Nodes

File Operations

File operation nodes serve as file input to the graph. Data can be loaded, extracted, or merged before running the pipeline.

Loading

In each loading node, the input file can be selected from the dropdown menu. The file has to be loaded into the platform first through the Files tab. If it fits the file format of the loading node, it automatically appears in the dropdown menu.

Load Fasta

Loads a fasta file (.fasta) containing a biological sequence, including DNA or protein sequences. The node automatically recognizes the sequence type. Xyna.bio provides various options to process the sequence further, e.g., protein structure prediction for protein sequences.

Various -omics data types can usually be represented using the same file ending .fasta. This is inconsistent with the not compatible types of data stored and can lead to mistakes and chaining together incompatible tools. For this reason, xyna.bio automatically recognizes the contents of fasta files and infers the reference type for further processing. The reference type identity can also be manually assigned.

Data handling for fasta files is split between gene fasta files and protein fasta files (both with the ending .fasta).

Additionally, there are aligned fasta files and multi fasta files.

Aligned fasta files contain gene or protein sequences respectively that are the output of a multiple sequence alignment. These alignments in fasta format are allowed to contain “-” as a special character indicating a gap in the alignment.

Multi fasta files contain more than one sequence per file and can be manipulated using merge nodes.

Fasta files can, for instance, be downloaded from GenBank.

Input: (Multi-)Fasta file

Output: Sequence (e.g., Protein, NucleicAcid, MultiProtein)

Load Genbank

Loads a GenBank (.gb) file containing a gene sequence with annotations.

Input (GenBank file):

The GenBank file format (.gb) allows for the storage of gene sequences along with additional information like region annotations, sample information and references to publications.

Output: (reference to) GenBank file

Load PDB

Loads a PDB (.pdb) file containing a 3D protein structure. The loaded structure can, for instance, be displayed in MolStar or with the Protein Structure Viewer node.

Input (PDB file):
The PDB file format (.pdb) contains three-dimensional structural data in the form of atomic coordinates. Xyna.bio expects a pdb file to contain a protein structure. Usually, the files are downloaded from RCSB PDB.

Requirements: The file must contain a protein structure. For some use cases, such as Protein Ligand Binding Affinity (see Chemistry), a PDB structure can also contain a protein structure and ligand. In the case of AlphaFold Multimer (see Structure), the file must contain multiple protein chains.

Output: Protein structure (loaded pdb file)

Load SDF

Loads an SDF (.sdf) file containing the structure of a molecule. The structure can, for instance, be docked to a protein with the DiffDock node.

Input (SDF file):
The SDF file format (.sdf) contains three-dimensional structural data, but instead of protein structures, it is used to store coordinates of smaller molecules like drugs or metabolites.

SDF and PDB files can be combined into one PDB file containing the information of both. In xyna.bio, this can be done with the Add SDF to PDB node (see Structure).

A common source for SDF files is the PubChem database.

Output: Chemical structure

Load Single Cell Dataset

Loads loads a .zip file containing scRNA-seq data. The dataset can be further processed with the integrated single cell pipeline (see Single Cell RNA-seq).

Input: Zip file containing processed scRNA-seq data.

Requirements: The zip file must contain a folder which contains one folder per condition. Each of these folders must contain three files:

  • A tsv file containing barcodes: must have “barcode” in file name
  • A tsv file containing genes: must have “gene” in file name
  • A mtx file containing the matrix

The files are often generated with CellRanger. A potential source for scRNA-seq datasets is the Gene Expression Omnibus.

Input Parameters: Prefix must be provided if the gene names in the gene file contain a prefix.

Output: (reference to) Single Cell Dataset

Extracting

Extract Selection From Fasta

Extracts a subsequence from a fasta sequence based upon a selection name of a previously defined selection (see Add Selection node in Annotation).

Input:

  • Fasta: Loaded fasta file
  • Selections: Selections (e.g., from Add Selection node)

Input Parameters:

  • Selection Name: Name of the desired selection from Selections input.

Output:

  • Extracted Fasta: Extracted sequence (Protein or Nucleic Acid).
  • Extracted Selections: Selection of the extracted region.

Extract Selection Structure

Gets a substructure of a given PDB file based upon the name of a previously defined selection (see Add Selection node in Annotation).

Returns all amino acids within the selection range as a new PDB structure. Does not return water molecules or other types of atoms present in the PDB file.

Input:

  • Protein structure: Protein Structure (e.g., from “Load PDB” node)
  • Selections: Selections (e.g., from “Add Selection” node)

Input Parameters:

  • Selection Name: Name of the desired selection from “Selections” input.

Output:

  • Extracted Structure: Extracted protein structure, can be viewed in MolStar.
  • Extracted Selections: Selection of the extracted region.

Extract Sequence From Fasta

Extracts a single fasta sequence from a multifasta file.

Input:

  • Source: Multifasta file (MulitProtein or MultiNucleicAcid) to extract the sequence from.
  • Selections (optional): Selection of the source fasta file.

Input Parameters:

  • Name: Name of the sequence to extract from the provided source fasta file.

Output:

  • Extracted: Extracted single protein or nucleic acid sequence as fasta file.
  • Extracted Selections: Selections associated with the extracted sequence.

Merging

Merge Fasta

Merges two protein fasta sequences into one multi fasta file. A multi fasta file is a file in fasta format, containing multiple formatted sequences. The merging is based on selections, which can be created with the Add Selection node (see Annotation). The selections are merged, too.

Input:

  • File 1: First fasta file to combine.
  • File 2: Second fasta file to combine. Must match the biological reference type of the first file (both must be Protein/MultiProtein or NucleicAcid/MultiNucleicAcid).
  • Selections 1: Selection for first file.
  • Selections 2: Selection for second file.

Input Parameters:

  • File: Merged fasta sequence (MultiProtein or MultiNucleicAcid)
  • Selections: Merged selections.

Other

Select File

Selects a loaded file from the file storage according to its ID and type. This node is helpful to access intermediate files or to avoid loading datasets multiple times (e.g., if loading takes long). For more details on file management, check out the Getting Started section.

Input Parameters:

  • File Id: ID of the file. Can be retrieved from the Job Spreadsheet or the Files Tab (see System Features)
  • File Type: Type of the file. Can be retrieved from the output node or the Job Spreadsheet.

Output: Loaded file.