Annotation

Add Selection

Defines a selection on a protein (PDB or fasta) or gene (fasta) sequence.

Input:

  • Selections: The selections to add the new selection to. Can be a loaded sequence.

Input Parameters:

  • Name: Label to assign to selection. Is used in downstream nodes to refer to this selection.
  • Start: Start index of the selection in the sequence.
  • End: End index of the selection in the sequence.
  • Color: Color to apply to the selection in hex code.
  • Chain ID: The chain ID of the sequence containing the selection.
  • Sequence Name: Name of the sequence in a multi-sequence.

Output: List of selections, including the one added by the node.

Fasta from GenBank

Generates a fasta sequence and a selection from a GenBank file.

Input: Loaded GenBank file, usually from the Load GenBank node in File Operations (link to File Operations section).

Output:

  • Sequence: Generated nucleic acid sequence.
  • Selections: Sequence annotations from the GenBank input file.

Find ORFs

Finds possible open reading frames (ORFs) in each gene sequence by searching for start and stop codons in all six possible reading frames. It uses many different combinations of start and stop codons from different codon usage tables.

An ORF is a sequence of DNA that can be translated into a protein. Since DNA is coding for amino acids in triplets of bases, there are three possible frames to read the DNA in, depending on which base it starts on. Since DNA is a double-stranded antiparallel molecule, it is also possible to read the reverse of each DNA sequence as well, adding another three possible reading frames.

Input:

  • Input DNA file: Loaded nucleic acid (DNA) fasta file containing the sequence.
  • Selections (optional): Selection of the sequence (e.g., from Add Selection node in Annotations (link to Annotation section)).

Input Parameters:

  • Codon table: Selected from the dropdown menu. Dependent on the source and host organism of the sequence.
  • Minimum Protein length: The minimum length of the predicted protein coded for by the potential ORFs.

Output: Annotated ORFs as a GenBank file containing the DNA sequence and the ORF annotation.

Interpro Scan

Runs Interpro Scan on a supplied protein sequence.

Interpro Scan provides sequence annotation by scanning against the Interpro protein signature databases. These databases contain information on known sequence motif families, allowing for the classification of protein domains.

Matches are added as selections to the protein sequence.

Input: Fasta file of the sequence to annotate.

Input Parameters: The keep Overlapping option can be selected. In this case, overlapping annotations are kept. Otherwise, they are automatically removed and only a set of the longest, non-overlapping annotations is kept.

Output: Fasta file with added selections.

Citations:

Jones, P., Binns, D., Chang, H. Y., Fraser, M., Li, W., McAnulla, C., ... & Hunter, S. (2014). InterProScan 5: genome-scale protein function classification. Bioinformatics, 30(9), 1236-1240. DOI: https://doi.org/10.1093/bioinformatics/btu031 Blum, M., Chang, H. Y., Chuguransky, S., Grego, T., Kandasaamy, S., Mitchell, A., ... & Finn, R. D. (2021). The InterPro protein families and domains database: 20 years on. Nucleic acids research, 49(D1), D344-D354. DOI: https://doi.org/10.1093/nar/gkaa977