VOLPES - Documentation

General settings

Sequence input

Sequences can be entered directly in the textbox (as a raw string or FASTA formatted) or can be retrieved from a public database (UniProt¹ or ENA²) using the corresponding ID. For protein sequences, the provided sequence should consist of the single-letter alphabet for the 20 standard amino acids. For RNA and DNA sequences, the four standard nucleobases should be used. However, if a DNA sequence is provided under molecule type RNA, any Ts will automatically be interpreted as Us. Similarly, when an RNA sequence is provided under the molecule type DNA, any Us will automatically be interpreted as Ts.

Additionally, it is possible to provide sequences with gaps, represented by the character "-". Further information on how these are treated during smoothing is given below.

Treatment of gaps in the sequence

When gaps are encountered in a window during smoothing of the sequence profile, the average is calculated only over existing values within that window. This concerns gaps ("-") that are intentionally placed in the sequence, but could also concern certain amino acids, whenever data for the property of interest is not available for this amino acid. Dashed lines above the profile are used to indicate the positions, where values are derived from averaging windows containing missing values. When a smoothing window consists entirely of gaps, no y-value will be assigned to the profile at that position.

Within the tab Visuals in the sequence settings, there is the option to specify a minimum percentage of values in an averaging window that have to be present in order for a y-value to be assigned to the profile. By default this is set to 0%, meaning that as long as there is at least one value in a window, an average will be calculated as described above. Setting this parameter to e.g. 80% would result in no assigned y-value in the profile for every averaging window containing more than 20% missing values.

Components

The VOLPES application comes in two main parts: the drawing functionality and the generated interface. These are currently not split into multiple, separated files, but should be soon.

Drawing

To draw any profile, an SVG element needs to be provided as canvas and a correctly formatted data object to define the drawing.


    <svg class="canvas" width="1050" height="330"></svg>

The height and width can be chosen nearly arbitrarily, with the only limitation being that the axes need to fit within the area provided.


    data = { "seq" : [
        {
            "type" : string,
            "scale" : string,
            "sequence" : string,
            "window" : int,
            "shift" : int,
            "smoothing_method" : string,
            "name" : string,
            "organism" : string (optional),
            "thickness" : float,
            "color" : rgb_hex_string,
            "active" : bool
        }],
        "title" : string (optional)
    }

Given a data object, a saved representation can be drawn at anytime:


    drawingProfiles(data, id = ".canvas")

An SVG element classed canvas is selected by default, but any selector can be provided, allowing one to draw multiple graphs on a single page. To interactively change the displayed graph, modify the data object by any means and call drawingProfiles(data, id) at anytime, the graph should be updated.

Interface

In addition to a drawing library, a functionality that generates interfaces to interact with the data object is provided. This is still being developed and may change by final release. To add a simple interface box for a single sequence including an interface for sliding sequences against each other, changing the smoothing window size, the scale used, the color or the primary sequence data, use the addInterface function:


    addInterface()

This function modifies the current state of the data object and adds entries to it. It is aware of the current status of the data object, but currently does not support multiple objects. Therefore, it should not be used in conjunction with multiple SVG canvases to draw on.

If a data object has been created by other means and only the interactivity of the interface is needed, interface elements can also be created based on a data JSON object:


    addInterfaceBasedOnData(data)

This function does not modify the current state of the provided data object, although the created interface elements are able to do so.

Primary sequence and databases

NOTE: Currently all database connectivity is tied to the default interface - this will be changed in the future to allow this connectivity in custom interfaces.

The default interface supports several ways of specifying primary sequence data for each entry. The textbox allows for a primary sequence to be supplied as a one-letter-code string either as just the sequence or fasta formatted. Furthermore, parsers for two central databases are implemented: UniProt for protein sequences and the European Nucleotide Archive (ENA) for RNA sequences.

Customization

Scales

With the application, a database of 600 different protein property scales coming from the scientific literature is already included. These scales are stored in the scale and scaleDescriptors objects, respectively. To add further scales, they need to be registered in both of these objects. Protein scales are stored in scale.protein:


    "WIMW960101": {
        "A": 4.08, "C": 4.49, "D": 3.02, "E": 2.23, "F": 5.38,
        "G": 4.24, "H": 4.08, "I": 4.52, "K": 3.77, "L": 4.81,
        "M": 4.48, "N": 3.83, "P": 3.8,  "Q": 3.67, "R": 3.91,
        "S": 4.12, "T": 4.11, "V": 4.18, "W": 6.1,  "Y": 5.19
    }

The key used to define a scale is the same used in the scale attribute of the data object. To provide descriptions for a scale each should also be registered in scaleDescriptors.protein:


    "WIMW960101": {
        "PMID": "8836100",               // Pubmed ID (optional)
        "aaIndexID": "WIMW960101",       // AAindex ID (optional)
        "author": "Wimley, W.C. and White, S.",
        "journal": "Nature Structual biol. 3, 842-848 (1996)",
        "name": "Free energies of transfer of AcWl-X-LL peptides from bilayer interface to water (Wimley-White, 1996)",
        "summary": "Experimentally determined hydrophobicity scale for proteins at membrane interfaces"
    }

RNA scales need to be defined in both the scale.rna and scaleDescriptors.rna objects. While the structure of a description is identical to a protein scale, the scale itself uses different keys. For example, one can define a purine composition scale as follows:


    'PUR' : {'A': 1, 'C': 0, 'G': 1, 'U': 0, 'T': 0}

Additional functionality

In addition to all functions accessible via the provided interface functions, some functions are included that could be useful in building customized interfaces.

Auto align

Profiles can be shifted against each other to find the optimal alignment. This is defined as the alignment with the highest (or the lowest) Pearson correlation coefficient R between the two sequences as obtained by sliding the sequences against each other. This process can be automated:


    findBestShift(id1, id2, inv=false)

id1 and id2 refer to indices in the data object. The default behavior is to aligning by maximizing the Pearson R between the two profiles. This can be inverted to minimize the observed correlation by the usage of the inv flag.