GlycoPP V2.0
A webserver for glycosite prediction in prokaryotes


Home » GlycoPP V2.0 Galaxy Platform » Shared Data » Example Workflow » Team » Help »

Overview of GlycoPP V2.0

GlycoPP V2.0 is a highly accurate glycosylation prediction made available for the analysis of prokaryotic protein sequences on the web based Galaxy Platform. GlycoPP prediction programmes are trained on the largest available and an extensive dataset of N-glycosites and O-glycosites extracted from experimentally characterized glycoproteins of prokaryotes as obtained from ProGlycProt V2.0(http://www.proglycprot.org/).
GlycoPP V2.0 is an enhanced and updated version of our GlycoPP V1.0 (http://crdd.osdd.net/raghava/glycopp)

Glycosilation

O-linked Glycosylation SVM model

CTD PAAC SER CPP+SS DPC+SS DPC+ASA

N-linked Glycosylation SVM model

BPP BPP+SS BPP+ASA BPP+ASA+SS

The webserver provides prediction results for N-or O-glycosites using any one of the above mentioned, user defined SVM (Support Vector Machine) based prediction approaches namely:

  • Shannon Entropy of Residues (SER)
  • Conjoint Triad Descriptors (CTD)
  • Pseudo Amino Acid Composition (PAAC)
  • Composition Profile of Patterns (CPP)
  • Dipeptide Composition (DPC)
  • Binary Profile of Pattern (BPP)
  • Secondary Structure (SS)
  • Accessible Surface Area (ASA)
  • Hybrid approaches:
    BPP+ASA, BPP+SS, BPP+ASA+SS for N-glycosites prediction or CPP+SS, DPC+ASA, DPC+SS for O-glycosites prediction.

SVM Model generation methods

Composition Profile of Patterns (CPP)

Composition profile of patterns is the percentage frequencies of each amino acid in a fixed length sequence patterns.

BPP Binary Profile Pattern (BPP)

Binary profile of pattern (BPP): In this approach, sequence patterns of fixed length of 41-residues were converted into binary form. Each residue of patterns was represented by a vector of dimension 20 (e.g. Ala by 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0; Cys by 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0).

Shannon Entropy of Residues (SER)

To understand the structural orchestration of sequences i.e., propensity towards order and disorder. In this regard, the Shannon entropy (SE) score was calculated for each consensus sequence. As it was evidenced that entropy possesses an idea of the disorder. Entropy was directly proportional to the rate of disorder i.e., if the disorder increases, it signifies higher entropy.

Conjoint Triad Descriptor (CTD)

The conjoint triad feature is sequence information for proteins. Twenty amino acid types are clustered into seven classes to construct the C-triad feature. First, protein sequences are encoded into a numerical vector using the amino acid groups list in to seven classes. Subsequently, any three continuous amino acids are regarded as a unit, and scanning along the sequences and counting the frequencies of each triad type is performed to obtain a 343-dimensional numerical vector.

PSSM profile of patterns (PPP)

The multiple sequence alignment information in the form of position specific scoring matrix (PSSM) has been used as input feature to develop this learning model. Each target sequence was scanned at Swiss-Prot to generate the alignment profiles or position specific scoring matrices (PSSM) by PSI-BLAST program. Three iterations of PSI-BLAST were run for each protein with cut off e-value 0.001 thus generating the profile matrices. The PSSM contains probability of occurrence of each type of amino acid at each residue position of protein sequence. Finally we extract PSSM contains probability of occurrence of each type of amino acid of fixed length sequence patterns from full length sequence PSSM matrix that is calles PSSM profile of patterns.

Dipeptide Composition (DPC):

As sequence patterns of fixed length of 41-residues, we consider gapped dipeptides composition, where DPC1_AA represents an amino acid, having a gap of order Q (Q=0,1 and 2), here we get the best performance at Q=1.

Pseudo Amino Acid Composition (PAAC)

Pseudo amino acid composition using a discrete model to represent a protein yet without completely losing its sequence-order information. The concept of PAAC was used in predicting the post-translational modification. Here in this study, we extract each residue's impact on the subsequent residues with lambda (gap) (l) 1, 2 and 3, got the best result at l = 3.

Secondary Structure (SS)

Previous studies on eukaryotic glycoproteins suggested that the probability of finding glycosite was higher at positions where there was a secondary structure change.

Accessible Surface Area (ASA)

Surface accessibility is employed as another important feature because glycosylation has tenancy to occur at extracellular regions of proteins with the side chain of amino acid in the sequon exposed to the surface.

Hybrid Approaches:

We have obtained the ASA and SS from SARpred and PSIPRED prediction respectively which contains amino acid of fixed length sequence patterns from full length sequence on protein.

In view of the current understanding that glycosylation occurs on folded proteins in prokaryotes, we also provide hybrid models of above mention properties of protein sequence patterns in combination of ASA, SS, and ASA+SS as shown in graphical abstract above.