SVM Model generation methods
Composition Profile of Patterns (CPP)
Composition profile of patterns is the percentage frequencies of each
amino acid in a fixed length sequence patterns.
BPP Binary Profile Pattern (BPP)
Binary profile of pattern (BPP): In this approach, sequence patterns of
fixed length of 41-residues were converted into binary form. Each
residue of patterns was represented by a vector of dimension 20 (e.g.
Ala by 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0; Cys by
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0).
Shannon Entropy of Residues (SER)
To understand the structural orchestration of sequences i.e.,
propensity towards order and disorder. In this regard, the Shannon
entropy (SE) score was calculated for each consensus sequence. As it
was evidenced that entropy possesses an idea of the disorder. Entropy
was directly proportional to the rate of disorder i.e., if the disorder
increases, it signifies higher entropy.
Conjoint Triad Descriptor (CTD)
The conjoint triad feature is sequence information for proteins. Twenty
amino acid types are clustered into seven classes to construct the
C-triad feature. First, protein sequences are encoded into a numerical
vector using the amino acid groups list in to seven classes.
Subsequently, any three continuous amino acids are regarded as a unit,
and scanning along the sequences and counting the frequencies of each
triad type is performed to obtain a 343-dimensional numerical vector.
PSSM profile of patterns (PPP)
The multiple sequence alignment information in the form of position
specific scoring matrix (PSSM) has been used as input feature to
develop this learning model. Each target sequence was scanned at
Swiss-Prot to generate the alignment profiles or position specific
scoring matrices (PSSM) by PSI-BLAST program. Three iterations of
PSI-BLAST were run for each protein with cut off e-value 0.001 thus
generating the profile matrices. The PSSM contains probability of
occurrence of each type of amino acid at each residue position of
protein sequence. Finally we extract PSSM contains probability of
occurrence of each type of amino acid of fixed length sequence patterns
from full length sequence PSSM matrix that is calles PSSM profile of
patterns.
Dipeptide Composition (DPC):
As sequence patterns of fixed length of 41-residues, we consider gapped
dipeptides composition, where DPC1_AA represents an amino acid, having
a gap of order Q (Q=0,1 and 2), here we get the best performance at
Q=1.
Pseudo Amino Acid Composition (PAAC)
Pseudo amino acid composition using a discrete model to represent a
protein yet without completely losing its sequence-order information.
The concept of PAAC was used in predicting the post-translational
modification. Here in this study, we extract each residue's impact on
the subsequent residues with lambda (gap) (l) 1, 2 and 3, got the best
result at l = 3.
Secondary Structure (SS)
Previous studies on eukaryotic glycoproteins suggested that the
probability of finding glycosite was higher at positions where there
was a secondary structure change.
Accessible Surface Area (ASA)
Surface accessibility is employed as another important feature because
glycosylation has tenancy to occur at extracellular regions of proteins
with the side chain of amino acid in the sequon exposed to the surface.
Hybrid Approaches:
We have obtained the ASA and SS from SARpred and PSIPRED prediction
respectively which contains amino acid of fixed length sequence
patterns from full length sequence on protein.
In view of the current understanding that glycosylation
occurs on folded proteins in prokaryotes, we also provide hybrid models
of above mention properties of protein sequence patterns in combination
of ASA, SS, and ASA+SS as shown in graphical abstract above.