Filtering hydroxyproline rich glycoproteins • ragp

The filtering layer in ragp consists of two sequential tasks:

predict the presence of secretory signals (N-sp) and filter sequences containing them.
predict proline hydroxylation and filter sequences containing at least several potential hydroxyprolines.

The following section explains how to perform these tasks on a set of 2700 Arabidopsis protein sequences included in ragp as `at_nsp() data frame.

Required packages:

suppressMessages(library(ragp))
suppressMessages(library(dplyr))

Predicting N-terminal secretory signal sequences (N-sp)

Hydroxyproline rich glycoproteins (HRGPs) are secreted proteins, therefore they are expected to contain a secretory signal sequence on the N-terminus (N-sp). ragp incorporates N-sp prediction by querying SignalP 5 (Almagro Armenteros et al. 2019), SignalP 4.1, TargetP 1.1 (Emanuelsson et al. 2000) and Phobius (Käll, Krogh, and Sonnhammer 2007) via the functions: get_signalp5(), get_signalp(), get_targetp() and get_phobius().

In the ragp versions prior 0.3.5 we advised a majority vote between SignalP 4.1, TargetP 1.1 and Phobius should be used to determine which sequences should be flagged as secreted. Since the newest SignalP (version 5) utilizes a more sophisticated prediction method and a larger training set compared to the older methods we trust it provides a more accurate prediction of secreted protein sequences compared to the before mentioned majority vote.

To query SignalP 5 predictions:

nsp_signalp <- get_signalp5(data = at_nsp, #data frame name
                            sequence = sequence, #column containing the sequences
                            id = Transcript.id) #column containing protein identifiers

The predictions for the 2700 sequences contained in at_nsp data frame should be available after several minutes. The returned object nsp_signalp is a data frame resembling the web servers output:

head(nsp_signalp)
#>            id  Prediction SP.Sec.SPI  Other CS_pos   Pr cleave.site is.signalp sp.length
#> 1 ATCG00660.1       OTHER    0.00038 0.9996          NA                  FALSE        NA
#> 2 AT2G43600.1 SP(Sec/SPI)    0.99980 0.0002  22-23 0.96      VFS-QN       TRUE        22
#> 3 AT2G28410.1 SP(Sec/SPI)    0.99042 0.0096  22-23 0.89      ALA-QD       TRUE        22
#> 4 AT2G22960.1 SP(Sec/SPI)    0.99814 0.0019  22-23 0.94      AES-GS       TRUE        22
#> 5 AT2G19580.1       OTHER    0.26479 0.7352          NA                  FALSE        NA
#> 6 AT2G19690.2 SP(Sec/SPI)    0.98954 0.0105  28-29 0.90      ARS-EE       TRUE        28

To filter the sequences predicted to contain an N-sp:

nsp_signalp %>%
  filter(is.signalp) %>% #filter rows where is.signalp column is TRUE
  pull(id) -> id_nsp #pull the id column into a vector

at_nsp_filtered <- filter(at_nsp,
                          Transcript.id %in% id_nsp) #filter at_nsp by pulled ids

This will create at_nsp_filtered data frame with 1890 sequences.

Predicting proline hydroxylation

The key feature of HRGPs is the presence of hydroxyprolines (Hyp, O) which represent glycosylation sites (Showalter et al. 2010). While many HRGPs can be found based on biased amino acid composition, or the presence of certain amino acid motifs, there exists an abundance of chimeric proteins comprised from specific domains and HRGP motifs which are much harder to identify based on the mentioned features. In an attempt to identify these types of sequences ragp incorporates a model specifically built to predict the probability of proline hydroxylation in plant proteins. This model is incorporated in the predict_hyp() function and can be utilized to filter potential HRGPs.

at_hyp <- predict_hyp(data = at_nsp_filtered,
                      sequence = sequence,
                      id = Transcript.id)

The object returned by predict_hyp is a list with two elements prediction and sequence. The prediction element is a data frame containing the probability of hydroxylation for each P in each sequence along with the amino acid span used for prediction. First few rows of this element:

head(at_hyp$prediction)

#>            id                substr P_pos  prob HYP
#> 1 AT2G43600.1 FSQNCMDTSCPGLKECCSRWG    31 0.015  No
#> 2 AT2G43600.1 EYCGFFCFSGPCNIKGKSYGY    58 0.016  No
#> 3 AT2G43600.1 YGYDYNVDAGPRGKIETVITS    76 0.018  No
#> 4 AT2G43600.1 ERYCSKSKKYPCEPGKNYYGR   163 0.015  No
#> 5 AT2G43600.1 CSKSKKYPCEPGKNYYGRGLL   166 0.014  No
#> 6 AT2G43600.1 YYGAGKHLGLPLLKDPDLVSR   194 0.015  No

The sequence (at_hyp$sequence) element is a data frame with the same sequences as provided in the function call in which prolines (P) are replaced by hydroxyprolines (O) at positions predicted to be hydroxylated. This element can be useful as input to scan_ag() as explained in HRGP analysis tutorial.

The predictions are based on a probability threshold of 0.224 which offers the best trade-off between specificity and sensitivity. To increase specificity at the cost of sensitivity the threshold can be increased using the tprob argument:

at_hyp2 <- predict_hyp(data = at_nsp_filtered,
                       sequence = sequence,
                       id = Transcript.id,
                       tprob = 0.6)

The at_hyp object can be used to filter the sequences that contain more than a certain number of hydroxyprolines. The default threshold will be used. To filter sequences with three or more O:

at_hyp$prediction %>% #the prediction element
  group_by(id) %>% #group by id
  summarise(n = sum(HYP == "Yes")) %>% #calculate the number of hydroxyprolines per sequence
  filter(n >= 3) %>% #filter sequences with three or more predicted hydroxyprolines
  pull(id) -> at_3hyp #pull the ids of the filtered sequences

at_nsp_filtered %>%
  filter(Transcript.id %in% at_3hyp) -> at_nsp_3hyp #filter sequences from `at_nsp_filtered`

There are 365 sequences that satisfy this condition.

After the filtering step the sequences can be analyzed by scan_ag() to identify arabinogalactan motifs, by maab() to classify prototypical HRGPs, by get_big_pi(), get_pred_gpi() or get_netGPI() to predict GPI anchor sites in sequences and by get_hmm(), get_cdd() and get_espritz() to identify domains and disordered regions. This is explained in HRGP analysis tutorial.

References

Almagro Armenteros, José Juan, Konstantinos D. Tsirigos, Casper Kaae Sønderby, Thomas Nordahl Petersen, Ole Winther, Søren Brunak, Gunnar von Heijne, and Henrik Nielsen. 2019. “SignalP 5.0 Improves Signal Peptide Predictions Using Deep Neural Networks.” Nature Biotechnology 37: 420–23. https://doi.org/0.1038/s41587-019-0036-z.

Emanuelsson, O., H. Nielsen, S. Brunak, and G. von Heijne. 2000. “Predicting Subcellular Localization of Proteins Based on Their N-Terminal Amino Acid Sequence.” Journal of Molecular Biology 300 (4): 1005–16. https://doi.org/10.1006/jmbi.2000.3903.

Käll, Lukas, Anders Krogh, and Erik L. L. Sonnhammer. 2007. “Advantages of Combined Transmembrane Topology and Signal Peptide Prediction–the Phobius Web Server.” Nucleic Acids Research 35 (Web Server issue): W429–432. https://doi.org/10.1093/nar/gkm256.

Showalter, Allan M., Brian Keppler, Jens Lichtenberg, Dazhang Gu, and Lonnie R. Welch. 2010. “A Bioinformatics Approach to the Identification, Classification, and Analysis of Hydroxyproline-Rich Glycoproteins.” Plant Physiology 153 (2): 485–513. https://doi.org/10.1104/pp.110.156554.