The filtering layer in ragp
consists of two sequential tasks:
The following section explains how to perform these tasks on a set of 2700 Arabidopsis protein sequences included in ragp
as `at_nsp()
data frame.
Required packages:
Hydroxyproline rich glycoproteins (HRGPs) are secreted proteins, therefore they are expected to contain a secretory signal sequence on the N-terminus (N-sp). ragp
incorporates N-sp prediction by querying SignalP 5 (Almagro Armenteros et al. 2019), SignalP 4.1, TargetP 1.1 (Emanuelsson et al. 2000) and Phobius (Käll, Krogh, and Sonnhammer 2007) via the functions: get_signalp5()
, get_signalp()
, get_targetp()
and get_phobius()
.
In the ragp
versions prior 0.3.5
we advised a majority vote between SignalP 4.1, TargetP 1.1 and Phobius should be used to determine which sequences should be flagged as secreted. Since the newest SignalP (version 5) utilizes a more sophisticated prediction method and a larger training set compared to the older methods we trust it provides a more accurate prediction of secreted protein sequences compared to the before mentioned majority vote.
To query SignalP 5 predictions:
nsp_signalp <- get_signalp5(data = at_nsp, #data frame name
sequence = sequence, #column containing the sequences
id = Transcript.id) #column containing protein identifiers
The predictions for the 2700 sequences contained in at_nsp
data frame should be available after several minutes. The returned object nsp_signalp
is a data frame resembling the web servers output:
head(nsp_signalp)
#> id Prediction SP.Sec.SPI Other CS_pos Pr cleave.site is.signalp sp.length
#> 1 ATCG00660.1 OTHER 0.00038 0.9996 NA FALSE NA
#> 2 AT2G43600.1 SP(Sec/SPI) 0.99980 0.0002 22-23 0.96 VFS-QN TRUE 22
#> 3 AT2G28410.1 SP(Sec/SPI) 0.99042 0.0096 22-23 0.89 ALA-QD TRUE 22
#> 4 AT2G22960.1 SP(Sec/SPI) 0.99814 0.0019 22-23 0.94 AES-GS TRUE 22
#> 5 AT2G19580.1 OTHER 0.26479 0.7352 NA FALSE NA
#> 6 AT2G19690.2 SP(Sec/SPI) 0.98954 0.0105 28-29 0.90 ARS-EE TRUE 28
To filter the sequences predicted to contain an N-sp:
nsp_signalp %>%
filter(is.signalp) %>% #filter rows where is.signalp column is TRUE
pull(id) -> id_nsp #pull the id column into a vector
at_nsp_filtered <- filter(at_nsp,
Transcript.id %in% id_nsp) #filter at_nsp by pulled ids
This will create at_nsp_filtered
data frame with 1890 sequences.
The key feature of HRGPs is the presence of hydroxyprolines (Hyp, O) which represent glycosylation sites (Showalter et al. 2010). While many HRGPs can be found based on biased amino acid composition, or the presence of certain amino acid motifs, there exists an abundance of chimeric proteins comprised from specific domains and HRGP motifs which are much harder to identify based on the mentioned features. In an attempt to identify these types of sequences ragp
incorporates a model specifically built to predict the probability of proline hydroxylation in plant proteins. This model is incorporated in the predict_hyp()
function and can be utilized to filter potential HRGPs.
at_hyp <- predict_hyp(data = at_nsp_filtered,
sequence = sequence,
id = Transcript.id)
The object returned by predict_hyp
is a list with two elements prediction
and sequence
. The prediction
element is a data frame containing the probability of hydroxylation for each P in each sequence along with the amino acid span used for prediction. First few rows of this element:
head(at_hyp$prediction)
#> id substr P_pos prob HYP
#> 1 AT2G43600.1 FSQNCMDTSCPGLKECCSRWG 31 0.015 No
#> 2 AT2G43600.1 EYCGFFCFSGPCNIKGKSYGY 58 0.016 No
#> 3 AT2G43600.1 YGYDYNVDAGPRGKIETVITS 76 0.018 No
#> 4 AT2G43600.1 ERYCSKSKKYPCEPGKNYYGR 163 0.015 No
#> 5 AT2G43600.1 CSKSKKYPCEPGKNYYGRGLL 166 0.014 No
#> 6 AT2G43600.1 YYGAGKHLGLPLLKDPDLVSR 194 0.015 No
The sequence
(at_hyp$sequence
) element is a data frame with the same sequences as provided in the function call in which prolines (P) are replaced by hydroxyprolines (O) at positions predicted to be hydroxylated. This element can be useful as input to scan_ag()
as explained in HRGP analysis tutorial.
The predictions are based on a probability threshold of 0.224 which offers the best trade-off between specificity and sensitivity. To increase specificity at the cost of sensitivity the threshold can be increased using the tprob
argument:
at_hyp2 <- predict_hyp(data = at_nsp_filtered,
sequence = sequence,
id = Transcript.id,
tprob = 0.6)
The at_hyp
object can be used to filter the sequences that contain more than a certain number of hydroxyprolines. The default threshold will be used. To filter sequences with three or more O:
at_hyp$prediction %>% #the prediction element
group_by(id) %>% #group by id
summarise(n = sum(HYP == "Yes")) %>% #calculate the number of hydroxyprolines per sequence
filter(n >= 3) %>% #filter sequences with three or more predicted hydroxyprolines
pull(id) -> at_3hyp #pull the ids of the filtered sequences
at_nsp_filtered %>%
filter(Transcript.id %in% at_3hyp) -> at_nsp_3hyp #filter sequences from `at_nsp_filtered`
There are 365 sequences that satisfy this condition.
After the filtering step the sequences can be analyzed by scan_ag()
to identify arabinogalactan motifs, by maab()
to classify prototypical HRGPs, by get_big_pi()
, get_pred_gpi()
or get_netGPI()
to predict GPI anchor sites in sequences and by get_hmm()
, get_cdd()
and get_espritz()
to identify domains and disordered regions. This is explained in HRGP analysis tutorial.