The analysis layer in ragp
consists of:
The following section explains how to perform these tasks on a set of 365 Arabidopsis protein sequences obtained in Filtering hydroxyproline rich glycoproteins tutorial (the at_nsp_3hyp
data frame).
Required packages:
To identify potential arabinogalactans the function scan_ag()
can be used. This function scans the protein sequences for AG glycomodule clusters, by constructing regular expressions based on user input.
By default scan_ag()
searches for at least three dipeptides PA, PS, PT, PG, PV, AP, SP, TP, GP and VP which are separated by maximally 10 amino acids. The minimal number of dipeptides to be considered can be changed by the argument dim
while the maximum number of separating amino acids can be tweaked using the argument div
. To consider only PA, PS, PT, AP, SP and TP dipeptides the type
argument can be changed to "conservative"
. In cases when the dipeptides are ambiguous, such as SPT, PTP etc. all three amino acids will be counted as one dipeptide, while cases such as PTPA are considered as two dipeptides separated by zero amino acids. When the sequence
object returned by predict_hyp()
function is passed to scan_ag()
only prolines predicted to be hydroxylated are considered when searching for dipeptides:
scaned_at <- scan_ag(data = at_nsp_3hyp,
sequence,
Transcript.id)
The output of scan_ag()
can be a list, or a data frame depending on the simplify
/tidy
arguments. The default output is a data frame with one row per sequence which contains five columns. Here are the 55 N-terminal amino acids from the Classical arabinogalactan protein 9 as an example:
scaned_at %>%
mutate(sequence = strtrim(sequence, 55)) %>% #take the first 55 amino acids from the sequences
filter(id == "AT2G14890.1") #filter Q9C5S0
#> id sequence AG_aa total_length longest
#> 1 AT2G14890.1 marsfaiavicivliagvtgqAPTSPPTaTPAPPTPTTPpPAaTPpPVsAPpPVt 100 144 144
The column sequence
contains the input sequence where only the amino acids in potential AG glycomodules are in uppercase while all other are in lower case. The column AG_aa
contains the amino acid count found in potential AG glycomodules, while columns total_length
and longest
correspond to the matched sequence spans. From the output it can be observed that 100 amino acids were found in AG glycomodules, that the longest span of AG glycomodules is 144 amino acids long (AG glycomodule dipeptides in one span are separated by less then 10 amino acids, since div
argument which determines this is set per default to 10) and that it is the only span in the sequence.
To obtain a more detailed data frame the function can be called with simplify = FALSE
and tidy = TRUE
:
scaned_at2 <- scan_ag(data = at_nsp_3hyp,
sequence,
Transcript.id,
simplify = FALSE,
tidy = TRUE)
scaned_at2 %>%
mutate(sequence = strtrim(sequence, 55)) %>% #take the first 55 amino acids from the sequences
filter(id == "AT2G14890.1")
#> sequence id location.start location.end P_pos length
#> 1 marsfaiavicivliagvtgqAPTSPPTaTPAPPTPTTPpPAaTPpPVsAPpPVt AT2G14890.1 22 165 23, 26, .... 144
#> AG_aa
#> 1 100
Output obtained in this way contains the start and end location of each AG glycomodul span (in this example there is only one) as well as a vector of proline positions in AG glycomodules.
To incorporate hydroxyproline position knowledge when searching AG glycomodules the sequence
output of predict_hyp()
can be provided to scan_ag()
:
predict_hyp()
will be called on at_nsp_filtered
data frame created during Filtering hydroxyproline rich glycoproteins tutorial.
at_hyp <- predict_hyp(data = at_nsp_filtered,
sequence = sequence,
id = Transcript.id)
scan_ag()
will be called on the output at_hyp$sequence
:
scaned_at3 <- scan_ag(data = at_hyp$sequence,
sequence,
id, #`at_hyp$sequence` data frame contains the column `id`
simplify = FALSE,
tidy = TRUE)
#> sequence vector contains O, O will be considered instead of P
scaned_at3 %>%
mutate(sequence = strtrim(sequence, 55)) %>% #take the first 55 amino acids from the sequences
filter(id == "AT2G14890.1")
#> sequence id location.start location.end P_pos length
#> 1 marsfaiavicivliagvtgqAOTSOOTaTOAOOTOTTOoOAaTOoOVsAOoOVt AT2G14890.1 22 165 23, 26, .... 144
#> AG_aa
#> 1 98
It is clear almost all prolines (P) were predicted to be hydroxylated (O) so the start and end location as well as the length of the amino acid span containing AG glycomoduls are unchanged for this sequence.
Many of the found motifs in AT2G14890.1
(shown in uppercase) correspond to extensin motifs (PPP+ or OOO+). This might not be desired when searching for AG glycomodules, and scan_ag()
offers the argument exclude_ext
which enables the function to omit extensin motifs either of the common sequence SP{3,}
or SO{3,}
(exclude_ext = "yes"
) or of the form P[3,}
or O{3,}
(exclude_ext = "all"
):
scaned_at4 <- scan_ag(data = at_hyp$sequence,
sequence,
id,
simplify = FALSE,
tidy = TRUE,
exclude_ext = "all")
#> sequence vector contains O, O will be considered instead of P
scaned_at4 %>%
mutate(sequence = strtrim(sequence, 55)) %>% #take the first 55 amino acids from the sequences
filter(id == "AT2G14890.1")
#> sequence id location.start location.end P_pos length
#> 1 marsfaiavicivliagvtgqAOTSOOTaTOAOOTOTtoooaatooovsaooovt AT2G14890.1 22 37 23, 26, .... 16
#> 2 marsfaiavicivliagvtgqAOTSOOTaTOAOOTOTtoooaatooovsaooovt AT2G14890.1 114 165 115, 119.... 52
#> AG_aa
#> 1 15
#> 2 34
When extensin motifs are masked two AG glycomodule spans are found in AT2G14890.1
, from position 22 to 37 and from position 114 to 165.
Next we will compare how many sequences are identified as arabinogalactans with and without prediction of proline hydroxylation when extensin motifs are masked:
#without proline hydroxylation knowledge:
scaned_at5 <- scan_ag(data = at_nsp_filtered,
sequence,
Transcript.id,
exclude_ext = "all")
sum(scaned_at5$total_length != 0)
#> [1] 627
#using the output of `predict_hyp()
scaned_at6 <- scan_ag(data = at_hyp$sequence,
sequence,
id,
exclude_ext = "all")
#> sequence vector contains O, O will be considered instead of P
sum(scaned_at6$total_length != 0)
#> [1] 208
Incorporating hydroxyproline predictions in the search for AGPs reduced the pool of identified sequences by threefold. scan_ag()
is capable of identifying potential chimeric and short AGPs without pronounced amino acid bias and with few characteristic motifs. In combination with predict_hyp()
the specificity of the found sequences is greatly increased since the majority of sequences unlikely to contain hydroxyprolines are excluded.
The MAAB pipeline is explained in detail in Johnson et al. (2017). In short, after removal of sequences without a predicted N-sp and removal of sequences with predicted domains the sequences are first evaluated for amino acid bias. Sequences with > 45% of PAST (AGP bias) or PSKY (EXT bias) or PVKY (proline-rich proteins (PRP) bias) are grouped further into 24 distinct groups based on amino acid bias, presence of GPIs and the following motifs (as regular expression):
SP{3,5}
and the tyr motifs [FY].Y
, KHY
, VY[HKDE]
, V.Y
and YY
.PPV.[KT]
, PPV[QK]
and KKPCPP
.[AVTG]P{1,3}
and [ASVTG]P{1,2}
.The motifs are counted in a specific order ext > tyr > prp > agp, and overlapping motifs are not counted twice. Based on the final motif count and the overall coverage of the motifs HRGPs are clustered in the mentioned 24 groups. This can be performed using maab()
:
maab_at <- maab(at_nsp_3hyp,
sequence,
Transcript.id)
The output of maab()
is a data frame containing columns: id
- sequence identifier; ext_sp
, ext_tyr
, prp
and agp
- corresponding motif count; past_percent
, pvyk_percent
, psky_percent
, p_percent
- percentage of the corresponding amino acids in sequence; coverage
- the motif coverage; maab_class
- the MAAB class. First several hits as example:
maab_at %>%
filter(maab_class != "0") %>% #remove maab_class 0 - not HRGP
head #first 6 hits
#> id ext_sp ext_tyr prp agp past_percent pvyk_percent psky_percent p_percent coverage maab_class
#> 1 AT2G22470.1 0 0 0 21 63 31 32 21 0.35 1/4
#> 2 AT2G21140.1 0 0 4 5 31 46 43 20 0.11 24
#> 3 AT2G23130.1 0 0 0 32 57 31 37 21 0.37 1/4
#> 4 AT2G23130.2 0 0 0 32 59 34 40 23 0.42 1/4
#> 5 AT2G32300.1 0 1 0 16 47 24 29 13 0.14 24
#> 6 AT2G34870.1 0 0 0 6 47 32 34 26 0.14 24
Since the function was not provided knowledge on the GPI presence, both possible classes are indicated as in: “1/4” or “2/9.” To remove ambiguities in classes get_gpi
argument can be set to "bigpi"
, "predgpi"
or “netgpi
” this will run get_big_pi()
, get_pred_gpi()
or get_netGPI()
only on the sequences that belong to one of the HRGP classes:
maab_at <- maab(at_nsp_3hyp,
sequence,
Transcript.id,
get_gpi = "bigpi")
maab_at %>%
filter(maab_class != "0") %>% #remove maab_class 0 - not HRGP
head
#> id ext_sp ext_tyr prp agp past_percent pvyk_percent psky_percent p_percent coverage maab_class
#> 1 AT2G22470.1 0 0 0 21 63 31 32 21 0.35 1
#> 2 AT2G21140.1 0 0 4 5 31 46 43 20 0.11 24
#> 3 AT2G23130.1 0 0 0 32 57 31 37 21 0.37 1
#> 4 AT2G23130.2 0 0 0 32 59 34 40 23 0.42 4
#> 5 AT2G32300.1 0 1 0 16 47 24 29 13 0.14 24
#> 6 AT2G34870.1 0 0 0 6 47 32 34 26 0.14 24
To compare how many sequences are classified to one of the HRGP classes when proline hydroxylation is not considered:
#when considering only sequences with at least 3 predicted hydroxyprolins
maab_at %>%
filter(maab_class != "0") %>%
nrow
#> [1] 69
#when considering all sequences predicted to be secreted
maab_at2 <- maab(at_nsp_filtered,
sequence,
Transcript.id,
get_gpi = "bigpi")
maab_at2 %>%
filter(maab_class != "0") %>% #remove maab_class 0 - not HRGP
nrow
#> [1] 70
Out of the 1890 sequences in at_nsp_filtered
, 70 are predicted to belong to any MAAB class, one less compared to when proline hydroxylation prediction was incorporated. To check which sequence differs between maab_at
and maab_at2
:
maab_at2 %>%
filter(maab_class != "0") %>%
filter(!id %in% (maab_at %>%
filter(maab_class != "0") %>%
pull(id)))
#> id ext_sp ext_tyr prp agp past_percent pvyk_percent psky_percent p_percent coverage maab_class
#> 1 AT5G09530.1 0 0 0 23 35 49 46 29 0.12 24
The differing sequence belongs to MAAB class 24 - sequences with HRGP bias and low motif coverege (<15%).
Domain annotation is a fundamental research tool in protein sequence analyses. In many occasions domain annotation has already been performed prior to HRGP mining, either by using hmmer (Eddy 2011), InterProScan (Jones et al. 2014) or some other resource. In such circumstances the annotation can be imported into R
and joined with the filtered sequences. When this is not the case ragp
offers the functions get_hmm()
which queries the hmmscan web server (Finn, Clements, and Eddy 2011) and get_cdd()
which queries the Conserved Domain Database. Domain annotation is not necessary for AGP prediction although it enhances interpretation, but it is a fundamental part of the MAAB classification scheme. As with other ragp
functions the first argument is the data
, followed by the names of the columns containing the protein sequences and corresponding identifiers:
at_hmm <- get_hmm(at_nsp_3hyp,
sequence,
Transcript.id)
The resulting data frame resembles the web server’s output, or for those familiar with the hmmer
software the domain hits table produced by using the argument --domtblout
. First few rows:
head(at_hmm)
#> id name acc desc clan align_start align_end model_start model_end
#> 1 AT2G43620.1 Glyco_hydro_19 PF00182.21 Chitinase class I <NA> 41 52 74 85
#> 2 AT2G43620.1 Glyco_hydro_19 PF00182.21 Chitinase class I CL0037 94 229 2 156
#> 3 AT2G43620.1 Glyco_hydro_19 PF00182.21 Chitinase class I CL0037 237 283 185 232
#> 4 AT2G43620.1 Chitin_bind_1 PF00187.21 Chitin recognition protein <NA> 31 64 2 38
#> 5 AT2G43620.1 Chitin_bind_1 PF00187.21 Chitin recognition protein <NA> 164 167 28 31
#> 6 AT2G30933.1 X8 PF07983.15 X8 domain <NA> 82 151 2 75
#> ievalue cevalue bitscore reported
#> 1 1.9e+04 2.0e+00 -5.3 FALSE
#> 2 9.0e-38 9.4e-42 130.4 TRUE
#> 3 9.1e-06 9.5e-10 25.7 TRUE
#> 4 2.4e-08 2.5e-12 34.3 TRUE
#> 5 7.9e+03 8.3e-01 -2.5 FALSE
#> 6 3.2e-21 1.7e-25 75.9 TRUE
To check if any of the sequence with MAAB classes contain domains the output of get_hmm()
, in this case the at_hmm
data frame can be merged (dplyr::left_join()
) with the output of maab()
- the maab_at
data frame. After some data cleaning we can plot the frequency of domains using ggplot2::geom_col()
.
maab_at %>%
filter(maab_class != "0") %>% #remove maab_class 0 - not HRGP
left_join(at_hmm) %>% #join hmmscan annotation by protein id (only common column name)
filter(ievalue < 1e-10) %>% #keep domains with independent e-value lower then 1e-10
group_by(id) %>% #group by sequence identifier
distinct(name) %>% #remove duplicated domains per sequnce
group_by(name) %>% #group by domain name
summarise(n = n()) %>% #count number of sequences per domain
ggplot() + #plot the frequency
geom_col(aes(x = name, y = n)) +
theme_bw()+ #style the plot
theme(axis.text.x = element_text(angle = 45,
hjust = 1)) +
ylab("number of sequences") +
xlab("domain name")
Several sequences contain domains such as plastocyanin-like domain (Cu_bind_like), hydrophobic seed protein, probable lipid transfer (LTP_2), X8 and others. In the original MAAB description (Johnson et al. 2017) the authors suggested removal of sequences with identified domains prior to MAAB however we trust it is more informative to first run MAAB and then check if any of the sequences contain domains.
Similarly, to check the domain frequencies in sequences identified as AGPs:
scaned_at6 %>%
filter(total_length != 0) %>% #remove sequences without motif spans
left_join(at_hmm) %>% #join hmmscan annotation by protein id (only common column name)
filter(ievalue < 1e-10) %>% #remove domains with independent e-value higher then 1e-10
group_by(id) %>% #group by sequence identifier
distinct(name) %>% #remove duplicated domains per sequnce
group_by(name) %>% #group by domain name
summarise(n = n()) %>% #count number of sequences per domain
arrange(desc(n)) %>% #arrenge by count
top_n(10) %>% #select top ten
ggplot() + #plot the frequency
geom_col(aes(x = name, y = n)) +
theme_bw()+ #style the plot
theme(axis.text.x = element_text(angle = 45,
hjust = 1)) +
ylab("number of sequences") +
xlab("domain name")
As evident from the above examples an especially effective way to manipulate objects returned by ragp
functions is to utilize the tidyverse
collection of packages (dplyr
and ggplot2
are used above).
Gene ontology enrichment can be performed with pfam2go()
which maps Pfam accessions to GO terms based on http://geneontology.org/external2go/pfam2go. The function is called on the output of get_hmm()
:
go_enriched <- pfam2go(at_hmm)
head(go_enriched) #first few rows
#> id name acc desc clan align_start align_end model_start model_end ievalue
#> 1 AT2G43620.1 Glyco_hydro_19 PF00182.21 Chitinase class I <NA> 41 52 74 85 1.9e+04
#> 2 AT2G43620.1 Glyco_hydro_19 PF00182.21 Chitinase class I <NA> 41 52 74 85 1.9e+04
#> 3 AT2G43620.1 Glyco_hydro_19 PF00182.21 Chitinase class I <NA> 41 52 74 85 1.9e+04
#> 4 AT2G43620.1 Glyco_hydro_19 PF00182.21 Chitinase class I CL0037 94 229 2 156 9.0e-38
#> 5 AT2G43620.1 Glyco_hydro_19 PF00182.21 Chitinase class I CL0037 94 229 2 156 9.0e-38
#> 6 AT2G43620.1 Glyco_hydro_19 PF00182.21 Chitinase class I CL0037 94 229 2 156 9.0e-38
#> cevalue bitscore reported Pfam_name GO_name GO_acc
#> 1 2.0e+00 -5.3 FALSE Glyco_hydro_19 GO:chitin catabolic process GO:0006032
#> 2 2.0e+00 -5.3 FALSE Glyco_hydro_19 GO:chitinase activity GO:0004568
#> 3 2.0e+00 -5.3 FALSE Glyco_hydro_19 GO:cell wall macromolecule catabolic process GO:0016998
#> 4 9.4e-42 130.4 TRUE Glyco_hydro_19 GO:chitin catabolic process GO:0006032
#> 5 9.4e-42 130.4 TRUE Glyco_hydro_19 GO:chitinase activity GO:0004568
#> 6 9.4e-42 130.4 TRUE Glyco_hydro_19 GO:cell wall macromolecule catabolic process GO:0016998
HRGPs, and especially AGPs are often linked to the membrane by a glycosylphosphatidylinositol (GPI) anchor (Ellis et al. 2010). Attachment of the GPI moiety to the carboxyl terminus (omega-site) of the polypeptide occurs after proteolytic cleavage of a C-terminal propeptide. In order to predict the presence of the omega sites in the filtered sequences two functions are available: get_big_pi()
which queries the big-PI Plant Predictor (Eisenhaber et al. 2003) server, and get_pred_gpi()
which queries the PredGPI (Pierleoni, Martelli, and Casadio 2008) server and get_netGPI()
which queries the NetGPI server (Gíslason et al. 2021).
To query PredGPI:
at_gpi_pred <- get_pred_gpi(at_nsp_3hyp,
sequence,
Transcript.id)
The resulting data frame has one row per sequence:
head(at_gpi_pred)
#> id omega_site specificity is.gpi
#> 1 AT2G43620.1 252 0.42 FALSE
#> 2 AT2G30933.1 197 1.00 TRUE
#> 3 AT2G30933.2 139 0.48 FALSE
#> 4 AT2G22470.1 107 1.00 TRUE
#> 5 AT2G43610.1 260 0.16 FALSE
#> 6 AT2G27035.1 138 1.00 TRUE
To check which of the predicted AGPs using scan_ag()
are also predicted to be anchored by a GPI the output of get_pred_gp()
will be merged with scaned_at6
data frame using dplyr::left_join()
.
A key structural feature of HRGPs, based on abundance of disorder-promoting residues, is that they are intrinsically disordered proteins. To identify potential disordered regions in proteins ragp
contains get_espritz()
which queries ESpritz (Walsh et al. 2012) web server. Three models trained on different data sets are available and can be selected via the argument model
:
These models provide quite different predictions. To query ESpritz:
at_espritz <- get_espritz(at_nsp_3hyp,
sequence,
Transcript.id,
model = "NMR")
#> start end id
#> 1 1 7 AT2G43620.1
#> 2 49 92 AT2G43620.1
#> 3 103 116 AT2G43620.1
#> 4 149 152 AT2G43620.1
#> 5 168 170 AT2G43620.1
#> 6 175 178 AT2G43620.1
The output of the function depends on the simplify
argument. If simplify = TRUE
(default) the output is a data.frame
(one row per disordered region) with columns start
and end
which indicate the start and end of the predicted disordered region as well as the column id
which is a protein identifier. When simplify = FALSE
the output contains predicted probabilities of disorder for each amino acid.
at_espritz2 <- get_espritz(at_nsp_3hyp,
sequence,
Transcript.id,
model = "NMR",
simplify = FALSE)
#> id probability
#> 1 AT2G43620.1 0.477693....
#> 2 AT2G30933.1 0.636924....
#> 3 AT2G30933.2 0.631344....
#> 4 AT2G22470.1 0.743908....
#> 5 AT2G43610.1 0.742569....
#> 6 AT2G27035.1 0.556894....
#> prediction
#> 1 DDDDDDDOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOODDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDOOOOOOOOOODDDDDDDDDDDDDDOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOODDDDOOOOOOOOOOOOOOODDDOOOODDDDOOOOOOOOOOOOOOOOOOOOOOOOOODDDDDDDDDOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOODDDDDDDDOOOOOOOOOOOOOOOOOOOODDDDDDDDD
#> 2 DDDDDDDOOOOOOOOOOOODDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDOOODDDOOOOOOOOOOOOOOOOOOOOOOOODDDDOOOOODDDDDDDDDDOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOODDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDOOOOOOOOOOOOOOODDDD
#> 3 DDDDDDDOOOOOOOOOOOODDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOODDDD
#> 4 DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDOOOOOOOOOOOOOOODDD
#> 5 DDDDDDDDOOOOOOOOOOOOOOOODDDDDDDDDDDDDDDDOOOOOODDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDOOOOOOOOOOOOODDDDDDDDDDOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOODDDDDDDOOOOOOOOOOOOODDDDDDDDDDDDOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOODDDDDDDDD
#> 6 DDDDOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOODDOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOODDDDDDDDDDDDDDDDDDDDDOOOOOOOOOOOOOOOOO