The analysis layer in ragp consists of:

  • Arabinogalactan identification by finding localized clusters of characteristic motifs (AG glycomodules)
  • Hydroxyproline rich glycoprotein classification by the motif and amino acid bias scheme (MAAB, (Johnson et al. 2017)).
  • Domain annotation using Pfam data base and enrichment with gene ontology (GO) terms. Alternatively Domain annotation can be performed by querying CDD.
  • Glycosylphosphatidylinositol attachment sites prediction.
  • Identification of disordered regions in proteins.

The following section explains how to perform these tasks on a set of 365 Arabidopsis protein sequences obtained in Filtering hydroxyproline rich glycoproteins tutorial (the at_nsp_3hyp data frame).

Required packages:

Arabinogalactan identification

To identify potential arabinogalactans the function scan_ag() can be used. This function scans the protein sequences for AG glycomodule clusters, by constructing regular expressions based on user input.

By default scan_ag() searches for at least three dipeptides PA, PS, PT, PG, PV, AP, SP, TP, GP and VP which are separated by maximally 10 amino acids. The minimal number of dipeptides to be considered can be changed by the argument dim while the maximum number of separating amino acids can be tweaked using the argument div. To consider only PA, PS, PT, AP, SP and TP dipeptides the type argument can be changed to "conservative". In cases when the dipeptides are ambiguous, such as SPT, PTP etc. all three amino acids will be counted as one dipeptide, while cases such as PTPA are considered as two dipeptides separated by zero amino acids. When the sequence object returned by predict_hyp() function is passed to scan_ag() only prolines predicted to be hydroxylated are considered when searching for dipeptides:

scaned_at <- scan_ag(data = at_nsp_3hyp,
                     sequence,
                     Transcript.id)

The output of scan_ag() can be a list, or a data frame depending on the simplify/tidy arguments. The default output is a data frame with one row per sequence which contains five columns. Here are the 55 N-terminal amino acids from the Classical arabinogalactan protein 9 as an example:

scaned_at  %>%
  mutate(sequence = strtrim(sequence, 55)) %>% #take the first 55 amino acids from the sequences
  filter(id == "AT2G14890.1") #filter Q9C5S0
#>            id                                                sequence AG_aa total_length longest
#> 1 AT2G14890.1 marsfaiavicivliagvtgqAPTSPPTaTPAPPTPTTPpPAaTPpPVsAPpPVt   100          144     144

The column sequence contains the input sequence where only the amino acids in potential AG glycomodules are in uppercase while all other are in lower case. The column AG_aa contains the amino acid count found in potential AG glycomodules, while columns total_length and longest correspond to the matched sequence spans. From the output it can be observed that 100 amino acids were found in AG glycomodules, that the longest span of AG glycomodules is 144 amino acids long (AG glycomodule dipeptides in one span are separated by less then 10 amino acids, since div argument which determines this is set per default to 10) and that it is the only span in the sequence.

To obtain a more detailed data frame the function can be called with simplify = FALSE and tidy = TRUE:

scaned_at2 <- scan_ag(data = at_nsp_3hyp,
                      sequence,
                      Transcript.id, 
                      simplify = FALSE,
                      tidy = TRUE)

scaned_at2  %>%
  mutate(sequence = strtrim(sequence, 55)) %>% #take the first 55 amino acids from the sequences
  filter(id == "AT2G14890.1")
#>                                                  sequence          id location.start location.end        P_pos length
#> 1 marsfaiavicivliagvtgqAPTSPPTaTPAPPTPTTPpPAaTPpPVsAPpPVt AT2G14890.1             22          165 23, 26, ....    144
#>   AG_aa
#> 1   100

Output obtained in this way contains the start and end location of each AG glycomodul span (in this example there is only one) as well as a vector of proline positions in AG glycomodules.

To incorporate hydroxyproline position knowledge when searching AG glycomodules the sequence output of predict_hyp() can be provided to scan_ag():

at_hyp <- predict_hyp(data = at_nsp_filtered,
                      sequence = sequence,
                      id = Transcript.id)
  • Then scan_ag() will be called on the output at_hyp$sequence:
scaned_at3 <- scan_ag(data = at_hyp$sequence,
                      sequence,
                      id, #`at_hyp$sequence` data frame contains the column `id`
                      simplify = FALSE,
                      tidy = TRUE)
#> sequence vector contains O, O will be considered instead of P

scaned_at3  %>%
  mutate(sequence = strtrim(sequence, 55)) %>% #take the first 55 amino acids from the sequences
  filter(id == "AT2G14890.1")
#>                                                  sequence          id location.start location.end        P_pos length
#> 1 marsfaiavicivliagvtgqAOTSOOTaTOAOOTOTTOoOAaTOoOVsAOoOVt AT2G14890.1             22          165 23, 26, ....    144
#>   AG_aa
#> 1    98

It is clear almost all prolines (P) were predicted to be hydroxylated (O) so the start and end location as well as the length of the amino acid span containing AG glycomoduls are unchanged for this sequence.

Many of the found motifs in AT2G14890.1 (shown in uppercase) correspond to extensin motifs (PPP+ or OOO+). This might not be desired when searching for AG glycomodules, and scan_ag() offers the argument exclude_ext which enables the function to omit extensin motifs either of the common sequence SP{3,} or SO{3,} (exclude_ext = "yes") or of the form P[3,} or O{3,} (exclude_ext = "all"):

scaned_at4 <- scan_ag(data = at_hyp$sequence,
                      sequence,
                      id, 
                      simplify = FALSE,
                      tidy = TRUE,
                      exclude_ext = "all")
#> sequence vector contains O, O will be considered instead of P

scaned_at4  %>%
  mutate(sequence = strtrim(sequence, 55)) %>% #take the first 55 amino acids from the sequences
  filter(id == "AT2G14890.1")
#>                                                  sequence          id location.start location.end        P_pos length
#> 1 marsfaiavicivliagvtgqAOTSOOTaTOAOOTOTtoooaatooovsaooovt AT2G14890.1             22           37 23, 26, ....     16
#> 2 marsfaiavicivliagvtgqAOTSOOTaTOAOOTOTtoooaatooovsaooovt AT2G14890.1            114          165 115, 119....     52
#>   AG_aa
#> 1    15
#> 2    34

When extensin motifs are masked two AG glycomodule spans are found in AT2G14890.1, from position 22 to 37 and from position 114 to 165.

Next we will compare how many sequences are identified as arabinogalactans with and without prediction of proline hydroxylation when extensin motifs are masked:

#without proline hydroxylation knowledge:
scaned_at5 <- scan_ag(data = at_nsp_filtered,
                      sequence,
                      Transcript.id,
                      exclude_ext = "all")
sum(scaned_at5$total_length != 0)
#> [1] 627

#using the output of `predict_hyp()
scaned_at6 <- scan_ag(data = at_hyp$sequence,
                      sequence,
                      id,
                      exclude_ext = "all")
#> sequence vector contains O, O will be considered instead of P
sum(scaned_at6$total_length != 0)
#> [1] 208

Incorporating hydroxyproline predictions in the search for AGPs reduced the pool of identified sequences by threefold. scan_ag() is capable of identifying potential chimeric and short AGPs without pronounced amino acid bias and with few characteristic motifs. In combination with predict_hyp() the specificity of the found sequences is greatly increased since the majority of sequences unlikely to contain hydroxyprolines are excluded.

Motif and amino acid bias classification - MAAB

The MAAB pipeline is explained in detail in Johnson et al. (2017). In short, after removal of sequences without a predicted N-sp and removal of sequences with predicted domains the sequences are first evaluated for amino acid bias. Sequences with > 45% of PAST (AGP bias) or PSKY (EXT bias) or PVKY (proline-rich proteins (PRP) bias) are grouped further into 24 distinct groups based on amino acid bias, presence of GPIs and the following motifs (as regular expression):

  • EXT: the SPn motif SP{3,5} and the tyr motifs [FY].Y, KHY, VY[HKDE], V.Y and YY.
  • PRP: PPV.[KT], PPV[QK] and KKPCPP.
  • AGP: [AVTG]P{1,3} and [ASVTG]P{1,2}.

The motifs are counted in a specific order ext > tyr > prp > agp, and overlapping motifs are not counted twice. Based on the final motif count and the overall coverage of the motifs HRGPs are clustered in the mentioned 24 groups. This can be performed using maab():

maab_at <- maab(at_nsp_3hyp,
                sequence,
                Transcript.id)

The output of maab() is a data frame containing columns: id - sequence identifier; ext_sp, ext_tyr, prp and agp - corresponding motif count; past_percent, pvyk_percent, psky_percent, p_percent - percentage of the corresponding amino acids in sequence; coverage - the motif coverage; maab_class - the MAAB class. First several hits as example:

maab_at %>%
  filter(maab_class != "0") %>% #remove maab_class 0 - not HRGP
  head #first 6 hits
#>            id ext_sp ext_tyr prp agp past_percent pvyk_percent psky_percent p_percent coverage maab_class
#> 1 AT2G22470.1      0       0   0  21           63           31           32        21     0.35        1/4
#> 2 AT2G21140.1      0       0   4   5           31           46           43        20     0.11         24
#> 3 AT2G23130.1      0       0   0  32           57           31           37        21     0.37        1/4
#> 4 AT2G23130.2      0       0   0  32           59           34           40        23     0.42        1/4
#> 5 AT2G32300.1      0       1   0  16           47           24           29        13     0.14         24
#> 6 AT2G34870.1      0       0   0   6           47           32           34        26     0.14         24

Since the function was not provided knowledge on the GPI presence, both possible classes are indicated as in: “1/4” or “2/9.” To remove ambiguities in classes get_gpi argument can be set to "bigpi", "predgpi" or “netgpi” this will run get_big_pi(), get_pred_gpi() or get_netGPI() only on the sequences that belong to one of the HRGP classes:

maab_at <- maab(at_nsp_3hyp,
                sequence,
                Transcript.id,
                get_gpi = "bigpi")

maab_at %>%
  filter(maab_class != "0") %>% #remove maab_class 0 - not HRGP
  head
#>            id ext_sp ext_tyr prp agp past_percent pvyk_percent psky_percent p_percent coverage maab_class
#> 1 AT2G22470.1      0       0   0  21           63           31           32        21     0.35          1
#> 2 AT2G21140.1      0       0   4   5           31           46           43        20     0.11         24
#> 3 AT2G23130.1      0       0   0  32           57           31           37        21     0.37          1
#> 4 AT2G23130.2      0       0   0  32           59           34           40        23     0.42          4
#> 5 AT2G32300.1      0       1   0  16           47           24           29        13     0.14         24
#> 6 AT2G34870.1      0       0   0   6           47           32           34        26     0.14         24

To compare how many sequences are classified to one of the HRGP classes when proline hydroxylation is not considered:

#when considering only sequences with at least 3 predicted hydroxyprolins
maab_at %>%
  filter(maab_class != "0") %>%
  nrow
#> [1] 69

#when considering all sequences predicted to be secreted
maab_at2 <- maab(at_nsp_filtered,
                 sequence,
                 Transcript.id,
                 get_gpi = "bigpi")

maab_at2 %>%
  filter(maab_class != "0") %>% #remove maab_class 0 - not HRGP
  nrow
#> [1] 70

Out of the 1890 sequences in at_nsp_filtered, 70 are predicted to belong to any MAAB class, one less compared to when proline hydroxylation prediction was incorporated. To check which sequence differs between maab_at and maab_at2:

maab_at2 %>%
  filter(maab_class != "0") %>%
  filter(!id %in% (maab_at %>%
                     filter(maab_class != "0") %>%
                     pull(id))) 
#>            id ext_sp ext_tyr prp agp past_percent pvyk_percent psky_percent p_percent coverage maab_class
#> 1 AT5G09530.1      0       0   0  23           35           49           46        29     0.12         24

The differing sequence belongs to MAAB class 24 - sequences with HRGP bias and low motif coverege (<15%).

Domain annotation and GO enrichment

Domain annotation is a fundamental research tool in protein sequence analyses. In many occasions domain annotation has already been performed prior to HRGP mining, either by using hmmer (Eddy 2011), InterProScan (Jones et al. 2014) or some other resource. In such circumstances the annotation can be imported into R and joined with the filtered sequences. When this is not the case ragp offers the functions get_hmm() which queries the hmmscan web server (Finn, Clements, and Eddy 2011) and get_cdd() which queries the Conserved Domain Database. Domain annotation is not necessary for AGP prediction although it enhances interpretation, but it is a fundamental part of the MAAB classification scheme. As with other ragp functions the first argument is the data, followed by the names of the columns containing the protein sequences and corresponding identifiers:

at_hmm <- get_hmm(at_nsp_3hyp,
                  sequence,
                  Transcript.id)

The resulting data frame resembles the web server’s output, or for those familiar with the hmmer software the domain hits table produced by using the argument --domtblout. First few rows:

head(at_hmm)
#>            id           name        acc                       desc   clan align_start align_end model_start model_end
#> 1 AT2G43620.1 Glyco_hydro_19 PF00182.21          Chitinase class I   <NA>          41        52          74        85
#> 2 AT2G43620.1 Glyco_hydro_19 PF00182.21          Chitinase class I CL0037          94       229           2       156
#> 3 AT2G43620.1 Glyco_hydro_19 PF00182.21          Chitinase class I CL0037         237       283         185       232
#> 4 AT2G43620.1  Chitin_bind_1 PF00187.21 Chitin recognition protein   <NA>          31        64           2        38
#> 5 AT2G43620.1  Chitin_bind_1 PF00187.21 Chitin recognition protein   <NA>         164       167          28        31
#> 6 AT2G30933.1             X8 PF07983.15                  X8 domain   <NA>          82       151           2        75
#>   ievalue cevalue bitscore reported
#> 1 1.9e+04 2.0e+00     -5.3    FALSE
#> 2 9.0e-38 9.4e-42    130.4     TRUE
#> 3 9.1e-06 9.5e-10     25.7     TRUE
#> 4 2.4e-08 2.5e-12     34.3     TRUE
#> 5 7.9e+03 8.3e-01     -2.5    FALSE
#> 6 3.2e-21 1.7e-25     75.9     TRUE

To check if any of the sequence with MAAB classes contain domains the output of get_hmm(), in this case the at_hmm data frame can be merged (dplyr::left_join()) with the output of maab() - the maab_at data frame. After some data cleaning we can plot the frequency of domains using ggplot2::geom_col().

maab_at %>%
  filter(maab_class != "0") %>% #remove maab_class 0 - not HRGP
  left_join(at_hmm) %>% #join hmmscan annotation by protein id (only common column name)
  filter(ievalue < 1e-10) %>% #keep domains with independent e-value lower then 1e-10
  group_by(id) %>% #group by sequence identifier
  distinct(name) %>% #remove duplicated domains per sequnce
  group_by(name) %>% #group by domain name
  summarise(n = n()) %>% #count number of sequences per domain
  ggplot() + #plot the frequency
  geom_col(aes(x = name, y = n)) +
  theme_bw()+ #style the plot
  theme(axis.text.x = element_text(angle = 45,
                                   hjust = 1)) +
  ylab("number of sequences") +
  xlab("domain name")

Several sequences contain domains such as plastocyanin-like domain (Cu_bind_like), hydrophobic seed protein, probable lipid transfer (LTP_2), X8 and others. In the original MAAB description (Johnson et al. 2017) the authors suggested removal of sequences with identified domains prior to MAAB however we trust it is more informative to first run MAAB and then check if any of the sequences contain domains.

Similarly, to check the domain frequencies in sequences identified as AGPs:

scaned_at6 %>%
  filter(total_length != 0) %>% #remove sequences without motif spans
  left_join(at_hmm) %>% #join hmmscan annotation by protein id (only common column name)
  filter(ievalue < 1e-10) %>% #remove domains with independent e-value higher then 1e-10
  group_by(id) %>% #group by sequence identifier
  distinct(name) %>% #remove duplicated domains per sequnce
  group_by(name) %>% #group by domain name
  summarise(n = n()) %>% #count number of sequences per domain
  arrange(desc(n)) %>% #arrenge by count
  top_n(10) %>% #select top ten
  ggplot() + #plot the frequency
  geom_col(aes(x = name, y = n)) +
  theme_bw()+ #style the plot
  theme(axis.text.x = element_text(angle = 45,
                                   hjust = 1)) +
  ylab("number of sequences") +
  xlab("domain name")

As evident from the above examples an especially effective way to manipulate objects returned by ragp functions is to utilize the tidyverse collection of packages (dplyr and ggplot2 are used above).

Gene ontology enrichment can be performed with pfam2go() which maps Pfam accessions to GO terms based on http://geneontology.org/external2go/pfam2go. The function is called on the output of get_hmm():

go_enriched <- pfam2go(at_hmm)

head(go_enriched) #first few rows
#>            id           name        acc              desc   clan align_start align_end model_start model_end ievalue
#> 1 AT2G43620.1 Glyco_hydro_19 PF00182.21 Chitinase class I   <NA>          41        52          74        85 1.9e+04
#> 2 AT2G43620.1 Glyco_hydro_19 PF00182.21 Chitinase class I   <NA>          41        52          74        85 1.9e+04
#> 3 AT2G43620.1 Glyco_hydro_19 PF00182.21 Chitinase class I   <NA>          41        52          74        85 1.9e+04
#> 4 AT2G43620.1 Glyco_hydro_19 PF00182.21 Chitinase class I CL0037          94       229           2       156 9.0e-38
#> 5 AT2G43620.1 Glyco_hydro_19 PF00182.21 Chitinase class I CL0037          94       229           2       156 9.0e-38
#> 6 AT2G43620.1 Glyco_hydro_19 PF00182.21 Chitinase class I CL0037          94       229           2       156 9.0e-38
#>   cevalue bitscore reported      Pfam_name                                      GO_name     GO_acc
#> 1 2.0e+00     -5.3    FALSE Glyco_hydro_19                  GO:chitin catabolic process GO:0006032
#> 2 2.0e+00     -5.3    FALSE Glyco_hydro_19                        GO:chitinase activity GO:0004568
#> 3 2.0e+00     -5.3    FALSE Glyco_hydro_19 GO:cell wall macromolecule catabolic process GO:0016998
#> 4 9.4e-42    130.4     TRUE Glyco_hydro_19                  GO:chitin catabolic process GO:0006032
#> 5 9.4e-42    130.4     TRUE Glyco_hydro_19                        GO:chitinase activity GO:0004568
#> 6 9.4e-42    130.4     TRUE Glyco_hydro_19 GO:cell wall macromolecule catabolic process GO:0016998

GPI prediction

HRGPs, and especially AGPs are often linked to the membrane by a glycosylphosphatidylinositol (GPI) anchor (Ellis et al. 2010). Attachment of the GPI moiety to the carboxyl terminus (omega-site) of the polypeptide occurs after proteolytic cleavage of a C-terminal propeptide. In order to predict the presence of the omega sites in the filtered sequences two functions are available: get_big_pi() which queries the big-PI Plant Predictor (Eisenhaber et al. 2003) server, and get_pred_gpi() which queries the PredGPI (Pierleoni, Martelli, and Casadio 2008) server and get_netGPI() which queries the NetGPI server (Gíslason et al. 2021).

To query PredGPI:

at_gpi_pred <- get_pred_gpi(at_nsp_3hyp,
                            sequence,
                            Transcript.id)

The resulting data frame has one row per sequence:

head(at_gpi_pred)
#>            id omega_site specificity is.gpi
#> 1 AT2G43620.1        252        0.42  FALSE
#> 2 AT2G30933.1        197        1.00   TRUE
#> 3 AT2G30933.2        139        0.48  FALSE
#> 4 AT2G22470.1        107        1.00   TRUE
#> 5 AT2G43610.1        260        0.16  FALSE
#> 6 AT2G27035.1        138        1.00   TRUE

To check which of the predicted AGPs using scan_ag() are also predicted to be anchored by a GPI the output of get_pred_gp() will be merged with scaned_at6 data frame using dplyr::left_join().

scaned_at6 %>%
  filter(total_length != 0) %>% #filter out non AGPs
  left_join(at_gpi_pred) %>% #merge GPI predictions
  summarize(with_gpi = sum(is.gpi), #sum TRUE instances
            without_gpi = sum(!is.gpi)) #sum FALSE instances
#> Joining, by = "id"
#>   with_gpi without_gpi
#> 1       88         120

Disorder prediction

A key structural feature of HRGPs, based on abundance of disorder-promoting residues, is that they are intrinsically disordered proteins. To identify potential disordered regions in proteins ragp contains get_espritz() which queries ESpritz (Walsh et al. 2012) web server. Three models trained on different data sets are available and can be selected via the argument model:

  • ‘X-Ray’ - based on missing atoms from the Protein Data Bank (PDB) X-ray solved structures. If this option is chosen then the predictors with short disorder options are executed.
  • ‘Disprot’ - contains longer disorder segments compared to x-ray. In particular, disprot is a manually curated database which is often based on functional attributes of the disordered region was used for this definition. Disorder residues are defined if the disprot curators consider the residue to be disordered at least once. All other residues are considered ordered. If this option is chosen then the predictors with long disorder options are executed.
  • ‘NMR’ - based on NMR mobility. NMR flexibility is calculated using the Mobi server (Martin, Walsh, and Tosatto 2010) optimized to replicate the ordered-disordered NMR definition used in CASP8.

These models provide quite different predictions. To query ESpritz:

at_espritz <- get_espritz(at_nsp_3hyp,
                          sequence,
                          Transcript.id,
                          model = "NMR")
#>   start end          id
#> 1     1   7 AT2G43620.1
#> 2    49  92 AT2G43620.1
#> 3   103 116 AT2G43620.1
#> 4   149 152 AT2G43620.1
#> 5   168 170 AT2G43620.1
#> 6   175 178 AT2G43620.1

The output of the function depends on the simplify argument. If simplify = TRUE (default) the output is a data.frame (one row per disordered region) with columns start and end which indicate the start and end of the predicted disordered region as well as the column id which is a protein identifier. When simplify = FALSE the output contains predicted probabilities of disorder for each amino acid.

at_espritz2 <- get_espritz(at_nsp_3hyp,
                           sequence,
                           Transcript.id,
                           model = "NMR",
                           simplify = FALSE)
#>            id  probability
#> 1 AT2G43620.1 0.477693....
#> 2 AT2G30933.1 0.636924....
#> 3 AT2G30933.2 0.631344....
#> 4 AT2G22470.1 0.743908....
#> 5 AT2G43610.1 0.742569....
#> 6 AT2G27035.1 0.556894....
#>                                                                                                                                                                                                                                                                                    prediction
#> 1 DDDDDDDOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOODDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDOOOOOOOOOODDDDDDDDDDDDDDOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOODDDDOOOOOOOOOOOOOOODDDOOOODDDDOOOOOOOOOOOOOOOOOOOOOOOOOODDDDDDDDDOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOODDDDDDDDOOOOOOOOOOOOOOOOOOOODDDDDDDDD
#> 2                                                         DDDDDDDOOOOOOOOOOOODDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDOOODDDOOOOOOOOOOOOOOOOOOOOOOOODDDDOOOOODDDDDDDDDDOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOODDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDOOOOOOOOOOOOOOODDDD
#> 3                                                                                                                             DDDDDDDOOOOOOOOOOOODDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOODDDD
#> 4                                                                                                                                                         DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDOOOOOOOOOOOOOOODDD
#> 5   DDDDDDDDOOOOOOOOOOOOOOOODDDDDDDDDDDDDDDDOOOOOODDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDOOOOOOOOOOOOODDDDDDDDDDOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOODDDDDDDOOOOOOOOOOOOODDDDDDDDDDDDOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOODDDDDDDDD
#> 6                                                                                                                         DDDDOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOODDOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOODDDDDDDDDDDDDDDDDDDDDOOOOOOOOOOOOOOOOO

References

Eddy, Sean R. 2011. “Accelerated Profile HMM Searches.” PLOS Computational Biology 7 (10): 1–16. https://doi.org/10.1371/journal.pcbi.1002195.
Eisenhaber, Birgit, Michael Wildpaner, Carolyn J. Schultz, Georg H. H. Borner, Paul Dupree, and Frank Eisenhaber. 2003. “Glycosylphosphatidylinositol Lipid Anchoring of Plant Proteins. Sensitive Prediction from Sequence- and Genome-Wide Studies for Arabidopsis and Rice.” Plant Physiology 133 (4): 1691–701. https://doi.org/10.1104/pp.103.023580.
Ellis, Miriam, Jack Egelund, Carolyn J. Schultz, and Antony Bacic. 2010. “Arabinogalactan-Proteins: Key Regulators at the Cell Surface?” Plant Physiology 153 (2): 403–19. https://doi.org/10.1104/pp.110.156000.
Finn, Robert D., Jody Clements, and Sean R. Eddy. 2011. HMMER Web Server: Interactive Sequence Similarity Searching.” Nucleic Acids Research 39 (Web Server issue): W29–37. https://doi.org/10.1093/nar/gkr367.
Gíslason, Magnús Halldór, Henrik Nielsen, José Juan Almagro Armenteros, and Alexander Rosenberg Johansen. 2021. “Prediction of GPI-Anchored Proteins with Pointer Neural Networks.” Current Research in Biotechnology 3: 6–13. https://doi.org/https://doi.org/10.1016/j.crbiot.2021.01.001.
Johnson, Kim L., Andrew M. Cassin, Andrew Lonsdale, Antony Bacic, Monika S. Doblin, and Carolyn J. Schultz. 2017. “Pipeline to Identify Hydroxyproline-Rich Glycoproteins.” Plant Physiology 174 (2): 886–903. https://doi.org/10.1104/pp.17.00294.
Jones, Philip, David Binns, Hsin-Yu Chang, Matthew Fraser, Weizhong Li, Craig McAnulla, Hamish McWilliam, et al. 2014. InterProScan 5: Genome-Scale Protein Function Classification.” Bioinformatics 30 (9): 1236–40. https://doi.org/10.1093/bioinformatics/btu031.
Martin, Alberto J. M., Ian Walsh, and Silvio C. E. Tosatto. 2010. MOBI: A Web Server to Define and Visualize Structural Mobility in NMR Protein Ensembles.” Bioinformatics 26 (22): 2916–17. https://doi.org/10.1093/bioinformatics/btq537.
Pierleoni, Andrea, Pier Luigi Martelli, and Rita Casadio. 2008. PredGPI: A GPI-Anchor Predictor.” BMC Bioinformatics 9 (September): 392. https://doi.org/10.1186/1471-2105-9-392.
Walsh, Ian, Alberto J. M. Martin, Tomàs Di Domenico, and Silvio C. E. Tosatto. 2012. ESpritz: Accurate and Fast Prediction of Protein Disorder.” Bioinformatics 28 (4): 503–9. https://doi.org/10.1093/bioinformatics/btr682.