Protein Sequence Visualization

All of the protein sequence attributes which can be acquired using ragp functions can be visualized using the function plot_prot() which accepts protein sequences in the form of strings, calls the appropriate ragp functions to obtain sequence features and returns a ggplot2 object.

The following section explains how to use plot_prot() and manipulate the output visual style.

Required packages:

library(ggplot2)
library(protr) 
library(ragp)

Basic usage

In the next example the function protr::getUniProt() will be used to download several protein sequences from UniProt and then plot_prot() will be used to visualize the sequence attributes. Do note that downloading the sequences from UniProt is not a requirement for plot_prot(). No information apart the actual protein sequences and their identifiers (accessions) is required for plot_prot():

ids <- c("Q9C7T7",
         "Q9ZPR7",
         "Q9ZQ23")

Download the corresponding sequences using protr::getUniProt() and unlist them so they are in an appropriate format for plot_prot():

seqs <- unlist(protr::getUniProt(ids))
seqs
#> [1] "MSRRPDLLRGSVVATVAATFLLFIFPPNVESTVEKQALFRFKNRLDDSHNILQSWKPSDSPCVFRGITCDPLSGEVIGISLGNVNLSGTISPSISALTKLSTLSLPSNFISGRIPPEIVNCKNLKVLNLTSNRLSGTIPNLSPLKSLEILDISGNFLNGEFQSWIGNMNQLVSLGLGNNHYEEGIIPESIGGLKKLTWLFLARSNLTGKIPNSIFDLNALDTFDIANNAISDDFPILISRLVNLTKIELFNNSLTGKIPPEIKNLTRLREFDISSNQLSGVLPEELGVLKELRVFHCHENNFTGEFPSGFGDLSHLTSLSIYRNNFSGEFPVNIGRFSPLDTVDISENEFTGPFPRFLCQNKKLQFLLALQNEFSGEIPRSYGECKSLLRLRINNNRLSGQVVEGFWSLPLAKMIDLSDNELTGEVSPQIGLSTELSQLILQNNRFSGKIPRELGRLTNIERIYLSNNNLSGEIPMEVGDLKELSSLHLENNSLTGFIPKELKNCVKLVDLNLAKNFLTGEIPNSLSQIASLNSLDFSGNRLTGEIPASLVKLKLSFIDLSGNQLSGRIPPDLLAVGGSTAFSRNEKLCVDKENAKTNQNLGLSICSGYQNVKRNSSLDGTLLFLALAIVVVVLVSGLFALRYRVVKIRELDSENRDINKADAKWKIASFHQMELDVDEICRLDEDHVIGSGSAGKVYRVDLKKGGGTVAVKWLKRGGGEEGDGTEVSVAEMEILGKIRHRNVLKLYACLVGRGSRYLVFEFMENGNLYQALGNNIKGGLPELDWLKRYKIAVGAAKGIAYLHHDCCPPIIHRDIKSSNILLDGDYESKIADFGVAKVADKGYEWSCVAGTHGYMAPELAYSFKATEKSDVYSFGVVLLELVTGLRPMEDEFGEGKDIVDYVYSQIQQDPRNLQNVLDKQVLSTYIEESMIRVLKMGLLCTTKLPNLRPSMREVVRKLDDADPCVSNSQDTTGKITV"
#> [2] "MYMIESKGGAIACMLLALLFLGTWPAIMTLTERRGRLPQHTYLDYTLTNLLAAVIIALTLGEIGPSRPNFFTQLSQDNWQSVMFAMAGGIVLSLGNLATQYAWAYVGLSVTEVITASITVVIGTTLNYFLDDRINRAEVLFPGVACFLIAVCFGSAVHKSNAADNKTKLQNFKSLETTSSFEMETISASNGLTKGKAKEGTAAFLIELEKQRAIKVFGKSTIIGLVITFFAGICFSLFSPAFNLATNDQWHTLKHGVPKLNVYTAFFYFSISAFVVALILNIRFLYWPILGLPRSSFKAYLNDWNGRGWSFLAGFLCGFGNGLQFMGGQAAGYAAADAVQALPLVSTFWGILLFGEYRRSSRKTYTLLISMLLMFIVAVAVLMASSGHRK"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
#> [3] "MGLKVSSSLLCLTILLAVSSIVSAVNITRVLEKYPEFSTMTELLAKTELTPIINKRQTITVLALNNDAIGSISGRPEEEVKNILMNHVVLDYFDELKLKALKEKSTLLTTLYQSTGLGQQQNGFLNCTKSNGKIYFGSGVKGAPQTAEYITTVFRNPYNLSVVQISMPIVAPGLGSPVKVPPPPPMSSPPAPSPKKGAATPAPAPADEGDYADAPPGLAPETAPASAPSESDSPAPAPDKSGKKKMAAADEAEPPSSASNTGLSFGAVLVLGFVASFVGF"

Provide the sequences and protein identifiers to plot_prot():

p1 <- plot_prot(sequence = seqs,
                id = ids)
p1

By default the plot contains the following information:

Transmembrane regions (TM) are shown in yellow. Extracellular regions are indicated by the dashed line above the sequences while intracellular regions are indicated by the dashed line bellow the sequence. For TM prediction plot_prot() can query Phobius or TMHMM via the functions get_phobius() and get_tmhmm(). The plot_prot() argument tm (can be: “phobius”, “tmhmm” or “none”) instructs which method should be used. Default is “phobius”.
Signal peptides are indicated by the thick red line on the N-terminal side. For N-sp prediction plot_prot() can query SignalP 4.1 or SignalP-5.0. The plot_prot() argument nsp (can be: “signalp”, “signalp5” or “none”) instructs which method should be used. Default is “signalp5”.
GPI attachment sites (as predicted by get_big_pi(), get_pred_gpi() or get_netGPI() depending on the plot_prot() argument gpi) are indicated by blue rhomboids. Default is to query NetGPI
Hydroxyprolines (as predicted by predict_hyp()) are indicated by vertical dark gray lines.
Arabinogalactan glycomodule spans are indicated by the light grey background (motifs are found using scan_ag()).
Domains are indicated by rectangles with an appropriate fill as indicated in the legend. Domain annotation for the plot can be obtained using get_hmm() or get_cdd() depending on the domain argument (can be: “cdd”, “hmm” or “none”). Default is to use get_cdd().

Plotting can take a while since plot_prot() first obtains the appropriate predictions using the above mentioned function and then generates the plot. To check what actions the function is performing set the argument progress to TRUE.

Alternatively the outputs of get_hmm(), get_cdd(), get_signalp(), get_signalp5(), get_phobius(), get_tmhmm(), get_big_pi(), get_pred_gpi(), get_netGPI() and get_espritz() can be supplied as plot_prot() arguments domain, nsp, tm, gpi and disorder:

domains <- get_cdd(sequence = seqs,
                   id = ids)

tm <- get_phobius(sequence = seqs,
                  id = ids)

nsp <- get_signalp5(sequence = seqs,
                   id = ids)

gpi <- get_netGPI(sequence = seqs,
                  id = ids)

p1_df <- plot_prot(sequence = seqs,
                   id = ids,
                   domain = domains,
                   tm = tm,
                   nsp = nsp,
                   gpi = gpi)

p1_df

Mixing the two types of input is also a possibility. In the below call the data frames nsp and gpi are used to tell prot_plot() the length of the N-sp and the positions of omega sites while domain = "hmm" and tm = "tmhmm" instruct it to use get_hmm() to obtain domain predictions and get_tmhmm() to obtain TM predictions:

p2 <- plot_prot(sequence = seqs,
                id = ids,
                domain = "hmm",
                tm = "tmhmm",
                nsp = nsp,
                gpi = gpi)
p2

From the above plot it can be observed TMHMM tends to predict TM regions where there are N-sp’s and that get_cdd() and get_hmm() differ in the type of detected domains. For instance the Q9C7T7 is a leucine-rich repeat (LRR) receptor-like protein kinase - the Conserved domain database has an entry for exactly the mentioned class PLN00113 while the Pfam database has more general domain models so it predicts a protein kinase domain and several LRR domains.

To plot disordered regions as predicted by Espritz vie the function get_espritz():

p2 <- plot_prot(sequence = seqs,
                id = ids,
                domain = domains,
                tm = tm,
                nsp = nsp,
                gpi = gpi,
                disorder = TRUE)
                
p2

The longest predicted disordered region overlaps with the AG motif span in Q9ZQ23

`plot_prot()` options

determining what features to plot: to turn off some elements from being plotted use the appropriate argument, for instance hyp = FALSE will result in hydroxyproline predictions to not be plotted (check full list of options: ?plot_prot):

p1 <- plot_prot(sequence = seqs,
                id = ids,
                hyp = FALSE, #do not plot hydroxyprolines
                ag = FALSE, #do not plot arabinogalactan motifs
                gpi = "predgpi", #use `get_pred_gpi()` to obtain GPI predictions
                domain = "hmm")  #use `get_hmm()` to obtain domain predictions
p1

feature specific options: passing appropriate function arguments to functions called by plot_prot() such as dim and div for scan_ag() or tprob for predict_hyp() to change the rules of arabinogalactan motif scan and the threshold for decision if the proline is hydroxylated is supported:

p2 <- plot_prot(sequence = seqs,
                id = ids,
                dim = 5, #`scan_ag()` argument, minimum number of dipeptides for a motif
                div = 5, #`scan_ag()` argument, maximum number of amino acids between dipeptides 
                tprob = 0.9,
                domain = "hmm",
                ievalue = 1e-10) #`get_hmm()` argument, remove domains with independent E-value higher than this value
                
p2

change feature colors:

p3 <- plot_prot(sequence = seqs,
                id = ids,
                hyp_col = "chocolate2",
                gpi_col = "lightsalmon4",
                nsp_col = "sienna1",
                ag_col = "bisque2",
                tm_col = "aquamarine") 
p3

change the order of domain plotting: in the above figures where plot_prot() was called with domain = "hmm" it can be observed there are more entries in the legend then can be seen in the plot. Some domains cover other domains. By default the domains with the lowest I-evalue (independent e-value) are plotted on top. To change this the argument dom_sort can be used. This argument can take the following values: "ievalue" (default), "abc" (alphabetically sorted by domain names), "cba" (reverse of "abc").

p4 <- plot_prot(sequence = seqs,
                id = ids,
                dom_sort = "abc",
                domain = "hmm")
p4

Ureide permease domain overlaps with TMEM144 transmembrane transporter domain and also protein kinase and protein tyrosine kinase domains overlap.

Customizing the plot using ggplot layers

plot_prot() returns a ggplot object which can be further customized:

change theme elements:

p1 +
  theme(legend.position = "bottom",
        legend.direction = "vertical")

add annotations:

p1 +
  theme(legend.position = "bottom",
        legend.direction = "vertical") +
  annotate(geom = "text",
           y = 3,
           x = 90,
           label = "FAS domain")

The y axis is continuous - 1 on the plot corresponds to Q9C7T7, 2 corresponds to Q9ZPR7 and 3 to Q9ZQ23. The x axis represents amino acids in the sequence. This enables adding annotations anywhere on the plot.

p1 +
  theme(legend.position = "bottom",
        legend.direction = "vertical") +
  annotate(geom = "text",
           y = 1.6,
           x = 530,
           label = "annotation 1")

change plot ratio: the ratio of the axes is determined automatically by plot_prot() and scales with the longest sequence plotted. To change this use the ratio argument to coord_equal():

p1 +
  theme(legend.position = "bottom",
        legend.direction = "vertical") +
  coord_equal(ratio = 80,
              expand = FALSE)

change domain colors: use another fill palette:

p1 +
  theme(legend.position = "bottom",
        legend.direction = "vertical") +
  scale_fill_brewer(type = "div")

or create a manual palette:

p1 +
  theme(legend.position = "bottom",
        legend.direction = "vertical") +
  scale_fill_manual(values = c("orange",
                               "red",
                               "yellow",
                               "dodgerblue",
                               "green",
                               "blue",
                               "purple",
                               "grey30"))

Basic usage

plot_prot() options

Customizing the plot using ggplot layers

`plot_prot()` options