All of the protein sequence attributes which can be acquired using ragp
functions can be visualized using the function plot_prot()
which accepts protein sequences in the form of strings, calls the appropriate ragp
functions to obtain sequence features and returns a ggplot2
object.
The following section explains how to use plot_prot()
and manipulate the output visual style.
Required packages:
In the next example the function protr::getUniProt()
will be used to download several protein sequences from UniProt and then plot_prot()
will be used to visualize the sequence attributes. Do note that downloading the sequences from UniProt is not a requirement for plot_prot()
. No information apart the actual protein sequences and their identifiers (accessions) is required for plot_prot()
:
ids <- c("Q9C7T7",
"Q9ZPR7",
"Q9ZQ23")
Download the corresponding sequences using protr::getUniProt()
and unlist them so they are in an appropriate format for plot_prot()
:
seqs <- unlist(protr::getUniProt(ids))
seqs
#> [1] "MSRRPDLLRGSVVATVAATFLLFIFPPNVESTVEKQALFRFKNRLDDSHNILQSWKPSDSPCVFRGITCDPLSGEVIGISLGNVNLSGTISPSISALTKLSTLSLPSNFISGRIPPEIVNCKNLKVLNLTSNRLSGTIPNLSPLKSLEILDISGNFLNGEFQSWIGNMNQLVSLGLGNNHYEEGIIPESIGGLKKLTWLFLARSNLTGKIPNSIFDLNALDTFDIANNAISDDFPILISRLVNLTKIELFNNSLTGKIPPEIKNLTRLREFDISSNQLSGVLPEELGVLKELRVFHCHENNFTGEFPSGFGDLSHLTSLSIYRNNFSGEFPVNIGRFSPLDTVDISENEFTGPFPRFLCQNKKLQFLLALQNEFSGEIPRSYGECKSLLRLRINNNRLSGQVVEGFWSLPLAKMIDLSDNELTGEVSPQIGLSTELSQLILQNNRFSGKIPRELGRLTNIERIYLSNNNLSGEIPMEVGDLKELSSLHLENNSLTGFIPKELKNCVKLVDLNLAKNFLTGEIPNSLSQIASLNSLDFSGNRLTGEIPASLVKLKLSFIDLSGNQLSGRIPPDLLAVGGSTAFSRNEKLCVDKENAKTNQNLGLSICSGYQNVKRNSSLDGTLLFLALAIVVVVLVSGLFALRYRVVKIRELDSENRDINKADAKWKIASFHQMELDVDEICRLDEDHVIGSGSAGKVYRVDLKKGGGTVAVKWLKRGGGEEGDGTEVSVAEMEILGKIRHRNVLKLYACLVGRGSRYLVFEFMENGNLYQALGNNIKGGLPELDWLKRYKIAVGAAKGIAYLHHDCCPPIIHRDIKSSNILLDGDYESKIADFGVAKVADKGYEWSCVAGTHGYMAPELAYSFKATEKSDVYSFGVVLLELVTGLRPMEDEFGEGKDIVDYVYSQIQQDPRNLQNVLDKQVLSTYIEESMIRVLKMGLLCTTKLPNLRPSMREVVRKLDDADPCVSNSQDTTGKITV"
#> [2] "MYMIESKGGAIACMLLALLFLGTWPAIMTLTERRGRLPQHTYLDYTLTNLLAAVIIALTLGEIGPSRPNFFTQLSQDNWQSVMFAMAGGIVLSLGNLATQYAWAYVGLSVTEVITASITVVIGTTLNYFLDDRINRAEVLFPGVACFLIAVCFGSAVHKSNAADNKTKLQNFKSLETTSSFEMETISASNGLTKGKAKEGTAAFLIELEKQRAIKVFGKSTIIGLVITFFAGICFSLFSPAFNLATNDQWHTLKHGVPKLNVYTAFFYFSISAFVVALILNIRFLYWPILGLPRSSFKAYLNDWNGRGWSFLAGFLCGFGNGLQFMGGQAAGYAAADAVQALPLVSTFWGILLFGEYRRSSRKTYTLLISMLLMFIVAVAVLMASSGHRK"
#> [3] "MGLKVSSSLLCLTILLAVSSIVSAVNITRVLEKYPEFSTMTELLAKTELTPIINKRQTITVLALNNDAIGSISGRPEEEVKNILMNHVVLDYFDELKLKALKEKSTLLTTLYQSTGLGQQQNGFLNCTKSNGKIYFGSGVKGAPQTAEYITTVFRNPYNLSVVQISMPIVAPGLGSPVKVPPPPPMSSPPAPSPKKGAATPAPAPADEGDYADAPPGLAPETAPASAPSESDSPAPAPDKSGKKKMAAADEAEPPSSASNTGLSFGAVLVLGFVASFVGF"
Provide the sequences and protein identifiers to plot_prot()
:
p1 <- plot_prot(sequence = seqs,
id = ids)
p1
By default the plot contains the following information:
Transmembrane regions (TM) are shown in yellow. Extracellular regions are indicated by the dashed line above the sequences while intracellular regions are indicated by the dashed line bellow the sequence. For TM prediction plot_prot()
can query Phobius or TMHMM via the functions get_phobius()
and get_tmhmm()
. The plot_prot()
argument tm
(can be: “phobius”, “tmhmm” or “none”) instructs which method should be used. Default is “phobius”.
Signal peptides are indicated by the thick red line on the N-terminal side. For N-sp prediction plot_prot()
can query SignalP 4.1 or SignalP-5.0. The plot_prot()
argument nsp
(can be: “signalp”, “signalp5” or “none”) instructs which method should be used. Default is “signalp5”.
GPI attachment sites (as predicted by get_big_pi()
, get_pred_gpi()
or get_netGPI()
depending on the plot_prot()
argument gpi
) are indicated by blue rhomboids. Default is to query NetGPI
Hydroxyprolines (as predicted by predict_hyp()
) are indicated by vertical dark gray lines.
Arabinogalactan glycomodule spans are indicated by the light grey background (motifs are found using scan_ag()
).
Domains are indicated by rectangles with an appropriate fill as indicated in the legend. Domain annotation for the plot can be obtained using get_hmm()
or get_cdd()
depending on the domain
argument (can be: “cdd”, “hmm” or “none”). Default is to use get_cdd()
.
Plotting can take a while since plot_prot()
first obtains the appropriate predictions using the above mentioned function and then generates the plot. To check what actions the function is performing set the argument progress
to TRUE
.
Alternatively the outputs of get_hmm()
, get_cdd()
, get_signalp()
, get_signalp5()
, get_phobius()
, get_tmhmm()
, get_big_pi()
, get_pred_gpi()
, get_netGPI()
and get_espritz()
can be supplied as plot_prot()
arguments domain
, nsp
, tm
, gpi
and disorder
:
domains <- get_cdd(sequence = seqs,
id = ids)
tm <- get_phobius(sequence = seqs,
id = ids)
nsp <- get_signalp5(sequence = seqs,
id = ids)
gpi <- get_netGPI(sequence = seqs,
id = ids)
p1_df <- plot_prot(sequence = seqs,
id = ids,
domain = domains,
tm = tm,
nsp = nsp,
gpi = gpi)
p1_df
Mixing the two types of input is also a possibility. In the below call the data frames
nsp
and gpi
are used to tell prot_plot()
the length of the N-sp and the positions of omega sites while domain = "hmm"
and tm = "tmhmm"
instruct it to use get_hmm()
to obtain domain predictions and get_tmhmm()
to obtain TM predictions:
p2 <- plot_prot(sequence = seqs,
id = ids,
domain = "hmm",
tm = "tmhmm",
nsp = nsp,
gpi = gpi)
p2
From the above plot it can be observed TMHMM tends to predict TM regions where there are N-sp’s and that get_cdd()
and get_hmm()
differ in the type of detected domains. For instance the Q9C7T7 is a leucine-rich repeat (LRR) receptor-like protein kinase - the Conserved domain database has an entry for exactly the mentioned class PLN00113 while the Pfam database has more general domain models so it predicts a protein kinase domain and several LRR domains.
To plot disordered regions as predicted by Espritz vie the function get_espritz()
:
p2 <- plot_prot(sequence = seqs,
id = ids,
domain = domains,
tm = tm,
nsp = nsp,
gpi = gpi,
disorder = TRUE)
p2
The longest predicted disordered region overlaps with the AG motif span in Q9ZQ23
plot_prot()
optionshyp = FALSE
will result in hydroxyproline predictions to not be plotted (check full list of options: ?plot_prot
):
p1 <- plot_prot(sequence = seqs,
id = ids,
hyp = FALSE, #do not plot hydroxyprolines
ag = FALSE, #do not plot arabinogalactan motifs
gpi = "predgpi", #use `get_pred_gpi()` to obtain GPI predictions
domain = "hmm") #use `get_hmm()` to obtain domain predictions
p1
plot_prot()
such as dim
and div
for scan_ag()
or tprob
for predict_hyp()
to change the rules of arabinogalactan motif scan and the threshold for decision if the proline is hydroxylated is supported:
p2 <- plot_prot(sequence = seqs,
id = ids,
dim = 5, #`scan_ag()` argument, minimum number of dipeptides for a motif
div = 5, #`scan_ag()` argument, maximum number of amino acids between dipeptides
tprob = 0.9,
domain = "hmm",
ievalue = 1e-10) #`get_hmm()` argument, remove domains with independent E-value higher than this value
p2
p3 <- plot_prot(sequence = seqs,
id = ids,
hyp_col = "chocolate2",
gpi_col = "lightsalmon4",
nsp_col = "sienna1",
ag_col = "bisque2",
tm_col = "aquamarine")
p3
plot_prot()
was called with domain = "hmm"
it can be observed there are more entries in the legend then can be seen in the plot. Some domains cover other domains. By default the domains with the lowest I-evalue (independent e-value) are plotted on top. To change this the argument dom_sort
can be used. This argument can take the following values: "ievalue"
(default), "abc"
(alphabetically sorted by domain names), "cba"
(reverse of "abc"
).
p4 <- plot_prot(sequence = seqs,
id = ids,
dom_sort = "abc",
domain = "hmm")
p4
Ureide permease domain overlaps with TMEM144 transmembrane transporter domain and also protein kinase and protein tyrosine kinase domains overlap.
plot_prot()
returns a ggplot
object which can be further customized:
p1 +
theme(legend.position = "bottom",
legend.direction = "vertical")
p1 +
theme(legend.position = "bottom",
legend.direction = "vertical") +
annotate(geom = "text",
y = 3,
x = 90,
label = "FAS domain")
The y
axis is continuous - 1
on the plot corresponds to Q9C7T7, 2
corresponds to Q9ZPR7 and 3 to Q9ZQ23. The x
axis represents amino acids in the sequence. This enables adding annotations anywhere on the plot.
p1 +
theme(legend.position = "bottom",
legend.direction = "vertical") +
annotate(geom = "text",
y = 1.6,
x = 530,
label = "annotation 1")
plot_prot()
and scales with the longest sequence plotted. To change this use the ratio
argument to coord_equal()
:
p1 +
theme(legend.position = "bottom",
legend.direction = "vertical") +
coord_equal(ratio = 80,
expand = FALSE)
p1 +
theme(legend.position = "bottom",
legend.direction = "vertical") +
scale_fill_brewer(type = "div")
or create a manual palette:
p1 +
theme(legend.position = "bottom",
legend.direction = "vertical") +
scale_fill_manual(values = c("orange",
"red",
"yellow",
"dodgerblue",
"green",
"blue",
"purple",
"grey30"))