Getting started • ragp

This short section explains how get started using ragp, from installation and basic function arguments to manipulation of function outputs.

Installation

There are several ways to install R packages hosted on git-hub, however the simplest is to use remotes::install_github() which will perform all the required steps automatically.

To install ragp run:

# install.packages("remotes") #if not present
# install.packages("git2r") #if not present
remotes::install_github("missuse/ragp")

alternatively run:

# install.packages("remotes") 
# install.packages("git2r") 
remotes::install_git("https://github.com/missuse/ragp",
                     build_vignettes = FALSE)

to build vignettes which can be viewed by:

browseVignettes("ragp")

Inputs

Most ragp functions require single letter protein sequences and the corresponding identifiers as input. These can be provided in the form of basic R data types such as vectors or data frames. Additionally ragp imports the seqinr package for the manipulation of .FASTA files, so the input objects can be a list of SeqFastaAA objects returned by the seqinr::read.fasta(). The location of a .FASTA file is also possible as a type of input. As of ragp version 0.3.5 objects of class AAStringSet are also supported.

Input options will be illustrated using scan_ag() function:

provide a character vector of protein sequences to the sequence argument and a character vector of protein identifiers to the id argument:

library(ragp)
data(at_nsp) #a data frame of 2700 Arabidopsis sequences
input1 <- scan_ag(sequence = at_nsp$sequence,
                  id = at_nsp$Transcript.id)

provide a data.frame to data argument, and names of columns containing the protein sequences and corresponding identifiers to sequence and id arguments:

input2 <- scan_ag(data = at_nsp,
                  sequence = "sequence",
                  id = "Transcript.id")

quoting column names is not necessary:

input3 <- scan_ag(data = at_nsp,
                  sequence = sequence,
                  id = Transcript.id)

provide a list of SeqFastaAA objects to data argument:

library(seqinr) #to create a fasta file with protein sequences

#write a FASTA file
seqinr::write.fasta(sequence = strsplit(at_nsp$sequence, ""),
                    name = at_nsp$Transcript.id, file = "at_nsp.fasta")

#read a FASTA file to a list of SeqFastaAA objects
At_seq_fas <- read.fasta("at_nsp.fasta",
                         seqtype =  "AA", 
                         as.string = TRUE) 

input4 <- scan_ag(data = At_seq_fas)

provide the location of a .FASTA file to be analyzed as string:

input5 <- scan_ag(data = "at_nsp.fasta") #file at_nsp.fasta is in the working directory

provide an AAStringSet object:

dat <- Biostrings::readAAStringSet("at_nsp.fasta") #file at_nsp.fasta is in the working directory
input6 <- scan_ag(data = dat)

All of the outputs are equal:

all.equal(input1,
          input2)
#> [1] TRUE

all.equal(input1,
          input3)
#> [1] TRUE

all.equal(input1,
          input4)
#> [1] TRUE

all.equal(input1,
          input5)
#> [1] TRUE

all.equal(input1,
          input6)
#> [1] TRUE

The only exceptions to this design are the plotting function plot_prot() which requires protein sequences to be supplied in the form of string vectors (input1 in the above example) and pfam2go() which does not take sequences as input.

Outputs

All ragp functions return basic R data structures such as data frames, lists of vectors and lists of data frames, making them convenient for manipulation to anyone familiar with R. An especially effective way to manipulate these objects is by utilizing the tidyverse collection of packages, especially dplyr and ggplot2. Several dplyr functions that will be especially handy for data wrangling are:

Examples on usage of these functions on objects returned by ragp functions are provided in HRGP filtering and HRGP analysis tutorials. Additionally there are extensive examples on the internet on usage of the mentioned functions.

Obtaining pretty visualizations is usually the goal of the above mentioned data manipulations. The golden standard of R graphics at present is the ggplot2 package and we recommend it to graphically summarize the data. Additionally ragp contains plot_prot() function which is a wrapper for ggplot2, and while plot_prot() can be used without knowing ggplot2 syntax, to tweak the plot style at least a basic knowledge of ggplot2 is required. Examples are provided in protein sequence visualization tutorial.