back to Big Data Biology main page

Solution to Making a scatter plot from last time:

#make a subset
prot <- subset(gene, protein_coding ==1)

#simple plot without log
plot(
  prot$mRNA_copies_per_cell,
  prot$protein_copies_per_cell
)

#log10 plot
plot(
  log10(prot$mRNA_copies_per_cell),
  log10(prot$protein_copies_per_cell)
)

#add some labels
plot(
  log10(prot$mRNA_copies_per_cell),
  log10(prot$protein_copies_per_cell),
  xlab="mRNA copies per cell (log10)",
  ylab="protein copies per cell(log10)",
  main = "this is the plots title"
)

Video guide to this workshop


Introduction

In this workshop we'll add some more data to our data frame using merge(), and improve our plotting skills using the ggplot2 library.


Aims

  1. To gain more familiarity with data processing in R.
  2. To learn how to add new data to a data frame and plot that data.
  3. To learn about transposon mutagenesis data. (Sometimes called Tn-seq.)
  4. To show the power of combining high throughput data from different sources.

Learning Outcomes

When you have completed and revised this workshop in your own time you should be able to:

  • make a better scatter plot, using smoothScatter.
  • make simple histograms and scatter plots using ggplot2.
  • test for correlations using cor.test.

Don't forget to revise this workshop in your own time.


The data we use is described here.


Getting started

  1. Open your R script.
  2. Load your own saved Rda file from last time.
load("my-saved-data.Rda")

You should have a data frame called gene. If not, download the Rda file yeast_data.28-02-2020.Rda, from here, then load it into R, like so:

load("yeast_data.28-02-2020.Rda")

Or you can load the data directory from the web, like so:

load(url("https://www-users.york.ac.uk/~dj757/BIO00047I/data/yeast_data.28-02-2020.Rda"))

Exploring the data

Transposon mutagenesis data

Now we'll add another column of data to this data frame. This data will be transposon mutagenesis data. This data was derived from a Tn-seq experiment, where we made millions of transposon insertions into the S. pombe genome, and found where they inserted using Illumina DNA sequencing

You can read more about transposon mutagenesis here. You can read more detail about the fission yeast experiment here, and similar experiment with the budding yeast Saccharomyces cerevisiae here.


Downloading, loading and filtering the data

First, download the transposon data (yeast-transposon-data.txt), from here.

Then load it into a data frame called tra.

tra <- read.table("yeast-transposon-data.txt",h=T)

Then use the commands head(tra), nrow(tra), names(tra) and summary(tra) to take a quick look at your data.

You'll see that the data has genes in rows and various numeric values in columns.


Select some columns of the data to use.

This data is complex. So lets select just some of the columns, to keep it simple.

Do it yourself: See what columns the tra data frame has using,

names(tra)

or

summary(tra)

We'll select the gene, and uipkm columns, because we found these the most useful. Here uipkm means unique insertions per kb, per million insertions. It's very much like the gene expression measure RPKM. UIPKM is in the 13th column. To select this and the gene column:

tra2 <- tra[,c(1,13)]

Merge data sets

Now merge this new data frame with the gene data frame (according to what is in the column "gene" of both datasets).

gene2 <- merge(gene,tra2,by="gene",all=T)

Next, check out that the merging went OK, using your favourite combination of head(), nrow(), names() and summary() on the gene, gene2, tra or tra2 data frames.

Why do this? Because you need to make sure that you data is put together correctly before you analyse it.


A simple test of the data

Now, let's check out the data, by plotting something simple, where we know what to expect. When a transposon is inserted into a gene, it frequently inactivates the gene. So insertions in essential genes should kill the cell. Here's how it works:



Confused?

  • During the workshops, or for help with R commands go to the google hangout
  • For more complex questions, outside workshop times, post something on the discussion board on the VLE. We'll check this and respond once a week.

So, we should have less insertions in essential genes.

Do it yourself: Examine this, by making a box plot that compares the transposon insertions in essential vs non-essential genes.

ess <- subset(gene2, essential ==1 & protein_coding ==1)
not.ess <- subset(gene2, essential ==0 & protein_coding ==1)
boxplot(
  ess$uipkm,
  not.ess$uipkm,
  outline=F,
  notch=T,
  names=c("something","something else"),
  ylab="name the y axis"
)

Do it yourself: Test whether essential vs non-essential genes have different insertion densities using a Wilcoxon Rank Sum test.

wilcox.test(ess$uipkm,not.ess$uipkm)

Presenting data clearly

Setting up ggplot2

By now you should be good at making box and whisker plots with boxplot and scatter plots with plot.

Now let's make some plots with the wonderful ggplot2 package. We'll need to tidy up the data a little.

  • First, make the essential column into a factor (it's currently numeric) This will change the data from continuous to discrete categorical which is what it should be.
gene2$essential <- as.factor(gene2$essential)
  • Then make a new data frame that contains only lines where the essential column is not NA (which means missing data). Here the "!" means NOT TRUE so "!is.na" means select rows where "is.na" (is it NA?) is false
gene3 <- subset(gene2, !is.na(essential))
  • Now load the ggplot library
library(ggplot2)

Need help? Ask a demonstrator, or post a question on the google hangout


Making a plot with ggplot2

ggpplot works by adding layers to plots. Each line of your code adds a new layer. You need to add a plus sign (+) at the end of each line apart from the last, like so:

ggplot(bla bla) +
geom_something(bla bla) +
something(bla bla bla)  

Let's make a boxplot using ggplot2.

ggplot(data = gene3, aes(x = essential, y = uipkm)) +
geom_boxplot(notch=T)


  • Now let's make it look better, by making the axis titles and text bigger.
ggplot(data = gene3, aes(x = essential, y = uipkm)) +
geom_boxplot(notch=T) +
theme(axis.title.x = element_text(face="bold", size=20), axis.text.x  = element_text(size=16)) +
theme(axis.title.y = element_text(face="bold", size=20), axis.text.y  = element_text(size=16)) 

Do it yourself: Try adding these extra lines to the boxplot commands. Remember to add the plus signs in the ends of all non-final lines.

coord_flip() +
theme_classic() 

You can download a ggplot2 cheatsheet here. This explains everything you need to know, and plenty more.


Made a nice plot and want some recognition? Post it on the google hangout. Include your code. Tell us what is shows. Win the edible prize.


Back up your work

  • Save your script somewhere where you can get to it later.
  • Save an image of all your data, like so:
save.image("this_is_my_stuff.Rda")

Before the next workshop

1. Decide which data set you will use for your project. The assessement for this course is a 1500 word report. The report should be in a 'letter' style of scientific article, describing your research project. See more details on the VLE. You should choose either the fission yeast data or the Brassica napus data for your project.

2. Start to explore your chosen data set in your own time. Begin by thinking of a question, or data type that you want to explore. Then try out some of the commands we have shown you. Then alter them to perform slightly different analyses.