back to Big Data Biology main page


Video guide to this workshop


Introduction

In this workshop, we'll explore the S. pombe yeast data some more. You will learn how to test whether two variables are correlated, and have a think about what this might mean if they are (or if they are not). Then you'll learn how to make a plot that shows this.

We'll use subsets of data to make box and whisker plots, then use a statistical test to find out if two sets of values are significantly different.

We'll learn about these principles of big data analysis:

  • What is the biological question?
  • Statistical tests are powerful.

If you haven't done so already download the The Seven Principles of Big Data Biology here. These principles sum up what we want you to learn.


Aims

  1. To gain more familiarity with data processing in R.
  2. To learn how to apply some statistical tests and visualisation methods (plots).
  3. To learn some aspects of RNA and protein gene expression data.

Learning Outcomes

At the end of this workshop you will be able to:

  • Make subsets of data frames (and understand why this is useful).

  • Make a box and whiskers plot, like this:

  • Make a scatter plot, like this:
  • Test whether two parameters are correlated, using the cor.test() function.

  • Test whether one set of values (eg: gene expression) is greater than the other, using the wilcox.test() function.


Getting started

You should have RStudio started already. Keep recording your commands in your own R script.

First load up the fission yeast data from here.


Set the working directory first to where your data file is (otherwise the computer will complain).

Then load it into R.

load("yeast_data.28-02-2020.Rda")

Or you can load the data directory from the web, like so:

load(url("http://www-users.york.ac.uk/~dj757/BIO00047I/data/yeast_data.28-02-2020.Rda"))

Need some help?

  • During the workshops, or for help with R commands go to the google hangout
  • For more complex questions, outside workshop times, post something on the discussion board on the VLE. We'll check this and respond once a week.

Exploring the data

Now let's find out something about biology by playing about with the gene data frame.

What is the biological question?

To focus ourselves, let's choose a question to look into. In fission yeast (and many other model species), we know that some genes can be 'knocked out' (removed) without killing the cell, and some gene knockouts are lethal. Genes that are lethal when knocked out are called essential genes.

So: Are essential genes are more highly expressed?

Fortunately, some very determined people tried to knock out every single protein-coding gene in the fission yeast genome (read about it here). This data is in the gene table, in the essential column.

gene$essential

Let's make two subsets of the data that contain the essential protein-coding genes, and the non-essential protein-coding genes:

ess.prot <- subset(gene, protein_coding == 1 & essential==1)
non.ess.prot <- subset(gene, protein_coding == 1 & essential==0)

Do it yourself: Make these subsets, so you can use them later today.

Plotting our data

Now let's make a box and whiskers plot that shows the gene expression of these two sets of genes:

boxplot(
  ess.prot$gene.expression.RPKM,
  non.ess.prot$gene.expression.RPKM,
  outline=T
)

Not very informative, is it?

Let's try that again, on a log scale. Let's put a red dotted line at the essential gene median, using these commands:

ess.med <- median(ess.prot$gene.expression.RPKM,na.rm=T)
abline(h=log10(ess.med),col="red",lty=2)

Let's make a new plot:

  • Note that the names=c("this","that") argument gives names to the boxes.
boxplot(
  log10(ess.prot$gene.expression.RPKM),
  log10(non.ess.prot$gene.expression.RPKM),
  ylab="RPKM (log10)",
  xlab="gene type",
  names=c("ESS","NON")
)
ess.med <- median(ess.prot$gene.expression.RPKM,na.rm=T)
abline(h=log10(ess.med),col="red",lty=2)

Looks like the essential genes are more highly expressed, doesn't it?


Do it yourself: See if you can alter this plot so it doesn't show all the outlying values, by adding this line to the boxplot code.

outline=F

Need some help with this? Post your question on the google hangout


Statistical tests are powerful.

It looks like the essential genes are more highly expressed, but could this be by chance? Let's use a Wilcoxon signed rank test, wilcox.test, to find out.

Note: A Wilcoxon signed rank test is like a T-test, but doesn't assume that data is normally distributed (i.e. it's non-parametric). Just to confuse us, sometimes these are called Mann-Whitney tests. To use a Mann-Whitney, to test whether the vectors a and b are sigificantly different, do this:

wilcox.test(a,b)

Now, let's look at our gene expression data:

wilcox.test(ess.prot$gene.expression.RPKM,non.ess.prot$gene.expression.RPKM)

The p-value is less than 2.2e-16 (2.2 x 10-16), meaning that this difference is very unlikely to occur by chance. So we can be very sure that essential genes are more highly expressed than non-essential genes.

  • Why might this be so?

Making a scatter plot

Let's look at something else to do with gene expression. Take a look at the names of columns in the gene data frame, using the head() command.

head(gene)

Now, we can plot two numeric parameters of genes against each other using the plot() command.

plot(gene$a,gene$b)

Do it yourself: By altering the commands we've shown you already, see if you can:

  • Make a subset of all the protein coding genes

  • With this subset, plot the mRNA copies per cell against the protein copies per cell

  • Plotting log10(mRNA) vs log10(protein)

If you get stuck, the solution is in the next workshop, here.


Back up your work

  • Save your script somewhere where you can get to it later.
  • Save an image of all your data, like so:
save.image("kittens.Rda")

Perhaps replace 'kittens' with something sensible


Before the next workshop

  1. Practice what you have learned here, including these commands:
  • scatter plots, plot()

  • box plots, boxplot()

  • subset, subset()

  • Wilcoxon test wilcox.test()

  1. Read about how gene expression was measured in copies per cell here

  2. See what other features of genes correlate with mRNA copies per cell. For example, these ones should really be correlated:

  • protein_copies_per_cell
  • gene.expression.RPKM
  • mRNA.stabilities

You can see a full description of the yeast and the Brassica data here.

Then make a scatter plot of your favourite one. Try plotting one or both using a log scale. Name the like we showed you earlier, and make a pdf, like so:

pdf("puppies.pdf")
#x axis on  log scale
plot(
    log10(gene$something),
     gene$something_else)
)
#two axis on  log scale
plot(
    log10(gene$something),
    log10(gene$something_else)
)

Made a pretty plot that shows something interesting? Post it on the google hangout, include your code and explain what it means to win that Yorkie Bar. One per Yorkie Bar per workshop awarded for the best plot. If you don't upload plots, the lecturers will eat the chocolate themselves.