back to Big Data Biology main page
In this workshop, we'll explore the S. pombe yeast data some more. You will learn how to test whether two variables are correlated, and have a think about what this might mean if they are (or if they are not). Then you'll learn how to make a plot that shows this.
We'll use subsets of data to make box and whisker plots, then use a statistical test to find out if two sets of values are significantly different.
We'll learn about these principles of big data analysis:
If you haven't done so already download the The Seven Principles of Big Data Biology here. These principles sum up what we want you to learn.
At the end of this workshop you will be able to:
Make subsets of data frames (and understand why this is useful).
Make a box and whiskers plot, like this:
Test whether two parameters are correlated, using the cor.test() function.
Test whether one set of values (eg: gene expression) is greater than the other, using the wilcox.test() function.
You should have RStudio started already. Keep recording your commands in your own R script.
First load up the fission yeast data from here.
Set the working directory first to where your data file is (otherwise the computer will complain).
Then load it into R.
load("yeast_data.28-02-2020.Rda")
Or you can load the data directory from the web, like so:
load(url("http://www-users.york.ac.uk/~dj757/BIO00047I/data/yeast_data.28-02-2020.Rda"))
Need some help?
Now let's find out something about biology by playing about with the gene data frame.
To focus ourselves, let's choose a question to look into. In fission yeast (and many other model species), we know that some genes can be 'knocked out' (removed) without killing the cell, and some gene knockouts are lethal. Genes that are lethal when knocked out are called essential genes.
So: Are essential genes are more highly expressed?
Fortunately, some very determined people tried to knock out every single protein-coding gene in the fission yeast genome (read about it here). This data is in the gene table, in the essential column.
gene$essential
Let's make two subsets of the data that contain the essential protein-coding genes, and the non-essential protein-coding genes:
ess.prot <- subset(gene, protein_coding == 1 & essential==1)
non.ess.prot <- subset(gene, protein_coding == 1 & essential==0)
Do it yourself: Make these subsets, so you can use them later today.
Now let's make a box and whiskers plot that shows the gene expression of these two sets of genes:
boxplot(
ess.prot$gene.expression.RPKM,
non.ess.prot$gene.expression.RPKM,
outline=T
)
Not very informative, is it?
Let's try that again, on a log scale. Let's put a red dotted line at the essential gene median, using these commands:
ess.med <- median(ess.prot$gene.expression.RPKM,na.rm=T)
abline(h=log10(ess.med),col="red",lty=2)
Let's make a new plot:
boxplot(
log10(ess.prot$gene.expression.RPKM),
log10(non.ess.prot$gene.expression.RPKM),
ylab="RPKM (log10)",
xlab="gene type",
names=c("ESS","NON")
)
ess.med <- median(ess.prot$gene.expression.RPKM,na.rm=T)
abline(h=log10(ess.med),col="red",lty=2)
Looks like the essential genes are more highly expressed, doesn't it?
Do it yourself: See if you can alter this plot so it doesn't show all the outlying values, by adding this line to the boxplot code.
outline=F
Need some help with this? Post your question on the google hangout
It looks like the essential genes are more highly expressed, but could this be by chance? Let's use a Wilcoxon signed rank test, wilcox.test, to find out.
Note: A Wilcoxon signed rank test is like a T-test, but doesn't assume that data is normally distributed (i.e. it's non-parametric). Just to confuse us, sometimes these are called Mann-Whitney tests. To use a Mann-Whitney, to test whether the vectors a and b are sigificantly different, do this:
wilcox.test(a,b)
Now, let's look at our gene expression data:
wilcox.test(ess.prot$gene.expression.RPKM,non.ess.prot$gene.expression.RPKM)
The p-value is less than 2.2e-16 (2.2 x 10-16), meaning that this difference is very unlikely to occur by chance. So we can be very sure that essential genes are more highly expressed than non-essential genes.
Let's look at something else to do with gene expression. Take a look at the names of columns in the gene data frame, using the head() command.
head(gene)
Now, we can plot two numeric parameters of genes against each other using the plot() command.
plot(gene$a,gene$b)
Do it yourself: By altering the commands we've shown you already, see if you can:
Make a subset of all the protein coding genes
With this subset, plot the mRNA copies per cell against the protein copies per cell
Plotting log10(mRNA) vs log10(protein)
If you get stuck, the solution is in the next workshop, here.
save.image("kittens.Rda")
Perhaps replace 'kittens' with something sensible
scatter plots, plot()
box plots, boxplot()
subset, subset()
Wilcoxon test wilcox.test()
Read about how gene expression was measured in copies per cell here
See what other features of genes correlate with mRNA copies per cell. For example, these ones should really be correlated:
You can see a full description of the yeast and the Brassica data here.
Then make a scatter plot of your favourite one. Try plotting one or both using a log scale. Name the like we showed you earlier, and make a pdf, like so:
pdf("puppies.pdf")
#x axis on log scale
plot(
log10(gene$something),
gene$something_else)
)
#two axis on log scale
plot(
log10(gene$something),
log10(gene$something_else)
)
Made a pretty plot that shows something interesting? Post it on the google hangout, include your code and explain what it means to win that Yorkie Bar. One per Yorkie Bar per workshop awarded for the best plot. If you don't upload plots, the lecturers will eat the chocolate themselves.