back to Big Data Biology main page
Solution to Making a scatter plot from last time:
#make a subset
prot <- subset(gene, protein_coding ==1)
#simple plot without log
plot(
prot$mRNA_copies_per_cell,
prot$protein_copies_per_cell
)
#log10 plot
plot(
log10(prot$mRNA_copies_per_cell),
log10(prot$protein_copies_per_cell)
)
#add some labels
plot(
log10(prot$mRNA_copies_per_cell),
log10(prot$protein_copies_per_cell),
xlab="mRNA copies per cell (log10)",
ylab="protein copies per cell(log10)",
main = "this is the plots title"
)
In this workshop we'll add some more data to our data frame using merge(), and improve our plotting skills using the ggplot2 library.
When you have completed and revised this workshop in your own time you should be able to:
Don't forget to revise this workshop in your own time.
The data we use is described here.
load("my-saved-data.Rda")
You should have a data frame called gene. If not, download the Rda file yeast_data.28-02-2020.Rda, from here, then load it into R, like so:
load("yeast_data.28-02-2020.Rda")
Or you can load the data directory from the web, like so:
load(url("https://www-users.york.ac.uk/~dj757/BIO00047I/data/yeast_data.28-02-2020.Rda"))
Transposon mutagenesis data
Now we'll add another column of data to this data frame. This data will be transposon mutagenesis data. This data was derived from a Tn-seq experiment, where we made millions of transposon insertions into the S. pombe genome, and found where they inserted using Illumina DNA sequencing
You can read more about transposon mutagenesis here. You can read more detail about the fission yeast experiment here, and similar experiment with the budding yeast Saccharomyces cerevisiae here.
First, download the transposon data (yeast-transposon-data.txt), from here.
Then load it into a data frame called tra.
tra <- read.table("yeast-transposon-data.txt",h=T)
Then use the commands head(tra), nrow(tra), names(tra) and summary(tra) to take a quick look at your data.
You'll see that the data has genes in rows and various numeric values in columns.
This data is complex. So lets select just some of the columns, to keep it simple.
Do it yourself: See what columns the tra data frame has using,
names(tra)
or
summary(tra)
We'll select the gene, and uipkm columns, because we found these the most useful. Here uipkm means unique insertions per kb, per million insertions. It's very much like the gene expression measure RPKM. UIPKM is in the 13th column. To select this and the gene column:
tra2 <- tra[,c(1,13)]
Now merge this new data frame with the gene data frame (according to what is in the column "gene" of both datasets).
gene2 <- merge(gene,tra2,by="gene",all=T)
Next, check out that the merging went OK, using your favourite combination of head(), nrow(), names() and summary() on the gene, gene2, tra or tra2 data frames.
Why do this? Because you need to make sure that you data is put together correctly before you analyse it.
Now, let's check out the data, by plotting something simple, where we know what to expect. When a transposon is inserted into a gene, it frequently inactivates the gene. So insertions in essential genes should kill the cell. Here's how it works:
Confused?
So, we should have less insertions in essential genes.
Do it yourself: Examine this, by making a box plot that compares the transposon insertions in essential vs non-essential genes.
ess <- subset(gene2, essential ==1 & protein_coding ==1)
not.ess <- subset(gene2, essential ==0 & protein_coding ==1)
boxplot(
ess$uipkm,
not.ess$uipkm,
outline=F,
notch=T,
names=c("something","something else"),
ylab="name the y axis"
)
Do it yourself: Test whether essential vs non-essential genes have different insertion densities using a Wilcoxon Rank Sum test.
wilcox.test(ess$uipkm,not.ess$uipkm)
By now you should be good at making box and whisker plots with boxplot and scatter plots with plot.
Now let's make some plots with the wonderful ggplot2 package. We'll need to tidy up the data a little.
gene2$essential <- as.factor(gene2$essential)
gene3 <- subset(gene2, !is.na(essential))
library(ggplot2)
Need help? Ask a demonstrator, or post a question on the google hangout
ggpplot works by adding layers to plots. Each line of your code adds a new layer. You need to add a plus sign (+) at the end of each line apart from the last, like so:
ggplot(bla bla) +
geom_something(bla bla) +
something(bla bla bla)
Let's make a boxplot using ggplot2.
ggplot(data = gene3, aes(x = essential, y = uipkm)) +
geom_boxplot(notch=T)
ggplot(data = gene3, aes(x = essential, y = uipkm)) +
geom_boxplot(notch=T) +
theme(axis.title.x = element_text(face="bold", size=20), axis.text.x = element_text(size=16)) +
theme(axis.title.y = element_text(face="bold", size=20), axis.text.y = element_text(size=16))
Do it yourself: Try adding these extra lines to the boxplot commands. Remember to add the plus signs in the ends of all non-final lines.
coord_flip() +
theme_classic()
You can download a ggplot2 cheatsheet here. This explains everything you need to know, and plenty more.
Made a nice plot and want some recognition? Post it on the google hangout. Include your code. Tell us what is shows. Win the edible prize.
save.image("this_is_my_stuff.Rda")
1. Decide which data set you will use for your project. The assessement for this course is a 1500 word report. The report should be in a 'letter' style of scientific article, describing your research project. See more details on the VLE. You should choose either the fission yeast data or the Brassica napus data for your project.
2. Start to explore your chosen data set in your own time. Begin by thinking of a question, or data type that you want to explore. Then try out some of the commands we have shown you. Then alter them to perform slightly different analyses.