back to Big Data Biology main page
In this part of the workshop we'll look at some data from the fission yeast Schizosaccharomyces pombe. This yeast isn't the yeast we use to make beer, wine or bread, but it has a long history of research. So there is a large amount of high-throughput data - and you'll be using some of it.
Need some help?
You can see all the others here, and download the details here
The Seven Principles of Big Data Biology are a summary of all the concepts we want you to learn.
* So print them out. * Carry them about in your pocket. * Read them on the bus.*
By completing this practical and carrying out your own independent study, you will be able to:
You should have RStudio started already.
You can continue to use your script from the first part, or start another.
To start a new script, go to the File menu, then choosing New File, then R Script. Now save the file using the Save icon.
Use this file to record all the commands that you use. So you can use them later, share them, and record what you have done. As you are working, copy your commands into this file.
This will be very helpful if you have problems, so you can show us what you have done.
Now let's look at the data.
This time we will load data from a specialised R Data File (or 'Rda' for short). These data files allow you to save all your R data, so when you start working again it's all there. We'll show you how to make one later.
First, let's clear R's memory to avoid any confusion. Remove any variable or R 'objects' from before, so you start with a clean sheet.
rm(list=ls())
Then download the Rda file called yeast_data.28-02-2020.Rda, from here. Or from the Data Sets tab in the VLE.
Then load it into R, like so:
load("yeast_data.28-02-2020.Rda")
Or you can load the data directory from the web, like so:
load(url("http://www-users.york.ac.uk/~dj757/BIO00047I/data/yeast_data.28-02-2020.Rda"))
Now take a look to see what 'objects' we have in our collection.
ls()
You should see one object, called 'gene'. To see what type of object this is:
class(gene)
This 'object' is a data frame, a very handy data format. It's like an excel spreadsheet (but more powerful). Let's start to explore our gene data frame, so we know what we have.
To see how many rows the data has, do this:
nrow(gene)
And how many columns
ncol(gene)
Let's look at the first few rows, like this
head(gene)
## gene NumberIntrons NumResidues protein_coding ncRNA Rel_telomere
## 1 SPAC1002.01 1 179 1 0 0.4768186
## 2 SPAC1002.02 0 229 1 0 0.4770079
## 3 SPAC1002.03c 0 923 1 0 0.4772343
## 4 SPAC1002.04c 0 199 1 0 0.4782177
## 5 SPAC1002.05c 0 715 1 0 0.4784627
## 6 SPAC1002.06c 2 118 1 0 0.4791844
## mRNA_copies_per_cell protein_copies_per_cell mRNA.stabilities
## 1 0.460 NA 38.30908
## 2 3.800 2564.51 32.34325
## 3 7.200 9719.87 47.75722
## 4 1.400 1600.25 26.73106
## 5 0.520 NA 20.38781
## 6 0.088 NA NA
## GeneticDiversity ProteinHalfLife Golgi Mitochondrion Nuclear_dots
## 1 0.0024331 NA 0 1 0
## 2 0.0025389 177.6 0 0 0
## 3 0.0041078 959.8 0 0 0
## 4 0.0043796 756.3 0 0 0
## 5 0.0043057 NA 0 0 0
## 6 0.0000000 NA 0 0 0
## Nuclear_envelope Nucleolus Nucleus Vacuole start end
## 1 0 0 0 0 1798347 1799015
## 2 0 0 0 0 1799061 1800053
## 3 0 0 0 0 1799915 1803141
## 4 0 0 0 0 1803624 1804491
## 5 0 0 1 0 1804548 1806797
## 6 0 0 0 0 1807270 1807781
## solid.media.KO.fitness gene.expression.RPKM conservation.phyloP chromosome
## 1 1.042607 6.94700 0.0000000 I
## 2 1.095876 88.64500 0.3780292 I
## 3 1.035408 103.73833 0.5133217 I
## 4 NA 33.71667 0.5185620 I
## 5 1.056457 8.13000 0.4469873 I
## 6 1.060421 1.09700 0.5372165 I
## essential
## 1 0
## 2 0
## 3 0
## 4 1
## 5 0
## 6 0
Each row in this data frame is a gene, and each column contains some information about this gene. Most of the data was downloaded from the Angeli website here. The researchers who made this website collected all sort of high throughput data for fission yeast.
We also added some of our own data. We'll explain what all the columns mean as we go along.
let's look at one column, protein_coding. This column tells us if a gene is a protein-coding gene or a non-coding RNA gene (ncRNAs). We can refer to this column of the data frame like so:
gene$protein_coding
Let's count the number of genes of each type. First, make a new data frame, that contains only the protein coding genes
prot <- subset(gene, protein_coding ==1)
Note that the assignment operator "<-" puts the result of the subset command into the new object called 'prot'. You can call R objects almost anything you like - as long as they don't start with a number, or have any spaces. For example we could call it kittens, but prot is simpler to type, and sounds more meaningful (at least to me).
Need some help? Go to the google hangout and ask us.
Also see that in R we use "==" to mean is equal to. A single "=" means the same as the assignment operator "<-", that we use to push some results into a new variable (as we did above). Bit confusing, but you'll get used to it.
Now let's make a new data frame, with just the ncRNAs genes,like so:
ncrna <- subset(gene, protein_coding ==0)
We can count how many rows there are in our new data frames like this. This will tell us how many genes there are, because each row contains the information for one gene.
nrow(prot)
nrow(ncrna)
This may look simple, but we have just uncovered some real biology. Now we know more about what the fission yeast genome 'outputs'. We know how many different proteins there are, and now many different ncRNAs there are. Later, we'll look at how transcription there is from each of these. We'll then have a genome-wide view of the genomic outputs of this species.
Now that we have made some new objects (prot and ncrna), let's make sure we don't lose our work! Save your R 'session' (all the objects we have). It's a good idea to back up your data regularly.
save.image("mydata.Rda")
PS: You don't have to call it mydata.Rda, you can call it what you like. When you start again, you can load up all this data, like so:
load("mydata.Rda")
Now let's look at the gene expression column in the gene data frame "gene.expression.RPKM". This is a measure of how gene expression during rapid growth, much like the oilseed rape data. This data is from this article, which described gene expression in S. pombe using RNA-Seq.
First, let's a get a summary of the data, so we know what we have.
summary(prot$gene.expression.RPKM)
Then make a histogram or gene expression levels for the protein-coding genes.
hist(prot$gene.expression.RPKM)
It looks much like the oilseed rape data, doesn't it? Many genes with very low expression, and only a few with high expression. This is a general principle of gene expression in many organisms.
By plotting the (base 10) of the gene expression value, we can see the range of values more clearly.
hist(log10(prot$gene.expression.RPKM))
Why plot on a log scale?
Biological data sets often contain many values, and few very large ones. These can be difficult to view on a plot, and a bit of a challenge to understand. Plotting data on a log scale, can often show the range of your data more clearly. Log base 10 transforms data like this: log10(1) = 0, log10(10)= 1, log10(100) = 2, log10(1000) = 3, and so on.
For example,
compare this plot, showing gene expression along chromosome I,
with this one, where the RPKM is on a log10 scale
See how the high and the low values are easier to spot in the second plot?
log.n <- log10(n)
Making your plot pretty: let's make this plot nicer, so it's easier for a reader to understand. We will add an axis labels, and a heading. In the next command:
main = "something" makes the heading
xlab = "something" makes the x-axis label (on the bottom)
I'm sure you can guess what ylab does.
hist(log10(prot$gene.expression.RPKM),
main = "Fission yeast gene expression (protein coding genes)",
xlab = "gene expression (RPKM, log10 scale)",
ylab = "number of genes"
)
Note that we plot our data honestly, by pointing out on the x axis label that the RPKM values are on a log10 scale.
Why not see if the gene expression measured in mRNA_copies_per_cell looks similar to the RPKM measure? Hint: replace gene.expression.RPKM with mRNA_copies_per_cell in the command above.
Made a plot you are proud of? Post it on the google hangout. Include your code. One Yorkie Bar (per workshop) will be awarded to the person who posts the best plot.
Save your working, with your own name
save.image("myname-workshop1.Rda")
Then put this file somewhere that you can get to later (Google drive, a USB stick, or email it to yourself).
You should now be able to:
This course requires that you practise skills and learn in your own time.
So, before the next workshop:
Read over what you have learned here
Make a histogram of the expression levels for ncRNAs using the ncrna data frame you made earlier. Compare it to the protein-coding genes - how does it differ? These commands will allow you to see two plots side by side, and make a pdf of your plot.
pdf("kittens.pdf")
par(mfrow=c(1,2))
#plot 1
#plot 2
dev.off()
To switch off the two plot option, use: par(mfrow=c(1,1))
hist(dataframe$column,breaks = 60)