back to Big Data Biology main page


Video guide to this workshop


Introduction

In this part of the workshop we'll look at some data from the fission yeast Schizosaccharomyces pombe. This yeast isn't the yeast we use to make beer, wine or bread, but it has a long history of research. So there is a large amount of high-throughput data - and you'll be using some of it.


Need some help?

  • During the workshops, or for help with R commands go to the google hangout
  • For more complex questions, outside workshop times, post something on the discussion board on the VLE. We'll check this and respond once a week.

Aims

  1. Learn how to analyse large datasets to look for biological patterns.
  2. Understand one of The Seven Principles of Big Data Biology: Know your data.
  3. To learn some aspects of RNA-seq gene expression data.

You can see all the others here, and download the details here

The Seven Principles of Big Data Biology are a summary of all the concepts we want you to learn.
* So print them out. * Carry them about in your pocket. * Read them on the bus.*


Learning Outcomes

By completing this practical and carrying out your own independent study, you will be able to:

  • open and explore large gene datasets
  • make subsets of data frames, using the subset() function
  • transform data to a log scale, using the log10() function
  • make a histogram of some data, using the hist() function

Getting started

You should have RStudio started already.

You can continue to use your script from the first part, or start another.

To start a new script, go to the File menu, then choosing New File, then R Script. Now save the file using the Save icon.

Use this file to record all the commands that you use. So you can use them later, share them, and record what you have done. As you are working, copy your commands into this file.

This will be very helpful if you have problems, so you can show us what you have done.


Exploring the data

Now let's look at the data.

This time we will load data from a specialised R Data File (or 'Rda' for short). These data files allow you to save all your R data, so when you start working again it's all there. We'll show you how to make one later.

First, let's clear R's memory to avoid any confusion. Remove any variable or R 'objects' from before, so you start with a clean sheet.

rm(list=ls())

Then download the Rda file called yeast_data.28-02-2020.Rda, from here. Or from the Data Sets tab in the VLE.

Then load it into R, like so:

load("yeast_data.28-02-2020.Rda")

Or you can load the data directory from the web, like so:

load(url("http://www-users.york.ac.uk/~dj757/BIO00047I/data/yeast_data.28-02-2020.Rda"))

Exploring data

Now take a look to see what 'objects' we have in our collection.

ls()

You should see one object, called 'gene'. To see what type of object this is:

class(gene)

This 'object' is a data frame, a very handy data format. It's like an excel spreadsheet (but more powerful). Let's start to explore our gene data frame, so we know what we have.

To see how many rows the data has, do this:

nrow(gene)

And how many columns

ncol(gene)

Let's look at the first few rows, like this

head(gene)
##           gene NumberIntrons NumResidues protein_coding ncRNA Rel_telomere
## 1  SPAC1002.01             1         179              1     0    0.4768186
## 2  SPAC1002.02             0         229              1     0    0.4770079
## 3 SPAC1002.03c             0         923              1     0    0.4772343
## 4 SPAC1002.04c             0         199              1     0    0.4782177
## 5 SPAC1002.05c             0         715              1     0    0.4784627
## 6 SPAC1002.06c             2         118              1     0    0.4791844
##   mRNA_copies_per_cell protein_copies_per_cell mRNA.stabilities
## 1                0.460                      NA         38.30908
## 2                3.800                 2564.51         32.34325
## 3                7.200                 9719.87         47.75722
## 4                1.400                 1600.25         26.73106
## 5                0.520                      NA         20.38781
## 6                0.088                      NA               NA
##   GeneticDiversity ProteinHalfLife Golgi Mitochondrion Nuclear_dots
## 1        0.0024331              NA     0             1            0
## 2        0.0025389           177.6     0             0            0
## 3        0.0041078           959.8     0             0            0
## 4        0.0043796           756.3     0             0            0
## 5        0.0043057              NA     0             0            0
## 6        0.0000000              NA     0             0            0
##   Nuclear_envelope Nucleolus Nucleus Vacuole   start     end
## 1                0         0       0       0 1798347 1799015
## 2                0         0       0       0 1799061 1800053
## 3                0         0       0       0 1799915 1803141
## 4                0         0       0       0 1803624 1804491
## 5                0         0       1       0 1804548 1806797
## 6                0         0       0       0 1807270 1807781
##   solid.media.KO.fitness gene.expression.RPKM conservation.phyloP chromosome
## 1               1.042607              6.94700           0.0000000          I
## 2               1.095876             88.64500           0.3780292          I
## 3               1.035408            103.73833           0.5133217          I
## 4                     NA             33.71667           0.5185620          I
## 5               1.056457              8.13000           0.4469873          I
## 6               1.060421              1.09700           0.5372165          I
##   essential
## 1         0
## 2         0
## 3         0
## 4         1
## 5         0
## 6         0

Each row in this data frame is a gene, and each column contains some information about this gene. Most of the data was downloaded from the Angeli website here. The researchers who made this website collected all sort of high throughput data for fission yeast.

We also added some of our own data. We'll explain what all the columns mean as we go along.

let's look at one column, protein_coding. This column tells us if a gene is a protein-coding gene or a non-coding RNA gene (ncRNAs). We can refer to this column of the data frame like so:

gene$protein_coding

Let's count the number of genes of each type. First, make a new data frame, that contains only the protein coding genes

prot <- subset(gene, protein_coding ==1)

Note that the assignment operator "<-" puts the result of the subset command into the new object called 'prot'. You can call R objects almost anything you like - as long as they don't start with a number, or have any spaces. For example we could call it kittens, but prot is simpler to type, and sounds more meaningful (at least to me).


Need some help? Go to the google hangout and ask us.


Also see that in R we use "==" to mean is equal to. A single "=" means the same as the assignment operator "<-", that we use to push some results into a new variable (as we did above). Bit confusing, but you'll get used to it.

Now let's make a new data frame, with just the ncRNAs genes,like so:

ncrna <- subset(gene, protein_coding ==0)

We can count how many rows there are in our new data frames like this. This will tell us how many genes there are, because each row contains the information for one gene.

nrow(prot)
nrow(ncrna)

This may look simple, but we have just uncovered some real biology. Now we know more about what the fission yeast genome 'outputs'. We know how many different proteins there are, and now many different ncRNAs there are. Later, we'll look at how transcription there is from each of these. We'll then have a genome-wide view of the genomic outputs of this species.


Back up your work

Now that we have made some new objects (prot and ncrna), let's make sure we don't lose our work! Save your R 'session' (all the objects we have). It's a good idea to back up your data regularly.

save.image("mydata.Rda")

PS: You don't have to call it mydata.Rda, you can call it what you like. When you start again, you can load up all this data, like so:

load("mydata.Rda")


Plotting some data.

Now let's look at the gene expression column in the gene data frame "gene.expression.RPKM". This is a measure of how gene expression during rapid growth, much like the oilseed rape data. This data is from this article, which described gene expression in S. pombe using RNA-Seq.

First, let's a get a summary of the data, so we know what we have.

summary(prot$gene.expression.RPKM)

Then make a histogram or gene expression levels for the protein-coding genes.

hist(prot$gene.expression.RPKM)

It looks much like the oilseed rape data, doesn't it? Many genes with very low expression, and only a few with high expression. This is a general principle of gene expression in many organisms.

By plotting the (base 10) of the gene expression value, we can see the range of values more clearly.

hist(log10(prot$gene.expression.RPKM))


Why plot on a log scale?

Biological data sets often contain many values, and few very large ones. These can be difficult to view on a plot, and a bit of a challenge to understand. Plotting data on a log scale, can often show the range of your data more clearly. Log base 10 transforms data like this: log10(1) = 0, log10(10)= 1, log10(100) = 2, log10(1000) = 3, and so on.

For example,

compare this plot, showing gene expression along chromosome I,

with this one, where the RPKM is on a log10 scale

See how the high and the low values are easier to spot in the second plot?

  • To transform a vector of numbers to log base 10, use the log10() command, like so:
log.n <- log10(n)

Presenting data honestly and clearly

Making your plot pretty: let's make this plot nicer, so it's easier for a reader to understand. We will add an axis labels, and a heading. In the next command:

main = "something" makes the heading

xlab = "something" makes the x-axis label (on the bottom)

I'm sure you can guess what ylab does.

hist(log10(prot$gene.expression.RPKM),
     main = "Fission yeast gene expression (protein coding genes)",
     xlab = "gene expression (RPKM, log10 scale)",
     ylab = "number of genes"
)

Note that we plot our data honestly, by pointing out on the x axis label that the RPKM values are on a log10 scale.

Why not see if the gene expression measured in mRNA_copies_per_cell looks similar to the RPKM measure? Hint: replace gene.expression.RPKM with mRNA_copies_per_cell in the command above.


Made a plot you are proud of? Post it on the google hangout. Include your code. One Yorkie Bar (per workshop) will be awarded to the person who posts the best plot.


Finally

Save your working, with your own name

save.image("myname-workshop1.Rda")

Then put this file somewhere that you can get to later (Google drive, a USB stick, or email it to yourself).

You should now be able to:

  • open an R Data File
  • make subsets of data frames
  • make a histogram
  • add axis labels and a title to your histogram

Before the next workshop

This course requires that you practise skills and learn in your own time.

So, before the next workshop:

  • Read over what you have learned here

  • Make a histogram of the expression levels for ncRNAs using the ncrna data frame you made earlier. Compare it to the protein-coding genes - how does it differ? These commands will allow you to see two plots side by side, and make a pdf of your plot.

pdf("kittens.pdf")
par(mfrow=c(1,2))
#plot 1
#plot 2
dev.off()

To switch off the two plot option, use: par(mfrow=c(1,1))

  • Make a histogram of the ProteinHalfLife column. These values are in minutes. Add more breaks to your histogram and see if the plot looks better, like so:
hist(dataframe$column,breaks = 60)
  • On average, how many hour do proteins last?
  • Does this surprise you? (it surprised me)
  • Read about the source of the protein half life data, by looking at the article that produced the data here.