back to Big Data Biology main page


Video guide to this workshop


Introduction

In this workshop we'll explore some ways to use online tools to understand your data. We'll use a model organism data base PomBase to find information about a particular gene. Then we'll use it to understand something about a list of genes.


Aim

  1. To gain more familiarity with online databases and search tools for fission yeast.
  2. To use some of these yourself.

Learning Outcomes

When you have completed and revised this workshop in your own time make sure that you:

  • Know how to use Pombase to find out about genes and gene lists.

Getting started

  1. Open your R script.
  2. Load your own saved Rda file from last time.
load("my-saved-data.Rda")

Or load the original saved file.

load("yeast_data.28-02-2020.Rda")

Or you can load the data directory from the web, like so:

load(url("http://www-users.york.ac.uk/~dj757/BIO00047I/data/yeast_data.28-02-2020.Rda"))

Exploring the data

Let's start with a question:

What kinds of genes in fission yeast have evolved most slowly over time?

We can easily find this out, because one of the columns in the gene data frame is "conservation.phyloP". This number indicates how conserved each gene is between the four Schizosaccharomyces yeast species. We obtained this value aligning all the genomes, and running the PhyloP program. You can read more about phyloP here.

phyloP scores are interpreted like this:

  • Higher scores mean that the gene is more conserved (evolves slowly)
  • Lower scores mean that the gene is more less conserved (evolves more quickly)

First, let's look at the most conserved gene. Extract this from the gene table, like so:

#get the top phyloP score 
max.cons.level <- max(gene$conservation.phyloP,na.rm=T)
#make subset with the gene that has this score
most.conserved.gene.table <- subset(gene, conservation.phyloP == max.cons.level)
#show this genes name
most.conserved.gene.table$gene

The "na.rm=T" part means remove the scores that are "NA" (missing data).


The yeast and the Brassica data is described here.


Using PomBase: searching a gene list


That wasn't very informative was it?!


Perhaps we should look more genes. This is big data biology, so lets look at the top 100 most conserved genes, or so. First, lets make sure we know our data, by looking at the distribution of phyloP scores (limiting ourselves to the protein-coding genes), like so:


#make a subset of the gene table that contain only protein-coding genes
prot <- subset(gene, protein_coding ==1)

#make a histogram of phyloP scores from this subset
hist(prot$conservation.phyloP)

Now, make a subset of the genes that have fairly high scores. Choose what 'fairly high' means yourself, so you get about 100 genes.

#make a subset of the prot table that contain only genes with high conservation
conserved <- subset(prot, conservation.phyloP > some_number)

#count how many you get
nrow(conserved)

Now extract a gene list from this table and output this to a file, so we can use it in PomBase (give your file a sensible name!).

#make gene list
genelist <- conserved$gene

#output a table of data to a file called puppies.txt
write.table(genelist,file="puppies.txt",col.names = F,quote=F, row.names = F)



Now let's do a gene list search at PomBase, as indicated below (Leela cartoon added by me, not by PomBase). Paste your gene list into the gene list search window, and see what you find. A pattern should emerge now.



Then click on the resulting information.


Finally click on the visualise tool button at the top right of the screen. This will show a customisable graphical display of various aspects of this gene list.



When I looked at the visualisations, I noticed that conserved and non-conserved genes appeared to have diffent proportions of genes with transmembrane domains (TM's). So let's follow this up wih a plot and a test.

First lets make a matrix that contains the number of TM-containing proteins and non-TM proteins, like so:

mat<-matrix(c(1,96,95,226),nrow=2)
mat
##      [,1] [,2]
## [1,]    1   95
## [2,]   96  226

Then, we'll do a chi-squared contingency table test

chisq.test(mat)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mat
## X-squared = 32.758, df = 1, p-value = 1.044e-08

And make our bar plot, with a legend:

#the plot
barplot(mat,
        ylab="number of genes",
        col=c("red","black"),
        names=c("highly conserved","not conserved"),
        beside=T
)
#the legend
legend("topleft",
       c("TM domain","no TM domain"),
       fill=c("red","black"),
       bty="n")


Congratulations, this is a biological result!

  • What does this list, and the visualise tool tell you about highly conserved genes?
  • What processes inside yeast cells don't change rapidly with time?

Do it yourself: Repeat this analysis using one of these categories of genes, that describe where the protein is located in the cell.

  • Mitochondrion
  • Nuclear_dots
  • Nuclear_envelope
  • Nucleolus
  • Nucleus
  • Vacuole

You can make a subset like so:

mysubset <- subset(Vacuole ==1)
  • What do the results tell you about this cellular compartment?

Got a result and want some (edible) recognition? Download the gene list visualisation image, and post it on the google hangout. Include a sentence telling us what it shows.


Before the next workshop

  • Practise what you have learned here.
  • Find out if protein-coding genes are more or less conserved than ncRNA genes, by:
    1. Making a subset of the gene table that contains only non protein-coding genes (ncRNA genes).
    2. Comparing the phyloP score of protein-coding to non protein-coding genes with a box plot.
    3. Comparing the phyloP scores with a Mann-Whitney test.
  • Find out what kind of genes are in the golgi, by making a subset of the golgi-located genes, writing the gene names to a table, and pasting them into PomBase gene search.
golgi <- subset(gene, Golgi==1)
genelist2 <- golgi$gene
write.table(genelist2,col.names = F,quote=F, row.names = F,file="kittens.txt")

Make a visualisation, then see how things look if you deselect or sort the categories. Post it on the google hangout if you make something you are proud of.

  • Find out if one of these categories of gene is more conserved, or more highly expressed than genes that are not in this category: Nuclear_dots, Nuclear_envelope or the Nucleolus. Post your plot, and tell us what it could mean.
#hint: in R != means is not equal to
not.golgi <- subset(gene, Golgi != 1)

Most informative plot wins some chocolate.


Assessment for this course

The only assessment for this course is a 1500 word report. This is due XXX. This report should describe some analysis of one of the data sets you have been working with (yeast or oilseed rape). There is more information about the format we want on the VLE.

Start planning your report. For example, consider:

  • Which data set interests you?
  • What biology underlies this data? (what might be interesting to find out?)
  • Take notes about what plots and tests you think are useful.
  • Experiment with altering the plots we have shown you using different columns, or different subsets.

The data sets are described here.


Adding something special

If you would like to add some new yeast data and analyse that, you can. We wrote a small R script that explains how to downoad and add data from the Angeli website.

You can get this script here.


If you want some help starting your report, or need some ideas:

  • During a workshop: ask a demonstrator, or post a question on the google hangout.
  • At other times: Post your question on the Discussion Board on the VLE. We will try to answer within a week.