back to Big Data Biology main page
In this workshop we'll explore some ways to use online tools to understand your data. We'll use a model organism data base PomBase to find information about a particular gene. Then we'll use it to understand something about a list of genes.
When you have completed and revised this workshop in your own time make sure that you:
load("my-saved-data.Rda")
Or load the original saved file.
load("yeast_data.28-02-2020.Rda")
Or you can load the data directory from the web, like so:
load(url("http://www-users.york.ac.uk/~dj757/BIO00047I/data/yeast_data.28-02-2020.Rda"))
Let's start with a question:
What kinds of genes in fission yeast have evolved most slowly over time?
We can easily find this out, because one of the columns in the gene data frame is "conservation.phyloP". This number indicates how conserved each gene is between the four Schizosaccharomyces yeast species. We obtained this value aligning all the genomes, and running the PhyloP program. You can read more about phyloP here.
phyloP scores are interpreted like this:
First, let's look at the most conserved gene. Extract this from the gene table, like so:
#get the top phyloP score
max.cons.level <- max(gene$conservation.phyloP,na.rm=T)
#make subset with the gene that has this score
most.conserved.gene.table <- subset(gene, conservation.phyloP == max.cons.level)
#show this genes name
most.conserved.gene.table$gene
The "na.rm=T" part means remove the scores that are "NA" (missing data).
The yeast and the Brassica data is described here.
To find out what is known about this gene, go to PomBase, and paste the gene name into the search box at top right. From the gene page that you are sent to, use the menu on the left to learn something about this gene.
That wasn't very informative was it?!
Perhaps we should look more genes. This is big data biology, so lets look at the top 100 most conserved genes, or so. First, lets make sure we know our data, by looking at the distribution of phyloP scores (limiting ourselves to the protein-coding genes), like so:
#make a subset of the gene table that contain only protein-coding genes
prot <- subset(gene, protein_coding ==1)
#make a histogram of phyloP scores from this subset
hist(prot$conservation.phyloP)
Now, make a subset of the genes that have fairly high scores. Choose what 'fairly high' means yourself, so you get about 100 genes.
#make a subset of the prot table that contain only genes with high conservation
conserved <- subset(prot, conservation.phyloP > some_number)
#count how many you get
nrow(conserved)
Now extract a gene list from this table and output this to a file, so we can use it in PomBase (give your file a sensible name!).
#make gene list
genelist <- conserved$gene
#output a table of data to a file called puppies.txt
write.table(genelist,file="puppies.txt",col.names = F,quote=F, row.names = F)
Now let's do a gene list search at PomBase, as indicated below (Leela cartoon added by me, not by PomBase). Paste your gene list into the gene list search window, and see what you find. A pattern should emerge now.
Then click on the resulting information.
Finally click on the visualise tool button at the top right of the screen. This will show a customisable graphical display of various aspects of this gene list.
When I looked at the visualisations, I noticed that conserved and non-conserved genes appeared to have diffent proportions of genes with transmembrane domains (TM's). So let's follow this up wih a plot and a test.
First lets make a matrix that contains the number of TM-containing proteins and non-TM proteins, like so:
mat<-matrix(c(1,96,95,226),nrow=2)
mat
## [,1] [,2]
## [1,] 1 95
## [2,] 96 226
Then, we'll do a chi-squared contingency table test
chisq.test(mat)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mat
## X-squared = 32.758, df = 1, p-value = 1.044e-08
And make our bar plot, with a legend:
#the plot
barplot(mat,
ylab="number of genes",
col=c("red","black"),
names=c("highly conserved","not conserved"),
beside=T
)
#the legend
legend("topleft",
c("TM domain","no TM domain"),
fill=c("red","black"),
bty="n")
Congratulations, this is a biological result!
Do it yourself: Repeat this analysis using one of these categories of genes, that describe where the protein is located in the cell.
You can make a subset like so:
mysubset <- subset(Vacuole ==1)
Got a result and want some (edible) recognition? Download the gene list visualisation image, and post it on the google hangout. Include a sentence telling us what it shows.
golgi <- subset(gene, Golgi==1)
genelist2 <- golgi$gene
write.table(genelist2,col.names = F,quote=F, row.names = F,file="kittens.txt")
Make a visualisation, then see how things look if you deselect or sort the categories. Post it on the google hangout if you make something you are proud of.
#hint: in R != means is not equal to
not.golgi <- subset(gene, Golgi != 1)
Most informative plot wins some chocolate.
The only assessment for this course is a 1500 word report. This is due XXX. This report should describe some analysis of one of the data sets you have been working with (yeast or oilseed rape). There is more information about the format we want on the VLE.
Start planning your report. For example, consider:
The data sets are described here.
If you would like to add some new yeast data and analyse that, you can. We wrote a small R script that explains how to downoad and add data from the Angeli website.
You can get this script here.
If you want some help starting your report, or need some ideas: