back to Big Data Biology main page
Previously, we found the distribution of phyla in hedge row roots. In this workshop, we would like to know (i) whether the environmental samples have different distributions of phyla and (ii) whether certain phyla of fungi tend to grow together.
You will also have the opportunity to learn the following new programming skills if you choose to (i.e. optional):
load("Workshop2_ITS.Rda")
In the previous session, we constructed the variable ‘counts_by_phylum’, which lists the number of times fungi from each phylum was observed across the ALL environmental samples. In this session, we want to compare the diversity of the environmental samples. In order to do this, we will first need to calculate the number of times fungi from each phylum were observed in each environmental sample separately.
Recall that we previously found which columns corresponded to environmental samples. We stored this in the variable called “sampleColumns” (see Workshop 1).
colnames(ITS_counts)[sampleColumns]
## [1] "W14" "W17" "W18" "W19" "W22" "W23" "W24" "W25" "W26" "W27" "W28" "W33"
## [13] "W34" "W35" "W36" "W3" "W41" "W42" "W43" "W44" "W50" "W51" "W52" "W53"
## [25] "W58" "W59" "W60" "W61" "W65" "W66" "W69" "W6" "W70" "W73" "W74" "W76"
## [37] "W77" "W79" "W81" "W83" "W85" "W86" "W87" "W89" "W8" "W95" "W9"
Previously, we counted how often each unique phylum was observed (the variable ‘counts_by_phylum’). To do this, we looped over every unique phylum. Now, we want to count how often each unique phylum was observed in each environmental sample. To do this, we need to iterate over every column in sampleColumn AND iterate over every unique phylum.
***Optional: Learn to program***
The apply function
I recommend watching the apply video: link
Another new function that we will introduce here is the 'apply' function. 'Apply' is almost the same as the 'sapply' function (see Workshop 2).
Q1: Do you remember what the sapply function does?
Q2: Can you predict what the following line of code will produce? (think about it before copying and pasting it!)
sapply(c(4,6,2,6), function(i){i*i})
If you can't answer these questions, I would recommend watching this video about apply functions:
link
The difference between apply and sapply takes in a vector or a list, while apply takes in a matrix or a data frame.
Q3: Do you remember what the difference is between a vector, list, matrix and data frame?
If you can't answer these questions, I would recommend watching this video about data structures:
link
In addition, the apply function takes in one additional parameter, which describes what variable is being iterated over:
* If this parameter has a value of 1, then the function will run on every row of the matrix.
* If this parameter has a value of 2, then the function will run on every column of the matrix
* If this parameter has a value of c(1,2), then the function will run on every element of the matrix.
It is easier to understand the apply function after looking at a few examples:
Q4: Run the following examples of the apply function in R and see what they produce. Can you understand why the function 'apply' produces the outputs it does in these three cases? Note that these examples are not related to fungal ecology.
mat=matrix(c(1,-4,2,2.5,-3,4), nrow=2)
mat
apply(mat, 1, function(i){max(i)})
apply(mat, 2, function(i){max(i)})
apply(mat, c(1,2), function(i){max(i)})
Q5: Now try to predict what the output will be in the following examples, BEFORE running the code in RStudio. Here we introduce one other feature of the apply/sapply functions. We can replace the 'i' with any variable name we want:
apply(mat, 1, function(j){max(j*j)})
apply(mat, 2, function(hello){min(hello*hello)})
apply(mat, c(1,2), function(i){i*i+2})
The algorithm for finding the diversity of each environmental sample is as follows:
Here is an illustration of this algorithm:
Here is the code for finding the counts per phylum per environmental sample, using the apply function and the sapply function:
ITS_counts_envSamples=ITS_counts[,sampleColumns] #extract the columns of ITS_counts that correspond to environmental samples
hedgerow_counts_by_phylum=apply(ITS_counts_envSamples, 2, function(col){ #for every column with an environmental sample
sapply(unique(phyla), function(i){ # and for every unique phylum
#Step 1:
index_phylum=which(phyla==i)
#Step 2:
sum_by_phylum=col[index_phylum]
#Step 3:
sum(sum_by_phylum)
})
})
dim(hedgerow_counts_by_phylum)
## [1] 15 47
hedgerow_counts_by_phylum[1:5, 1:5]
## W14 W17 W18 W19 W22
## k__Fungi;p__Ascomycota 25607 14359 21956 17524 20237
## k__Fungi;p__Basidiomycota 13518 1325 2440 2063 1145
## k__Fungi;p__Chytridiomycota 10 0 31 108 293
## k__Fungi;p__Entomophthoromycota 0 0 0 0 0
## k__Fungi;p__Glomeromycota 1468 120 0 86 31
The first question we might ask once we have this table is: are there different counts of each phylum in each environmental sample? For instance, are there different amounts of Ascomycota?
barplot(sort(hedgerow_counts_by_phylum["k__Fungi;p__Ascomycota",]), xlab="sample", ylab="Ascomycota count")
At first glance, it may appear that there are very different amounts of Ascomycota in each sample. However, remember that each sample had a very different total number of OTUs identified. As you might remember from Workshop 1, we previously counted the total number of OTUs that were found in each environmental sample, and we saved this in a variable called colSums_ITS. It could be that these differences in the frequency of Ascomycota between environmental samples is just a reflection of the different sequencing depths of the experiments.
#Let us remind ourselves what is in the variable "colSums_ITS":
sort(colSums_ITS)
## W41 W85 W65 W79 W59 W25 W77 W73 W53 W8 W89 W95 W17
## 2 2 19 237 1218 6444 10257 11592 11932 12951 14370 15734 15806
## W86 W33 W26 W70 W61 W44 W74 W69 W43 W81 W19 W83 W76
## 17160 17250 17683 17971 18065 18558 18927 20390 20994 21377 21584 21723 22923
## W27 W24 W42 W6 W18 W52 W22 W28 W23 W35 W58 W36 W34
## 22999 23488 24087 24371 24756 24803 25151 25771 26613 28248 28273 29147 30125
## W3 W60 W9 W66 W14 W51 W87 W50
## 32423 32991 38771 40897 40925 45262 53482 57966
ascomycota_counts=hedgerow_counts_by_phylum["k__Fungi;p__Ascomycota",]
plot(colSums_ITS, ascomycota_counts, xlab="number of fungi sequenced", ylab="number of Ascomycota sequenced")
As you can see, there is a strong positive correlation between the number of total counts and the counts of Ascomycota. It might be a good idea to instead look at the proportion of reads that come from each phylum. For each environmental sample, what proportion of the reads come from Ascomycota?
normalised_Ascomycota=ascomycota_counts/colSums_ITS
normalised_Ascomycota
## W14 W17 W18 W19 W22 W23 W24
## 0.62570556 0.90845249 0.88689611 0.81189770 0.80462009 0.79359711 0.91923535
## W25 W26 W27 W28 W33 W34 W35
## 0.32293606 0.52457162 0.25779382 0.57735439 0.96666667 0.95093776 0.87365477
## W36 W3 W41 W42 W43 W44 W50
## 0.89731362 0.24127934 0.00000000 0.69904928 0.72877965 0.63449725 0.82776110
## W51 W52 W53 W58 W59 W60 W61
## 0.95658610 0.37551909 0.24346296 0.78460015 0.80952381 0.89721439 0.81583172
## W65 W66 W69 W6 W70 W73 W74
## 0.89473684 0.94261193 0.88857283 0.34619014 0.98976128 0.73628364 0.78464627
## W76 W77 W79 W81 W83 W85 W86
## 0.88404659 0.65837964 0.81856540 0.61622304 0.47571698 0.50000000 0.77331002
## W87 W89 W8 W95 W9
## 0.48436857 0.60758525 0.01080998 0.54271005 0.84805654
plot(colSums_ITS, normalised_Ascomycota, xlab="number of fungi sequenced", ylab="proportion of Ascomycota")
We see that there no strong correlation between the total number of reads and the proportion of the reads that map to Ascomycota. However, we see that there is more variability when there are low read counts! This is something very important to keep in mind for the next workshop!
However, even among samples that have a lot of OTUs (over 30,000), we still see lots of variability in the proportion of Ascomycota– one sample has less than 30% Ascomycota, while another has over 90%.
Now, let’s apply this procedure to all the rows, so we can find the proportion of each phylum in each environmental sample:
normalised_ITS_counts_by_phylum=apply(hedgerow_counts_by_phylum, 1, function(i){
i/colSums_ITS
})
Let’s compare the distributions of phyla in each environmental sample. Here we use the function ‘t’ which transposes the matrix (i.e. the columns become rows and the rows become columns). What would this barplot look like if we did not transpose the matrix first?
barplot(t(normalised_ITS_counts_by_phylum))
This is really hard to read, so let’s add a colour key. How about we give the 5 most populous phyla ‘rainbow colours’, and make all the other phyla grey. First, we will sort the phyla by how common they are, using the ‘sort’ function that we used in Workshop 1. Do you remember what this function does? What do you think the output would be if we either leave out decreasing=TRUE or set decreasing=FALSE ?
phyla_by_pop=sort(counts_by_phylum, decreasing=TRUE)
phyla_by_pop
## k__Fungi;p__Ascomycota k__Fungi;p__Basidiomycota
## 739935 203335
## k__Fungi;p__Glomeromycota k__Fungi;p__Olpidiomycota
## 61131 13352
## k__Fungi;p__Mortierellomycota k__Fungi;p__unidentified
## 5698 3800
## k__Fungi;p__Rozellomycota k__Fungi;p__Zoopagomycota
## 3264 1986
## k__Fungi;p__Chytridiomycota other
## 1644 1026
## k__Protista;p__Cercozoa k__Fungi;p__Kickxellomycota
## 276 131
## k__Fungi;p__Mucoromycota k__Fungi;p__GS19
## 92 28
## k__Fungi;p__Entomophthoromycota
## 20
Next, we will find 5 rainbow colours, using the ‘rainbow’ function. The colours in R can be expressed as words like “red”, “blue”, and “green”, or they can be expressed as a ‘hexadecimal code’, an 8-digit code that contains the values 0 to 9 and/or the letters A to F. The rainbow function returns a vector of colours in a ‘hexadecimal code’ format.
rainbow_cols=rainbow(5)
rainbow_cols
## [1] "#FF0000FF" "#CCFF00FF" "#00FF66FF" "#0066FFFF" "#CC00FFFF"
We would like the remaining colours to be grey, so let’s make a vector of length length(phyla_by_pop)-5 of the colour “grey”. This uses the function ‘rep’, which makes a vector with the same value repeated a certain number of times.
grey_cols=rep("grey", length(phyla_by_pop)-5)
Now, let’s combine these two vectors into one vector:
col_labels=c(rainbow_cols, grey_cols)
Finally, let us set the names to be the phyla names.
names(col_labels)=names(phyla_by_pop)
col_labels
## k__Fungi;p__Ascomycota k__Fungi;p__Basidiomycota
## "#FF0000FF" "#CCFF00FF"
## k__Fungi;p__Glomeromycota k__Fungi;p__Olpidiomycota
## "#00FF66FF" "#0066FFFF"
## k__Fungi;p__Mortierellomycota k__Fungi;p__unidentified
## "#CC00FFFF" "grey"
## k__Fungi;p__Rozellomycota k__Fungi;p__Zoopagomycota
## "grey" "grey"
## k__Fungi;p__Chytridiomycota other
## "grey" "grey"
## k__Protista;p__Cercozoa k__Fungi;p__Kickxellomycota
## "grey" "grey"
## k__Fungi;p__Mucoromycota k__Fungi;p__GS19
## "grey" "grey"
## k__Fungi;p__Entomophthoromycota
## "grey"
Now let’s add this legend to the barplot:
barplot(t(as.matrix(normalised_ITS_counts_by_phylum)), xlab="environmental sample", ylab="proportion of each phylum", col=col_labels[colnames(normalised_ITS_counts_by_phylum)], xlim=c(0,80))
len_pretext=nchar("k__Fungi;p__")
shorterNames=substring(names(col_labels), len_pretext+1)
legend(60, 0.9, shorterNames, fill=col_labels, cex=0.6)
Let us order the bars by the relative amount of Ascomycota in the sample
order_plot=order(normalised_ITS_counts_by_phylum[,1])
order_plot
## [1] 17 45 16 24 10 8 32 23 40 43 41 9 46 11 44 39 1 20 37 18 19 34 42 25 35
## [26] 6 5 26 4 28 38 21 47 14 36 3 31 29 27 15 2 7 30 13 22 12 33
barplot(t(as.matrix(normalised_ITS_counts_by_phylum)[order_plot,]),
xlab="environmental sample", ylab="proportion of each phylum", col=col_labels[colnames(normalised_ITS_counts_by_phylum)])
How do you interpret this plot? For instance, what do you observe about the proportion of Glomeromycota and Olidiomycota in each sample, and how this relates to the proportion of Ascomycota in each sample? Can you think of ways to visualise this more clearly?
Our null hypothesis is that phyla diversity is independent of sampling location. “Phyla” and “environmental sample” are both categorical variables (i.e. they are not numbers– they can only take a certain number of discrete values). To compare the frequencies of phyla in different sampling locations, we can use the chi-squared test of independence. For more information about this test, see the example here: http://www.r-tutor.com/elementary-statistics/goodness-fit/chi-squared-test-independence
When can you use a Chi-squared test? (see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3900058/)
Given these criteria, why are we using “hedgerow_counts_by_phylum” below instead of “normalised_ITS_counts_by_phylum”?
#install.packages("MASS") #uncomment in your own code
library(MASS)
## Warning: package 'MASS' was built under R version 3.6.2
chisq.test(hedgerow_counts_by_phylum)
## Warning in chisq.test(hedgerow_counts_by_phylum): Chi-squared approximation may
## be incorrect
##
## Pearson's Chi-squared test
##
## data: hedgerow_counts_by_phylum
## X-squared = 804578, df = 644, p-value < 2.2e-16
You will see a warning: what does it mean? Which criteria is probably broken? (As a hint, see the code below)
over5=length(which(hedgerow_counts_by_phylum>5))
over5/length(hedgerow_counts_by_phylum)
## [1] 0.4184397
Here is a challenge question that will really test your knowledge of what we’ve done so far. This is yet another idea for how to include a ‘novel analysis’ in your final report.
Possible ‘novel analysis’ for your final report: Make a version of hedgerow_counts_by_phylum that has only 6 phyla categories: Ascomycota, Basidiomycota, Glomeromycota, Olpidiomycota, Moortierellomycota, and ‘Other Phyla’. Perform a Chi-squared test on this table.
Lets save the workspace for next time:
save.image("Workshop3_ITS.Rda")
Today we compared the diversity of the different environmental samples. In particular, we visualised the different frequencies of the phyla in each of the environmental samples and performed a Chi-squared test.
Here are some biological questions that you should consider:
In the process, we learned a few general techniques for data analysis:
For those doing the optional programming content: you also learned some new R skills. See you you can do the following excercises to see if you have an understanding of the material:
Exercise 1:
mat=matrix(c(1,6,7,6,4,5,-4,0,1,3,-2,1), nrow=3)
mat
## [,1] [,2] [,3] [,4]
## [1,] 1 6 -4 3
## [2,] 6 4 0 -2
## [3,] 7 5 1 1
temp=c()
for(i in c(1:dim(mat)[2])){
temp[i]=min(mat[,i])+7
}
temp
## [1] 8 11 3 5
Exercise 2:
a=c()
for(j in c(1:dim(mat)[1])){
a[j]=max(mat[j,])-4
}
a
## [1] 2 2 3
Exercise 3:
a=mat
for(hello in c(1:dim(mat)[1])){
for(world in c(1:dim(mat)[2])){
a[hello,world]=mat[hello,world]+5
}
}
a
## [,1] [,2] [,3] [,4]
## [1,] 6 11 1 8
## [2,] 11 9 5 3
## [3,] 12 10 6 6
#a=c()
#index=1
#for(j in c(1:5)){
# for(k in c(4:7)){
# a[index]=j*k
# index=index+1
# }
#}
#a
Exercise 2:
#a=c()
#index=1
#for(j in c(1:5)){
# for(k in c(4:7)){
# a[index]=j*k
# }
# index=index+1
#}
#a
Exercise 3:
#mat=matrix(c(1,6,7,6,4,5,-4,0,1,3,-2,1), nrow=3)
#sapply(c(1:5), function(hello){
# sapply(c(7, 3, 2), function(world){
# apply(mat, 1, function(k){
# hello+world-min(k)
# })
# })
#})
Do you remember what these functions do?