back to Big Data Biology main page
Previously, we explored a dataset about fungal diversity in the soil around hedge roots. We found out how many unique species there are in the dataset. We noticed that some environmental samples had very few reads that mapped to OTUs.
In this workshop, we will answer the following biological question: what is the distribution of fungi phyla in the soil near hedge roots?
You will also have the opportunity to learn the following new programming skills if you choose to (i.e. optional):
load("Workshop1_ITS.Rda")
Last week, we saw that there were a lot of OTUs! Before we fully characterise the diversity of the hedges at a species-level, it can be helpful to start off by grouping the data by a more broad taxonomic division, such as ‘phylum’, ‘class’, ‘order’ or ‘family’. In this example, we will group the OTUs by phylum, but you might want to repeat the analysis with a different grouping for your final project.
Let’s remind ourselves how the taxonomic data looks like:
# Print the first 7 elements of the "taxonomy" column of table ITS_counts
ITS_counts[1:7,"taxonomy"]
## [1] k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Botryosphaeriales;f__Botryosphaeriaceae;g__Diplodia;s__Diplodia_subglobosa
## [2] k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Capnodiales_fam_Incertae_sedis;g__Vermiconia;s__Vermiconia_calcicola
## [3] k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Cladosporiaceae;g__Cladosporium;s__Cladosporium_exasperatum
## [4] k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Cladosporiaceae;g__Cladosporium;s__Cladosporium_halotolerans
## [5] k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Mycosphaerellaceae;g__Mycosphaerella;s__Mycosphaerella_ulmi
## [6] k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Mycosphaerellaceae;g__unidentified;s__unidentified
## [7] k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__unidentified;g__unidentified;s__unidentified
## 510 Levels: k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Botryosphaeriales;f__Botryosphaeriaceae;g__Diplodia;s__Diplodia_subglobosa ...
Let us make sure we understand the format of this textual data. The format is:
# k__[kingdom];p__[phylum];c__[class];o__[order];f__[family];g__[genus];s__[species]
For instance, for the first row, the kingdom is Fungi, the phylum is ‘Ascomycota’, and so on.
Since we want to group the OTUs by phylum, we want to extract the taxonomy only up to the level of phylum. We can do this by splitting each taxonomy by the pattern ";c__". The first half of the text will contain the information we want (the kingdom and phylum) and the second half of the text will contain the order, family, genus and species.
First, let us extract the text in the column ‘taxonomy’. (Note to students doing optional programming exercises: By default, R does not read in text as a ‘character’ data type, so we need to convert it using the as.character command):
#Get taxonomies
taxonomies=as.character(ITS_counts[,"taxonomy"])
#Question: What is the taxonomy of the first OTU? (i.e. print out the first element of the vector "taxonomies")
taxonomies[1]
## [1] "k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Botryosphaeriales;f__Botryosphaeriaceae;g__Diplodia;s__Diplodia_subglobosa"
Next, let’s split the text by the pattern ";c__":
#split by the pattern ";c__"
split_by_phylum=strsplit(taxonomies, ";c__")
***Optional: Learn to program***
If you haven't already, consider watching the optional videos about data types and data structures: (links)
Previously, we've worked with matrices and data frames (types of data tables) and vectors (a sequence of values).
The function 'strsplit' produces a new data structure for us: a *list*. A list is like a vector, in that it also contains a sequence of elements. However, vectors can only contain (i) sequences of numbers (data type: numeric), c(1,2,3) (ii) sequences of text (data type: character), c("hello", "world") (iii) sequences of true/falses (data type: booleans), c(TRUE, TRUE, FALSE).
A list can contain any of these things, but it can *also* contain sequences of vectors or matrices. For instance, let's say that I make two vectors: a=c(1,2,3) and b=c(2,3,4). I can place both of these vectors into a list: mylist=list(a, b). The variable mylist will now contain a sequence of two *vectors*.
To reference an element in a list, you need to use a double bracket: mylist[[1]] will print out 1, 2 and 3, while mylist[[2]] will print out 2, 3, and 4.
Let us take a peek at what strsplit produces. The result of strsplit for the first OTU is:
split_by_phylum[[1]]
## [1] "k__Fungi;p__Ascomycota"
## [2] "Dothideomycetes;o__Botryosphaeriales;f__Botryosphaeriaceae;g__Diplodia;s__Diplodia_subglobosa"
Let us print the kingdom and phylum information only (for the first OTU)
split_by_phylum[[1]][1]
## [1] "k__Fungi;p__Ascomycota"
Let us print the rest of the taxonomy, excluding the kingdom and phylum (for the first OTU)
split_by_phylum[[1]][2]
## [1] "Dothideomycetes;o__Botryosphaeriales;f__Botryosphaeriaceae;g__Diplodia;s__Diplodia_subglobosa"
Let us print the kingdom and phylum information only, for the second OTU
split_by_phylum[[2]][1]
## [1] "k__Fungi;p__Ascomycota"
Print the kingdon and phylum information for the 100th OTU. Do you think you understand how to access data in a list now?
split_by_phylum[[100]][1]
## [1] "k__Fungi;p__Ascomycota"
Now we have almost accomplished our goal of identifying the phylum associated with each OTU, but the information is stored in a really complicated way. In the next section, we will simplify the way we store the information about the phylums.
Our next goal is to take the complicate list called split_by_phylum and turn it into a simpler format. Once we have accomplished this task, we will easily be able to identify which OTU corresponds with which phylum.
To do this, we want to extract the first part of the text from split_by_phylum (which should contain the information we care about: the kingdom and phylum).
***Optional: Learn to program***
Now, we will introduce another extremely important function in R: *sapply*. This is a tricky one, but is extremely powerful.
First, you need to know what a function is. You have already used lots of functions in R. These are just commands: they take in some data and produce some output. For instance, rownames is a function that takes a data table (matrix or data frame) as an input and returns a vector containing all the rownames of that data table. Indeed, almost every line of this workshop contains a function!
Introduction to functions (optional video): link
The sapply requires two inputs (i) a vector or list (ii) another function. Then, sapply will produce the output of the other function applied to every element of the vector or list. In other words, it first runs the function on the first element of the vector, then the second element of the vector, etc. This is a little tricky, so I really recommend watching the optional video so you can some examples.
Introduction to apply functions (optional video): link
In our case, we want to (i) iterate over every element in the list split_by_phylum and (ii) extract the kingdom/phylum of every element of the list "split_by_phylum". Therefore, the first parameter of the sapply function will be the list split_by_phylum. The second parameter of sapply will be a function that prints the kingdom/phylum.
So now, lets extract the phylums of every OTU and save it as a variable called phyla.
phyla=sapply(split_by_phylum, function(i){ i[1] })
#What does this accomplish?
phyla[1:5]
## [1] "k__Fungi;p__Ascomycota" "k__Fungi;p__Ascomycota" "k__Fungi;p__Ascomycota"
## [4] "k__Fungi;p__Ascomycota" "k__Fungi;p__Ascomycota"
Great! Now we know the phyla of every OTU! By the way, to get a really good mark on your report, you will need to provide evidence that you can use the tools that your learned to perform a novel analysis that isn’t directly covered in the workshop. One great and easy idea for extending the project in the final report is to modify this code so it extracts information from each OTU at a different taxonomic level (like class, order or family). For instance, if you want to analyse the ‘class’ instead of ‘phylum’, all you need to do is replace ’;c__’ with ’;o__’ in the code above.
Next, we might ask: How many different phyla do we have? What are they? This will use two functions that you have seen in previous workshops: “unique” (which returns the unique values of a vector, getting rid of duplicates) and “length” (which returns the length of a vector).
length(unique(phyla))
## [1] 15
unique(phyla)
## [1] "k__Fungi;p__Ascomycota" "k__Fungi;p__Basidiomycota"
## [3] "k__Fungi;p__Chytridiomycota" "k__Fungi;p__Entomophthoromycota"
## [5] "k__Fungi;p__Glomeromycota" "k__Fungi;p__GS19"
## [7] "k__Fungi;p__Kickxellomycota" "k__Fungi;p__Mortierellomycota"
## [9] "k__Fungi;p__Mucoromycota" "k__Fungi;p__Olpidiomycota"
## [11] "k__Fungi;p__Rozellomycota" "k__Fungi;p__unidentified"
## [13] "k__Fungi;p__Zoopagomycota" "k__Protista;p__Cercozoa"
## [15] NA
This means we have 15 unique phyla in our hedges and we now know their names.
As you see, one of the ‘phyla’ is NA, which stands for ‘Not applicable’. These are missing values! What happened there? Let us print out the taxonomy for the corresponding rows. (For those of you doing the optional programming content, take a look: this is another application of ‘which’ that we learned about last week. The function “is.na” takes as input a vector (in this case phyla) and returns a new vector that has a value of TRUE if the vector contains a value of NA in that position, and a FALSE otherwise):
na_phyla=is.na(phyla)
na_indices=which(na_phyla)
na_indices
## [1] 926 927 928 929 930 931 932 933 934 935 936 937 938 939
taxonomies[na_indices]
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
We asked the person who built this data table to explain why there were missing values. As it turns out, these OTUs correspond to sequences where no significant matches were found to reference fungal ITS2 sequence database. This could be because of a number of reasons: (i) The sequences might still be fungal, but they may be very divergent from any reference fungal ITS2. (ii) The sequences might be of low quality– it could be a technical problem with the sequencing. (iii) These sequences might have a non-fungal origin– it could be contamination.
For now, let’s save the sequences that correspond to these NAs to a file, in a format (fasta) that can be used in a webtool called BLAST. BLAST will search these sequences against all the sequences that were ever submitted to GenBank, an international sequencing repository: (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome). We won’t be doing this analysis as part of the workshop, but this is another great way to include a novel analysis in your final report! This is an especially good option if you want to do a novel analysis, but want to avoid doing any additional coding.
You will need to remove the # character from the first line of code to install the necessary package (seqinr). By the way, if you have low quality internet access and can’t download the package for various Covid-19 related reasons, it is fine to skip the next two lines of code (i.e. the line beginning with ‘library’ and the line beginning with ‘write.fasta’): it won’t impact the rest of the project. I’m just giving you more tools so you have lots of options of what to do in your final project!
#install.packages("seqinr")
library(seqinr)
## Warning: package 'seqinr' was built under R version 3.6.2
write.fasta(as.list(ITS_counts[na_indices,"sequence"]), as.character(rownames(ITS_counts)[na_indices]), file.out="ITS_na.fasta")
For now, we will consider these ’NA’s to be a unique phylum. However, there are other alternative approaches that you could take to deal with the missing values, such as filtering out these OTUs or filling in the missing data using additional information from the BLAST program.
For now, let’s replace the missing values with “other”:
phyla[which(is.na(phyla))]="other"
It is always important to have a good justification for these kinds of decisions. We are not investigating this further (for the purposes of time), but this is a great project extension for the final report!
We have two related questions that we will focus on in this part of the analysis: (i) How many OTUs correspond to each phylum? (ii) How many times was each phylum observed in the hedge samples?
On first glance, you might think that these questions are the same, but they are actually very different. This is because some OTUs were observed very frequently and others were only observed a few times. In (i) we count the number of unique OTUs that correspond to each phylum, ignoring how frequently the OTUs are found. In (ii) we will look up how many times each OTU that matched a phylum was observed and calculate the total by taking the sum.
First, let us find out how many OTUs correspond to each phylum, using ‘table’– a function we learned about in the previous workshop. ‘Table’ counts how many times each unique value appears in a vector.
table(phyla)
## phyla
## k__Fungi;p__Ascomycota k__Fungi;p__Basidiomycota
## 618 185
## k__Fungi;p__Chytridiomycota k__Fungi;p__Entomophthoromycota
## 12 1
## k__Fungi;p__Glomeromycota k__Fungi;p__GS19
## 39 1
## k__Fungi;p__Kickxellomycota k__Fungi;p__Mortierellomycota
## 4 17
## k__Fungi;p__Mucoromycota k__Fungi;p__Olpidiomycota
## 3 3
## k__Fungi;p__Rozellomycota k__Fungi;p__unidentified
## 20 15
## k__Fungi;p__Zoopagomycota k__Protista;p__Cercozoa
## 5 2
## other
## 14
As you can see, there are 618 unique OTUs that are from the phylum Ascomycota and 185 unique OTUs that are from the phylum Basidiomycota and so on.
Now, let’s compare the total reads that belong to each phylum. Whenever you write code that involves multiple steps, it is useful to write down a plan as to what to do and to break the problem into parts.
For each unique phylum (let’s refer to each phylum by the variable name i), we:
Here is a cartoon that can help you work out why this accomplishes our goal:
Again, we use ‘sapply’ and ‘which’.
counts_by_phylum=sapply(unique(phyla), function(i){
# step 1: get indices that correspond to each phylum:
index_phylum=which(phyla==i)
# step 2: extract the sum column
sum_by_phylum=ITS_counts[index_phylum,'sum']
# step 3: find the sum of all these values
sum(sum_by_phylum)
})
counts_by_phylum #print output
## k__Fungi;p__Ascomycota k__Fungi;p__Basidiomycota
## 739935 203335
## k__Fungi;p__Chytridiomycota k__Fungi;p__Entomophthoromycota
## 1644 20
## k__Fungi;p__Glomeromycota k__Fungi;p__GS19
## 61131 28
## k__Fungi;p__Kickxellomycota k__Fungi;p__Mortierellomycota
## 131 5698
## k__Fungi;p__Mucoromycota k__Fungi;p__Olpidiomycota
## 92 13352
## k__Fungi;p__Rozellomycota k__Fungi;p__unidentified
## 3264 3800
## k__Fungi;p__Zoopagomycota k__Protista;p__Cercozoa
## 1986 276
## other
## 1026
Now, we have a variable called counts_by_phylum that contains the number of times each phyla was observed. We seee that across all of the hedge samples, we observe fungi from the kingdom Ascomycota 739,935 times and so on.
Let’s draw counts_by_phylum as a bar plot, so we can visualise the frequency of each phylum in our hedge samples.
barplot(counts_by_phylum, xlab="phylum", ylab="count")
As you see, the x-axis labels are unreadable. Let us parse the phyla into text that will look nicer on the bar plot. First, let’s get rid of the redundant ‘kingdom’ information. Then, we’ll extract the first 2 letters of each phylum.
First, we get the names of the kingdom/phyla:
phylum_names=names(counts_by_phylum)
Next, we split the text by the pattern ";p__". The first part will contain the kingdom information and the second part will contain the phylum information:
phylum_split=strsplit(phylum_names, ";p__")
Then, we can use the sapply function to extract the phylum information. We want to iterate over every element of the variable “phylum_split”, and extract the second part of the text:
phylum_labels_long=sapply(phylum_split, function(i){i[2]})
Now, let’s extract the first two letters of each phylum. This can be done using the “substring” function.
Imagine that all the letters in some text are numbered. For instance, in the text “hello world”, “h” is in position 1, “e” is in position 2 and “l” is found in positions 3, 4 and 10. Substring returns the text that is found between two positions. For instance, substring(“hello world”, 3, 8), will print the text between the 3rd letter (“l”) and the 8th letter (“o”), so it will print “llo wo”. The substring function can also take an entire vector of text as an input, and it will extract the text between two positions for every element in the vector. For instance, substring(c(“hello”, “world”), 2, 3) will produce “el” and “or”. Let us find the first two letters of each phylum:
phylum_labels=substring(phylum_labels_long, 1, 2)
phylum_labels
## [1] "As" "Ba" "Ch" "En" "Gl" "GS" "Ki" "Mo" "Mu" "Ol" "Ro" "un" "Zo" "Ce" NA
Recall that one of the phylums is NA. Let us make sure that the phylum label for this phylum has the text “NA”.
na_label=is.na(phylum_labels) #find NA
ids_no_label=which(na_label) # find postion of NA
phylum_labels[ids_no_label]="NA" # set value to "NA"
Make a vector that has the same values as counts_by_phylum, but that has shorter phylum names:
counts_by_phylum_shortname=counts_by_phylum
names(counts_by_phylum_shortname)=phylum_labels
Draw a barplot:
barplot(counts_by_phylum_shortname)
# Set the x-axis label to be "phylum" and the y-axis label to be "count"
barplot(counts_by_phylum_shortname, xlab="phylum", ylab="count")
Lets save the workspace for next time:
save.image("Workshop2_ITS.Rda")
Today we answered our first real biological question about the dataset: what is the distribution of phyla of fungi that live near hedge roots?
I’ve also provided you with two opportunities to extend the analysis in this workshop in your final report. You should have the tools to answer the questions: (i) How many different classes of Ascomycota are found in soil near hedge roots? (ii) What is the distribution of classes of fungi that live near hedge roots? (ii) Why might the taxonomic information be missing for some OTUs: is it contamination? Are the sequences just too divergent from reference ITS2 sequences?
You should think about: Why might these research questions be interesting to ecologists? Can you think of other biological questions that you think you could answer using what you’ve learned today in R?
In the process, we learned a few general techniques for data analysis:
If you have been looking at the optional programming videos, you also learned some new R skills. See if you can do the following excercises to see if you have an understanding of the material:
Exercise 1:
temp=c()
for(i in c(1:10)){
temp[i]=i*i+7
}
temp
## [1] 8 11 16 23 32 43 56 71 88 107
Exercise 2:
a=c()
b=c(7, 4, 2, 9)
for(i in c(1:length(b))){
a[i]=max(b[i], 5)
}
a
## [1] 7 5 5 9
There are also a number of other functions that you saw today that you’ve seen in previous workshops or that have such intuitive names that we didn’t bother to discuss them in detail. Can you read the R documentation for each of these and understand what they do?