back to Big Data Biology main page

Introduction

Aims

Previously, we explored a dataset about fungal diversity in the soil around hedge roots. We found out how many unique species there are in the dataset. We noticed that some environmental samples had very few reads that mapped to OTUs.

In this workshop, we will answer the following biological question: what is the distribution of fungi phyla in the soil near hedge roots?

Learning Outcomes

  • Understand that ‘text’ can be a kind of data
  • Understand that sometimes there is missing data in a data table and that it is important to investigate why this data is missing.
  • Learn a useful strategy for analysing a large data set: group the rows by a certain quality and analyse the properties of each group.

You will also have the opportunity to learn the following new programming skills if you choose to (i.e. optional):

  • Practice accessing data in “lists” (see optional data structure video)
  • Learn what a function is in R (see optional function video)
  • Splitting text into different components, using the ‘strsplit’ function
  • Learning the sapply function in R (see optional apply video)
  • Learning how to install and use R packages for saving fasta files

Load the data from last time

load("Workshop1_ITS.Rda")

Text is data too: interpretting taxonomic information

Splitting text

Last week, we saw that there were a lot of OTUs! Before we fully characterise the diversity of the hedges at a species-level, it can be helpful to start off by grouping the data by a more broad taxonomic division, such as ‘phylum’, ‘class’, ‘order’ or ‘family’. In this example, we will group the OTUs by phylum, but you might want to repeat the analysis with a different grouping for your final project.

Let’s remind ourselves how the taxonomic data looks like:

# Print the first 7 elements of the "taxonomy" column of table ITS_counts
ITS_counts[1:7,"taxonomy"]
## [1] k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Botryosphaeriales;f__Botryosphaeriaceae;g__Diplodia;s__Diplodia_subglobosa         
## [2] k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Capnodiales_fam_Incertae_sedis;g__Vermiconia;s__Vermiconia_calcicola
## [3] k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Cladosporiaceae;g__Cladosporium;s__Cladosporium_exasperatum         
## [4] k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Cladosporiaceae;g__Cladosporium;s__Cladosporium_halotolerans        
## [5] k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Mycosphaerellaceae;g__Mycosphaerella;s__Mycosphaerella_ulmi         
## [6] k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Mycosphaerellaceae;g__unidentified;s__unidentified                  
## [7] k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__unidentified;g__unidentified;s__unidentified                        
## 510 Levels: k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Botryosphaeriales;f__Botryosphaeriaceae;g__Diplodia;s__Diplodia_subglobosa ...

Let us make sure we understand the format of this textual data. The format is:

# k__[kingdom];p__[phylum];c__[class];o__[order];f__[family];g__[genus];s__[species]

For instance, for the first row, the kingdom is Fungi, the phylum is ‘Ascomycota’, and so on.

Since we want to group the OTUs by phylum, we want to extract the taxonomy only up to the level of phylum. We can do this by splitting each taxonomy by the pattern ";c__". The first half of the text will contain the information we want (the kingdom and phylum) and the second half of the text will contain the order, family, genus and species.

First, let us extract the text in the column ‘taxonomy’. (Note to students doing optional programming exercises: By default, R does not read in text as a ‘character’ data type, so we need to convert it using the as.character command):

#Get taxonomies
taxonomies=as.character(ITS_counts[,"taxonomy"])

#Question: What is the taxonomy of the first OTU? (i.e. print out the first element of the vector "taxonomies")
taxonomies[1]
## [1] "k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Botryosphaeriales;f__Botryosphaeriaceae;g__Diplodia;s__Diplodia_subglobosa"

Next, let’s split the text by the pattern ";c__":

#split by the pattern ";c__"
split_by_phylum=strsplit(taxonomies, ";c__")
***Optional: Learn to program***

If you haven't already, consider watching the optional videos about data types and data structures: (links)

Previously, we've worked with matrices and data frames (types of data tables) and vectors (a sequence of values).  

The function 'strsplit' produces a new data structure for us: a *list*.  A list is like a vector, in that it also contains a sequence of elements.  However, vectors can only contain (i) sequences of numbers (data type: numeric), c(1,2,3) (ii) sequences of text (data type: character), c("hello", "world") (iii) sequences of true/falses (data type: booleans), c(TRUE, TRUE, FALSE).
A list can contain any of these things, but it can *also* contain sequences of vectors or matrices.  For instance, let's say that I make two vectors: a=c(1,2,3) and b=c(2,3,4).  I can place both of these vectors into a list: mylist=list(a, b).  The variable mylist will now contain a sequence of two *vectors*.  

To reference an element in a list, you need to use a double bracket: mylist[[1]] will print out 1, 2 and 3, while mylist[[2]] will print out 2, 3, and 4.  
Some examples of R data structures

Some examples of R data structures

Let us take a peek at what strsplit produces. The result of strsplit for the first OTU is:

split_by_phylum[[1]]
## [1] "k__Fungi;p__Ascomycota"                                                                       
## [2] "Dothideomycetes;o__Botryosphaeriales;f__Botryosphaeriaceae;g__Diplodia;s__Diplodia_subglobosa"

Let us print the kingdom and phylum information only (for the first OTU)

split_by_phylum[[1]][1]
## [1] "k__Fungi;p__Ascomycota"

Let us print the rest of the taxonomy, excluding the kingdom and phylum (for the first OTU)

split_by_phylum[[1]][2]
## [1] "Dothideomycetes;o__Botryosphaeriales;f__Botryosphaeriaceae;g__Diplodia;s__Diplodia_subglobosa"

Let us print the kingdom and phylum information only, for the second OTU

split_by_phylum[[2]][1]
## [1] "k__Fungi;p__Ascomycota"

Print the kingdon and phylum information for the 100th OTU. Do you think you understand how to access data in a list now?

split_by_phylum[[100]][1]
## [1] "k__Fungi;p__Ascomycota"

Now we have almost accomplished our goal of identifying the phylum associated with each OTU, but the information is stored in a really complicated way. In the next section, we will simplify the way we store the information about the phylums.

The sapply function

Our next goal is to take the complicate list called split_by_phylum and turn it into a simpler format. Once we have accomplished this task, we will easily be able to identify which OTU corresponds with which phylum.

To do this, we want to extract the first part of the text from split_by_phylum (which should contain the information we care about: the kingdom and phylum).

***Optional: Learn to program***

Now, we will introduce another extremely important function in R: *sapply*.  This is a tricky one, but is extremely powerful.

First, you need to know what a function is.  You have already used lots of functions in R. These are just commands: they take in some data and produce some output.  For instance, rownames is a function that takes a data table (matrix or data frame) as an input and returns a vector containing all the rownames of that data table.  Indeed, almost every line of this workshop contains a function!

Introduction to functions (optional video): link

The sapply requires two inputs (i) a vector or list (ii) another function.  Then, sapply will produce the output of the other function applied to every element of the vector or list.  In other words, it first runs the function on the first element of the vector, then the second element of the vector, etc.  This is a little tricky, so I really recommend watching the optional video so you can some examples.

Introduction to apply functions (optional video): link

In our case, we want to (i) iterate over every element in the list split_by_phylum and (ii) extract the kingdom/phylum of every element of the list "split_by_phylum".  Therefore, the first parameter of the sapply function will be the list split_by_phylum.  The second parameter of sapply will be a function that prints the kingdom/phylum.   

So now, lets extract the phylums of every OTU and save it as a variable called phyla.

phyla=sapply(split_by_phylum, function(i){ i[1] })

#What does this accomplish?
phyla[1:5]
## [1] "k__Fungi;p__Ascomycota" "k__Fungi;p__Ascomycota" "k__Fungi;p__Ascomycota"
## [4] "k__Fungi;p__Ascomycota" "k__Fungi;p__Ascomycota"

Great! Now we know the phyla of every OTU! By the way, to get a really good mark on your report, you will need to provide evidence that you can use the tools that your learned to perform a novel analysis that isn’t directly covered in the workshop. One great and easy idea for extending the project in the final report is to modify this code so it extracts information from each OTU at a different taxonomic level (like class, order or family). For instance, if you want to analyse the ‘class’ instead of ‘phylum’, all you need to do is replace ’;c__’ with ’;o__’ in the code above.

Be careful of missing values

Next, we might ask: How many different phyla do we have? What are they? This will use two functions that you have seen in previous workshops: “unique” (which returns the unique values of a vector, getting rid of duplicates) and “length” (which returns the length of a vector).

length(unique(phyla))
## [1] 15
unique(phyla)
##  [1] "k__Fungi;p__Ascomycota"          "k__Fungi;p__Basidiomycota"      
##  [3] "k__Fungi;p__Chytridiomycota"     "k__Fungi;p__Entomophthoromycota"
##  [5] "k__Fungi;p__Glomeromycota"       "k__Fungi;p__GS19"               
##  [7] "k__Fungi;p__Kickxellomycota"     "k__Fungi;p__Mortierellomycota"  
##  [9] "k__Fungi;p__Mucoromycota"        "k__Fungi;p__Olpidiomycota"      
## [11] "k__Fungi;p__Rozellomycota"       "k__Fungi;p__unidentified"       
## [13] "k__Fungi;p__Zoopagomycota"       "k__Protista;p__Cercozoa"        
## [15] NA

This means we have 15 unique phyla in our hedges and we now know their names.

As you see, one of the ‘phyla’ is NA, which stands for ‘Not applicable’. These are missing values! What happened there? Let us print out the taxonomy for the corresponding rows. (For those of you doing the optional programming content, take a look: this is another application of ‘which’ that we learned about last week. The function “is.na” takes as input a vector (in this case phyla) and returns a new vector that has a value of TRUE if the vector contains a value of NA in that position, and a FALSE otherwise):

na_phyla=is.na(phyla)
na_indices=which(na_phyla)
na_indices
##  [1] 926 927 928 929 930 931 932 933 934 935 936 937 938 939
taxonomies[na_indices]
##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA

We asked the person who built this data table to explain why there were missing values. As it turns out, these OTUs correspond to sequences where no significant matches were found to reference fungal ITS2 sequence database. This could be because of a number of reasons: (i) The sequences might still be fungal, but they may be very divergent from any reference fungal ITS2. (ii) The sequences might be of low quality– it could be a technical problem with the sequencing. (iii) These sequences might have a non-fungal origin– it could be contamination.

For now, let’s save the sequences that correspond to these NAs to a file, in a format (fasta) that can be used in a webtool called BLAST. BLAST will search these sequences against all the sequences that were ever submitted to GenBank, an international sequencing repository: (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome). We won’t be doing this analysis as part of the workshop, but this is another great way to include a novel analysis in your final report! This is an especially good option if you want to do a novel analysis, but want to avoid doing any additional coding.

You will need to remove the # character from the first line of code to install the necessary package (seqinr). By the way, if you have low quality internet access and can’t download the package for various Covid-19 related reasons, it is fine to skip the next two lines of code (i.e. the line beginning with ‘library’ and the line beginning with ‘write.fasta’): it won’t impact the rest of the project. I’m just giving you more tools so you have lots of options of what to do in your final project!

#install.packages("seqinr")
library(seqinr)
## Warning: package 'seqinr' was built under R version 3.6.2
write.fasta(as.list(ITS_counts[na_indices,"sequence"]), as.character(rownames(ITS_counts)[na_indices]), file.out="ITS_na.fasta")

For now, we will consider these ’NA’s to be a unique phylum. However, there are other alternative approaches that you could take to deal with the missing values, such as filtering out these OTUs or filling in the missing data using additional information from the BLAST program.

For now, let’s replace the missing values with “other”:

phyla[which(is.na(phyla))]="other"

It is always important to have a good justification for these kinds of decisions. We are not investigating this further (for the purposes of time), but this is a great project extension for the final report!

Grouping OTU counts by phylum

We have two related questions that we will focus on in this part of the analysis: (i) How many OTUs correspond to each phylum? (ii) How many times was each phylum observed in the hedge samples?

On first glance, you might think that these questions are the same, but they are actually very different. This is because some OTUs were observed very frequently and others were only observed a few times. In (i) we count the number of unique OTUs that correspond to each phylum, ignoring how frequently the OTUs are found. In (ii) we will look up how many times each OTU that matched a phylum was observed and calculate the total by taking the sum.

Q1: How many OTUs correspond to each phylum?

First, let us find out how many OTUs correspond to each phylum, using ‘table’– a function we learned about in the previous workshop. ‘Table’ counts how many times each unique value appears in a vector.

table(phyla)
## phyla
##          k__Fungi;p__Ascomycota       k__Fungi;p__Basidiomycota 
##                             618                             185 
##     k__Fungi;p__Chytridiomycota k__Fungi;p__Entomophthoromycota 
##                              12                               1 
##       k__Fungi;p__Glomeromycota                k__Fungi;p__GS19 
##                              39                               1 
##     k__Fungi;p__Kickxellomycota   k__Fungi;p__Mortierellomycota 
##                               4                              17 
##        k__Fungi;p__Mucoromycota       k__Fungi;p__Olpidiomycota 
##                               3                               3 
##       k__Fungi;p__Rozellomycota        k__Fungi;p__unidentified 
##                              20                              15 
##       k__Fungi;p__Zoopagomycota         k__Protista;p__Cercozoa 
##                               5                               2 
##                           other 
##                              14

As you can see, there are 618 unique OTUs that are from the phylum Ascomycota and 185 unique OTUs that are from the phylum Basidiomycota and so on.

Q2: How many times was each phylum observed in the hedge samples?

Now, let’s compare the total reads that belong to each phylum. Whenever you write code that involves multiple steps, it is useful to write down a plan as to what to do and to break the problem into parts.

For each unique phylum (let’s refer to each phylum by the variable name i), we:

  • Find the indices (positions) that correspond to that phylum: which(phyla==i)
  • Extract the ‘sum’ column from the table ITS_counts. This column tells us the number of times each OTU was seen in all the environmental samples: ITS_counts[which(phyla==i), “sum”]
  • Find the sum of all these values. sum(ITS_counts[which(phyla==i),“sum”])

Here is a cartoon that can help you work out why this accomplishes our goal:

cartoon example that illustrates how these steps help us accomplish our goal of counting the number of reads that map to each phylum.

cartoon example that illustrates how these steps help us accomplish our goal of counting the number of reads that map to each phylum.

Again, we use ‘sapply’ and ‘which’.

counts_by_phylum=sapply(unique(phyla), function(i){
    # step 1: get indices that correspond to each phylum:
    index_phylum=which(phyla==i)
    # step 2: extract the sum column
    sum_by_phylum=ITS_counts[index_phylum,'sum']
    # step 3: find the sum of all these values
    sum(sum_by_phylum)
})
counts_by_phylum #print output
##          k__Fungi;p__Ascomycota       k__Fungi;p__Basidiomycota 
##                          739935                          203335 
##     k__Fungi;p__Chytridiomycota k__Fungi;p__Entomophthoromycota 
##                            1644                              20 
##       k__Fungi;p__Glomeromycota                k__Fungi;p__GS19 
##                           61131                              28 
##     k__Fungi;p__Kickxellomycota   k__Fungi;p__Mortierellomycota 
##                             131                            5698 
##        k__Fungi;p__Mucoromycota       k__Fungi;p__Olpidiomycota 
##                              92                           13352 
##       k__Fungi;p__Rozellomycota        k__Fungi;p__unidentified 
##                            3264                            3800 
##       k__Fungi;p__Zoopagomycota         k__Protista;p__Cercozoa 
##                            1986                             276 
##                           other 
##                            1026

Now, we have a variable called counts_by_phylum that contains the number of times each phyla was observed. We seee that across all of the hedge samples, we observe fungi from the kingdom Ascomycota 739,935 times and so on.

Visualising results with bar plots

Let’s draw counts_by_phylum as a bar plot, so we can visualise the frequency of each phylum in our hedge samples.

barplot(counts_by_phylum, xlab="phylum", ylab="count")

As you see, the x-axis labels are unreadable. Let us parse the phyla into text that will look nicer on the bar plot. First, let’s get rid of the redundant ‘kingdom’ information. Then, we’ll extract the first 2 letters of each phylum.

First, we get the names of the kingdom/phyla:

phylum_names=names(counts_by_phylum) 

Next, we split the text by the pattern ";p__". The first part will contain the kingdom information and the second part will contain the phylum information:

phylum_split=strsplit(phylum_names, ";p__") 

Then, we can use the sapply function to extract the phylum information. We want to iterate over every element of the variable “phylum_split”, and extract the second part of the text:

phylum_labels_long=sapply(phylum_split, function(i){i[2]})

Now, let’s extract the first two letters of each phylum. This can be done using the “substring” function.

Imagine that all the letters in some text are numbered. For instance, in the text “hello world”, “h” is in position 1, “e” is in position 2 and “l” is found in positions 3, 4 and 10. Substring returns the text that is found between two positions. For instance, substring(“hello world”, 3, 8), will print the text between the 3rd letter (“l”) and the 8th letter (“o”), so it will print “llo wo”. The substring function can also take an entire vector of text as an input, and it will extract the text between two positions for every element in the vector. For instance, substring(c(“hello”, “world”), 2, 3) will produce “el” and “or”. Let us find the first two letters of each phylum:

phylum_labels=substring(phylum_labels_long, 1, 2)
phylum_labels
##  [1] "As" "Ba" "Ch" "En" "Gl" "GS" "Ki" "Mo" "Mu" "Ol" "Ro" "un" "Zo" "Ce" NA

Recall that one of the phylums is NA. Let us make sure that the phylum label for this phylum has the text “NA”.

na_label=is.na(phylum_labels) #find NA
ids_no_label=which(na_label) # find postion of NA
phylum_labels[ids_no_label]="NA" # set value to "NA"

Make a vector that has the same values as counts_by_phylum, but that has shorter phylum names:

counts_by_phylum_shortname=counts_by_phylum
names(counts_by_phylum_shortname)=phylum_labels

Draw a barplot:

barplot(counts_by_phylum_shortname)

# Set the x-axis label to be "phylum" and the y-axis label to be "count" 
barplot(counts_by_phylum_shortname, xlab="phylum", ylab="count")

Save the results for next time

Lets save the workspace for next time:

save.image("Workshop2_ITS.Rda")

Take home messages

Today we answered our first real biological question about the dataset: what is the distribution of phyla of fungi that live near hedge roots?

I’ve also provided you with two opportunities to extend the analysis in this workshop in your final report. You should have the tools to answer the questions: (i) How many different classes of Ascomycota are found in soil near hedge roots? (ii) What is the distribution of classes of fungi that live near hedge roots? (ii) Why might the taxonomic information be missing for some OTUs: is it contamination? Are the sequences just too divergent from reference ITS2 sequences?

You should think about: Why might these research questions be interesting to ecologists? Can you think of other biological questions that you think you could answer using what you’ve learned today in R?

In the process, we learned a few general techniques for data analysis:

  • We learned how to analyse data that is in a ‘text’ format. Specifically, we looked at taxonomic information, but similar techniques can apply to other kind of text. For instance, a DNA sequence or a protein sequence is a kind of text.
  • We also learned that it is important to look for ‘missing data’ and we learned some strategies for dealing with it: (i) investigating why the data is missing (ii) assigning missing data to its own group and (iii) throwing away missing data. The decision of which approach to take must be fully justified and explained in the methods of your paper.
  • We learned that it ‘grouping’ data is a useful data analysis strategy
  • We learned how to visualise the data in bar plots in a more visually impactful way, by making the labels easier to read.

If you have been looking at the optional programming videos, you also learned some new R skills. See if you can do the following excercises to see if you have an understanding of the material:

  • Can you explain the difference between a vector and a list? How do you access elements from each?
  • Using strsplit: In this workshop, we the taxonomic division up to the phylum level. Can you change the code to instead find the taxonomic division up to the class, order, or family level?
  • Using sapply: Can you translate the following for loops into sapply functions? (Note: some of these could be written even more concisely, but the point of this exercise is to see the connection between sapply and for-loops)

Exercise 1:

temp=c()

for(i in c(1:10)){
    temp[i]=i*i+7
}
temp
##  [1]   8  11  16  23  32  43  56  71  88 107

Exercise 2:

a=c()
b=c(7, 4, 2, 9)
for(i in c(1:length(b))){
    a[i]=max(b[i], 5)
}
a
## [1] 7 5 5 9
  • fasta files:
    • What are the qualities of a fasta file?
    • Copy-and-paste the contents of the fasta file into the link for BLAST provided. Try to understand what the output of BLAST means. Does this change your idea for how to handle the missing values?

There are also a number of other functions that you saw today that you’ve seen in previous workshops or that have such intuitive names that we didn’t bother to discuss them in detail. Can you read the R documentation for each of these and understand what they do?

  • length
  • unique
  • which
  • as.character
  • table
  • sum
  • c
  • substring
  • is.na
  • names
  • barplot