• Introduction
    • Aims
    • Learning Outcomes
    • Getting Started
    • What is the biological question?
    • Get to know your data:
    • Take a look at your data
    • Save the data
    • Optional programming topic: Verify that values in the table make sense
      • Step 1: Find the columns that refer to environmental samples, using the grep function
      • Step 2: Find if the “sum” column is correct, using the which function
    • Take home messages

back to Big Data Biology main page

Introduction

Large genetic datasets are not only relevant for molecular biology and genetics research: they are also important for ecological research. In this workshop, we will explore a genetic dataset that will help us uncover fungal species richness in the soil around hedges. This kind of metagenomics research is a very powerful way of understanding microbial ecology https://en.wikipedia.org/wiki/Metagenomics.

Most plants do not have the capacity to effectively harvest the abiotic nutrients in the soil without the help of fungal symbionts that grow amongst their roots. In return for providing abiotic nutrients, the plants supply the fungi with energy via sugars produced by photosynthesis. It is very hard to study these fungi in the lab, because many are obligate symbionts (i.e. they can only grow with in the presence of the plant). However, we can collect samples in the field and then use DNA sequencing to identify which fungi are associated with each plant. Instead of sequencing the whole genome of all the fungi in each sample, we sequence a single specific DNA region that can be used to barcode the species of fungus (in this case, Internal Transcribed Spacer 2 (ITS2)). Then, ITS2 sequences that are extremely similar to one another are grouped together: each group is referred to as an Operational Taxonomic Unit (OTU). Finally, we can count how often we find sequencing reads that map to each OTU in each environmental sample.

This data set includes samples from the soil near hedges in four different fields, and there are 3 sampling points per hedge. Each environmental sample will contain a mixture of many OTUs. This table lists the number of sequencing reads that map to each OTU from each environmental sample. These counts represent the abundance of the species present in the sample. It also contains additional columns that provide information about each OTU, such as its closest match to an ITS2 sequence of a reference species. The complete sequence of each OTU is also provided.

Aims

The purpose of analysing this data set is to characterise the diversity of fungi associated with the roots of hedges:

Workshop 1: we will aim to understand the data in the table, clean the table, and do some preliminary data visualisations.

Workshop 2: what is the population structure of fungi found in soil near hedge roots?

Workshop 3: what is the population structure of fungi in each environmental sample? How variable are the samples?

Workshop 4: what is the correlation between the spatial distribution of the samples and their population structures?

Learning Outcomes

This project has the same learning objectives as some of the other workshops:

  • Loading data tables in R
  • Printing properties of the data table, such as its dimensions and its row and column names
  • Accessing and printing subsets of the data table
  • Drawing histograms
  • Saving data tables in R

Optional topics (Don’t be intimidated if you don’t know what these mean yet):

  • Referencing elements in a vector
  • Searching for a pattern using the ‘grep’ command
  • Using logical operators
  • Finding which elements of a vector meet a specific condition, using the ‘which’ command
  • Counting how often an element occurs in a vector, using the ‘table’ command

Getting Started

First, we will need to load the data into R. The data is stored in ITS_table_cleaned.csv in the VLE. Copy and paste the file into your working directory and then load the file into R as you have done in previous workshops:

ITS_counts <- read.csv("ITS_table_cleaned.csv") 

What is the biological question?

How much fungal diversity is there among hedge rows? Do the fungi significantly differ between hedge rows?

Get to know your data:

It is important to always look at the data in the file and see if it has been loaded properly. For instance, you can learn how many rows and columns the data table has. You should have seen the following functions in the Plant Gene Expression Workshop 1: (i) class (ii) dim. Make sure you remember what they do.

#what is the class of "ITA_counts"?
class(ITS_counts)
#> [1] "data.frame"
#how many rows does this table have?  How many columns?
dim(ITS_counts)
#> [1] 939  53

Next, we will want to print the column names and row names and see if they are reasonable. Since there are a lot of rows, you may want to only print the first few rownames.

***Optional: Learn to program***
If you want to have a deeper understanding of programming, watch these now:
    
Data types:

Data structures:
    
How these videos relate to the next section?
You have already worked with 'data.frames' (tables that can contain mixtures of different kinds of data, like numbers and words) and 'matrices' (tables that contain data of the same type, like only having numbers).  R also allows you to store data in a 'vector', which is like a single column in a matrix.  Vectors contain sequences of data (of the same type).  

To access specific elements in a matrix, you need to first specify which rows you want to access, and then which columns you want to access.  The notation for this is: matrix[rows, columns]; for example, mat[1:5, 1:3] refers to the first 5 rows and the first 3 columns of a matrix called 'mat'.  

However, vectors only have 1 column of data, so they can be refered to with a single range of numbers.  For instance, vec[1:5] refers to the first 5 elements of the vector.  To make a new vector, you can use the 'c' command.  For instance vec=c(2,5,6) produces a new vector called 'vec' that contains the numbers 2, 5 and 6.  

The R functions 'rownames' and 'colnames' both return vectors.

So lets print the rownames and colnames:

rows=rownames(ITS_counts) #find all the rownames and save their values in a variable called "rows". The variable "rows" is a vector.
rows[1:5] #print the first five elements of "rows" 
#> [1] "1" "2" "3" "4" "5"

cols=colnames(ITS_counts)
cols
#>  [1] "OTU"       "W14"       "W17"       "W18"       "W19"       "W22"      
#>  [7] "W23"       "W24"       "W25"       "W26"       "W27"       "W28"      
#> [13] "W33"       "W34"       "W35"       "W36"       "W3"        "W41"      
#> [19] "W42"       "W43"       "W44"       "W50"       "W51"       "W52"      
#> [25] "W53"       "W58"       "W59"       "W60"       "W61"       "W65"      
#> [31] "W66"       "W69"       "W6"        "W70"       "W73"       "W74"      
#> [37] "W76"       "W77"       "W79"       "W81"       "W83"       "W85"      
#> [43] "W86"       "W87"       "W89"       "W8"        "W95"       "W9"       
#> [49] "sum"       "taxonomy"  "p.value"   "something" "sequence"
# Question: print out all the column names using the "colnames" command. One of the columns has a very bad name: what is it?

It appears as if there are no row names, only numbers! The column called OTU contains unique identifiers, so we can assign this column to be the row names. Once we do that, we will no longer need the OTU column, so we can delete it.

# Assign the column "OTU" as the rownames
ITS_counts[1:5,1] #print first 5 OTU names
#> [1] "OTU.W36.543937" "OTU.W25.265387" "OTU.W14.3337"   "OTU.W25.265241"
#> [5] "OTU.W23.192303"
rownames(ITS_counts)<-ITS_counts[,1] #assign new rownames
rownames(ITS_counts)[1:5] #Double check: what are the rownames now?
#> [1] "OTU.W36.543937" "OTU.W25.265387" "OTU.W14.3337"   "OTU.W25.265241"
#> [5] "OTU.W23.192303"

ITS_counts=ITS_counts[,-1] # Delete the OTU column
#Use 'dim' and 'colnames' functions to confirm that you did this correctly
dim(ITS_counts)
#> [1] 939  52
colnames(ITS_counts)
#>  [1] "W14"       "W17"       "W18"       "W19"       "W22"       "W23"      
#>  [7] "W24"       "W25"       "W26"       "W27"       "W28"       "W33"      
#> [13] "W34"       "W35"       "W36"       "W3"        "W41"       "W42"      
#> [19] "W43"       "W44"       "W50"       "W51"       "W52"       "W53"      
#> [25] "W58"       "W59"       "W60"       "W61"       "W65"       "W66"      
#> [31] "W69"       "W6"        "W70"       "W73"       "W74"       "W76"      
#> [37] "W77"       "W79"       "W81"       "W83"       "W85"       "W86"      
#> [43] "W87"       "W89"       "W8"        "W95"       "W9"        "sum"      
#> [49] "taxonomy"  "p.value"   "something" "sequence"

Now let us take a closer look at what is in this table. For instance, what do we see inside the columns whose names begin with ‘W’? Let’s take a peek at some of the columns that have names beginning with ‘W’.

sample_col_names=c("W17", "W61", "W87") #Make a new vector that contains the column names "W17", "W16" and "W87"
ITS_counts[1:10, sample_col_names] #print the first 10 rows of columns "W17", "W16" and "W87" of ITS_counts
#>                 W17 W61  W87
#> OTU.W36.543937    0   0    0
#> OTU.W25.265387    0   0    0
#> OTU.W14.3337     76 597 2810
#> OTU.W25.265241    0   0    0
#> OTU.W23.192303    0   0    0
#> OTU.W34.462356    0   0    0
#> OTU.W22.160264    0   0    0
#> OTU.W66.1110248   0   0    0
#> OTU.W50.759148    0   0   19
#> OTU.W18.93769     0   2    0

OTUs are operational taxonomic units– groupings of similar ITS2 sequences that were observed in the sequencing experiment. Each column that starts with a W represents a different environmental sample. For instance, we see that there were 597 reads sequenced that match the OTU labelled ‘OTU.W14.3337’ in sample ‘W61’.

Now let us look at some of the other columns for this OTU:

ITS_counts["OTU.W14.3337", c("sum", "taxonomy", "p.value", "sequence")]
#>                sum
#> OTU.W14.3337 10643
#>                                                                                                                             taxonomy
#> OTU.W14.3337 k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Cladosporiaceae;g__Cladosporium;s__Cladosporium_exasperatum
#>              p.value
#> OTU.W14.3337  5e-113
#>                                                                                                                                                                                                                                                                             sequence
#> OTU.W14.3337 AACGCACATTGCGCCCCCTGGTATTCCGGGGGGCATGCCTGTTCGAGCGTCATTTCACCACTCAAGCCTCGCTTGGTATTGGGCAACGCGGTCCGCCGCGTGCCTCAAATCGACCGGCTGGGTCTTCTGTCCCCTAAGCGTTGTGGAAACTATTCGCTAAAGGGTGTTCGGGAGGCTACGCCGTAAAACAACCCCATTTCTAAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAAGCATATCAATAAGCGGAGGA

The ‘sum’ column should contain the sum of the counts of each OTU. The ‘taxonomy’ column contains the taxonomic information about the closest species match to the OTU. The ‘p-value’ describes the significance of this match. The ‘sequence’ column lists the complete sequence of the OTU. This information tells us what kind of species this OTU sequence represents (in the taxonomy field) and the actual DNA sequence, and some other information we’ll explain later.

Take a look at your data

We can begin to ask some simple questions about the data, such as what is the distribution of OTUs across all of the samples? A histogram can be used to show this.

sum_column=ITS_counts[,"sum"]  #get the column named "sum" from ITS_counts
# Draw a histogram of the "sum" column of ITS_counts (i.e. sum_column)
hist(sum_column)

As you can see, some OTUs were found over 50,000 times, but most were found less frequently. We might get a better view of the distribution after taking the log. Often, when you see a histogram with a ‘long-tail’ (i.e. most of the data grouped near zero, but a few samples with really high values), then taking the log can help you see the distribution more clearly.

hist(log(sum_column))

Another question we can ask is: how many OTUs map to the same species?

***Optional: Learn to program***
This uses a new command in R called 'table'.  The 'table' function finds the frequency of each element in a vector. 
freq_species=table(ITS_counts[,"taxonomy"])

# Print out the first five elements of freq_species
freq_species[1:5]
#> 
#>          k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Botryosphaeriales;f__Botryosphaeriaceae;g__Diplodia;s__Diplodia_subglobosa 
#>                                                                                                                                1 
#> k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Capnodiales_fam_Incertae_sedis;g__Vermiconia;s__Vermiconia_calcicola 
#>                                                                                                                                1 
#>          k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Cladosporiaceae;g__Cladosporium;s__Cladosporium_exasperatum 
#>                                                                                                                                1 
#>         k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Cladosporiaceae;g__Cladosporium;s__Cladosporium_halotolerans 
#>                                                                                                                                1 
#>          k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Mycosphaerellaceae;g__Mycosphaerella;s__Mycosphaerella_ulmi 
#>                                                                                                                                1
# Draw a histogram of freq_species
hist(freq_species)

Which species have a large number of OTUs associated with them?

***Optional: Learn to program***
Which statements: We can use a new function called 'which' to perform this operation.  You will learn more about the which command in the optional section at the end of this workshop. Revisit this line of code after you complete that section and see if you understand how it works in this context.

Note: You might be confused why we use 'names', instead of rownames/colnames in this line of code.  Rownames/colnames is used for matrices and data frames, but 'names' is used for vectors.  As you see, the output of the 'table' function is a *vector*. 
multi_otu_species=which(freq_species>5) #which species have more than 5 OTUs associated with them? 
names_species=names(freq_species) #get the names of the species
names_species[multi_otu_species] #get the species that have more than 5 OTUs associated with them
#>  [1] "k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__unidentified;g__unidentified;s__unidentified"         
#>  [2] "k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Chaetothyriales;f__Herpotrichiellaceae;g__Exophiala;s__unidentified"   
#>  [3] "k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Chaetothyriales;f__Herpotrichiellaceae;g__unidentified;s__unidentified"
#>  [4] "k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Chaetothyriales;f__unidentified;g__unidentified;s__unidentified"       
#>  [5] "k__Fungi;p__Ascomycota;c__Leotiomycetes;o__Helotiales;f__unidentified;g__unidentified;s__unidentified"             
#>  [6] "k__Fungi;p__Ascomycota;c__Leotiomycetes;o__Rhytismatales;f__Rhytismataceae;g__unidentified;s__unidentified"        
#>  [7] "k__Fungi;p__Ascomycota;c__Sordariomycetes;o__Hypocreales;f__Nectriaceae;g__unidentified;s__unidentified"           
#>  [8] "k__Fungi;p__Ascomycota;c__Sordariomycetes;o__Hypocreales;f__unidentified;g__unidentified;s__unidentified"          
#>  [9] "k__Fungi;p__Ascomycota;c__Sordariomycetes;o__unidentified;f__unidentified;g__unidentified;s__unidentified"         
#> [10] "k__Fungi;p__Ascomycota;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified"            
#> [11] "k__Fungi;p__Basidiomycota;c__Agaricomycetes;o__Agaricales;f__unidentified;g__unidentified;s__unidentified"         
#> [12] "k__Fungi;p__Glomeromycota;c__Glomeromycetes;o__Glomerales;f__Glomeraceae;g__unidentified;s__unidentified"          
#> [13] "k__Fungi;p__Rozellomycota;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified"         
#> [14] "k__Fungi;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified"

Another question we can ask is: how many fungi were sequenced in each sample? First, we need to find the indices (positions) of all the columns that begin with the letter W.

***Optional: Learn to program***
Grep: We can use a new function called "grep" to perform this operation.  You will learn more about grep in the optional section at the end of this workshop. Revisit this line of code after you complete that section and see if you understand how it works in this context.

colSum: You may have used "colSum" in Plant Gene Expression Workshop 1: it finds the sum of the designated columns. 

Sort: "Sort" is a function that sorts a vector, from lowest value to highest value. 
sampleColumns=grep("^W", colnames(ITS_counts))  #finds the column names that begin with the letter W

colSums_ITS=colSums(ITS_counts[,sampleColumns]) #finds the total number of counts per column

sort(colSums_ITS) #prints a sorted version of this output
#>   W41   W85   W65   W79   W59   W25   W77   W73   W53    W8   W89   W95   W17 
#>     2     2    19   237  1218  6444 10257 11592 11932 12951 14370 15734 15806 
#>   W86   W33   W26   W70   W61   W44   W74   W69   W43   W81   W19   W83   W76 
#> 17160 17250 17683 17971 18065 18558 18927 20390 20994 21377 21584 21723 22923 
#>   W27   W24   W42    W6   W18   W52   W22   W28   W23   W35   W58   W36   W34 
#> 22999 23488 24087 24371 24756 24803 25151 25771 26613 28248 28273 29147 30125 
#>    W3   W60    W9   W66   W14   W51   W87   W50 
#> 32423 32991 38771 40897 40925 45262 53482 57966

Let’s draw this as a sorted bar plot

barplot(sort(colSums(ITS_counts[,sampleColumns])), xlab="environmental samples (sorted)", ylab="number of OTUs identified")

Clearly, there were some issues with some of the environmental samples. For instance, W41 and W85 only have two OTUs! This is something that we will need to seriously consider when we do statistical analysis on these samples in the next few classes. Do we want to filter these columns out or does this provide important biological information?

Save the data

It is always important to save your work. Not only will this help us jump right into the data analysis when we return to this data set next time, but it also means that it will be easier for other researchers to re-analyse the data and make sure your analysis is reproducible.

write.table(ITS_counts, "ITS_counts.txt", quote=F, sep="\t")

Lets save the workspace for next time:

save.image("Workshop1_ITS.Rda")

Optional programming topic: Verify that values in the table make sense

Surprisingly, many published papers will contain messy datasets and sometimes the columns do not contain the information that is claimed in the paper. For this reason, it is always important to make sure that the values in the table seem reasonable.

The ‘sum’ column should represent the sum of the counts of this OTU in all the environmental samples. Let us make sure that this is true before doing any further analysis.

Step 1: Find the columns that refer to environmental samples, using the grep function

To do this, we need to start by finding all the columns that represent environmental values (i.e. the columns whose names begin with the letter ‘W’.) This involves a really useful function called ‘grep’.

grep: Grep takes as input a ‘pattern’ and a vector of text. It tells you the positions of the elements in the vector contain the pattern. For instance, if the first and third elements of the vector contain the pattern, then it will return the values 1 and 3.

For instance, grep(“a”, c(“a”, “aa”, “bc”, “bca”, “bac”, “A”, “AA”, “1235a”)), will produce a vector containing: 1,2,4,5,8, because the text at these positions contain the lowercase letter “a”. However, patterns can be more complicated. For instance, if you want to make sure that you only look for patterns in the beginning of the text, you can add the ‘^’ character in front of your pattern.

For instance, grep(“^a”, c(“a”, “aa”, “bc”, “bca”, “bac”, “A”, “AA”, “1235a”)), will produce a vector containing: 1,2. This is because these are the only times where “a” is at the start of the text. To find out about other more complex patterns you can use, please see ?regex.

To help you understand the documentation of R functions that take ‘text’ as inputs, let’s go through a little terminology. In R, text is stored as a data type called ‘character’. However, sometimes documentation will refer to textual data by the word ‘string’.

First, we need to find the indices (positions) of all the columns that begin with the letter W.

sampleColumns=grep("^W", colnames(ITS_counts))

#What does this produce?
sampleColumns
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Are these really the numbers of the columns that begin with the letter W?

all_cols=colnames(ITS_counts) #get all column names
all_cols[sampleColumns] #print column names at the indices indicated by sampleColumns variable
#>  [1] "W14" "W17" "W18" "W19" "W22" "W23" "W24" "W25" "W26" "W27" "W28" "W33"
#> [13] "W34" "W35" "W36" "W3"  "W41" "W42" "W43" "W44" "W50" "W51" "W52" "W53"
#> [25] "W58" "W59" "W60" "W61" "W65" "W66" "W69" "W6"  "W70" "W73" "W74" "W76"
#> [37] "W77" "W79" "W81" "W83" "W85" "W86" "W87" "W89" "W8"  "W95" "W9"

Step 2: Find if the “sum” column is correct, using the which function

Next, let’s see if the ‘sum’ column in the ITS_count table contains the correct results. First, let us check this for a single row (for OTU “OTU.W14.3337”). Find the sum of the columns of ‘ITS_counts’ that begin with ‘W’ for row “OTU.W14.3337”.

otu_w14_row=ITS_counts["OTU.W14.3337", sampleColumns] #access the row named "OTU.W14.3337" and the sample columns that begin with "W"
sum(otu_w14_row) #print the sum
#> [1] 10643
ITS_counts["OTU.W14.3337", "sum"] #print the element in ITS_counts that is in row "OTU.W14.3337" and column "sum".
#> [1] 10643

Finally, let’s check if the sums are correct for ALL rows in the table. This will involve the ‘which’ function in R, another really useful one!

which: This function takes as an input a vector of logical values (also called “booleans”). Logical values can be TRUE or FALSE. The ‘which’ function returns the indices at which the vector is TRUE. Here are a few examples. See if you can predict what the output will be in each case:

  • which(c(TRUE, FALSE, TRUE))
  • which(c(1, 8, 3, 5)>4)
  • which(c(1, 8, 3, 5)==c(1, 5, 3, 4))
  • which(c(1, 8, 3, 5)!=5)

Hints:

  • Remember that c(1, 8, 3, 5) produces a new vector that contains the numbers 1, 8, 3 and 5
  • Remember that which returns the indices (positions) of the vector that are true, not the vector values themselves.
  • Logical operators (>, <, >=, <=, ==, !=) compare the values on the lefthand and righthand side and return TRUE if the condition is met or FALSE otherwise. If two vectors of equal length are compared, then each element is compared separately: i.e. in the operation c(a,b,c)==c(d,e,f), a is compared to d, b is compared to e, and so on. If a vector is compared to a single number, then each element of the vector is compared to that number: i.e. in c(a,b,c)==d. a is compared to d, b is compared to d, etc.
  • The logical operator != means ‘not equal’
#Find the sum of all samples, for every row:
sums_ITS=rowSums(ITS_counts[,sampleColumns])

#are there any values that do not match the sums in the table?
length(which(sums_ITS!=ITS_counts[,"sum"]))
#> [1] 0

Take home messages

Today you explored the data set and learned:

  • The distribution of OTUs in hedge soil
  • Instances of different OTUs that are most similar to the same taxonomic grouping
  • Examples of suspicious samples

Optional advanced exercises: You also learned some new R skills. See you you can do the following excercises to see if you have an understanding of the material:

  • Can you explain the difference between a matrix, a data frame, and a vector? How do you access elements from each?
  • Using grep: In this workshop, we found the indices that corresponded to column names that began with the letter ‘W’. Can you find the indices that correspond to column names that contain the following patterns:
    • “ono”
    • “1”
    • “s”, but only if it appears in the beginning of the column name
  • Logical operations: what will the following logical operations produce?
    • c(4,5,6)>c(6,5,4)
    • c(9,6,8)!=6
    • c(5,7)==(c(4,6)+1)
  • Using which: Make a new vector a=c(1:6). Write code that will do the following:
    • print all values in this vector that are greater or equal to 3
    • print all values in this vector that are not equal to 5
  • Using table: Can you predict what the outcome will be from the following lines of code?
    • table(c(1,3,1,4,5))
    • table(c(“a”, “A”, “A”, “A”, “a”, “abc”))
    • table(c(“hello”, “world”, “123”, “world”))

There are also a number of other functions that you saw today. Can you read the R documentation for each of these and understand what they do?

  • dim
  • rownames/colnames
  • length
  • names
  • rowSums/colSums
  • hist
  • barplot
  • write.table
  • sort
  • save