Documenting your work is important.
Keep recording your commands with a plain text editor, and keep backing this up.

Introduction

In the previous workshop we:

Mark PCR duplicates from bam files using Picard
Used FreeBayes to identify SNPs and indels (output was VCF format)
Looked a little at the VCF format output.
Visually inpected read mapping and variant calls using IGV

All of these steps used the bam files (binary version of a sam file), which read alignment files. You can read about the sam format here. The SNP and indel calling produced variant call format (VCF) files.

This workshop

In this workshop we will use the VCF files to examine the genomic diversity of the population. Rember that each row of a VCF file describes one genetic variant in the population, and the columns describe the allele calls in each individual.

See the Wikipedia VCF page here, if you need to remind yourself about VCF files.

What is genetic diversity?

Analysis of genetic diversity allows to describe populations, selection and some aspects of epidemiology. We often use the number of SNPs and INDELs as a way of determining how diverse individuals are from each other.

However there are more robust measures of genetic diversity that allows us to compare how varied individuals are, and to compare these values to other samples directly, and takes into account differences in the number of samples, or genome size.

Nucleotide diversity

Nucleotide diversity (often referred to using the symbol π) is the average pairwise difference between all possible pairs of individuals in your sample. It is a very intuitive and simple measure of genetic diversity, and is accurately estimated even with very few samples. A formal definition is here.

We can obtain the nucleotide diversity (π) from our VCF file using vcftools software. In our case we will collect the π value from each 10 kb (10,000 bp) window of the genome.

NB: vcftools is a very flexible tool for analying, manipulating VCF files. It can do many other wonderful things. The vcftools manual is on github here.

To calculate π over 10kb windows of the genome, use this command:

vcftools --vcf all_samples.vcf --window-pi  10000 --out all_samples

This will generate a tab separated file called all_sample.pi, with a π value for every 10,000 bases across the genome.

Later, we will load this output file into R, and take a look at the π along a chromosome.

Discuss with your group:

* If you saw one region with a dramatic drop in π, what could this be?

* What other summary statistic might you look at next?

Tajima’s D

We can also perform statistics which allow us to see whether a population is neutrally evolving or whether it is under selective pressures. One of the measures we can use to test for this is Tajima’s D.

Remember that Tajima’s D will change depending on population size changes, and selection.

We can use vcftools again to caluculate Tajima’s D for us, in windows accross the genome, like so:

$ vcftools --vcf all_samples.vcf --TajimaD 10000 --out all_samples

The output file will be called all_samples.tajd. Use ls to check that it is there.

Using a perl script, we have also used vcftools to calculate Tajima’s D over windows around each gene. This means we can find out the gene with the highest and lowest Tajima’s D. We will make this file availble for you to look at later in R.

Estimating population distance with FST

Another important method used in population genetics is FST. The fixation index (Fst) is a measure of population separation. Fst ranges from 0 to 1. If Fst = 0, there is no genetic distance between populations. If Fst = 1, the two populations are very strongly separated.

Strictly speaking, the Fst between two genes is the probability of identity by descent. It can also be thought of as the amount of genetic variation that can be explained by population structure (if Fst is high, much of the negetic variation is causes by differences between populations). You cean read more about this on Wikipedia here.

We can use FST in order to identify the relationship between individuals and can be used to identify populations. We compare pairwise between a set of populations or between individuals in vcftools.

First we will need to define which strains are part of one sub-population, and which are part of another. We know this from PCA. We have made two files that contains the names of the strains in each population:

These are:

Population_1_ethiopia_names

Population_2_ethiopia_names

You can use these to calculate the FST between these populations, like so:

vcftools --vcf all_samples.vcf --weir-fst-pop Population_1_ethiopia_names --weir-fst-pop Population_2_ethiopia_names --out pop1_vs_2_FST

In the above command, –weir-fst is specified twice to give the lists of individuals from the two population file, that form each population. The population lists we had previously, have been used here to specify each population.

We will give you the output files that you can read into R to make some plots.

Tajima’s D statistics:

Ethiopian_all_samples_20kb_tajD

Ethiopian_population_one_20kb_tajD

Ethiopian_population_two_20kb_tajD

Note that we caluclated Tajima’s D for all samples, and then the two populations seprately.

π statistics:

Ethiopian_window_20kb_pi

FST files:

popwise_fst.weir.fst

Generating plots in R

Graphs (or plots) are excellent ways to understand large genome-scale data sets. We tend to use plots very often.

We will be using R and R studio (an R GUI) to make plots. You should have this installed already. If not ask for some assistance and we will help with this. The first thing we will need to do is load R/R studio and set the current directory to the folder which you have all tehe data files in (Downloads in our case).

First start R studio from the Windows Start menu.

If you need some help getting to know R, ask a tutor. We are happy to help.

We can check what the current working directory is by typing the following command into the console.

$ getwd()

This will give us the path that R will currently be loading its data from. If this is the folder where the files we want to use are in, change nothing. If it is not we can correct this by setting the directory to the folder it is in using the following command:

$ setwd("/Users/Sarah/Downloads")

You should set Users/Sarah/Downloads to where your data is.

Plotting nucleotide diversity.

Now you are ready to load some data into R, and make a plot.

To read the data in the file into an object called pi.all, type this command in the console:

pi.all <- read.table("Ethiopian_window_20kb_pi",header=T)

We can then make a histogram of π, like so:

hist(pi.all$tajima,br=20)

This isn’t very informative, because most sites have relatively low diversity. So lets make a box and whiskers plot, like so:

boxplot(pi.all$tajima,ylab="diversity")

Now let’s look at π along chromosome 12. To do this, lets first make a subset of the pi.all object:

pi.chr12 <- subset(taj.all, chr = "chr12")

This command means, take the subset of lines in the pi.all object, where the chr column is equal to chr12.

Try using using the commands below to see what you have done.

#find out how many rows your objects have
nrow(pi.chr12)
nrow(pi.all)

#see thr first and last lines
head(pi.chr12)
head(pi.all)

tail(pi.chr12)
tail(pi.all)

#see a summary of the data frame
summary(pi.chr12)
summary(pi.all)

Now, lets make the plot.

Let’s can plot the π value (on the y axis) against the position on the chromosome along the x axis. The position is in the column of the pi.chr12 data frame called mid. We plot like so:

plot(pi.chr12$mid,pi.chr12$pi,xlab="position",ylab="diversity")

Try it yourself: make a subset of another chromosome, and plot it.

Plotting Tajima’s D

To read the data in the file into an object called taj.all, type this command in the console:

taj.all <- read.table("Ethiopian_all_samples_20kb_tajD",header=T)

We can then make a histogram of Tajima’s D here.

hist(taj.all$tajima,br=20)

How does distribution compare to the expected distribution of Tajima’s D?

Now let’s plot Tajima’s D along a chromosome 36. To do this, lets first make a subset of the taj.all object like we did before for π:

Then we can plot the Tajima’s D value (on thr y axis) against the position on the chromosome along the x axis. The position is in the column of the taj.chr36 data frame called mid. We plot like so:

plot(taj.chr36$mid,taj.chr36$tajima,xlab="position",ylab="Tajima's D")

Try it yourself. Plot Tajima’s D for another chromosome.

Try it yourself. Make histograms for the other Tajima’s D files.

Plotting FST.

By re-using these commands you can plot the FST distrubution and then plot FST along a chromosome.

Feel free to ask us if you want some help.

Summary

From the few examples here, we hope you can see how vcftools and other programs can be very informative about the effects of demography and selection on genomic diversity.

Discuss with your group.

Why do we use summary statistics?
How separate are L. donovani populations in Ethiopia?

End

Link back to main course page

Workshop 5. Describing genomic diversity using summary statistics.