Link back to main course page


In the previous workshop we will:

  • Checked sequence quality using FastQC
  • Cleaned the sequences using cutadapt
  • Aligned sequences using bwa to a reference genome
  • Sorted the BAM alignment files

In this workshop, you will continue with the next steps in the analysis of Illumina re-sequence data as it is used in population genomics. Note that there are many other uses of next generation sequencing data (NGS), and we can’t cover all of these here.

In this session you will:

  • Lean about symbolic links and the linux online manual
  • Mark PCR duplicates from bam files using Picard
  • Use FreeBayes to identify SNPs and indels in these strains
  • Visually inpect read mapping and variant calls with the Integrated Genomics Viewer (IGV)

Useful links for this workshop

VCF (variant call format)

https://samtools.github.io/hts-specs/VCFv4.1.pdf https://en.wikipedia.org/wiki/Variant_Call_Format

SAM (sequence alignment map) format

https://en.wikipedia.org/wiki/SAM_(file_format)


Background

The data you will be analysing is from clinical isolates of Leishmania donovani from Ethiopia. To save time, we’ll show you how to process three samples. Later on, you will get access to data from 40 individuals that we have processed for you. This is a population genomics data set.


Documenting your work is important.
Keep recording your commands with a plain text editor, and keep backing this up.

Logging into the server

Log on to the server

$ ssh -Y username@biologin.york.ac.uk

Use the cd command to navigate to the analysis directory you made last time.

$ cd /scratch/username/something

Initial data processing

Removing PCR duplicates

It is important to remove/mark PCR duplicate sequences as they are non-independent. We’ll use the MarkDuplicates tool from Picard Tools, to mark PCR duplicates so that they are not used in the subsequent steps.

Picard can do a lot of other useful things too, see here.

Load the modules

$ source biolprograms

Make a symbolic link to the java executable picard.jar file

$ ln -sf /opt/yarcc/biology/picard/2.18.1/1/default/picard.jar .

Mark duplicates (and other steps) can take some time to run. So, we’ll just work with a subset of the data for now. We can use the samtools tool to extract only the read alignment records that have aligned to one region or chromosome

First index the bam from the last workshop

$ samtools index Sample1.q20.sort.bam

Subsample the bam to contain only chromosome 1 records

$ samtools view Sample1.q20.sort.bam chr1 -b > Sample1.q20.sort.chr1.bam &

Now we can run Picard Tools MarkDuplicates on a smaller dataset, like so

$ java -XX:ParallelGCThreads=2 -Xmx2g -jar picard.jar MarkDuplicates \
ASSUME_SORTED=true \
VALIDATION_STRINGENCY=SILENT \
INPUT= Sample1.q20.sort.chr1.bam \
OUTPUT= Sample1.q20.sort.chr1.rmdup.bam \
METRICS_FILE=Sample1.q20.sort.chr1.rmdup.bam.metrics \
>& Sample1.q20.sort.chr1.rmdup.bam.MarkDuplicates.log &

Use gedit/less/more to have a look at the metrics file produced in the log file. The 8th line of this file contains information about the percentage of duplication (second number from the end).



Discuss with your group:

Would it be more important to remove PCR duplicates with high coverage, or low coverage?

What about if errors in reads were not randomly distributed?


Gettting new bam files

To save time, I have repeated the quality filtering, mapping, and duplicate removal for two other L. donovani isolates (Sample2 and Sample3). These should be in the folder Course materials, which you copied yesterday:

Identifying genetic variations within the genomes

In population genomics studies, the purpose of aligning reads to a reference genome is to be able to identify differences between individuals. Other studies (eg: RNAseq) map reads to identify coverage (how many reads map to each location, such as a gene or exon), to provide quantitative information. For chipseq or other biochemical studies the location that read map to in the genome is more important, see: https://en.wikipedia.org/wiki/ChIP-sequencing

The most abundant genetic differences (variants) are single nucleotide polymorphisms (SNPs) and insertion/deletion polymorphisms (indels). We will use FreeBayes to identify these. The FreeBayes manual is here: https://github.com/ekg/FreeBayes/blob/master/README.md

NB: We’ll use FreeBayes because it is fairly fast (though it still takes some time). There are many other variant calling software. GATK (The Genome Analysis Toolkit) is a popular choice, and can to lots of other things too: https://software.broadinstitute.org/gatk/documentation/



SNP and indel calling

As an exercise, we’ll call variants from the chromosome 1 bams. We will give you a variant file for all the others for your projects, if you need them

First ‘index’ the fasta format file of the genome with samtools

$ samtools faidx TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta

So we can tell FreeBayes what data we will be using to call variants from, make a file of file names (a file that contains the names of all your subsampled bams). Here you are piping the output of the ls (list) command to a file.

$ ls Sample*.q20.sort.chr1.rmdup.bam > bams.list

Now ‘call’ variants (identify potential variants) with FreeBayes

$ freebayes -f TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta --ploidy 2 --bam-list bams.list > three-samples.chr1.vcf &

NB: SNP calling takes about 10 minutes! Be patient.

You can tell if your SNP callig is still running using this command:

top -u username

replace username with your own username


Once SNP calling is completed, this will create a variant call format (VCF) file. This is a complicated file. VCF format is described in detail here: https://samtools.github.io/hts-specs/VCFv4.2.pdf And more simply here: https://en.wikipedia.org/wiki/Variant_Call_Format VCF files (or their binary form bcf files) are the substrate for many further population genetics analyses and genome-wide studies



Take a look at the first few lines with

$ less three-samples.chr1.vcf

You can scroll down with less using the space key, then up/down again using the arrow keys


To look at the header lines of the vcf file using grep, lik so:

$ grep CHROM three-samples.chr1.vcf

NB: grep is a very useful tool. To understand what its doing type man grep.


The first few columns of the vcf file describe the SNP or indel, as follows

CHROM = the chromosome where the genetic variant is found

POS = the position chromosome where the genetic variant is found

ID = the SNP id (blank for us, but SNPs in the human genome have ids).

REF = the reference allele

ALT = the reference allele(s)

QUAL = the SNP call quality (phred scaled)

FILTER = which filters apply to this SNP (if any). Will be PASS if the SNP passes all filters.

INFO = an information-rich column with the format KEY=value. The meaning of all the keys is described in the VCF header.

The last four columns are FORMAT, Sample1, Sample3, Sample2.

The Sample columns describe information about the three individuals (they can be names anything by the way, I merely chose to name them Sample1,2,3). The FORMAT column describes what is in the individual data. This can be different, depending on what the SNP caller outputs. In our case it is: GT:DP:AD:RO:QR:AO:QA:GL. The first of these are:

GT = genotype

DP = read depth

The remainder are described in the header of the VCF file. You can see these with:

$ grep "##FORMAT" three-samples.chr1.vcf

All this information in the VCF file is used to filtering SNPs.


Filtering SNP and indel calls

Initial variant calling is generally very approximate, and will identify many sites as SNPs or indels that are merely errors. The INFO field of the vcf file contains lots of information about each site in the genome, and the reads aligned there, and the quality of the variant calls.

We will trial using two of these to filter the initial vcf file, using vcffilter https://github.com/vcflib/vcflib/blob/master/README.md#vcffilter

Load the vcfultils software

$ module load vcflib/1.0.0

Count the number of variants we have, using grep

$ grep -vc '#' three-samples.chr1.vcf

Filter by variant quality 20, and pipe directly to grep to count again vcffilter -f “QUAL > 20” three-samples.chr1.vcf | grep -vc ‘#’

$ vcffilter -f  "QUAL > 20" three-samples.chr1.vcf  | grep -vc '#'

Note: this quality is phred-scaled (like fastq base qualities).

Remind yourself what phred score 20 means (if you have forgotten), or read the wikipedia page on phred quality scores.


First try filtering variants with “QUAL > 10”“”, and see how many SNPs you get;

Then, filter by variant quality 20, SAF > 5 & SAR > 5 and count again.

$ vcffilter -f "QUAL > 20 & SAF > 5 & SAR > 5" three-samples.chr1.vcf  | grep -vc '#'

Use grep “ID=SAF” three-samples.chr1.vcf to find out what SAF and SAR are

Are there any other info fields that might be good filters?

Use grep “INFO=” three-samples.chr1.vcf to look at other INFO fields


Scrutinising variant calls manually

Automated processing of alignments, and variant calling will result in errors, if we are not careful with the data. With so many reads (all may have errors in base calling), and so many differences between genomes, it is impossible/very challenging to describe all the changes with one software. So, it pays to look at alignments. The Integrated Genomics Viewer (IGV) is a good tool for this. See: here.


In practise IGV will not work well with a slow network connection.

So: We will talk through some regions of the alignments and show you the IGV screen.


We loaded in the L. donovani genome into the IGV menu with the following commands: Genomes > Load genome from file > TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta

Then we loaded the bam(s) and VCF files into IGV to look at particular sites, using File > Load from file > [navigate to the directory with bams/VCFs in it] > select bam of VCF

NB: bams need to be indexed with samtools index some.bam to be loaded by IGV

Some interesting regions to look at are: A badly mapping region: chr1:257,429-268,931 A clear homozygous SNP chr1:160,181-165,226 A clear heterozygous SNP chr1:272,180-272,809 A gene that doesn’t seem to map well: chr1:241,228-246,978 Note the low mapping score of the reads, indicated by the colours of the read bars. Look at mapping quality by musing over some reads. Why might this be?

To navigate to these regions paste the chromosome:position text above into the location window (to the left of the GO button). If you mouse over the SNP record f4rom the VCF to will see SNP quality etc.


Summary

Here is what we learned, and some questions to ask yourselves (or ask the data).

  • How to use Picard to clean up a bam by removing duplicates.
    • What kind of duplicates are we removing?
    • How are they created?
  • How to use symbolic links to avoid copying large data files all over the place.

  • How to use samtools to extract parts of a bam
    • Note that samtools has many other useful functions.
  • How to call and filter genetic variants from a bam (freebayes and vcffilter)

  • The basic anatomy of a vcf file.
    • What types of genetic variants did we not look at?
  • How to use IGV to inspect bam files.