In the previous workshop we will:
In this workshop, you will continue with the next steps in the analysis of Illumina re-sequence data as it is used in population genomics. Note that there are many other uses of next generation sequencing data (NGS), and we can’t cover all of these here.
In this session you will:
Useful links for this workshop
VCF (variant call format)
https://samtools.github.io/hts-specs/VCFv4.1.pdf https://en.wikipedia.org/wiki/Variant_Call_Format
SAM (sequence alignment map) format
https://en.wikipedia.org/wiki/SAM_(file_format)
The data you will be analysing is from clinical isolates of Leishmania donovani from Ethiopia. To save time, we’ll show you how to process three samples. Later on, you will get access to data from 40 individuals that we have processed for you. This is a population genomics data set.
Log on to the server
$ ssh -Y username@biologin.york.ac.uk
Use the cd command to navigate to the analysis directory you made last time.
$ cd /scratch/username/something
It is important to remove/mark PCR duplicate sequences as they are non-independent. We’ll use the MarkDuplicates tool from Picard Tools, to mark PCR duplicates so that they are not used in the subsequent steps.
Picard can do a lot of other useful things too, see here.
Load the modules
$ source biolprograms
Make a symbolic link to the java executable picard.jar file
$ ln -sf /opt/yarcc/biology/picard/2.18.1/1/default/picard.jar .
Mark duplicates (and other steps) can take some time to run. So, we’ll just work with a subset of the data for now. We can use the samtools tool to extract only the read alignment records that have aligned to one region or chromosome
First index the bam from the last workshop
$ samtools index Sample1.q20.sort.bam
Subsample the bam to contain only chromosome 1 records
$ samtools view Sample1.q20.sort.bam chr1 -b > Sample1.q20.sort.chr1.bam &
Now we can run Picard Tools MarkDuplicates on a smaller dataset, like so
$ java -XX:ParallelGCThreads=2 -Xmx2g -jar picard.jar MarkDuplicates \
ASSUME_SORTED=true \
VALIDATION_STRINGENCY=SILENT \
INPUT= Sample1.q20.sort.chr1.bam \
OUTPUT= Sample1.q20.sort.chr1.rmdup.bam \
METRICS_FILE=Sample1.q20.sort.chr1.rmdup.bam.metrics \
>& Sample1.q20.sort.chr1.rmdup.bam.MarkDuplicates.log &
Use gedit/less/more to have a look at the metrics file produced in the log file. The 8th line of this file contains information about the percentage of duplication (second number from the end).
Discuss with your group: Would it be more important to remove PCR duplicates with high coverage, or low coverage?
What about if errors in reads were not randomly distributed?
To save time, I have repeated the quality filtering, mapping, and duplicate removal for two other L. donovani isolates (Sample2 and Sample3). These should be in the folder Course materials, which you copied yesterday:
In population genomics studies, the purpose of aligning reads to a reference genome is to be able to identify differences between individuals. Other studies (eg: RNAseq) map reads to identify coverage (how many reads map to each location, such as a gene or exon), to provide quantitative information. For chipseq or other biochemical studies the location that read map to in the genome is more important, see: https://en.wikipedia.org/wiki/ChIP-sequencing
The most abundant genetic differences (variants) are single nucleotide polymorphisms (SNPs) and insertion/deletion polymorphisms (indels). We will use FreeBayes to identify these. The FreeBayes manual is here: https://github.com/ekg/FreeBayes/blob/master/README.md
NB: We’ll use FreeBayes because it is fairly fast (though it still takes some time). There are many other variant calling software. GATK (The Genome Analysis Toolkit) is a popular choice, and can to lots of other things too: https://software.broadinstitute.org/gatk/documentation/
As an exercise, we’ll call variants from the chromosome 1 bams. We will give you a variant file for all the others for your projects, if you need them
First ‘index’ the fasta format file of the genome with samtools
$ samtools faidx TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta
So we can tell FreeBayes what data we will be using to call variants from, make a file of file names (a file that contains the names of all your subsampled bams). Here you are piping the output of the ls (list) command to a file.
$ ls Sample*.q20.sort.chr1.rmdup.bam > bams.list
Now ‘call’ variants (identify potential variants) with FreeBayes
$ freebayes -f TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta --ploidy 2 --bam-list bams.list > three-samples.chr1.vcf &
NB: SNP calling takes about 10 minutes! Be patient.
You can tell if your SNP callig is still running using this command:
top -u username
replace username with your own username
Once SNP calling is completed, this will create a variant call format (VCF) file. This is a complicated file. VCF format is described in detail here: https://samtools.github.io/hts-specs/VCFv4.2.pdf And more simply here: https://en.wikipedia.org/wiki/Variant_Call_Format VCF files (or their binary form bcf files) are the substrate for many further population genetics analyses and genome-wide studies
Take a look at the first few lines with
$ less three-samples.chr1.vcf
You can scroll down with less using the space key, then up/down again using the arrow keys
To look at the header lines of the vcf file using grep, lik so:
$ grep CHROM three-samples.chr1.vcf
NB: grep is a very useful tool. To understand what its doing type man grep.
The first few columns of the vcf file describe the SNP or indel, as follows
CHROM = the chromosome where the genetic variant is found
POS = the position chromosome where the genetic variant is found
ID = the SNP id (blank for us, but SNPs in the human genome have ids).
REF = the reference allele
ALT = the reference allele(s)
QUAL = the SNP call quality (phred scaled)
FILTER = which filters apply to this SNP (if any). Will be PASS if the SNP passes all filters.
INFO = an information-rich column with the format KEY=value. The meaning of all the keys is described in the VCF header.
The last four columns are FORMAT, Sample1, Sample3, Sample2.
The Sample columns describe information about the three individuals (they can be names anything by the way, I merely chose to name them Sample1,2,3). The FORMAT column describes what is in the individual data. This can be different, depending on what the SNP caller outputs. In our case it is: GT:DP:AD:RO:QR:AO:QA:GL. The first of these are:
GT = genotype
DP = read depth
The remainder are described in the header of the VCF file. You can see these with:
$ grep "##FORMAT" three-samples.chr1.vcf
All this information in the VCF file is used to filtering SNPs.
Initial variant calling is generally very approximate, and will identify many sites as SNPs or indels that are merely errors. The INFO field of the vcf file contains lots of information about each site in the genome, and the reads aligned there, and the quality of the variant calls.
We will trial using two of these to filter the initial vcf file, using vcffilter https://github.com/vcflib/vcflib/blob/master/README.md#vcffilter
Load the vcfultils software
$ module load vcflib/1.0.0
Count the number of variants we have, using grep
$ grep -vc '#' three-samples.chr1.vcf
Filter by variant quality 20, and pipe directly to grep to count again vcffilter -f “QUAL > 20” three-samples.chr1.vcf | grep -vc ‘#’
$ vcffilter -f "QUAL > 20" three-samples.chr1.vcf | grep -vc '#'
Note: this quality is phred-scaled (like fastq base qualities).
Remind yourself what phred score 20 means (if you have forgotten), or read the wikipedia page on phred quality scores.
First try filtering variants with “QUAL > 10”“”, and see how many SNPs you get;
Then, filter by variant quality 20, SAF > 5 & SAR > 5 and count again.
$ vcffilter -f "QUAL > 20 & SAF > 5 & SAR > 5" three-samples.chr1.vcf | grep -vc '#'
Use grep “ID=SAF” three-samples.chr1.vcf to find out what SAF and SAR are
Are there any other info fields that might be good filters?
Use grep “INFO=” three-samples.chr1.vcf to look at other INFO fields
Automated processing of alignments, and variant calling will result in errors, if we are not careful with the data. With so many reads (all may have errors in base calling), and so many differences between genomes, it is impossible/very challenging to describe all the changes with one software. So, it pays to look at alignments. The Integrated Genomics Viewer (IGV) is a good tool for this. See: here.
In practise IGV will not work well with a slow network connection.
So: We will talk through some regions of the alignments and show you the IGV screen.
We loaded in the L. donovani genome into the IGV menu with the following commands: Genomes > Load genome from file > TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta
Then we loaded the bam(s) and VCF files into IGV to look at particular sites, using File > Load from file > [navigate to the directory with bams/VCFs in it] > select bam of VCF
NB: bams need to be indexed with samtools index some.bam to be loaded by IGV
Some interesting regions to look at are: A badly mapping region: chr1:257,429-268,931 A clear homozygous SNP chr1:160,181-165,226 A clear heterozygous SNP chr1:272,180-272,809 A gene that doesn’t seem to map well: chr1:241,228-246,978 Note the low mapping score of the reads, indicated by the colours of the read bars. Look at mapping quality by musing over some reads. Why might this be?
To navigate to these regions paste the chromosome:position text above into the location window (to the left of the GO button). If you mouse over the SNP record f4rom the VCF to will see SNP quality etc.
Here is what we learned, and some questions to ask yourselves (or ask the data).
How to use symbolic links to avoid copying large data files all over the place.
How to call and filter genetic variants from a bam (freebayes and vcffilter)
How to use IGV to inspect bam files.