In this workshop, you will get your hands on some Illumina sequence data and begin to analyse it. As in the first workshop we are using sequencing data for evolutionary analysis. There are many other uses for next generation’ sequencing data. The different uses of NGS (and 3rd generation) data often start with different ‘wet lab’ laboratory processes, for example RNAseq starts with RNA, not DNA.
But many of these use common bioinformatic processes at early on, including; the sequencing itself (with some differences quality checking and data cleaning) and mapping (aligning reads to a reference genome). They diverge into different analyses after that.
In this workshop, you will:
Keeping track of bioinformatic analysis is as important as taking notes in the lab.
The simplest way to document bioinformatic analyses is to keep a plain text file that acts as a lab book. You should record all the commands you use, comments about why you are running them, and what files they produced (or any errors they return if they fail). Record this with a plain text editor (NOT MS Word or Google Docs), and keep this backed up somewhere.
Keeping records of everything you do helps you to run commands again, because you can copy text from tour notes file and paste them into the terminal window. Copying from ms word or google docs will lead to the insertion of invisible characters in your commands. This will also make it easier for you (and us) to work out what has gone wrong when things do not work out.
We suggest making a text file on the linux server (see below). Keep records of everything you do in this workshop (including logging in etc.). Submitting a well-documented lab book file is an essential requirement of your project.
A single lane of Illumina sequencing produce many billions bases of sequence, and traditional methods of analysing sequences (as we used in Workshop 1) are not feasible for such vast quantities of data. In this practical and the following workshop, you will be analysing Illumina paired-end whole genome re-sequence data from clinical isolated of the parasite Leishmania infantum. The DNA libraries that have been sequenced comprise of random fragments of ~300-500bp. Each of these fragments have been sequenced from each end, hence the “paired-end”, like so:
From here on in, things you need to do are preceeded with: $
Let’s run though the brief linux tricks presentation first. It’s on the VLE, but I will run through it with the entire class. If you don’t already have a familiarity with the basic linux commands, mkdir, rm, cd, please see the following tutorial prior to running the following commands If you get stuck and you don’t understand the syntax of a program you can always use –h or –help and this will bring up the help page for the program you are trying to use
The analyses will be carried out on the linux operating system. You will be connecting to, and run commands on a linux server, called biologin.
We will first log into this machine from linux, using ssh (secure shell). Open a terminal window
Log into the server using putty.
First thing: make a plain text file on your computer using notepad.
Change filename to something sensible, obviously. I call mine notes.txt. Keep updating this with your commands, and comments about what’s going on
Log in to your allocated server biologin.york.ac.uk using Putty.
Use your York username we have given you(or your temporary username)
Open the Terminal app
Connect with this command:
ssh -X username@research0.york.ac.uk
username: your username
Make sure you have logged into the server before you continue
To display the processes running on your server type:
$ top
Go to your scratch space and make a working directory:
$ cd /scratch/username
$ mkdir sensible_directory_name
$ cd sensible_directory_name
You are now ‘in’ or working from this directory. Files you make will appear here
cd /shared/biology/bioldata2/bl-leish
mkdir username
Of course, replace username with your username!
cd username
cp /path/to/files /destination/path
The files you need for this workshop are bundled in a gzipped ‘tar’ file, called Course_material.tar.gz.
This is stored in the directory /shared/biology/bioldata2/bl-leish/DATA/popgenomics-teaching.
To copy this file to your scratch directory, do this:
cp /shared/biology/bioldata2/bl-leish/DATA/popgenomics-teaching/Course_material.tar.gz .
NB: Note the space between Leish-data.tar.gz and the dot. This means the current directory, and can be put in place of the path for the folder we want to copy our files to
Now decompress the file
$ tar zxvf Course_material.tar.gz
This will create a folder called Course_material. You will need to cd into this directory to see its contents. Like so:
$ cd Course_material
Then, list the files and directories
$ ls
Load all the software you will need to type the following
source /shared/biology/bioldata2/bl-leish/DATA/popgenomics-teaching/biolprograms
This system runs on a modular system, with each program loaded onto the current session with the command
module load programme_name
When you ran source biolprograms, you loaded all of the modules you need for this session.
Other key commands you may need to use are as follows:
$ module unload programme_name
This unloads the module
$ module list
This gives you the list of currently loaded modules
$ module avail
This gives a list of all available modules you can load
In here there is: TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta A fasta format file containing the Leishmania donovani reference genome
Also, six fastq.gz files that are the paired-end sequence Illumina sequence reads from parasites isolates from Ethiopia: Sample1_R1.fastq.gz
Sample1_R2.fastq.gz
Sample2_R1.fastq.gz
Sample2_R2.fastq.gz Sample3_R1.fastq.gz Sample3_R2.fastq.gz
We have called these sample 1-3 for simplicity, however can you tell what each sample name is (clue it begins with ERR) from opening the VCF file and looking at the header? Question: Why are there two files for each sample? What might R1 and R2 mean?
*.gz files are compressed files. Look at the structure of a fastq file
View the first 20 lines, using zcat to unzip
$ zcat Sample1_R1.fastq.gz | head -n 20
zcat uncompresses the file; and | ‘pipes’ the uncompressed file to head; which prints the first 20 lines
Use the wc (word count) program in linux to count how many sequence reads are in one of the files, by ‘piping’ zcat output to wc. Specifying –l means it counts only the number of lines in the file
Count lines of file
$ wc -l Sample1_R1.fastq.gz
How many lines does each sequence take up?
Understanding the format of fastq sequence data Look up fastq format on the internet; the Wikipedia entry is good What information does the first line of each sequence contain? What information does the fourth line of each sequence contain, and how is it encoded?
FastQC gives you a quick overview of the quality of your sequences. Use this to analyse each of your fastq files.
Now run fastQC to analyse the quality one of the fastq files
$ fastqc something.fastq.gz
The sequences are generally of good quality, though some have a Kmer Content problem. You can trim/eliminate the few poorer quality sequences
Pete will now show you the output of fastQC.
You will use a piece of software called cutadapt to do the quality filtering. Cutadapt can do many types of filtering. One important type of filtering it can do is to remove adapter contamination (which we will not be doing) and you can find out more here.
Use cutadapt to remove bases from the 3’ end of each sequence that have a quality < 20 (how stringent is this?)
Some linux shorthand we use here: a) the backslash symbol () to allows one command to be specified on multiple lines. This tells the linux machine “don’t run this command yet, it’s not complete yet”. This makes long commands more human-readable. Please do this in your lab book. b) the ampersand (&) to tell linux to run this process in the background c) the greater-than symbol (>) pipes the output of a command to a file. Any output that would usually come up on the screen can be piped to a file.
The basic command to filter 3’ (tail) ends by phred score 20 is:
$ cutadapt -q 20 -o output.fastq input.fastq
Trims reads by using a phred score 20, keeping only read that both reads in a pair pass the filter
$ cutadapt -q 20 --pair-filter=any \
-o Sample1_R1_fastq.q20.gz \
-p Sample1_R2_fastq.q20.gz \
Sample1_R1.fastq.gz \
Sample1_R2.fastq.gz \
> Sample1_R1.fastq.gz.log &
Check the window running top to keep an eye on the processes running and to see when your command finishes. A log file has been created. We can open this file in a read only mode using less, see whether you can understand the content by doing the following:
$ less Sample1_R1.fastq.gz.log
Examine what cutadapt did (why run it, if you don’t examine the results?). You can do this in three ways: List the files that have been created Look at the log file, with more Sample1_R1.fastq.gz.log Run FastQC again on the filtered output (Sample1_R1_fastq.q20.gz) to see what difference the filtering made
NB: Both cutadapt and fastqc take a while to run, which makes it challenging to test what are the best parameters to use. We can improve this my making a subset of the data, and testing this. Since each fastq read contains 4 lines we can use the linux command below to extract some lines that we use for testing. How many reads will this extract?
$ zcat Sample1_R1.fastq.gz | head -n 40000 > Sample1_R1.subset.fastq
$ zcat Sample1_R2.fastq.gz | head -n 40000 > Sample2_R1.subset.fastq
Now that you have clean sequences, you will now align the fastq files to the reference genome using the aligner called bwa (see hereand here for details).
Before you align to a reference, you must index that reference:
$ bwa index TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta &
NB: By default bwa index uses the “is” algorithm, however there are two other index algorithm options, bwtsw and rb2. In this case, leishmania is a smallish genome, so the default algorithm is the best option. However for larger or more divergent genomes aligning to the reference, you may need to use a different algorithm.
While the reference is indexing you can display the various options available in bwa by typing in:
$ bwa
To get more detail about one of the displayed options, type in, for example:
$ bwa mem
The bwa websites will have further details about the available options Align the sequence to the reference genome
$ bwa mem -t 2 \
-R '@RG\tID:S1\tSM:S1\tPL:Illumina' \
TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta \
Sample1_R1_fastq.q20.gz \
Sample1_R2_fastq.q20.gz \
2> Sample1.bwamem.log \
> Sample1.q20.sam &
NB: the “2>” pipes the comments (called stderr in linux) to one file (the .log file), while the normal redirection “>” pipes the main output (stdout) to the .sam file
In the other terminal window you will be able to see your bwa aligner using up the CPU, and also when bwa is finished. This will take ~10 minutes to run; pretty fast for aligning a so many sequences. While you are waiting for it to finish, go to this site to understand what the -t and -R options are doing (we will come back to -R in the next workshop).
If you are still waiting, listing the files in the folder will show you that bwa has started making a sam file. This contains the alignment information. View the last 20 lines:
$ tail -n 20 Sample1.q20.sam
Try to understand the format of the data. This pdf contains a detailed description of the format. See section 1.4 in particular. The format is similar to the fastq file you looked previously, but now has information as to where the sequence in question is aligned in the genome
Once your bwa alignment has finished, view the log file to check that everything ran okay
$ less Sample1.bwamem.log
Always look at the log files if something goes wrong. Sometimes your processes may fail because you have used the wrong syntax, or your input files are incorrect in some way. The log files will give you information as to what went wrong so that you can sort it out.
Now carry out the above steps to align the Samples 2 and 3 to the reference genome
samtools is an essential software for manipulating bam and sam alignment files. It has a many very useful functions.We strongly advise you to get to know samtools, by reading the manual here.
To display the various options available in samtools type:
samtools
Create a bam file from the sam file. A bam file is merely a compressed version of the sam file:
samtools view -bt TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta \
-o Sample1.q20.bam \
Sample1.q20.sam &
How much space is saved by the compression?
Once the bam file has been created you can delete the sam file, using the rm command. (once a file is deleted you CANNOT get it back; there is no recycle bin)
The sequences in your sam/bam file are in the same order as they were in the fastq files, i.e. randomly ordered with respect to where they map on the reference genome. It makes sense for the subsequent steps, such as SNP genotyping to have the sequences ordered by where they aligned in the genome
Sort each of your bam files, like so:
$ samtools sort -o Sample1.q20.sort.bam \
Sample1.q20.bam &
NB: you may put the ampersand “&” at the end of any command List the files in your folder with file details. You will see that more space saving is made by sorting the bam files.
You can get a crude summary of the quality of your aligned sequence by doing:
samtools flagstat Sample1.q20.sort.bam
Repeat for the bams, if you have made them What do the different statistics mean? How are the statistics different between the two samples? Is this surprising?
If you wish, more detailed statistics can be made by doing:
samtools stats Sample1.q20.sort.bam > Sample1.q20.sort.bam.stats
and then viewing the stats file with more/less, or gedit
Here’s what you learned, and some questions to discuss in your group with a tutor