In this workshop, you will get your hands on some Illumina sequence data and begin to analyse it. As in the first workshop we are using sequencing data for evolutionary analysis. There are many other uses for next generation’ sequencing data. The different uses of NGS (and 3rd generation) data often start with different ‘wet lab’ laboratory processes, for example RNAseq starts with RNA, not DNA. But many of these uses of NGS use common bioinformatic processes at early on, including; the sequencing itself (with some differences quality checking and data cleaning ‘mapping’ (aligning reads to a reference genome. They diverge after that.
In this workshop, you will:
Keeping track of bioinformatic analysis is as important as taking notes in the lab.
The simplest way to document bioinformatic analyses is to keep a plain text file that acts as a lab book. You should record all the commands you use, comments about why you are running them, and what files they produced (or any errors they return if they fail). Record this with a plain text editor (NOT MS Word or Google Docs), and keep this backed up somewhere. Keeping records of everything you do helps you to run commands again, because you can copy text from tour notes file and paste them into the terminal window. Copying from ms word or google docs will lead to the insertion of invisible characters in your commands. This will also make it easier for you (and us) to work out what has gone wrong when things do not work out.
I suggest making a text file on the linux server (see below). Keep records of everything you do in this workshop (including logging in etc.). Submitting a well-documented lab book file is an essential requirement of your project.
A single lane of Illumina sequencing produce many billions bases of sequence, and traditional methods of analysing sequences (as we used in Workshop 1) are not feasible for such vast quantities of data. In this practical and the following workshop, you will be analysing Illumina paired-end whole genome re-sequence data from clinical isolated of the parasite Leishmania infantum. The DNA libraries that have been sequenced comprise of random fragments of ~300-500bp. Each of these fragments have been sequenced from each end, hence the “paired-end”, like so:
Short read sequencing
From here on in, things you need to do are preceeded with: $
Let’s run though the brief linux tricks presentation first. It’s on the VLE, but I will run through it with the entire class. If you don’t already have a familiarity with the basic linux commands, mkdir, rm, cd, please see the following tutorial prior to running the following commands If you get stuck and you don’t understand the syntax of a program you can always use –h or –help and this will bring up the help page for the program you are trying to use
The analyses will be carried out on the linux operating system. You will be connecting to, and run commands on a linux server, called biologin.
We will first log into this machine from linux, using ssh (secure shell). Open a terminal window
Log into the server
$ ssh -Y username@biologin.york.ac.uk
First thing: make a plain text file on the server that serves as your notebook:
$ gedit filename &
Change filename to something sensible, obviously. I call mine notes.txt. Keep updating this with your commands, and comments about what’s going on
Log in to your allocated server biolar11X.york.ac.uk using a software called Putty. This should be possible to do from home, as long as you have the Pulse Secure VPN installed. See here: https://www.york.ac.uk/it-services/services/vpn/
To display the processes running on your server type:
$ top
This will allow you to keep track of whether your jobs are completed or still running. You can have two terminal windows running and log in twice, if you like
In one terminal window make a new directory with a sensible name, and move into this directory
Go to your scratch space and make a working directory:
$ cd /scratch/username
$ mkdir sensible_directory_name
$ cd sensible_directory_name
You are now ‘in’ or working from this directory. Files you make will appear here
Copy over the relevant data to this new folder using the cp command using the following format
$ cp /path/to/files /destination/path
$ cp /opt/yarcc/data/biology/Course_material.tar.gz .
NB: Note the space between Leish-data.tar.gz and the dot. This means the current directory, and can be put in place of the path for the folder we want to copy our files to
Decompress the file
$ tar zxvf Leish-data.tar.gz
List the files and directories
$ ls
In here, there is a file called biolprograms
Load all the software you will need to type the following
$ source biolprograms
This system runs on a modular system, with each program loaded onto the current session with the command
$ module load programme_name
When you ran
$ source biolprograms
you loaded all of the modules you need for this session Other key commands you may need to use are as follows:
$ module unload programme_name
This unloads the module
$ module list
This gives you the list of currently loaded modules
$ module avail
This gives a list of all available modules you can load
In here there is: TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta A fasta format file containing the Leishmania infantum reference genome
Also, six fastq.gz files that are the paired-end sequence Illumina sequence reads from parasites isolates from Ethiopia: Sample1_R1.fastq.gz
Sample1_R2.fastq.gz
Sample2_R1.fastq.gz
Sample2_R2.fastq.gz Sample3_R1.fastq.gz Sample3_R2.fastq.gz
We have called these sample 1-3 for simplicity, however can you tell what each sample name is (clue it begins with ERR) from opening the VCF file and looking at the header? Question: Why are there two files for each sample? What might R1 and R2 mean?
*.gz files are compressed files. Look at the structure of a fastq file
View the first 20 lines, using zcat to unzip
$ zcat Sample1_R1.fastq.gz | head -n 20
zcat uncompresses the file; and | ‘pipes’ the uncompressed file to head; which prints the first 20 lines
Use the wc (word count) program in linux to count how many sequence reads are in one of the files, by ‘piping’ zcat output to wc. Specifying –l means it counts only the number of lines in the file
Count lines of file
$ wc -l Sample1_R1.fastq.gz
How many lines does each sequence take up?
Understanding the format of fastq sequence data Look up fastq format on the internet; the Wikipedia entry is good What information does the first line of each sequence contain? What information does the fourth line of each sequence contain, and how is it encoded?
FastQC gives you a quick overview of the quality of your sequences. Use this to analyse each of your fastq files. FastQC wasn’t loaded using the biolprograms command above, so load it yourself Load fastqc
$ module load fastqc
Then run fastQC to analyse the quality one of the fastq files
$ fastqc something.fastq.gz &
Repeat the FastQC analysis for the remaining three R1 fastq files Get them running ‘in the background’ (using &) while looking at the output of the first one Once FastQC is done, you will see that each FastQC run produces an html file. You can view this in one of two ways: a) Open firefox within the linux machine (this is the simple way)
$ firefox filename.html &
transfer the file over from the server to your local computer. To do this you we can use a secure version of the copy command we used earlier scp. Open a terminal window not logged into the server. You can then give the path of your file on the server, and the folder you want it to deposit the file on your computer (this is the hard way). We can use the pwd on your current directory so see the path our files can be copied from Check current directory path
$ pwd
Current path to directory is
$ /scratch/username/sensible_directory_name/
Open a non logged in terminal page and put the path you want to put the file and the current path as follows:
$ scp username@biologin.york.ac.uk:/scratch/username/sensible_directory_name/file_to_transfer /path/on/computer/
If the open terminal page is in the directory you want to transfer the server files to, you can replace /path/on/computer/ with ‘.’ which means to this directory as follows:
$ scp username@biologin.york.ac.uk:/scratch/username/sensible_directory_name/file .
What do the various sections in the FastQC report mean? Look up the FastQC webpage, in particular the “Example Reports” section at **http://www.bioinformatics.babraham.ac.uk/projects/fastqc/**
The sequences are generally of good quality, though some have a Kmer Content problem. You can trim/eliminate the few poorer quality sequences
You will use a piece of software called cutadapt to do the quality filtering. Cutadapt can do many types of filtering. One important type of filtering it can do is to remove adapter contamination (which we will not be doing) and you can find out more here: **http://cutadapt.readthedocs.io/en/stable/guide.html**
Use cutadapt to remove bases from the 3’ end of each sequence that have a quality < 20 (how stringent is this?)
Some linux shorthand we use here: a) the backslash symbol () to allows one command to be specified on multiple lines. This tells the linux machine “don’t run this command yet, it’s not complete yet”. This makes long commands more human-readable. Please do this in your lab book. b) the ampersand (&) to tell linux to run this process in the background c) the greater-than symbol (>) pipes the output of a command to a file. Any output that would usually come up on the screen can be piped to a file.
The basic command to filter 3’ (tail) ends by phred score 20 is:
$ cutadapt -q 20 -o output.fastq input.fastq
Trims reads by using a phred score 20, keeping only read that both reads in a pair pass the filter
$ cutadapt -q 20 --pair-filter=any \
-o Sample1_R1_fastq.q20.gz \
-p Sample1_R2_fastq.q20.gz \
Sample1_R1.fastq.gz \
Sample1_R2.fastq.gz \
> Sample1_R1.fastq.gz.log &
Check the window running top to keep an eye on the processes running and to see when your command finishes. A log file has been created. We can open this file in a read only mode using less, see whether you can understand the content by doing the following:
$ less Sample1_R1.fastq.gz.log
Examine what cutadapt did (why run it, if you don’t examine the results?). You can do this in three ways: List the files that have been created Look at the log file, with more Sample1_R1.fastq.gz.log Run FastQC again on the filtered output (Sample1_R1_fastq.q20.gz) to see what difference the filtering made
NB: Both cutadapt and fastqc take a while to run, which makes it challenging to test what are the best parameters to use. We can improve this my making a subset of the data, and testing this. Since each fastq read contains 4 lines we can use the linux command below to extract some lines that we use for testing. How many reads will this extract?
$ zcat Sample1_R1.fastq.gz | head -n 40000 > Sample1_R1.subset.fastq
$ zcat Sample1_R2.fastq.gz | head -n 40000 > Sample2_R1.subset.fastq
Now that you have clean sequences, you will now align the fastq files to the reference genome using the aligner called bwa (see http://bio-bwa.sourceforge.net/* and http://bio-bwa.sourceforge.net/bwa.shtml* for details).
Before you align to a reference, you must index that reference:
$ bwa index TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta &
NB: By default bwa index uses the “is” algorithm, however there are two other index algorithm options, bwtsw and rb2. In this case, leishmania is a smallish genome, so the default algorithm is the best option. However for larger or more divergent genomes aligning to the reference, you may need to use a different algorithm.
While the reference is indexing you can display the various options available in bwa by typing in:
$ bwa
To get more detail about one of the displayed options, type in, for example:
$ bwa mem
The bwa websites will have further details about the available options Align the sequence to the reference genome
$ bwa mem -t 2 \
-R '@RG\tID:S1\tSM:S1\tPL:Illumina' \
TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta \
Sample1_R1_fastq.q20.gz \
Sample1_R2_fastq.q20.gz \
2> Sample1.bwamem.log \
> Sample1.q20.sam &
NB: the “2>” pipes the comments (called stderr in linux) to one file (the .log file), while the normal redirection “>” pipes the main output (stdout) to the .sam file
In the other terminal window you will be able to see your bwa aligner using up the CPU, and also when bwa is finished. This will take ~10 minutes to run; pretty fast for aligning a so many sequences. While you are waiting for it to finish, go to http://bio-bwa.sourceforge.net/bwa.shtml to understand what the -t and -R options are doing (we will come back to -R in the next workshop).
If you are still waiting, listing the files in the folder will show you that bwa has started making a sam file. This contains the alignment information. View the last 20 lines:
$ tail -n 20 Sample1.q20.sam
Try to understand the format of the data. **https://samtools.github.io/hts-specs/SAMv1.pdf** contains a detailed description of the format. See section 1.4 in particular. The format is similar to the fastq file you looked previously, but now has information as to where the sequence in question is aligned in the genome
Once your bwa alignment has finished, view the log file to check that everything ran okay
$ less Sample1.bwamem.log
Always look at the log files if something goes wrong. Sometimes your processes may fail because you have used the wrong syntax, or your input files are incorrect in some way. The log files will give you information as to what went wrong so that you can sort it out.
Now carry out the above steps to align the Samples 2 and 3 to the reference genome
samtools **(http://www.htslib.org/doc/samtools.html)** is a piece of software for manipulating alignment files. To display the various options available in samtools type in:
$ samtools
Create a bam file from the sam file. A bam file is merely a compressed version of the sam file:
$ samtools view -bt TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta \
-o Sample1.q20.bam \
Sample1.q20.sam &
How much space is saved by the compression?
Once the bam file has been created you can delete the sam file, using the rm command. (once a file is deleted you CANNOT get it back; there is no recycle bin)
The sequences in your sam/bam file are in the same order as they were in the fastq files, i.e. randomly ordered with respect to where they map on the reference genome. It makes sense for the subsequent steps, such as SNP genotyping to have the sequences ordered by where they aligned in the genome
Sort each of your bam files, like so:
$ samtools sort -o Sample1.q20.sort.bam \
Sample1.q20.bam &
NB: you may put the ampersand “&” at the end of any command List the files in your folder with file details. You will see that more space saving is made by sorting the bam files.
You can get a crude summary of the quality of your aligned sequence by doing:
$ samtools flagstat Sample1.q20.sort.bam
Repeat for the bams, if you have made them What do the different statistics mean? How are the statistics different between the two samples? Is this surprising?
If you wish, more detailed statistics can be made by doing:
$ samtools stats Sample1.q20.sort.bam \
> Sample1.q20.sort.bam.stats
and then viewing the stats file with more/less, or gedit
Here’s what you learned, and some questions to discuss in your group with a tutor