Introduction

In this workshop, you will get your hands on some Illumina sequence data and begin to analyse it. As in the first workshop we are using sequencing data for evolutionary analysis. There are many other uses for next generation’ sequencing data. The different uses of NGS (and 3rd generation) data often start with different ‘wet lab’ laboratory processes, for example RNAseq starts with RNA, not DNA.

But many of these use common bioinformatic processes at early on, including; the sequencing itself (with some differences quality checking and data cleaning) and mapping (aligning reads to a reference genome). They diverge into different analyses after that.

In this workshop, you will:

Gain more familiarity with using command line Linux
Check the sequence quality using FastQC
Clean the sequences using cutadapt
Align sequences using bwa to a reference genome
Use samtools to manipulate alignment files
Understand the fastq and sam file formats

Documenting your work

Keeping track of bioinformatic analysis is as important as taking notes in the lab.

The simplest way to document bioinformatic analyses is to keep a plain text file that acts as a lab book. You should record all the commands you use, comments about why you are running them, and what files they produced (or any errors they return if they fail). Record this with a plain text editor (NOT MS Word or Google Docs), and keep this backed up somewhere.

Keeping records of everything you do helps you to run commands again, because you can copy text from tour notes file and paste them into the terminal window. Copying from ms word or google docs will lead to the insertion of invisible characters in your commands. This will also make it easier for you (and us) to work out what has gone wrong when things do not work out.

Start keeping records now!

We suggest making a text file on the linux server (see below). Keep records of everything you do in this workshop (including logging in etc.). Submitting a well-documented lab book file is an essential requirement of your project.

Modern high-throughput sequencers produce huge amounts of sequences

A single lane of Illumina sequencing produce many billions bases of sequence, and traditional methods of analysing sequences (as we used in Workshop 1) are not feasible for such vast quantities of data. In this practical and the following workshop, you will be analysing Illumina paired-end whole genome re-sequence data from clinical isolated of the parasite Leishmania infantum. The DNA libraries that have been sequenced comprise of random fragments of ~300-500bp. Each of these fragments have been sequenced from each end, hence the “paired-end”, like so:

Short read sequencing

The workshop

From here on in, things you need to do are preceeded with: $

Let’s run though the brief linux tricks presentation first. It’s on the VLE, but I will run through it with the entire class. If you don’t already have a familiarity with the basic linux commands, mkdir, rm, cd, please see the following tutorial prior to running the following commands If you get stuck and you don’t understand the syntax of a program you can always use –h or –help and this will bring up the help page for the program you are trying to use

Logging into the server

The analyses will be carried out on the linux operating system. You will be connecting to, and run commands on a linux server, called biologin.

We will first log into this machine from linux, using ssh (secure shell). Open a terminal window

Log into the server using putty.

First thing: make a plain text file on your computer using notepad.

Change filename to something sensible, obviously. I call mine notes.txt. Keep updating this with your commands, and comments about what’s going on

To log into the server from a PC:

Log in to your allocated server biologin.york.ac.uk using Putty.
Use your York username we have given you(or your temporary username)

To log into the server from a Mac:

Open the Terminal app
Connect with this command:
```
ssh -X username@research0.york.ac.uk
```
username: your username

Make sure you have logged into the server before you continue

Get to know the server in the terminal

To display the processes running on your server type:

$ top

This will allow you to keep track of whether your jobs are completed or still running. You can have two terminal windows running and log in twice, if you like*

Go to your scratch space and make a working directory:

$ cd /scratch/username 
$ mkdir sensible_directory_name
$ cd sensible_directory_name

You are now ‘in’ or working from this directory. Files you make will appear here

Getting the data files

First, navigate to the directory where we will store our work:

cd /shared/biology/bioldata2/bl-leish

Now make your own directory within here. Please name it as your username. For example mine would be dj757. To create a directory in here, do this:

mkdir username

Of course, replace username with your username!

Now enter your directory:

cd username

Now copy over the relevant data to this new folder using the cp command using the following format

cp /path/to/files /destination/path

The files you need for this workshop are bundled in a gzipped ‘tar’ file, called Course_material.tar.gz.

This is stored in the directory /shared/biology/bioldata2/bl-leish/DATA/popgenomics-teaching.

To copy this file to your scratch directory, do this:

cp /shared/biology/bioldata2/bl-leish/DATA/popgenomics-teaching/Course_material.tar.gz .

NB: Note the space between Leish-data.tar.gz and the dot. This means the current directory, and can be put in place of the path for the folder we want to copy our files to

Now decompress the file

$ tar zxvf Course_material.tar.gz

This will create a folder called Course_material. You will need to cd into this directory to see its contents. Like so:

$ cd Course_material

Then, list the files and directories

$ ls

Load all the software you will need to type the following

source /shared/biology/bioldata2/bl-leish/DATA/popgenomics-teaching/biolprograms

This system runs on a modular system, with each program loaded onto the current session with the command

module load programme_name

When you ran source biolprograms, you loaded all of the modules you need for this session.

Other key commands you may need to use are as follows:

$ module unload programme_name

This unloads the module

$ module list

This gives you the list of currently loaded modules

$ module avail

This gives a list of all available modules you can load

In here there is: TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta A fasta format file containing the Leishmania donovani reference genome

Also, six fastq.gz files that are the paired-end sequence Illumina sequence reads from parasites isolates from Ethiopia: Sample1_R1.fastq.gz
Sample1_R2.fastq.gz
Sample2_R1.fastq.gz
Sample2_R2.fastq.gz Sample3_R1.fastq.gz Sample3_R2.fastq.gz

We have called these sample 1-3 for simplicity, however can you tell what each sample name is (clue it begins with ERR) from opening the VCF file and looking at the header? Question: Why are there two files for each sample? What might R1 and R2 mean?

*.gz files are compressed files. Look at the structure of a fastq file

View the first 20 lines, using zcat to unzip

$ zcat Sample1_R1.fastq.gz | head -n 20

zcat uncompresses the file; and | ‘pipes’ the uncompressed file to head; which prints the first 20 lines

Use the wc (word count) program in linux to count how many sequence reads are in one of the files, by ‘piping’ zcat output to wc. Specifying –l means it counts only the number of lines in the file

Count lines of file

$ wc -l  Sample1_R1.fastq.gz

How many lines does each sequence take up?

Understanding the format of fastq sequence data Look up fastq format on the internet; the Wikipedia entry is good What information does the first line of each sequence contain? What information does the fourth line of each sequence contain, and how is it encoded?

Data quality check with fastqc

FastQC gives you a quick overview of the quality of your sequences. Use this to analyse each of your fastq files.

Now run fastQC to analyse the quality one of the fastq files

$ fastqc something.fastq.gz

The sequences are generally of good quality, though some have a Kmer Content problem. You can trim/eliminate the few poorer quality sequences

Pete will now show you the output of fastQC.

Quality filtering fastq files

You will use a piece of software called cutadapt to do the quality filtering. Cutadapt can do many types of filtering. One important type of filtering it can do is to remove adapter contamination (which we will not be doing) and you can find out more here.

Use cutadapt to remove bases from the 3’ end of each sequence that have a quality < 20 (how stringent is this?)

Some linux shorthand we use here: a) the backslash symbol () to allows one command to be specified on multiple lines. This tells the linux machine “don’t run this command yet, it’s not complete yet”. This makes long commands more human-readable. Please do this in your lab book. b) the ampersand (&) to tell linux to run this process in the background c) the greater-than symbol (>) pipes the output of a command to a file. Any output that would usually come up on the screen can be piped to a file.

The basic command to filter 3’ (tail) ends by phred score 20 is:

$ cutadapt -q 20 -o output.fastq input.fastq

Trims reads by using a phred score 20, keeping only read that both reads in a pair pass the filter

$  cutadapt -q 20 --pair-filter=any \
-o Sample1_R1_fastq.q20.gz \
-p Sample1_R2_fastq.q20.gz \
Sample1_R1.fastq.gz \
Sample1_R2.fastq.gz \
> Sample1_R1.fastq.gz.log &

Check the window running top to keep an eye on the processes running and to see when your command finishes. A log file has been created. We can open this file in a read only mode using less, see whether you can understand the content by doing the following:

$ less Sample1_R1.fastq.gz.log

Examine what cutadapt did (why run it, if you don’t examine the results?). You can do this in three ways: List the files that have been created Look at the log file, with more Sample1_R1.fastq.gz.log Run FastQC again on the filtered output (Sample1_R1_fastq.q20.gz) to see what difference the filtering made

NB: Both cutadapt and fastqc take a while to run, which makes it challenging to test what are the best parameters to use. We can improve this my making a subset of the data, and testing this. Since each fastq read contains 4 lines we can use the linux command below to extract some lines that we use for testing. How many reads will this extract?

$ zcat Sample1_R1.fastq.gz | head -n 40000 > Sample1_R1.subset.fastq
$ zcat Sample1_R2.fastq.gz | head -n 40000 > Sample2_R1.subset.fastq

Aligning sequenced with BWA

Now that you have clean sequences, you will now align the fastq files to the reference genome using the aligner called bwa (see hereand here for details).

Before you align to a reference, you must index that reference:

$ bwa index TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta &

NB: By default bwa index uses the “is” algorithm, however there are two other index algorithm options, bwtsw and rb2. In this case, leishmania is a smallish genome, so the default algorithm is the best option. However for larger or more divergent genomes aligning to the reference, you may need to use a different algorithm.

While the reference is indexing you can display the various options available in bwa by typing in:

$ bwa

To get more detail about one of the displayed options, type in, for example:

$ bwa mem

The bwa websites will have further details about the available options Align the sequence to the reference genome

$ bwa mem -t 2 \
-R '@RG\tID:S1\tSM:S1\tPL:Illumina' \
TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta \
Sample1_R1_fastq.q20.gz \
Sample1_R2_fastq.q20.gz \
2> Sample1.bwamem.log \
> Sample1.q20.sam &

NB: the “2>” pipes the comments (called stderr in linux) to one file (the .log file), while the normal redirection “>” pipes the main output (stdout) to the .sam file

In the other terminal window you will be able to see your bwa aligner using up the CPU, and also when bwa is finished. This will take ~10 minutes to run; pretty fast for aligning a so many sequences. While you are waiting for it to finish, go to this site to understand what the -t and -R options are doing (we will come back to -R in the next workshop).

If you are still waiting, listing the files in the folder will show you that bwa has started making a sam file. This contains the alignment information. View the last 20 lines:

$ tail -n 20 Sample1.q20.sam

Try to understand the format of the data. This pdf contains a detailed description of the format. See section 1.4 in particular. The format is similar to the fastq file you looked previously, but now has information as to where the sequence in question is aligned in the genome

Once your bwa alignment has finished, view the log file to check that everything ran okay

$ less Sample1.bwamem.log

Always look at the log files if something goes wrong. Sometimes your processes may fail because you have used the wrong syntax, or your input files are incorrect in some way. The log files will give you information as to what went wrong so that you can sort it out.

Now carry out the above steps to align the Samples 2 and 3 to the reference genome

Compressing and sorting bam files

samtools is an essential software for manipulating bam and sam alignment files. It has a many very useful functions.We strongly advise you to get to know samtools, by reading the manual here.

To display the various options available in samtools type:

samtools

Create a bam file from the sam file. A bam file is merely a compressed version of the sam file:

samtools view -bt TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta \
-o Sample1.q20.bam \
Sample1.q20.sam &

How much space is saved by the compression?

Once the bam file has been created you can delete the sam file, using the rm command. (once a file is deleted you CANNOT get it back; there is no recycle bin)

The sequences in your sam/bam file are in the same order as they were in the fastq files, i.e. randomly ordered with respect to where they map on the reference genome. It makes sense for the subsequent steps, such as SNP genotyping to have the sequences ordered by where they aligned in the genome

Sort each of your bam files, like so:

$ samtools sort -o Sample1.q20.sort.bam \
Sample1.q20.bam &

NB: you may put the ampersand “&” at the end of any command List the files in your folder with file details. You will see that more space saving is made by sorting the bam files.

A breif summary of your alignment

You can get a crude summary of the quality of your aligned sequence by doing:

samtools flagstat Sample1.q20.sort.bam

Repeat for the bams, if you have made them What do the different statistics mean? How are the statistics different between the two samples? Is this surprising?

If you wish, more detailed statistics can be made by doing:

samtools stats Sample1.q20.sort.bam > Sample1.q20.sort.bam.stats

and then viewing the stats file with more/less, or gedit

Summary

Here’s what you learned, and some questions to discuss in your group with a tutor

How to log into a server. Why use a server? Why not just do this on a laptop?
How to extract tar files. Why use tar files and gzip?
How to check that your fastq’s are healthy, and remove bad quality data.
How to align reads to a reference. Why did we do this?
How to do some things with sam files. Whats a sam file?*

The End

That is all for today. More analysis of Illumina sequence data will come in Workshop 4.

Link back to main course page

Workshop 3: Short read data analysis.