In this workshop, you will use principle components analysis (PCA) in order to identify population structure in the samples analysed. By the end of this workshop you will understand the strengths and limitations of this approach, and further analyses you may be able to perform following PCA. You will also be able to understand why population structure should be taken into account, and how PCA can be used to subset a dataset into panmectic populations for use with other downstream analyses.
In this workshop, you will:
Learn about principle components analyses (PCA) is
Gain some gamilarity with the popular genomics toolkit PLINK
Use PLINK to generate eigenvalues for PCA from a VCF file
Gain more familarity with R
Understand output from PCA pca command, and use this to generate a PCA plot
Inferring ancestry differences among individuals from different populations, or identifying population structure has been motivated by many applications:
Population genetics
Genetic association studies
Prediction of whether an individual will have particular traits (for instance in people that have the duffy antigen that allows them to have resistance to malaria)
Until recently we were able to stratify individuals into populations using few markers, however the decrease in cost and improvements in technology mean that is now feasible to use whole genomes to perform these kinds of analyses. This means that we can now have very fine scaled analyses of populations based on many thousands of markers.
Principal components analysis (PCA) is the most commonly recognised and used method for identifying population structure.
This analysis allows us to calculate principle components (PCs) that explain the differences between individuals in genetic data. The naming of each of these components is based on the percentage of variation in the data they explain, with PC1 being the most explanatory PC of the data. For instance if PC1 explains 61%, this means that all of the other PCs most account for 39% of the variance observed in the data. PC2 will have the next largest contribution to the genetic variance, for example PC2 may contribute 20%. The overall variance should add up to 100% and so it PC1 and PC2 explain 81% of the variance, the remaining PCs will explain the maining 19%.
The following are a summary of the key steps behind how PCA works
It is important to understand the principle behind how a PCA works, however we will be using plink to perform the generation of the eigenvalues for us, which will perform these stages, and allow us to directly plot them in R.
More information of plink can be found here.
All these steps should be carrie dout in the linux server.
In order to load plink which is already installed on the server, use the following
module load plink
The first thing we need to do is made a bed file and the associated plink format files using the following command in plink
plink --vcf file.vcf --make-bed --chr-set no-xy
Instead of file.vcf, we will be using the vcf file we have prepared, which contains the SNP calls from multiple individuals. The make bed command means that the vcf file information will be used to create a bed file and plink specific files it needs to perform other functions. This software was based originally on human data, and so it expects a X/Y chromosome and 23 chromosomes. However if we are using a non human chromosome set, like Leishmania in this case, which has 36 chromosomes, we need to specify this. We can also specify the number of sex chromosomes, which is none in leishmania, and we specify with no-xy
You should now have files with a suffix bed,fam,ped and bim. If we used a file called file.vcf file would be the prefix
Using the files plink generated, we can now generate the eigenvalues
plink --bfile file --pca --chr-set 36 no-xy
Here the –bfile means use a bed file and fam,ped and bim files with the prefix file. The chr-set and no-xy we also have to use here. The –pca part of the command means generate eigenvalues
You will notice that following this command we have two new output files, they will be file.eigenvec and file.eigenval. The eigenval tells you in order of each PC ( so PC1,PC2….) the percentage each eigenvalue contributes to the variance. The eigenvec contains the coordinates for each sample. This file has no headers, and is tab seperated and contains the sample name in columns one and two, and then subsequently the eigenvals for each PC.
These values can then be outputted and used to generate a plot in R
You could simply download the eigen value files (Population_1_ethiopia_eigenval.txt and Population_2_ethiopia_eigenval.txt) from the zip file here.
But to learn how to download files from a server, read on.
We now want to copy the eigenvalue to a local folder. To do this we will use the secure copy command (scp) we used previously. In a terminal window that is not logged into the server, navigate to the downloads folder using the cd command. In the window that is still logged in, using the following command to determine the path that you are currently in and want to download from
pwd
This is the path we want to copy from. In this example we are going to copy from a file called output.eigenval from a path called /path/on/server/ to the local downloads directory
Copy over the relevant data to this new folder using the scp command using the following command in the terminal window that is not logged in, like so:
scp username@login.yarcc.york.ac.uk:/path/on/server/Population_?_ethiopia_eigenval.txt .
Note how we use the wild card symbol (?) here to get two files:
Population_1_ethiopia_eigenval.txt and Population_2_ethiopia_eigenval.txt.
If you are working from a windows PC, you will need to use winscp instead. The guide and installation process is explained here.
This section will be using R studio. You should have this installed already. If not ask for some assistance and we will help with this. The first thing we will need to do is open R studio and set the current directory to the folder which you have the eigenvec file in (Downloads in our case).
First, open R Studio from the Start menu.
Then start a new R script, or open your old one.
Then, check what the current working directory is by typing the following command in the console window.
getwd()
This will give us the path that R will currently be loading its data from. If this is the folder than the eigenvec file we created is in. If it is not we can correct this by setting the directory to the folder it is in (Downloads in our case)
setwd("/Users/Sarah/Downloads")
By default, the eigenvec file that we have is seperated by spaces, we need to tell R this, so that it can recognise that there are multiple columns in the file. We can specify this with sep=" " when we read the file in. In the following, we will be saving the information from the file into an object called pca_input using the “<-” symbol. (Of course, like all object names in R, you can call it anything you like. You could call it chocolate_fish, but that wouldn’t help you to remember what the data object contains).
pca1 <- read.table("Population_1_ethiopia_eigenval.txt",sep=" ",header=F)
And get the file (Population_2_ethiopia_eigenval.txt), using a similar command. We suggest you call this pca2.
There are several important things to note here. The file name needs to be enclosed within double quotes, the sep command specifies how columns are defined, and header=F, means that the first line of the file is not column names. Otherwise the first sample and it’s eigenvalues will be used as column names
We can then plot by the first and second principle components using the plot command. Because we have no headers in this file, R arbitrarily gives these are V number, so the first two columns are called V1 and V2. These contain the sample names, V3 corresponds with PC1, V4 with PC2 and so on. We can plot the first and second PCs using the following command
plot(data=pca1, V3~V4))
R has extra tools called packages you can very easily install. A popular package we like is called ggplot2. This package has advanced plots are are easy to customise. We can install this packge with the following command
install.packages("ggplot2")
We can then load this any time we want to use it after it has been installed by using the following command
library("ggplot2")
This package then has several different types of plots called geoms that we can use with the data, after we have specified the dataset in the first part of the command as follows:
ggplot(data=pca1, aes(V3,V4)) + geom_point()
Here we are using geom_point with default parameters, this takes the x,y from data_input, columns, V3 and V4.
However we can also increase the size of the points, or colour the points. Examples of this can be seen here on this cheatsheet.
Try it yourself: plot pca2 by modifying the commands above.
This was the last workshop,here is what we did.
In Workshop 1 we downloaded influenza virus sequences, and used MEGA phylogenetics software to analyse them.
In Workshop 2 we ran though some linux training, so we would work on a server.
In Workshop 3 we mapped reads to a refernce genome, and processed the bam file.
In Workshop 4 we used the bam file to identfy small genetic variants (SNPs and indels).
In Workshop 5 we used the SNPs and indels to examine genomic diversity.
In this workshop we used the SNPs and indels to examine population structure.
Wow. Busy week!