Link back to main course page


Introduction

In this workshop, you will use principle components analysis (PCA) in order to identify population structure in the samples analysed. By the end of this workshop you will understand the strengths and limitations of this approach, and further analyses you may be able to perform following PCA. You will also be able to understand why population structure should be taken into account, and how PCA can be used to subset a dataset into panmectic populations for use with other downstream analyses.


In this workshop, you will:

  • Learn about principle components analyses (PCA) is

  • Gain some gamilarity with the popular genomics toolkit PLINK

  • Use PLINK to generate eigenvalues for PCA from a VCF file

  • Gain more familarity with R

  • Understand output from PCA pca command, and use this to generate a PCA plot


Inferring population structure

Inferring ancestry differences among individuals from different populations, or identifying population structure has been motivated by many applications:

  • Population genetics

  • Genetic association studies

  • Prediction of whether an individual will have particular traits (for instance in people that have the duffy antigen that allows them to have resistance to malaria)

Until recently we were able to stratify individuals into populations using few markers, however the decrease in cost and improvements in technology mean that is now feasible to use whole genomes to perform these kinds of analyses. This means that we can now have very fine scaled analyses of populations based on many thousands of markers.


Inferring population structure with PCA

Principal components analysis (PCA) is the most commonly recognised and used method for identifying population structure.

This analysis allows us to calculate principle components (PCs) that explain the differences between individuals in genetic data. The naming of each of these components is based on the percentage of variation in the data they explain, with PC1 being the most explanatory PC of the data. For instance if PC1 explains 61%, this means that all of the other PCs most account for 39% of the variance observed in the data. PC2 will have the next largest contribution to the genetic variance, for example PC2 may contribute 20%. The overall variance should add up to 100% and so it PC1 and PC2 explain 81% of the variance, the remaining PCs will explain the maining 19%.

How does PCA work?

The following are a summary of the key steps behind how PCA works

  • Subtract the mean
    • You need to subtract the mean from each data dimension
    • The mean subtracted is the average across each dimension
    • This produces a data set whose mean is zero
  • Calculate the covariance matrix
    • Covariance is always measured between 2 dimensions.
    • If we have more than 2 dimensions, there is more than one covriance measurement to be calculated
    • If we have 3 dimensions, x,y,z, then we need cov(x,y), cov(y,z) and cov(x,z)
  • Calculate the eigenvectors and eigenvalues of the covariance matrix
  • Choosing components and forming a feature vector
    • The eigenvector with the highest eigenvalue is the principle component of the data set and the most significant relationship between the data dimensions.
    • These needed to be ordered to give the components in the order of significance

It is important to understand the principle behind how a PCA works, however we will be using plink to perform the generation of the eigenvalues for us, which will perform these stages, and allow us to directly plot them in R.

More information of plink can be found here.


The workshop

All these steps should be carrie dout in the linux server.

Generating eigenvalues

Using the files plink generated, we can now generate the eigenvalues

plink --bfile file --pca --chr-set 36 no-xy

Here the –bfile means use a bed file and fam,ped and bim files with the prefix file. The chr-set and no-xy we also have to use here. The –pca part of the command means generate eigenvalues


Understanding the output files

You will notice that following this command we have two new output files, they will be file.eigenvec and file.eigenval. The eigenval tells you in order of each PC ( so PC1,PC2….) the percentage each eigenvalue contributes to the variance. The eigenvec contains the coordinates for each sample. This file has no headers, and is tab seperated and contains the sample name in columns one and two, and then subsequently the eigenvals for each PC.

These values can then be outputted and used to generate a plot in R


Downloading the eigenvalue files

You could simply download the eigen value files (Population_1_ethiopia_eigenval.txt and Population_2_ethiopia_eigenval.txt) from the zip file here.

But to learn how to download files from a server, read on.

We now want to copy the eigenvalue to a local folder. To do this we will use the secure copy command (scp) we used previously. In a terminal window that is not logged into the server, navigate to the downloads folder using the cd command. In the window that is still logged in, using the following command to determine the path that you are currently in and want to download from

pwd

This is the path we want to copy from. In this example we are going to copy from a file called output.eigenval from a path called /path/on/server/ to the local downloads directory

Copy over the relevant data to this new folder using the scp command using the following command in the terminal window that is not logged in, like so:

scp username@login.yarcc.york.ac.uk:/path/on/server/Population_?_ethiopia_eigenval.txt .

Note how we use the wild card symbol (?) here to get two files:

Population_1_ethiopia_eigenval.txt and Population_2_ethiopia_eigenval.txt.


If you are working from a windows PC, you will need to use winscp instead. The guide and installation process is explained here.


Generating a PCA plot in R

This section will be using R studio. You should have this installed already. If not ask for some assistance and we will help with this. The first thing we will need to do is open R studio and set the current directory to the folder which you have the eigenvec file in (Downloads in our case).


  • First, open R Studio from the Start menu.

  • Then start a new R script, or open your old one.

  • Then, check what the current working directory is by typing the following command in the console window.

getwd()

This will give us the path that R will currently be loading its data from. If this is the folder than the eigenvec file we created is in. If it is not we can correct this by setting the directory to the folder it is in (Downloads in our case)

setwd("/Users/Sarah/Downloads")

Important: Keep records of your commands in your R script.

By default, the eigenvec file that we have is seperated by spaces, we need to tell R this, so that it can recognise that there are multiple columns in the file. We can specify this with sep=" " when we read the file in. In the following, we will be saving the information from the file into an object called pca_input using the “<-” symbol. (Of course, like all object names in R, you can call it anything you like. You could call it chocolate_fish, but that wouldn’t help you to remember what the data object contains).

pca1 <- read.table("Population_1_ethiopia_eigenval.txt",sep=" ",header=F)

And get the file (Population_2_ethiopia_eigenval.txt), using a similar command. We suggest you call this pca2.

There are several important things to note here. The file name needs to be enclosed within double quotes, the sep command specifies how columns are defined, and header=F, means that the first line of the file is not column names. Otherwise the first sample and it’s eigenvalues will be used as column names

We can then plot by the first and second principle components using the plot command. Because we have no headers in this file, R arbitrarily gives these are V number, so the first two columns are called V1 and V2. These contain the sample names, V3 corresponds with PC1, V4 with PC2 and so on. We can plot the first and second PCs using the following command

plot(data=pca1, V3~V4))

Using the ggplot2 package.

R has extra tools called packages you can very easily install. A popular package we like is called ggplot2. This package has advanced plots are are easy to customise. We can install this packge with the following command

install.packages("ggplot2")

We can then load this any time we want to use it after it has been installed by using the following command

library("ggplot2")

This package then has several different types of plots called geoms that we can use with the data, after we have specified the dataset in the first part of the command as follows:

ggplot(data=pca1, aes(V3,V4)) + geom_point()

Here we are using geom_point with default parameters, this takes the x,y from data_input, columns, V3 and V4.

However we can also increase the size of the points, or colour the points. Examples of this can be seen here on this cheatsheet.


Try it yourself: plot pca2 by modifying the commands above.


Summary

  • What is PCA? What can we use it to do?
  • What are the key steps involved in the generation of eigenvalues for PCA?
  • We should now be familiar with plink and how to use it to generate eigenvalues
  • We should have been able to take the files off the server and use locally in R to produce a plot
  • We should have gained more familiarity with R and learnt how to generate some simple plots

The End

This was the last workshop,here is what we did.

  • In Workshop 1 we downloaded influenza virus sequences, and used MEGA phylogenetics software to analyse them.

  • In Workshop 2 we ran though some linux training, so we would work on a server.

  • In Workshop 3 we mapped reads to a refernce genome, and processed the bam file.

  • In Workshop 4 we used the bam file to identfy small genetic variants (SNPs and indels).

  • In Workshop 5 we used the SNPs and indels to examine genomic diversity.

  • In this workshop we used the SNPs and indels to examine population structure.

Wow. Busy week!


Link back to main course page