In this workshop, we will graph some of the results we generated in Workshop 2, and identify some of the markers that are significantly associated with the glucosinolate trait. The examples used in this workshop continue using the results from the small dataset. When you’re ready, work though all of the workshops using the large dataset which you can find here, or from the Data Sets tab in the VLE.
By doing this practical and carrying out the independent study the successful student will be able to:
Start RStudio from the Start menu
In RStudio, set your working directory to your scripts folder
Open the results file you created in workshop 2, and take a look at the first five rows
## Df Sum.Sq Mean.Sq F.value P.value R2
## A_JCVI_10003 1 621.42226 621.42226 0.63884240 0.4278372 0.0123713541
## A_JCVI_10018 1 29.86376 29.86376 0.03033915 0.8624125 0.0005945316
## A_JCVI_10019 1 470.69649 470.69649 0.48242565 0.4904779 0.0093706860
## A_JCVI_10026 1 133.66851 133.66851 0.13607770 0.7137389 0.0026610898
## A_JCVI_10036 1 155.57474 155.57474 0.15844805 0.6922509 0.0030972020
## Intercept Gradient
## A_JCVI_10003 48.88972 -1.3402828
## A_JCVI_10018 40.44312 0.2956792
## A_JCVI_10019 51.01240 -2.7162996
## A_JCVI_10026 47.96460 -2.4231560
## A_JCVI_10036 36.88214 3.8151561
We have a lot of statistics here, but the p values tell us which genes show the strongest relationship with our trait. But how do we know if the p values are different from what we would expect if there were no significant results (the null hypothesis)?
We can use two functions to help with this, qqnorm produces a normal QQ plot of the variable and qqline adds a reference line
Let’s try it with our results
qqnorm(results$P.value, pch = 1, frame = FALSE)
qqline(results$P.value, col = "steelblue", lwd = 2)
As you can see, this function has computed the quantiles for a theoretical, normal range of p values (x axis), and plotted them against the quantiles for your P values (y axis). If there were no significant results in your data, we would expect the data to follow the line. As you can see, there is some divergence from the line, but how significant is it?
We can also plot these so-called Q-Q plots (quantile-quantile) using a library called car, which also provides confidence intervals
library("car")
## Loading required package: carData
qqPlot(results$P.value)
## [1] 43 26
Much nicer! As we can now see, some of our results fall outside the confidence intervals
So, which genes have the best p values? Let’s take another look at our data, but this time, let’s sort it by ascending p value so that the best results are at the top of the table. We can do that using the order to order the rows by their p value. We’ll take a look at the first five rows.
results_sorted <- results[order(results$P.value),]
results_sorted[1:5,]
## Df Sum.Sq Mean.Sq F.value P.value R2
## A_JCVI_10238 1 11408.052 11408.052 14.986357 0.0003092814 0.2271130
## A_JCVI_10398 1 10306.477 10306.477 13.165687 0.0006596128 0.2051827
## A_JCVI_10097 1 7299.818 7299.818 8.671854 0.0048576830 0.1453257
## A_JCVI_10299 1 5695.814 5695.814 6.522668 0.0136817492 0.1133930
## A_JCVI_10155 1 5091.926 5091.926 5.753102 0.0201550254 0.1013707
## Intercept Gradient
## A_JCVI_10238 7.503403 3.7931699
## A_JCVI_10398 87.651634 -6.2952872
## A_JCVI_10097 16.374764 2.4834231
## A_JCVI_10299 75.122442 -0.1349702
## A_JCVI_10155 12.505862 2.5569946
Great! Save this file if you like. I wonder if the top few genes are all in the same place on the genome? There is an additional file provided called osr_dir.txt, which has information about the chromosome and map positions for every gene in the dataset. Let’s open it and take a look.
dir <- read.delim("http://www-users.york.ac.uk/~ah1309/BigData/data/osr_dir.txt")
dim(dir)
## [1] 13597 3
dir[1:5,]
## Gene Graph Position
## 1 A_JCVI_33904 1 1
## 2 A_JCVI_27430 1 2
## 3 A_JCVI_26834 1 3
## 4 A_JCVI_22505 1 4
## 5 A_JCVI_4601 1 5
Let’s open it again, but this time, set the first column to be the row names
dir <- read.delim("http://www-users.york.ac.uk/~ah1309/BigData/data/osr_dir.txt", row.names=1)
dim(dir)
## [1] 13597 2
dir[1:5,]
## Graph Position
## A_JCVI_33904 1 1
## A_JCVI_27430 1 2
## A_JCVI_26834 1 3
## A_JCVI_22505 1 4
## A_JCVI_4601 1 5
Much better. In this file, the Graph column designates different chromosomes, and the position column tells you what order the genes are in on that chromosome. This should be enough information to put our results in genome order.
Let’s start by merging the files again as we did in workshop 2. Take a look if you like.
results_dir <- merge(dir, results, by="row.names")
results_dir[1:5,]
## Row.names Graph Position Df Sum.Sq Mean.Sq F.value P.value
## 1 A_JCVI_10003 2 3363 1 621.42226 621.42226 0.63884240 0.4278372
## 2 A_JCVI_10018 2 3385 1 29.86376 29.86376 0.03033915 0.8624125
## 3 A_JCVI_10019 2 1122 1 470.69649 470.69649 0.48242565 0.4904779
## 4 A_JCVI_10026 2 1945 1 133.66851 133.66851 0.13607770 0.7137389
## 5 A_JCVI_10036 1 5014 1 155.57474 155.57474 0.15844805 0.6922509
## R2 Intercept Gradient
## 1 0.0123713541 48.88972 -1.3402828
## 2 0.0005945316 40.44312 0.2956792
## 3 0.0093706860 51.01240 -2.7162996
## 4 0.0026610898 47.96460 -2.4231560
## 5 0.0030972020 36.88214 3.8151561
Let’s order them by p value again and see if the top results are in the same place in the genome
results_sorted2 <- results_dir[order(results_dir$P.value),]
results_sorted2[1:5,]
## Row.names Graph Position Df Sum.Sq Mean.Sq F.value
## 38 A_JCVI_10238 1 528 1 11408.052 11408.052 14.986357
## 67 A_JCVI_10398 2 655 1 10306.477 10306.477 13.165687
## 18 A_JCVI_10097 1 8093 1 7299.818 7299.818 8.671854
## 47 A_JCVI_10299 2 4320 1 5695.814 5695.814 6.522668
## 24 A_JCVI_10155 1 1837 1 5091.926 5091.926 5.753102
## P.value R2 Intercept Gradient
## 38 0.0003092814 0.2271130 7.503403 3.7931699
## 67 0.0006596128 0.2051827 87.651634 -6.2952872
## 18 0.0048576830 0.1453257 16.374764 2.4834231
## 47 0.0136817492 0.1133930 75.122442 -0.1349702
## 24 0.0201550254 0.1013707 12.505862 2.5569946
Hmm, they don’t seem to be very close to each other, but let’s graph them all in genome order to be sure!
As the best p values are likely to be very small, they probably won’t be very clear on a graph. Let’s do a simple transformation so that they become the highest points on the graph. The most common way to do this is to use the -log10 of the p values. We can create a new column directly in the dataframe like this:
results_sorted2$logP <- -log10(results_sorted2$P.value)
results_sorted2[1:5,]
## Row.names Graph Position Df Sum.Sq Mean.Sq F.value
## 38 A_JCVI_10238 1 528 1 11408.052 11408.052 14.986357
## 67 A_JCVI_10398 2 655 1 10306.477 10306.477 13.165687
## 18 A_JCVI_10097 1 8093 1 7299.818 7299.818 8.671854
## 47 A_JCVI_10299 2 4320 1 5695.814 5695.814 6.522668
## 24 A_JCVI_10155 1 1837 1 5091.926 5091.926 5.753102
## P.value R2 Intercept Gradient logP
## 38 0.0003092814 0.2271130 7.503403 3.7931699 3.509646
## 67 0.0006596128 0.2051827 87.651634 -6.2952872 3.180711
## 18 0.0048576830 0.1453257 16.374764 2.4834231 2.313571
## 47 0.0136817492 0.1133930 75.122442 -0.1349702 1.863858
## 24 0.0201550254 0.1013707 12.505862 2.5569946 1.695617
As you can see, this has created a new column called logP with -log transformed p values. Now we can graph them!
Let’s create a new object for chromosome 1 using the subset function again. Don’t forget to look at the new objects or check the dimensions to make sure the function has done what you expected it to do!
#This will subset results_sorted2, keeping only the rows that have a 1 in the Graph column, and saving as a new object called chr1
chr1 <- subset(results_sorted2, results_sorted2$Graph==1)
dim(chr1)
## [1] 44 12
Let’s try a simple plot, with map position on the x axis, and our transformed p values on the Y axis
plot(chr1$Position, chr1$logP)
This does the job, but it is pretty basic. As our data is already Tidy data, we can also use ggplot() to make a better graph.
library(ggplot2)
Here’s a basic plot…
ggplot(chr1, aes(x=Position, y=logP))+ geom_point()
Changing the point size and using classic theme…
ggplot(chr1, aes(x=Position, y=logP))+ geom_point(size=2)+theme_classic()
Try playing around with some of the ggplot settings to get a graph that you like the look of, and clearly describes the data.
Now we can see that one point is higher than the others, but is it significant? What was the p value?
min(results_dir$P.value)
## [1] 0.0003092814
Normally, a p value this low would be highly significant, but with big data we need to be even more stringent. This is because the more tests you perform, the more likely it is that some are false positives (Type 1 errors). So, how many tests did we perform in our “for” loop regressions? Simple - it is the number of rows in our results table…
nrow(results_dir)
## [1] 73
The simplest way to account for Type 1 error, is to adjust the threshold at which we consider a result “significant”. The most commonly used, and most stringent adjustment, is called Bonferroni adjustment. This uses a simple calculation, which is the usual significance threshold (normally 0.05), divided by the number of tests. We’ll also transform it using -log10 as we did for the p values
bon <- -log10(0.05/73)
bon
## [1] 3.164353
Using this threshold, we can draw a line on our plot, so that any points passing Bonferroni threshold will be more obvious
ggplot(chr1, aes(x=Position, y=logP))+ geom_point(size=2)+theme_classic() + geom_hline(yintercept = bon)
It looks like our top point is the only one that passes the threshold, but of course Bonferroni is very stringent. A less stringent approach, is to adjust your p values entirely, according to the False Discovery Rate (FDR), but there is a bit more work required to calculate it.
The first step is to put our results in p value order. We did this earlier and called our object results_sorted2
## Row.names Graph Position Df Sum.Sq Mean.Sq F.value
## 38 A_JCVI_10238 1 528 1 11408.052 11408.052 14.986357
## 67 A_JCVI_10398 2 655 1 10306.477 10306.477 13.165687
## 18 A_JCVI_10097 1 8093 1 7299.818 7299.818 8.671854
## 47 A_JCVI_10299 2 4320 1 5695.814 5695.814 6.522668
## 24 A_JCVI_10155 1 1837 1 5091.926 5091.926 5.753102
## P.value R2 Intercept Gradient logP
## 38 0.0003092814 0.2271130 7.503403 3.7931699 3.509646
## 67 0.0006596128 0.2051827 87.651634 -6.2952872 3.180711
## 18 0.0048576830 0.1453257 16.374764 2.4834231 2.313571
## 47 0.0136817492 0.1133930 75.122442 -0.1349702 1.863858
## 24 0.0201550254 0.1013707 12.505862 2.5569946 1.695617
Next we need to create a new column which contains the rank for the p values in each row. Now our data is in order, we can do this easily by using gl() to fill in numbers specified by the number of rows
#v=number of rows in our file
v=nrow(results_sorted2)
#Add new column with numbers from 1 to v
results_sorted2$Rank <- 1:v
#Take a look
results_sorted2[1:5,]
## Row.names Graph Position Df Sum.Sq Mean.Sq F.value
## 38 A_JCVI_10238 1 528 1 11408.052 11408.052 14.986357
## 67 A_JCVI_10398 2 655 1 10306.477 10306.477 13.165687
## 18 A_JCVI_10097 1 8093 1 7299.818 7299.818 8.671854
## 47 A_JCVI_10299 2 4320 1 5695.814 5695.814 6.522668
## 24 A_JCVI_10155 1 1837 1 5091.926 5091.926 5.753102
## P.value R2 Intercept Gradient logP Rank
## 38 0.0003092814 0.2271130 7.503403 3.7931699 3.509646 1
## 67 0.0006596128 0.2051827 87.651634 -6.2952872 3.180711 2
## 18 0.0048576830 0.1453257 16.374764 2.4834231 2.313571 3
## 47 0.0136817492 0.1133930 75.122442 -0.1349702 1.863858 4
## 24 0.0201550254 0.1013707 12.505862 2.5569946 1.695617 5
Now we can do the False Discovery Rate correction of our p values, using the following formula for each row in the dataset:
p value*(number of rows/rank)
Let’s add the corrected p values into a new column called FDR
#Calculate FDR
results_sorted2$FDR <- results_sorted2$P.value*(v/results_sorted2$Rank)
results_sorted2[1:5,]
## Row.names Graph Position Df Sum.Sq Mean.Sq F.value
## 38 A_JCVI_10238 1 528 1 11408.052 11408.052 14.986357
## 67 A_JCVI_10398 2 655 1 10306.477 10306.477 13.165687
## 18 A_JCVI_10097 1 8093 1 7299.818 7299.818 8.671854
## 47 A_JCVI_10299 2 4320 1 5695.814 5695.814 6.522668
## 24 A_JCVI_10155 1 1837 1 5091.926 5091.926 5.753102
## P.value R2 Intercept Gradient logP Rank FDR
## 38 0.0003092814 0.2271130 7.503403 3.7931699 3.509646 1 0.02257754
## 67 0.0006596128 0.2051827 87.651634 -6.2952872 3.180711 2 0.02407587
## 18 0.0048576830 0.1453257 16.374764 2.4834231 2.313571 3 0.11820362
## 47 0.0136817492 0.1133930 75.122442 -0.1349702 1.863858 4 0.24969192
## 24 0.0201550254 0.1013707 12.505862 2.5569946 1.695617 5 0.29426337
As you can see, the top two markers are below the 0.05 False Discovery Rate adjustment, and can be considered significant.
Let’s clean up this file and save it…
Change the first column name to “Gene”
colnames(results_sorted2)[1] <- c("Gene")
Save this file as your final results file, excluding the rownames
write.table(results_sorted2, "Final results.txt", quote=F, sep="\t", row.names=F)
Try completing these tasks using your results from the full dataset. The full dataset is here if you need it.
save.image("Workshop3.Rda")