In this workshop, we will look at how to perform some statistical tests tp identify significant associations betweent he glucosinolate trait and gene expression. The examples worked through here are for the smaller expression dataset, but you can find the complete one you will need to complete a full analysis here, or from the Data Sets tab in the VLE.
By doing this practical and carrying out the independent study the successful student will be able to:
Start RStudio from the Start menu
In RStudio, set your working directory to your scripts folder
Open the filtered data file you created in workshop 1, and make sure it still looks ok by looking at the first few rows as we did in workshop 1
OSR101_final <- read.table("OSR101_final.txt", header=T)
OSR101_final[1:5,1:5]
## A11_Bie A135_Reg A136_Roc A144_She A155_Ste
## A_JCVI_10003 5.593431 2.574767 2.429578 3.794030 7.921821
## A_JCVI_10018 10.619471 7.443781 8.998179 4.292112 8.903025
## A_JCVI_10019 2.076349 2.097265 2.724073 1.931506 2.380547
## A_JCVI_10026 2.022319 2.543542 3.243398 2.785690 2.570395
## A_JCVI_10036 1.334796 1.101064 1.678618 1.738355 1.008232
In this workshop, we want to see if any of the genes may be linked to a trait of interest. Let’s take a look at some trait data for these accessions
gluc <- read.delim("http://www-users.york.ac.uk/~ah1309/BigData/data/Glucosinolates.txt", header=T)
gluc
## Line Trait
## 1 A11_Bie 83.7
## 2 A155_Ste 16.2
## 3 A168_TEM 17.7
## 4 A176_Ver 17.6
## 5 A18_Bro 6.3
## 6 A187_Xia 45.4
## 7 A188_Zho 20.7
## 8 A25_Cap 21.1
## 9 A42_Dar 27.9
## 10 D128_Q10 79.3
## 11 D131_Raf 86.1
## 12 D173_TOP 40.0
## 13 D181_WESD 12.7
## 14 D6_APE 17.1
## 15 D7_APExGIN 58.8
## 16 D75_INCxCON 12.7
## 17 IB02_AbuN 20.8
## 18 IB100_Maj 48.2
## 19 IB106_MoaR 80.0
## 20 IB110_N01D_1330 17.8
## 21 IB111_N02D_1952 15.3
## 22 IB120_Palm 14.9
## 23 IB125_POH285B 14.2
## 24 IB130_Qui 114.3
## 25 IB133_Ram 48.4
## 26 IB141_SenNZ 75.5
## 27 IB143_ShaxWin 29.8
## 28 IB151_SloK 64.7
## 29 IB16_Bra 23.9
## 30 IB164_Tai 82.7
## 31 IB178_Vig 97.1
## 32 IB180_Vis 17.9
## 33 IB184_Yor 82.9
## 34 IB21_Cab 16.2
## 35 IB23_Can 80.5
## 36 IB24_CanxCou 17.9
## 37 IB28_Cas 27.8
## 38 IB29_Cat 17.0
## 39 IB37_ColxNic 19.5
## 40 IB45_Dim 23.8
## 41 IB49_Dup 13.8
## 42 IB50_DwaE 99.8
## 43 IB52_EngG 103.0
## 44 IB53_Erg 92.7
## 45 IB57_Exc 35.1
## 46 IB60_Fla 22.9
## 47 IB70_HanxGas 58.1
## 48 IB73_Hug 10.6
## 49 IB86_LemM 74.3
## 50 IB91_LicxExp 17.6
## 51 IB99_MadxRec 14.2
## 52 NIN 56.1
## 53 TAP 29.5
This file has a column of accession names that matches your data (Line), and a column giving percentage of glucosinolates in the seed oil (Trait)
We can merge this with our expression data using the merge function, but first we should set the rownames from the column named Line
rownames(gluc) <- gluc$Line
gluc
## Line Trait
## A11_Bie A11_Bie 83.7
## A155_Ste A155_Ste 16.2
## A168_TEM A168_TEM 17.7
## A176_Ver A176_Ver 17.6
## A18_Bro A18_Bro 6.3
## A187_Xia A187_Xia 45.4
## A188_Zho A188_Zho 20.7
## A25_Cap A25_Cap 21.1
## A42_Dar A42_Dar 27.9
## D128_Q10 D128_Q10 79.3
## D131_Raf D131_Raf 86.1
## D173_TOP D173_TOP 40.0
## D181_WESD D181_WESD 12.7
## D6_APE D6_APE 17.1
## D7_APExGIN D7_APExGIN 58.8
## D75_INCxCON D75_INCxCON 12.7
## IB02_AbuN IB02_AbuN 20.8
## IB100_Maj IB100_Maj 48.2
## IB106_MoaR IB106_MoaR 80.0
## IB110_N01D_1330 IB110_N01D_1330 17.8
## IB111_N02D_1952 IB111_N02D_1952 15.3
## IB120_Palm IB120_Palm 14.9
## IB125_POH285B IB125_POH285B 14.2
## IB130_Qui IB130_Qui 114.3
## IB133_Ram IB133_Ram 48.4
## IB141_SenNZ IB141_SenNZ 75.5
## IB143_ShaxWin IB143_ShaxWin 29.8
## IB151_SloK IB151_SloK 64.7
## IB16_Bra IB16_Bra 23.9
## IB164_Tai IB164_Tai 82.7
## IB178_Vig IB178_Vig 97.1
## IB180_Vis IB180_Vis 17.9
## IB184_Yor IB184_Yor 82.9
## IB21_Cab IB21_Cab 16.2
## IB23_Can IB23_Can 80.5
## IB24_CanxCou IB24_CanxCou 17.9
## IB28_Cas IB28_Cas 27.8
## IB29_Cat IB29_Cat 17.0
## IB37_ColxNic IB37_ColxNic 19.5
## IB45_Dim IB45_Dim 23.8
## IB49_Dup IB49_Dup 13.8
## IB50_DwaE IB50_DwaE 99.8
## IB52_EngG IB52_EngG 103.0
## IB53_Erg IB53_Erg 92.7
## IB57_Exc IB57_Exc 35.1
## IB60_Fla IB60_Fla 22.9
## IB70_HanxGas IB70_HanxGas 58.1
## IB73_Hug IB73_Hug 10.6
## IB86_LemM IB86_LemM 74.3
## IB91_LicxExp IB91_LicxExp 17.6
## IB99_MadxRec IB99_MadxRec 14.2
## NIN NIN 56.1
## TAP TAP 29.5
We also need to transpose (swap the dimensions) of our expression data so that the lines are the rows as they are in the trait data.
This is easy in R, we can just create a new object called tOSR and use the t function to transpose the data. Take another look at the file afterwards
tOSR <- t(OSR101_final)
tOSR[1:5,1:5]
## A_JCVI_10003 A_JCVI_10018 A_JCVI_10019 A_JCVI_10026 A_JCVI_10036
## A11_Bie 5.593431 10.619471 2.076349 2.022319 1.334796
## A135_Reg 2.574767 7.443781 2.097265 2.543542 1.101064
## A136_Roc 2.429578 8.998179 2.724073 3.243398 1.678618
## A144_She 3.794030 4.292112 1.931506 2.785690 1.738355
## A155_Ste 7.921821 8.903025 2.380547 2.570395 1.008232
Now we are ready merge the two objects by their rownames. Take a look - the trait data should be in the new ORSmerge object along with the expression data.
OSR_merge <- merge(gluc, tOSR, by="row.names")
OSR_merge[1:5,1:5]
## Row.names Line Trait A_JCVI_10003 A_JCVI_10018
## 1 A11_Bie A11_Bie 83.7 5.593431 10.619471
## 2 A155_Ste A155_Ste 16.2 7.921821 8.903025
## 3 A168_TEM A168_TEM 17.7 4.283325 6.187312
## 4 A176_Ver A176_Ver 17.6 5.622508 7.067371
## 5 A18_Bro A18_Bro 6.3 6.735284 4.684116
We need to clean this file up a bit by setting the rownames and removing some unnecessary columns. Notice the slight change to remove the first two columns instead of one
rownames(OSR_merge) <- OSR_merge$Line
OSR_merge <- OSR_merge[,-c(1:2)]
OSR_merge[1:5,1:5]
## Trait A_JCVI_10003 A_JCVI_10018 A_JCVI_10019 A_JCVI_10026
## A11_Bie 83.7 5.593431 10.619471 2.076349 2.022319
## A155_Ste 16.2 7.921821 8.903025 2.380547 2.570395
## A168_TEM 17.7 4.283325 6.187312 2.901607 3.458517
## A176_Ver 17.6 5.622508 7.067371 2.862368 2.955425
## A18_Bro 6.3 6.735284 4.684116 4.356354 1.879341
Much better! Check the dimensions using the dim function. They may have changed if there was trait data for only some of the lines…
Now we have our trait data and gene expression data in one object, let’s do some stats on them. First of all, let’s take a look at the first gene, and plot it against the trait data. This may look complicated, but it just says to plot column 2 against the column called “Trait”, and set the labels for the plot.
plot(OSR_merge[,2], OSR_merge$Trait, ylab="Glucosinolates", xlab=paste("RPKM for gene", colnames(OSR_merge[2])))
It looks like there may be a slightly negative relationship between the trait and gene expression for this gene. Let’s plot the graph again and use a linear model using a function called lm to fit a linear model to the data. We can then use this to fit a line through the data points on the plot
#Linear model using lm function
lm1 <- lm(OSR_merge$Trait~OSR_merge[,2])
#plot column 2 against "Trait" column, and use abline to draw a line based on the linear regression
plot(OSR_merge[,2], OSR_merge$Trait, ylab="Glucosinolates", xlab=paste("RPKM for gene", colnames(OSR_merge[2])))
abline(lm1)
Maybe a slight negative relationship, but is it significant? We can test this using an anova!
anova(lm1)
## Analysis of Variance Table
##
## Response: OSR_merge$Trait
## Df Sum Sq Mean Sq F value Pr(>F)
## OSR_merge[, 2] 1 621 621.42 0.6388 0.4278
## Residuals 51 49609 972.73
Oh, not significant at all, but that would be too easy anyway!
We can output some other details about this test using some other commands. coefficients will give us the gradient and intercept of the line we drew on the plot, and summary will give us all of the test statistics
#Coefficients
lm1$coefficients
## (Intercept) OSR_merge[, 2]
## 48.889721 -1.340283
#Summary
summary(lm1)
##
## Call:
## lm(formula = OSR_merge$Trait ~ OSR_merge[, 2])
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.15 -25.09 -16.04 29.07 69.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.890 9.287 5.264 2.84e-06 ***
## OSR_merge[, 2] -1.340 1.677 -0.799 0.428
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.19 on 51 degrees of freedom
## Multiple R-squared: 0.01237, Adjusted R-squared: -0.006994
## F-statistic: 0.6388 on 1 and 51 DF, p-value: 0.4278
If we want to save any of these results, we need to coerce them into dataframes, and select individual results. Here I’m going to save each important result as a separate object
#Row 1 of the Anova table
anova <- as.data.frame(anova(lm1)[1,])
#Intercept of regression line
intercept = as.data.frame(coefficients(lm1))[1,1]
#Gradient of regression line
gradient = as.data.frame(coefficients(lm1))[2,1]
#R2 (correlation coefficient)
R2 <- as.data.frame(summary(lm1)$r.squared)
Now it would be nice to join all of these in the same object
We can use the c function to stick everything together
result1 <- as.data.frame(c(anova, R2, intercept, gradient))
result1
## Df Sum.Sq Mean.Sq F.value Pr..F. summary.lm1..r.squared
## 1 1 621.4223 621.4223 0.6388424 0.4278372 0.01237135
## X48.8897208547186 X.1.34028278394752
## 1 48.88972 -1.340283
Not bad, we’ll fix the horrible column names and add rownames back in later.
So, now we want to do this test and output these results, for all of the genes in the dataset. This will take a while by hand, so we should instruct R to do this for us.
The simplest way to do this is to write a “for” loop. These seem complicated at first, but the structure is quite simple. You just define the data you need to loop through, and surround what you want to do each time in curly brackets.
First, let’s tell the loop where to stop! We can count the total number of columns using the ncol function
numcol <- ncol(OSR_merge)
Next, lets create an empty dataframe with 8 columns for our results to go into called results
results <- as.data.frame(matrix(nrow = 0, ncol = 8))
Onto the loop! We need to run the whole “for”" loop all at once, so first, let’s look at each individual part.
The first part of the function will tell r which rows to loop though. We want to loop our test through each column of gene expression data which starts in column 2 and ends at numcol.
#for (i in 2:numcol){
The next part of the loop will run the linear model, but this time, we will use i (which will loop from 2 to numcol) instead of an individual column number.
#lm1 <- lm(OSR_merge$Trait~OSR_merge[,i])
Next, we will output and arrange our results as before
#anova <- as.data.frame(anova(lm1)[1,])
#intercept = as.data.frame(coefficients(lm1))[1,1]
#gradient = as.data.frame(coefficients(lm1))[2,1]
#R2 <- as.data.frame(summary(lm1)$r.squared)
#result1 <- as.data.frame(c(anova, R2, intercept, gradient))
Finally, we will paste our results into our new results dataframe using rbind. This only works if both dataframes have the same column names, so we’ll make sure they do! Finally, close the curly bracket
#colnames(results) <- colnames(result1)
#results <-rbind(results, result)
#}
Ok, here’s the full “for” loop… Don’t forget to run the whole thing in one go!
for (i in 2:numcol){
lm1 <- lm(OSR_merge$Trait~OSR_merge[,i])
anova <- as.data.frame(anova(lm1)[1,])
intercept = as.data.frame(coefficients(lm1))[1,1]
gradient = as.data.frame(coefficients(lm1))[2,1]
R2 <- as.data.frame(summary(lm1)$r.squared)
result1 <- as.data.frame(c(anova, R2, intercept, gradient))
colnames(results) <- colnames(result1)
results <-rbind(results, result1)
}
Lets take a look at the first few rows of our results file!
results [1:5,]
## Df Sum.Sq Mean.Sq F.value Pr..F. summary.lm1..r.squared
## 1 1 621.42226 621.42226 0.63884240 0.4278372 0.0123713541
## 2 1 29.86376 29.86376 0.03033915 0.8624125 0.0005945316
## 3 1 470.69649 470.69649 0.48242565 0.4904779 0.0093706860
## 4 1 133.66851 133.66851 0.13607770 0.7137389 0.0026610898
## 5 1 155.57474 155.57474 0.15844805 0.6922509 0.0030972020
## X39.8734375897093 X0.370848391401888
## 1 48.88972 -1.3402828
## 2 40.44312 0.2956792
## 3 51.01240 -2.7162996
## 4 47.96460 -2.4231560
## 5 36.88214 3.8151561
We need to put the rownames back on and sort out the column names
rownames(results) <- colnames(OSR_merge[,2:numcol])
colnames(results) <- c("Df", "Sum.Sq", "Mean.Sq", "F.value", "P.value", "R2", "Intercept", "Gradient")
results[1:5,]
## Df Sum.Sq Mean.Sq F.value P.value R2
## A_JCVI_10003 1 621.42226 621.42226 0.63884240 0.4278372 0.0123713541
## A_JCVI_10018 1 29.86376 29.86376 0.03033915 0.8624125 0.0005945316
## A_JCVI_10019 1 470.69649 470.69649 0.48242565 0.4904779 0.0093706860
## A_JCVI_10026 1 133.66851 133.66851 0.13607770 0.7137389 0.0026610898
## A_JCVI_10036 1 155.57474 155.57474 0.15844805 0.6922509 0.0030972020
## Intercept Gradient
## A_JCVI_10003 48.88972 -1.3402828
## A_JCVI_10018 40.44312 0.2956792
## A_JCVI_10019 51.01240 -2.7162996
## A_JCVI_10026 47.96460 -2.4231560
## A_JCVI_10036 36.88214 3.8151561
Much better. Save it for next time!
write.table(results, "Results.txt", quote=F, sep="\t")
Try completing these tasks using your results from the full dataset. The full dataset is here if you need it.
save.image("Workshop2.Rda")