In this part of the workshop, we will look at some Brassica napus data. You are probably most familiar with B. napus as bright yellow fields of oilseed rape, but the species also includes the vegetable swede, some exotic varieties grown in places like China, and fodder crops for cattle. The complete dataset includes three files, a large dataset of gene expression data for 101 different lines of B. napus, a file with the content of glucosinolates in the seed oil for each line, and a file with information about where each gene is in the B. napus genome. For this workshop, we will work on a smaller expression dataset, but you can find the complete one you will need to complete a full analysis here, or from the Data Sets tab in the VLE.
By doing this practical and carrying out the independent study the successful student will be able to:
The words ‘folder’ and ‘directory’ mean the same thing (for most purposes). Folder is mainly a Windows term; directory is more classical and widespread.
I like to work in one folder named for the particular task. I recommend creating a folder called Big Data (or similar) containing folders called ‘data’ and ‘scripts’.
Start RStudio from the Start menu
In RStudio, set your working directory to your scripts folder
Make a new script file to carry out the rest of the work.
Open the manual page for every command you want to use
The data in OSR101_sample.txt show measurements of gene expression in a random sample of 101 Brassica napus (oilseed rape) cultivars. The gene expression measure used here is RPKM, or reads per kilobase per million aligned reads.
Read the sample data file into R
OSR101_RPKM <- read.delim("http://www-users.york.ac.uk/~ah1309/BigData/data/OSR101_sample.txt")
You should find a view of the data appears and an object called OSR101_sample can be seen in the Environment window. If you are having difficulty, Reading in data files might help.
Using this data, we would like to know which genes are involved in changing the level of glucosinolates found in the seed oil. If you want to know why glucosinolates are important, take a look here
Use class(OSR101_sample)
to see what type of data you have, and dim(OSR101_sample)
to see how large it is.
class(OSR101_RPKM)
## [1] "data.frame"
dim(OSR101_RPKM)
## [1] 100 102
It looks like the file is a data frame (contains both numbers and text), and has 100 rows, and 102 columns.
You can tell R which rows and columns you are interested in using square brackets. Rows are always the first number and columns the second, and they are seperated by a comma. If you want all rows or columns, don’t put any number.
So, let’s take a look at the first 5 rows and 6 columns of the data…
OSR101_RPKM[1:5,1:6]
## Gene A11_Bie A135_Reg A136_Roc A144_She A155_Ste
## 1 A_JCVI_10003 5.59343127 2.57476690 2.4295783 3.794030 7.92182114
## 2 A_JCVI_10016 1.26169112 0.07965001 1.6105420 0.000000 0.00000000
## 3 A_JCVI_10018 10.61947131 7.44378077 8.9981791 4.292112 8.90302512
## 4 A_JCVI_10019 2.07634949 2.09726467 2.7240726 1.931506 2.38054726
## 5 A_JCVI_1002 0.07684213 0.13582828 0.0762911 0.000000 0.07255291
Or the first 2 rows and all columns…
OSR101_RPKM[1:2,]
## Gene A11_Bie A135_Reg A136_Roc A144_She A155_Ste A168_TEM
## 1 A_JCVI_10003 5.593431 2.57476690 2.429578 3.79403 7.921821 4.2833250
## 2 A_JCVI_10016 1.261691 0.07965001 1.610542 0.00000 0.000000 0.4274322
## A176_Ver A18_Bro A187_Xia A188_Zho A25_Cap A34_Chu A42_Dar A87_Les
## 1 5.622508 6.735284 3.648835 1.994188 6.143167 8.3851445 5.622014 3.122881
## 2 1.087071 0.000000 0.000000 0.000000 1.556345 0.1158005 1.057598 2.113252
## A93_Lih D10_BAL D102_MAT D108_MON D113_NEWH D117_NOR D127_PRI
## 1 6.037484 8.676367 5.283080 3.739883 2.9882969 5.172192 4.3474426
## 2 0.000000 1.178793 1.003525 0.000000 0.6847593 0.296298 0.5308716
## D128_Q10 D131_Raf D132_RAGJ D134_RAPCR D137_ROCxLIZ D146_SIBB D170_TIN
## 1 0.2575557 6.8788506 4.333797 6.062497 3.926827 5.0722436 3.015604
## 2 1.5934889 0.9143586 0.000000 0.000000 1.036988 0.8076202 0.000000
## D173_TOP D181_WESD D182_WILR D26_CAPxMOH D3_ALT D31_CES D32_CHE_DZA
## 1 4.529879 4.879594 3.615997 4.9456027 3.989128 4.301258 4.1101159
## 2 1.980113 0.000000 0.000000 0.5624686 0.000000 0.000000 0.3345942
## D39_COR D4_AMBxCOM D46_DIP D55_EUR D6_APE D69_HAN D7_APExGIN
## 1 6.559233 5.180249 6.0447656 9.933069 5.1389542 2.713562 4.616804
## 2 2.149458 1.306387 0.5081356 0.000000 0.9748329 0.000000 1.020144
## D74_HURxNAV D75_INCxCON D76_JANS D79_JETN D8_AphRR D83_KAR D85_KRO
## 1 3.144852 6.4239522 6.0230211 1.4486174 6.404243 2.25651 5.774085
## 2 1.114730 0.4006531 0.5822538 0.5601594 0.000000 0.00000 0.000000
## IB02_AbuN IB100_Maj IB106_MoaR IB110_N01D_1330 IB111_N02D_1952
## 1 3.542750 7.2451790 4.679582 3.238331 1.623746
## 2 1.786866 0.7690683 0.000000 0.000000 0.000000
## IB120_Palm IB121_Palu IB125_POH285B IB130_Qui IB133_Ram IB139_Sam
## 1 3.086280 6.199095 4.254501 2.938252 6.245337 4.463855
## 2 1.971737 0.000000 1.398381 1.076381 1.006242 1.119638
## IB141_SenNZ IB143_ShaxWin IB15_BraS IB150_SlaS IB151_SloK
## 1 1.837676 2.390600 12.2336049 3.3055130 4.484327
## 2 0.000000 1.355802 0.7493954 0.7303966 2.275907
## IB156_Sur400_024 IB16_Bra IB164_Tai IB169_TeqxAra IB175_Tri IB178_Vig
## 1 7.499272 3.908620 4.859321 3.575879 4.354051 1.213971
## 2 0.000000 1.231517 0.000000 0.000000 1.473193 0.000000
## IB180_Vis IB184_Yor IB21_Cab IB22_Cab IB23_Can IB24_CanxCou IB28_Cas
## 1 5.302275 4.718072 5.672416 7.945805 2.429071 6.690075 3.406890
## 2 1.500229 0.000000 1.602899 0.801529 0.692106 1.361555 1.621409
## IB29_Cat IB37_ColxNic IB45_Dim IB49_Dup IB50_DwaE IB52_EngG IB53_Erg
## 1 5.401047 7.417543 4.3777038 3.3859 2.0645808 6.340437 4.2671610
## 2 2.021136 1.147303 0.4836557 0.0000 0.7983436 1.163545 0.1885771
## IB57_Exc IB58_Exp IB60_Fla IB65_GroGS IB70_HanxGas IB73_Hug IB86_LemM
## 1 6.0709772 3.100292 7.4209342 5.1588938 3.5328799 2.344014 8.019127
## 2 0.6476025 1.298741 0.3248568 0.6435064 0.7932273 0.000000 1.148474
## IB91_LicxExp IB99_MadxRec NIN R_A48_Dra R_IB177_Vic R_IB40_CouN
## 1 3.811818 4.107241 16.06488 1.5608199 2.2624017 2.56388
## 2 1.351144 0.922187 0.00000 0.5633101 0.2187095 0.00000
## R_IB78_JauCV TAP D14_Bol
## 1 0.5393794 11.7503195 3.9636929
## 2 0.0000000 0.5105257 0.9366517
As you can see, we have a load of gene names (which are the names of genes). Let’s make these the row names.
rownames(OSR101_RPKM) <- OSR101_RPKM$Gene
OSR101_RPKM[1:2,]
## Gene A11_Bie A135_Reg A136_Roc A144_She A155_Ste
## A_JCVI_10003 A_JCVI_10003 5.593431 2.57476690 2.429578 3.79403 7.921821
## A_JCVI_10016 A_JCVI_10016 1.261691 0.07965001 1.610542 0.00000 0.000000
## A168_TEM A176_Ver A18_Bro A187_Xia A188_Zho A25_Cap
## A_JCVI_10003 4.2833250 5.622508 6.735284 3.648835 1.994188 6.143167
## A_JCVI_10016 0.4274322 1.087071 0.000000 0.000000 0.000000 1.556345
## A34_Chu A42_Dar A87_Les A93_Lih D10_BAL D102_MAT
## A_JCVI_10003 8.3851445 5.622014 3.122881 6.037484 8.676367 5.283080
## A_JCVI_10016 0.1158005 1.057598 2.113252 0.000000 1.178793 1.003525
## D108_MON D113_NEWH D117_NOR D127_PRI D128_Q10 D131_Raf
## A_JCVI_10003 3.739883 2.9882969 5.172192 4.3474426 0.2575557 6.8788506
## A_JCVI_10016 0.000000 0.6847593 0.296298 0.5308716 1.5934889 0.9143586
## D132_RAGJ D134_RAPCR D137_ROCxLIZ D146_SIBB D170_TIN D173_TOP
## A_JCVI_10003 4.333797 6.062497 3.926827 5.0722436 3.015604 4.529879
## A_JCVI_10016 0.000000 0.000000 1.036988 0.8076202 0.000000 1.980113
## D181_WESD D182_WILR D26_CAPxMOH D3_ALT D31_CES D32_CHE_DZA
## A_JCVI_10003 4.879594 3.615997 4.9456027 3.989128 4.301258 4.1101159
## A_JCVI_10016 0.000000 0.000000 0.5624686 0.000000 0.000000 0.3345942
## D39_COR D4_AMBxCOM D46_DIP D55_EUR D6_APE D69_HAN
## A_JCVI_10003 6.559233 5.180249 6.0447656 9.933069 5.1389542 2.713562
## A_JCVI_10016 2.149458 1.306387 0.5081356 0.000000 0.9748329 0.000000
## D7_APExGIN D74_HURxNAV D75_INCxCON D76_JANS D79_JETN
## A_JCVI_10003 4.616804 3.144852 6.4239522 6.0230211 1.4486174
## A_JCVI_10016 1.020144 1.114730 0.4006531 0.5822538 0.5601594
## D8_AphRR D83_KAR D85_KRO IB02_AbuN IB100_Maj IB106_MoaR
## A_JCVI_10003 6.404243 2.25651 5.774085 3.542750 7.2451790 4.679582
## A_JCVI_10016 0.000000 0.00000 0.000000 1.786866 0.7690683 0.000000
## IB110_N01D_1330 IB111_N02D_1952 IB120_Palm IB121_Palu
## A_JCVI_10003 3.238331 1.623746 3.086280 6.199095
## A_JCVI_10016 0.000000 0.000000 1.971737 0.000000
## IB125_POH285B IB130_Qui IB133_Ram IB139_Sam IB141_SenNZ
## A_JCVI_10003 4.254501 2.938252 6.245337 4.463855 1.837676
## A_JCVI_10016 1.398381 1.076381 1.006242 1.119638 0.000000
## IB143_ShaxWin IB15_BraS IB150_SlaS IB151_SloK
## A_JCVI_10003 2.390600 12.2336049 3.3055130 4.484327
## A_JCVI_10016 1.355802 0.7493954 0.7303966 2.275907
## IB156_Sur400_024 IB16_Bra IB164_Tai IB169_TeqxAra IB175_Tri
## A_JCVI_10003 7.499272 3.908620 4.859321 3.575879 4.354051
## A_JCVI_10016 0.000000 1.231517 0.000000 0.000000 1.473193
## IB178_Vig IB180_Vis IB184_Yor IB21_Cab IB22_Cab IB23_Can
## A_JCVI_10003 1.213971 5.302275 4.718072 5.672416 7.945805 2.429071
## A_JCVI_10016 0.000000 1.500229 0.000000 1.602899 0.801529 0.692106
## IB24_CanxCou IB28_Cas IB29_Cat IB37_ColxNic IB45_Dim
## A_JCVI_10003 6.690075 3.406890 5.401047 7.417543 4.3777038
## A_JCVI_10016 1.361555 1.621409 2.021136 1.147303 0.4836557
## IB49_Dup IB50_DwaE IB52_EngG IB53_Erg IB57_Exc IB58_Exp
## A_JCVI_10003 3.3859 2.0645808 6.340437 4.2671610 6.0709772 3.100292
## A_JCVI_10016 0.0000 0.7983436 1.163545 0.1885771 0.6476025 1.298741
## IB60_Fla IB65_GroGS IB70_HanxGas IB73_Hug IB86_LemM
## A_JCVI_10003 7.4209342 5.1588938 3.5328799 2.344014 8.019127
## A_JCVI_10016 0.3248568 0.6435064 0.7932273 0.000000 1.148474
## IB91_LicxExp IB99_MadxRec NIN R_A48_Dra R_IB177_Vic
## A_JCVI_10003 3.811818 4.107241 16.06488 1.5608199 2.2624017
## A_JCVI_10016 1.351144 0.922187 0.00000 0.5633101 0.2187095
## R_IB40_CouN R_IB78_JauCV TAP D14_Bol
## A_JCVI_10003 2.56388 0.5393794 11.7503195 3.9636929
## A_JCVI_10016 0.00000 0.0000000 0.5105257 0.9366517
You can also use the square brackets to get rid of individual rows and columns by putting a minus sign before the number you want to get rid of.
It looks like we don’t need the first column of this data, as it now contains the same info as the row names. Let’s create a new object called OSR101_RPKM2 that has the first column deleted, and take another look at the first few rows and columns.
OSR101_RPKM2 <- OSR101_RPKM[,-1]
OSR101_RPKM2[1:5,1:5]
## A11_Bie A135_Reg A136_Roc A144_She A155_Ste
## A_JCVI_10003 5.59343127 2.57476690 2.4295783 3.794030 7.92182114
## A_JCVI_10016 1.26169112 0.07965001 1.6105420 0.000000 0.00000000
## A_JCVI_10018 10.61947131 7.44378077 8.9981791 4.292112 8.90302512
## A_JCVI_10019 2.07634949 2.09726467 2.7240726 1.931506 2.38054726
## A_JCVI_1002 0.07684213 0.13582828 0.0762911 0.000000 0.07255291
That looks much better! The row names are now the gene names, and the column names are the names of individual plant lines. The numerical data are normalised measures of gene expression called RPKM (reads per kilobase per million aligned reads).
Next, as some functions won’t work on non-numeric data, let’s check if all of the data is in a numeric format using the is.numeric question
is.numeric(OSR101_RPKM2)
## [1] FALSE
Oh dear. Let’s create a new object called OSR101_num, which is a matrix (a matrix only contains numbers), containing the contents of OSR101_RPKM2, which have been converted to numeric format by applying as.numeric to each row using sapply, and put the rownames back using row.names. We’ll then check it again to make sure it is now numeric. Finally, we’ll make sure the rownmaes are the same as the original.
OSR101_num <- as.matrix(sapply(OSR101_RPKM2, as.numeric))
is.numeric(OSR101_num)
## [1] TRUE
row.names(OSR101_num) <- row.names(OSR101_RPKM2)
Need some help? Go to the google hangout and ask us.
Now our data is nice and tidy, let’s take a look at it.
Plot the first gene using the hist function.
hist(OSR101_num[1,])
Take a look at a few more genes - do they show a similar pattern? Let’s look at all of the genes together this time.
hist(OSR101_num)
It looks like most of the genes have very small levels of expression. Let’s get rid of any that are expressed at very low levels in all accessions, as they probably aren’t being expressed.
We can use the rowmeans function to calculate mean expression values for each gene.
rowmeans <- rowMeans(OSR101_num)
If we come up with a threshold mean expression value that we want to keep (let’s say a mean value greater than or equal to 1), we can use the subset function to keep only those rows
keepmeans <- subset(OSR101_num, rowmeans >= 1)
Can you see if the dimensions of your new file are different to your old one?
## [1] 100 101
## [1] 73 101
They are! We can summarise the data now using the summary function. I’ll just look at the first five columns of the summary here:
summary(keepmeans)[,1:5]
## A11_Bie A135_Reg A136_Roc A144_She
## Min. : 0.000 Min. : 0.1859 Min. : 0.000 Min. : 0.000
## 1st Qu.: 2.295 1st Qu.: 2.5748 1st Qu.: 2.589 1st Qu.: 2.891
## Median : 5.071 Median : 5.7155 Median : 4.949 Median : 5.339
## Mean : 12.740 Mean : 12.5183 Mean : 12.861 Mean : 12.565
## 3rd Qu.: 10.079 3rd Qu.: 10.7393 3rd Qu.: 11.082 3rd Qu.: 13.317
## Max. :320.016 Max. :204.9678 Max. :296.278 Max. :227.489
## A155_Ste
## Min. : 0.000
## 1st Qu.: 2.381
## Median : 6.295
## Mean : 11.907
## 3rd Qu.: 10.099
## Max. :139.861
Double check that the keepmeans object looks how you want it to, and save this as a new datafile called OSR101_final
write.table(keepmeans, "OSR101_final.txt", quote=F, sep="\t")
If you want to save the summary data too:
write.table(summary(keepmeans), "Summary.txt", quote=F, sep="\t")
Let’s take a look at the mean RPKM for each individual in the dataset using a bar plot. Here I have also removed the line names on the x axis because they are messy.
barplot(colMeans(keepmeans), ylab="RPKM", xlab="Lines", xaxt="n")
RPKM normalises for total number of reads and the length of each gene, so we would expect all of the lines to be broadly similar. We could take out the ones that are outliers, but we might remove interesting biological phenomena!
Don’t forget to save your graphs if you like them, and you can save your session using save.image