The aim of this session is to strengthen your understanding of the concept of tidy data (Wickham 2014), introduce you to the tidyverse (Wickham 2017) and some commonly applied data tidying operations. We are covering data tidying before data import because some of the import methods will make use of tidyverse methods and generate tidy data structures.
The successful student will be able to:
Tidy data adhere to a consistent structure which makes it easier to manipulate, model and visualize them. The structure is defined by:
It is a concept long recognised by data practioners and is closely allied to the relational algebra of relational databases (Codd 1990). It underlies the enforced rectangular formating of, for example, SPSS and STATA datafiles and R’s dataframe. However, the term ‘tidy data’ was popularised by Hadley Wickham (2014) and has led to a common language for data tasks and framework for tools.
There may be more than one potential tidy structure for a given data set depending on the particular analysis or visualisation you want to apply.
Tidying data includes getting it into ‘tidy’ format but also other tasks such as:
Important:
The tidyverse (Wickham 2017) is “an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.” ggplot
is one of these packages. You can install and load all of them with:
Do not do this is working on a Biology PC - it’s already installed.
install.packages("tidyverse")
library(tidyverse)
%>%
The magrittr
package (Bache and Wickham 2014) is part of the tidyverse and includes the pipe operater which can improve code readability by:
For example, to apply a log-squareroot transformation you might use:
# generate some numbers
nums <- sample(1:100, size = 10, replace = FALSE)
# transformation either:
# a) nested functions
tnums <- log(sqrt(nums))
# b) intermediates
sqrtnums <- sqrt(nums)
tnums <- log(sqrtnums)
Nesting the functions means you have to read inside out and creating intermediates can be cluttered. The pipe allows you to avoid these by taking the output of one operation as the input of the next. The pipe has long been used by Unix operating systems (where the pipe operator is |). The R pipe operator is %>%
, a short cut for which is ctrl-shift-M.
tnums <- nums %>%
sqrt() %>%
log()
This is short for
tnums <- nums %>%
sqrt(.) %>%
log(.)
Where .
stands for the object being passed in. In most cases, you don’t need to include it but some functions require you to (for example when arguments are optional or there is ambiguity over which argument is meant).
A additional benefit of using the pipe is that solving problems step-by-step is made easier.
Data commonly need to be reshaped in this way. The data are given in biomass.txt are taken from an experiment in which the insect pest biomass (g) was measured on plots sprayed with water (control) or one of five different insecticides. Also in the data file are variables indicating the replicate number and the identity of the tray in which the plant was grown.
These data are in “wide” format and can be converting to “long” format using the dplyr
package function gather()
. By default it collects all the values into a single column and create a column of the column names, called the key, to indicate which column the value is derived from. We want to gather these columns: WaterControl, A, B, C, D, E. We don’t want to gather the rep_tray
column but instead want the contents to be repeated.
biomass2 <- biomass %>%
gather(key = spray,
value = mass,
-rep_tray)
file <- here::here("data", "processed", "biomass2.txt")
write.table(biomass2, file = file)
You can see the resulting file here: biomass2.txt
We sometimes have single columns which contain more than one type of encoded information. UK Car number plates are an example - some parts of the number plate encode a year which you might want in an additional column for analyses by year. Or birthdates. For the biomass2
data we could separate the replicate number from the tray identity and put them in two separte columns.
We can do this with a ’regular expression` or regex. A regex defines a pattern for matching text. It’s a big topic and there are many tutorials. I remember a few bits and google “how to match #### regex”. A quick reference
We give the names of the new coulmns we want to create and the patterns matching the part of the rep_tray
value we want to go in each column.
biomass3 <- biomass2 %>%
extract(rep_tray,
c("replicate_number", "tray_id"),
"([0-9]{1,2})\\_([a-z])")
file <- here::here("data", "processed", "biomass3.txt")
write.table(biomass3, file = file)
( )
.replicate_number
is anything matching 1 or 2 numbers.tray_id
is anything matching exactly one lowercase letter.( )
is a pattern that matches what is in rep_tray
but is not to be saved.You can see the resulting file here: biomass3.txt
The data in Y101_Y102_Y201_Y202_Y101-5.csv are proteomic data of soluble proteins from five immortalised mesenchymal stromal cell (MSC) lines.
Column names are spreadover three rows but are primarily in the third row. We can read in from the third row by skipping the first two. We can also use the clean()
function from the janitor
package to improve the column names.
# define file name
library(janitor)
filesol <- here::here("data", "Y101_Y102_Y201_Y202_Y101-5.csv")
# skip first two lines
sol <- read.csv(filesol, skip = 2, stringsAsFactors = FALSE) %>%
clean_names()
stringsAsFactors = FALSE
means strings are treated as character variables. Whilst analysis and visualisation often require factor variables, any processing of strings is made much easier if they are characters.
This dataset includes bovine serum proteins from the medium on which the cells were grown and proteins for which fewer than 2 peptides were detected. All such lines need to be filtered out. We can achieve this in the tidyverse way like this:
library(data.table)
sol <- sol %>%
filter(description %like% "OS=Homo sapiens") %>%
filter(x_1pep == "x")
The %like%
operator from the data.table
package will return any rows which include that pattern.
It would be good to add a column for the genename by processing the contents of the description. One entry from the description column looks like this:
glimpse(sol$description)
## chr [1:861] "Neuroblast differentiation-associated protein AHNAK OS=Homo sapiens GN=AHNAK PE=1 SV=2" ...
The genename is after GN=
. We need to extract the part of the string with the genename and put it in a new column. The way to problem-solve your way through this is work with one value carrying out one operation at a time until you’ve worked out what to do before implementing on an entire dataset.
# extract the first value of the description to work with
one_description <- sol$description[1]
We need to extract the part of the string after GN=
. We can do this with a regex:
# extract the part of the string
one_genename <- one_description %>%
str_extract("GN=[^\\s]+")
one_genename
## [1] "GN=AHNAK"
[ ]
means some characters^
means ‘not’ when inside [ ]
\s
is white space\
before is an escape character to indicate that the next character should not be taken literally (because it’s part of \s
)+
means one of moreSo GN=[^\\s]+
means GN=
followed by one or more characters that are not whitespace. This means the pattern stops matching at the first white space after “GN=”.
We’re close. Now we will drop the GN
part by spliting the string on the =
:
# extract the part of the string
one_genename <- one_description %>%
str_extract("GN=[^\\s]+") %>%
strsplit(., "=", fixed = TRUE)
one_genename
## [[1]]
## [1] "GN" "AHNAK"
Note we need to use .
to represent the object being passed in, “GN=AHNAK” in this case. This is because the function definition of [strsplit()
(https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/strsplit]. The split
argument would be mistaken for the x
argument.
The output value of strsplit()
is a two element list of the strings before and after the “=”. We want the second element of the list. An often easier way to access lists where each item in the list is a single string is to turn it into a character vector. The unlist()
function achieves this:
# extract the part of the string
one_genename <- one_description %>%
str_extract("GN=[^\\s]+") %>%
strsplit(., "=", fixed = TRUE) %>%
unlist()
one_genename
## [1] "GN" "AHNAK"
The to access the second element of the the character vector we need to use the .
again:
# extract the part of the string
one_genename <- one_description %>%
str_extract("GN=[^\\s]+") %>%
strsplit(., "=", fixed = TRUE) %>%
unlist() %>%
.[2]
one_genename
## [1] "AHNAK"
We know how to get the genename for a single value and now need to apply this to every value in the column. This can be achieved with a for
loop:
# add column for genename
for (i in 1:length(sol$description)) {
sol$genename[i] <- sol$description[i] %>%
str_extract("GN=[^\\s]+") %>%
strsplit(., "=", fixed = TRUE) %>%
unlist() %>%
.[2]}
Can you add a column for the top protein identifier. This is the first Uniprot ID after the “1::” in the accession
column. I suggest splitting on the semi-colon, then the double colon. Develop the solution by using a suitable entry from the accession column (i.e., not the first one as it only has one entry).
Can you create a second dataframe, sol2
in which the protein abundances are in a single column and the cell lineage is indicated in another. All the other variables should also be in the new data frame.
## [1] "y101_a" "y101_b" "y101_c" "y102_a" "y102_b" "y102_c" "y201_a"
## [8] "y201_b" "y201_c" "y202_a" "y202_b" "y202_c" "y1015_a" "y1015_b"
## [15] "y1015_c"
sol2
for the cell line and the replicate?Bache, Stefan Milton, and Hadley Wickham. 2014. Magrittr: A Forward-Pipe Operator for R. https://CRAN.R-project.org/package=magrittr.
Codd, E F. 1990. The Relational Model for Database Management: Version 2. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software, Articles 59 (10): 1–23.
———. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.