The aim of this session is to show you how to organise your work in a logical, consistent and reproducible way using RStudio Projects, appropriate directory structures and portable code.
The successful student will be able to:
Project: a discrete piece of work which may have a number of files associated with it such as data, several scripts for analysis or production of the output (application, report etc).
|--stem_cell_proteomics
|--data
|--raw
|--processed
|--analysis
|--main
|--accessory
|--figures
The location of a filesystem object (i.e., file, directory or link) .
Useful orientation commands:
getwd()
prints your working directory (where you are).dir()
lists the file contents of your working directory.There is a very useful way round having to think about path issues too much: using the here
package. the here()
function in the here
package returns the top level directory of the RStudio project. We will discuss this below.
An RStudio project is associated with a directory.
You create a new project with File | New Project…
When a new project is created RStudio:
Using a project helps you manage file paths. The working directory is the project directory (i.e., the location of the .Rproj file).
You can open a project with:
When you open project, a new R session starts and various settings are restored to their condition when the project was closed.
Suggested reading Chapter 2 Project-oriented workflow of What they forgot to teach you about R (Bryan and Hester, n.d.).
here()
If you are not working with a Project, the here()
function in the here
package returns the working directory when the package was loaded.
here::here()
## [1] "M:/web/58M_BDS_2019"
The colon notation packagename::functionname()
is an alternative to first using library(packagename)
then functionname()
. It’s useful when you only want to use a package a few times in a session or when you’ve loaded two packages which has functions with the same name and you want to specify which package should be used.
here()
can be used to build paths in a way that will transport to other systems, i.e., it’s an alternative to typing the relative paths.
For example, if I want to read beewing.txt (which are left wing widths of 100 honey bees in mm) in:
# the code to help you understand how it works is included
# what is the wd
here::here()
## [1] "M:/web/58M_BDS_2019"
# what is in the wd
dir(here::here())
## [1] "58M_BDS_2019.Rproj"
## [2] "data"
## [3] "figs"
## [4] "functions"
## [5] "index.html"
## [6] "index.Rmd"
## [7] "L1_Introduction_to_Data Science.pdf"
## [8] "pics"
## [9] "README.md"
## [10] "refs"
## [11] "workshops"
# what is in the data folder in the the wd
dir(here::here("data"))
## [1] "beewing.txt" "biomass.txt"
## [3] "chaff.txt" "Icd10Code.csv"
## [5] "processed" "structurepred.txt"
## [7] "Y101_Y102_Y201_Y202_Y101-5.csv"
# build the path to the file.
file <- here::here("data", "beewing.txt")
# what's in file
file
## [1] "M:/web/58M_BDS_2019/data/beewing.txt"
# note that it builds the path you request, data/beewing.txt, even if it doesn't exist
# for example
nonsense <- here::here("bob", "sue", "banana.txt")
nonsense
## [1] "M:/web/58M_BDS_2019/bob/sue/banana.txt"
# read the file in
bee <- read.table(file, header = TRUE)
str(bee)
## 'data.frame': 100 obs. of 1 variable:
## $ wing: num 3.6 3.7 3.8 3.8 4 4 4 4 4.1 4.1 ...
When working in a Project, here::here()
returns the top level directory of the RStudio project. This will be the directory with the same name as the .RProj file in which that file is located.
Often we want to write to files. My main reasons for doing so are to save copies of data that have been processed and to save graphics
To write a dataframe to a plain text file:
# in this case I will just write the bee data frame to another file
# in a folder I have already (called processed)
file <- here::here("data", "processed", "bee2.txt")
write.table(bee, file)
To create and save a ggplot:
ggplot(data = bee, aes(x = wing)) +
geom_density() +
theme_classic()
file <- here::here("figs", "beehist.png")
ggsave(file)
Imagine there is no inbuilt function to calculate the mean. We will write our own function to calculate it to demonstrate the principle of function writing on a simple example.
file <- here::here("data", "beewing.txt")
bee <- read.table(file, header = TRUE)
str(bee)
## 'data.frame': 100 obs. of 1 variable:
## $ wing: num 3.6 3.7 3.8 3.8 4 4 4 4 4.1 4.1 ...
If you were to calculate the mean bee wing width by hand you would sum all the values in bee$wing and divide by the total number of values. R makes this easy because it calculates things ‘elementwise’, i.e., it applies an operation to every element of a vector so that:
sum(bee$wing)
## [1] 455
adds up all the values.
You should rememember this from previous work. This is unusual amongst programming languages where you often need to use a loop to iterate through a vector and is very useful where data analysis and visualisation are the main tasks.
We need to divide that sum by the length of the vector to get the mean:
sum(bee$wing) / length(bee$wing)
## [1] 4.55
We can create a function for future use. A function is defined by an assignment of the form:
functionname <- function(arg1, arg2, ...) {expression}
The {expression} is any R code that uses the arguments (arg1 etc) to calculate a value. In our case, we will have just one argument, the vector of values, and our expression will be that needed to calculate the mean.
my_mean <- function(v) {sum(v) / length(v)}
I chose v
, as a name, arbitrarily. It doesn’t matter what you call it (and it only exists inside the function when the function is called). All that matters is that the function expression describes what the function should do with the arguments passed. To call the function:
my_mean(bee$wing)
## [1] 4.55
Functions are useful because they generalise a process thus making it reproducible without copying and pasting.
Rather than having your function code in your main analysis code file, it is good practice to put it in it’s own file and ‘source’ it from your main file. Sourcing a file makes any functions it contains available. Typically you save one function in a script with the same name as the function.
To ‘source’ a file called my_mean.R use:
file <- here::here("functions", "my_mean.R")
source(file)
You call the function in the same way as normal.
The data in chaff.txt are the masses of male and female chaffinches. It is organised in to two columns, males
and females
a format which is not normally ideal. Your task is to organise the analysis of these data.
You need to:
read in the data.
reformat the data into ‘tidy’ form, i.e., one column giving the sex, another giving the mass. Write the newly formatted data to a file.
Write your own function for calculating the sums of squares of the whole dataset. SS(x) is the sum of the squared deviations from the mean given by:
\(\sum (x_i- \bar{x})^2\)
Put the code for the function in its own script and call it from your main script.
carry out a statistical test on the data and record the result in comments.
create and save a figure to accompany the statistical result.
theme_gray
without the bracketsBryan, Jennifer, and Jim Hester. n.d. “Chapter 2 Project-Oriented Workflow | What They Forgot to Teach You About R.” https://whattheyforgot.org/project-oriented-workflow.html.
Müller, Kirill. 2017. Here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.
R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Wickham, Hadley. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.