The aim of this session is to strength your ability to import your own data files regardless of the formatting and to introduce you to some of the other ways to import data such as from googlesheets, through web scraping and via APIs.
The successful student will be able to:
This should be revision.
read.table()
methods.read.csv()
, read.delim()
are just wrappers for read.table()
.read.table(file)
is the minimum needed, other arguments have defaults.Data file: structurepred.txt
file <- here::here("data", "structurepred.txt")
mydata <- read.table(file)
str(mydata)
## 'data.frame': 91 obs. of 3 variables:
## $ V1: Factor w/ 91 levels "0.08","0.353",..: 91 84 31 32 37 18 25 89 88 3 ...
## $ V2: Factor w/ 4 levels "Abstruct","Predicto",..: 3 1 1 1 1 1 1 1 1 1 ...
## $ V3: Factor w/ 31 levels "1","10","11",..: 31 1 12 23 25 26 27 28 29 30 ...
😒 The command runs but the result looks wrong.
All the variables have been made factors even though the first should be numeric. This is because the first line in the file contains strings. These strings are the column names rather than the data themselves. We can tell R the file includes headers.
mydata <- read.table(file, header = TRUE)
str(mydata)
## 'data.frame': 90 obs. of 3 variables:
## $ rmsd: num 9.04 14.95 17.73 3.12 11.28 ...
## $ prog: Factor w/ 3 levels "Abstruct","Predicto",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ prot: int 1 2 3 4 5 6 7 8 9 10 ...
😃 Now it looks better!
Data file: Icd10Code.csv
file <- here::here("data", "Icd10Code.csv")
mydata <- read.table(file, header = TRUE)
## Error in read.table(file, header = TRUE): more columns than column names
😒 The command won’t run at all.
Read in the first line only:
read.table(file, header = FALSE, nrows = 1)
## V1
## 1 Code,Description
Or you could use readLines()
:
readLines(file, n = 5)
## [1] "Code,Description"
## [2] "A00,\"Cholera\""
## [3] "A000,\"Cholera due to Vibrio cholerae 01, biovar cholerae\""
## [4] "A001,\"Cholera due to Vibrio cholerae 01, biovar eltor\""
## [5] "A009,\"Cholera, unspecified\""
These look like they might be two column names but they have been read into one column. That is, the comma is not recognised as the separator.
Read in the third line only:
read.table(file, header = FALSE, nrows = 1, skip = 2)
## V1 V2 V3 V4 V5 V6 V7 V8
## 1 A000,"Cholera due to Vibrio cholerae 01, biovar cholerae"
It’s splitting on white space because that is the default so there are many columns in some rows - “more columns than column names”. The actual separator is a comma.
mydata <- read.table(file, header = TRUE, sep = ",")
haven
(Wickham and Miller 2018) for SPSS, STATA, SASreadxl
for excel filesjsonlite
for JSONhaven
Data file: prac9a.sav
library(haven)
file <- here::here("data", "prac9a.sav")
mydata <- read_sav(file)
str(mydata)
## Classes 'tbl_df', 'tbl' and 'data.frame': 120 obs. of 6 variables:
## $ terrsize: num 0.463 0.446 0.651 0.507 0.879 ...
## ..- attr(*, "label")= chr "Territory size (Ha)"
## ..- attr(*, "format.spss")= chr "F8.3"
## $ country : 'haven_labelled' num 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "Country"
## ..- attr(*, "format.spss")= chr "F8.0"
## ..- attr(*, "labels")= Named num 1 2 3
## .. ..- attr(*, "names")= chr "U.K" "France" "Germany"
## $ woodtype: 'haven_labelled' num 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "Wood Type"
## ..- attr(*, "format.spss")= chr "F8.0"
## ..- attr(*, "labels")= Named num 1 2
## .. ..- attr(*, "names")= chr "Deciduous" "Mixed"
## $ site : num 1 2 3 4 1 2 3 4 1 2 ...
## ..- attr(*, "label")= chr "Site"
## ..- attr(*, "format.spss")= chr "F8.0"
## $ age : num 2 1 1 3 2 1 3 2 3 3 ...
## ..- attr(*, "label")= chr "Age (years)"
## ..- attr(*, "format.spss")= chr "F8.0"
## $ Temp : num 9.5 8.9 9.8 11.4 9.5 8.9 9.8 11.4 9.5 8.9 ...
## ..- attr(*, "label")= chr "Mean annual temperature (C)"
## ..- attr(*, "format.spss")= chr "F8.1"
Note that a “tibble” is essentially a dataframe.
There are read_dta()
and read_sas()
functions for STATA and SAS files respectively.
Simply use the URL rather than the local file path. Your options for import functions and arguments are the same as for local files.
For example, these data are from a buoy (buoy #44025) off the coast of New Jersey at http://www.ndbc.noaa.gov/view_text_file.php?filename=44025h2011.txt.gz&dir=data/historical/stdmet/
file <- file <- "http://www.ndbc.noaa.gov/view_text_file.php?filename=44025h2011.txt.gz&dir=data/historical/stdmet/"
# use readLines() data format: look on the web or use:
readLines(file, n = 5)
## [1] "#YY MM DD hh mm WDIR WSPD GST WVHT DPD APD MWD PRES ATMP WTMP DEWP VIS TIDE"
## [2] "#yr mo dy hr mn degT m/s m/s m sec sec degT hPa degC degC degC mi ft"
## [3] "2010 12 31 23 50 222 7.2 8.5 0.75 4.55 3.72 203 1022.2 6.9 6.7 3.5 99.0 99.00"
## [4] "2011 01 01 00 50 233 6.0 6.8 0.76 4.76 3.77 196 1022.2 6.7 6.7 3.7 99.0 99.00"
## [5] "2011 01 01 01 50 230 5.0 5.9 0.72 4.55 3.85 201 1021.9 6.8 6.7 3.5 99.0 99.00"
The first two lines give the column name and units; the data start on line 3
mydata <- read.table(file, header = F, skip = 2)
This can be done with the googlesheets
package (Bryan and Zhao 2018) and requires two steps: registering the sheet for use (you will need to ‘authenticate’) and reading it in.
Assuming your google sheet has the name “colours (Responses)”:
library(googlesheets)
# Register the sheet for use with gs_title()
colours <- gs_title("colours (Responses)")
This will open a browser to ask you to allow the googlesheets package access your googlesheets. Choose Allow.
Read in data:
coloursurvey <- gs_read(colours)
What if data are not in a file but on webpage? One solution is to to ‘scrape’ the data using package rvest. This is just one small example of a big topic.
We are going to get the information from a Wikipedia page on Global biodiversity. We will need to retrieve the html (i.e., the page source) and find the the part of the source which contains the table information.
library(rvest)
url <- "https://en.wikipedia.org/wiki/Global_biodiversity"
test <- read_html(url) %>%
html_nodes(xpath = '//*[@id="mw-content-text"]/div/table') %>%
html_table() %>%
.[[1]]
You’d need to tidy the resulting dataframe.
There are many packages:
Many organisations provide an API (application program interface) to access their data. This means you don’t have to scrape web pages of search results because there is a defined set of tools or protocols which allow access so you don’t need to understand the underlying structure of the information.
Many packages have been written that make use of APIs and make your life much easier.
Bioconductor is a whole ecosystem of R tools to analyse high-throughput genomic data and access public repositories. They are available from Bioconductor rather than CRAN. To install Bioconductor packages on your own computer you will need to install BiocManager
from CRAN then use BiocManager::install("packagename")
.
ropensci is a curated set of R packages that provide programmatic access to a variety of scientific data, full-text of journal articles, and metrics of scholarly impact.
Not every API will have an available package but may be JSON-based APIs meaning they return data in JSON format. One example is Twitter. The jsonlite
package is very useful to deal with such data. Examples, including Twitter.
Local files. You may feel confident enough to skip this exercise but otherwise save these files below to your ‘data’ directory and read them in. Use the manual for read.table()
to see all the options.
This is tidying exercise. Can you read in the buoy data with appropriate columns names? I suggest:
Below is a file containing flowcyometry data. Your task is to use google to find out how to read it in and develop understanding of the resulting data structure by examining its components. You should find any packages to need are already installed but you will still need to identify and ‘library’ them. What kind object results? Where is the data itself? What is the other information? Can you plot any of the actual data - for example do the classic forward vs side scatter plot?
The data: Specimen_001_Tube_001_001.fcs
If you’re stuck, here is one function I wrote: em_read_list.R. Work out how to use it from the code in that file.
Bryan, Jennifer, and Joanna Zhao. 2018. Googlesheets: Manage Google Spreadsheets from R. https://CRAN.R-project.org/package=googlesheets.
Wickham, Hadley, and Evan Miller. 2018. Haven: Import and Export ’Spss’, ’Stata’ and ’Sas’ Files. https://CRAN.R-project.org/package=haven.