Introduction

Aims

The aim of this session is to strength your ability to import your own data files regardless of the formatting and to introduce you to some of the other ways to import data such as from googlesheets, through web scraping and via APIs.

Learning outcomes

The successful student will be able to:

  • use an understanding of what matters in their data import
  • import plain text and proprietary data formats stored locally and on the web
  • carry out some simple web scraping
  • start to appreciate the number of packages available for importing publicly accessible data

Four aspects.

  • Where: stored locally (on your own computer) or remotely (on another computer/server).
  • Format: various. structured as XML or JSON, in databases or may require harvesting.
  • How: base R functions; Access to APIs for many forms of specialised data has been made easier with packages e.g., bioconductor.
  • Result: often dataframes or dataframe-like structures (eg., tibbles), often specialised data structures.

Locally stored plain text (or similar).

This should be revision.

  • Essentially plain text (can be opened in notepad and make sense).
  • Columns usually ‘delimited’ by a particular character (but fixed width does occur).
  • Read in with the read.table() methods.
  • read.csv(), read.delim() are just wrappers for read.table().
  • read.table(file) is the minimum needed, other arguments have defaults.
  • Remember that relative file location matters.

Example 1

Data file: structurepred.txt

file <- here::here("data", "structurepred.txt")
mydata <- read.table(file)
str(mydata)
## 'data.frame':    91 obs. of  3 variables:
##  $ V1: Factor w/ 91 levels "0.08","0.353",..: 91 84 31 32 37 18 25 89 88 3 ...
##  $ V2: Factor w/ 4 levels "Abstruct","Predicto",..: 3 1 1 1 1 1 1 1 1 1 ...
##  $ V3: Factor w/ 31 levels "1","10","11",..: 31 1 12 23 25 26 27 28 29 30 ...

😒 The command runs but the result looks wrong.

All the variables have been made factors even though the first should be numeric. This is because the first line in the file contains strings. These strings are the column names rather than the data themselves. We can tell R the file includes headers.

mydata <- read.table(file, header = TRUE)
str(mydata)
## 'data.frame':    90 obs. of  3 variables:
##  $ rmsd: num  9.04 14.95 17.73 3.12 11.28 ...
##  $ prog: Factor w/ 3 levels "Abstruct","Predicto",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ prot: int  1 2 3 4 5 6 7 8 9 10 ...

😃 Now it looks better!

  • There are several arguments that can be set.
  • Arguments depend on data format.
  • Check manual for defaults.

Example 2

Data file: Icd10Code.csv

file <- here::here("data", "Icd10Code.csv")
mydata <- read.table(file, header = TRUE)
## Error in read.table(file, header = TRUE): more columns than column names

😒 The command won’t run at all.

  • This is common error when the separator is not the default.
  • Troubleshoot by reading in one line.

Read in the first line only:

read.table(file, header = FALSE, nrows = 1)
##                 V1
## 1 Code,Description

Or you could use readLines():

readLines(file, n = 5)
## [1] "Code,Description"                                           
## [2] "A00,\"Cholera\""                                            
## [3] "A000,\"Cholera due to Vibrio cholerae 01, biovar cholerae\""
## [4] "A001,\"Cholera due to Vibrio cholerae 01, biovar eltor\""   
## [5] "A009,\"Cholera, unspecified\""

These look like they might be two column names but they have been read into one column. That is, the comma is not recognised as the separator.

Read in the third line only:

read.table(file, header = FALSE, nrows = 1, skip = 2)
##              V1  V2 V3     V4       V5  V6     V7        V8
## 1 A000,"Cholera due to Vibrio cholerae 01, biovar cholerae"

It’s splitting on white space because that is the default so there are many columns in some rows - “more columns than column names”. The actual separator is a comma.

mydata <- read.table(file, header = TRUE, sep = ",")

Locally stored special formats.

  • Cannot usually be opened in notepad.
  • Often specific to proprietary software, e.g., SPSS, STATA, Matlab.
  • If you have that software you may be able to export in plain text format.
  • But usually there is a package or function that allows you to script the steps. Favourites of mine are:
    • haven (Wickham and Miller 2018) for SPSS, STATA, SAS
    • readxl for excel files
    • jsonlite for JSON
  • Google is your friend.

SPSS example using haven

Data file: prac9a.sav

library(haven)
file <- here::here("data", "prac9a.sav")
mydata <- read_sav(file)
str(mydata)
## Classes 'tbl_df', 'tbl' and 'data.frame':    120 obs. of  6 variables:
##  $ terrsize: num  0.463 0.446 0.651 0.507 0.879 ...
##   ..- attr(*, "label")= chr "Territory size (Ha)"
##   ..- attr(*, "format.spss")= chr "F8.3"
##  $ country : 'haven_labelled' num  1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "Country"
##   ..- attr(*, "format.spss")= chr "F8.0"
##   ..- attr(*, "labels")= Named num  1 2 3
##   .. ..- attr(*, "names")= chr  "U.K" "France" "Germany"
##  $ woodtype: 'haven_labelled' num  1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "Wood Type"
##   ..- attr(*, "format.spss")= chr "F8.0"
##   ..- attr(*, "labels")= Named num  1 2
##   .. ..- attr(*, "names")= chr  "Deciduous" "Mixed"
##  $ site    : num  1 2 3 4 1 2 3 4 1 2 ...
##   ..- attr(*, "label")= chr "Site"
##   ..- attr(*, "format.spss")= chr "F8.0"
##  $ age     : num  2 1 1 3 2 1 3 2 3 3 ...
##   ..- attr(*, "label")= chr "Age (years)"
##   ..- attr(*, "format.spss")= chr "F8.0"
##  $ Temp    : num  9.5 8.9 9.8 11.4 9.5 8.9 9.8 11.4 9.5 8.9 ...
##   ..- attr(*, "label")= chr "Mean annual temperature (C)"
##   ..- attr(*, "format.spss")= chr "F8.1"

Note that a “tibble” is essentially a dataframe.

There are read_dta() and read_sas() functions for STATA and SAS files respectively.

Files on the internet

Simply use the URL rather than the local file path. Your options for import functions and arguments are the same as for local files.

For example, these data are from a buoy (buoy #44025) off the coast of New Jersey at http://www.ndbc.noaa.gov/view_text_file.php?filename=44025h2011.txt.gz&dir=data/historical/stdmet/

file <- file <- "http://www.ndbc.noaa.gov/view_text_file.php?filename=44025h2011.txt.gz&dir=data/historical/stdmet/"

# use readLines() data format: look on the web or use:
readLines(file, n = 5)
## [1] "#YY  MM DD hh mm WDIR WSPD GST  WVHT   DPD   APD MWD   PRES  ATMP  WTMP  DEWP  VIS  TIDE"
## [2] "#yr  mo dy hr mn degT m/s  m/s     m   sec   sec degT   hPa  degC  degC  degC   mi    ft"
## [3] "2010 12 31 23 50 222  7.2  8.5  0.75  4.55  3.72 203 1022.2   6.9   6.7   3.5 99.0 99.00"
## [4] "2011 01 01 00 50 233  6.0  6.8  0.76  4.76  3.77 196 1022.2   6.7   6.7   3.7 99.0 99.00"
## [5] "2011 01 01 01 50 230  5.0  5.9  0.72  4.55  3.85 201 1021.9   6.8   6.7   3.5 99.0 99.00"

The first two lines give the column name and units; the data start on line 3

mydata <- read.table(file, header = F, skip = 2)
  • It is better to read from the internet than download the file if the data are likely to be updated.

From googlesheets

This can be done with the googlesheets package (Bryan and Zhao 2018) and requires two steps: registering the sheet for use (you will need to ‘authenticate’) and reading it in.

Assuming your google sheet has the name “colours (Responses)”:

library(googlesheets)
# Register the sheet for use with gs_title()
colours <- gs_title("colours (Responses)")

This will open a browser to ask you to allow the googlesheets package access your googlesheets. Choose Allow.

Read in data:

coloursurvey <- gs_read(colours)

Web scraping

What if data are not in a file but on webpage? One solution is to to ‘scrape’ the data using package rvest. This is just one small example of a big topic.

We are going to get the information from a Wikipedia page on Global biodiversity. We will need to retrieve the html (i.e., the page source) and find the the part of the source which contains the table information.

  • Go to the web page: https://en.wikipedia.org/wiki/Global_biodiversity
  • Find the table for the population to extract.
  • Right-click the table and choose Inspect
  • A complicated looking box will appear on the right.
  • You need to right-click the table element then choose Copy and Copy Xpath
library(rvest)

url <- "https://en.wikipedia.org/wiki/Global_biodiversity"
test <- read_html(url) %>%
  html_nodes(xpath = '//*[@id="mw-content-text"]/div/table') %>%
  html_table() %>% 
  .[[1]]

You’d need to tidy the resulting dataframe.

Data from databases

There are many packages:

  • For relational databases (Oracle, MSSQL, MySQL):
    • RMySQL, RODBC
  • For non- relational databases (MongoDB, Hadoop):
    • rmongodb, rhbase

Data from APIs

Many organisations provide an API (application program interface) to access their data. This means you don’t have to scrape web pages of search results because there is a defined set of tools or protocols which allow access so you don’t need to understand the underlying structure of the information.

Many packages have been written that make use of APIs and make your life much easier.

Bioconductor is a whole ecosystem of R tools to analyse high-throughput genomic data and access public repositories. They are available from Bioconductor rather than CRAN. To install Bioconductor packages on your own computer you will need to install BiocManager from CRAN then use BiocManager::install("packagename").

ropensci is a curated set of R packages that provide programmatic access to a variety of scientific data, full-text of journal articles, and metrics of scholarly impact.

Not every API will have an available package but may be JSON-based APIs meaning they return data in JSON format. One example is Twitter. The jsonlite package is very useful to deal with such data. Examples, including Twitter.

Exercises

  1. Local files. You may feel confident enough to skip this exercise but otherwise save these files below to your ‘data’ directory and read them in. Use the manual for read.table() to see all the options.

  2. This is tidying exercise. Can you read in the buoy data with appropriate columns names? I suggest:

    • reading the first line of the file
    • processing that line using similar methods to those used in the case study of Workshop 2: Tidying data, tidy data and the tidyverse.
  3. Below is a file containing flowcyometry data. Your task is to use google to find out how to read it in and develop understanding of the resulting data structure by examining its components. You should find any packages to need are already installed but you will still need to identify and ‘library’ them. What kind object results? Where is the data itself? What is the other information? Can you plot any of the actual data - for example do the classic forward vs side scatter plot?

The data: Specimen_001_Tube_001_001.fcs

  1. Can you work out how to read in all the files of the same format in a particular directory? You could try to read them into separate dataframes or a single dataframe. A collection of suitable files is specdata.zip which you could download and unzip.

If you’re stuck, here is one function I wrote: em_read_list.R. Work out how to use it from the code in that file.

  1. Browse the Bioconductor Workflow Packages. These are analysis packages which include a detailed example workflow.

The Rmd file

Rmd file

Bryan, Jennifer, and Joanna Zhao. 2018. Googlesheets: Manage Google Spreadsheets from R. https://CRAN.R-project.org/package=googlesheets.

Wickham, Hadley, and Evan Miller. 2018. Haven: Import and Export ’Spss’, ’Stata’ and ’Sas’ Files. https://CRAN.R-project.org/package=haven.