Introduction

Aims

In this session you will get an introduction to machine learning in general and in applying and interpreting some machine learning methods. You should also have time to experiment with your own data.

Learning outcomes

The successful student will be able to:

Describe what is meant by machine learning the primary way to classify ML methods.
Know a few supervised and unsupervised methods.
Be apply to apply, visualise and interpret clustering, PCA and LDA.
Be able to partition datasets to train and test ML models on different sets of observations.

Overview.

What is the difference between ML and statistics?

There isn’t really one! But perhaps …

Statistics.
- Goal is understanding: variable significance.
- Parsimonious models: penalties for complexity, Interpretability is important.
- Model entire data set.
Machine learning.
- Goal is more often prediction: reliability of predictions.
- How predictions arise is less important.
- Often complex models: no penalties for complexity.
- Train on ~75% data, test on 25%.

Main types.

One of the main ways in which ML methods can be classified is as supervised or unsupervised.

Supervised:
- We have a known response.
- Model created through training set (~75%) with known response.
- Model tested on test set (~25%).
Unsupervised:
- No particular response; input is unlabeled.
- Goal is to discover structure.
- Can be used to explore or reduce or transform variables before supervised methods.

Unsupervised methods.

Often used to explore and visualise datasets with many variables. Very common examples:

Clustering ( >1 method)
- Goal: find natural groupings which may suggest hypotheses.
- Based on distance measures.
- Clustering algorithms vary in 1. distance measure and 2. Clustering algorithm.
- Hierarchical - either successive agglomeration or successive division.
- Illustrated with dendrogram.
Dimension reduction e.g., PCA, tSNE
- New variables which are combinations of the original variables.
- Summarise overall pattern of variation.
- See the pattern of correlation between variables.
- See similarities among observations.

Supervised methods.

Often intended to classify observations. Many, many variations for example:

Linear and logistic regression.
Linear Discriminant Analysis (LDA).
Support-Vector Machines (SVM).
Decision trees.
Naive Bayes.
K nearest neighbour.

Examples

Getting started.

Create a new project.
Make a folder for data.
Create a new markdown document.
- File | New File | R Markdown.
- Add a title.
- Add your name.
- Delete everything except:
  - the YAML header between the —
  - the first code chunk which begins: ```{r setup, include=FALSE}
- Add a chunk for packages and include tidyverse

Example.

The Wheat Seeds Dataset involves the prediction of species given measurements of seed kernels from different varieties of wheat.

There are three different varieties of wheat: Kama, Rosa and Canadian. High quality visualization of the internal kernel structure was detected using a soft X-ray technique and 7 measurements were taken:

Area.
Perimeter.
Compactness
Length of kernel.
Width of kernel.
Asymmetry coefficient.
Length of kernel groove.

file <- here::here("data", "seeds_dataset.txt")
cols <- c("area", 
          "perimeter",
          "compactness",
          "kernal_length",
          "kernel_width",
          "asymmetry_coef",
          "groove_length",
          "species")
seeds <- read.table(file, header = FALSE, col.names = cols)

The species is coded as 1, 2, and 3 and it would be useful to recode to the species names:

seeds$species <- recode(seeds$species,
                        `1` = "Kama",
                        `2` = "Rosa",
                        `3` = "Canadian")

If you have about 4 - 15 variables, plotting them pairwise against each other gives a nice overview. This can be achieved with the ggpairs() from the GGally package.

seeds %>% 
  GGally::ggpairs()

Clustering

Clustering is one way to see if these variables could be used to distinguish between species. If they do, we would expect individuals of the same species to cluster together. We first create a matrix of the distances between observations using dist(), then use those distances for clustering with hclust():

clust <- seeds %>% 
  select(-species) %>% 
  dist() %>% 
  hclust(method = "complete")

Then we can plot the result:

plot(clust)

Since I like ggplot, I will convert the output of hclust to a list of dataframes that ggplot can use using another package, ggdendro. Add this package to your packages chunk.

library(ggdendro)

Now we can use the dendro_data() function on our hclust output. Open the manual page for dendro_data()

ddata <- dendro_data(clust, type = "rectangle")

And plot it:

ggplot() + 
  geom_segment(data = ddata$segments, 
               aes(x = x, y = y, xend = xend, yend = yend)) + 
  coord_flip() + 
  scale_y_reverse()

How many clusters do you think exist
If you had to identify 3 clusters (one for each species) where would they be?
Does this match the reality? Add the species labels by adding a geom to the plot: geom_text(data = ddata$labels, aes(x = x, y = y - 0.8,label = seed$species), size = 2)

ggplot() + 
  geom_segment(data = ddata$segments, 
               aes(x = x, y = y, xend = xend, yend = yend)) + 
  geom_text(data = ddata$labels, 
            aes(x = x, y = y - 0.8,
                label = seeds$species,
                colour = seeds$species),
            size = 1) + 
  coord_flip() + 
  scale_y_reverse()

This is not entirely useful with there being so many but you should still be able to see if your thoughts about where the three clusters are match the actual groups.

What do you conclude about the ability of this unsupervised method to separate these species on measurements of their sepal and petal width/length?

Principal Components Analysis.

Principal Components Analysis (PCA) is another method to investigate whether you have groups or patterns in a dataset. It is ‘data reduction’ or ‘dimension reduction’ method and creates a set of ‘components’ (axes) which are linear combinations of the original variables. PCA is useful when relatively few components are needed to capture most of the variation in the data.

pca <- seeds %>% 
  select(-species) %>% 
  prcomp(scale. = TRUE)

Scaling: When the values in one variable are much bigger than in others we usually scale all the variable (mean of zero and a unit variance) before undertaking PCA to avoid the variable with the biggest values dominating the analysis. To see the variance accounted for by each component

summary(pca)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.2430 1.0943 0.82341 0.26147 0.13680 0.07302
## Proportion of Variance 0.7187 0.1711 0.09686 0.00977 0.00267 0.00076
## Cumulative Proportion  0.7187 0.8898 0.98668 0.99645 0.99912 0.99988
##                            PC7
## Standard deviation     0.02850
## Proportion of Variance 0.00012
## Cumulative Proportion  1.00000

To see the importance (loading) of each variable in each component

pca$rotation

##                       PC1         PC2         PC3         PC4         PC5
## area           -0.4444735  0.02656355 -0.02587094  0.19363997 -0.20441167
## perimeter      -0.4415715  0.08400282  0.05983912  0.29545659 -0.17427591
## compactness    -0.2770174 -0.52915125 -0.62969178 -0.33281640  0.33265481
## kernal_length  -0.4235633  0.20597518  0.21187966  0.26340659  0.76609839
## kernel_width   -0.4328187 -0.11668963 -0.21648338  0.19963039 -0.46536555
## asymmetry_coef  0.1186925  0.71688203 -0.67950584  0.09246481  0.03625822
## groove_length  -0.3871608  0.37719327  0.21389720 -0.80414995 -0.11134657
##                        PC6          PC7
## area            0.42643686  0.734805689
## perimeter       0.47623853 -0.670751532
## compactness     0.14162884 -0.072552703
## kernal_length  -0.27357647  0.046276051
## kernel_width   -0.70301171 -0.039289079
## asymmetry_coef  0.01964186 -0.003723456
## groove_length  -0.04282974 -0.034498098

To plot, we might want to use the scores on each of the new axes and colour them by species. The scores are in a variable called $x

# For convenience, I'll put these in one 'tidy' dataframe
pca_labelled <- data.frame(pca$x, species = seeds$species)
# a then to do a scatterplot
ggplot(pca_labelled, aes(x = PC1, y = PC2, color = species))+
  geom_point()

LDA

Linear Discriminant Analysis aims to find linear combination of variables the maximise differences between groups. It is supervised because we label observations by their class and determine the allocation rules based on these. A ‘discriminant’ is a line that separates groups. As there are three classes we have at most two linear discriminants. the lda() function is in a package called MASS. I very rarely load MASS with a library command since it has a function called select that conflicts with dplyr’s select(). Thus I will use MASS:: to access its functions.

ldaout <- seeds %>% 
  select(-species) %>% 
  MASS::lda(grouping = seeds$species)

To determine how well the species were predicted the scores on LD1 and LD2 and the classes need to be predicted:

pldaout <- seeds %>% 
  select(-species) %>%
  predict(object = ldaout)

How many predicted classes are the same as the actual classes:

table(predicted = pldaout$class, observed = seeds$species)

##           observed
## predicted  Canadian Kama Rosa
##   Canadian       67    3    0
##   Kama            3   66    0
##   Rosa            0    1   70

How well do you think the species were predicted?

We can create a scatter plot of LD1 and LD2 just like we did for PC1 and PC2:

#we add the labels of the original and class prediction
ldaout_labelled <- data.frame(pldaout$x,
                              observed_species = seeds$species,
                              predicted_species = pldaout$class)
ldaout_labelled$match <- ldaout_labelled$observed_species == ldaout_labelled$predicted_species
# a then to do a scatterplot
ggplot(ldaout_labelled, aes(x = LD1, y = LD2, 
                            colour = observed_species,
                            shape = match)) +
  geom_point()

Do you think the LDA gives better separation of groups than PCA?

Training and Testing

We used the same data to train the LDA model as we used to examine its performance. Only seven cases were in correctly classified. But this isn’t very robust - we could have overfitting.

A key part of using ML methods to make predictions is to test how good those predictions are. This is typically done by training the model on about 75% of your data and then testing it on the remainder.

The caret package includes functions to help with this (as well as lots of ML algorithms). The name comes from Cclassification And REgression Training

library(caret)

Split the dataset in to training and testing sets using createDataPartition()

It returns a proportion of row numbers randomly sampled from the dataframe. Since it returns a list (of one item) I’ve added [[1]] to select that list element so I have a vector to work with.

ids <- createDataPartition(y = seeds$species, p = 0.75)[[1]]
str(ids)

##  int [1:159] 1 2 3 4 6 8 9 10 11 12 ...

Now use those row numbers to create the training and test datasets.

train <- seeds[ids,] # those rows only
test <- seeds[-ids,] # without those rows

Perform the lda on the training data and test on the test data

# train

ldaout <- train %>% 
  select(-species) %>% 
  MASS::lda(grouping = train$species)

And predict classes of the test data based on lda model

pldaout <- test %>% 
  select(-species) %>%
  predict(object = ldaout)

Evaluate the model performance by comparing the predicted classes to the actual classes.

# compare the predicted classes the actual classes
table(predicted = pldaout$class, observed = test$species)

##           observed
## predicted  Canadian Kama Rosa
##   Canadian       16    1    0
##   Kama            1   16    0
##   Rosa            0    0   17

I got 2 misclassified. You may have had greater or fewer.

Exercise

Either:

Start working with your own assessment.

Do continue to follow the practice covered in previous workshops with respect to project organisation, coding style and reproducibility.

Continue working with the example in Workshop 2: Tidying data and the tidyverse. and Workshop 4: Reproducibility and an introduction to R Markdown. to develop a report generated through R Markdown. So that you can focus on applying the methods from this workshop, a processed form of the data are in sol.txt. Note that like the raw data file, these are organised with samples in columns and proteins in rows. In order use the genes are variables and the samples as observations, we will need to transpose the data.

Import:

file <- here::here("data", "sol.txt")
sol <- read.table(file, header = TRUE)
names(sol)

##  [1] "genename" "Y101_A"   "Y101_B"   "Y101_C"   "Y102_A"   "Y102_B"  
##  [7] "Y102_C"   "Y201_A"   "Y201_B"   "Y201_C"   "Y202_A"   "Y202_B"  
## [13] "Y202_C"   "Y1015_A"  "Y1015_B"  "Y1015_C"

We can see that the genename is in the first column.

Transpose all the values except the genename:

tsol <- sol %>% 
  select(-genename) %>% 
  t() %>% 
  data.frame()

Use the genenames in sol to name the columns in tsol:

names(tsol) <- sol$genename

The the column names of sol have become the row names of tsol. We can add a column for these as well.

tsol$sample <- row.names(tsol)

And process the sample name so we have the cell lineage in one column and the replicate in another

tsol <- tsol %>% 
  extract(sample, 
          c("lineage","rep"),
          "(Y[0-9]{3,4})\\_([A-C])")

Now the data should be in a format to which you can apply the methods in this workshop.

The Rmd file

Rmd file

Workshop 6: An Introduction to Machine Learning.

Emma Rand