Getting started.
Example.
The Wheat Seeds Dataset involves the prediction of species given measurements of seed kernels from different varieties of wheat.
There are three different varieties of wheat: Kama, Rosa and Canadian. High quality visualization of the internal kernel structure was detected using a soft X-ray technique and 7 measurements were taken:
- Area.
- Perimeter.
- Compactness
- Length of kernel.
- Width of kernel.
- Asymmetry coefficient.
- Length of kernel groove.
file <- here::here("data", "seeds_dataset.txt")
cols <- c("area",
"perimeter",
"compactness",
"kernal_length",
"kernel_width",
"asymmetry_coef",
"groove_length",
"species")
seeds <- read.table(file, header = FALSE, col.names = cols)
The species is coded as 1, 2, and 3 and it would be useful to recode to the species names:
seeds$species <- recode(seeds$species,
`1` = "Kama",
`2` = "Rosa",
`3` = "Canadian")
If you have about 4 - 15 variables, plotting them pairwise against each other gives a nice overview. This can be achieved with the ggpairs()
from the GGally
package.
seeds %>%
GGally::ggpairs()
Clustering
Clustering is one way to see if these variables could be used to distinguish between species. If they do, we would expect individuals of the same species to cluster together. We first create a matrix of the distances between observations using dist()
, then use those distances for clustering with hclust()
:
clust <- seeds %>%
select(-species) %>%
dist() %>%
hclust(method = "complete")
Then we can plot the result:
plot(clust)
Since I like ggplot
, I will convert the output of hclust
to a list of dataframes that ggplot
can use using another package, ggdendro
. Add this package to your packages chunk.
library(ggdendro)
Now we can use the dendro_data()
function on our hclust output. Open the manual page for dendro_data()
ddata <- dendro_data(clust, type = "rectangle")
And plot it:
ggplot() +
geom_segment(data = ddata$segments,
aes(x = x, y = y, xend = xend, yend = yend)) +
coord_flip() +
scale_y_reverse()
- How many clusters do you think exist
- If you had to identify 3 clusters (one for each species) where would they be?
- Does this match the reality? Add the species labels by adding a geom to the plot:
geom_text(data = ddata$labels, aes(x = x, y = y - 0.8,label = seed$species), size = 2)
ggplot() +
geom_segment(data = ddata$segments,
aes(x = x, y = y, xend = xend, yend = yend)) +
geom_text(data = ddata$labels,
aes(x = x, y = y - 0.8,
label = seeds$species,
colour = seeds$species),
size = 1) +
coord_flip() +
scale_y_reverse()
This is not entirely useful with there being so many but you should still be able to see if your thoughts about where the three clusters are match the actual groups.
- What do you conclude about the ability of this unsupervised method to separate these species on measurements of their sepal and petal width/length?
Principal Components Analysis.
Principal Components Analysis (PCA) is another method to investigate whether you have groups or patterns in a dataset. It is ‘data reduction’ or ‘dimension reduction’ method and creates a set of ‘components’ (axes) which are linear combinations of the original variables. PCA is useful when relatively few components are needed to capture most of the variation in the data.
pca <- seeds %>%
select(-species) %>%
prcomp(scale. = TRUE)
Scaling: When the values in one variable are much bigger than in others we usually scale all the variable (mean of zero and a unit variance) before undertaking PCA to avoid the variable with the biggest values dominating the analysis. To see the variance accounted for by each component
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.2430 1.0943 0.82341 0.26147 0.13680 0.07302
## Proportion of Variance 0.7187 0.1711 0.09686 0.00977 0.00267 0.00076
## Cumulative Proportion 0.7187 0.8898 0.98668 0.99645 0.99912 0.99988
## PC7
## Standard deviation 0.02850
## Proportion of Variance 0.00012
## Cumulative Proportion 1.00000
To see the importance (loading) of each variable in each component
pca$rotation
## PC1 PC2 PC3 PC4 PC5
## area -0.4444735 0.02656355 -0.02587094 0.19363997 -0.20441167
## perimeter -0.4415715 0.08400282 0.05983912 0.29545659 -0.17427591
## compactness -0.2770174 -0.52915125 -0.62969178 -0.33281640 0.33265481
## kernal_length -0.4235633 0.20597518 0.21187966 0.26340659 0.76609839
## kernel_width -0.4328187 -0.11668963 -0.21648338 0.19963039 -0.46536555
## asymmetry_coef 0.1186925 0.71688203 -0.67950584 0.09246481 0.03625822
## groove_length -0.3871608 0.37719327 0.21389720 -0.80414995 -0.11134657
## PC6 PC7
## area 0.42643686 0.734805689
## perimeter 0.47623853 -0.670751532
## compactness 0.14162884 -0.072552703
## kernal_length -0.27357647 0.046276051
## kernel_width -0.70301171 -0.039289079
## asymmetry_coef 0.01964186 -0.003723456
## groove_length -0.04282974 -0.034498098
To plot, we might want to use the scores on each of the new axes and colour them by species. The scores are in a variable called $x
# For convenience, I'll put these in one 'tidy' dataframe
pca_labelled <- data.frame(pca$x, species = seeds$species)
# a then to do a scatterplot
ggplot(pca_labelled, aes(x = PC1, y = PC2, color = species))+
geom_point()
LDA
Linear Discriminant Analysis aims to find linear combination of variables the maximise differences between groups. It is supervised because we label observations by their class and determine the allocation rules based on these. A ‘discriminant’ is a line that separates groups. As there are three classes we have at most two linear discriminants. the lda()
function is in a package called MASS
. I very rarely load MASS with a library command since it has a function called select that conflicts with dplyr
’s select()
. Thus I will use MASS::
to access its functions.
ldaout <- seeds %>%
select(-species) %>%
MASS::lda(grouping = seeds$species)
To determine how well the species were predicted the scores on LD1 and LD2 and the classes need to be predicted:
pldaout <- seeds %>%
select(-species) %>%
predict(object = ldaout)
How many predicted classes are the same as the actual classes:
table(predicted = pldaout$class, observed = seeds$species)
## observed
## predicted Canadian Kama Rosa
## Canadian 67 3 0
## Kama 3 66 0
## Rosa 0 1 70
How well do you think the species were predicted?
We can create a scatter plot of LD1 and LD2 just like we did for PC1 and PC2:
#we add the labels of the original and class prediction
ldaout_labelled <- data.frame(pldaout$x,
observed_species = seeds$species,
predicted_species = pldaout$class)
ldaout_labelled$match <- ldaout_labelled$observed_species == ldaout_labelled$predicted_species
# a then to do a scatterplot
ggplot(ldaout_labelled, aes(x = LD1, y = LD2,
colour = observed_species,
shape = match)) +
geom_point()
Do you think the LDA gives better separation of groups than PCA?
Training and Testing
We used the same data to train the LDA model as we used to examine its performance. Only seven cases were in correctly classified. But this isn’t very robust - we could have overfitting.
A key part of using ML methods to make predictions is to test how good those predictions are. This is typically done by training the model on about 75% of your data and then testing it on the remainder.
The caret package includes functions to help with this (as well as lots of ML algorithms). The name comes from Cclassification And REgression Training
library(caret)
Split the dataset in to training and testing sets using createDataPartition()
It returns a proportion of row numbers randomly sampled from the dataframe. Since it returns a list (of one item) I’ve added [[1]] to select that list element so I have a vector to work with.
ids <- createDataPartition(y = seeds$species, p = 0.75)[[1]]
str(ids)
## int [1:159] 1 2 3 4 6 8 9 10 11 12 ...
Now use those row numbers to create the training and test datasets.
train <- seeds[ids,] # those rows only
test <- seeds[-ids,] # without those rows
Perform the lda on the training data and test on the test data
# train
ldaout <- train %>%
select(-species) %>%
MASS::lda(grouping = train$species)
And predict classes of the test data based on lda model
pldaout <- test %>%
select(-species) %>%
predict(object = ldaout)
Evaluate the model performance by comparing the predicted classes to the actual classes.
# compare the predicted classes the actual classes
table(predicted = pldaout$class, observed = test$species)
## observed
## predicted Canadian Kama Rosa
## Canadian 16 1 0
## Kama 1 16 0
## Rosa 0 0 17
I got 2 misclassified. You may have had greater or fewer.