Overview

In this section, we’re going to start off with some key skills in how to load datasets into R and perform some basic manipulations on them. Then we’ll cover how to conduct some descriptive statistics to summarise patterns in your data, before moving onto some vowel-specific techniques in data visualisation to generate vowel plots. At the end, we’ll go through a case study looking at GOOSE-fronting to put these skills into practice.

The dataset we’ll be working with contains formant values from five force-aligned sociolinguistic interviews, extracted automatically using FAVE-extract.


1 Installing and loading packages

The first thing we need to do is install and then load the tidyverse set of R packages to provide us with lots of extra functionality. You only need to install this once: once it’s installed we can simply load it into the workspace using the library() function each time we open a new R session.

You can read more about the tidyverse here, if you’re interested.


2 Loading in data

Now let’s load in our vowel data. You can download the datafile here: workshop_vowels.csv

Since it’s in comma-separated format, you can read it in using the read_csv() function (if it was in tab-delimited txt format we might use read_delim() instead). Let’s assign it to an object called vowels. You can also use ‘=’ as an assignment operator, but I recommend using ‘<-’ because you don’t want to get confused with ==, which is a logical operator used for comparing two values (try typing "apples" == "oranges" into the console to test this out).

Once we’ve loaded the data in, our new vowels dataframe should appear in the Environment window in the top right of your RStudio window. You can click on its name to view the dataframe in a spreadsheet-like format, but you can also take a look at the columns it contains by using colnames():

##  [1] "target_id" "speaker"   "sex"       "dob"       "city_born"
##  [6] "lexset"    "word"      "start"     "end"       "pre_seg"  
## [11] "fol_seg"   "F1"        "F2"        "F3"

And take a peek at the first six rows of data using head():

## # A tibble: 6 x 14
##   target_id speaker sex     dob city_born lexset word  start   end pre_seg
##       <dbl> <chr>   <chr> <dbl> <chr>     <chr>  <chr> <dbl> <dbl> <chr>  
## 1         1 FredN   M      1955 Manchest… LOT    salf…  5.04  5.12 S      
## 2         2 FredN   M      1955 Manchest… LOT    not   18.8  18.9  N      
## 3         3 FredN   M      1955 Manchest… KIT    in    19.5  19.5  T      
## 4         4 FredN   M      1955 Manchest… FACE   they  20.2  20.2  DH     
## 5         5 FredN   M      1955 Manchest… KIT    think 21.4  21.4  TH     
## 6         6 FredN   M      1955 Manchest… MOUTH  about 24.9  25.0  B      
## # … with 4 more variables: fol_seg <chr>, F1 <dbl>, F2 <dbl>, F3 <dbl>

Or a random sample by using sample_n() (the second argument tells R how many randomly-selected rows you want to print):

## # A tibble: 10 x 14
##    target_id speaker sex     dob city_born lexset word  start   end pre_seg
##        <dbl> <chr>   <chr> <dbl> <chr>     <chr>  <chr> <dbl> <dbl> <chr>  
##  1      2734 GraceG  F      1994 Manchest… LOT    obvi…  352.  352. AE1    
##  2      5358 LillyR  F      1907 Manchest… KIT    been   832.  832. B      
##  3       543 FredN   M      1955 Manchest… FLEECE peop…  942.  942. P      
##  4      9943 WillowA F      1995 Blackburn TRAP   bang… 2907. 2907. B      
##  5      1918 FredN   M      1955 Manchest… KIT    lit   3126. 3126. L      
##  6      1125 FredN   M      1955 Manchest… DRESS  twen… 1874. 1874. W      
##  7      2631 GraceG  F      1994 Manchest… PRICE  i      155.  155. SP     
##  8       384 FredN   M      1955 Manchest… NORTH  foot…  656.  656. B      
##  9      3268 GraceG  F      1994 Manchest… GOAT   loads 1417. 1417. L      
## 10      5430 LillyR  F      1907 Manchest… PRICE  mine  1012. 1012. M      
## # … with 4 more variables: fol_seg <chr>, F1 <dbl>, F2 <dbl>, F3 <dbl>

3 Data wrangling

3.1 Adding and removing columns

Now that our data is loaded in, we might want to make some adjustments before conducting any analysis.

Let’s first create some new columns with things we might be interested in. To add a new column (or change an existing column), we use mutate(). Note that in the code below, we use %>% to ‘pipe’ together multiple lines of code. This is a really nice way of structuring your code, and we’ll be using this a lot throughout the workshop. The chunk of code below basically says:

“take the vowels dataframe, input this to the mutate() command to create a new column called duration, which is just the value in the start column subtracted from the value in the end column”

Note that by preceding all of this code with vowels <-, it means we save these changes back to the original vowels dataframe.

We can use the select() function to keep just the columns we’re interested in, and drop the others from our dataframe. Now that we’ve used the start and end columns to calculate the duration of each vowel, we don’t really need them anymore so let’s drop them.

To do this, we just need to list the column names we want to drop and precede each with -.

Now let’s rename some of these columns to make our life easier down the line (you don’t want to be constantly typing out a long column name every time you need to refer to it). We can use rename() for this. Let’s change target_id to id, and city_born to location:

Ok, let’s take another look at our final dataset:

## # A tibble: 6 x 13
##      id speaker sex     dob location lexset word  pre_seg fol_seg    F1
##   <dbl> <chr>   <chr> <dbl> <chr>    <chr>  <chr> <chr>   <chr>   <dbl>
## 1     1 FredN   M      1955 Manches… LOT    salf… S       L        544.
## 2     2 FredN   M      1955 Manches… LOT    not   N       T        570.
## 3     3 FredN   M      1955 Manches… KIT    in    T       N        446 
## 4     4 FredN   M      1955 Manches… FACE   they  DH      W        360 
## 5     5 FredN   M      1955 Manches… KIT    think TH      NG       362.
## 6     6 FredN   M      1955 Manches… MOUTH  about B       T        620.
## # … with 3 more variables: F2 <dbl>, F3 <dbl>, duration <dbl>

Much better!


3.2 Re-coding columns

Recall we used mutate() earlier to create a new column for duration - it was relatively simple to calculate the values in this, it was just the difference between two other columns already in the data: the start and end times of each vowel.

In more complicated scenarios, we can use case_when() to either make a new column based on some existing variable, or to re-code an existing column. Let’s say we want to create a new column, age group, containing a binary measure of age to complement the existing continuous measure in the dob column. We can do this like so:

Let’s take a look at 10 random rows of data to make sure it’s worked:

## # A tibble: 10 x 3
##    speaker   dob age.group
##    <chr>   <dbl> <chr>    
##  1 FredN    1955 older    
##  2 FredN    1955 older    
##  3 WillowA  1995 younger  
##  4 GraceG   1994 younger  
##  5 FredN    1955 older    
##  6 WillowA  1995 younger  
##  7 GraceG   1994 younger  
##  8 HenryM   1954 older    
##  9 WadeT    1991 younger  
## 10 WillowA  1995 younger

Great! Now what if we wanted to create a new column categorising each lexical set as either a diphthong or monophthong? Let’s find out what vowels are actually in the dataset first.

We can get this information relatively easily using the unique() command. If we pass a column (which is essentially just a vector of values) into this command, it’ll print out only the unique values. The chunk of code below pipes together a few things:

"take the entire vowels dataframe, ‘pull’ the lexset column out of it, and then print out the unique values in this column

##  [1] "LOT"    "KIT"    "FACE"   "MOUTH"  "NURSE"  "GOAT"   "GOOSE" 
##  [8] "DRESS"  "NORTH"  "STRUT"  "CHOICE" "FLEECE" "PALM"   "TRAP"  
## [15] "PRICE"  "FOOT"

Ok, looks like our diphthongs are FACE, GOAT, MOUTH, CHOICE and PRICE.

One option would be to include a line in our case_when() command that goes something like this: lexset == 'FACE' | lexset == 'GOAT' | lexset == 'MOUTH' ..., which basically means:

lexset is equal to ‘FACE’, or lexset is equal to ‘GOAT’, or lexset is equal to ‘MOUTH’, and so on…

Needless to say, there are quicker and more efficient ways to combine these conditional statements! We can instead use the %in% operator alongside a vector of values that we want to check the column against:

In fact, we can make this even more efficient. Once we’ve listed the diphthongs, we can set the second line to TRUE ~ 'monophthong', which basically means “code everything else as monophthong

Exercise

Using the tools we’ve covered so far, make a new variable categorising the values in the pre_seg column into either coronal, velar, or other based on their place of articulation.

If you’re not familiar with the coding scheme used in the pre_seg column, where sounds have been transcribed in ARPAbet, you can find IPA translations here (look in the 2-letter columns).


🤔Stuck? Solution here


4 Summary statistics

Before we move on to visualisation, we might want to conduct some basic summary statistics to describe our data.

Let’s say we want to calculate things like the average F1/F2 for our vowels. We can start off looking at the mean F2 of the GOOSE vowel using a combination of filter(), pull(), and mean(). Note the double equals in the filter command!

## [1] 1683.275

Now we could do the same for F1, and then repeat this for each vowel category in our dataset, but we don’t have all day. Luckily, the powerful tidyverse set of packages let’s us summarise our data really easily, using a combination of group_by() and summarise().

By specifying a column in the group_by() command, we tell R to temporarily split the dataframe into separate groups for each unique value in that column. This means that, when combined with summarise(), we can perform summary statistics for individual vowel categories rather than aggregating over the entire dataset.

The code below will produce a summary table with columns F1.avg and F2.avg, telling us the mean F1 and F2 for each vowel category:

## # A tibble: 16 x 3
##    lexset F1.avg F2.avg
##    <chr>   <dbl>  <dbl>
##  1 CHOICE   576.  1180.
##  2 DRESS    601.  1576.
##  3 FACE     536.  1785.
##  4 FLEECE   423.  2089.
##  5 FOOT     453.  1299.
##  6 GOAT     545.  1249.
##  7 GOOSE    412.  1683.
##  8 KIT      482.  1835.
##  9 LOT      628.  1200.
## 10 MOUTH    686.  1388.
## 11 NORTH    570.  1116.
## 12 NURSE    535.  1566.
## 13 PALM     707.  1248.
## 14 PRICE    724.  1424.
## 15 STRUT    529.  1278.
## 16 TRAP     684.  1545.

We can take this one step further by adding even more information, such as the standard deviation of F1 and F2 using sd(), or the number of tokens in each vowel category using length().

## # A tibble: 16 x 6
##    lexset F1.avg F2.avg F1.sd F2.sd     n
##    <chr>   <dbl>  <dbl> <dbl> <dbl> <int>
##  1 CHOICE   576.  1180. 111.   225.    31
##  2 DRESS    601.  1576. 122.   259.  1152
##  3 FACE     536.  1785. 107.   273.   624
##  4 FLEECE   423.  2089.  87.8  322.   750
##  5 FOOT     453.  1299.  88.2  343.   184
##  6 GOAT     545.  1249. 109.   239.   732
##  7 GOOSE    412.  1683.  65.5  377.   498
##  8 KIT      482.  1835.  90.9  277.  1267
##  9 LOT      628.  1200. 121.   192.   524
## 10 MOUTH    686.  1388. 117.   229.   285
## 11 NORTH    570.  1116. 114.   206.   392
## 12 NURSE    535.  1566. 103.   243.   167
## 13 PALM     707.  1248. 124.   229.   101
## 14 PRICE    724.  1424. 143.   212.  1251
## 15 STRUT    529.  1278. 115.   264.  1083
## 16 TRAP     684.  1545. 139.   275.   959

Exercise

Calculate the following:

  • average duration of monophthongs and diphthongs for men and women separately
  • average F1 of the STRUT vowel for old and young speakers in Manchester and Blackburn separately

🤔Stuck? Solution here


5 Visualisation

Ok, it’s time for the really exciting part now! We can make vowel plots using the ggplot2 package, which should already be loaded as part of the tidyverse. The name ‘ggplot’ stands for grammar of graphics, and it’s an extremely powerful tool in data visualisation.

Before we get started making any actual plots, let’s change the deault ggplot theme to theme_minimal(). Note that this isn’t mandatory, unless like me you have a strong aversion to the default grey background of ggplot:

ggplots are built up in layers. If you start off just by running ggplot() on its own, you’ll see a blank graph is created in the Plots tab to the right of your RStudio window

Now let’s start building this up bit by bit. The next (and most important!) step is to add some data. We can ‘pipe’ our dataframe into the ggplot() function using the %>% operator we saw earlier. We also need to specify which values we want mapping onto the X and Y axes, which we need to do inside aes().

Ok, we’ve now got some axes! But where’s our data? This is where different geom types come in. We need to tell R how to plot our data. Do we want a boxplot? Do we want lines? Polygons? A pie chart? (the answer to that last one is always no!)

Let’s start off by plotting each vowel as a single point on these F1/F2 dimensions. We can use geom_point() for this:

Hm, it looks more like a swarm of bees than a vowel plot. Let’s colour-code each point based on the vowel:

Better! Of course, we know that in formant plots the F1 and F2 axes should be reversed, so that the (0,0) point is in the top-right corner. We can flip the axes by adding scale_x_reverse() and scale_y_reverse() to our plot:

It’s looking a little crowded, so from now on let’s just plot the monophthongs. Rather than including a filter(type == 'monophthong') line every time we want to plot our data, let’s just make a separate, filtered-down dataframe:

Now we have a new dataframe vowels.mon, with fewer lexical sets to work with.

Exercise

‘lexset’ isn’t a particularly reader-friendly title for our legend, so let’s change it. To do this, you need to add a layer scale_colour_discrete().

To get an idea of what arguments you can specify for a particular command, you can check the help section for each command by typing its name, preceded by a ?, in the console below, i.e. ?scale_colour_discrete


🤔Stuck? Solution here


5.1 Averages

Another solution to the massive over-plotting issue is to just plot the average F1/F2 of each lexical set instead. Recall we can use the group_by() and summarise() functions we saw earlier for this:

Now that we’ve got the averages in a vowel.avgs dataframe, we can plot them in the same way. Since we’re only plotting single-point averages though, we should make the points a bit bigger to stand out. We can do this by including a size argument inside the geom_point() function. Note that we don’t need to include this inside aes() like we do for the x/y/colour arguments - this is because the values for those latter three are based on columns in the dataframe, whereas for the size argument we’re just setting a static value.

Ideally, we might plot the averages overlaid on top of the individual vowel tokens. We can do this by including two separate geom_point() layers, one that inherits the data piped into the original ggplot() function and one where we specify an extra dataset - in this case the vowel.avgs dataframe.

Rather than use two geom_point() layers, let’s introduce another geometric type for the averages: geom_label(). As the name suggests, this allows us to plot our data as text labels. You still need to provide the usual x and y arguments, but for this type of layer we also need to specify what the data poinst should be labelled with, using label.

There’s a lot going on in the code below, so I’ll break it down:

  • ggplot() inherits the vowels.mon dataframe that we pipe into it, and sets the x, y, and colour arguments
  • we plot each row of this data as a geom_point() layer, where we also set the alpha level (i.e. opacity) and size of these points to 0.5
  • we plot the vowel.avgs dataframe as a geom_label() layer overlaid on this, specifying the x, y, and label arguments. Note that we don’t have to specify colour again, as this layer will inherit the colour specification from the ggplot() line - it won’t inherit the x and y values though, because the columns in this dataframe are called F1.avg and F2.avg rather than just F1 and F2
  • we reverse both the x and y axes
  • we remove the legend, which is redundant now that we have labelled averages


5.2 Distributions

Another option is to not plot individual points at all, but rather to plot their distribution in the form of an ellipse.

In the code below, all we’ve done is replace the geom_point() layer with stat_ellipse(). Note how the arguments for this layer have been specified:

  • the stat_ellipse() layer will inherit the colour specification from the overarching ggplot() call in the line above, but these ellipses (in addition to things like boxplots) also have an optional fill argument. Generally speaking, we set the colour of the border using colour, and the colour of the shape itself with fill
  • we also set the geom type to ‘polygon’ (otherwise we wouldn’t be able to fill in the shape with a colour), and the alpha level to 0.3. Note that for both of these, we can specify them outside of aes(), because we’re using static values. For fill, however, we’re setting the value based on whatever is in the lexset column, so we have to specify this inside aes()

Exercise

By default, stat_ellipse() will plot an ellipse that contains 95% of the data for that particular distribution (sort of similar to a 95% confidence interval).

Try and change this to a lower value, such as 68% (or even 10%!) to see how this influences the plot. Don’t forget you can check the help page by running ?stat_ellipse


🤔Stuck? Solution here


5.3 Faceting

Sometimes we might want to generate multiple vowel plots for a group of speakers. Our dataset only contains 5 different speakers, so we could just use the existing code from earlier and add an extra command along the lines of filter(speaker == 'SPEAKER_NAME_GOES_HERE'),

However, it’s much more efficient to instead include a facet term in your code. If we add facet_wrap(~speaker), ggplot will generate a separate plot for each unique value in the speaker column.

Of course if we’re overlaying the averages from the vowel.avgs dataframe, we would first need to calculate speaker-specific averages, so let’s do that first. We can re-use the earlier code for this, simply adding speaker as an additional grouping variable in group_by():

Now we can reproduce the same plot as before, but with an extra line at the bottom for our facet term:

Discussion

Compare the vowel spaces between each speaker - what do you notice, and what do these inter-speaker differences call for?


6 Normalisation

That’s right! Comparing the vowel spaces for HenryM and GraceG in particular highlights the need for normalisation. We can’t make any reliable inter-speaker comparisons based on raw formant frequencies, because female speakers tend to have much higher resonating frequencies than male speakers. This is why the vowel space looks so condensed for people like FredN and HenryM (and to WadeT, although to a lesser extent).

There’s a really quick and easy way of conducting normalisation in R using the scale() function, which allows us to scale a given formant value relative to each speaker’s average frequency. Let’s create new normalised F1 and F2 columns using a combination of group_by(), mutate(), and scale():

The new columns F1.norm and F2.norm now contain z-scored formants instead of raw frequencies. Let’s now generate the same plot as before, but using our scaled formants instead:

This isn’t the only method of vowel normalisation available in R. If you install the phonR package, you can try out other methods such as Bark, Mel, Lobanov, Nearey, and Watt-Fabricius. We won’t cover this package in this workshop, but you can read about it here and of course there are the usual package vignettes for specific help with each function.


7 Case study

Exercise

Now that we’ve covered all of the key tools in analysing and plotting vowel formant data in R, let’s try a little case study exploring GOOSE-fronting:

  • plot the distribution of only FLEECE and GOOSE tokens (including averages!) for each speaker to establish the degree of overlap between these categories

  • plot just the F2 of GOOSE by date of birth or age group to establish if we have evidence of apparent-time change - you might want to try a boxplot for this (hint: it’s geom_boxplot())

  • make a plot of all GOOSE tokens colour-coded by whether or not the following segment is /l/ - you might want to create a new column for this using case_when(). What do the results suggest?

  • make a plot of all GOOSE tokens colour-coded by whether or not the preceding segment is alveolar or velar - you should have already made this column from Section 3.2 earlier. Does this preceding segmental environment also have an effect on the realisation of GOOSE?


🤔Stuck? Solution here