Back to Big Data Biology main page
All data are from unpublished research by Katie E. Davis & Alex Payne.
Our biological question is to explore how the diversification of Pseudosuchia was shaped by environmental changes in the geological past.
You will do this by testing for significant correlations between the speciation rate of Pseudosuchia and global environmental change through geological time.
Over the course of the four workshops you will be guided towards answering the hypothesis by showing you:
The purpose of this workshop is to learn how to conduct correlation analyses between time series, plot the output and test for statistical significance.
For this workshop you will not be exploring any new data, you will just need the saved output from the last workshop.
###DATAMUNGE###
When you see this anything between these tags it means that we’re either altering the data to the required format or working out parameters that are very specific to phylogenetic trees. This is beyond what I would expect you to be able to code yourself so do not worry about it. You will not be expected to code anything this complex for your report analyses so just use it when necessary and try not to overthink it! You might want to rename some variables to avoid over-writing but that is all. The novel aspect of your report should come from the analyses you choose to run and how you choose to present your results.
###/DATAMUNGE###
You will need to load the data you saved after the first workshop.
load("Crocs_Workshop3.RData")
We have one library we need to load.
library(ggplot2)
First we’re going to load a function that will be used to carry out the correlation analyses. Like the data munges, you don’t need to worry about this other than to understand why we use it.
This is a DCCA-based (Detrended Cross Correlation Analysis) test which calculates the correlation coefficient between two time series. Before it does this it detrends the data. eg. temperature tends to have a trend, e.g., there is an overall trend of global cooling in the last 66 MYR. This can lead to false correlations so we need to account for these long-term trends in the analysis. This won’t stop our analyses from revealing correlations that are real but it will prevent these long-term trends from dominating.
DCCA <- function(x,y,s){
xx<-cumsum(x)
yy<-cumsum(y)
t<-1:length(xx)
F2sj_xy<-runif(floor(length(xx)/s))
F2sj_xx<-F2sj_xy
F2sj_yy<-F2sj_xy
for(ss in seq(1,(floor(length(xx)/s)*s),by=s)){
F2sj_xy[(ss-1)/s+1]<-sum((summary(lm(xx[ss:(ss+s-1)]~t[ss:(ss+s-1)]))$residuals)*(summary(lm(yy[ss:(ss+s-1)]~t[ss:(ss+s-1)]))$residuals))/(s-1)
F2sj_xx[(ss-1)/s+1]<-sum((summary(lm(xx[ss:(ss+s-1)]~t[ss:(ss+s-1)]))$residuals)*(summary(lm(xx[ss:(ss+s-1)]~t[ss:(ss+s-1)]))$residuals))/(s-1)
F2sj_yy[(ss-1)/s+1]<-sum((summary(lm(yy[ss:(ss+s-1)]~t[ss:(ss+s-1)]))$residuals)*(summary(lm(yy[ss:(ss+s-1)]~t[ss:(ss+s-1)]))$residuals))/(s-1)
}
rho<-mean(F2sj_xy)/sqrt(mean(F2sj_xx)*mean(F2sj_yy))
return(c(rho,1/sqrt(length(xx)),1-pnorm(abs(rho),mean=0,sd=1/sqrt(length(xx)))))
}
###DATAMUNGE###
The subtree gets shifted to an arbitrary zero, so we need to reverse time and then shift to the correct node age, otherwise it will project into the future instead of starting in the past!
# Here we flip the times by subtracting the maximum from each time then we take the absolute to get rid of the minus sign. So 0 - 228 becomes 228 - 0.
times = abs(rtt_T$times-max(rtt_T$times))
# Calculate number of simulations in our diversification data
numberOfSims = length(rtt_T$lambda)/length(rtt_T$times)
###/DATAMUNGE###
Now let’s get started with our analyses!
Remember that we have two sets of environmental data - temperature & sea level - so we’ll be running correlation analyses for both.
We’re not going to do the full correlation with all 9,001 BAMM simulations due to time constraints so we’ll set the number of samples to 100. If you have time and/or enough computational power you can try increasing this as more is always better. Consider increasing it for your report.
numberOfSamples = 100
Now we need to store the correlation coefficients for each simulation in BAMM. NA is ‘empty’ ready for the correlation coefficients to be stored.
cors_temp_T <- rep(NA, numberOfSamples)
cors_seaLevel_T <- rep(NA, numberOfSamples)
Now we are setting a random seed. This isn’t essential for the code but we’re doing it to make sure that you all get the same results for the workshops, otherwise it’ll be harder for you to compare what you’re doing with your peers. Once you’ve got the correlations working you could try changing the seed or commenting it out if you want to see how it changes your results.
set.seed(1)
Now we’re generating the random sample (that we set to 100 a couple of lines back) drawn from the 9,001 that we have in total.
samples = sample(1:numberOfSims, numberOfSamples, replace = FALSE)
Because we’re only looking at a sample we need to keep track of how many correlations we’ve done while we’re in the loop so we set the count to 1.
count = 1
Now let’s run the actual correlations, starting with the correlation between global temperature & speciation rates.
Because we want to run 100 correlations we set up a for loop.
# This loops from 1 to the number of simulations, which is 9,001.
for (i in 1:numberOfSims ) {
# Is it one of our samples?
if (i %in% samples){
# If yes, do the correlation.
# Start by interpolating the data. We do this because the two time series are different lengths. We need them to start and end at the same times and we need the points in between to match up in order to carry out the correlation.
# This line takes our speciation rates (lambda) and the corresponding times and interpolates the lambda onto the temperature times. We can only do a correlation for the temperature data that we have so if we have more lambda times than temperature we cannot use them
interpdiv = approx(times, rtt_T$lambda[i,], temperature$V1, method='linear', rule=1)
# Here we check whether there is a lambda for every temperature time, if not it's left as NA
end = which(is.na(interpdiv$y))
# If we have no NAs, ie. there is a time in the temperature time series for each lambda, we just use the interpolation as already calculated
if (length(end) == 0) {
div_rates = interpdiv$y
ft = finaltemp
# Otherwise, we only grab and use the times that have both lambda and temperature data
} else {
div_rates = interpdiv$y[-end]
ft = finaltemp[-end]
}
# Now do the correlation using the interpolated data
c = DCCA(as.numeric(unlist(div_rates)),as.numeric(unlist(ft)),length(ft)/10)
# Store the correlation co-efficient
cors_temp_T[count] = c[1]
# Increase your count by 1 ready for the next correlation
count = count+1
}
}
Now let’s plot the correlation coefficients as a histogram.
plot=qplot(cors_temp_T, geom="histogram", bins=30)
Let’s take a look.
plot
Let’s make it a bit nicer (you can change this to make it even nicer! - look up the ggplot2 library for options).
plot2 = plot + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), axis.line = element_line(colour = "black"))
How is it looking?
plot2
Now save your plot. We can use ggplot2 an alternative to using the pdf() function as it can be used to make much more complex plots.
ggsave(plot2,file="SpeciationTerrestrialTemperature.pdf")
How does your plot look? Does it show anything interesting?
We can’t just trust a histogram though, so now we’ll calculate some statistics wrapped in the sink() function to write them all to the same file. We will carry out a Wilcoxon unpaired test to test whether the median of the distribution of correlation co-efficients is significantly different to zero.
# Set file name
sink(file="SpeciationTerrestrialTemperatureStats.txt")
# Calculate and print to file some summary statistics, e.g., median
print(summary(cors_temp_T))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.005345 0.187668 0.282985 0.267077 0.335030 0.544983
# Calculate and print the 95% confidence intervals
print(quantile(cors_temp_T, c(0.025, 0.975)))
## 2.5% 97.5%
## 0.06199339 0.42821211
# Carry out a wilcoxon unpaired test to test for significance. This will calculate the wilcoxon test statistic and the p-value. Beware! The mu here is not the same as the mu mentioned elsewhere (extinction). Here, we are testing whether the distribution is symmetrical around 0.0 or not.
stats <- (wilcox.test(cors_temp_T, mu=0.0, paired = FALSE))
# We want to know the p-value
stats$p.value
## [1] 3.955912e-18
# Close sink
sink()
Have a look at your output. Is the correlation statistically significant? Why?
If you’re short on time you can save your data at this point and continue later. Otherwise, continue…
Let’s restart the counter ready for the next correlation.
count = 1
Now the speciation Correlation for sea level & speciation rate.
# This loops from 1 to the number of simulations, which is 9,001.
for (i in 1:numberOfSims ) {
# Is it one of our samples?
if (i %in% samples){
# If yes, do the correlation.
# Again, we have to start by interpolating the data.
interpdiv = approx(times, rtt_T$lambda[i,], seaLevel$Age, method='linear', rule=1)
end = which(is.na(interpdiv$y))
if (length(end) == 0) {
div_rates = interpdiv$y
ft = seaLevel$SL
} else {
div_rates = interpdiv$y[-end]
ft = seaLevel$SL[-end]
}
# Now do the correlation
c = DCCA(as.numeric(unlist(div_rates)),as.numeric(unlist(ft)),length(ft)/10)
## Store the correlation co-efficient
cors_seaLevel_T[count] = c[1]
# Increase your count by 1 ready to do the next.
count = count+1
}
}
Now let’s plot our second histogram using ggplot2.
plot=qplot(cors_seaLevel_T, geom="histogram", bins=30)
Let’s take a look.
plot
Make it a bit nicer again.
plot2 = plot + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), axis.line = element_line(colour = "black"))
Let’s check it how it looks.
plot2
Now save your plot to file.
ggsave(plot2,file="SpeciationTerrestrialSeaLevel.pdf")
Calculate and save some statistics to check our correlation is significant.
sink(file="SpeciationTerrestrialSeaLevelStats.txt")
# Print some summary stats
print(summary(cors_seaLevel_T))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.65330 -0.52404 -0.44867 -0.43421 -0.37012 -0.08432
# Print the 95% confidence intervals
print(quantile(cors_seaLevel_T, c(0.025, 0.975)))
## 2.5% 97.5%
## -0.6245419 -0.1430661
# We want to extract the p-value
stats <- (wilcox.test(cors_temp_T, mu=0.0, paired = FALSE))
# We want to know the p-value
stats$p.value
## [1] 3.955912e-18
# Now close the sink
sink()
Have a look at your output. Is the correlation statistically significant? Why?
Now save your data.
save.image("Crocs_Workshop4.RData")
For your report you could try running through all the workshops for the marine species too. You could also have a go at exploring extinction rates. What sort of relationship do you find here? If any? Coding hint: speciation is expressed as lambda and extinction is expressed as mu. You could also try exploring what else BAMMtools can offer in exploring macroevolutionary dynamics in this group. Finally, If you’re feeling ambitious you could think beyond just environmental change and consider biotic drivers of diversification. Hint - if you want to do this you will need to look into lineages through time (you can do this in ape) and correlate those with your diversification rates. The opportunities for independent analysis are really open-ended and you certainly don’t have to do all of these! Just go with whatever takes your interest.
When you’re writing your report don’t forget to think about the biology & ecology behind the analyses. The R code is just a means of analysing big datasets, I’m interested in seeing how you have gone about analysing the data and how you explain your results in the context of the macroevolution of Pseudosuchia in response to global environmental change.
Here are some paper that have used these methods to answer similar macroevolutionary questions. You will probably find these helpful for framing and writing your report.
https://www.nature.com/articles/ncomms13003
https://www.nature.com/articles/s42003-018-0191-7