WWW Do online | PC Do on your computer | R Do in RStudio | ? Think about and answer
R In RStudio, set your working directory to your scripts folder for this course (setup eariler…)
R Make a new script file called datatypesAndStructures.R as a place to capture/carry out the rest of the work.
Open the manual page for every command you want to use e.g. ?integer
We have described programming as data + algorithms. In the first couple of sessions we have focused on algorithms (the sequence of steps, calculations, loops and decisions that are necessary to solve a problem). In this session we turn to data, in terms of the basic data types and their manipulation throuugh operators and functions. We will then look at (higher-order) data structures that can be built from the basic data types.
Base data types are the atomic representations of data in a programming language: they are the building blocks from which data can be expressed. We will look at the following:
Type | Description | Example Usage |
---|---|---|
Integer | discrete numbers | counting, categories, indexing |
Numeric | real numbers | continuous real valued quantities |
Logical | a bit of information | boolean expressions, evaluating to TRUE or FALSE |
String | sequence of characters | alphabetical, numeric, punctuation, operators etc. |
NB: There are other data types, such as Complex (complex numbers with real and imaginary components) and Date (data representing time points), but this will do for now.
With integers and numerics, we have some basic operators to do arithmetic:
Operator | Description |
---|---|
+ | addition |
- | subtraction |
* | multiplication |
/ | division |
^ or ** | exponentiation |
x %% y | modulus (x mod y) 5%%2 is 1 |
x %/% y | integer division 5%/%2 is 2 |
We’ll illustrate the use of these below.
a <- 1
a <- a + a
a <- a * 5
a <- a - 12
a <- a / 2
a <- a^2
a
## [1] 1
R Stop to play around familarise yourself with some simple arithmetic operations on integers
Do you know the mod (modulus) operator?
WWW modulus operator
a <- 5
a %% 2 # read this as "a mod 2". e.g. The sequence (1 2 3 4 5) mod 2 is (1 0 1 0 1)
## [1] 1
It’s useful, we’ll come back to it.
Dealing with real numbers. Here we show this by assigning 1.0 not 1 in the same code fragment.
a <- 1.0
a <- a + a
a <- a * 5
a <- a - 12
a <- a / 2
a <- a^2
a
## [1] 1
Hey! - hold on a minute - that looks just like what we did on integers above. That’s because for most purposes R shields us from worrying whether the numeric data we are using is real (continuous) or integer (discrete). (This is an explicit concern in some other programming languages, but not R).
Here’s an example that is explicit in its use of real numbers.
a <- 1.3675
a <- a + a
a <- a * 5
a <- a - 12
a <- a / 2
a <- a^2
a
## [1] 0.7014062
NB: In R, numeric data types are really a superset of integer and real data. You can force things to be explicitly treated as numeric or integer (using as.numeric() or as.integer() functions). Recall we used a trick of integer division in the switch() version of the grade assignment code in the last session, to turn continous grades in the interval [0,100] into discrete grades 0-9? Heres another example of explicit integer arithmetic on real numbers:
a <- 7.5
a / 2
## [1] 3.75
a %/% 2 # equivalent to as.integer(a/2) - try it...
## [1] 3
This is not all we have to say on this matter. But for now, that’s enough.
We have looked at various operators on numeric data, now we will look at the more general case of functions. Functions are transformations of the data, the we simply call with our numeric data as an argument to the function. Here are some of the most common numeric functions:
Function | Description |
---|---|
abs(x) | absolute value |
sqrt(x) | square root |
ceiling(x) | ceiling(2.347) is 3 |
floor(x) | floor(2.347) is 2 |
trunc(x) | trunc(4.899) is 4 |
round(x, digits=n) | round(2.347, digits=2) is 2.35 |
signif(x, digits=n) | signif(2.347, digits=2) is 2.3 |
cos(x), sin(x), tan(x) | also acos(x), cosh(x), acosh(x), etc. |
log(x) | natural logarithm |
log10(x) | logarithm to base 10 |
exp(x) | e^x |
R Try a few test caculations
This data type is the basic bit of information: 1, or 0, or equivalently TRUE or FALSE. You will recall we have been using logical expressions already in our look at control structures in R, in the while (condition) and if (condition) then-else.
? Boolean logic lies at the core of the binary representation of data in digital computation
First we need to define the (Boolean) logical operators:
Operator | Description |
---|---|
< | less than |
<= | less than or equal to |
> | greater than |
>= | greater than or equal to |
== | exactly equal to |
!= | not equal to |
!x | Not x |
x | y | x OR y (i.e. either x or y needs to be TRUE for the condition to be TRUE) |
x & y | x AND y (i.e. both x and y need to be TRUE for the condition to be TRUE) |
Some of these will be more familiar than others. Some examples:
Truth Table for the OR ( | ) operator:
(NB: the way to read this table is: look at each element in the table as being the result of the operator on the corresponding row and column element)
## True False
## True TRUE TRUE
## False TRUE FALSE
Truth Table for the AND ( & ) operator:
## True False
## True TRUE FALSE
## False FALSE FALSE
a <- 1
Examples of logical expressions:
5 > 4
## [1] TRUE
! (5 > 4) # example of NOT operator (!)
## [1] FALSE
# expressions can be combined
target <- 300
step <- 141
(target > 500 & step < 300 )
## [1] FALSE
We will go on to see how logical expressions applied to data values is a powerful way of selecting elements of data structures in R.
Strings are sequences of characters, such as letters of the alphabet, punctuation, etc. Your data may not be numerical, it might be text from scientific abstracts, or tweets on Twitter. In most cases, datasets have some text associated with them e.g. metadata.
As with other data types, strings have some associated functions:
Function | Description |
---|---|
substr(s, start=pos1, stop=pos2) | Extract or replace substrings in a character vector. |
sub(pattern, replacement, s, ignore.case =FALSE, fixed=FALSE) | Find pattern in s and replace with replacement text. If fixed=FALSE then pattern is a regular expression. If fixed = T then pattern is a literal text string. |
grep(pattern, s , ignore.case=FALSE, fixed=FALSE) | Search for pattern in s. If fixed =FALSE then pattern is a regular expression. If fixed=TRUE then pattern is a text string. Returns matching indices.
strsplit(x, split) | Split the elements of character vector x at split. paste(…, sep=“”) | Concatenate strings separated by the string sep toupper(s) | Convert s to uppercase characters tolower(s) | Convert s to lowercase characters nchar(s) | number of characters in s
s <- 'abcdef'
substr(s, 2, 4)
## [1] "bcd"
substr(s,2,4) <- 'XXX' # assign a string to a substring
sub("X","Y",s) # sub substitutes the first instance it finds
## [1] "aYXXef"
gsub("X","Y",s) # gsub replaces all occurences - a *global* substitution
## [1] "aYYYef"
strsplit(s,"")
## [[1]]
## [1] "a" "X" "X" "X" "e" "f"
toupper(s)
## [1] "AXXXEF"
nchar(s)
## [1] 6
sv <- c(s,"test","abXYZcd","defg")
sv
## [1] "aXXXef" "test" "abXYZcd" "defg"
findthis <- "X"
loc <- grep(findthis,sv,fixed=T)
for (i in loc) print(paste(findthis,": found in element: ",i, "(",sv[i],")"))
## [1] "X : found in element: 1 ( aXXXef )"
## [1] "X : found in element: 3 ( abXYZcd )"
R How would you write some code to show the peptide fragments resulting from the peptide digest performed by Asp-N Endopeptidase which cleaves specifically bonds with Asp (D) in position P1’ (the amino-acid following the peptide bond cleaved) ? Try out your ideas on: prot <- ‘AVYRYPDEFGHESDGGPILKV’
WWW Look more at regular expression. Can you find some examples of their use in R? Familiarise yourself with the functions regexpr() and gregexpr()
? How does this inform your approach to the peptide digest problem above?