WWW Do online | PC Do on your computer | R Do in RStudio | ? Think about and answer

Data Types

Getting Started

R In RStudio, set your working directory to your scripts folder for this course (setup eariler…)

R Make a new script file called datatypesAndStructures.R as a place to capture/carry out the rest of the work.

Open the manual page for every command you want to use e.g. ?integer

Preamble

We have described programming as data + algorithms. In the first couple of sessions we have focused on algorithms (the sequence of steps, calculations, loops and decisions that are necessary to solve a problem). In this session we turn to data, in terms of the basic data types and their manipulation throuugh operators and functions. We will then look at (higher-order) data structures that can be built from the basic data types.

Data Types, operators and functions

Base data types

Base data types are the atomic representations of data in a programming language: they are the building blocks from which data can be expressed. We will look at the following:

Type	Description	Example Usage
Integer	discrete numbers	counting, categories, indexing
Numeric	real numbers	continuous real valued quantities
Logical	a bit of information	boolean expressions, evaluating to TRUE or FALSE
String	sequence of characters	alphabetical, numeric, punctuation, operators etc.

NB: There are other data types, such as Complex (complex numbers with real and imaginary components) and Date (data representing time points), but this will do for now.

Artithmetic Operators

With integers and numerics, we have some basic operators to do arithmetic:

Operator	Description
+	addition
-	subtraction
*	multiplication
/	division
^ or **	exponentiation
x %% y	modulus (x mod y) 5%%2 is 1
x %/% y	integer division 5%/%2 is 2

We’ll illustrate the use of these below.

INTEGERS

a <- 1
a <-  a + a
a <- a * 5
a <- a - 12
a <- a / 2
a <- a^2
a

## [1] 1

R Stop to ~~play around~~ familarise yourself with some simple arithmetic operations on integers
Do you know the mod (modulus) operator?
WWW modulus operator

a <- 5
a %% 2 # read this as "a mod 2". e.g. The sequence (1 2 3 4 5) mod 2 is (1 0 1 0 1)

## [1] 1

It’s useful, we’ll come back to it.

NUMERIC

Dealing with real numbers. Here we show this by assigning 1.0 not 1 in the same code fragment.

a <- 1.0
a <-  a + a
a <- a * 5
a <- a - 12
a <- a / 2
a <- a^2
a

## [1] 1

Hey! - hold on a minute - that looks just like what we did on integers above. That’s because for most purposes R shields us from worrying whether the numeric data we are using is real (continuous) or integer (discrete). (This is an explicit concern in some other programming languages, but not R).

Here’s an example that is explicit in its use of real numbers.

a <- 1.3675
a <-  a + a
a <- a * 5
a <- a - 12
a <- a / 2
a <- a^2
a

## [1] 0.7014062

NB: In R, numeric data types are really a superset of integer and real data. You can force things to be explicitly treated as numeric or integer (using as.numeric() or as.integer() functions). Recall we used a trick of integer division in the switch() version of the grade assignment code in the last session, to turn continous grades in the interval [0,100] into discrete grades 0-9? Heres another example of explicit integer arithmetic on real numbers:

a <- 7.5
a / 2

## [1] 3.75

a %/% 2  # equivalent to as.integer(a/2) - try it...

## [1] 3

This is not all we have to say on this matter. But for now, that’s enough.

Numeric functions

We have looked at various operators on numeric data, now we will look at the more general case of functions. Functions are transformations of the data, the we simply call with our numeric data as an argument to the function. Here are some of the most common numeric functions:

Function	Description
abs(x)	absolute value
sqrt(x)	square root
ceiling(x)	ceiling(2.347) is 3
floor(x)	floor(2.347) is 2
trunc(x)	trunc(4.899) is 4
round(x, digits=n)	round(2.347, digits=2) is 2.35
signif(x, digits=n)	signif(2.347, digits=2) is 2.3
cos(x), sin(x), tan(x)	also acos(x), cosh(x), acosh(x), etc.
log(x)	natural logarithm
log10(x)	logarithm to base 10
exp(x)	e^x

R Try a few test caculations

LOGICAL

This data type is the basic bit of information: 1, or 0, or equivalently TRUE or FALSE. You will recall we have been using logical expressions already in our look at control structures in R, in the while (condition) and if (condition) then-else.
? Boolean logic lies at the core of the binary representation of data in digital computation

Logical Operators

First we need to define the (Boolean) logical operators:

Operator	Description
<	less than
<=	less than or equal to
>	greater than
>=	greater than or equal to
==	exactly equal to
!=	not equal to
!x	Not x
x \| y	x OR y (i.e. either x or y needs to be TRUE for the condition to be TRUE)
x & y	x AND y (i.e. both x and y need to be TRUE for the condition to be TRUE)

Some of these will be more familiar than others. Some examples:

Truth Table for the OR ( | ) operator:
(NB: the way to read this table is: look at each element in the table as being the result of the operator on the corresponding row and column element)

##       True False
## True  TRUE  TRUE
## False TRUE FALSE

Truth Table for the AND ( & ) operator:

##        True False
## True   TRUE FALSE
## False FALSE FALSE

a <- 1

Examples of logical expressions:

5 > 4

## [1] TRUE

! (5 > 4) # example of NOT operator (!)

## [1] FALSE

# expressions can be combined
target <- 300
step <- 141
(target > 500 & step < 300 )

## [1] FALSE

We will go on to see how logical expressions applied to data values is a powerful way of selecting elements of data structures in R.

STRINGS

Strings are sequences of characters, such as letters of the alphabet, punctuation, etc. Your data may not be numerical, it might be text from scientific abstracts, or tweets on Twitter. In most cases, datasets have some text associated with them e.g. metadata.

String functions

As with other data types, strings have some associated functions:

Function	Description
substr(s, start=pos1, stop=pos2)	Extract or replace substrings in a character vector.
sub(pattern, replacement, s, ignore.case =FALSE, fixed=FALSE)	Find pattern in s and replace with replacement text. If fixed=FALSE then pattern is a regular expression. If fixed = T then pattern is a literal text string.

grep(pattern, s , ignore.case=FALSE, fixed=FALSE) | Search for pattern in s. If fixed =FALSE then pattern is a regular expression. If fixed=TRUE then pattern is a text string. Returns matching indices.

strsplit(x, split) | Split the elements of character vector x at split. paste(…, sep=“”) | Concatenate strings separated by the string sep toupper(s) | Convert s to uppercase characters tolower(s) | Convert s to lowercase characters nchar(s) | number of characters in s

s <- 'abcdef'
substr(s, 2, 4)

## [1] "bcd"

substr(s,2,4) <- 'XXX'  # assign a string to a substring

sub("X","Y",s)  # sub substitutes the first instance it finds

## [1] "aYXXef"

gsub("X","Y",s) # gsub replaces all occurences - a *global* substitution

## [1] "aYYYef"

strsplit(s,"")

## [[1]]
## [1] "a" "X" "X" "X" "e" "f"

toupper(s)

## [1] "AXXXEF"

nchar(s)

## [1] 6

sv <- c(s,"test","abXYZcd","defg")
sv

## [1] "aXXXef"  "test"    "abXYZcd" "defg"

findthis <- "X"
loc <- grep(findthis,sv,fixed=T)
for (i in loc) print(paste(findthis,": found in element:  ",i, "(",sv[i],")"))

## [1] "X : found in element:   1 ( aXXXef )"
## [1] "X : found in element:   3 ( abXYZcd )"

R How would you write some code to show the peptide fragments resulting from the peptide digest performed by Asp-N Endopeptidase which cleaves specifically bonds with Asp (D) in position P1’ (the amino-acid following the peptide bond cleaved) ? Try out your ideas on: prot <- ‘AVYRYPDEFGHESDGGPILKV’

WWW Look more at regular expression. Can you find some examples of their use in R? Familiarise yourself with the functions regexpr() and gregexpr()
? How does this inform your approach to the peptide digest problem above?

Programming in the Biosciences: Session 3A

Leo Caves

Autumn Term