Chapter 2 Basics of R
2.1 Brief history
R is a programming language developed by Ross Ihaka and Robert Genleman in 1993 (Ihaka and Gentleman 1996). An excellent resource about R is the book “Programming for data science” (Peng 2015). R is like a free version of S-PLUS programming language (http://www.solutionmetrics.com.au/products/splus/default.html). It is free and recognized as GNU software (https://www.gnu.org/software/software.html) with GNU General Public License (GPLV2-GPLV3) license. This feature allowed that many developers started improving and adding code over the years.
R includes several statistical functions; therefore, users who want to use the base capabilities of R are not forced to learn R-programming. However, the evolution from user to developer in R has been facilitated with numerous publications such as the book “R packages: organize, test, document, and share your code” (Wickham 2015), or “The art of R programming: A tour of statistical software design” (Matloff 2011).
2.1.1 Installation
R can be installed in different OS including Linux, Windows, Mac, and Solaris. The webpage https://cran.r-project.org/ shows how to install R in any platform. A modern Integrated development environment (IDE) is RStudio https://www.rstudio.com/, which contains many useful integrated options. However, you can run R on the terminal, and use any text editor to write and save scripts in R. Even more, you can run R scripts on the terminal by typing Rscript -e "YourScript.R"
.
2.2 Using R
If we type at the R promt
x <- 1 # Not printing results
x # Print results
## [1] 1
print(x) # Explicit print x
## [1] 1
The integer 1 was assigned (<-) to x, then writing x or print(x) shows the value of x. Instead of numbers, other expressions such as strings, dates or other objects can n be assigned. R includes several statistical functions. For instance, if we type ‘pi’ at the terminal, it prints the pi number.
pi
## [1] 3.141593
2.2.1 R objects
There are five basic classes (or atomic classes).
- character.
- numeric.
- integer.
- complex.
- logical (TRUE/FALSE or just T or F).
Objects of the same class can be grouped in vectors. You can createVectors are created with writingc() with the elements of the vector inside the parenthesis, separated by a colon. c is a built-in function to create vectors. In this case, the vector v1 contains a sequence of three integers, from 1 to 3. The resulting class is numeric.
v1 <- 1:3
v2 <- c(1, 2, 3)
identical(v1, v2)
## [1] FALSE
v1
## [1] 1 2 3
class(v1)
## [1] "integer"
With the operator [
we can get specific elements of your vectors. In the following code, the function length
is used additionally, which returns the length of an object.
v1[1] # first element of v1
## [1] 1
v1[length(v1)] # last element element of v1
## [1] 3
If we create a vector with numbers and letter, the resulting vector will be “character”, automatically converting the numbers into characters.
v2 <- c(1:3,"a")
v2
## [1] "1" "2" "3" "a"
class(v2)
## [1] "character"
Objects can be converted to other classes with the as.* functions. For instance, if we convert the vector v2 to numeric, it recognizes the numbers adding an NA into the position of the character.
as.numeric(v2)
## Warning: NAs introduzidos por coerção
## [1] 1 2 3 NA
2.2.2 Factors
Factors represent categorical data. A typical example is a character vector of days of the week. The function factor
creates factors. To see the help section, type ?factor
. The following example uses this function with the arguments x
and levels
.
dow <- factor(x = c("Monday", "Thursday", "Friday"),
levels = c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday"))
dow
## [1] Monday Thursday Friday
## Levels: Monday Tuesday Wednesday Thursday Friday
2.2.3 Missing Values
Missing values in data-sets are a typical source of a headache for R users. Fortunately, R counts with several tools to avoid these headaches. These tools are the functions is.na
to check for NA (not available) and is.nan
(not a number). This function returns logical values.
n <- c(1, NA)
is.na(n)
## [1] FALSE TRUE
n <- c(1, NaN)
is.na(n)
## [1] FALSE TRUE
is.nan(n)
## [1] FALSE TRUE
2.2.4 Matrices
Matrix is a structure with rows and columns. They can be created using the matrix
function and also, using vectors. Remember, if you want to know more about any function you have to type ? and the function (e.g. ?matrix
), which opens the help documentation page.
Let’s create a matrix using vectors:
a <- 1:12 #numeric vector
(m <- matrix(data = a, nrow = 3, ncol = 4))
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
We can check the dimensions of our matrix m
with dim
, which are 3 and 4.
dim(m)
## [1] 3 4
In order to get the elements of matrix m
we can use the [
operator:
m[1, 1] # first element
## [1] 1
m[, 1] # firs column
## [1] 1 2 3
m[1, ] # firs row
## [1] 1 4 7 10
m[3,4] # last element
## [1] 12
2.2.5 Arrays
Arrays are like matrices inside other matrices. A matrix is a 2-dimensional array. Let’s create an array create an array using the same vector a
and same dimensions of m
, 3 and 4. array
has three arguments, data
, dim
and dimnames
. In the argument dim
let’s add the number 2, resulting in an array of two matrices identical to m
.
(a <- array(data = a, dim = c(3,4,2)))
## , , 1
##
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
##
## , , 2
##
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
We can subset elements from array also with the [
operator.
a[1, 1, 1] # first element
## [1] 1
dim(a) # dimensions
## [1] 3 4 2
a[3, 4, 2] # last element
## [1] 12
2.2.6 Lists
List are objects that can contain elements of different classes. I like to call lists as bags. In a bag, you can put almost anything. Also, inside the bag, you can put other bags with different things, meaning that you can have a list of lists of any R object. You can create lists with the list
function. Let’s create an empty list. Then I’m using the vector a
to create another list.
a <- 1:3 # vector of three elements
l1 <- list() # empty list
l1 <- list(a) # list with a vector of three elements
length(l1) # length 1
## [1] 1
l1 <- as.list(a) # vector a as list
length(l1) # three elements
## [1] 3
As mentioned, the list can have lists inside.
l1 <- list(list(a), # Numeric elements
list(TRUE, TRUE, FALSE)) # Logical elements
length(l1) # length 2
## [1] 2
2.2.7 Data-Frames
“Data-frames are used to store tabular data in R” (Peng 2015). You can think of data-frames as spreadsheet-like objects. They are similar to matrices, and you can have a matrix and a data.frame with the same dimensions. You have rows and columns, columns can have different classes, and the columns usually have a one-word name.
Programmers, scientists, and practitioners use tabular data. Hence, there are R packages created to extend the capabilities of the data-frames. Below some of them:
data.frame
: This is not a package. It is a class and function from thebase
library in R. With this function you can create data-frames.data-table
(Dowle and Srinivasan 2017):data.tables
objects have the classdata.table
Which inherits from the classdata-frame
, which means that they share common characteristics. However, the big difference withdata-table
is the efficiency with memory. It includes functions such asfread
for reading millions of observations in seconds.tidyr
(Wickham and Henry 2018): Sometimes your data is in long format, and you need it in wide format, and vice-versa. This package does the job.readxl
(Wickham and Bryan 2017): a Good package for importing Excel and LibreOffice files. I recommend this packages for newbies.sf
(Pebesma 2017): This package presents the classsf
and introduces the list-column of spatial geometry.
Let’s create a data.frame.
a <- 1:3
b <- 3:5
(ab <- data.frame(a, b))
## a b
## 1 1 3
## 2 2 4
## 3 3 5
class(ab)
## [1] "data.frame"
2.2.8 Names
In R you can put names in almost everything. Below examples for vectors
, matrix
and data.frames
.
names(ab) # original names
## [1] "a" "b"
names(ab) <- c("c", "d")
names(ab) # new names
## [1] "c" "d"
2.2.9 Subseting your data.frame
There are several packages for filtering and manipulating data.frames and databases, such as the famous dplyr
(Wickham et al. 2017). However, I will mention an essential characteristic to filter data.frames, and it is not necessary for any new package, consisted in subsetting using indices. Recall that data.frames are similar to matrices, and you can select values with this structure: [rows, columns].
Let’s create a data.frame with three columns and select all commons that match criteria for a row. Specifically, the column a fas rows with values equal of 2
df <- data.frame(a = 1:3, b = 4:6, c = 7:9)
(aa <- df[df$a == 2,])
## a b c
## 2 2 5 8
class(aa)
## [1] "data.frame"
here we are selecting all columns and the class of aa
is data.frame
. The same applies for selecting columns, lets select the columns “b” and “c”.
(df[, c("b", "c")])
## b c
## 1 4 7
## 2 5 8
## 3 6 9
2.3 Inputting data into R
Your data can be in different formats. In this section, I’m showing you how you can input text and comma separated value files, which is a ubiquitous task. Let’s say that you are using a spreadsheet program such as Microsoft Excel or LibreOffice. Then you might want to import your data into R.
- Your first check is if the first row of your data has a name or not. If they do have names, this means that they have a header.
- Then click on ‘Save as’ and search text file with extension .txt or .csv.
- Then you edit your file with notepad (e.g., bloc, gedit, vi or another tool) to check if the decimal character is a point ‘.’ so the separator character is a comma ‘,’ or semicolon ‘;’ or another character, which depends on the local configuration of your spreadsheet software.
- If your file has the extension ‘.txt’, use
read.table
and if it is ‘.csv’, useread.csv
.
Once you checked the format, you can import your ‘.txt’ or ‘.csv’ files in R:
a <- read.csv("path/to/file.csv", sep = ",",
header = T, stringsAsFactors = F)
a <- read.table("path/to/file.txt", sep = ",",
header = T, stringsAsFactors = F)
2.4 Reading spatial data
When I’m talking about spatial data into the context of emissions inventory context, I’m reffering essentialy to vectors and not grddded data (rasters). Despite that there are several regional and global emissions inventories as gridded data, such as EDGAR (Olivier et al. 1996) and REAS (Kurokawa et al. 2013), in this book I’m focused on the spatial location of the sources of pollution, more especifically, lines. A good reference regarding spatial computation is the book Geocomputation with R (Robin, Jakub, and Jannes 2019) and the paper about Spatial Features (Pebesma 2018) .
- The package sf (Pebesma 2017) reads spatial vectors with binding to the library GDAL (https://www.gdal.org/ogr_formats.html).
A simple example:
library(sf)
data <- st_read("shapefile.shp")
data <- st_read("mapinfo.TAB")
References
Ihaka, Ross, and Robert Gentleman. 1996. “R: A Language for Data Analysis and Graphics.” Journal of Computational and Graphical Statistics 5 (3). Taylor & Francis: 299–314.
Peng, Roger D. 2015. R Programming for Data Science. Leanpub.com.
Wickham, Hadley. 2015. R Packages: Organize, Test, Document, and Share Your Code. “ O’Reilly Media, Inc.”
Matloff, Norman. 2011. The Art of R Programming: A Tour of Statistical Software Design. No Starch Press.
Dowle, Matt, and Arun Srinivasan. 2017. Data.table: Extension of ‘Data.frame‘. https://CRAN.R-project.org/package=data.table.
Wickham, Hadley, and Lionel Henry. 2018. Tidyr: Easily Tidy Data with ’Spread()’ and ’Gather()’ Functions. https://CRAN.R-project.org/package=tidyr.
Wickham, Hadley, and Jennifer Bryan. 2017. Readxl: Read Excel Files. https://CRAN.R-project.org/package=readxl.
Pebesma, Edzer. 2017. Sf: Simple Features for R. https://CRAN.R-project.org/package=sf.
Wickham, Hadley, Romain Francois, Lionel Henry, and Kirill Müller. 2017. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Olivier, Johannes Gerardus Jozef, AF Bouwman, JJM Berdowski, C Veldt, JPJ Bloos, AJH Visschedijk, PYJ Zandveld, JL Haverlag, and others. 1996. “Description of Edgar Version 2.0: A Set of Global Emission Inventories of Greenhouse Gases and Ozone-Depleting Substances for All Anthropogenic and Most Natural Sources on a Per Country Basis and on 1 Degree X 1 Degree Grid.” Rijksinstituut voor Volksgezondheid en Milieu RIVM.
Kurokawa, J., T. Ohara, T. Morikawa, S. Hanayama, G. Janssens-Maenhout, T. Fukui, K. Kawashima, and H. Akimoto. 2013. “Emissions of Air Pollutants and Greenhouse Gases over Asian Regions During 2000–2008: Regional Emission Inventory in Asia (Reas) Version 2.” Atmospheric Chemistry and Physics 13 (21): 11019–58. doi:10.5194/acp-13-11019-2013.
Robin, Lovelace, Nowosad Jakub, and Muenchow Jannes. 2019. Dynamic Documents with R and Knitr. 1st ed. Boca Raton, Florida: Chapman; Hall/CRC. https://geocompr.robinlovelace.net/.
Pebesma, Edzer. 2018. “Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal. https://journal.r-project.org/archive/2018/RJ-2018-009/index.html.