R-Ladies Istanbul: R For Beginners

author: Hazel Kavili date: 2017-09-30 width: 1400 height: 1280 font-family: 'Helvetica'

R and R Studio

Let's meet with the most used IDE (integrated desktop environment) for R

Scripts, Console,
- Code and workflow are more reproducible if we can document everything that we do.
Environment, History, Connections
Files, Plots, Packages, Help, Viewer
- The viewer window will helpy ou to see Plots, Shiny applications, blog pages of you did!

Looking for Help!

Ask Google
Search in Stackoverflow
An introduction to R
R for Data Science book from Hadley Wickham & Garrett Grolemund
Try R
R mailing list: first, learn how to ask questions!
R Tutorials
Get help from R:

help.start()
help(mean)
?mean
example(mean)

Getting Started - I

Motor Trend Car Road Tests - simple dataset

head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

tail(mtcars)

                mpg cyl  disp  hp drat    wt qsec vs am gear carb
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

Getting Started - II

Teaching base or tidyverse way

Base R codes:

mtcars$transmission <- 
  ifelse(mtcars$am == 0, "automatic", "manual")

dplyr codes:

mtcars <- mtcars %>%
  mutate(transmission = case_when(am == 0 ~ "automatic", am == 1 ~ "manual"))

Getting Started - III

Teaching base or tidyverse way

Visualisation

Base R codes:

mtcars$trans_color <- 
  ifelse(mtcars$transmission == "automatic", "green", "blue")

pdf("plots/scatter_base.pdf", width = 5, height = 3)
plot(mtcars$mpg ~ mtcars$disp, col = mtcars$trans_color)
legend("topright", 
       legend = c("automatic", "manual"), 
       pch = 1, col = c("green", "blue"))
dev.off()

ggplot codes:

p1 <- ggplot(mtcars, aes(x = disp, y = mpg, color = transmission)) +
       geom_point()

ggsave("plots/scatter_tidy.pdf", p1, width = 5, height = 3)

Getting Started - IV

R commands:

are case sensitive
can be seperated either by a semi-colon(;), or by a newline
#comment

Objects:

varibables, arrays of numbers, character strings, functions

Getting Started - V

Assignment and Basic Operators

use <- for assigments
+,-, *, /, ^, %%

Logical Operators

<,>, <=, >=, ==, !=, !x, x & y, x | y

Others

sum, sqrt, min, max, mean, var, sd, abs, summary

Most Frequently Used Objects

Vectors
List
Matrices
Arrays
Factors
Data Frames

Vectors

use c() for concatenate more than one element.
in programming vectors are variable sized sequence of values (not necessarily numbers).

books <- c("history", "sci-fi", "fantasy")
print(books)

[1] "history" "sci-fi"  "fantasy"

print(class(books))

[1] "character"

ages <- c(12,13,14,15,9,8)
print(ages)

[1] 12 13 14 15  9  8

print(class(ages))

[1] "numeric"

Examples

#this is R-Ladies Istanbul
X <- 10
x <- 5
print(paste("X is", X))

[1] "X is 10"

print(paste("x is", x))

[1] "x is 5"

cat("X and x are equal? = ", X == x)

X and x are equal? =  FALSE

myNumbers <- c(1:10) # c is short for concatenate
rep(myNumbers, times = 3)

 [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3
[24]  4  5  6  7  8  9 10

twice <- rep(myNumbers, each = 2)
print(twice)

 [1]  1  1  2  2  3  3  4  4  5  5  6  6  7  7  8  8  9  9 10 10

Examples - II

y <- c(1,2,3,10,15,20)
z <- c(y,4,5,6,y)
print(y)

[1]  1  2  3 10 15 20

Vector Arithmetic

5/y

[1] 5.0000000 2.5000000 1.6666667 0.5000000 0.3333333 0.2500000

5*y

[1]   5  10  15  50  75 100

y^2

[1]   1   4   9 100 225 400

sqrt(y)

[1] 1.000000 1.414214 1.732051 3.162278 3.872983 4.472136

Examples - III

seq(from = -10, to = 10, by = 1)

 [1] -10  -9  -8  -7  -6  -5  -4  -3  -2  -1   0   1   2   3   4   5   6
[18]   7   8   9  10

seq(from = 1, length = 25, by = 2 )

 [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45
[24] 47 49

Simple Manipulations

weights <- c(55, 60, 45, 70, 56, 73, 59, 82)
sum(weights)

[1] 500

mean(weights)

[1] 62.5

sd(weights)

[1] 11.77164

var(weights)

[1] 138.5714

length(weights)

[1] 8

Lists

Lists can contain many different types of elements inside.

myList <- list(c(1,2,3), 15, "hello")
print(myList)

[[1]]
[1] 1 2 3

[[2]]
[1] 15

[[3]]
[1] "hello"

Select an element from lists

myList[1]

[[1]]
[1] 1 2 3

myList[[1]][2]

[1] 2

myList[2]

[[1]]
[1] 15

Matrices

A matrix is two-dimensional recteangular data set.

myMatrix <- matrix(c(1,2,3,4,5,6), nrow = 2, ncol = 3, byrow = TRUE)
print(myMatrix)

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6

Select an element from matrices

myMatrix[1,2]

[1] 2

myMatrix[2,]

[1] 4 5 6

myMatrix[,1]

[1] 1 4

dim(myMatrix)

[1] 2 3

Matrices - II

myMatrix2 <- matrix(c(-6:-1), nrow = 2, ncol = 3, byrow = TRUE)
myMatrix + myMatrix2

     [,1] [,2] [,3]
[1,]   -5   -3   -1
[2,]    1    3    5

myMatrix %*% c(1,2,3)

     [,1]
[1,]   14
[2,]   32

diag(myMatrix2) #diagonal

[1] -6 -2

t(myMatrix2)  #transpose of matrix

     [,1] [,2]
[1,]   -6   -3
[2,]   -5   -2
[3,]   -4   -1

Arrays

Arrays can be of any number of dimensions

myArray <- array(c('uno','dos', 'tres'), dim = c(3,3,3))
print(myArray)

, , 1

     [,1]   [,2]   [,3]  
[1,] "uno"  "uno"  "uno" 
[2,] "dos"  "dos"  "dos" 
[3,] "tres" "tres" "tres"

, , 2

     [,1]   [,2]   [,3]  
[1,] "uno"  "uno"  "uno" 
[2,] "dos"  "dos"  "dos" 
[3,] "tres" "tres" "tres"

, , 3

     [,1]   [,2]   [,3]  
[1,] "uno"  "uno"  "uno" 
[2,] "dos"  "dos"  "dos" 
[3,] "tres" "tres" "tres"

Factors

They are categorical variables.
You can create using a vector. Factors stores the vector along with distinct values of the elements in the vector as labesl.
The labels are always character irrespective of wheter it is numeric or character or Boolean etc. input vector. They are useful in statistical modelling.

myVector <- c('blue', 'red', 'violet', 'red', 'red', 'blue')
print(myVector)

[1] "blue"   "red"    "violet" "red"    "red"    "blue"

myVectorFactor <- factor(myVector)
print(myVectorFactor)

[1] blue   red    violet red    red    blue  
Levels: blue red violet

print(nlevels(myVectorFactor))

[1] 3

Tidy Data Concept and Data Frames

Tidy Data

Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table

mySurvey <- data.frame(
  name = c("Harry", "Ron", "Hermione", "Draco", "Cedric"),
  gender = c("Male", "Male", "Female", "Male", "Male"),
  age = c(11, 11, 11, 11, 12),
  bloodStatus = c("Half-blood", "Pure-blood","Muggle-born", "Pure-blood", NA),
  house = c("Gryffindor", "Gryffindor", "Gryffindor", "Slytherin", NA)
  )
print(mySurvey)

      name gender age bloodStatus      house
1    Harry   Male  11  Half-blood Gryffindor
2      Ron   Male  11  Pure-blood Gryffindor
3 Hermione Female  11 Muggle-born Gryffindor
4    Draco   Male  11  Pure-blood  Slytherin
5   Cedric   Male  12        <NA>       <NA>

Indexing, Selecting and Modifying

is.na(mySurvey)

      name gender   age bloodStatus house
[1,] FALSE  FALSE FALSE       FALSE FALSE
[2,] FALSE  FALSE FALSE       FALSE FALSE
[3,] FALSE  FALSE FALSE       FALSE FALSE
[4,] FALSE  FALSE FALSE       FALSE FALSE
[5,] FALSE  FALSE FALSE        TRUE  TRUE

mySurvey[5,4] <- "Pure-blood"
str(mySurvey)

'data.frame':	5 obs. of  5 variables:
 $ name       : Factor w/ 5 levels "Cedric","Draco",..: 3 5 4 2 1
 $ gender     : Factor w/ 2 levels "Female","Male": 2 2 1 2 2
 $ age        : num  11 11 11 11 12
 $ bloodStatus: Factor w/ 3 levels "Half-blood","Muggle-born",..: 1 3 2 3 3
 $ house      : Factor w/ 2 levels "Gryffindor","Slytherin": 1 1 1 2 NA

levels(mySurvey$house) <- c("Gryffindor", "Slytherin", "Hufflepuff")
mySurvey[5,5] <- "Hufflepuff"
print(mySurvey)

      name gender age bloodStatus      house
1    Harry   Male  11  Half-blood Gryffindor
2      Ron   Male  11  Pure-blood Gryffindor
3 Hermione Female  11 Muggle-born Gryffindor
4    Draco   Male  11  Pure-blood  Slytherin
5   Cedric   Male  12  Pure-blood Hufflepuff

Indexing, Selecting and Modifying - II

mySurvey$name

[1] Harry    Ron      Hermione Draco    Cedric  
Levels: Cedric Draco Harry Hermione Ron

head(mySurvey)

      name gender age bloodStatus      house
1    Harry   Male  11  Half-blood Gryffindor
2      Ron   Male  11  Pure-blood Gryffindor
3 Hermione Female  11 Muggle-born Gryffindor
4    Draco   Male  11  Pure-blood  Slytherin
5   Cedric   Male  12  Pure-blood Hufflepuff

tail(mySurvey)

      name gender age bloodStatus      house
1    Harry   Male  11  Half-blood Gryffindor
2      Ron   Male  11  Pure-blood Gryffindor
3 Hermione Female  11 Muggle-born Gryffindor
4    Draco   Male  11  Pure-blood  Slytherin
5   Cedric   Male  12  Pure-blood Hufflepuff

look for:

x %in% y, !(x %in% y), !x, !is.na()

R Packages

R has a lot of packages for you to make some work easily done!
CRAN is the name of the R package repository but you can find and download many R packages on Github.
These packages are about: statistics, modelling, visualisation, manipulating, documentation, making websites/applications etc.

Tidyverse

dplyr, broom, tidyr, lubridate
ggplot2
purrr, magrittr, forecats, tibble
readxl

Installation

install.packages('tidyverse')
library(tidyverse)

Datasets

R has many example datasets and you can look at the list of datasets from here to make practice.
Today we will use 20170930dataset and learn how to load a data set on working directory and manipulate it with dplyr functions.
- To read a dataset, you can use these functions, according to your file type: read.csv, read.table, read.xls, read.xlsx etc.
- You need path of the file: where you store your file. For example (/Users/hazelkavili/Desktop/R-LadiesIstanbul/20170930dataset.csv)

myDataset <- read.csv(file = "~/Desktop/R-LadiesIstanbul/20170930dataset.csv", sep = ",", header = TRUE)
print(myDataset)

   GlobalRank         Company Sales Profits Assets MarketValue
1         307      KocHolding  47.1     1.3   61.1        13.6
2         333       IsBankasi  12.4     1.9  112.7        16.8
3         341  GarantiBankasi   9.8     1.9   99.2        22.0
4         382          Akbank   8.2     1.7   91.6        20.6
5         388  SabanciHolding  14.6     1.0   98.3        12.7
6         543        Halkbank   6.2     1.5   61.0        13.3
7         711       VakifBank   6.3     0.8   60.6         7.9
8         802     TurkTelekom   7.1     1.5    9.5        15.0
9         843        Turkcell   5.9     1.2   10.5        14.8
10       1210 TurkishAirlines   8.3     0.6   10.5         4.8
11       1245            Enka   5.7     0.6    8.2         8.3
12       1788     AnadoluEfes   3.6     0.3    6.5         9.0
13       1972             BIM   5.5     0.2    1.2         7.5
14       1977      FordOtosan   5.5     0.4    2.6         4.5

Structure of Dataset

str(myDataset)

'data.frame':	14 obs. of  6 variables:
 $ GlobalRank : int  307 333 341 382 388 543 711 802 843 1210 ...
 $ Company    : Factor w/ 14 levels "Akbank","AnadoluEfes",..: 9 8 6 1 10 7 14 11 12 13 ...
 $ Sales      : num  47.1 12.4 9.8 8.2 14.6 6.2 6.3 7.1 5.9 8.3 ...
 $ Profits    : num  1.3 1.9 1.9 1.7 1 1.5 0.8 1.5 1.2 0.6 ...
 $ Assets     : num  61.1 112.7 99.2 91.6 98.3 ...
 $ MarketValue: num  13.6 16.8 22 20.6 12.7 13.3 7.9 15 14.8 4.8 ...

summary(myDataset)

   GlobalRank               Company      Sales           Profits     
 Min.   : 307.0   Akbank        :1   Min.   : 3.600   Min.   :0.200  
 1st Qu.: 383.5   AnadoluEfes   :1   1st Qu.: 5.750   1st Qu.:0.600  
 Median : 756.5   BIM           :1   Median : 6.700   Median :1.100  
 Mean   : 917.3   Enka          :1   Mean   :10.443   Mean   :1.064  
 3rd Qu.:1236.2   FordOtosan    :1   3rd Qu.: 9.425   3rd Qu.:1.500  
 Max.   :1977.0   GarantiBankasi:1   Max.   :47.100   Max.   :1.900  
                  (Other)       :8                                   
     Assets         MarketValue   
 Min.   :  1.200   Min.   : 4.50  
 1st Qu.:  8.525   1st Qu.: 8.00  
 Median : 35.550   Median :13.00  
 Mean   : 45.250   Mean   :12.20  
 3rd Qu.: 83.975   3rd Qu.:14.95  
 Max.   :112.700   Max.   :22.00

Magical Words from dplyr!

Variables(columns)
- select
- mutate
Observations (rows)
- filter
- arrange
Groups
- summarise

look for Hadley's book for more magical words

%>% (Pipe) Operator

Basically tells R to take the value of that which is to the left and pass it to the right as an argument.
- cmd + shft + m
- kntr + shft + m

library(dplyr)
myDataset %>% filter(MarketValue > 5) %>% summarise(Average = mean(Assets))

  Average
1    51.7

Select

Choosing is not losing!

select(dataframe, var1, var2, ...)
select(dataframe, 1:4, -2)

smallSet <- myDataset %>% select(Company, MarketValue)
print(smallSet)

           Company MarketValue
1       KocHolding        13.6
2        IsBankasi        16.8
3   GarantiBankasi        22.0
4           Akbank        20.6
5   SabanciHolding        12.7
6         Halkbank        13.3
7        VakifBank         7.9
8      TurkTelekom        15.0
9         Turkcell        14.8
10 TurkishAirlines         4.8
11            Enka         8.3
12     AnadoluEfes         9.0
13             BIM         7.5
14      FordOtosan         4.5

There are also helper functions: starts_with, end_with, contains

Mutate

Deals with info in your data which is not display

mutate(dataframe, newVariable = var1 + var2)
mutate(dataframe, x = a + b, y = x + c)

mutateSet <- myDataset %>% mutate(TotalMoney = Assets + Profits)
head(mutateSet)

  GlobalRank        Company Sales Profits Assets MarketValue TotalMoney
1        307     KocHolding  47.1     1.3   61.1        13.6       62.4
2        333      IsBankasi  12.4     1.9  112.7        16.8      114.6
3        341 GarantiBankasi   9.8     1.9   99.2        22.0      101.1
4        382         Akbank   8.2     1.7   91.6        20.6       93.3
5        388 SabanciHolding  14.6     1.0   98.3        12.7       99.3
6        543       Halkbank   6.2     1.5   61.0        13.3       62.5

Filter

Filter out rows, specific type of observation

filter(dataframe, logicaltest)

filteredSet <- myDataset %>% filter(Assets < 10 & Profits > 1)
print(filteredSet)

  GlobalRank     Company Sales Profits Assets MarketValue
1        802 TurkTelekom   7.1     1.5    9.5          15

Arrange

Helps order observation (default ascending)

arrange(dataframe, var1)
arrange(dataframe, var1, desc(var2))

arrangedSet <- myDataset %>% select(GlobalRank, Company, MarketValue) %>%  arrange(desc(MarketValue))
print(arrangedSet)

   GlobalRank         Company MarketValue
1         341  GarantiBankasi        22.0
2         382          Akbank        20.6
3         333       IsBankasi        16.8
4         802     TurkTelekom        15.0
5         843        Turkcell        14.8
6         307      KocHolding        13.6
7         543        Halkbank        13.3
8         388  SabanciHolding        12.7
9        1788     AnadoluEfes         9.0
10       1245            Enka         8.3
11        711       VakifBank         7.9
12       1972             BIM         7.5
13       1210 TurkishAirlines         4.8
14       1977      FordOtosan         4.5

Summarise

Helps order observation (default ascending)

summarise(dataframe, newVar = expression,. . .)
summarise(dataframe, sum = sum(A), avg = mean(B)..)

summarisedSet <- myDataset %>% filter(grepl('Bank', Company) | grepl('bank', Company)) %>% 
  summarise(averageAssets = mean(Assets), averageSales = mean(Sales))

print(summarisedSet)

  averageAssets averageSales
1         85.02         8.58

look for grepl

References

Mine Cetinkaya-Rundell's Teaching Data Science to New Users presentation
Ismail Sezen's github
data.world for many datasets
Google :)
Data Carpentary

========================================================

Files

20170930_RForbeginners.md

Latest commit

History

20170930_RForbeginners.md

File metadata and controls

R-Ladies Istanbul: R For Beginners

R and R Studio

Looking for Help!

Getting Started - I

Motor Trend Car Road Tests - simple dataset

Getting Started - II

Teaching base or tidyverse way

Getting Started - III

Teaching base or tidyverse way

Visualisation

Getting Started - IV

Getting Started - V

Most Frequently Used Objects

Vectors

Examples

Examples - II

Examples - III

Simple Manipulations

Lists

Matrices

Matrices - II

Arrays

Factors

Tidy Data Concept and Data Frames

Indexing, Selecting and Modifying

Indexing, Selecting and Modifying - II

R Packages

Tidyverse

Installation

Datasets

Structure of Dataset

Magical Words from dplyr!

%>% (Pipe) Operator

Select

Mutate

Filter

Arrange

Summarise

References

Any questions?

- [email protected] & [email protected]

- twitter.com/RLadiesIstanbul

- meetup.com/rladies-istanbul