-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathScraping.Rmd
367 lines (296 loc) · 14.4 KB
/
Scraping.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
---
title: 'Web Scraping in R : rvest'
author: "Inayatus"
date: "`r format(Sys.Date(), '%A, %B %d, %Y')`"
output:
rmdformats::readthedown:
self_contained: true
thumbnails: true
lightbox: true
gallery: false
highlight: monochrome
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(plotly)
library(ggthemes)
```
Downloading data in many data provider website maybe commen we've done before. How if we want to scrap or crawling a data, but the data we want get is from those website? In here, we want to do data scraping or data crawling. In this step we want to try to crawl data from [trustpilot website](https://www.trustpilot.com/).
Trustpilot is an popular website that used by customer to put the review in one of any famous e-commerce website. The review is about bussiness and service of the e-commerce website to serve their customer. In this tutorial, we want to discuss howw to scrap or crawl useful information from Trustpillot website then make a simple insight from the data or information. In this tutorial we want to use `rvest` package that built in R.
# Library Used
Before we can do crawling data from website, we need *package* is installed in our machine. This is *package* that we need to install in order to do crawling data.
```{r, message=FALSE, warning=FALSE}
# general purpose data wraling
library(tidyverse)
# parsing of HTML/XML files
library(rvest)
# string manipulation
library(stringr)
# Date time manipulation
library(lubridate)
# verbose regular expression
library(rebus)
```
# Find All Page
After that, say taht we want to pull information from e-commerce website like **Amazon**. What we need to pull information from website is using their URL from those website.
URL that we use, we want to save it into object.
```{r}
url <- 'https://www.trustpilot.com/review/www.amazon.com?page=225'
```
Usually in large companies, especially if it is e-commerce, of course, has very many reviews, can be more than hundreds.
To get data from a website we need to use the functionality of the `rvest` package. To convert a website into an XML object, we use the `read_html ()` function. But don't forget to provide the target URL that we will use to collect data, call the web server, and parse data from the web. To extract the node from the XML object we use `html_nodes ()`, and it is followed by `.` to indicate the _class_ descriptor. The output that will be generated is a list of all the nodes found. To extract _tagged data_, we use `html_text ()` on the node that we have found. In the case when we need to extract an attribute on the website, we use `html_attrs ()`. This function can return the attributes that we want to reset and extract.
Well, it's not difficult for long, let's try it together.
In this tutorial, we will play around a lot using `fuction ()` whose purpose is to extract the data that we will take from a website. In the `function` that we will create later, we need to know _ tag_ of the information that we will take.
Because this time I want to exemplify scrapping data from Amazon e-commerce, here is the tag that we need.
- `.pagination-page`: _tag_ to see how many pages or page reviews
- `. consumer-information__ names»: _tag_ to find out the name of the reviewer
- `.star-rating`: _tag_ to find out the rating rating of an e-commerce
- `.review-content__text`: _tag_ to find out the review given by each reviewer
- `. consumer-information__location`: _tag_ to find out where the reviewer is from
- `. consumer-information__ review-count`: _tag_ to find out how many reviewers have reviewed it
> Keep in mind that for each website it has a different _tag_ and _class_ names, so it must be adjusted for a particular website
Here is a `function 'that can be used for some _tag_ above.
```{r}
last.page <- function(html){
pages.data <- html %>%
html_nodes('.pagination-page') %>%
# ekstrak raw teks ke list
html_text()
# mengambil halaman kedua hingga terakhir
pages.data[(length(pages.data))] %>%
# mengambil raw string
unname() %>%
# convert ke angka
as.numeric()
}
```
The steps above apply the `html_nodes ()` function where we want to extract the `pagination` class. The last function created is a function to take the correct item from the list, the second page until the end, and convert it to a numeric value.
To test the function we can use the `read_html ()` function and apply it to the function we have written:
```{r}
first.page <- read_html(url)
latest.page <- last.page(first.page)
```
Now we got the numbers, we want to generalize the list from all of URL.
```{r}
list.of.pages <- str_c(url, '?page=', 1:latest.page)
head(list.of.pages)
```
We can checked it manually from `list.of.page`.
# Extract Information of One Page
If we want to extract the review text, rating, author name, and time from collecting the review from the subpage. We can repeat the steps from the beginning of each *fields* that we want to find.
```{r}
information <- function(html, tag){
html %>%
# relevant tag
html_nodes(tag) %>%
html_text() %>%
# trim additional white space
str_trim() %>%
# mengubah dari list ke vector
unlist()
}
```
Last but not least, we want to make a function to extract rating review. Rating called an atribute in tag. Rating is not a numeric, but include in `star-rating-X`, whereas X is number that we want.
Last step, we applu thos function is URL list that we want to generalized. To use that, we use `map()` from `purrr` package that include in big package `tidyverse`.
```{r}
star.rating.information <- function(html){
# pattern you look for : the first digit after 'star-rating--'
pattern = '[0-9.]'
rating <- html %>%
html_nodes('.star-rating') %>%
html_nodes('img') %>%
html_attrs() %>%
# apply the pattern match to all attribtes
map(str_match_all, pattern = pattern) %>%
# str_match[1] is fully matched string, the second entry
# is the part you extract with the capture in your pattern
map(2)
rating <- lapply(rating, function(x) x %>% unlist() %>% paste(collapse = "")) %>% unlist()
# leave out first instance, as it is not part of a review
rating[3:length(rating)]
}
```
Then we want to get date and time of reviewer.
```{r}
dates <- function(html){
read_html(url) %>%
html_nodes('.review-card .review__content .review-content .review-content__header .review-content-header .review-content-header__dates') %>%
html_text() %>%
purrr::map(1) %>%
# parse string into a datetime object with lubridate
ymd_hms() %>%
unlist()
}
```
Then we want to bind it to be one table.
```{r}
get.data.table <- function(html, company.name){
# extract basic information from HTML
name.review <- information(html, '.consumer-information__name')
text.review <- information(html, '.review-content__text')
# location.review <- information(html, '.consumer-information__location')
review.count <- information(html, '.consumer-information__review-count')
rating <- star.rating.information(html)
dates <- dates(html)
#combine into tibble
combine.data <- tibble(Name = name.review, Dates = dates, Review = text.review,
# Location = location.review,
Review.count = review.count, Rating = rating)
# tag individual data with the company name
combine.data %>%
mutate(Company = company.name) %>%
select(Company, Name, Dates, Review, Review.count, Rating)
}
```
```{r}
get.data.url <- function(url, company.name){
html <- read_html(url)
get.data.table(html, company.name)
}
```
And then we want to make a function to do scrap of data that we want to pull from URL we choose. We will bind it into *tibble*.
```{r}
scrape.table <- function(url, company.name){
# baca halaman pertama
first.page <- read_html(url)
# ekstrak nomor halaman
latest.page <- last.page(first.page)
# masukin target URL
list.of.pages <- str_c(url, '?page=', 1:latest.page)
# apply the extraction and bind the individuals resuts back into one table
list.of.pages %>%
purrr::map(get.data.url, company.name) %>%
# combine the tibbles into one tibble
bind_rows() %>%
# write a tab-eparated file
write_tsv(str_c(company.name, '.tsv'))
}
```
Let's try to scrap Amazon review from Trustpilot website.
```{r eval = FALSE}
temp <- scrape.table(url, 'amazon')
write_tsv(temp, "amazon.tsv")
```
```{r}
# menyimpan data hasil scraping
amazon <- read_tsv('amazon.tsv')
head(amazon, 10)
```
# Visualisation Data {.tabset .tabset-fade .tabset-pills}
We have obtained some review data from several people. Based on the results of scraping obtained, there are 4500 reviewers who gave reviews on Amazon's website. Of the 4500 reviewers who gave a review, the number of ratings given based on the many stars is as follows.
## Total Rating
```{r, echo=FALSE, fig.width=8, warning=FALSE, message=FALSE}
# library(extrafont)
# font_import()
# loadfonts(device = "win")
amazon.new <- amazon %>%
mutate(Date = as.Date(Dates),
Time = format(Dates, "%h:%m:%s"),
rating = factor(Rating, levels = c(1:5),
labels = c("Star 1", "Star 2", "Star 3", "Star 4", "Star 5")),
Judgement = as.factor(case_when(rating == "Star 1"~"Worse",
rating == "Star 2"~"Bad",
rating == "Star 3"~"Good",
rating == "Star 4"~"Better",
rating == "Star 5"~"Best",
TRUE ~ as.character(Rating))))
amazon.new %>%
arrange(rating, Judgement) %>%
group_by(rating, Judgement) %>%
summarise(freq = n()) %>%
ungroup() %>%
ggplot(aes(x = rating, y = freq)) +
geom_col(aes(fill = rating)) +
geom_label(aes(label = factor(freq)), colour = "black", size = 3, nudge_y = 70, label.size = 0.15) +
geom_text(aes(label = Judgement), position = position_stack(vjust = 0.5), angle = 90, size = 2.5,
color = c("black", "black", "black", "white", "white"), fontface = "bold") +
labs(x = "Rating", y = "Amount of Rating", title = "Total Rating") +
scale_fill_manual(values = c("#fadbe0", "#eaadbd", "#b88a9f", "#876880", "#554562")) +
theme(title = element_text(size = 12, colour = "black",
family = "Franklin Gothic Medium"),
plot.title = element_text(hjust = 0.5),
axis.title = element_text(size = 10, color = "black", family = "Calibri"),
legend.title = element_text(size = 8, hjust = 0.2, vjust = 0.2, angle = 0.5, family = "Calibri"),
legend.text = element_text(size = 8, hjust = 0.2, vjust = 0.2, angle = 0.5, family = "Calibri"),
legend.box.background = element_blank(),
legend.background = element_blank(),
legend.key = element_blank(),
axis.line = element_line(colour = "grey", size = 0.8),
panel.grid.major.y = element_line(colour = "white", linetype = 1),
panel.background = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
plot.margin = margin(1, 1, 1, 1, "cm"),
plot.background = element_rect(colour = "grey", fill = "grey90", size = 1),
legend.position = "bottom")
```
From the graph above, we get information that many reviewers give a 5 (Star 5) rating to Amazon. If we look at the movements of the average rating each month and weekly on Amazon are as follows.
## Montly Average Rating
```{r}
library(xts)
amazon.ts <- xts(amazon.new$Rating, amazon.new$Date)
colnames(amazon.ts) <- 'Rating'
ended.interval <- '2009-01-01/'
amazon.xts <- amazon.ts[ended.interval]
avg.rating <- apply.monthly(amazon.xts, colMeans)
count.rating <- apply.monthly(amazon.xts, FUN = length)
```
```{r}
avg.rating <- avg.rating %>%
as.data.frame()
avg.rating$month <- row.names(avg.rating)
# avg.rating <- avg.rating[,c(ncol(avg.rating), 1:(ncol(avg.rating)-1))]
avg.rating <- avg.rating %>%
mutate(month = as.Date(month),
Year = year(month),
Month_ = month(month),
Month = factor(paste( Year, Month_, sep = "-")),
Month = factor(Month, levels = Month)) %>%
select(-Year, -Month_, -month)
count.rating <- count.rating %>%
as.data.frame()
count.rating$month <- row.names(count.rating)
# avg.rating <- avg.rating[,c(ncol(avg.rating), 1:(ncol(avg.rating)-1))]
count.rating <- count.rating %>%
mutate(month = as.Date(month),
Year = year(month),
Month_ = month(month),
Month = factor(paste( Year, Month_, sep = "-")),
Month = factor(Month, levels = Month)) %>%
select(-Year, -Month_, -month)
```
```{r, echo=FALSE, warning=FALSE, message=FALSE}
avg.plot <- avg.rating %>%
ggplot(aes(x= Month, y = Rating, group = 1)) +
geom_line(lwd = 1.5, colour = "#E93434") +
scale_y_continuous(limits = c(1.0, 5.5), breaks = seq(1.0, 10.0, 1.1),
name = "Average Rating/month") +
labs(title = "Average Month Rating", x = "Month") +
theme(title = element_text(family = "Lucida Sans Unicode", size = 12, face = "bold"),
plot.title = element_text(hjust = 0.5),
axis.title = element_text(size = 10, vjust = 0.5),
axis.line = element_line(colour = "grey70", size = 0.8),
panel.background = element_blank(),
panel.grid.major = element_line(colour = "grey70", linetype = 1),
panel.grid.minor.y = element_blank(),
plot.margin = margin(.5, .5, .5, .5, "cm"),
plot.background = element_rect(colour = "grey", fill = "#E1E1E1", size = 0.8))
count.plot <- count.rating %>%
ggplot(aes(x= Month, y = Rating, group = 1)) +
geom_line(lwd = 1.5, colour = "#E93434") +
# scale_y_continuous(limits = c(1.0, 5.5), breaks = seq(1.0, 10.0, 1.1),
# name = "Average Count") +
labs(title = "Number Review per Month", x = "Month") +
theme(title = element_text(family = "Lucida Sans Unicode", size = 12, face = "bold"),
plot.title = element_text(hjust = 0.5),
axis.title = element_text(size = 10, vjust = 0.5),
axis.line = element_line(colour = "grey70", size = 0.8),
panel.background = element_blank(),
panel.grid.major = element_line(colour = "grey70", linetype = 1),
panel.grid.minor.y = element_blank(),
plot.margin = margin(.5, .5, .5, .5, "cm"),
plot.background = element_rect(colour = "grey", fill = "#E1E1E1", size = 0.8))
library(gridExtra)
grid.arrange(avg.plot, count.plot, nrow =2)
```