forked from susanli2016/Data-Analysis-with-R
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmark_twain_novels.Rmd
198 lines (150 loc) · 6.28 KB
/
mark_twain_novels.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
---
title: "Text Analysis with Term Frequency for Mark Twain's Novels"
output: html_document
---
Samuel Langhorne Clemens, otherwise known as Mark Twain, is one of the most important American writers."The Adventures of Tom Sawyer" is probably one of my most favorite books in all English literature. Happy to see that Twain's river novels remain required reading for young students, he is read more widely now than ever!
Project Gutenberg offers over 53,000 free books. I will use four of Twain’s best novels for this analysis:
* Roughing It
* Life on the Mississippi
* The Adventures of Tom Sawyer
* Adventures of Huckleberry Finn
```{r global_options, include=FALSE}
knitr::opts_chunk$set(echo=FALSE, warning=FALSE, message=FALSE)
```
We will be using the following packages for the analysis:
```{r}
library(tidyverse)
library(tidyr)
library(ggplot2)
library(tidytext)
library(stringr)
library(dplyr)
library(tm)
library(topicmodels)
library(gutenbergr)
theme_set(theme_minimal())
```
## Data preprocessing
We’ll retrieve these four books using the gutenbergr package:
```{r}
books <- gutenberg_download(c(3177, 245, 74, 76), meta_fields = "title")
```
An important preprocessing step is tokenization. This is the process of splitting a text into individual words or sequences of words. The unnest_tokens function is a way to do just that. The result is converting the text column to be one-token-per-row like so:
```{r}
tidy_books <- books %>%
unnest_tokens(word, text)
tidy_books
```
After removing stop words, we can find the most common words in all the four books as a whole.
```{r}
data("stop_words")
cleaned_books <- tidy_books %>%
anti_join(stop_words)
cleaned_books %>%
count(word, sort = TRUE)
```
### A little bit Sentiment
Sentiment analysis is not the focus today, but since we are here already, why not have a quick look?
```{r}
bing <- get_sentiments("bing")
bing_word_counts <- tidy_books %>%
inner_join(bing) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
bing_word_counts
```
```{r}
bing_word_counts %>%
filter(n > 100) %>%
mutate(n = ifelse(sentiment == 'negative', -n, n)) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_bar(stat = 'identity') +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab('Contribution to sentiment') + ggtitle('Most common positive and negative words')
```
We did not spot anomaly in the sentiment analysis results except word "miss' is identified as a negative word, actually, it is used as a title for the tough old spinster Miss Watson in "Adventures of Huckleberry Finn".
### tf-idf
To blatantly quote the [Wikipedia article](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) on tf-idf:
In text analysis, tf-idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.
For our purpose, we want to know the most important words(highest tf-idf) in Mark Twain's four books overall, and most important words(highest tf-idf) in each of these four books. Let's find out.
```{r}
book_words <- cleaned_books %>%
count(title, word, sort = TRUE) %>%
ungroup()
total_words <- book_words %>%
group_by(title) %>%
summarize(total = sum(n))
book_words <- left_join(book_words, total_words)
book_words
```
Terms with high tf-idf across all the four novels
```{r}
book_words <- book_words %>%
bind_tf_idf(word, title, n)
book_words %>%
select(-total) %>%
arrange(desc(tf_idf))
```
```{r}
plot <- book_words %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word))))
plot %>%
top_n(20) %>%
ggplot(aes(word, tf_idf, fill = title)) +
geom_bar(stat = 'identity', position = position_dodge())+
labs(x = NULL, y = "tf-idf") +
coord_flip() + ggtitle("Top tf-idf words in Mark Twain's Four Novels")
```
```{r}
plot %>%
group_by(title) %>%
top_n(10) %>%
ungroup %>%
ggplot(aes(word, tf_idf, fill = title)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~title, ncol = 2, scales = "free") +
coord_flip() + ggtitle('Top tf-idf words in each novel')
```
Each novel has its own highest tf-idf words. However, the language he used across these four novels are pretty similar, such as term "city" has high tf-idf in "Roughing it" and "Life on the Mississippi".
### Term frequency
Just for the kicks, let's compare Mark Twain's works with those of Charles Dicken's. Let's get "A Tale of Two Cities”, “Great Expectations”, “A Christmas Carol in Prose; Being a Ghost Story of Christmas”, “Oliver Twist” and “Hard Times”.
What are the most common words in these novels of Charles Dickens?
```{r}
dickens <- gutenberg_download(c(98, 1400, 46, 730, 786))
tidy_dickens <- dickens %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
tidy_dickens %>%
count(word, sort = TRUE)
```
```{r}
tidy_twains <- books %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
```
```{r}
frequency <- bind_rows(mutate(tidy_twains, author = "Mark Twain"),
mutate(tidy_dickens, author = "Charles Dickens")) %>%
mutate(word = str_extract(word, "[a-z']+")) %>%
count(author, word) %>%
group_by(author) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
spread(author, proportion) %>%
gather(author, proportion, `Mark Twain`:`Charles Dickens`)
```
```{r}
frequency$word <- factor(frequency$word,
levels=unique(with(frequency,
word[order(proportion, word,
decreasing = TRUE)])))
frequency <- frequency[complete.cases(frequency), ]
ggplot(aes(x = reorder(word, proportion), y = proportion, fill = author),
data = subset(frequency, proportion>0.0025)) +
geom_bar(stat = 'identity', position = position_dodge())+
coord_flip() + ggtitle('Comparing the word frequencies of Mark Twain and Charles Dickens')
```
The top term for both author is the same - "time". Other than that, their language are very different.