generated from emu-hacettepe-analytics/emu430-project-quarto-template
-
Notifications
You must be signed in to change notification settings - Fork 0
/
data.qmd
221 lines (168 loc) · 9.5 KB
/
data.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
---
styles:
styles.css
format:
html:
toc: true
toc-depth: 6
toc-title: Contents
toc-location: left
toc-expand: 5
smooth-scroll: true
anchor-sections: true
include-after-body: abbrv_toc.html
number-sections: true
---
::: {style="text-align: center"}
::: {style="font-family: fantasy; font-size: 50px; font-style: normal; font-variant: small-caps; font-weight: 700; line-height: 39.6px; color: #333333;"}
**Data**
:::
:::
![](giphy.gif){style="text-align: center" fig-align="center" width="437"}
## General Information About the Data
::: {style="text-align: justify"}
The Türkiye Health Survey wants to find out how healthy people are and collect important information about key health measures. It helps compare health internationally and gives insights into what health needs a country has for its development.
:::
**Data source:** [Türkiye Health Survey 2022](https://data.tuik.gov.tr/Bulten/Index?p=Turkiye-Saglik-Arastirmasi-2022-49747){.download-button}
### Why We Selected This Data
::: {style="text-align: justify"}
We picked the 'Türkiye Health Survey' data set for our EMU430 course because it's interesting and fits well with what we're learning. This data set gives us a lot of information about health in Turkey. We can study things like how people take care of their health, how much alcohol they drink, what common diseases there are, and the body mass index of individuals. It's a good choice for our course because it covers a lot of important health topics in Turkey.
:::
### Our Objectives
::: {style="text-align: justify"}
Our main goal is to study how people's health changes with age and differs between men and women. We want to understand how behaviors and diseases vary based on these factors. By doing this research, we hope to gather useful information that can be used to create better health policies and strategies.
Our plan is to use numbers and graphs to look for patterns and differences in the data. We believe that showing our findings visually will make them easier to understand. But before we start, we need to make sure the information we have is correct and complete. We'll check and fix any mistakes or missing details to ensure that our results can be trusted. Following this plan, we aim to make our project's goals, the data we're using, and how we're studying it clear. This will set a strong foundation for our project to move forward successfully.
:::
## Importing and Preprocessing Data: Our Approach
::: {style="text-align: justify"}
Before importing our data, we manually cleaned it by removing Turkish text and unnecessary information. Then, we used the '*dplyr*' and '*tidyr*' packages to improve our understanding of the data. The relevant code is provided below.
:::
This study includes three data-sets:
### **Percentage of Health Problems in the Last 12 Months by Sex, 2016-2022** {toc-text="Dataset 1"}
```{r}
#| code-fold: true
#| code-summary: "Show the code"
suppressPackageStartupMessages(library(dplyr))
library(readxl)
library(dplyr)
library(tidyr)
data_1 <- read_excel("tidydataset1.xls")
colnames(data_1) <- c("Diseases", "men_2016", "women_2016", "men_2019", "women_2019", "men_2022", "women_2022")
data_1_longer <- data_1 %>%
pivot_longer(cols = starts_with(c("men", "women")),
names_to = "Gender",
values_to = "Percentage") %>%
separate(Gender, into = c("Gender", "Year"), sep = "_") %>%
arrange(Diseases) %>%
mutate_at(vars(Year, Percentage), as.numeric) %>%
mutate(Diseases = gsub("^\\n", "", Diseases))
```
[Click here to download the associated .RData file for Dataset 1.](https://github.com/emu-hacettepe-analytics/emu430-fall2023-team-data_ciphers/raw/master/tidy_dataset_1.RData){.download-button}
### **Percentage of Status of Alcohol Use by Sex and Age Group, 2016-2022** {toc-text="Dataset 2"}
```{r}
#| code-fold: true
#| code-summary: "Show the code"
suppressPackageStartupMessages(library(dplyr))
library(readxl)
library(dplyr)
library(tidyr)
data_2 <- read_excel("tidydataset2.xls")
colnames(data_2) <- c("age", "men_2016", "women_2016", "men_2019", "women_2019", "men_2022", "women_2022", "usage")
data_2_longer <- data_2 %>%
pivot_longer(cols = starts_with(c("men", "women")),
names_to = "gender",
values_to = "rate") %>%
separate(gender, into = c("gender", "year"), sep = "_") %>%
arrange(age) %>%
mutate_at(vars(year, rate), as.numeric) %>%
na.omit()
```
[Click here to download the associated .RData file for Dataset 2.](https://github.com/emu-hacettepe-analytics/emu430-fall2023-team-data_ciphers/raw/master/tidy_dataset_2.RData){.download-button}
### **Body Mass Index Distribution of Individuals by Sex, 2008-2022** {toc-text="Dataset 3"}
```{r}
#| code-fold: true
#| code-summary: "Show the code"
suppressPackageStartupMessages(library(dplyr))
library(readxl)
library(dplyr)
library(tidyr)
data_3 <- read_excel("tidydataset3.xls")
colnames(data_3) <- c("Year", "Sex", "Underweight", "Normal_weight", "Pre_Obese", "Obese")
data_3_long <- data_3 %>%
pivot_longer(cols = c(Underweight, Normal_weight, Pre_Obese, Obese),
names_to = "Category",
values_to = "Percentage") %>%
mutate(sex_group = ifelse(Sex == "Total", "Total", "Individual")) %>%
group_by(sex_group) %>%
ungroup() %>%
dplyr::filter(Sex != "Total" | n() == 1) %>%
select(-sex_group)
```
[Click here to download the associated .RData file for Dataset 3.](https://github.com/emu-hacettepe-analytics/emu430-fall2023-team-data_ciphers/blob/master/tidy_dataset_3.RData){.download-button}
::: callout-note
In the data pre-processing phase, we used ChatGPT to provide necessary functions and increase the quality of our content. Some of the functions are: "*mutate_at*", "*na.omit*", "*gsub*".
:::
## Exploratory Data Analysis
In our project, we are working on three data sets, all of which were sourced from the *Turkish Statistical Institute's* ***Türkiye Health Survey*** that was conducted in 2022.
### Dataset 1 : The Percentage of Main Diseases/Health Problems Declared by Individuals in the Last 12 Months by Sex, 2016-2022 {toc-text="Dataset 1"}
This data set showcases the percentage of health problems by sex. Only people over the age of 15 were considered for the study, and Alzheimer was evaluated for individuals in the 65+ age group.
```{r}
#| echo: false
library(readxl)
library(knitr)
dataset_1 <- readxl::read_excel("tidy1.xlsx")
kable(
head(dataset_1, 15),
format = "html",
caption = "The Percentage of Main Diseases/Health Problems Declared by Individuals in the Last 12 Months by Sex",
align = "c"
)
```
::: {style="text-align: justify"}
The first column, titled *Diseases* corresponds to the diseases, and the following columns represent the gender of the individual, the year the data was collected and the percentage information respectively. If we take a look at the visualization of the data as presented below we can see that low back problems are consistently what causes the most issues, in both men and women. In general, women have declared their health problems more than men for all years that were considered for the study.
:::
![The distribution of declared health problems for both genders, 2016](2016_dataset_1.png)
![The distribution of declared health problems for both genders, 2019](2019_dataset_1.png)
![The distribution of declared health problems for both genders, 2022](2022_dataset_1.png)
### Data Set 2 : The Percentage of Individuals' Status of Alcohol Use by Sex and Age Group, 2016-2022 {toc-text="Dataset 2"}
This data set showcases the percentage of individuals' alcohol consumption status by sex and age group.
The first column of the table below shows the various age ranges from the study, starting from age 15 and going all the way up to 75+.
::: {style="text-align: justify"}
The second column, titled *usage*, has three different field values: \*Consumers: Indicates the individual partakes in regular alcohol consumption. \*Doesn't consume: Means the individual has consumed alcohol before, but not anymore/not regularly. \*Never consume: Indicates the individual has never consumed alcohol before.
:::
```{r}
#| echo: false
library(readxl)
dataset_2 <- readxl::read_excel("tidy2.xlsx")
kable(
head(dataset_2, 15),
format = "html",
caption = "The Percentage of Individuals' Status of Alcohol Use by Sex and Age Group",
col.names = c("Age", "User Type", "Gender", "Year", "Percentage"),
align = "c"
)
```
The third column shows the gender of the individuals that took part in the study, followed by the year on column 4 and the percentages on column 5.
We can further understand this data by visualizing it via a bar chart, as shown below.
![The distribution of alcohol consumption habits for both genders; 2016, 2019, 2022](chart_dataset_2.png)
### Data Set 3 : Body Mass Index Distribution of Individuals by Sex, 2008-2022 {toc-text="Dataset 3"}
::: {style="text-align: justify"}
This data set showcases the body mass index distribution of male and female individuals, as well as a total for the year of each sub-study. The data here was collected every two years starting from 2008, and ending at 2022.
:::
```{r}
#| echo: false
library(readxl)
dataset_3 <- readxl::read_excel("tidy3.xlsx")
kable(
head(dataset_3, 15),
format = "html",
caption = "Body Mass Index Distribution of Individuals by Sex",
align = "c"
)
```
The first column has year information. The second column indicates whether the individual is male or female. *Category* column has four field values:
- Underweight
- Normal weight
- Pre-obese
- Obese
The last column showcases the percentage information.