-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathstep0.3_pre-process_hvac_dhw_novar.Rmd
196 lines (143 loc) · 5.04 KB
/
step0.3_pre-process_hvac_dhw_novar.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
---
title: "Step 1: Preprocess data for analysis"
author: "Noah Klammer"
date: "6/27/2021"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
```
## Clear global and report
```{r}
rm(list = ls())
gc()
```
# Import
## ESO timeseries from "zemf_hvac_dhw_novar_eso"
```{r import, message=FALSE, warning=FALSE}
library(readr)
df <- read_csv("data_in/zemf_mini_erv_dhw_novar_eso.csv",
skip = 3)
colnames(df)[1] <- "Date/Time"
#View(df)
```
```{r infer time sampling, include=FALSE}
# readr::spec(df)
freq <- "null"
rows <- nrow(df)
if (rows==8760*4) {
freq <- "15 minutes"
} else if (rows==8760) {
freq <- "hourly"
} else if (rows==12) {
freq <- "monthly"
} else {freq <- "could not determine"}
```
There are ``r ncol(df)`` column variables in this file with names like ``r names(df)[3]``, ``r names(df)[6]``, and ``r names(df)[50]``.
The frequency of this .eso file's timestep is ``r freq``.
### Rowname labels
Named rows are helpful labels for a dataframe without 'being' data per se.
```{r warning=FALSE}
# probably just easier to do this in Excel
eighty760_labels <- str_replace(df$`Date/Time`,"/\\d{4}","") # take out year
rownames(df) <- eighty760_labels
rownames(df) <- paste(df$`Date/Time`,"Energy Use [J]") # Add in variable name
```
### Zone names for columns
```{r}
# string replace colnames
# drop time column
time_col <- select(df, c(`Date/Time`))
# save rowname labels
rownames(time_col) <- rownames(df)
df <- select(df, -c(`Date/Time`))
regex_str <- "(?<=Zone\\:).+(?=\\s)|(?<=\\s\\-\\s).*(?=\\sELECTRICITY)"
zone_names <- str_extract(colnames(df), regex_str)
colnames(df) <- zone_names
df <- cbind(time_col,df)
```
# Inspect Zones and select only residential zones of interest
Dwelling units, stairwells, and corridors above the first floor are all considered residential zones.
### Drop Date/Time and save for later
```{r}
# drop Date/Time col
time_col <- select(df, c(`Date/Time`))
rownames(time_col) <- rownames(df)
```
```{r}
res_list <-
c(
grep("STAIRWELL_\\d", colnames(df), value = TRUE),
grep("CORRIDOR_\\d", colnames(df), value = TRUE),
grep("BDRM", colnames(df), value = TRUE)
)
sorted_list <- stringr::str_sort(res_list, numeric = TRUE)
df <- df[sorted_list]
```
### Add Date/Time back in
```{r warning=FALSE}
`Date/Time` <- time_col
df <- cbind(`Date/Time`,df)
#rownames(df) <- paste(eighty760_labels, "Ideal Tot. Clg Load [J/m^2]")
```
### Create columns for categorical month, day, hour, minute
```{r warning=FALSE}
# separate the date and time into cols month, day, hour, minute
# make sure to have two digits for all days and months
df <- df %>%
mutate(month = as.integer(substr(`Date/Time`, start = 1, stop = 2)),
day = as.integer(substr(`Date/Time`, start = 4, stop = 5)),
hour = as.integer(substr(`Date/Time`, start = 7, stop = 8)), # hour is not working
`Date/Time` = as.integer(substr(`Date/Time`, start = 10, stop = 11))) %>%
rename(minute = `Date/Time`)
sorted_list <- stringr::str_sort(colnames(df), numeric = TRUE)
df <- df[sorted_list]
# numeric month var to month string
# df <- transform(df, month = month.abb[month])
```
### Data QA/QC: remove observations with NA
```{r}
# remove NA observations
# remove minute col if subhourly data DNE
if (is.null(df$minute)) { # do nothing, check if exists
} else if (var(df$minute)==0) { # take out minute with zero variance
df <- select(df,-minute)} else { # do nothing
}
if (anyNA(df)) { # then
df <- df %>% na.omit()
}
# remove the automatic row numbers
# rownames(df) <- NULL
```
### Data QA/QC: zero variance
```{r include=FALSE}
sel_days_df <- df
# visdat::vis_cor(sel_days_df)
#=> "the standard deviation is zero"
# let's find which cols have zero variance
zv <- which(apply(sel_days_df, 2, var) == 0)
z_var_zones <- str_subset(colnames(sel_days_df)[zv],"")
# get zones that str match with "cooling"
c <- z_var_zones
# get zones that str match with "heating"
# h <- grep("*\\Deating", z_var_zones, value = TRUE)
# extracts zone name logic
# c <- str_extract(c,"(?<=SYSTEM\\s).*(?=\\:)")
# h <- str_extract(h,"(?<=SYSTEM\\s).*(?=\\:)")
# are there any zones with neither heating nor cooling?
# intersect(c,h)
#=> unconditioned zones
# remove the unc zones from
# `zero_var_zone_names` string list
z_var_zone_names <- z_var_zones
```
The longer we work with this data set, the more clear it is that many zones have zero variance. We see that ``r length(z_var_zones)`` out of ``r length(sel_days_df)`` zones have zero variance for this temporal range.
We find that there are ``r length(c)`` zones with no cooling load and `0` zones with no heating load. We assert that `r length(z_var_zones)` have neither cooling nor heating load. These zones are unconditioned spaces.
Naturally, for a correctly defined building energy model, few if any hours of a certain zone will have both heating and cooling. For possible future regression purposes, I will treat heating load as negative cooling load.
### Save as .Rda R data file
```{r}
hvac_dhw_novar <- df
save(hvac_dhw_novar,file = "hvac_dhw_novar.Rda")
```
<br><br><br>