PCWG_share_01_main.Rmd

---
title: "PCWG_share_01_main"
author: "Andy Clifton"
date: '`r Sys.Date()`'
output: pdf_document
---
<!-- ## Reading the source code?
This is R Markdown. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com> or <http://kbroman.org/knitr_knutshell/pages/Rmarkdown.html>.

This script is designed to be used with RStudio. When you click the **Knit** button in RStudio a document will be generated that includes the output of any embedded R code chunks within the document. -->

# Introduction
This document contains the results of the Power Curve Working Group's Share_01 exercise, which ran from October to December 2015. The document and results are generated using the programing language `R` from the _PCWG_share_01_main.rmd_ file and can be run by participants themselves.

## How to use PCWG_share_01_main.rmd
install R (<http://www.r-project.org>) and Rstudio (<http://www.rstudio.com>), and then create a directory with all of the code and files (see below). When you click the **Knit** button in RStudio a document will be generated that includes text and results from the code embedded in _PCWG_share_01_main.rmd_.

```{r clean up, echo = FALSE}
rm(list = ls())
```
## User Inputs
The _project.root_ variable defines the location of the files required for this analysis. The _made.by_ variable forms part of a label that will be added to the plots. _data.public_ is a flag that indicates whether the results of the analysis are intended to be public, or not. _data.analyze.raw_ is a flag that indicates whether individual data files should be (re)analyzed (_data.analyze.raw = TRUE_) or whether saved, aggregated data should be used (_data.analyze.raw = FALSE_). 

The following user inputs were used in the preparation of this document:

```{r user-defined options}
# Where can files be found?
project.root <- file.path('/Users/stuart/PCWG-Share-01')

# Who ran this script
made.by = "A. Clifton, NREL"

# Will data be public or not?
data.public = TRUE

#change setting below to toggle normalisation 
# FALSE = Normalise to energy in bin (may overstate errors in low wind speed bins)
# TRUE = Normalise to energy in dataset
data.normalise_to_all = TRUE

# Reanalyze existing data?
data.analyze.raw = TRUE

# software version to use:
# version string in "x.x.x" format
sw.version = "0.8.0"
# should we use versions less, equal, or greater than that?
sw.version.logic = "from" 
# options are "to" (<=), "equals" (==), or "from" (>=).
# See SelectDatabySWVersion.R for implications.

# Do not load versions of the tool before...
# will be ignored if "" is used
# use data versions in the form "x.x.x". 
# Note that this may conflict with some code
data.version.earliest ="0.8.0"
```

Note that _data.version.earliest_ is used to limit the data that are processed; data before this version number will not be used. In this document, _data.version.earliest_ was set to  `cat('"',data.version.earliest,'"')`. Also, a major change in data processing occurred in version 0.5.9 of the PCWG tool. Therefore, the functions used in the preparation of this document have two inputs (_sw.version_ and _sw.version.logic_) that can be used to select the data that are used for each plot.

```{r check old data actually exists, results="asis", echo = FALSE}
# define the path to the directory that we will use to store all of the data
output.dir = file.path(project.root, 
                       "analysis",
                       "all")
# create this directory if it doesn't already exist
dir.create(output.dir, 
           showWarnings = FALSE,
           recursive = TRUE)

# check for existing data. If none exists, set data.analyze.raw back to TRUE
if (data.analyze.raw == FALSE){
  if (file.exists(file.path(output.dir,"AggregatedData.RData"))){
    cat(text = "\n This document was produced from data saved at ", file.path(output.dir,"AggregatedData.RData"), "\n")
  } else {
    cat(text = "\nThis document was produced from raw data files.\n")
    data.analyze.raw <- TRUE
  }
} else {
  cat(text = "\nThis document was produced from raw data files.\n")
}

```

## Packages
### Standard Packages
This script requires the _ggplot2_, _grid_, _knitr_, _RColorBrewer_, _rgdal_, _stringr_, and _XLConnect_ packages to run. These are called from the script but you may need to install them directly. For details of how to install packages, see the RStudio help. 


```{r load packages, message=FALSE, echo = FALSE}
require(ggplot2)
require(grid)
require(XLConnect)
require(knitr)
require(reshape2)
require(stringr)
```

### WARNING
Note that currently (September 21, 20216) the plots require the developer version of _ggplot2_. The github master of ggplot2 can be found at <https://github.com/hadley/ggplot2> . Installing this version requires the _devtools_ package. To install _devtools_ and the developer version of _ggplot2_, type:

````{r install_dev_ggplot2, eval=FALSE, echo=TRUE}
install.packages("devtools")
devtools::install_github("hadley/ggplot2")
````

## Directory structure

```{r define file locations, message=FALSE, echo = FALSE}
# define the working directory
working.dir <- project.root

setwd(working.dir)

#identify data directory
data.dir = file.path(project.root,
                     "data")

# define where functions live
code.dir = file.path(project.root,
                     "code")

# source these functions
code.files = dir(code.dir, pattern = "\\.R$")
for (file in code.files){
  source(file = file.path(code.dir,file))
}

```
The folowing files should be placed in the _project.root_ directory:

* PCWG_share_01_main.Rmd
* /__analysis__ directory containg results of the analysis
* /__code__ directory containing functions required for the analysis
* /__data__ directory containing all data files to be analyzed. This can include further sub directories. 
+ All .xls files contained in __data__ and sub directories will be used in the analysis. Any __Summary.R__ files will be ignored.
+ A __metaData.R__ file can be added to __data__ and include the line _data.supplier.type <- <<supplier type>>_, where <<<<supplier type>> could be one of "Consultant", "Developer", "Owner/Operator", or "OEM". 

```{r configure graphics, message=FALSE, echo = FALSE}
# configure graphics appearance
#theme_set(theme_PCWG(base_size = 8, 
#               base_family = "sans"))
```

# Results from each data set
We now analyse the data from each data set. The plots are saved to their own directories in the _analysis_ directory. If _data.public_ is FALSE, plots will be created for every data set. If _data.public_ is TRUE, only the final, aggregated data plots will be created.

```{r load individual data files, results='asis', echo = FALSE}
# set a counter to check for numbers of files that are data.version.earliest or later
count.version = 0
if (data.analyze.raw == TRUE){
  # identify the data sets that we have available
  data.files = dir(data.dir, 
                   recursive = TRUE,
                   pattern = "\\.xls$")
  
  # remove the file called summary.xls
  data.files <- data.files[lapply(data.files,function(x) length(grep("ummary",x,value=FALSE))) == 0]
  
  # create an empty list to store data in at a later date 
  all.data <- list(sub = NULL,
                   meta = NULL,
                   errors = NULL)
  for (data.file in data.files){
    # Read this data set;
    # open the excel sheet
    wb <- loadWorkbook(file.path(data.dir,data.file))
    # read the submission data

    in.sub <- ReadSubmissionData(wb)
    in.sub$data.file <- data.file
    # check the version
    if (data.version.earliest  !=""){
      # return 1 if the imported file is later than data.version.earliest 
      # return 0 if the imported file is the same as data.version.earliest 
      do.continue = compareVersion(substr(in.sub$sw.version, start =1, stop =5),
                                   data.version.earliest )
      #
    } else {
      do.continue = 0
    }
    
    # depending on how up to date the version is, continue or stop
    if (do.continue < 0){
      cat(text = paste0("\nChecking software versions. ", data.file ,
                   " was produced using ", in.sub$sw.version,
                   ": skipping file.\n"))
      rm(list = c("wb","in.sub"))
    } else {
      cat(text = paste0("\nChecking software versions. ", data.file ,
                   " was produced using ", in.sub$sw.version,
                   ": using file.\n"))
      count.version = count.version+1
      
      # keep going; read more of the file
      # read the meta data
      in.meta <- ReadMetaData(wb)
      
      meta.data.file <- file.path(data.dir,
                                  dirname(data.file),
                                  "metaData.R")
      if (file.exists(meta.data.file)){
        # read the meta data file associated with the entire submission
        source(file = meta.data.file)
        in.meta$data.supplier.type <- data.supplier.type 
      } else {
        in.meta$data.supplier.type <- NA
      }
      # append all of this good stuff
      in.meta$data.file <- data.file
      in.meta$sw.version <- in.sub$sw.version
      
      # read *all* errors
      in.errors <- ReadErrorData(wb,
                                 in.sub$sw.version)

      if (data.public == FALSE){
        # write out results from this case to file and to the document  
        cat(text = "\n## Data set", in.sub$random.ID, "from", data.file, "\n")
        # check to see if we have a directory for this case
        output.dir = file.path(project.root, 
                               "analysis",
                               strsplit(data.file, "\\.")[[1]][1])
        dir.create(output.dir, showWarnings = FALSE,recursive = TRUE)
        
        # create a label we'll use to annotate plots
        plot.label <- labelSubmission(in.sub,
                                      made.by)
        # plot errors
        if (NROW(na.omit(in.errors$by.WS$error.val.pc)) >=1){
          PlotSubErrorsByWS(in.errors$by.WS,
                            plot.label,
                            output.dir)
        }
        if (NROW(na.omit(in.errors$by.TOD$error.val.pc)) >=1){
          PlotSubErrorsByTOD(in.errors$by.TOD,
                             plot.label,
                             output.dir)
        }
        if (NROW(na.omit(in.errors$by.CM$error.val.pc)) >=1){
          PlotSubErrorsByCM(in.errors$by.CM,
                            plot.label,
                            output.dir)
        }
        if (NROW(na.omit(in.errors$by.WD$error.val.pc)) >=1){
          PlotSubErrorsByWD(in.errors$by.WD,
                            plot.label,
                            output.dir)
        }
        if (NROW(na.omit(in.errors$by.4CM$error.val.pc)) >=1){
          PlotSubErrorsBy4CM(in.errors$by.4CM,
                             plot.label,
                             output.dir)
        }
        
        # dummy text to clear knitr buffers  
        cat(text = "\n")
      }
      
      # prepare data for aggregation
      all.data <- aggregateDataSets(all.data,
                                    in.sub,
                                    in.meta,
                                    in.errors)
    }
  }
}
```

``` {r check we have data, echo = FALSE}
if ((count.version == 0)& (data.analyze.raw==TRUE)){
  stop("No data found.\n Check $count.version is empty or contains a sensible value, for example '0.5.10'.")
  
}
```


``` {r save data, echo=FALSE}
# define the path to the directory that we will use to store all of the data
output.dir = file.path(project.root, 
                       "analysis",
                       "all")

# save the data
if (data.analyze.raw == TRUE){
  save(list = c("project.root",
                "made.by",
                "data.public",
                "data.version.earliest",
                "working.dir",
                "output.dir",
                "all.data"), 
       file = file.path(output.dir,"AggregatedData.RData"),
       envir = .GlobalEnv)
} else {
  load(file = file.path(output.dir,"AggregatedData.RData"))
}
```

# Data Sets
## Data Suppliers
Data suppliers were categorized according to the role they play in the wind industry. Roles include:

* Consultants: providing technical services to others but with no financial interest in the development or operation of the wind plant

* Developers: site, design, and plan (or carry out) construction of prospective wind plants.

* Owner/operators: have a finacial stake in the wind plant and are involved with some or all of the day-to-day running of the wind plant

* Original Equipment Manufacturers (OEMs): design, manufature, and maintain wind turbines 

```{r Histogram of data supplier types, echo=FALSE, fig.width = 6.5, fig.height = 3.5, fig.cap="Data Sources"}
#PlotAllDataSuppliers(all.data$meta,
#                     sw.version = sw.version,
#                     sw.version.logic = sw.version.logic,
#                     output.dir,
#                     made.by)
``` 

## Turbine Sizes
In total, `r nrow(all.data$sub)` data sets were submitted. The `r nrow(all.data$sub)` data sets include tests carried out in the period from `r min(all.data$meta$year.of.measurement, na.rm=TRUE)` to `r max(all.data$meta$year.of.measurement, na.rm=TRUE)`. Turbine diameters range from `r min(all.data$meta$turbine.dia)` to `r max(all.data$meta$turbine.dia)` m, while hub heights range from `r min(all.data$meta$turbine.height)` to `r max(all.data$meta$turbine.height)` m.

```{r plot turbine year and size, fig.width = 6.5, fig.height = 3.5, echo=FALSE}
PlotAllTurbineYearSizeDia(all.data$meta,
                          sw.version = sw.version,
                          sw.version.logic = sw.version.logic,
                          output.dir,
                          made.by)

```

## Turbine Locations
The coutry in which the turbine was located was reported in `r NROW(na.omit(all.data$meta$Geography.country))` of the data sets. 

```{r Histogram of turbine locations, echo=FALSE, fig.width = 6.5, fig.height = 3.5, fig.cap="Turbine locations by country"}
PlotAllTurbineLocations(all.data$meta,
                        sw.version = sw.version,
                        sw.version.logic = sw.version.logic,
                        output.dir,
                        made.by)
``` 

Data were obtained from turbines in `r NROW(unique(na.omit(all.data$meta$Geography.country)))` countries including `r TextList(unique(na.omit(all.data$meta$Geography.country)))`. 

```{r Map of turbine locations, message = FALSE, echo=FALSE, fig.width = 6.5, fig.height = 3.5}
MapAllTurbineLocations(all.data$meta,
                       sw.version = sw.version,
                       sw.version.logic = sw.version.logic,
                       code.dir,
                       output.dir,
                       made.by =made.by)
```

# Results
In this section, data from all of the individual data sets have been combined. 

## Errors versus wind speed
Error may reasonably be expected to be a function of wind speed. In order to maintain confidentiality, wind speeds were normalized with respect to rated wind speed (?which one?) by the PCWG tool, and errors were binned into 1-m/s bins. Results for five different methods are shown in the following plots. They include

* The baseline method
* The rotor-equivalent wind speed method (REWS)
* The turbulence correction method
* The turbulence correction method using REWS wind speeds, and 
* The power deviation matrix.

### All Errors
```{r plot errors by wind speed bin using lines, echo=FALSE, fig.width = 6.5, fig.height = 4.5}
PlotAllErrorsByWSBin_Lines(all.data$errors$by.WS,
                           data.range = "all",
                           sw.version = sw.version,
                           sw.version.logic = sw.version.logic,
                           output.dir)
```               

### Errors grouped by magnitude of Error
The following figures show the error by wind speed for data sets with baseline inner range absolute NME greater than 2%.
```{r Error by wind speed for high-NME data sets, echo=FALSE, fig.width = 6.5, fig.height = 4.5}
# get the name of files with high baseline inner range NME
data.files.NMEanom <- unique(as.character(all.data$errors$by.Range$data.file[(all.data$errors$by.Range$range == "Inner") & (abs(all.data$errors$by.Range$error.val.pc) > 2) & (all.data$errors$by.Range$correction == "Baseline") & (all.data$errors$by.Range$error.name == "NME")]))

# plot the eror by wind speed for those files
PlotAllErrorsByWSBin_Lines(all.data$errors$by.WS[all.data$errors$by.WS$data.file %in% data.files.NMEanom,],
                           data.range = "all",
                           sw.version = sw.version,
                           sw.version.logic = sw.version.logic,
                           output.dir,
                           filename.suffix = "_absGT2pc")
```

The following figures show the error by wind speed for data sets with baseline inner range absolute NME less than or equal to 2%.

```{r Error by wind speed for low-NME data sets, echo=FALSE, fig.width = 6.5, fig.height = 4.5}
# get the name of files with low baseline inner range NME
data.files.NMEusual <- unique(as.character(all.data$errors$by.Range$data.file[(all.data$errors$by.Range$range == "Inner") & (abs(all.data$errors$by.Range$error.val.pc) <= 2) & (all.data$errors$by.Range$correction == "Baseline") & (all.data$errors$by.Range$error.name == "NME")]))

# plot the eror by wind speed for those files
PlotAllErrorsByWSBin_Lines(all.data$errors$by.WS[all.data$errors$by.WS$data.file %in% data.files.NMEusual,],
                           data.range = "all",
                           sw.version = sw.version,
                           sw.version.logic = sw.version.logic,
                           output.dir,
                           filename.suffix = "_absLTEq2pc")
```

## Errors Binned by Wind Speed
The following plots show the error in each wind speed bin, grouped by correction method. Plots are provided for all data, the inner range, and the oter range. Data are plotted using a box with whiskers that shows the variation of the error in each wind speed bin. Outliers are shown as black points.

###All Data

```{r plot errors by wind speed bin using box-and-whiskers plot, echo=FALSE,  fig.width = 6.5, fig.height = 3.5}
PlotAllErrorsByWSBin_BoxNWhiskers(all.data$errors$by.WS,
                                  data.range = "all",
                                  sw.version = sw.version,
                                  sw.version.logic = sw.version.logic,
                                  output.dir)
```               

###Inner Range
```{r plot all inner range errors by wind speed using box and whiskers, echo=FALSE,  fig.width = 6.5, fig.height = 3.5}
PlotAllErrorsByWSBin_BoxNWhiskers(all.data$errors$by.WS,
                                  data.range = "Inner",
                                  sw.version = sw.version,
                                  sw.version.logic = sw.version.logic,
                                  output.dir)
```               

###Outer Range
```{r plot all outer range errors by wind speed using box and whiskers, echo=FALSE,  fig.width = 6.5, fig.height = 3.5}
PlotAllErrorsByWSBin_BoxNWhiskers(all.data$errors$by.WS,
                                  data.range = "Outer",
                                  sw.version = sw.version,
                                  sw.version.logic = sw.version.logic,
                                  output.dir)
```               

## Baseline Inner and Outer Range Error Histograms
The following plot compares the normalized mean error (NME) for the inner and outer range for the baseline power curve. The error is the difference between xx and xx.

The inner and outer range are defined in the PCWG's 2013 Proposal (see http://www.pcwg.org/proposals/PCWG-Inner-Outer-Range-Proposal-Dec-2013.pdf) as:

* __The Inner Range__: the range of conditions for which one can expect to achieve an Annual Energy Production (AEP) of 100% (relative to a reference power curve).
* __The Outer Range__: the range of conditions for which one can expect to achieve an AEP of less than 100%. Stated another way the outer range is the range of all possible conditions excluding those in the inner range.

Based on this definition, it would be expected that the error in the inner range would be less than the error in the outer range. This expectation is supported by the following figure.

```{r baseline inner and outer range histogram, echo = FALSE, fig.width = 6.5, fig.height = 3.5}
PlotAllBaselineErrorsByRange(all.data$errors$by.Range,
                             sw.version = sw.version,
                             sw.version.logic = sw.version.logic,
                             output.dir)

```

```{r statistics, echo = TRUE}
StatsByRange(all.data$errors$by.Range,
             sw.version = sw.version,
             sw.version.logic = sw.version.logic) 
```

## The Effect of The Turbulence Correction on the Outer Range Error
The PCWG investigated the effect of turbulence on the power curve uncertainty. The effect of turbulence was to be accounted for using the turbulence correction approach. If this approach were succesful, we would expect to see reduced error with the turbulence correction, compared to the baseline. A comparison of the outer range error with and without turbulence correction is shown in the following figure.

```{r, echo = FALSE, fig.width = 6.5, fig.height = 3.5}
PlotAllOuterErrorsTurbulenceCorrection(all.data$errors$by.Range,
                                       sw.version = sw.version,
                                       sw.version.logic = sw.version.logic,
                                       output.dir)
```

# The Effect of Other Corrections
A similar analysis as in the previous section can be applied to the improvement in both the inner and outer range. The following plot shows the difference in the error in each data set in the inner or outer range, compared to the baseline method. The 1:1 line is shown for comparison; a data point below the line has a lower normalized mean error with the correction, than it had with the baseline method. The number of data sets with each correction is also shown, together with the percentage of the data sets that showed an improvement using this correction.

```{r plot improvement, fig.width = 6.5, fig.height = 4.5, echo=FALSE}
PlotImprovementByRangeAndCorrection(all.data$errors$by.Range,
                                    sw.version = sw.version,
                                    sw.version.logic = sw.version.logic,
                                    output.dir)

```

## Errors Binned by Wind Speed and Ti
The following plot shows how effective each set of corrections is, for each of four combinations of wind speed and turbulence intensity. Data are plotted as $\left|\textrm{error}\right|_\textrm{with corrections} - \left|\textrm{error}\right|_\textrm{baseline}$. Therefore, a positive change indicates an increase in the magnitude of the error, and a negative change indicates a decrease in the magnitude of the error.

```{r, echo=FALSE, fig.width = 6.5, fig.height = 4.5}
# plot the change compared to the baseline in each data set
PlotAllChangeInErrorsBy4CM(df.in = all.data$errors$by.4CM,
                           show.perfect = FALSE,
                           sw.version = sw.version,
                           sw.version.logic = "from",
                           error.name = "NMAE",
                           output.dir)
```             

It is also possible to define a perfect correction which reduces all error to zero. This is shown in the following plot as a benchmark correction.

```{r, echo=FALSE, fig.width = 6.5, fig.height = 4.5}
# plot the change compared to the baseline in each data set
PlotAllChangeInErrorsBy4CM(df.in = all.data$errors$by.4CM,
                           show.perfect = TRUE,
                           sw.version = sw.version,
                           sw.version.logic = "from",
                           error.name = "NMAE",
                           output.dir)
```             

```{r, echo=FALSE, fig.width = 6.5, fig.height = 4.5}
PlotImprovementBy4CM(all.data$errors$by.4CM,
                     sw.version = sw.version,
                     sw.version.logic = "from",
                     output.dir)
```