Skip to content

Name That Function

Ben Bond-Lamberty edited this page Aug 25, 2017 · 29 revisions

When you hit this function, do this...

Built-in (R and packages) functions

melt and dcast

Use gather and spread respectively, BUT note that many data reshapes are unnecessary in the new data system.

rbind and cbind

Use bind_rows and bind_cols instead.

aggregate

Use group_by and summarise. NOTE that by default aggregate drops NA values in its grouping variables, while summarise does not!

match

It depends. For matching GCAM region IDs, and any other operation where you want an error thrown if there's no match, use left_join_error_no_match.

Use left_join if it's OK for NA values to appear after the join (i.e., you know that everything might not match).

If there are multiple potential matches, and you want only the first one (i.e. replicate the behavior of R's match):

y %>%
  distinct(join_var, .keep_all = TRUE) %>%
  right_join(x, by = "join_var")

...where x and y are two data frames. Note that here join_var can be one or multiple variables.

Use inner_join if you want only the rows that are common to the two data sets (i.e., rows that appear in one data set but not the other will be dropped).

Use semi_join if you want to filter a data set to the rows that have matches in another data set, but you don't actually want to add the data from the other data set. You can think of this as a generalization of the %in% operator. A row is "in" the other data frame if it has a match.

Other cases?

merge

Careful here.

Data system-specific functions

get_water_inputs_for_mapping

This is now called set_water_input_name.

interpolate_and_melt

This method was used in the old data system to call gcam_interp and then melt the results. Users should just call gather themselves instead of relying on this method to melt. With regards to the rest of the functionality please consult the section on gcam_interp.

gcam_interp

This method was used to fill out values in all requested years which were represented in the data given to the method in a "wide" format with each year in it's own column. The method used to control how missing data was filled in by the rule= parameter. The following values were supported:

  1. Linearly interpolate (no extrapolation)
  2. Linearly interpolate or extrapolate for values outside the range of values which were specified.
  3. Linearly interpolate for values inside the range of values which were specified and extrapolate using a exponential decay function.

For rules 1 and 2 users can now use the pipeline helper approx_fun within a pipeline that gathers their assumptions data, for instance:

# Interpolate the GCAM population data to all historical and future years
GCAM3_population %>%
  complete(nesting(region_GCAM3), year = c(year, HISTORICAL_YEARS, FUTURE_YEARS)) %>%
  arrange(region_GCAM3, year) %>%
  group_by(region_GCAM3) %>%
  mutate(value = approx_fun(year, value)) %>%
  filter(year %in% c(HISTORICAL_YEARS, FUTURE_YEARS)) ->
  L101.Pop_thous_GCAM3_RG3_Y

Since rule 3 is significantly more complicated it was added as a module helper called fill_exp_decay_extrapolate and will do a lot of the pipeline processing for you so can you start by giving it the wide assumptions data and the years needed to be ensured that are filled in:

A23.globaltech_capital %>%
  fill_exp_decay_extrapolate(c(HISTORICAL_YEARS, FUTURE_YEARS)) %>%
  rename(value=capital.cost) ->
  L223.globaltech_capital

See ?approx_fun and ?fill_exp_decay_extrapolate for more details on how to use those function.

repeat_and_add_vector

The new function is called repeat_add_columns and can operate in a pipeline, e.g. x %>% repeat_add_columns(y).

Specifically, if the old call is repeat_and_add_vector(x, name_of_new_column, y) you will likely need to implement as x %>% repeat_add_columns(tibble::tibble(name_of_new_column = y) ) The tibble::tibble is often required because the new column, y, is usually a vector like c("CH4","N2O","NMVOC","NOx","SO2","CO","VOC") ; in this case, leaving out the conversion to tibble will cause an error.

x %>% repeat_add_columns(name_of_new_column = tibble::tibble(y) ) and x %>% repeat_add_columns(name_of_new_column , tibble::tibble(y) ) will both also lead to errors.

translate_to_full_table

This can be replaced by a call to tidyr::complete. Here's code sample, in which every combination of region & commodity will be included, with missing values assigned to 0:

DATA_FRAME %>% 
  complete(GCAM_region_ID = unique(iso_GCAM_regID$GCAM_region_ID),                                             
           GCAM_commodity = unique(FAO_ag_items_cal_SUA$GCAM_commodity),
           fill = list(value = 0)) 

Make sure to ungroup() data before using complete. Otherwise, complete will duplicate rows.

(Note that instead of writing tidyr::complete you can also add @importFrom tidyr complete to the function's header, and then just use complete.)

vecpaste

This should never be necessary. Apart from the fact that the collapse argument to paste does this for you, vecpaste in the current code base is almost invariably used in conjunction with match to find corresponding rows in two data frames. Use one of the join functions above instead. Example from LA100.0_LDS_preprocessing:

# This used to be a complicated vecpaste call
L100.LDS_ag_HA_ha %>%
  semi_join(L100.LDS_ag_prod_t, by = c("iso", aglu.GLU, "GTAP_crop")) ->
  L100.LDS_ag_HA_ha

EDGAR_nation

The EDGAR_nation file mapped the ISO codes used in the EDGAR data to the ISO codes used in the iso_GCAM_regID mapping. Essentially, this file did two things: (1) switched from capital to lower case and (2) changed the Romania iso code from its current (rou) to its pre-2002 value (rom). Instead of using this file, we are now making these changes explicit. If you have a chunk that uses EDGAR_nation, you should use the following three lines of code to go from EDGAR iso to GCAM_region_ID:

standardize_iso(col = "ISO_A3") %>%
      change_iso_code('rou', 'rom') %>%
      left_join(iso_GCAM_regID, by = "iso")

set_water_input_name

rename_SO2

get_logit_fn_tables

While translating most Level 2 data files developers may find calls to a function get_logit_fn_tables as well as perhaps looping over it's return value or chunk output declarations that were automatically generated as something such as L203.SectorLogitTables[[ curr_table ]]$data which clearly isn't right. What to do with these? Well the old data system function get_logit_fn_tables would generate a list of tables that was closely related to some other table that had logit exponents such as L203.Supplysector_demand. We could have replicated all of this sort of behavior in gcamdata however luckily there was a much cleaner solution to avoid most of this.

Mostly it involves deleting all of the code associated with get_logit_fn_tables and it's resulting tables. The following is a "diff" showing an example of how zchunk_L203.demand_input.R was changed from it's template to accommodate the new solution for get_logit_fn_tables in gcamdata. It can be used as guidance for converting your chunk. For those not familiar with diff files the basics are that lines that start with a - indicates the old lines of code (from the automatically generated chunk template) and + indicates what those lines were changed to.

diff --git a/R/zchunk_L203.demand_input.R b/R/zchunk_L203.demand_input.R
index 73ee311..5028c3a 100644
--- a/R/zchunk_L203.demand_input.R
+++ b/R/zchunk_L203.demand_input.R
@@ -37,9 +37,7 @@ module_aglu_L203.demand_input <- function(command, ...) {
"L101.Pop_thous_R_Yh",
"L102.pcgdp_thous90USD_Scen_R_Y"))
} else if(command == driver.DECLARE_OUTPUTS) {
-    return(c("L203.SectorLogitTables[[curr_table]]$data",
-             "L203.Supplysector_demand",
-             "L203.SubsectorLogitTables[[curr_table]]$data",
+    return(c("L203.Supplysector_demand",
"L203.SubsectorAll_demand",
"L203.StubTech_demand",
"L203.GlobalTechCoef_demand",
@@ -121,35 +119,13 @@ module_aglu_L203.demand_input <- function(command, ...) {

# L203.Supplysector_demand: generic info for demand sectors
A_demand_supplysector %>%
-      get_logit_fn_tables(names_SupplysectorLogitType, GCAM_region_names = GCAM_region_names,
-                          base_header = "Supplysector_", include_equiv_table = TRUE, write_all_regions = TRUE) ->
-      L203.SectorLogitTables
-    # Remove any regions for which agriculture and land use are not modeled
-    for(curr_table in names(L203.SectorLogitTables)) {
-      if(curr_table != "EQUIV_TABLE") {
-        L203.SectorLogitTables[[curr_table]]$data <- filter(L203.SectorLogitTables[[curr_table]]$data, !(region %in% aglu.NO_AGLU_REGIONS))
-      }
-    }
-
-    A_demand_supplysector %>%
-      write_to_all_regions(names_Supplysector, GCAM_region_names = GCAM_region_names) %>%
+      write_to_all_regions(c(names_Supplysector, "logit.type"), GCAM_region_names = GCAM_region_names) %>%
filter(!region %in% aglu.NO_AGLU_REGIONS) ->
L203.Supplysector_demand

# L203.SubsectorAll_demand: generic info for demand subsectors
A_demand_subsector %>%
-      get_logit_fn_tables(names_SubsectorLogitType, GCAM_region_names = GCAM_region_names,
-                          base_header = "SubsectorLogit_", include_equiv_table = FALSE, write_all_regions = TRUE) ->
-      L203.SubsectorLogitTables
-    # Remove any regions for which agriculture and land use are not modeled
-    for(curr_table in names(L203.SubsectorLogitTables)) {
-      if(curr_table != "EQUIV_TABLE") {
-        L203.SubsectorLogitTables[[curr_table]]$data <- filter(L203.SubsectorLogitTables[[curr_table]]$data, !region %in% aglu.NO_AGLU_REGIONS)
-      }
-    }
-
-    A_demand_subsector %>%
-      write_to_all_regions(names_SubsectorAll, GCAM_region_names = GCAM_region_names) %>%
+      write_to_all_regions(c(names_SubsectorAll, "logit.type"), GCAM_region_names = GCAM_region_names) %>%
filter(!region %in% aglu.NO_AGLU_REGIONS) ->
L203.SubsectorAll_demand

@@ -393,17 +369,6 @@ module_aglu_L203.demand_input <- function(command, ...) {
filter(!region %in% aglu.NO_AGLU_REGIONS) ->
L203.FuelPrefElast_ssp1

-    # Produce outputs
-    L203.SectorLogitTables[[curr_table]]$data %>%
-      add_title("descriptive title of data") %>%
-      add_units("units") %>%
-      add_comments("comments describing how data generated") %>%
-      add_comments("can be multiple lines") %>%
-      add_legacy_name("L203.SectorLogitTables[[ curr_table ]]$data") %>%
-      add_precursors("common/GCAM_region_names",
-                     "aglu/A_demand_supplysector") ->
-      L203.SectorLogitTables[[curr_table]]$data
-
L203.Supplysector_demand %>%
add_title("descriptive title of data") %>%
add_units("units") %>%
@@ -414,16 +379,6 @@ module_aglu_L203.demand_input <- function(command, ...) {
"aglu/A_demand_supplysector") ->
L203.Supplysector_demand

-    L203.SubsectorLogitTables[[curr_table]]$data %>%
-      add_title("descriptive title of data") %>%
-      add_units("units") %>%
-      add_comments("comments describing how data generated") %>%
-      add_comments("can be multiple lines") %>%
-      add_legacy_name("L203.SubsectorLogitTables[[ curr_table ]]$data") %>%
-      add_precursors("common/GCAM_region_names",
-                     "aglu/A_demand_subsector") ->
-      L203.SubsectorLogitTables[[curr_table]]$data
-
L203.SubsectorAll_demand %>%
add_title("descriptive title of data") %>%
add_units("units") %>%
@@ -688,7 +643,7 @@ module_aglu_L203.demand_input <- function(command, ...) {
"L102.pcgdp_thous90USD_Scen_R_Y") ->
L203.IncomeElasticity_SSP5

-    return_data(L203.SectorLogitTables[[curr_table]]$data, L203.Supplysector_demand, L203.SubsectorLogitTables[[curr_table]]$data, L203.SubsectorAll_demand, L203.StubTech_demand, L203.GlobalTechCoef_demand, L203.GlobalTechShrwt_demand, L203.StubTechProd_food_crop, L203.StubTechProd_food_meat, L203.StubTechProd_nonfood_crop, L203.StubTechProd_nonfood_meat, L203.StubTechProd_For, L203.StubTechFixOut_exp, L203.StubCalorieContent_crop, L203.StubCalorieContent_meat, L203.PerCapitaBased, L203.BaseService, L203.IncomeElasticity, L203.PriceElasticity, L203.FuelPrefElast_ssp1, L203.IncomeElasticity_SSP1, L203.IncomeElasticity_SSP2, L203.IncomeElasticity_SSP3, L203.IncomeElasticity_SSP4, L203.IncomeElasticity_SSP5)
+    return_data(L203.Supplysector_demand, L203.SubsectorAll_demand, L203.StubTech_demand, L203.GlobalTechCoef_demand, L203.GlobalTechShrwt_demand, L203.StubTechProd_food_crop, L203.StubTechProd_food_meat, L203.StubTechProd_nonfood_crop, L203.StubTechProd_nonfood_meat, L203.StubTechProd_For, L203.StubTechFixOut_exp, L203.StubCalorieContent_crop, L203.StubCalorieContent_meat, L203.PerCapitaBased, L203.BaseService, L203.IncomeElasticity, L203.PriceElasticity, L203.FuelPrefElast_ssp1, L203.IncomeElasticity_SSP1, L203.IncomeElasticity_SSP2, L203.IncomeElasticity_SSP3, L203.IncomeElasticity_SSP4, L203.IncomeElasticity_SSP5)
} else {
stop("Unknown command")
}

Again in summary the lines associated with the results of get_logit_fn_tables are removed completely. One subtle change however is that the table L203.Supplysector_demand must now include the column logit.type which would have not existed in the old data system. The testing system has been updated to accommodate this additional column.

The other change will be in the "batch XML chunk" such as in zchunk_batch_demand_input_xml.R where instead of just converting L203.Supplysector_demand with:

create_xml("demand_input.xml") %>%
  add_xml_data(L203.Supplysector_demand,"Supplysector") %>%

You would instead use a specialized add xml function:

create_xml("demand_input.xml") %>%
  add_logit_tables_xml(L203.Supplysector_demand,"Supplysector") %>%

In addition you may need to double check the proper header to use from the old data system for instance for the subsector we had earlier in the Level 2 processing file (the third argument):

L203.SubsectorLogitTables <- get_logit_fn_tables( A_demand_subsector, names_SubsectorLogitType,
    base.header="SubsectorLogit_", include.equiv.table=F, write.all.regions=T )

While the corresponding table L203.SubsectorAll_demand was written to XML using a different header later in the Level 2 processing file (the second argument):

write_mi_data( L203.SubsectorAll_demand, "SubsectorAll", "AGLU_LEVEL2_DATA", "L203.SubsectorAll_demand", "AGLU_XML_BATCH", "batch_demand_input.xml" )

Thus in the add_logit_tables_xml function there is a third argument to supply this header (which defaults to the second argument since in most cases they will be the same):

add_logit_tables_xml(L203.SubsectorAll_demand,"SubsectorAll", "SubsectorLogit") %>%

Note failure to use the add_logit_tables_xml will result in XML that looks like the following intentionally to indicate when run through GCAM that the logit.type was not properly set.

<dummy-logit-tag>
  <logit-exponent fillout="1" year="1975">0</logit-exponent>
</dummy-logit-tag>

Instead of something like:

<relative-cost-logit>
  <logit-exponent fillout="1" year="1975">0</logit-exponent>
</relative-cost-logit>

write_to_all_regions

set_traded names

set_years

add_node_leaf_names

append_GLU

replace_GLU

These are all defined in module-helpers.R. Note that in some cases the parameters required, and/or their names, may vary slightly from the original versions!