-
Notifications
You must be signed in to change notification settings - Fork 26
Name That Function
When you hit this function, do this...
Use gather
and spread
respectively, BUT note that many data reshapes are unnecessary in the new data system.
Use bind_rows
and bind_cols
instead.
Use group_by
and summarise
. NOTE that by default aggregate
drops NA values in its grouping variables, while summarise
does not!
It depends. For matching GCAM region IDs, and any other operation where you want an error thrown if there's no match, use left_join_error_no_match
.
Use left_join
if it's OK for NA
values to appear after the join (i.e., you know that everything might not match).
If there are multiple potential matches, and you want only the first one (i.e. replicate the behavior of R's match
):
y %>%
distinct(join_var, .keep_all = TRUE) %>%
right_join(x, by = "join_var")
...where x
and y
are two data frames. Note that here join_var
can be one or multiple variables.
Use inner_join
if you want only the rows that are common to the two data sets (i.e., rows that appear in one data set but not the other will be dropped).
Use semi_join
if you want to filter a data set to the rows that have matches in another data set, but you don't actually want to add the data from the other data set. You can think of this as a generalization of the %in%
operator. A row is "in" the other data frame if it has a match.
Other cases?
Careful here.
This is now called set_water_input_name
.
This method was used in the old data system to call gcam_interp
and then melt the results. Users should just call gather
themselves instead of relying on this method to melt. With regards to the rest of the functionality please consult the section on gcam_interp
.
This method was used to fill out values in all requested years which were represented in the data given to the method in a "wide" format with each year in it's own column. The method used to control how missing data was filled in by the rule=
parameter. The following values were supported:
- Linearly interpolate (no extrapolation)
- Linearly interpolate or extrapolate for values outside the range of values which were specified.
- Linearly interpolate for values inside the range of values which were specified and extrapolate using a exponential decay function.
For rules 1 and 2 users can now use the pipeline helper approx_fun
within a pipeline that gathers their assumptions data, for instance:
# Interpolate the GCAM population data to all historical and future years
GCAM3_population %>%
complete(nesting(region_GCAM3), year = c(year, HISTORICAL_YEARS, FUTURE_YEARS)) %>%
arrange(region_GCAM3, year) %>%
group_by(region_GCAM3) %>%
mutate(value = approx_fun(year, value)) %>%
filter(year %in% c(HISTORICAL_YEARS, FUTURE_YEARS)) ->
L101.Pop_thous_GCAM3_RG3_Y
Since rule 3 is significantly more complicated it was added as a module helper called fill_exp_decay_extrapolate
and will do a lot of the pipeline processing for you so can you start by giving it the wide assumptions data and the years needed to be ensured that are filled in:
A23.globaltech_capital %>%
fill_exp_decay_extrapolate(c(HISTORICAL_YEARS, FUTURE_YEARS)) %>%
rename(value=capital.cost) ->
L223.globaltech_capital
See ?approx_fun
and ?fill_exp_decay_extrapolate
for more details on how to use those function.
The new function is called repeat_add_columns
and can operate in a pipeline, e.g. x %>% repeat_add_columns(y)
.
Specifically, if the old call is
repeat_and_add_vector(x, name_of_new_column, y)
you will likely need to implement as x %>% repeat_add_columns(tibble::tibble(name_of_new_column = y) )
The tibble::tibble
is often required because the new column, y
, is usually a vector like
c("CH4","N2O","NMVOC","NOx","SO2","CO","VOC")
;
in this case, leaving out the conversion to tibble
will cause an error.
x %>% repeat_add_columns(name_of_new_column = tibble::tibble(y) )
and x %>% repeat_add_columns(name_of_new_column , tibble::tibble(y) )
will both also lead to errors.
This can be replaced by a call to tidyr::complete
. Here's code sample, in which every combination of region & commodity will be included, with missing values assigned to 0:
DATA_FRAME %>%
complete(GCAM_region_ID = unique(iso_GCAM_regID$GCAM_region_ID),
GCAM_commodity = unique(FAO_ag_items_cal_SUA$GCAM_commodity),
fill = list(value = 0))
Make sure to ungroup()
data before using complete
. Otherwise, complete
will duplicate rows.
(Note that instead of writing tidyr::complete
you can also add @importFrom tidyr complete
to the function's header, and then just use complete
.)
This should never be necessary. Apart from the fact that the collapse
argument to paste
does this for you, vecpaste
in the current code base is almost invariably used in conjunction with match
to find corresponding rows in two data frames. Use one of the join
functions above instead. Example from LA100.0_LDS_preprocessing
:
# This used to be a complicated vecpaste call
L100.LDS_ag_HA_ha %>%
semi_join(L100.LDS_ag_prod_t, by = c("iso", aglu.GLU, "GTAP_crop")) ->
L100.LDS_ag_HA_ha
The EDGAR_nation
file mapped the ISO codes used in the EDGAR data to the ISO codes used in the iso_GCAM_regID
mapping. Essentially, this file did two things: (1) switched from capital to lower case and (2) changed the Romania iso code from its current (rou
) to its pre-2002 value (rom
). Instead of using this file, we are now making these changes explicit. If you have a chunk that uses EDGAR_nation
, you should use the following three lines of code to go from EDGAR iso to GCAM_region_ID:
standardize_iso(col = "ISO_A3") %>%
change_iso_code('rou', 'rom') %>%
left_join(iso_GCAM_regID, by = "iso")
While translating most Level 2 data files developers may find calls to a function get_logit_fn_tables
as well as perhaps looping over it's return value or chunk output declarations that were automatically generated as something such as L203.SectorLogitTables[[ curr_table ]]$data
which clearly isn't right. What to do with these? Well the old data system function get_logit_fn_tables
would generate a list of tables that was closely related to some other table that had logit exponents such as L203.Supplysector_demand
. We could have replicated all of this sort of behavior in gcamdata however luckily there was a much cleaner solution to avoid most of this.
Mostly it involves deleting all of the code associated with get_logit_fn_tables
and it's resulting tables. The following is a "diff" showing an example of how zchunk_L203.demand_input.R
was changed from it's template to accommodate the new solution for get_logit_fn_tables
in gcamdata. It can be used as guidance for converting your chunk. For those not familiar with diff files the basics are that lines that start with a -
indicates the old lines of code (from the automatically generated chunk template) and +
indicates what those lines were changed to.
diff --git a/R/zchunk_L203.demand_input.R b/R/zchunk_L203.demand_input.R
index 73ee311..5028c3a 100644
--- a/R/zchunk_L203.demand_input.R
+++ b/R/zchunk_L203.demand_input.R
@@ -37,9 +37,7 @@ module_aglu_L203.demand_input <- function(command, ...) {
"L101.Pop_thous_R_Yh",
"L102.pcgdp_thous90USD_Scen_R_Y"))
} else if(command == driver.DECLARE_OUTPUTS) {
- return(c("L203.SectorLogitTables[[curr_table]]$data",
- "L203.Supplysector_demand",
- "L203.SubsectorLogitTables[[curr_table]]$data",
+ return(c("L203.Supplysector_demand",
"L203.SubsectorAll_demand",
"L203.StubTech_demand",
"L203.GlobalTechCoef_demand",
@@ -121,35 +119,13 @@ module_aglu_L203.demand_input <- function(command, ...) {
# L203.Supplysector_demand: generic info for demand sectors
A_demand_supplysector %>%
- get_logit_fn_tables(names_SupplysectorLogitType, GCAM_region_names = GCAM_region_names,
- base_header = "Supplysector_", include_equiv_table = TRUE, write_all_regions = TRUE) ->
- L203.SectorLogitTables
- # Remove any regions for which agriculture and land use are not modeled
- for(curr_table in names(L203.SectorLogitTables)) {
- if(curr_table != "EQUIV_TABLE") {
- L203.SectorLogitTables[[curr_table]]$data <- filter(L203.SectorLogitTables[[curr_table]]$data, !(region %in% aglu.NO_AGLU_REGIONS))
- }
- }
-
- A_demand_supplysector %>%
- write_to_all_regions(names_Supplysector, GCAM_region_names = GCAM_region_names) %>%
+ write_to_all_regions(c(names_Supplysector, "logit.type"), GCAM_region_names = GCAM_region_names) %>%
filter(!region %in% aglu.NO_AGLU_REGIONS) ->
L203.Supplysector_demand
# L203.SubsectorAll_demand: generic info for demand subsectors
A_demand_subsector %>%
- get_logit_fn_tables(names_SubsectorLogitType, GCAM_region_names = GCAM_region_names,
- base_header = "SubsectorLogit_", include_equiv_table = FALSE, write_all_regions = TRUE) ->
- L203.SubsectorLogitTables
- # Remove any regions for which agriculture and land use are not modeled
- for(curr_table in names(L203.SubsectorLogitTables)) {
- if(curr_table != "EQUIV_TABLE") {
- L203.SubsectorLogitTables[[curr_table]]$data <- filter(L203.SubsectorLogitTables[[curr_table]]$data, !region %in% aglu.NO_AGLU_REGIONS)
- }
- }
-
- A_demand_subsector %>%
- write_to_all_regions(names_SubsectorAll, GCAM_region_names = GCAM_region_names) %>%
+ write_to_all_regions(c(names_SubsectorAll, "logit.type"), GCAM_region_names = GCAM_region_names) %>%
filter(!region %in% aglu.NO_AGLU_REGIONS) ->
L203.SubsectorAll_demand
@@ -393,17 +369,6 @@ module_aglu_L203.demand_input <- function(command, ...) {
filter(!region %in% aglu.NO_AGLU_REGIONS) ->
L203.FuelPrefElast_ssp1
- # Produce outputs
- L203.SectorLogitTables[[curr_table]]$data %>%
- add_title("descriptive title of data") %>%
- add_units("units") %>%
- add_comments("comments describing how data generated") %>%
- add_comments("can be multiple lines") %>%
- add_legacy_name("L203.SectorLogitTables[[ curr_table ]]$data") %>%
- add_precursors("common/GCAM_region_names",
- "aglu/A_demand_supplysector") ->
- L203.SectorLogitTables[[curr_table]]$data
-
L203.Supplysector_demand %>%
add_title("descriptive title of data") %>%
add_units("units") %>%
@@ -414,16 +379,6 @@ module_aglu_L203.demand_input <- function(command, ...) {
"aglu/A_demand_supplysector") ->
L203.Supplysector_demand
- L203.SubsectorLogitTables[[curr_table]]$data %>%
- add_title("descriptive title of data") %>%
- add_units("units") %>%
- add_comments("comments describing how data generated") %>%
- add_comments("can be multiple lines") %>%
- add_legacy_name("L203.SubsectorLogitTables[[ curr_table ]]$data") %>%
- add_precursors("common/GCAM_region_names",
- "aglu/A_demand_subsector") ->
- L203.SubsectorLogitTables[[curr_table]]$data
-
L203.SubsectorAll_demand %>%
add_title("descriptive title of data") %>%
add_units("units") %>%
@@ -688,7 +643,7 @@ module_aglu_L203.demand_input <- function(command, ...) {
"L102.pcgdp_thous90USD_Scen_R_Y") ->
L203.IncomeElasticity_SSP5
- return_data(L203.SectorLogitTables[[curr_table]]$data, L203.Supplysector_demand, L203.SubsectorLogitTables[[curr_table]]$data, L203.SubsectorAll_demand, L203.StubTech_demand, L203.GlobalTechCoef_demand, L203.GlobalTechShrwt_demand, L203.StubTechProd_food_crop, L203.StubTechProd_food_meat, L203.StubTechProd_nonfood_crop, L203.StubTechProd_nonfood_meat, L203.StubTechProd_For, L203.StubTechFixOut_exp, L203.StubCalorieContent_crop, L203.StubCalorieContent_meat, L203.PerCapitaBased, L203.BaseService, L203.IncomeElasticity, L203.PriceElasticity, L203.FuelPrefElast_ssp1, L203.IncomeElasticity_SSP1, L203.IncomeElasticity_SSP2, L203.IncomeElasticity_SSP3, L203.IncomeElasticity_SSP4, L203.IncomeElasticity_SSP5)
+ return_data(L203.Supplysector_demand, L203.SubsectorAll_demand, L203.StubTech_demand, L203.GlobalTechCoef_demand, L203.GlobalTechShrwt_demand, L203.StubTechProd_food_crop, L203.StubTechProd_food_meat, L203.StubTechProd_nonfood_crop, L203.StubTechProd_nonfood_meat, L203.StubTechProd_For, L203.StubTechFixOut_exp, L203.StubCalorieContent_crop, L203.StubCalorieContent_meat, L203.PerCapitaBased, L203.BaseService, L203.IncomeElasticity, L203.PriceElasticity, L203.FuelPrefElast_ssp1, L203.IncomeElasticity_SSP1, L203.IncomeElasticity_SSP2, L203.IncomeElasticity_SSP3, L203.IncomeElasticity_SSP4, L203.IncomeElasticity_SSP5)
} else {
stop("Unknown command")
}
Again in summary the lines associated with the results of get_logit_fn_tables
are removed completely. One subtle change however is that the table L203.Supplysector_demand
must now include the column logit.type
which would have not existed in the old data system. The testing system has been updated to accommodate this additional column.
The other change will be in the "batch XML chunk" such as in zchunk_batch_demand_input_xml.R
where instead of just converting L203.Supplysector_demand
with:
create_xml("demand_input.xml") %>%
add_xml_data(L203.Supplysector_demand,"Supplysector") %>%
You would instead use a specialized add xml function:
create_xml("demand_input.xml") %>%
add_logit_tables_xml(L203.Supplysector_demand,"Supplysector") %>%
In addition you may need to double check the proper header to use from the old data system for instance for the subsector we had earlier in the Level 2 processing file (the third argument):
L203.SubsectorLogitTables <- get_logit_fn_tables( A_demand_subsector, names_SubsectorLogitType,
base.header="SubsectorLogit_", include.equiv.table=F, write.all.regions=T )
While the corresponding table L203.SubsectorAll_demand
was written to XML using a different header later in the Level 2 processing file (the second argument):
write_mi_data( L203.SubsectorAll_demand, "SubsectorAll", "AGLU_LEVEL2_DATA", "L203.SubsectorAll_demand", "AGLU_XML_BATCH", "batch_demand_input.xml" )
Thus in the add_logit_tables_xml
function there is a third argument to supply this header (which defaults to the second argument since in most cases they will be the same):
add_logit_tables_xml(L203.SubsectorAll_demand,"SubsectorAll", "SubsectorLogit") %>%
Note failure to use the add_logit_tables_xml
will result in XML that looks like the following intentionally to indicate when run through GCAM that the logit.type
was not properly set.
<dummy-logit-tag>
<logit-exponent fillout="1" year="1975">0</logit-exponent>
</dummy-logit-tag>
Instead of something like:
<relative-cost-logit>
<logit-exponent fillout="1" year="1975">0</logit-exponent>
</relative-cost-logit>
These are all defined in module-helpers.R
. Note that in some cases the parameters required, and/or their names, may vary slightly from the original versions!