Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benjamin's Initial Evaluation Brain Dump #133

Open
nutterb opened this issue Sep 18, 2016 · 2 comments
Open

Benjamin's Initial Evaluation Brain Dump #133

nutterb opened this issue Sep 18, 2016 · 2 comments

Comments

@nutterb
Copy link
Contributor

nutterb commented Sep 18, 2016

I apologize for the length but I wanted to make this available for review and discussion. I thought it would be good to get some feedback on ideas before I do any code work. i can break these up into separate issues when there's a specific aspect needing focus.

Also, be honest and direct in your responses. Part of this process will be transitioning me away from my redcapAPI mindset to the REDCapR mindset. I promise not to have hurt feelings if you don't want to incorporate anything in this list.

Observed Structure

Stand-alone functions

  1. checkbox_choices
  2. regex_named_captures
  3. redcap_column_sanitize
  4. replace_nas_with_explicit

The API Calls

This is the code I used to plot out how the existing API calls interact with each other. The blue boxes are my proposals for new functions. A child node indicates a function that is called within its parent.

    library(HydeNet)

    redcap <- 
      HydeNetwork(
        ~ redcap_metadata_read  + 
          redcap_read + 
          redcap_write + 
          redcap_download_file_oneshot + 
          redcap_upload_file_oneshot + 
          sanitize_token | redcap_metadata_read * 
                           redcap_read * 
                           redcap_write * 
                           redcap_download_file_oneshot * 
                           redcap_upload_file_oneshot +
          api_call | redcap_metadata_read * 
                           redcap_read * 
                           redcap_write * 
                           redcap_download_file_oneshot * 
                           redcap_upload_file_oneshot + 
          redcap_read_oneshot | redcap_read + 
          redcap_read_batch | redcap_read + 
          create_batch_glossary | redcap_read_batch *
                                  redcap_write_batch + 
          format_redcap_data | redcap_read + 
          redcap_write_oneshot | redcap_write +
          redcap_write_batch | redcap_write
      ) %>%
      setDecisionNodes("sanitize_token",
                       "format_redcap_data",
                       "api_call",
                       "redcap_read_batch",
                       "redcap_write_batch")

    plot(redcap)

redcap map

*Less complex Systems *

  1. populate_project_simple calls redcap_project
  2. retrieve_token_mssql calls retrieve_credential_mssql
  3. retrieve_credential_local
  4. validate_no_logical and validate_no_uppercase call
    validate_for_write

New functions

sanitize_token

As of 6.9.5,

The API is now more strict with regard to the validation of API tokens
sent in API requests. In previous versions, if the token was longer
than 32 characters, it would truncate the token to 32 characters
(which is the expected length). It no longer truncates the token if
longer than expected but merely returns an error message."

It wouldn't be hard to write a function to sanitize the token. It would
look something like:

sanitize_token <- function(token){
  token <- trimws(token)
  token <- substr(token, 1, 32)
  token
}

This has the same effect as the code introduce through Issue #103
(#103), and has the added
benefit of providing consistency of tokens into earlier version of
REDCap.

api_call

It might seem silly, but if the httr interface ever undergoes a
change, or if there's ever a decision to change the package making the
web call, putting all of the POST calls into one function means only
have to change that one function, instead of finding all of the calls in
the package.

redcap_read_batch

Would it simplify the interface for the user to have redcap_read be
the driver for exporting data, and to make it a wrapper for
redcap_read_oneshot and a new function redcap_read_batch?
redcapAPI::exportRecords uses batch.size = -1 to indicate reading in
one shot. Something similar could be done to direct traffic between the
subfunctions without affecting backward compatibility (though
redcap_read_oneshot would need to continue to be exported.)

redcap_write_batch

Similar concept to redcap_read_batch. Just make it one function to call for the user.

New Features

  1. Data formatting: The meta data provide enough information to convert
    fields to factors, Date, or POSIXct objects, as appropriate. I
    propose a logical format_data argument to redcap_read that
    formats the data (similar to how redcapAPI formats) when TRUE. A
    new function format_data would be the engine to do this work.
  2. Allow the meta data to be passed as an object to redcap_read and
    redcap_write. redcapAPI has a meta_data = NULL argument that
    allows the user to pass in the meta data data.frame if it exists. If
    given, there is no need to download the meta data again. If NULL,
    the meta data is downloaded. Not strictly necessary, but can save a
    bit of time. I'd consider this pretty low priority, because it is a
    feature that can be added without affect back compatibility.
  3. New methods for exporting other data objects, such as the events
    table, the arms table, the mappings table. I don't know that these
    get used much, but they can be handy for formatting event names
    and such.
  4. Provide internal validations prior to writing data. Honestly, I have
    mixed feelings about this. I put a lot of work into doing this in
    redcapAPI, but the REDCap validations return some pretty good
    messages as well.

Code Improvements

Sections of code like

events_collapsed <- ifelse(is.null(events), "", paste0(events, collapse=","))

can be optimized with

events_collapsed <- if (is.null(events)) "" else paste0(events, collapse = ",")

It isn't a big deal, (see benchmarking below, but as a package matures,
I think it's a good thing to incorporate some efficiency gains)

library(microbenchmark)
library(magrittr)
records <- NULL
microbenchmark(
  ifelse = ifelse(is.null(records), "", paste0(records, collapse=",")),
  strict_if = if (is.null(records)) "" else paste0(records, collapse = "")
)

## Warning in microbenchmark(ifelse = ifelse(is.null(records), "",
## paste0(records, : Could not measure a positive execution time for 6
## evaluations.

## Unit: nanoseconds
##       expr  min   lq    mean median   uq   max neval cld
##     ifelse 1155 1540 1991.12   1541 1926 11934   100   b
##  strict_if    0    1  397.44      1    1 26948   100  a

Other Notes

  • Since we're using RODBCext to manage credentials, it wouldn't be
    hard to write up instructions for acessing credentials from any
    SQL database. Users would just need to know their driver. We could
    probably make an adaptation that pulls from a CSV using the
    sqldf package. It would make a consistent style of management that
    would encourage best practices, even for users who don't have the
    benefit of SQL databases.
  • We should do some benchmarking to see if an non-sparse eav of a
    large data file is faster than the csv export. How large does a
    database have to be with the eav format before the server times
    out? If we find the eav to be faster, perhaps we could up the
    batch number from 100 to something larger (fewer calls to the API
    would mean less wait time for the user, but I suspect you would want
    to continue batching to make the server available to other users.
    Upping the default batch number seems like a good compromise.)

I should also add that at one point I had compiled a list of the API changes announced on the group. I was making notes on what changes I might incorporate into redcapAPI and how those changes could affect the R package. The link to the spreadsheet is
https://docs.google.com/spreadsheets/d/1NMdpb-k5nvrVxF0gvnpIfQyP4vZIpQgUxoLUKJuv2L0/edit?usp=sharing

This was referenced Mar 21, 2017
@wibeasley
Copy link
Member

@nutterb, sorry for not responding to your thoughtful post until now. I've read it several times since September. In short, I like your ideas and think they're all worth discussing. Some of these are large/mature/encapsulated enough they're spun off into entire issues.

Some specific reactions.

New Functions

  1. redcap_read_batch() & redcap_write_batch(). See Read EAV - w/ batching #145. If I use your approach of batching within the function, your advice makes even more sense.

  2. api_call my thought is that would create another layer of indirection that I'm more likely to mess up in the future. Also, I'm guessing it would be easier to return better error messages if something like redcap_read() encounters the error (than if api_call()).

    But I certainly agree that there's a lot of duplication across functions like redcap-metadata-read.R() and redcap-read.R(). I'm mildly convinced that I'm on the right side of the tradeoffs, but certainly could be persuaded otherwise.

  3. I'm phasing out retrieve_token_zzz() functions with retrieve_credential_zzz calls, so that removes a little complexity, and I think becomes aligned with your nested advice. The credential (as opposed to token) retrieval is more general and should work in a wider variety of scenarios.

  4. I like your sanitize_token(). See new issue sanitize_token() #144.

New Features

  1. Data formatting & additional metadata: Where would this metadata be stored?

    • In REDCap annotation field (ie, the big dumb box)?
    • Passed by the programmer, similar to readr's col()'
    • Passed by the programmer, reusing readr's col()' (and then using readr's text-to-tbl functionality?)

    Spin this off into a new issue if you'd like.

  2. Other API calls, like events. Sounds good.

  3. internal validations before writing from the client to REDCap? Can you point to examples?

Code Improvements

As long as the test suite has appropriate coverage of these areas, and doesn't break, I'm all for it. I had no idea ifelse() was so inferior to the approach you benchmarked. Any idea if dplyr::if_else() is as slow? I'll check it out if you haven't.

I think the lowest hanging fruit could still be associated with structuring network calls so they're less chatty & wasteful. But I don't have any ideas beyond EAV; and even that hits a limit and needs batching.

Other Notes

I agree with your two points. So far the EAV playground has been beating the flat export almost every time, even with non-sparse data. But I'd love to know more and describe the performance envelope to users.

Thanks again for all the thought you put into this, @nutterb.

@nutterb
Copy link
Contributor Author

nutterb commented Apr 1, 2017

api_call my thought is that would create another layer of indirection that I'm more likely to mess up in the future. Also, I'm guessing it would be easier to return better error messages if something like redcap_read() encounters the error (than if api_call()).

I think this is something that the checkmate package handles well. I've been using it extensively in another package. It uses an assertCollection object that can be passed between functions. The assertions can be reported from the main function and it will appear to the user that all of the errors are coming from the function they executed.

A crude, silly example is below. A less trivial example is the interaction between the sprinkle_bg function and the index_to_sprinkle

had no idea ifelse() was so inferior to the approach you benchmarked. Any idea if dplyr::if_else() is as slow? I'll check it out if you haven't.

The if and else approach is faster, but usually only advantageous when dealing with a single logical value. ifelse keeps copies of both the yes and no elements, does a bunch of error checking, and a number of other things that make it very useful for vectorized conditionals, but quite a bit slower. dplyr::if_else has the same limitations. Compare the results of the second code block at the bottom.

But as you mentioned, these kinds of optimizations are not going to make a big difference to the user. optimizing the calls to the API will help a lot, as will optimizing the construction of data sets from batches. This is just one of those things that I change when I see them.

passing error messages with checkmate

main_function <- function(x, y, z)
{
  coll <- checkmate::makeAssertCollection()
  
  checkmate::assert_numeric(x = x,
                            len = 1,
                            add = coll)
  
  x_add <- sub_function1(x, y, coll)
  
  res <- sub_function2(x_add, z, coll)
  
  checkmate::reportAssertions(coll)
  
  res
}


sub_function1 <- function(x, y, coll)
{
  
  checkmate::assert_numeric(x = y,
                            len = 1,
                            add = coll,
                            .var.name = "y")
  
  x + y
  
}


sub_function2 <- function(x, z, coll)
{
  checkmate::assert_character(x = z,
                              len = 1,
                              add = coll,
                              .var.name = "z")
  
  paste0(round(x, 2), " ", z)
}

main_function(1, 2, "units")

main_function(1, 2, c("km", "m"))

If else benchmarking

 records <- NULL
microbenchmark(
  ifelse = ifelse(is.null(records), "", paste0(records, collapse=",")),
  if_else = if_else(is.null(records), "", paste0(records, collapse=",")),
  strict_if = if (is.null(records)) "" else paste0(records, collapse = ",")
)

x <- factor(sample(letters[1:5], 10, replace = TRUE))
microbenchmark(
  base = ifelse(x %in% c("a", "b", "c"), x, factor(NA)),
  dplyr = if_else(x %in% c("a", "b", "c"), x, factor(NA))
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants