Skip to content

OSS 2014 Parking Lot

Matt Jones edited this page Aug 7, 2014 · 94 revisions

OSS 2014 Parking Lot

Please add any issues that you encounter here as parking lot issues that we should address during the course.

Day 18: August 07, 2014

Day 17: August 06, 2014

  • more discussion about why using algorithmic approaches to answer questions about what structures or causes patterns is inappropriate. In particular, examples (ideally ecological ones) that illustrate. IE investigator A wants to know if precipitation increases biodiversity. algorithmic approaches can tell you B and classical approaches can tell you C. (except with an example that makes sense).

  • Biological examples for spatial lab? A: I'm going to leave out part 1 (spatial simulation), as this is really a very general set of tools that you can use to simulate different kinds of data at different levels -- I'll focus on parts 2 (SAR/CAR/weight matrices) and 3 (geostatistical analyses). In general I would consider CAR/SAR (1) for larger-scale problems (e.g. data at geographic scales -- sampling units might be zip codes, counties, states/provinces, etc.); (2) for large data sets (upwards of 1000 observations), where computational efficiency gets more important; (3) for cases where space and spatial scale are nuisances rather than primary research questions. Geostatistical approaches would be the opposite (small-scale, medium-size data [below about 30-50 spatial sampling points I would probably give up on estimating spatial autocorrelation at all], interest in spatial scales). So I would be more likely to use CAR/SAR/neighbor weights for environmental analysis questions (diversity/land-use change across counties), and geostatistical questions for more 'ecological'/small-scale questions (I have been attempting to use them to understand the spatial scales of variables affecting establishment, i.e. the seed to seedling stage transition in pine trees).

  • Dealing with large spatial datasets in R - the raster package and data.table package definitely help for dealing with large datasets, but there are a lot of GIS manipulations that can still be slow when working in R. Any suggestions on this? Or work with more dedicated GIS tools to do necessary manipulations and then extract/bring data into R for analyses? A: this is a very interesting question, but one that's still a little too vague to answer definitively (even if I knew all the answers). As you point out, there have been a lot of improvements in processing speed in R over the last few years, including most prominently the data.table, dplyr, and raster packages, as well as Rcpp for interfaces to C++ code. (The high-performance computing task view doesn't mention these packages, but has potentially useful information about out-of-memory and parallel computation tools.) I've had good luck with my limited use of rgeos via sp for things like polygon intersections ... It's also worth distinguishing between "I wish this operation were faster" and "I know this operation could be faster, because tool XXX does it faster". In the latter case, why? Better algorithms? Better implementation? Multithreaded/parallel implementations? Special-purpose tool with more restricted capabilities but better performance within that specialized scope? In the absence of more detailed information about particular tasks, I would encourage you to combine different tools within your workflow/pipeline. Except for the overhead of remembering multiple command sets/etc., this is how software ecosystems are supposed to work! This is easier with tools that play nicely as part of an ecosystem (good APIs): have you tried PostGIS? (Feel free to respond with slightly more examples/detail about the kinds of operations you'd like to accelerate ...)

  • Free-online course on Statistical Learning course from Trevor Hastie et al: http://online.stanford.edu/course/statistical-learning-winter-2014

  • Might be of interested for differences in boosted regression trees versus random forests here

  • A knowledgeable friend says: "The Strobl, et al. paper is not a quick read, in that it's long, but it's very digestible and discusses the broader class of these types of algorithms and their applications. The Prasad, et al. and Cutler papers may be closer to what you have in mind, less technical but a decent discussion."

    • Cutler, D. Richard, Thomas C. Edwards Jr., Karen H. Beard, Adele Cutler, Kyle T. Hess, Jacob Gibson, and Joshua J. Lawler. 2007. “Random Forests for Classification in Ecology.” Ecology 88 (11): 2783–92. http://www.jstor.org/stable/27651436.
    • Prasad, Anantha, Louis I. Iverson, and Andy Liaw. 2006. “Newer Classification and Regression Tree Techniques: Bagging and Random Forests for Ecological Prediction - Springer.” Ecosystems 9: 181–99. doi:10.1007/s10021-005-0054-1.
    • Strobl, Carolin, James Malley, and Gerhard Tutz. 2009. “An Introduction to Recursive Partitioning: Rationale, Application, and Characteristics of Classification and Regression Trees, Bagging, and Random Forests.” Psychological Methods 14 (4): 323–48. doi:10.1037/a0016973.

Day 16: August 05, 2014

Day 15: August 04, 2014

Is there any reason to choose between structuring the code these two different ways for doing the permutation? The difference is in using transform on the table or just sample on the response vector. I think this gets at "good" coding practice.

permfun<-function(myvar){
  mysamp<-sample(myvar)
  myfit<-lm(mysamp~tdat$cYear)
  myt<-coef(summary(myfit))
  return(myt)
}


set.seed(101)
permfun(tdat$GS.NEE)

permfun2<-function(mydata){
  sim.data <- transform(mydata, response=sample(mydata$GS.NEE))
  myfit<-lm(sim.data$response~sim.data$cYear)
  myt<-coef(summary(myfit))
  return(myt)
}
set.seed(101)
permfun2(tdat)

A: it's mostly stylistic, but there are a few issues here:

  • it's generally best not to use $ inside formulas, i.e. you should generally use y~x, data=mydata rather than data$y~data$x. Many modeling methods like predict rely on the separation of the data from the formula. Similarly, it's best to draw all of the information in a formula from within the same data set -- that makes it easier to make sure that all the data are aligned properly (e.g. that you haven't discarded NAs or somehow subsetted one but not the other of the variables).
  • the choice between data$y <- sample(data$y) and data <- transform(data,y=sample(y)) is really just stylistic.
  • old-school R coders don't generally like to use return() explicitly, but this is really just a religious issue.
    • MBJ comment: It may be debated like a religious issue, but it shouldn't be. Don't use implicit returns. R's convention of returning the last created variable is cryptic and prone to introducing errors, as its unclear if a function intends to have a return value or not. One and only one call to return() should be made to explicitly indicate what the proper return value is. Without that, its quite easy to add a line of code to an R function and totally break its functionality by not realizing there was an implicit return value. It was a horrible design decision for R to even allow it.

Day 11: July 31, 2014

  • Dealing with pesky paths: many folks were discovering the joys of dealing with extremely long PATHS during Jeff's geospatial analysis presentations. E.g. we had:(to be continued)

  • Hi All, I recently put together a GUI-based tutorial for QGIS, for a workshop I taught. Following up on Jeff's workshop for open source GIS today, I wanted to share it here for folks who might be new to QGIS in general. I've uploaded the materials to a GitHub repository here. In the repository you'll find a .zip file with data I used, as well as the tutorial document in PDF, DOC, and MD formats [sorry the markdown is pretty poor - I just used a web-based tool to convert it from DOC].

    • The tutorial covers a pretty wide sampling of things, and you can probably get through in a couple of hours. I designed it to give folks a working knowledge of tools in QGIS, so I hope some of you find it helpful. Definitely let me know if you encounter any problems or have any comments/suggestions. I'll also note that at the QGIS Documentation page there are a number of other good resources including for scripting. -Mike

Day 10: July 30, 2014

  • Q. Working hypotheses into workflows

    • I noticed that some of our workflows began with data sources, discussed how we could mine the data, and then discussed what visuals/publications we would make. It seems like better "science" would be to define hypothesis, then find the right data, then mine it. Like a message box + workflow. Do you have a recommendation or best practice suggestion for this?
    • A: Post-hoc analysis can definitely be problematic, especially if you run lots of tests. In any large data set, running lots of tests can almost guarantee that some will return as significant even when the trend occurred purely by chance. But it is also feasible so long as you account for statistical problems associated with running multiple tests (e.g., using a Bonferroni correction). There is a well-developed literature on this subject. In addition, both machine learning and data mining techniques have made large advances, and are demonstrably useful in detecting real patterns in large data sets. The data mining community clearly has a different take on post-hoc analysis than the experiment-oriented ecological community. This is a worthwhile area for further discussion in groups.
    • followup (Ben B.): "data snooping" is indeed a big problem. There's a bit of a balance, though; you also have to let your data talk to you (e.g. if you find problems with your originally proposed model, you do need to change it ...). * Keep the dangers of data snooping in mind. * Try to be aware of whether you are being exploratory or confirmatory at any given point in your analysis. * One good strategy is writing down a short statement of your intended analysis before you start to look at your data. * Multiple corrections comparisons can help, but if you're doing informal post hoc analysis it's very hard to quantify which comparisons you're actually doing. Andrew Gelman calls this "the garden of forking paths". * An interesting article: Wickham et al (2010) “Graphical Inference for Infovis.” IEEE Transactions on Visualization and Computer Graphics 16 (6): 973–79. doi:10.1109/TVCG.2010.161.

Day 09: July 29, 2014

  • Testing frameworks for R

    • During discussion, I mentioned some good testing tools for R. The tools in particular I was referring to were the testthat package, which is a companion of the devtools package, and Travis CI, which is a continuous integration platform that integrates well with Github and R. To see a nice test framework in action, see the Travis CI builds page for the ROpensci rgbif package.
  • What is a good way to save and organize lots (100+) of output files generated from running code in R (e.g. from multiple runs of model fits) for easy access in the future?

    • A: Good question. No silver bullets. But that said, you'd be well served by creating a very formal process for naming your outputs, serializing them in a standard way (e.g., in a well-named directory for each scenario), and archiving those consistently. If you can create metadata to attach to each file that provides the details of each run (which parameters change, which versions of code were run, etc.) and be sure that is attached to the output as well, then you'll have a strong basis both for perusing the outputs, and for using scripts that can ingest and process the outputs later when you need to. Let's discuss this more during feedback.
  • Q: Error 43 with KnitPDF in RMarkdown...

    • A: I installed the texlive and texinfo packages on isis and now the check() command completes properly without error.
    • A: Courtesy of Karthik. To properly link KnitPDF you need to install MikTek before installing RStudio. So the fix was to install MikTek and then reinstall RStudio.

Day 08: July 28, 2015

Day 05: July 25, 2015

  • DONE Q: Would it be possible to vote ahead of time on which topics are most interesting? That way we could prioritize what gets dropped and what's saved if we run out of time? Additionally, prioritize topics that can't really be "written up" so that we can play catch up by reading tutorials on our own time?
    • A: Yes, to some extent, although some are easier to move around than others. I think each instructor will need to decide priorities, possibly with input from the participants when they have that flexibility. Let's discuss this Monday.

Day 04: July 24, 2014

  • If anyone on Mac OSX couldn't get Sublime to serve as their GIT editor, nor open from the command line, here's a pretty clean way to do it-- Sublime install on Mac OSX

  • If anyone on Windows couldn't get Sublime to open from the command line, this finally worked for me. Just had to delete the "/cygdrive" part of the path (N.B. from sysadmins: be cautious about deleting "/cygdrive" from your path-- unless you know you aren't relying on any Cygwin commands, and haven't created scripts that reference paths based on a "/cygdrive" root): http://stackoverflow.com/questions/9440639/sublime-text-from-command-line-win7

  • Q: Are there some additional implications from the above tips about getting Sublime to work?

    • A: Yes. Two.

    • First-- the stackoverflow web site referenced above is a fantastic, crowd-sourced resource for resolving all sorts of technical questions. It involves a voting mechanism to bring the best answers to the top of the page. If you get a 'stackoverflow' reference in your google search results, it is usually worth checking out. Highly recommended.

    • Second-- the tips about installing Sublime reinforce the importance of knowing something about ENVIRONMENT VARIABLES. Whenever you login to a system (Win, Mac, Linux or other), you work within an ENVIRONMENT where many ENVIRONMENT VARIABLES are defined. Things like your PATH (that determines the identity and order in which directories will be searched for files), what SHELL you are using (bash, csh, tcsh, ksh, sh, etc.), what USER account you are logged in to, what is your HOME directory, what TERM-inal you are using, etc.-- are all defined in your ENVIRONMENT. The ENVIRONMENT is defined in several files, but most prominently for us, by your ~/.bashrc or ~/bash_profile or ~/.profile (etc). Details of what these files are called, where they are located, and the order in which they are executed, can vary slightly by Operating System. Learn more about ENVIRONMENT VARIABLES

  • Q: Can group projects get access to compute servers for project work

  • Q: Are there alternatives to GitBash for Windows (that have a fuller set of *nix commands)?

    • A: Yes. Perhaps Cygwin is the best known of these, and would almost certainly cover all the *nix commands we've mentioned so far. GitBash seems to be lacking: man and apropos, at least. However, GitBash was chosen for this course because it is simpler to install, less intrusive on your operating system, and serves the need for running Git from a bash shell command line in Windows. We will certainly consider trying to support a more fully *nix capable emulator for Windows in the future.

Day 03: July 23, 2014

  • DONE grep "10\?" paleo*s.txt (works- to grep out any lines that have 10 followed by some other character-- why double-quotes, and escape on the "?")

    • You have to escape the ? so that the bash shell doesn't interpret the ? as a special regex character -- we need it to be passed in to grep as is
  • DONE Provide example .bashrc for customizing shell

# .bashrc file for interactive bash(1) shells.  
PS1='\[\e[31;1m\]\h\[\e[0m\]:\[\e[34m\]\w\[\e[0m\] \u\$ '  
#BASH HISTORY  
HISTFILESIZE=2000;  
#ALIASES  
alias sl='ls -G'    
alias ls='ls -G'
# Alternate example; Make ls use colors
export CLICOLOR=1;
export LSCOLORS=exfxcxdxbxegedabagacad;
# Git prompt customization
source ~/.git-prompt.sh
GIT_PS1_SHOWDIRTYSTATE=true
GIT_PS1_SHOWUNTRACKEDFILES=true
#GIT_PS1_SHOWSTASHSTATE=true
PS1='${debian_chroot:+($debian_chroot)}\[\033[01;33m\]\u@\h\[\033[00m\]:\[\033[01;34m\]\w\[\033[0;32m\]$(__git_ps1 " (%s)")\[\033[00m\]\$ '
  • Note the inclusion of the .git-prompt.sh script, which modifies the prompt to display current git status information of the directory one is in. You'll need to download and save .git-prompt.sh from the web and save it in your home directory.
  • The bash customizations go in your home directory in the '.bashrc' file or the '.bash_profile' file. If '.bashrc' doesn't exist, you may need to create it.

Day 02: July 22, 2014

  1. DONE provide .bashrc (or whatever) syntax to double check whether you'd like to rm something

    • To force rm to always prompt, you can alias it to always use the 'interactive' mode
    • alias rm 'rm -i' (note on OSX, the bash syntax requires an equals sign with no space between arguments:
    • alias rm='/bin/rm -i'; note also it is typically good to point to the full path of the command you are aliasing-- hence use '/bin/rm' instead of just 'rm')
  2. NOTED tldr might be a good resource for learning new commands. needs details for how people should install it on their computer

  3. NOTED Vim adventure is the best, most addictive, way to learn Vim!

  4. DONE What is the '@' symbol following the permissions in a long listing 'ls -l'?

    • In that context, the '@' symbol indicates the file has extended attributes. This is just additional metadata that is available about the file, and is filesystem-specific.

Day 01: July 21, 2014

  1. DONE M Jones: Explain how to use the parking lot for those without Github accounts

  2. DONE See Twitter Handles @BenjaminHCCarr @MirelaGTulbure
    @biogeocycle
    @diegosotomayor
    @RenwickKatie
    @TimAssal
    @SparkleLM85
    @georginaladams
    @_Sara_Varela
    @ajpelu
    @deboradrucker
    @TheSemmensLab
    @megankjennings
    @sweetlynnc
    @ae_schmidty
    @marissaruthlee @se_hampton @rayi @miketreglia @oliviarata @kellygarbach @emcfuller @RachaelOrben