Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linking data to code used to obtain the data using git hashes #509

Open
gureckis opened this issue May 20, 2021 · 11 comments
Open

linking data to code used to obtain the data using git hashes #509

gureckis opened this issue May 20, 2021 · 11 comments

Comments

@gureckis
Copy link
Member

psiturk stores the codeversion variable in a column in the assignment_table_name. However, a more natural target if the experimenter is using git is to use the current git hash of the code base for the experiment. This would tie the data collected from an experiment specifically to the code written to collect the experiment. I have implemented this in other projects using the python-git-info package but there are several ways to pull this off.

I guess the issue would be that not everyone would use git to manage their experiment project folder but anyone using heroku now would. also, like, it makes sense to add this feature and encourage it because the provenance of some data is very important. linking the data from a subject to the code used to run that subject is, well, the gold standard in replicability.

my specific proposal is a major version change though because it would add a column to the assignments_table_name that tracked the git info for the experiment code base in a new column of meta-data about the assignment. I guess if no github repo is found or anything weird happens (see "the pack of snakes" here), then it should default to some unknown value perhaps.

@deargle
Copy link
Collaborator

deargle commented May 20, 2021 via email

@jacob-lee
Copy link
Collaborator

jacob-lee commented May 21, 2021 via email

@jmuchovej
Copy link
Contributor

Could a hybrid of these two be used? (e.g., track the git sha and codeversion?)

While I learned (painfully) in my first use of psiTurk that codeversion matters, I think having the git sha would have been quite helpful since I could use it as a fallback.

Another alternative (albeit "very advanced" – read as: I highly doubt many will use this, myself included) could be to use the git tagging system, git tag <codeversion-equivalent> as a way to track versions independent of manual updates to config.txt.

@jacob-lee
Copy link
Collaborator

jacob-lee commented Mar 1, 2022 via email

@jmuchovej
Copy link
Contributor

Hmm... this seems to be different from what I was referring to.

Are you saying that every codeversion should have an analog of conda/pip freeze? That would be cool, but intuitively psiTurk doesn't feel like "the right tool" for that. (e.g., Docker or Singularity seem better suited as "freezers".) Maybe psiTurk manages/enforces that?
Though... I could see how that could turn into an adoption/usage nightmare, given the primary users of psiTurk (grad students, per @deargle's mention in discussions).

I think doing a conda/pip freeze equivalent would be cool, but seems aside the point that @gureckis was pointing at?

Also, I'm curious what you mean by:

The tasks should be importable as a python module, or something similar.

I think I do something similar to what you're referring to. I have experiments/{{ codeversion }}/task.js and the like (I actually store a config.txt for each, but have to manually symlink this myself). Getting this setup to work nicely required a a few custom_code.route(...)s and passing the codeversion around to any $.ajax(...) calls. I think I only found it "easy" to debug because I had ascended the jQuery learning curve and have a pretty lengthy web background. I'm not sure it's safe to assume many users would have the requisite knowledge for this. 😕 (Let alone the time/patience to acquire it.)

@jacob-lee
Copy link
Collaborator

jacob-lee commented Mar 1, 2022 via email

@jmuchovej
Copy link
Contributor

Ahh. I see now. Hmm... so, that seems like it would require psiTurk to be more opinionated that it currently is to achieve that. Much like how Ruby on Rails forces particular application structures.

That seems sensible, but also presents a bit of a problem with non-MTurk crowdsourcing (e.g., Prolific). I'm not sure that there's a way to prevent experiment deployment without a version update on any platform except for those with an API. 😕

Also, re:psiTurk versioning – that seems like something a pip freeze, or better yet a conda freeze, should do no? (I prefer conda here explicitly because it's a dependency manager (unlike pip, well, older versions of pip aren't dependency managers)). I know that psiTurk doesn't have a conda package of late, but I've been looking into getting that back up and running.

I guess I'm at a loss why a psiTurk version would/should change over the course of a research paper/project, if the researcher is doing proper dependency management. (Lol, big assumption here, but I think it makes sense, unless psiTurk runs into a critical bug, but that shouldn't be a patch update, I think?)

I definitely understand specifying psiTurk versions to encourage reproducibility, but I'm not sure it makes sense to do snapshots at a task level (strictly on the basis that tasks are part of projects). Allowing for different psiTurk versions across tasks seems redundant and not in a good/useful way. (Though I'm still fresh to the land of psiTurk, so I probably haven't run into the scenario(s) to justify this.)

@deargle
Copy link
Collaborator

deargle commented Mar 2, 2022

I guess I'm at a loss why a psiTurk version would/should change over the course of a research paper/project

My psiturk version often changes, usually because I'm developing and releasing new features to make data collection easier. Like dashboard stuff. In hindsight, it might have been better to make the dashboard a psiturk plugin.

unless psiTurk runs into a critical bug, but that shouldn't be a patch update, I think?

Huge mistakes have been made with semver (by me!). Won't happen again, fingers crossed.

specifying psiTurk versions to encourage reproducibility

Backwards-compatibility to run old psiturk tasks has gotten somewhat easier with later releases, but there's still the problem of later versions adding the mode column to the database. psiTurk does not gracefully handle that column missing. Although I did write a psiturk.db.migrate_db function, callable via the cli, which could in theory add that column, if missing.

@jacob-lee
Copy link
Collaborator

jacob-lee commented Mar 2, 2022

Sometimes studies last years. Sometimes, we come back and want to run the same experiment from five years ago, with a few tweaks. In the meantime, the world has changed. The world has switched from python2 to 3; security bugs have been found and patched, etc. In some cases the old version of psiturk won't even work any more (e.g. because it assumed the existence of the psiturk server, or because of outdated SSL libraries). So you need to, and should, use updated versions of psiturk. But that's not always easy (e.g. the mode column issue, Dave spoke about). You end up having to update task code (and sometimes you end up messing something up along the way). The cleaner the separation between what a user has to develop, and psiturk, the easier that is to manage.

In general, again, I'd like a much cleaner separation between the psiturk application and the tasks people develop, with a standard way of importing them in. Ends up not being that opinionated, really; being able to develop your own routes, javascript, etc. already affords a tremendous degree of freedom.

@jacob-lee
Copy link
Collaborator

ad.html is (or was) a good example of the problem. Every study has to modify the template, but it contains javascript code necessary for psiturk to function.

@deargle
Copy link
Collaborator

deargle commented Mar 2, 2022

Oh my heck, ad.html, which we had to change the filename of (to pub.html) and css classes within (from .ad to .not-an-ad) so that it wouldn't trigger ad blockers.

edit: although strictly speaking, ad.html is only necessary if running on mturk. Not a hard requirement for psiturk if running in lab (anything besides 'live' or 'sandbox') modes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants