Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud-based Data Science #223

Open
4 tasks
DingoEatingFuzz opened this issue Mar 17, 2019 · 4 comments
Open
4 tasks

Cloud-based Data Science #223

DingoEatingFuzz opened this issue Mar 17, 2019 · 4 comments
Assignees

Comments

@DingoEatingFuzz
Copy link
Contributor

For large datasets, improved collaboration, and link sharing, data science should be done ✨ In The Cloud ✨

Since most of our data science is done through the Python ecosystem, Jupyter Notebooks is the most obvious technology choice. R and RStudio comes in as a close second.

Ideally, we self-host this so we can take advantage of lower-latency dc locality and such, compared to open tools where data would have to be transferred over arbitrary distance and unknown network conditions.

Ideal solution

Other tools to look at

TODO

  • Learn more about SageMakers pros and cons
  • Evaluate pricing (ec2 hourly costs, how many hours per month, storage costs, possible networking costs)
  • How does this play into s3 file storage and IAM user accounts?
  • Do users need aws console access or is this a separate thing entirely?
@znmeb
Copy link
Contributor

znmeb commented Mar 18, 2019

My only concern with SageMaker is that it seems to be geared towards a machine learning workflow / mindset. I think that's a great strategic goal - TensorFlow is eating the world - but I'm not sure how well that fits the tactical situation. It's definitely accessible from R / RStudio. so it wouldn't lock R programmers out.

@danieldn danieldn self-assigned this Mar 31, 2019
@danieldn
Copy link

Holding off until needs for sagemaker are clarified.

@karenng-civicsoftware
Copy link

@danieldn @DingoEatingFuzz did we decide to use sagemaker or not for people needing cloud access?
The other type of cloud resources that we recommended was Google Colaboratory notebooks which are free.

@DingoEatingFuzz
Copy link
Contributor Author

There is an open PR to introduce the sagemaker infrastructure: hackoregon/hackoregon-aws-infrastructure#61

This will make it easy for us to provision notebook instances, but I still want to be conservative with when we do that, since it can be costly.

If someone is working with private data or large datasets and cannot do the work locally, they should request a notebook instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants