Skip to content

Commit ed21565

Browse files
authored
Adds scaffolding for user contributions (#350)
We want to enable users to contribute Hamilton dataflows. Data is the oil, so code isn’t. Therefore we think we can help people get started with Hamilton by enabling more contributions that enable people to get started quickly with things off the shelf. The design premise is to: 1. Have a separate package called `sf-hamilton-contrib` that houses these dataflows. 2. We then autogenerate docusaurus docs based on that module and expose it at hub.dagworks.io. 3. I have created github workflows/and adjusted existing ones to minimize CI usage. 4. This isn’t 100% complete, but I think it’s good enough to be merged into main. Right now the main thing to note, is the pip install has to be updated with each new sf-hamilton-contrib release until it’s out of RC stage. Otherwise we're working on a template to ensure things are standardized between dataflows. One major TODO is to figure out the unit testing story. ---- Squashed commits -- lots of them: * Removes files from hub docs that aren't needed To not confuse anyone this removes a few files that get overwritten when docs are generated. Adds to the changelog (+9 squashed commits) Squashed commits: [9b5262b] Changes prints to logging statements [c6e338b] Fixes off by one [29ef702] Changes commits to be served via static [0aa1a2d] Fix no official case [6c84b9b] Adds site map And upgrades docusaurus. [1d6bd3e] Adds google analytics to hub So that we can understand usage. [6b3a379] Adds telemetry Fixes a few things too This commit should be squashed at some point. [b7e7f7e] Enables gh pages to be run off of main This is required for things to only run if they are merged to main. [74a8eee] Adds scaffolding for user contributions We want to enable users to contribute Hamilton dataflows. Data is the oil, so code isn’t. Therefore we think we can help people get started with Hamilton by enabling more contributions that enable people to get started quickly with things off the shelf. The design premise is to: 1. Have a separate package called `sf-hamilton-contrib` that houses these dataflows. 2. We then autogenerate docusaurus docs based on that module and expose it at hub.dagworks.io. 3. I have created github workflows/and adjusted existing ones to minimize CI usage. 4. This isn’t 100% complete, but I think it’s good enough to be merged into main. Cleaning up code a bit (+41 squashed commits) Squashed commits: [5aaf10b] changing base to see if it works with custom domain [0ce0f8d] Fixes build [a6e828e] Revert "Reverting sidebar" This reverts commit d21065e. [d21065e] Reverting sidebar [2eff762] FIxing build [9a5fbaf] Adds more to docs Adds more structure. [9867f5e] Adding more docs to contrib module [fc35509] Adds templates for contrib You can only have one default. So I decided to link from our default to the specific one for now. Depending on volume we should switch things around. [7dc35e2] Constrains build to contrib directory path [2088037] wip [a2715f1] Fixes so that pre-commit is run for everything [58dd21a] Adds more to contrib/readme should skip tests [024364d] Fixes script to not error out on non-zero status [e2c2e78] Adds start of readme This should stop tests early too. [2a32754] updates circleci to skip contrib changes [7fbd056] Test didn't work github manages things [c7ffdb9] Testing path [5c780c7] Adds tagging [0aeb353] reverting path back to what it was [21f86b1] changes project root [8ba794c] fixes workflow yaml [8f2cf89] WIP adjust fetch depth [953618a] Changes to include all commits [cc56c44] Fix compile of docs path [3e954fb] WIP fix python build [edfe6f6] Adds compile step [c5dd405] commit to test things [b215379] Fixes path from moving package [4826440] wip [d415a37] WIP packages Seems to work and gets added to the namespace appropriately without breaking anything else. [c1db95f] TODO: fix up package stuff. [0ffc1b9] WIP docusaurus [6d08fb7] wip [1977375] WIP commit [828060d] WIP fix github action flow [7e5c3ab] WIP commit [df5fd77] WIP [4eb1333] wip [721df5e] WIP [9e918b3] wip [9511774] WIP
1 parent 05b06be commit ed21565

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

73 files changed

+15938
-3
lines changed

.circleci/config.yml

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,26 @@
11
version: 2.1
22
jobs:
3+
check_for_changes:
4+
docker:
5+
- image: circleci/python:3.10
6+
steps:
7+
- checkout
8+
- run:
9+
name: Check for changes in specific paths
10+
command: |
11+
set +e
12+
git diff --name-only HEAD^ HEAD | grep '^.ci\|^.circleci\|^graph_adapter_tests\|^hamilton\|^plugin_tests\|^tests\|^requirements\|setup' > /dev/null
13+
if [ $? -eq 0 ]; then
14+
echo "Changes found in target paths."
15+
echo 'true' > /tmp/changes_detected
16+
else
17+
echo "No changes found in target paths."
18+
echo 'false' > /tmp/changes_detected
19+
fi
20+
- persist_to_workspace:
21+
root: /tmp
22+
paths:
23+
- changes_detected
324
test:
425
parameters:
526
python-version:
@@ -13,6 +34,15 @@ jobs:
1334
CI: true
1435
steps:
1536
- checkout
37+
- attach_workspace:
38+
at: /tmp
39+
- run:
40+
name: Check if changes were detected
41+
command: |
42+
if grep -q 'false' /tmp/changes_detected; then
43+
echo "No changes detected, skipping job..."
44+
circleci-agent step halt
45+
fi
1646
- run:
1747
name: install dependencies
1848
command: .ci/setup.sh
@@ -22,19 +52,28 @@ jobs:
2252
workflows:
2353
unit-test-workflow:
2454
jobs:
55+
- check_for_changes
2556
- test:
57+
requires:
58+
- check_for_changes
2659
name: build-py37
2760
python-version: '3.7'
2861
task: tests
2962
- test:
63+
requires:
64+
- check_for_changes
3065
name: build-py38
3166
python-version: '3.8'
3267
task: tests
3368
- test:
69+
requires:
70+
- check_for_changes
3471
name: build-py39
3572
python-version: '3.9'
3673
task: tests
3774
- test:
75+
requires:
76+
- check_for_changes
3877
name: build-py310
3978
python-version: '3.10'
4079
task: tests
@@ -47,22 +86,32 @@ workflows:
4786
python-version: '3.9'
4887
task: pre-commit
4988
- test:
89+
requires:
90+
- check_for_changes
5091
name: dask-py39
5192
python-version: '3.9'
5293
task: dask
5394
- test:
95+
requires:
96+
- check_for_changes
5497
name: ray-py39
5598
python-version: '3.9'
5699
task: ray
57100
- test:
101+
requires:
102+
- check_for_changes
58103
name: spark-py39
59104
python-version: '3.9'
60105
task: pyspark
61106
- test:
107+
requires:
108+
- check_for_changes
62109
name: spark-py310
63110
python-version: '3.10'
64111
task: pyspark
65112
- test:
113+
requires:
114+
- check_for_changes
66115
name: spark-py311
67116
python-version: '3.11'
68117
task: pyspark
@@ -71,30 +120,44 @@ workflows:
71120
python-version: '3.7'
72121
task: integrations
73122
- test:
123+
requires:
124+
- check_for_changes
74125
name: integrations-py38
75126
python-version: '3.8'
76127
task: integrations
77128
- test:
129+
requires:
130+
- check_for_changes
78131
name: integrations-py39
79132
python-version: '3.9'
80133
task: integrations
81134
- test:
135+
requires:
136+
- check_for_changes
82137
name: integrations-py310
83138
python-version: '3.10'
84139
task: integrations
85140
- test:
141+
requires:
142+
- check_for_changes
86143
name: integrations-py311
87144
python-version: '3.11'
88145
task: integrations
89146
- test:
147+
requires:
148+
- check_for_changes
90149
name: asyncio-py39
91150
python-version: '3.9'
92151
task: async
93152
- test:
153+
requires:
154+
- check_for_changes
94155
name: asyncio-py310
95156
python-version: '3.10'
96157
task: async
97158
- test:
159+
requires:
160+
- check_for_changes
98161
name: asyncio-py311
99162
python-version: '3.11'
100163
task: async
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
[Summary of contribution]
2+
3+
## For new dataflows:
4+
Do you have the following?
5+
- [ ] Added a directory mapping to my github user name in the contrib/hamilton/contrib/user directory.
6+
- [ ] If my author names contains hyphens I have replaced them with underscores.
7+
- [ ] If my author name starts with a number, I have prefixed it with an underscore.
8+
- [ ] If your author name is a python reserved keyword. Reach out to the maintainers for help.
9+
- [ ] Added an author.md file under my username directory and is filled out.
10+
- [ ] Added an __init__.py file under my username directory.
11+
- [ ] Added a new folder for my dataflow under my username directory.
12+
- [ ] Added a README.md file under my dataflow directory that follows the standard headings and is filled out.
13+
- [ ] Added a __init__.py file under my dataflow directory that contains the Hamilton code.
14+
- [ ] Added a requirements.txt under my dataflow directory that contains the required packages outside of Hamilton.
15+
- [ ] Added tags.json under my dataflow directory to curate my dataflow.
16+
- [ ] Added valid_configs.jsonl under my dataflow directory to specify the valid configurations.
17+
- [ ] Added a dag.png that shows one possible configuration of my dataflow.
18+
19+
## For existing dataflows -- what has changed?
20+
21+
## How I tested this
22+
23+
## Notes
24+
25+
## Checklist
26+
27+
- [ ] PR has an informative and human-readable title (this will be pulled into the release notes)
28+
- [ ] Changes are limited to a single goal (no scope creep)
29+
- [ ] Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
30+
- [ ] Any _change_ in functionality is tested
31+
- [ ] New functions are documented (with a description, list of inputs, and expected output)
32+
- [ ] Dataflow documentation has been updated if adding/changing functionality.

.github/PULL_REQUEST_TEMPLATE.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
Looking to submit a Hamilton Dataflow to the sf-hamilton-contrib module? If so go the the `Preview` tab and select the appropriate sub-template:
2+
* [sf-hamilton-contrib template](?expand=1&template=HAMILTON_CONTRIB_PR_TEMPLATE.md)
3+
4+
Else remove this block.
5+
6+
---
17
[Short description explaining the high-level reason for the pull request]
28

39
## Changes

.github/workflows/docusaurus-gh-pages.yml

Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,11 @@
22
name: Deploy Docusaurus with GitHub Pages dependencies preinstalled
33

44
on:
5-
# Runs on pushes targeting the default branch
5+
# Runs on pushes targeting the default branch & contrib subdirectory
66
push:
7-
branches: ["user_contrib"]
7+
branches: ["user_contrib", "main"]
8+
paths:
9+
- 'contrib/**'
810

911
# Allows you to run this workflow manually from the Actions tab
1012
workflow_dispatch:
@@ -30,16 +32,34 @@ jobs:
3032
steps:
3133
- name: Checkout
3234
uses: actions/checkout@v3
35+
with:
36+
fetch-depth: 1000
3337
# 👇 Build steps
38+
- name: Set up Python 3.10
39+
uses: actions/setup-python@v4
40+
with:
41+
python-version: "3.10"
42+
- name: Install dependencies
43+
run: |
44+
pip install -e .
45+
- name: Compile code to create pages
46+
working-directory: contrib/docs
47+
run: python compile_docs.py
3448
- name: Set up Node.js
3549
uses: actions/setup-node@v3
3650
with:
37-
path: contrib/docs
3851
node-version: 16.x
3952
cache: yarn
53+
# The action defaults to search for the dependency file
54+
# (package-lock.json or yarn.lock) in the repository root, and uses
55+
# its hash as a part of the cache key.
56+
# https://github.com/actions/setup-node#caching-packages-dependencies
57+
cache-dependency-path: "./contrib/docs/package-lock.json"
4058
- name: Install dependencies
59+
working-directory: contrib/docs
4160
run: yarn install --frozen-lockfile --non-interactive
4261
- name: Build
62+
working-directory: contrib/docs
4363
run: yarn build
4464
# 👆 Build steps
4565
- name: Setup Pages

contrib/MANIFEST.in

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
include *.md
2+
include *.txt
3+
include *.jsonl

contrib/README.md

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# Off-the-shelf Hamilton Dataflows
2+
3+
Welcome!
4+
5+
Here you'll find a package that curates a collection of Hamilton Dataflows that are
6+
ready to be used in your own projects. They are user-contributed and maintained, with
7+
the goal of making it easier for you to get started with Hamilton.
8+
9+
We expect this collection to grow over time, so check back often! As dataflows become mature we
10+
will move them into the official sub-package of this respository and become maintained by the
11+
Hamilton team.
12+
13+
## Usage
14+
There are two ways to get access to dataflows in this package. For either approach,
15+
the assumption is that you have the requisite python dependencies installed on your system.
16+
You'll get import errors if you don't. Don't know what you need, we have convenience functions to help!
17+
18+
For more extensive documentation, please see [Hamilton User Contrib documentation]().
19+
20+
### Static installation
21+
This approach relies on you installing the package on your system. This is the recommended path for
22+
production purposes as you can version-lock your dependencies.
23+
24+
To install the package, run:
25+
26+
```bash
27+
pip install sf-hamilton-contrib==0.0.1rc1
28+
```
29+
30+
Once installed, you can import the dataflows as follows.
31+
32+
Things you need to know:
33+
1. Whether it's a user or official dataflow. If user, what the name of the user is.
34+
2. The name of the dataflow.
35+
```python
36+
from hamilton import driver
37+
# from hamilton.contrib.official import NAME_OF_DATAFLOW
38+
from hamilton.contrib.user.NAME_OF_USER import NAME_OF_DATAFLOW
39+
40+
dr = (
41+
driver.Builder()
42+
.with_config({}) # replace with configuration as appropriate
43+
.with_modules(NAME_OF_DATAFLOW)
44+
.build()
45+
)
46+
# execute the dataflow, specifying what you want back. Will return a dictionary.
47+
result = dr.execute(
48+
[NAME_OF_DATAFLOW.FUNCTION_NAME, ...], # this specifies what you want back
49+
inputs={...} # pass in inputs as appropriate
50+
)
51+
```
52+
53+
### Dynamic installation
54+
Here we dynamically download the dataflow from the internet and execute it. This is useful for quickly
55+
iterating in a notebook and pulling in just the dataflow you need.
56+
57+
```python
58+
from hamilton import dataflow, driver
59+
60+
# downloads into ~/.hamilton/dataflows and loads the module -- WARNING: ensure you know what code you're importing!
61+
# NAME_OF_DATAFLOW = dataflow.import_module("NAME_OF_DATAFLOW") # if using official dataflow
62+
NAME_OF_DATAFLOW = dataflow.import_module("NAME_OF_DATAFLOW", "NAME_OF_USER")
63+
dr = (
64+
driver.Builder()
65+
.with_config({}) # replace with configuration as appropriate
66+
.with_modules(NAME_OF_DATAFLOW)
67+
.build()
68+
)
69+
# execute the dataflow, specifying what you want back. Will return a dictionary.
70+
result = dr.execute(
71+
[NAME_OF_DATAFLOW.FUNCTION_NAME, ...], # this specifies what you want back
72+
inputs={...} # pass in inputs as appropriate
73+
)
74+
```
75+
76+
## How to contribute
77+
78+
If you have a dataflow that you would like to share with the community, please submit a pull request
79+
to this repository. We will review your dataflow and if it meets our standards we will add it to the
80+
package. To submit a pull request please use [this link](TODO) as it'll take you to the specific PR template.
81+
82+
### Dataflow standards
83+
We want to ensure that the dataflows in this package are of high quality and are easy to use. To that end,
84+
we have a set of standards that we expect all dataflows to meet. If you have any questions, please reach out.
85+
86+
Standards:
87+
- The dataflow must be a valid Python module.
88+
- It must not do anything malicious.
89+
- It must be well documented.
90+
- It must work.
91+
- It must follow our standard structure as outlined below.
92+
93+
94+
### Checklist for new dataflows:
95+
Do you have the following?
96+
- [ ] Added a directory mapping to my github user name in the contrib/hamilton/contrib/user directory.
97+
- [ ] If my author names contains hyphens I have replaced them with underscores.
98+
- [ ] If my author name starts with a number, I have prefixed it with an underscore.
99+
- [ ] If your author name is a python reserved keyword. Reach out to the maintainers for help.
100+
- [ ] Added an author.md file under my username directory and is filled out.
101+
- [ ] Added an __init__.py file under my username directory.
102+
- [ ] Added a new folder for my dataflow under my username directory.
103+
- [ ] Added a README.md file under my dataflow directory that follows the standard headings and is filled out.
104+
- [ ] Added a __init__.py file under my dataflow directory that contains the Hamilton code.
105+
- [ ] Added a requirements.txt under my dataflow directory that contains the required packages outside of Hamilton.
106+
- [ ] Added tags.json under my dataflow directory to curate my dataflow.
107+
- [ ] Added valid_configs.jsonl under my dataflow directory to specify the valid configurations.
108+
- [ ] Added a dag.png that shows one possible configuration of my dataflow.
109+
110+
# Got questions?
111+
Join our [slack](https://join.slack.com/t/hamilton-opensource/shared_invite/zt-1bjs72asx-wcUTgH7q7QX1igiQ5bbdcg) community to chat/ask Qs/etc.

contrib/docs/.gitignore

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Dependencies
2+
/node_modules
3+
4+
# Production
5+
/build
6+
7+
# Generated files
8+
.docusaurus
9+
.cache-loader
10+
11+
# Misc
12+
.DS_Store
13+
.env.local
14+
.env.development.local
15+
.env.test.local
16+
.env.production.local
17+
18+
npm-debug.log*
19+
yarn-debug.log*
20+
yarn-error.log*

0 commit comments

Comments
 (0)