Skip to content

Commit 3060446

Browse files
authored
Merge pull request #3554 from cal-itp/curriculum_docs_update
Edit and Update: Enhancements to Cal-ITP Data Services Documentation
2 parents 0de0734 + cc13683 commit 3060446

File tree

10 files changed

+100
-39
lines changed

10 files changed

+100
-39
lines changed

docs/analytics_onboarding/overview.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,16 +32,19 @@
3232

3333
**Python Libraries:**
3434

35-
- [ ] **calitp-data-analysis** - Cal-ITP's internal Python library for analysis | ([Docs](calitp-data-analysis))
36-
- [ ] **siuba** - Recommended data analysis library | ([Docs](siuba))
37-
- [ ] [**shared_utils**](https://github.com/cal-itp/data-analyses/tree/main/_shared_utils) and [**here**](https://github.com/cal-itp/data-infra/tree/main/packages/calitp-data-analysis/calitp_data_analysis) - A shared utilities library for the analytics team | ([Docs](shared-utils))
35+
- [ ] [**calitp-data-analysis**](https://github.com/cal-itp/data-infra/tree/main/packages/calitp-data-analysis/calitp_data_analysis) - Cal-ITP's internal Python library for analysis | ([Docs](calitp-data-analysis))
36+
- [ ] [**siuba**](https://siuba.org/) - Recommended data analysis library | ([Docs](siuba))
37+
- [ ] [**shared_utils**](https://github.com/cal-itp/data-analyses/tree/main/_shared_utils) - A shared utilities library for the analytics team | ([Docs](shared-utils))
3838

3939
**Caltrans Employee Resources:**
4040

41+
- [ ] [**Organizational Chart**](https://pmp.onramp.dot.ca.gov/organizational-chart) - Data and Digital Services Organizational Chart
4142
- [ ] [**OnRamp**](https://onramp.dot.ca.gov/) - Caltrans employee intranet
4243
- [ ] [**Service Now (SNOW)**](https://cdotprod.service-now.com/sp) - Caltrans IT Service Management Portal for IT issues and requesting specific software
4344
- [ ] [**Cal Employee Connect**](https://connect.sco.ca.gov/) - State Controller's Office site for paystubs and tax information
4445
- [ ] [**Geospatial Enterprise Engagement Platform - GIS Account Request Form**](https://sv03tmcpo.ct.dot.ca.gov/portal/apps/sites/#/geep/pages/account-request) (optional) - User request form for ArcGIS Online and ArcGIS Portal accounts
46+
- [ ] [**Planning Handbook**](https://transportationplanning.onramp.dot.ca.gov/caltrans-transportation-planning-handbook) - Caltrans Transportation Planning Handbook
47+
- [ ] [**California Public Employees Retirement System**](https://www.calpers.ca.gov/) - System that manages pension and health benefits
4548

4649
 
4750
(get-help)=

docs/analytics_tools/jupyterhub.md

Lines changed: 25 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -14,14 +14,15 @@ Analyses on JupyterHub are accomplished using notebooks, which allow users to mi
1414

1515
01. [Using JupyterHub](#using-jupyterhub)
1616
02. [Logging in to JupyterHub](#logging-in-to-jupyterhub)
17-
03. [Connecting to the Warehouse](#connecting-to-the-warehouse)
18-
04. [Increasing the Query Limit](#increasing-the-query-limit)
19-
05. [Increase the User Storage Limit](#increasing-the-storage-limit)
20-
06. [Querying with SQL in JupyterHub](querying-sql-jupyterhub)
21-
07. [Saving Code to Github](saving-code-jupyter)
22-
08. [Environment Variables](#environment-variables)
23-
09. [Jupyter Notebook Best Practices](notebook-shortcuts)
24-
10. [Developing warehouse models in Jupyter](jupyterhub-warehouse)
17+
03. [Default vs Power User](#default-user-vs-power-user)
18+
04. [Connecting to the Warehouse](#connecting-to-the-warehouse)
19+
05. [Increasing the Query Limit](#increasing-the-query-limit)
20+
06. [Increase the User Storage Limit](#increasing-the-storage-limit)
21+
07. [Querying with SQL in JupyterHub](querying-sql-jupyterhub)
22+
08. [Saving Code to Github](saving-code-jupyter)
23+
09. [Environment Variables](#environment-variables)
24+
10. [Jupyter Notebook Best Practices](notebook-shortcuts)
25+
11. [Developing warehouse models in Jupyter](jupyterhub-warehouse)
2526

2627
(using-jupyterhub)=
2728

@@ -39,6 +40,22 @@ JupyterHub currently lives at [notebooks.calitp.org](https://notebooks.calitp.or
3940

4041
Note: you will need to have been added to the Cal-ITP organization on GitHub to obtain access. If you have yet to be added to the organization and need to be, ask in the `#services-team` channel in Slack.
4142

43+
(default-user-vs-power-user)=
44+
45+
### Default User vs Power User
46+
47+
#### Default User
48+
49+
Designed for general use and is ideal for less resource-intensive tasks. It's a good starting point for most users who don't expect to run very large, memory-hungry jobs.
50+
51+
Default User profile offers quick availability since it uses less memory and can allocate a smaller node, allowing you to start tasks faster. However, if your task grows in memory usage over time, it may exceed the node's capacity, potentially causing the system to terminate your job. This makes the Default profile best for smaller or medium-sized tasks that don’t require a lot of memory. If your workload exceeds these limits, you might experience instability or crashes.
52+
53+
#### Power User
54+
55+
Intended for more demanding, memory-intensive tasks that require more resources upfront. This profile is suitable for workloads that have higher memory requirements or are expected to grow during execution.
56+
57+
Power User profile allocates a full node or a significant portion of resources to ensure your job has enough memory and computational power, avoiding crashes or delays. However, this comes with a longer wait time as the system needs to provision a new node for you. Once it's ready, you'll have all the resources necessary for memory-intensive tasks like large datasets or simulations. The Power User profile is ideal for jobs that might be unstable or crash on the Default profile due to higher resource demands. Additionally, it offers scalability—if your task requires more resources than the initial node can provide, the system will automatically spin up additional nodes to meet the demand.
58+
4259
(connecting-to-the-warehouse)=
4360

4461
### Connecting to the Warehouse

docs/analytics_tools/knowledge_sharing.md

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
# Helpful Links
44

5-
Here are some resources data analysts have collected and referenced, that will hopefully help you out in your work. Have something you want to share? Create a new markdown file, add it [to the example report folder](https://github.com/cal-itp/data-analyses/tree/main/example_report), and [message Amanda.](https://app.slack.com/client/T014965JTHA/C013N8GELLF/user_profile/U02PCTPSZ8A)
5+
Here are some resources data analysts have collected and referenced, that will hopefully help you out in your work.
66

77
- [Data Analysis](#data-analysis)
88
- [Python](#python)
@@ -11,12 +11,14 @@ Here are some resources data analysts have collected and referenced, that will h
1111
- [Merging](#merging)
1212
- [Dates](#dates)
1313
- [Monetary Values](#monetary-values)
14+
- [Tidy Data](#tidy-data)
1415
- [Visualizations](#visualization)
1516
- [Charts](#charts)
1617
- [Maps](#maps)
1718
- [DataFrames](#dataframes)
1819
- [Ipywidgets](#ipywidgets)
1920
- [Markdown](#markdown)
21+
- [ReviewNB](#reviewNB)
2022

2123
(data-analysis)=
2224

@@ -128,6 +130,20 @@ def adjust_prices(df):
128130
return df
129131
```
130132

133+
(tidy-data)=
134+
135+
### Tidy Data
136+
137+
Tidy Data follows a set of principles that ensure the data is easy to work with, especially when using tools like pandas and matplotlib. Primary rules of tidy data are:
138+
139+
- Each variable must have its own column.
140+
- Each observation must have its own row.
141+
- Each value must have its own cell.
142+
143+
Tidy data ensures consistency, making it easier to work with tools like pandas, matplotlib, or seaborn. It also simplifies data manipulation, as functions like `groupby()`, `pivot()`, and `melt()` work more intuitively when the data is structured properly. Additionally, tidy data enables vectorized operations in pandas, allowing for efficient analysis on entire columns or rows at once.
144+
145+
Learn more about Tidy Data [here.](https://vita.had.co.nz/papers/tidy-data.pdf)
146+
131147
(visualization)=
132148

133149
## Visualization
@@ -159,7 +175,6 @@ def add_tooltip(chart, tooltip1, tooltip2):
159175

160176
### Maps
161177

162-
- [Examples of folium, branca, and color maps.](https://nbviewer.org/github/python-visualization/folium/blob/v0.2.0/examples/Colormaps.ipynb)
163178
- [Quick interactive maps with Geopandas.gdf.explore()](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.explore.html)
164179

165180
(dataframes)=
@@ -188,3 +203,10 @@ def add_tooltip(chart, tooltip1, tooltip2):
188203
- [Add a table of content that links to headers throughout a markdown file.](https://stackoverflow.com/questions/2822089/how-to-link-to-part-of-the-same-document-in-markdown)
189204
- [Add links to local files.](https://stackoverflow.com/questions/32563078/how-link-to-any-local-file-with-markdown-syntax?rq=1)
190205
- [Direct embed an image.](https://datascienceparichay.com/article/insert-image-in-a-jupyter-notebook/)
206+
207+
(reviewNB)=
208+
209+
### ReviewNB on GitHub
210+
211+
- [Tool designed to facilitate reviewing Jupyter Notebooks in a collaborative setting on GitHub](https://www.reviewnb.com/)
212+
- [Shows side-by-side diffs of Jupyter Notebooks, including changes to both code cells and markdown cells and allows reviewers to comment on specific cells](https://www.reviewnb.com/#faq)

docs/analytics_tools/saving_code.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,16 @@ Doing work locally and pushing directly from the command line is a similar workf
99
## Table of Contents
1010

1111
1. What's a typical [project workflow](#project-workflow)?
12+
1213
2. Someone is collaborating on my branch, how do we [stay in sync](#pulling-and-pushing-changes)?
14+
1315
- The `main` branch is ahead, and I want to [sync my branch with `main`](#rebase-and-merge)
1416
- [Rebase](#rebase) or [merge](#merge)
1517
- Options to [Resolve Merge Conflicts](#resolve-merge-conflicts)
18+
- [Other Common Issues](#other-common-github-issues-encountered-during-saving-codes)
19+
1620
3. [Other Common GitHub Commands](#other-common-github-commands)
21+
1722
- [External Git Resources](#external-git-resources)
1823
- [Committing in the Github User Interface](#pushing-drag-drop)
1924

@@ -111,6 +116,17 @@ If you discover merge conflicts and they are within a single notebook that only
111116
`git checkout --theirs path/to/notebook.ipynb`
112117
- From here, just add the file and commit with a message as you normally would and the conflict should be fixed in your Pull Request.
113118

119+
(other-common-github-issues-encountered-during-saving-codes)
120+
121+
### Other Common Issues
122+
123+
- Untracked Files:
124+
Sometimes, files are created or modified locally but are not added to Git before committing, so they are not tracked or pushed to GitHub. Use `git add <filename>` to track files before committing.
125+
- Incorrect Branches:
126+
Committing to the wrong branch (e.g., main instead of a feature branch) can cause problems, especially if changes are not meant to be merged into the main codebase. Always ensure you're on the correct branch using git branch and switch branches with `git switch -c <branch-name>` before committing.
127+
- Merge Conflicts from Overlapping Work:
128+
When multiple analysts work on the same files or sections of code, merge conflicts can occur. Creating feature branches and pulling regularly to stay updated with main can help avoid these conflicts.
129+
114130
(other-common-github-commands)=
115131

116132
## Other Common GitHub Commands

docs/analytics_tools/tools_quick_links.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
| Tool | Purpose |
88
| -------------------------------------------------------------------------------------------------- | --------------------------------------- |
99
| [**Analytics Repo**](https://github.com/cal-itp/data-analyses) | Analytics team code repository. |
10-
| [**Analytics Project Board**](https://github.com/cal-itp/data-analyses/projects/1) | Analytics team work management. |
10+
| [**Analytics Project Board**](https://github.com/cal-itp/data-analyses/projects/1) | Analytics team list of active issues. |
1111
| [**notebooks.calitp.org**](https://notebooks.calitp.org/) | JupyterHub cloud-based notebooks |
1212
| [**dashboards.calitp.org**](https://dashboards.calitp.org/) | Metabase dashboards & Business Insights |
1313
| [**dbt-docs.calitp.org**](https://dbt-docs.calitp.org/) | DBT warehouse documentation |

docs/analytics_welcome/how_we_work.md

Lines changed: 0 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -27,26 +27,6 @@ The section below outlines our team's primary meetings and their purposes, as we
2727
| #**data-office-hours** | Discussion | A place to bring questions, issues, and observations for team discussion. |
2828
| #**data-warehouse-devs** | Discussion | For people building dbt models - focused on data warehouse performance considerations, etc. |
2929

30-
## Collaboration Tools
31-
32-
(analytics-project-board)=
33-
34-
### GitHub Analytics Project Board
35-
36-
**You can access The Analytics Project Board [using this link](https://github.com/cal-itp/data-analyses/projects/1)**.
37-
38-
#### How We Track Work
39-
40-
##### Screencast - Navigating the Board
41-
42-
The screencast below introduces:
43-
44-
- Creating new GitHub issues to track your work
45-
- Adding your issues to our analytics project board
46-
- Viewing all of your issues on the board (e.g. clicking your avatar to filter)
47-
48-
<div style="position: relative; padding-bottom: 62.5%; height: 0;"><iframe src="https://www.loom.com/embed/a7332ee2e1c040edbf2d11da70b4c3ea" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>
49-
5030
(analytics-repo)=
5131

5232
### GitHub Analytics Repo

docs/analytics_welcome/overview.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,10 @@ After you've read through this section, continue reading through the remaining s
1515

1616
______________________________________________________________________
1717

18+
- [Data and Digital Services Organizational Chart](https://pmp.onramp.dot.ca.gov/downloads/pmp/files/Splash%20Page/org-charts-10-2024/DDS_OrgChart_October2024-signed.pdf)
19+
20+
______________________________________________________________________
21+
1822
**Other Analytics Sections**:
1923

2024
- [Technical Onboarding](technical-onboarding)

docs/publishing/sections/5_analytics_portfolio_site.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ Netlify is the platform turns our Jupyter Notebooks uploaded to GitHub into a fu
1111
To setup your netlify key:
1212

1313
- Ask in Slack/Teams for a Netlify key if you don't have one yet.
14+
- If you already have your Netlify key set up, find it by typing `cat ~/.bash_profile` into the root of your repo.
1415
- Install netlify: `npm install -g netlify-cli`
1516
- Navigate to your main directory
1617
- Edit your bash profile using Nano:
@@ -47,14 +48,16 @@ Create a `README.md` file in the repo where your work lies. This also forms the
4748

4849
Each `.yml` file creates a new site on the [Portfolio's Index Page](https://analysis.calitp.org/), so every project needs its own file. DLA Grant Analysis, SB125 Route Illustrations, and Active Transportation Program all have their own `.yml` file.
4950

50-
All the `.yml` files live here at [data-analyses/portfolio/sites](https://github.com/cal-itp/data-analyses/tree/main/portfolio/sites).
51+
All the `.yml` files live here at [data-analyses/portfolio/sites](https://github.com/cal-itp/data-analyses/tree/main/portfolio/sites). Navigate to this folder to create the .yml file.
5152

5253
Here's how to create a `yml` file:
5354

5455
- Include the directory to the notebook(s) you want to publish.
5556

5657
- Name your `.yml` file. For now we will use `my_report.yml` as an example.
5758

59+
- `.yml` file should contain the title, directory, README.md path and notebook path.
60+
5861
- The structure of your `.yml` file depends on the type of your analysis:
5962

6063
- If you have one parameterized notebook with **one parameter**:
@@ -206,3 +209,9 @@ build_my_reports:
206209
git add portfolio/my_report/district_*/ portfolio/my_report/*.yml portfolio/my_report/*.md
207210
git add portfolio/sites/my_report.yml
208211
```
212+
213+
### Delete Portfolio/ Refresh Index Page
214+
215+
When redeploying your portfolio with new content and there’s an old version with existing files or content on your portfolio site or in your local environment, it’s important to clean up the old files before adding new content.
216+
217+
Use python `portfolio/portfolio.py clean my_report` before deploying your report.

docs/publishing/sections/6_metabase.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,7 @@ An [Airflow DAG](https://github.com/cal-itp/data-infra/tree/main/airflow/dags) n
99
Any tweaks to the data processing steps are easily done in scripts and notebooks, and it ensures that the visualizations in the dashboard remain updated with little friction.
1010

1111
Ex: [Payments Dashboard](https://dashboards.calitp.org/dashboard/3-payments-performance-dashboard?transit_provider=mst)
12+
13+
## Metabase Training Guide 2024
14+
15+
Please see the [Cal-ITP Metabase Training Guide](https://docs.google.com/document/d/1ag9qmSDWF9d30lGyKcvAAjILt1sCIJhK7wuUYkfAals/edit?tab=t.0#heading=h.xdjzmfck1e7) to see how to utilize the data warehouse to create meaningful and effective visuals and analyses.

docs/publishing/sections/7_gcs.md

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,14 @@
22

33
# GCS
44

5-
NOTE: If you are planning on publishing to [CKAN](publishing-ckan) and you are
6-
using the dbt exposure publishing framework, your data will already be saved in
7-
GCS as part of the upload process.
5+
### Public Data Access in GCS
86

9-
TBD.
7+
Some data stored in Cloud Storage is configured to be publicly accessible, meaning anyone on the internet can read it at any time. In Google Cloud Storage, you can make data publicly accessible either at the bucket level or the object level. At the bucket level, you can grant public access to all objects within the bucket by modifying the bucket policy. Alternatively, you can provide public access to specific objects.
8+
9+
Notes:
10+
11+
- Always ensure that sensitive information is not exposed when configuring public access in Google Cloud Storage. Publicly accessible data should be carefully reviewed to prevent the accidental sharing of confidential or private information.
12+
- External users can't browse the public bucket on the web, only download individual files. If you have many files to share, it's best to use the [Command Line Interface.](https://cloud.google.com/storage/docs/access-public-data#command-line)
13+
- There is a [function](https://github.com/cal-itp/data-analyses/blob/f62b150768fb1547c6b604cb53d122531104d099/_shared_utils/shared_utils/publish_utils.py#L16) in shared_utils that handles writing files to the public bucket, regardless of the file type (e.g., Parquet, GeoJSON, etc.)
14+
15+
NOTE: If you are planning on publishing to [CKAN](publishing-ckan) and you are using the dbt exposure publishing framework, your data will already be saved in GCS as part of the upload process.

0 commit comments

Comments
 (0)