You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[ ]**siuba** - Recommended data analysis library | ([Docs](siuba))
37
-
-[ ][**shared_utils**](https://github.com/cal-itp/data-analyses/tree/main/_shared_utils)and [**here**](https://github.com/cal-itp/data-infra/tree/main/packages/calitp-data-analysis/calitp_data_analysis)- A shared utilities library for the analytics team | ([Docs](shared-utils))
-[ ][**siuba**](https://siuba.org/) - Recommended data analysis library | ([Docs](siuba))
37
+
-[ ][**shared_utils**](https://github.com/cal-itp/data-analyses/tree/main/_shared_utils) - A shared utilities library for the analytics team | ([Docs](shared-utils))
38
38
39
39
**Caltrans Employee Resources:**
40
40
41
+
-[ ][**Organizational Chart**](https://pmp.onramp.dot.ca.gov/organizational-chart) - Data and Digital Services Organizational Chart
-[ ][**Service Now (SNOW)**](https://cdotprod.service-now.com/sp) - Caltrans IT Service Management Portal for IT issues and requesting specific software
43
44
-[ ][**Cal Employee Connect**](https://connect.sco.ca.gov/) - State Controller's Office site for paystubs and tax information
44
45
-[ ][**Geospatial Enterprise Engagement Platform - GIS Account Request Form**](https://sv03tmcpo.ct.dot.ca.gov/portal/apps/sites/#/geep/pages/account-request) (optional) - User request form for ArcGIS Online and ArcGIS Portal accounts
10.[Jupyter Notebook Best Practices](notebook-shortcuts)
25
+
11.[Developing warehouse models in Jupyter](jupyterhub-warehouse)
25
26
26
27
(using-jupyterhub)=
27
28
@@ -39,6 +40,22 @@ JupyterHub currently lives at [notebooks.calitp.org](https://notebooks.calitp.or
39
40
40
41
Note: you will need to have been added to the Cal-ITP organization on GitHub to obtain access. If you have yet to be added to the organization and need to be, ask in the `#services-team` channel in Slack.
41
42
43
+
(default-user-vs-power-user)=
44
+
45
+
### Default User vs Power User
46
+
47
+
#### Default User
48
+
49
+
Designed for general use and is ideal for less resource-intensive tasks. It's a good starting point for most users who don't expect to run very large, memory-hungry jobs.
50
+
51
+
Default User profile offers quick availability since it uses less memory and can allocate a smaller node, allowing you to start tasks faster. However, if your task grows in memory usage over time, it may exceed the node's capacity, potentially causing the system to terminate your job. This makes the Default profile best for smaller or medium-sized tasks that don’t require a lot of memory. If your workload exceeds these limits, you might experience instability or crashes.
52
+
53
+
#### Power User
54
+
55
+
Intended for more demanding, memory-intensive tasks that require more resources upfront. This profile is suitable for workloads that have higher memory requirements or are expected to grow during execution.
56
+
57
+
Power User profile allocates a full node or a significant portion of resources to ensure your job has enough memory and computational power, avoiding crashes or delays. However, this comes with a longer wait time as the system needs to provision a new node for you. Once it's ready, you'll have all the resources necessary for memory-intensive tasks like large datasets or simulations. The Power User profile is ideal for jobs that might be unstable or crash on the Default profile due to higher resource demands. Additionally, it offers scalability—if your task requires more resources than the initial node can provide, the system will automatically spin up additional nodes to meet the demand.
Copy file name to clipboardExpand all lines: docs/analytics_tools/knowledge_sharing.md
+24-2Lines changed: 24 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
# Helpful Links
4
4
5
-
Here are some resources data analysts have collected and referenced, that will hopefully help you out in your work. Have something you want to share? Create a new markdown file, add it [to the example report folder](https://github.com/cal-itp/data-analyses/tree/main/example_report), and [message Amanda.](https://app.slack.com/client/T014965JTHA/C013N8GELLF/user_profile/U02PCTPSZ8A)
5
+
Here are some resources data analysts have collected and referenced, that will hopefully help you out in your work.
6
6
7
7
-[Data Analysis](#data-analysis)
8
8
-[Python](#python)
@@ -11,12 +11,14 @@ Here are some resources data analysts have collected and referenced, that will h
11
11
-[Merging](#merging)
12
12
-[Dates](#dates)
13
13
-[Monetary Values](#monetary-values)
14
+
-[Tidy Data](#tidy-data)
14
15
-[Visualizations](#visualization)
15
16
-[Charts](#charts)
16
17
-[Maps](#maps)
17
18
-[DataFrames](#dataframes)
18
19
-[Ipywidgets](#ipywidgets)
19
20
-[Markdown](#markdown)
21
+
-[ReviewNB](#reviewNB)
20
22
21
23
(data-analysis)=
22
24
@@ -128,6 +130,20 @@ def adjust_prices(df):
128
130
return df
129
131
```
130
132
133
+
(tidy-data)=
134
+
135
+
### Tidy Data
136
+
137
+
Tidy Data follows a set of principles that ensure the data is easy to work with, especially when using tools like pandas and matplotlib. Primary rules of tidy data are:
138
+
139
+
- Each variable must have its own column.
140
+
- Each observation must have its own row.
141
+
- Each value must have its own cell.
142
+
143
+
Tidy data ensures consistency, making it easier to work with tools like pandas, matplotlib, or seaborn. It also simplifies data manipulation, as functions like `groupby()`, `pivot()`, and `melt()` work more intuitively when the data is structured properly. Additionally, tidy data enables vectorized operations in pandas, allowing for efficient analysis on entire columns or rows at once.
144
+
145
+
Learn more about Tidy Data [here.](https://vita.had.co.nz/papers/tidy-data.pdf)
-[Add a table of content that links to headers throughout a markdown file.](https://stackoverflow.com/questions/2822089/how-to-link-to-part-of-the-same-document-in-markdown)
189
204
-[Add links to local files.](https://stackoverflow.com/questions/32563078/how-link-to-any-local-file-with-markdown-syntax?rq=1)
190
205
-[Direct embed an image.](https://datascienceparichay.com/article/insert-image-in-a-jupyter-notebook/)
206
+
207
+
(reviewNB)=
208
+
209
+
### ReviewNB on GitHub
210
+
211
+
-[Tool designed to facilitate reviewing Jupyter Notebooks in a collaborative setting on GitHub](https://www.reviewnb.com/)
212
+
-[Shows side-by-side diffs of Jupyter Notebooks, including changes to both code cells and markdown cells and allows reviewers to comment on specific cells](https://www.reviewnb.com/#faq)
Sometimes, files are created or modified locally but are not added to Git before committing, so they are not tracked or pushed to GitHub. Use `git add <filename>` to track files before committing.
125
+
- Incorrect Branches:
126
+
Committing to the wrong branch (e.g., main instead of a feature branch) can cause problems, especially if changes are not meant to be merged into the main codebase. Always ensure you're on the correct branch using git branch and switch branches with `git switch -c <branch-name>` before committing.
127
+
- Merge Conflicts from Overlapping Work:
128
+
When multiple analysts work on the same files or sections of code, merge conflicts can occur. Creating feature branches and pulling regularly to stay updated with main can help avoid these conflicts.
-[Data and Digital Services Organizational Chart](https://pmp.onramp.dot.ca.gov/downloads/pmp/files/Splash%20Page/org-charts-10-2024/DDS_OrgChart_October2024-signed.pdf)
Copy file name to clipboardExpand all lines: docs/publishing/sections/5_analytics_portfolio_site.md
+10-1Lines changed: 10 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,6 +11,7 @@ Netlify is the platform turns our Jupyter Notebooks uploaded to GitHub into a fu
11
11
To setup your netlify key:
12
12
13
13
- Ask in Slack/Teams for a Netlify key if you don't have one yet.
14
+
- If you already have your Netlify key set up, find it by typing `cat ~/.bash_profile` into the root of your repo.
14
15
- Install netlify: `npm install -g netlify-cli`
15
16
- Navigate to your main directory
16
17
- Edit your bash profile using Nano:
@@ -47,14 +48,16 @@ Create a `README.md` file in the repo where your work lies. This also forms the
47
48
48
49
Each `.yml` file creates a new site on the [Portfolio's Index Page](https://analysis.calitp.org/), so every project needs its own file. DLA Grant Analysis, SB125 Route Illustrations, and Active Transportation Program all have their own `.yml` file.
49
50
50
-
All the `.yml` files live here at [data-analyses/portfolio/sites](https://github.com/cal-itp/data-analyses/tree/main/portfolio/sites).
51
+
All the `.yml` files live here at [data-analyses/portfolio/sites](https://github.com/cal-itp/data-analyses/tree/main/portfolio/sites). Navigate to this folder to create the .yml file.
51
52
52
53
Here's how to create a `yml` file:
53
54
54
55
- Include the directory to the notebook(s) you want to publish.
55
56
56
57
- Name your `.yml` file. For now we will use `my_report.yml` as an example.
57
58
59
+
-`.yml` file should contain the title, directory, README.md path and notebook path.
60
+
58
61
- The structure of your `.yml` file depends on the type of your analysis:
59
62
60
63
- If you have one parameterized notebook with **one parameter**:
When redeploying your portfolio with new content and there’s an old version with existing files or content on your portfolio site or in your local environment, it’s important to clean up the old files before adding new content.
216
+
217
+
Use python `portfolio/portfolio.py clean my_report` before deploying your report.
Copy file name to clipboardExpand all lines: docs/publishing/sections/6_metabase.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,3 +9,7 @@ An [Airflow DAG](https://github.com/cal-itp/data-infra/tree/main/airflow/dags) n
9
9
Any tweaks to the data processing steps are easily done in scripts and notebooks, and it ensures that the visualizations in the dashboard remain updated with little friction.
Please see the [Cal-ITP Metabase Training Guide](https://docs.google.com/document/d/1ag9qmSDWF9d30lGyKcvAAjILt1sCIJhK7wuUYkfAals/edit?tab=t.0#heading=h.xdjzmfck1e7) to see how to utilize the data warehouse to create meaningful and effective visuals and analyses.
Copy file name to clipboardExpand all lines: docs/publishing/sections/7_gcs.md
+10-4Lines changed: 10 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,8 +2,14 @@
2
2
3
3
# GCS
4
4
5
-
NOTE: If you are planning on publishing to [CKAN](publishing-ckan) and you are
6
-
using the dbt exposure publishing framework, your data will already be saved in
7
-
GCS as part of the upload process.
5
+
### Public Data Access in GCS
8
6
9
-
TBD.
7
+
Some data stored in Cloud Storage is configured to be publicly accessible, meaning anyone on the internet can read it at any time. In Google Cloud Storage, you can make data publicly accessible either at the bucket level or the object level. At the bucket level, you can grant public access to all objects within the bucket by modifying the bucket policy. Alternatively, you can provide public access to specific objects.
8
+
9
+
Notes:
10
+
11
+
- Always ensure that sensitive information is not exposed when configuring public access in Google Cloud Storage. Publicly accessible data should be carefully reviewed to prevent the accidental sharing of confidential or private information.
12
+
- External users can't browse the public bucket on the web, only download individual files. If you have many files to share, it's best to use the [Command Line Interface.](https://cloud.google.com/storage/docs/access-public-data#command-line)
13
+
- There is a [function](https://github.com/cal-itp/data-analyses/blob/f62b150768fb1547c6b604cb53d122531104d099/_shared_utils/shared_utils/publish_utils.py#L16) in shared_utils that handles writing files to the public bucket, regardless of the file type (e.g., Parquet, GeoJSON, etc.)
14
+
15
+
NOTE: If you are planning on publishing to [CKAN](publishing-ckan) and you are using the dbt exposure publishing framework, your data will already be saved in GCS as part of the upload process.
0 commit comments