From 49d5d151f06e0c49ae8427e242a3ea49d5a49477 Mon Sep 17 00:00:00 2001 From: Ian Vanagas <34755028+ivanagas@users.noreply.github.com> Date: Mon, 30 Sep 2024 10:45:49 -0700 Subject: [PATCH] Tutorial: How to query a CSV in PostHog (#9418) * csv query tutorial --------- Co-authored-by: Lior539 --- contents/docs/data-warehouse/setup/r2.md | 2 +- contents/docs/data-warehouse/setup/s3.md | 2 +- contents/docs/data-warehouse/tutorials.mdx | 3 +- contents/tutorials/csv-query.md | 131 +++++++++++++++++++++ 4 files changed, 135 insertions(+), 3 deletions(-) create mode 100644 contents/tutorials/csv-query.md diff --git a/contents/docs/data-warehouse/setup/r2.md b/contents/docs/data-warehouse/setup/r2.md index ff526aa1a8d4..73c06c4c6d32 100644 --- a/contents/docs/data-warehouse/setup/r2.md +++ b/contents/docs/data-warehouse/setup/r2.md @@ -22,7 +22,7 @@ The data warehouse can link to data in Cloudflare R2. To start, you'll need to: ![created bucket](https://res.cloudinary.com/dmukukwp6/image/upload/Clean_Shot_2024_07_16_at_10_06_02_2x_152e3c2309.png) -3. In your bucket, upload the data you want to query such as CSV or Parquet data. It can be as simple as a `.csv` file like this: +3. In your bucket, upload the data you want to query such as [CSV](/tutorials/csv-query) or Parquet data. It can be as simple as a `.csv` file like this: ```csv id,name,email diff --git a/contents/docs/data-warehouse/setup/s3.md b/contents/docs/data-warehouse/setup/s3.md index 6c7e00e1f484..df605877f08c 100644 --- a/contents/docs/data-warehouse/setup/s3.md +++ b/contents/docs/data-warehouse/setup/s3.md @@ -87,7 +87,7 @@ The final step is to create a new user and give them access to our bucket by att ### Step 3: Add data to the bucket -> For this section, we'll be using **Airbyte**. However, we accept any data in CSV or Parquet format, so if you already have data in S3 you can skip this section. +> For this section, we'll be using **Airbyte**. However, we accept any data in [CSV](/tutorials/csv-query) or Parquet format, so if you already have data in S3 you can skip this section. 1. Go to [Airbyte](https://airbyte.com) and sign up for an account if you haven't already. 2. Go to connections and click "New connection" diff --git a/contents/docs/data-warehouse/tutorials.mdx b/contents/docs/data-warehouse/tutorials.mdx index 203414d7100b..390ff3e4ba5a 100644 --- a/contents/docs/data-warehouse/tutorials.mdx +++ b/contents/docs/data-warehouse/tutorials.mdx @@ -10,4 +10,5 @@ Got a question which isn't answered below? Head to [the community forum](/questi - [How to set up Hubspot reports](/tutorials/hubspot-reports) - [How to set up Zendesk reports](/tutorials/zendesk-reports) - [How to set up Google Ads reports](/tutorials/google-ads-reports) -- [How to query Supabase in PostHog](/tutorials/supabase-query) \ No newline at end of file +- [How to query Supabase in PostHog](/tutorials/supabase-query) +- [How to query a CSV in PostHog](/tutorials/csv-query) \ No newline at end of file diff --git a/contents/tutorials/csv-query.md b/contents/tutorials/csv-query.md new file mode 100644 index 000000000000..bf02d5a17ec6 --- /dev/null +++ b/contents/tutorials/csv-query.md @@ -0,0 +1,131 @@ +--- +title: How to query a CSV in PostHog +date: 2024-09-30 +author: + - ian-vanagas +tags: + - data warehouse +--- + +PostHog can capture a lot of data about your users. For data it can't capture, you can leverage the [data warehouse](/data-warehouse) to manually upload any data you'd like as a CSV + +This tutorial shows you how to upload a CSV to storage, connect that storage source to PostHog, and then query the CSV alongside your data in PostHog. + +## Creating and uploading our CSV + +For this tutorial, we can create an example CSV with a list of users for an imaginary video conferencing company which looks like this: + +```csv +user_id,full_name,email,join_date,subscription_type,total_meetings_hosted,total_meetings_attended +001,John Doe,johndoe@example.com,2023-01-15,Pro,45,60 +002,Jane Smith,janesmith@example.com,2022-11-30,Free,10,25 +003,Michael Brown,michaelbrown@example.com,2023-03-10,Pro,55,70 +004,Linda Green,lindagreen@example.com,2022-12-25,Business,120,150 +005,David Lee,davidlee@example.com,2023-07-05,Free,5,10 +006,Sarah Johnson,sarahj@example.com,2023-05-20,Business,75,80 +007,Ian Vanagas,ian@posthog.com,2023-02-15,Pro,40,55 +``` + +To get this into PostHog, we need to upload it into storage. The easiest way to do this is to use [Cloudflare R2](/docs/data-warehouse/setup/r2), but you can also use other storage services like [S3](/docs/data-warehouse/setup/s3), [Azure Blob](/docs/data-warehouse/setup/azure-blob), or [GCS](/docs/data-warehouse/setup/gcs). + +After signing up for Cloudflare, go to your dashboard and create a new bucket (if you haven't already). We suggest using Eastern North America as a location hint if you're using PostHog Cloud US or European Union as a specific jurisdiction if you're using PostHog Cloud EU. + +With the bucket created, upload your `.csv`. + +![https://res.cloudinary.com/dmukukwp6/image/upload/Clean_Shot_2024_09_23_at_10_46_04_2x_0a9905a073.png](https://res.cloudinary.com/dmukukwp6/image/upload/Clean_Shot_2024_09_23_at_10_46_04_2x_0a9905a073.png) + +## Connecting our R2 bucket to PostHog + +With our bucket setup and `.csv` upload, we are ready to connect it to PostHog. + +1. In Cloudflare, go to the R2 overview, and under account details, click **Manage R2 API Tokens.** +2. Click **Create API token**, give your token a name, choose **Object Read only** as the permission type, apply it to your bucket, and click **Create API Token.** + +![https://res.cloudinary.com/dmukukwp6/image/upload/Clean_Shot_2024_07_16_at_10_20_43_2x_97c29591fb.png](https://res.cloudinary.com/dmukukwp6/image/upload/Clean_Shot_2024_07_16_at_10_20_43_2x_97c29591fb.png) + +1. Copy the credentials for S3 clients, including the **Access Key ID**, **Secret Access Key**, and jurisdiction-specific endpoint URL. These are not shown again, so copy them to a safe place. + +With these, we can add the bucket to PostHog: + +1. Go to the [sources tab](https://us.posthog.com/pipeline/sources) of the data pipeline section in PostHog. +2. Click [**New source**](https://us.posthog.com/project/52792/pipeline/new/source) and under self managed, look for **Cloudflare R2** and click **Link.** +3. Fill the table name for use in PostHog (like `csv_users`), then use the data from Cloudflare to fill out the rest of the fields: + - For files URL pattern, use the jurisdiction-specific endpoint URL with your bucket and file name like `https://b27344y7bd543c.r2.cloudflarestorage.com/posthog-warehouse/my_users.csv`. + - Choose the **CSV with headers** format. + - For the access key, use your Access Key ID. + - For the secret key, use your Secret Access Key. +4. Finally, click **Next** to link the bucket to PostHog. + + + +## Querying CSV data in PostHog + +Once linked, we can query the data in PostHog by creating a [new SQL insight](https://us.posthog.com/insights/new) and querying the newly created table like this: + +```sql +SELECT * FROM csv_users +``` + +This gets all the data from the CSV. + + + +We can use [the features of SQL](/docs/product-analytics/sql) to filter and transform the data. For example, to get the pro or business users with the highest `total_meetings_hosted`, we can do this: + +```sql +SELECT email, total_meetings_hosted +FROM csv_users +WHERE subscription_type = 'Pro' OR subscription_type = 'Business' +ORDER BY total_meetings_hosted DESC +``` + +### Joining CSV data to persons + +When your data relates to [people](/docs/data/persons) in PostHog, you can create a [join](/docs/data-warehouse/join) between it and our `persons` table. This makes your CSV data much more useful by acting like extended person properties. + +To do this: + +1. Go to the data warehouse tab and find the `persons` table, click the three dots next to it, and click **Add join**. +2. In the popup, set the **Source Table Key** to a property that both tables include, in our case, that is `email`. To access it, we use HogQL to set our **Source Table Key** to `properties.email`. +3. Choose `csv_users` as your **Joining Table** and `email` as your **Joining Table Key.** +4. Click **Save**. + + + +Once you've done this, you can then query your CSV data from the persons table like this: + +```sql +select csv_users.total_meetings_hosted +from persons +where properties.email = 'ian@posthog.com' +``` + +You can also use these extended person properties in insights. For example, you can get pageviews for users with the pro subscription type by selecting `csv_users: subscription_type` from extended person properties when creating an insight. + + + +## Further reading + +- [How to query Supabase data in PostHog](/tutorials/supabase-query) +- [How to set up Google Ads reports](/tutorials/google-ads-reports) +- [The basics of SQL for analytics](/product-engineers/sql-for-analytics)