Skip to content

Commit

Permalink
Tutorial: How to query a CSV in PostHog (#9418)
Browse files Browse the repository at this point in the history
* csv query tutorial

---------

Co-authored-by: Lior539 <[email protected]>
  • Loading branch information
ivanagas and Lior539 authored Sep 30, 2024
1 parent 1925a2f commit 49d5d15
Show file tree
Hide file tree
Showing 4 changed files with 135 additions and 3 deletions.
2 changes: 1 addition & 1 deletion contents/docs/data-warehouse/setup/r2.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ The data warehouse can link to data in Cloudflare R2. To start, you'll need to:

![created bucket](https://res.cloudinary.com/dmukukwp6/image/upload/Clean_Shot_2024_07_16_at_10_06_02_2x_152e3c2309.png)

3. In your bucket, upload the data you want to query such as CSV or Parquet data. It can be as simple as a `.csv` file like this:
3. In your bucket, upload the data you want to query such as [CSV](/tutorials/csv-query) or Parquet data. It can be as simple as a `.csv` file like this:

```csv
id,name,email
Expand Down
2 changes: 1 addition & 1 deletion contents/docs/data-warehouse/setup/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ The final step is to create a new user and give them access to our bucket by att

### Step 3: Add data to the bucket

> For this section, we'll be using **Airbyte**. However, we accept any data in CSV or Parquet format, so if you already have data in S3 you can skip this section.
> For this section, we'll be using **Airbyte**. However, we accept any data in [CSV](/tutorials/csv-query) or Parquet format, so if you already have data in S3 you can skip this section.
1. Go to [Airbyte](https://airbyte.com) and sign up for an account if you haven't already.
2. Go to connections and click "New connection"
Expand Down
3 changes: 2 additions & 1 deletion contents/docs/data-warehouse/tutorials.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,5 @@ Got a question which isn't answered below? Head to [the community forum](/questi
- [How to set up Hubspot reports](/tutorials/hubspot-reports)
- [How to set up Zendesk reports](/tutorials/zendesk-reports)
- [How to set up Google Ads reports](/tutorials/google-ads-reports)
- [How to query Supabase in PostHog](/tutorials/supabase-query)
- [How to query Supabase in PostHog](/tutorials/supabase-query)
- [How to query a CSV in PostHog](/tutorials/csv-query)
131 changes: 131 additions & 0 deletions contents/tutorials/csv-query.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
---
title: How to query a CSV in PostHog
date: 2024-09-30
author:
- ian-vanagas
tags:
- data warehouse
---

PostHog can capture a lot of data about your users. For data it can't capture, you can leverage the [data warehouse](/data-warehouse) to manually upload any data you'd like as a CSV

This tutorial shows you how to upload a CSV to storage, connect that storage source to PostHog, and then query the CSV alongside your data in PostHog.

## Creating and uploading our CSV

For this tutorial, we can create an example CSV with a list of users for an imaginary video conferencing company which looks like this:

```csv
user_id,full_name,email,join_date,subscription_type,total_meetings_hosted,total_meetings_attended
001,John Doe,[email protected],2023-01-15,Pro,45,60
002,Jane Smith,[email protected],2022-11-30,Free,10,25
003,Michael Brown,[email protected],2023-03-10,Pro,55,70
004,Linda Green,[email protected],2022-12-25,Business,120,150
005,David Lee,[email protected],2023-07-05,Free,5,10
006,Sarah Johnson,[email protected],2023-05-20,Business,75,80
007,Ian Vanagas,[email protected],2023-02-15,Pro,40,55
```

To get this into PostHog, we need to upload it into storage. The easiest way to do this is to use [Cloudflare R2](/docs/data-warehouse/setup/r2), but you can also use other storage services like [S3](/docs/data-warehouse/setup/s3), [Azure Blob](/docs/data-warehouse/setup/azure-blob), or [GCS](/docs/data-warehouse/setup/gcs).

After signing up for Cloudflare, go to your dashboard and create a new bucket (if you haven't already). We suggest using Eastern North America as a location hint if you're using PostHog Cloud US or European Union as a specific jurisdiction if you're using PostHog Cloud EU.

With the bucket created, upload your `.csv`.

![https://res.cloudinary.com/dmukukwp6/image/upload/Clean_Shot_2024_09_23_at_10_46_04_2x_0a9905a073.png](https://res.cloudinary.com/dmukukwp6/image/upload/Clean_Shot_2024_09_23_at_10_46_04_2x_0a9905a073.png)

## Connecting our R2 bucket to PostHog

With our bucket setup and `.csv` upload, we are ready to connect it to PostHog.

1. In Cloudflare, go to the R2 overview, and under account details, click **Manage R2 API Tokens.**
2. Click **Create API token**, give your token a name, choose **Object Read only** as the permission type, apply it to your bucket, and click **Create API Token.**

![https://res.cloudinary.com/dmukukwp6/image/upload/Clean_Shot_2024_07_16_at_10_20_43_2x_97c29591fb.png](https://res.cloudinary.com/dmukukwp6/image/upload/Clean_Shot_2024_07_16_at_10_20_43_2x_97c29591fb.png)

1. Copy the credentials for S3 clients, including the **Access Key ID**, **Secret Access Key**, and jurisdiction-specific endpoint URL. These are not shown again, so copy them to a safe place.

With these, we can add the bucket to PostHog:

1. Go to the [sources tab](https://us.posthog.com/pipeline/sources) of the data pipeline section in PostHog.
2. Click [**New source**](https://us.posthog.com/project/52792/pipeline/new/source) and under self managed, look for **Cloudflare R2** and click **Link.**
3. Fill the table name for use in PostHog (like `csv_users`), then use the data from Cloudflare to fill out the rest of the fields:
- For files URL pattern, use the jurisdiction-specific endpoint URL with your bucket and file name like `https://b27344y7bd543c.r2.cloudflarestorage.com/posthog-warehouse/my_users.csv`.
- Choose the **CSV with headers** format.
- For the access key, use your Access Key ID.
- For the secret key, use your Secret Access Key.
4. Finally, click **Next** to link the bucket to PostHog.

<ProductScreenshot
imageLight="https://res.cloudinary.com/dmukukwp6/image/upload/Clean_Shot_2024_09_23_at_11_23_44_2x_982f1f4214.png"
imageDark="https://res.cloudinary.com/dmukukwp6/image/upload/Clean_Shot_2024_09_23_at_11_23_29_2x_4b68dbfec3.png"
classes="rounded"
alt="Connecting R2 bucket to PostHog"
/>

## Querying CSV data in PostHog

Once linked, we can query the data in PostHog by creating a [new SQL insight](https://us.posthog.com/insights/new) and querying the newly created table like this:

```sql
SELECT * FROM csv_users
```

This gets all the data from the CSV.

<ProductScreenshot
imageLight = "https://res.cloudinary.com/dmukukwp6/image/upload/Clean_Shot_2024_09_23_at_11_28_54_2x_e37398b6b8.png"
imageDark = "https://res.cloudinary.com/dmukukwp6/image/upload/Clean_Shot_2024_09_23_at_11_29_10_2x_4be5ee2166.png"
classes="rounded"
alt="Querying CSV data in PostHog"
/>

We can use [the features of SQL](/docs/product-analytics/sql) to filter and transform the data. For example, to get the pro or business users with the highest `total_meetings_hosted`, we can do this:

```sql
SELECT email, total_meetings_hosted
FROM csv_users
WHERE subscription_type = 'Pro' OR subscription_type = 'Business'
ORDER BY total_meetings_hosted DESC
```

### Joining CSV data to persons

When your data relates to [people](/docs/data/persons) in PostHog, you can create a [join](/docs/data-warehouse/join) between it and our `persons` table. This makes your CSV data much more useful by acting like extended person properties.

To do this:

1. Go to the data warehouse tab and find the `persons` table, click the three dots next to it, and click **Add join**.
2. In the popup, set the **Source Table Key** to a property that both tables include, in our case, that is `email`. To access it, we use HogQL to set our **Source Table Key** to `properties.email`.
3. Choose `csv_users` as your **Joining Table** and `email` as your **Joining Table Key.**
4. Click **Save**.

<ProductScreenshot
imageLight = "https://res.cloudinary.com/dmukukwp6/image/upload/Clean_Shot_2024_09_23_at_13_18_33_2x_38449df291.png"
imageDark = "https://res.cloudinary.com/dmukukwp6/image/upload/Clean_Shot_2024_09_23_at_13_17_59_2x_4896f8c63b.png"
classes="rounded"
alt="Joining CSV data to persons in PostHog"
/>

Once you've done this, you can then query your CSV data from the persons table like this:

```sql
select csv_users.total_meetings_hosted
from persons
where properties.email = '[email protected]'
```

You can also use these extended person properties in insights. For example, you can get pageviews for users with the pro subscription type by selecting `csv_users: subscription_type` from extended person properties when creating an insight.

<ProductScreenshot
imageLight = "https://res.cloudinary.com/dmukukwp6/image/upload/Clean_Shot_2024_09_23_at_13_24_54_2x_f6704d05eb.png"
imageDark = "https://res.cloudinary.com/dmukukwp6/image/upload/Clean_Shot_2024_09_23_at_13_25_19_2x_6ad280fde5.png"
classes="rounded"
alt="Using extended person properties from CSV data in PostHog insights"
/>

## Further reading

- [How to query Supabase data in PostHog](/tutorials/supabase-query)
- [How to set up Google Ads reports](/tutorials/google-ads-reports)
- [The basics of SQL for analytics](/product-engineers/sql-for-analytics)

0 comments on commit 49d5d15

Please sign in to comment.