Skip to content

Commit

Permalink
feat: Add Dagster jobs to compute exchange rates (#29495)
Browse files Browse the repository at this point in the history
  • Loading branch information
rafaeelaudibert authored Mar 6, 2025
1 parent bd37883 commit 487488d
Show file tree
Hide file tree
Showing 14 changed files with 811 additions and 125 deletions.
Empty file added .dagster_home/.gitkeep
Empty file.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -96,3 +96,6 @@ playwright-report/
test-results/
playwright/playwright-report/
playwright/test-results/

.dagster_home/*
!.dagster_home/.gitkeep
98 changes: 90 additions & 8 deletions dags/README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,99 @@
# Dagster
# PostHog Dagster DAGs

## Running locally
This directory contains [Dagster](https://dagster.io/) data pipelines (DAGs) for PostHog. Dagster is a data orchestration framework that allows us to define, schedule, and monitor data workflows.

You'll need to set DAGSTER_HOME
## What is Dagster?

Dagster is an open-source data orchestration tool designed to help you define and execute data pipelines. Key concepts include:

- **Assets**: Data artifacts that your pipelines produce and consume (e.g., tables, files)
- **Ops**: Individual units of computation (functions)
- **Jobs**: Collections of ops that are executed together
- **Resources**: Shared infrastructure and connections (e.g. database connections)
- **Schedules**: Time-based triggers for jobs
- **Sensors**: Event-based triggers for jobs

## Project Structure

- `definitions.py`: Main Dagster definition file that defines assets, jobs, schedules, sensors, and resources
- `common.py`: Shared utilities and resources
- Individual DAG files (e.g., `exchange_rate.py`, `deletes.py`, `person_overrides.py`)
- `tests/`: Tests for the DAGs

## Local Development

### Environment Setup

Dagster uses the `DAGSTER_HOME` environment variable to determine where to store instance configuration, logs, and other local artifacts. If not set, Dagster will use a temporary folder that's erased after you bring `dagster dev` down

```bash
# Set DAGSTER_HOME to a directory of your choice
export DAGSTER_HOME=/path/to/your/dagster/home
```

For consistency with the PostHog development environment, you might want to set this to a subdirectory within your project:

```bash
export DAGSTER_HOME=$(pwd)/.dagster_home
```

You can add this to your shell profile if you want to always store your assets, or to your local `.env` file which will be automatically detected by `dagster dev`.

### Running the Development Server

To run the Dagster development server locally:

```bash
# Important: Set DEBUG=1 when running locally to use local resources
DEBUG=1 dagster dev
```

Setting `DEBUG=1` is critical to get it to run properly

The Dagster UI will be available at http://localhost:3000 by default, where you can:

- Browse assets, jobs, and schedules
- Manually trigger job runs
- View execution logs and status
- Debug pipeline issues

## Adding New DAGs

When adding a new DAG:

1. Create a new Python file for your DAG
2. Define your assets, ops, and jobs
3. Import and register them in `definitions.py`
4. Add appropriate tests in the `tests/` directory

## Running Tests

Tests are implemented using pytest. The following command will run all DAG tests:

Easiest is to just start jobs from your cli
```bash
dagster job execute -m dags.export_query_logs_to_s3 --config dags/query_log_example.yaml
# From the project root
pytest dags/
```

You can also run the interface
To run a specific test file:

```bash
dagster dev
pytest dags/tests/test_exchange_rate.py
```

By default this will run on http://127.0.0.1:3000/
To run a specific test:

```bash
pytest dags/tests/test_exchange_rate.py::test_name
```

Add `-v` for verbose output:

```bash
pytest -v dags/tests/test_exchange_rate.py
```

## Additional Resources

- [Dagster Documentation](https://docs.dagster.io/)
- [PostHog Documentation](https://posthog.com/docs)
20 changes: 9 additions & 11 deletions dags/ch_examples.py
Original file line number Diff line number Diff line change
@@ -1,26 +1,24 @@
from dagster import (
Config,
MaterializeResult,
asset,
)
import dagster

from posthog.clickhouse.client import sync_execute # noqa


class ClickHouseConfig(Config):
class ClickHouseConfig(dagster.Config):
result_path: str = "/tmp/clickhouse_version.txt"


@asset
def get_clickhouse_version(config: ClickHouseConfig) -> MaterializeResult:
@dagster.asset
def get_clickhouse_version(config: ClickHouseConfig) -> dagster.MaterializeResult:
version = sync_execute("SELECT version()")[0][0]
with open(config.result_path, "w") as f:
f.write(version)
return MaterializeResult(metadata={"version": version})

return dagster.MaterializeResult(metadata={"version": version})

@asset(deps=[get_clickhouse_version])

@dagster.asset(deps=[get_clickhouse_version])
def print_clickhouse_version(config: ClickHouseConfig):
with open(config.result_path) as f:
print(f.read()) # noqa
return MaterializeResult(metadata={"version": config.result_path})

return dagster.MaterializeResult(metadata={"version": config.result_path})
1 change: 1 addition & 0 deletions dags/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

class JobOwners(str, Enum):
TEAM_CLICKHOUSE = "team-clickhouse"
TEAM_WEB_ANALYTICS = "team-web-analytics"


class ClickhouseClusterResource(dagster.ConfigurableResource):
Expand Down
99 changes: 42 additions & 57 deletions dags/definitions.py
Original file line number Diff line number Diff line change
@@ -1,29 +1,21 @@
from dagster import (
DagsterRunStatus,
Definitions,
EnvVar,
ResourceDefinition,
RunRequest,
ScheduleDefinition,
fs_io_manager,
load_assets_from_modules,
run_status_sensor,
)
import dagster
import dagster_slack

from dagster_aws.s3.io_manager import s3_pickle_io_manager
from dagster_aws.s3.resources import s3_resource
from dagster_slack import SlackResource
from django.conf import settings

from dags.slack_alerts import notify_slack_on_failure

from . import ch_examples, deletes, materialized_columns, orm_examples, person_overrides, export_query_logs_to_s3
from .common import ClickhouseClusterResource

all_assets = load_assets_from_modules([ch_examples, orm_examples])


env = "local" if settings.DEBUG else "prod"

from dags.common import ClickhouseClusterResource
from dags import (
ch_examples,
deletes,
exchange_rate,
export_query_logs_to_s3,
materialized_columns,
orm_examples,
person_overrides,
slack_alerts,
)

# Define resources for different environments
resources_by_env = {
Expand All @@ -34,57 +26,50 @@
),
"s3": s3_resource,
# Using EnvVar instead of the Django setting to ensure that the token is not leaked anywhere in the Dagster UI
"slack": SlackResource(token=EnvVar("SLACK_TOKEN")),
"slack": dagster_slack.SlackResource(token=dagster.EnvVar("SLACK_TOKEN")),
},
"local": {
"cluster": ClickhouseClusterResource.configure_at_launch(),
"io_manager": fs_io_manager,
"slack": ResourceDefinition.none_resource(description="Dummy Slack resource for local development"),
"io_manager": dagster.fs_io_manager,
"slack": dagster.ResourceDefinition.none_resource(description="Dummy Slack resource for local development"),
},
}


# Get resources for current environment, fallback to local if env not found
env = "local" if settings.DEBUG else "prod"
resources = resources_by_env.get(env, resources_by_env["local"])


# Schedule to run squash at 10 PM on Saturdays
squash_schedule = ScheduleDefinition(
job=person_overrides.squash_person_overrides,
cron_schedule="0 22 * * 6", # At 22:00 (10 PM) on Saturday
execution_timezone="UTC",
name="squash_person_overrides_schedule",
)

# Schedule to run query logs export at 1 AM daily
query_logs_export_schedule = ScheduleDefinition(
job=export_query_logs_to_s3.export_query_logs_to_s3,
cron_schedule="0 1 * * *", # At 01:00 (1 AM) every day
execution_timezone="UTC",
name="query_logs_export_schedule",
)


@run_status_sensor(
run_status=DagsterRunStatus.SUCCESS,
monitored_jobs=[person_overrides.squash_person_overrides],
request_job=deletes.deletes_job,
)
def run_deletes_after_squash(context):
return RunRequest(run_key=None)


defs = Definitions(
assets=all_assets,
defs = dagster.Definitions(
assets=[
ch_examples.get_clickhouse_version,
ch_examples.print_clickhouse_version,
exchange_rate.daily_exchange_rates,
exchange_rate.hourly_exchange_rates,
exchange_rate.daily_exchange_rates_in_clickhouse,
exchange_rate.hourly_exchange_rates_in_clickhouse,
orm_examples.pending_deletions,
orm_examples.process_pending_deletions,
],
jobs=[
deletes.deletes_job,
exchange_rate.daily_exchange_rates_job,
exchange_rate.hourly_exchange_rates_job,
export_query_logs_to_s3.export_query_logs_to_s3,
materialized_columns.materialize_column,
person_overrides.cleanup_orphaned_person_overrides_snapshot,
person_overrides.squash_person_overrides,
export_query_logs_to_s3.export_query_logs_to_s3,
],
schedules=[squash_schedule, query_logs_export_schedule],
sensors=[run_deletes_after_squash, notify_slack_on_failure],
schedules=[
exchange_rate.daily_exchange_rates_schedule,
exchange_rate.hourly_exchange_rates_schedule,
export_query_logs_to_s3.query_logs_export_schedule,
person_overrides.squash_schedule,
],
sensors=[
deletes.run_deletes_after_squash,
slack_alerts.notify_slack_on_failure,
],
resources=resources,
)

Expand Down
Loading

0 comments on commit 487488d

Please sign in to comment.