Skip to content

Add support for reading webdav based files like Hetzner storage boxes #160

@onnimonni

Description

@onnimonni

Hey!

I created a short example on how to read data from Hetzner Storage boxes here:

# Install dependencies
brew install curl pwgen hcloud jq

# Get api keys from https://console.hetzner.com/projects/YOUR-PROJECT-ID/security/tokens
hcloud context create my-hetzner-context --token YOUR_HETZNER_API_TOKEN

# Generate new password with '_' in the middle to suit for the special character requirement
export HETZNER_STORAGEBOX_PASSWORD=$(pwgen -1 -n 16)_$(pwgen -1 -n 16)

# Create the 1 terabyte 3.2€/month (+ VAT) storage box in Helsinki
hcloud storage-box create --enable-webdav --reachable-externally --name duckdb-test-box --location hel1 --type bx11 --password $HETZNER_STORAGEBOX_PASSWORD

# Get the username for the storage box
export HETZNER_STORAGEBOX_USER=$(hcloud storage-box describe duckdb-test-box -o json | jq -r .username)

# Get sample data
curl -o data.parquet https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet

# Upload the sample data
curl -T data.parquet --user "$HETZNER_STORAGEBOX_USER:$HETZNER_STORAGEBOX_PASSWORD" https://$HETZNER_STORAGEBOX_USER.your-storagebox.de/data.parquet

# Nice thing in webdav is that it supports http range requests, so we can read parts of the file without downloading the whole file.
# Query the data with duckdb
duckdb -c "
    SET VARIABLE basic_auth = (
        SELECT base64((getenv('HETZNER_STORAGEBOX_USER') || ':' || getenv('HETZNER_STORAGEBOX_PASSWORD'))::BLOB)
    );

    SET VARIABLE storagebox_parquet_url = format(
        'https://{}.your-storagebox.de/data.parquet',
        getenv('HETZNER_STORAGEBOX_USER')
    );

    SELECT getvariable('basic_auth') AS basic_auth;

    CREATE SECRET IF NOT EXISTS http_auth (
        TYPE HTTP,
        EXTRA_HTTP_HEADERS MAP {
            'Authorization': 'Basic ' || getvariable('basic_auth')
        }
    );

    SELECT COUNT(*)
    FROM read_parquet(getvariable('storagebox_parquet_url'));
"

# Remove the storage box after testing
hcloud storage-box delete duckdb-test-box

It works but it's pretty convoluted and uploading files doesn't work.

Since httpfs already has curl in the dependencies would it be possible to add support on reading and writing from webdav based remote sources like Hetzner Storage box?

This is one of the cheapest remote storages available (2.18€+VAT when using 5TB or more and with unlimited but slow bandwidth) and thus it would lower the barriers on many data storage use cases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions