tap-stackoverflow-sampledata is a Singer tap for the Stack Overflow xml dump files avaiable at Archieve.org. This tap is inteded to be use to test Singer targets and or seed a source system with enough data to sufficently test a source to target pipleline.
Built with the Meltano Tap SDK for Singer Taps.
2024-08-01 Upgraded to Meltano Singer-SDK 0.46.4:
Edger at Arch expanded the SDK to have a faster JSON encoder avaiable. I’ve updated tap-stackoverflow-sampledata to use the new MsgSpecWriter class, which leverages the lightweight and speedy msgspec. Big Thank You 🙏 to Jim Crist-Harif for creating and maintaining msgspec and Edger for updating the SDK to utlize it!
2024-08-01 Upgraded to Meltano Singer-SDK 0.39.0
2024-04-04 Upgraded to Meltano Singer-SDK 0.36.1
2023-12-14 Upgraded to Meltano Singer-SDK 0.34.0
You will need to install the tap directly from the GitHub repository. Here is the command to use
pipx install git+https://github.com/BuzzCutNorman/tap-stackoverflow-sampledata.gitYou can find this tap at Meltano Hub. Which makes installation a snap.
Add the tap-stackoverflow-sampledata extractor to your project using meltano add :
meltano add extractor tap-stackoverflow-sampledataYou will need to download the Stack Overflow files, uzip them, and place then into a directory. The files are zipped up using 7zip (.7z) so you will need it to complete the unzip step. Currently this tap will work with these files.
| File | Zipped Size | Unzipped Size | Rows |
|---|---|---|---|
| Badges | 514 MB | 5.65 GB | 51,287,627 |
| Comments | 6.51 GB | 23.3 GB | 90,380,323 |
| PostLinks | 144 MB | 768 MB | 6,552,590 |
| Posts | 21.4 GB | 96.7 GB | 59,819,048 |
| Tags | 1 MB | 5.56 MB | 65,675 |
| Users | 944 MB | 5.38 GB | 22,484,235 |
| Votes | 2.05 GB | 21.6 GB | 238,974,011 |
You can use one, two, or all.
The only configuration you need to provide is the path of the directory you placed the extracted Stackoverflow file(s) in.
Configure the tap-stackoverflow-sampledata settings using meltano config :
meltano config tap-stackoverflow-sampledata set --interactive| Setting | Required | Default | Description |
|---|---|---|---|
| stackoverflow_data_directory | False | None | A path to the StackOverflow XML data files. |
| batch_config | False | None | Optional Batch Message configuration |
Singer: config.json
{
"stackoverflow_data_directory" : "C:\\Development\\StackOverflow\\"
}
Meltano: meltano.yml
config:
stackoverflow_data_directory: C:\Development\StackOverflow\
Singer: config.json
{
"stackoverflow_data_directory" : "C:\\Development\\StackOverflow\\".
"batch_config": {
"encoding": {
"format": "jsonl",
"compression": "gzip"
},
"storage": {
"root": "file://c://development/batches",
"prefix": "test-batch-"
}
}
}
Meltano: meltano.yml
config:
stackoverflow_data_directory: C:\Development\StackOverflow\
batch_config:
encoding:
format: jsonl
compression: gzip
storage:
root: "file://c://development/batches"
prefix: test-batch-
aboutdiscover
A full list of supported settings and capabilities is available by running: tap-stackoverflow-sampledata --about
This Singer tap will automatically import any environment variables within the working directory's
.env if the --config=ENV is provided, such that config values will be considered if a matching
environment variable is set either in the terminal context or in the .env file.
You can easily run tap-stackoverflow-sampledata by itself or in a pipeline using Meltano.
tap-stackoverflow-sampledata --version
tap-stackoverflow-sampledata --help
tap-stackoverflow-sampledata --config CONFIG --discover > ./catalog.jsonFollow these instructions to contribute to this project.
Prerequisites:
- Python 3.9+
- uv
uv syncCreate tests within the tests subfolder and
then run:
uv run pytestYou can also test the tap-stackoverflow-sampledata CLI interface directly using uv run:
uv run tap-stackoverflow-sampledata --helpTesting with Meltano
Note: This tap will work in any Singer environment and does not require Meltano. Examples here are for convenience and to streamline end-to-end orchestration scenarios.
Your project comes with a custom meltano.yml project file already created. Open the meltano.yml and follow any "TODO" items listed in
the file.
Next, install Meltano (if you haven't already) and any needed plugins:
# Install meltano
pipx install meltano
# Initialize meltano within this directory
cd tap-stackoverflow-sampledata
meltano installNow you can test and orchestrate using Meltano:
# Test invocation:
meltano invoke tap-stackoverflow-sampledata --version
# OR run a test `elt` pipeline:
meltano run tap-stackoverflow-sampledata target-jsonlSee the dev guide for more instructions on how to use the SDK to develop your own taps and targets.