This repository has been archived by the owner on May 17, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 274
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
12 changed files
with
163 additions
and
309 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,3 @@ | ||
.venv | ||
ml-25m* | ||
dev/ml-25m* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
FROM python:3.10 | ||
RUN apt-get update && apt-get install -y \ | ||
python3-dev libpq-dev wget unzip \ | ||
python3-setuptools gcc bc | ||
RUN pip install --no-cache-dir poetry==1.1.13 | ||
COPY . /app | ||
WORKDIR /app | ||
# For now while we are in heavy development we install the latest with Poetry | ||
# and execute directly with Poetry. Later, we'll move to the released Pip package. | ||
RUN poetry install -E preql -E mysql -E pgsql -E snowflake | ||
ENTRYPOINT ["poetry", "run", "python3", "-m", "data_diff"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,27 +1,74 @@ | ||
# Data Diff | ||
|
||
A cross-database, efficient diff between mostly-similar database tables. | ||
A cross-database, efficient diff using checksums between mostly-similar database | ||
tables. | ||
|
||
Use cases: | ||
- Validate that a table was copied properly | ||
- Be alerted before your customer finds out, or your report is wrong | ||
- Validate that your replication mechnism is working correctly | ||
- Find changes between two versions of the same table | ||
|
||
- Quickly validate that a table was copied correctly | ||
It uses a bisection algorithm to efficiently check if e.g. a table is the same | ||
between MySQL and Postgres, or Postgres and Snowflake, or MySQL and RDS! | ||
|
||
- Find changes between two versions of the same table | ||
```python | ||
$ data-diff postgres:/// Original postgres:/// Original_1diff -v --bisection-factor=4 | ||
[16:55:19] INFO - Diffing tables of size 25000095 and 25000095 | segments: 4, bisection threshold: 1048576. | ||
[16:55:36] INFO - Diffing segment 0/4 of size 8333364 and 8333364 | ||
[16:55:45] INFO - . Diffing segment 0/4 of size 2777787 and 2777787 | ||
[16:55:52] INFO - . . Diffing segment 0/4 of size 925928 and 925928 | ||
[16:55:54] INFO - . . . Diff found 2 different rows. | ||
+ (20000, 942013020) | ||
- (20000, 942013021) | ||
[16:55:54] INFO - . . Diffing segment 1/4 of size 925929 and 925929 | ||
[16:55:55] INFO - . . Diffing segment 2/4 of size 925929 and 925929 | ||
[16:55:55] INFO - . . Diffing segment 3/4 of size 1 and 1 | ||
[16:55:56] INFO - . Diffing segment 1/4 of size 2777788 and 2777788 | ||
[16:55:58] INFO - . Diffing segment 2/4 of size 2777788 and 2777788 | ||
[16:55:59] INFO - . Diffing segment 3/4 of size 1 and 1 | ||
[16:56:00] INFO - Diffing segment 1/4 of size 8333365 and 8333365 | ||
[16:56:06] INFO - Diffing segment 2/4 of size 8333365 and 8333365 | ||
[16:56:11] INFO - Diffing segment 3/4 of size 1 and 1 | ||
[16:56:11] INFO - Duration: 53.51 seconds. | ||
``` | ||
|
||
We currently support the following databases: | ||
|
||
- PostgreSQL | ||
|
||
- MySQL | ||
|
||
- Oracle | ||
|
||
- Snowflake | ||
|
||
- BigQuery | ||
|
||
- Redshift | ||
|
||
We plan to add more, including NoSQL, and even APIs like Shopify! | ||
|
||
# How to install | ||
|
||
Requires Python 3.7+ with pip. | ||
|
||
```pip install data-diff``` | ||
|
||
or when you need extras like mysql and postgres | ||
|
||
```pip install "data-diff[mysql,pgsql]"``` | ||
|
||
# How to use | ||
|
||
Usage: `data-diff DB1_URI TABLE1_NAME DB2_URI TABLE2_NAME [OPTIONS]` | ||
|
||
Options: | ||
|
||
- `--help` - Show help message and exit. | ||
- `-k` or `--key_column` - Name of the primary key column | ||
- `-c` or `--columns` - List of names of extra columns to compare | ||
- `-l` or `--limit` - Maximum number of differences to find (limits maximum bandwidth and runtime) | ||
- `-s` or `--stats` - Print stats instead of a detailed diff | ||
- `-d` or `--debug` - Print debug info | ||
- `-v` or `--verbose` - Print extra info | ||
- `--bisection-factor` - Segments per iteration. When set to 2, it performs binary search. | ||
- `--bisection-threshold` - Minimal bisection threshold. i.e. maximum size of pages to diff locally. | ||
|
||
|
||
# How does it work? | ||
|
||
|
@@ -63,57 +110,70 @@ We ran it with a very low bisection factor, and with the verbose flag, to demons | |
|
||
Note: It's usually much faster to use high bisection factors, especially when there are very few changes, like in this example. | ||
|
||
```python | ||
$ data_diff postgres:/// Original postgres:/// Original_1diff -v --bisection-factor=4 | ||
[16:55:19] INFO - Diffing tables of size 25000095 and 25000095 | segments: 4, bisection threshold: 1048576. | ||
[16:55:36] INFO - Diffing segment 0/4 of size 8333364 and 8333364 | ||
[16:55:45] INFO - . Diffing segment 0/4 of size 2777787 and 2777787 | ||
[16:55:52] INFO - . . Diffing segment 0/4 of size 925928 and 925928 | ||
[16:55:54] INFO - . . . Diff found 2 different rows. | ||
+ (20000, 942013020) | ||
- (20000, 942013021) | ||
[16:55:54] INFO - . . Diffing segment 1/4 of size 925929 and 925929 | ||
[16:55:55] INFO - . . Diffing segment 2/4 of size 925929 and 925929 | ||
[16:55:55] INFO - . . Diffing segment 3/4 of size 1 and 1 | ||
[16:55:56] INFO - . Diffing segment 1/4 of size 2777788 and 2777788 | ||
[16:55:58] INFO - . Diffing segment 2/4 of size 2777788 and 2777788 | ||
[16:55:59] INFO - . Diffing segment 3/4 of size 1 and 1 | ||
[16:56:00] INFO - Diffing segment 1/4 of size 8333365 and 8333365 | ||
[16:56:06] INFO - Diffing segment 2/4 of size 8333365 and 8333365 | ||
[16:56:11] INFO - Diffing segment 3/4 of size 1 and 1 | ||
[16:56:11] INFO - Duration: 53.51 seconds. | ||
## Tips for performance | ||
|
||
It's highly recommended that all involved columns are indexed. | ||
|
||
## Development Setup | ||
|
||
The development setup centers around using `docker-compose` to boot up various | ||
databases, and then inserting data into them. | ||
|
||
For Mac for performance of Docker, we suggest enabling in the UI: | ||
|
||
* Use new Virtualization Framework | ||
* Enable VirtioFS accelerated directory sharing | ||
|
||
**1. Install Data Diff** | ||
|
||
When developing/debugging, it's recommended to install dependencies and run it | ||
directly with `poetry` rather than go through the package. | ||
|
||
``` | ||
poetry install | ||
``` | ||
|
||
**2. Download CSV of Testing Data** | ||
|
||
# How to install | ||
```shell-session | ||
wget https://files.grouplens.org/datasets/movielens/ml-25m.zip | ||
unzip ml-25m.zip -d dev/ | ||
``` | ||
|
||
Requires Python 3.7+ with pip. | ||
**3. Start Databases** | ||
|
||
```pip install data-diff``` | ||
```shell-session | ||
docker-compose up -d mysql postgres | ||
``` | ||
|
||
or when you need extras like mysql and postgres | ||
**4. Run Unit Tests** | ||
|
||
```pip install "data-diff[mysql,pgsql]"``` | ||
```shell-session | ||
poetry run python3 -m unittest | ||
``` | ||
|
||
# How to use | ||
**5. Seed the Database(s)** | ||
|
||
Usage: `data_diff DB1_URI TABLE1_NAME DB2_URI TABLE2_NAME [OPTIONS]` | ||
If you're just testing, we recommend just setting up one database (e.g. | ||
Postgres) to avoid incurring the long setup time repeatedly. | ||
|
||
Options: | ||
```shell-session | ||
preql -f dev/prepare_db.pql postgres://postgres:[email protected]:5432/postgres | ||
preql -f dev/prepare_db.pql mysql://mysql:[email protected]:3306/mysql | ||
preql -f dev/prepare_db.psq snowflake://<uri> | ||
preql -f dev/prepare_db.psq mssql://<uri> | ||
preql -f dev/prepare_db_bigquery.pql bigquery:///<project> # Bigquery has its own | ||
``` | ||
|
||
- `--help` - Show help message and exit. | ||
- `-k` or `--key_column` - Name of the primary key column | ||
- `-c` or `--columns` - List of names of extra columns to compare | ||
- `-l` or `--limit` - Maximum number of differences to find (limits maximum bandwidth and runtime) | ||
- `-s` or `--stats` - Print stats instead of a detailed diff | ||
- `-d` or `--debug` - Print debug info | ||
- `-v` or `--verbose` - Print extra info | ||
- `--bisection-factor` - Segments per iteration. When set to 2, it performs binary search. | ||
- `--bisection-threshold` - Minimal bisection threshold. i.e. maximum size of pages to diff locally. | ||
**6. Run data-diff against seeded database** | ||
|
||
## Tips for performance | ||
```bash | ||
poetry run python3 -m data_diff postgres://user:password@host:db Rating mysql://user:password@host:db Rating_del1 -c timestamp --stats | ||
|
||
It's highly recommended that all involved columns are indexed. | ||
Diff-Total: 250156 changed rows out of 25000095 | ||
Diff-Percent: 1.0006% | ||
Diff-Split: +250156 -0 | ||
``` | ||
|
||
# License | ||
|
||
|
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.