Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS 2-5: Ingest Global Basins #3647

Closed
3 tasks done
rajadain opened this issue Sep 4, 2024 · 9 comments
Closed
3 tasks done

AWS 2-5: Ingest Global Basins #3647

rajadain opened this issue Sep 4, 2024 · 9 comments
Assignees
Labels
AWS Funding Source: AWS

Comments

@rajadain
Copy link
Member

rajadain commented Sep 4, 2024

Ingest the TDX Hydro Global Basins dataset into Postgres. This should match the additional fields added to the streams in #3646.

The source data will be made available by @ptomasula at s3://data.mmw.azavea.com/. Once it is:

  • Ingest it locally into Postgres using ogr2ogr
  • Export it to SQL using pg_dump
  • Import it on staging
@rajadain rajadain added blocked AWS Funding Source: AWS labels Sep 4, 2024
@rajadain rajadain removed the blocked label Oct 14, 2024
@rajadain rajadain self-assigned this Oct 14, 2024
@rajadain rajadain mentioned this issue Oct 30, 2024
50 tasks
@rajadain
Copy link
Member Author

rajadain commented Nov 12, 2024

Should use the following script to ingest them:

ogr2ogr -progress -if "Parquet" \
          -f "PostgreSQL" PG:"host=localhost user=mmw dbname=mmw password=XXX" \
          -nln tdxbasins \
          -lco GEOMETRY_NAME=geom \
          -lco FID=LINKNO \
          -append -skipfailures \
          TDX_streamreach_basins_mnsi_\*.parquet

Once imported, we should create this index to help with delineation:

CREATE INDEX idx_tdxbasins_root_discover_finish
ON tdxbasins (root_id, discover_time, finish_time);

@rajadain
Copy link
Member Author

Once that is in place, we can use the following query to delineate a watershed, given a basin clicked by the user:

WITH target AS (SELECT *
FROM tdxbasins
WHERE linkno = ?)

SELECT ST_Union(geom)
FROM tdxbasins
WHERE root_id = (SELECT root_id FROM target)
  AND discover_time <= (SELECT discover_time FROM target)
  AND finish_time >= (SELECT finish_time FROM target);

@rajadain
Copy link
Member Author

Data has been ingested on Staging successfully. Now exporting for reuse using:

env PGPASSWORD=xxx pg_dump --clean --if-exists --no-owner --dbname=modelmywatershed --username=modelmywatershed --host=database.service.mmw.internal --table=tdxbasins --compress=9 --verbose > tdxbasins.sql.gz

@rajadain
Copy link
Member Author

Data exported to s3://data.mmw.azavea.com/tdxbasins.sql.gz

@rajadain
Copy link
Member Author

rajadain commented Dec 3, 2024

This data was not imported correctly. Particularly, by specifying -lco FID=LINKNO, we created a new monotonic ID field called LINKNO and lost the original values.

Instead, the data must be reimported as follows:

  1. Delete the existing tdxbasins table
  2. Run the ogr2ogr command again, without -lco FID=LINKNO
  3. Do the following to set the LINKNO to be the primary key:
    -- Step 1: Drop the existing primary key constraint
    ALTER TABLE tdxbasins DROP CONSTRAINT tdxbasins_pkey;
    
    -- Step 2: Drop the `ogc_fid` column
    ALTER TABLE tdxbasins DROP COLUMN ogc_fid;
    
    -- Step 3: Add a new primary key constraint on `linkno`
    ALTER TABLE tdxbasins ADD CONSTRAINT tdxbasins_pkey PRIMARY KEY (linkno);
  4. Create the index again:
    CREATE INDEX idx_tdxbasins_root_discover_finish
    ON tdxbasins (root_id, discover_time, finish_time);
  5. Re-export the data and upload to S3, overwriting the old export
  6. Re-create a subset for development

@rajadain rajadain reopened this Dec 3, 2024
@rajadain
Copy link
Member Author

After deleting the ogc_fid, when I tried to make linkno the primary key I discovered that the column is not unique. Currently doing a full table scan (very slow) to discover the duplicates.

@rajadain
Copy link
Member Author

The total number of duplicate linknos is 334,017.

Some of these have very high numbers of duplicates:

SELECT linkno, count
FROM "20241210_tdxbasins_duplicate_linkno.csv"
ORDER BY count DESC
LIMIT 10;
linkno count
130565963 2105
130356958 1923
130690836 1774
130642990 1772
130565970 1767
130533444 1763
130612035 1646
130560773 1614
130343248 1569
130555645 1564

Looking at the full rows for these, they have identical values for every other column except for geom, which is a slightly different square for each case:

SELECT *
FROM tdxbasins
WHERE linkno IN (
    130565963,
    130356958,
    130690836,
    130642990,
    130565970,
    130533444,
    130612035,
    130560773,
    130343248,
    130555645)
ORDER BY linkno;
root_id finish_time discover_time element_count dissolve_root_id linkno geom
130747391 326295 326294 1 130707940 130343248 SRID=4326;MULTIPOLYGON(((29.57016666666716 -4.330388888889166,29.57016666666716 -4.330500000000278,29.57027777777827 -4.330500000000278,29.57027777777827 -4.330388888889166,29.57016666666716 -4.330388888889166)))
130747391 326295 326294 1 130707940 130343248 SRID=4326;MULTIPOLYGON(((29.57005555555605 -4.330500000000278,29.57005555555605 -4.33061111111139,29.57016666666716 -4.33061111111139,29.57016666666716 -4.330500000000278,29.57005555555605 -4.330500000000278)))
130747391 326295 326294 1 130707940 130343248 SRID=4326;MULTIPOLYGON(((29.56994444444494 -4.33061111111139,29.56994444444494 -4.3307222222225,29.57005555555605 -4.3307222222225,29.57005555555605 -4.33061111111139,29.56994444444494 -4.33061111111139)))
130747391 326295 326294 1 130707940 130343248 SRID=4326;MULTIPOLYGON(((29.56983333333383 -4.3307222222225,29.56983333333383 -4.330833333333612,29.56994444444494 -4.330833333333612,29.56994444444494 -4.3307222222225,29.56983333333383 -4.3307222222225)))
130747391 326295 326294 1 130707940 130343248 SRID=4326;MULTIPOLYGON(((29.56972222222272 -4.330833333333612,29.56972222222272 -4.330944444444722,29.56983333333383 -4.330944444444722,29.56983333333383 -4.330833333333612,29.56972222222272 -4.330833333333612)))
130747391 326295 326294 1 130707940 130343248 SRID=4326;MULTIPOLYGON(((29.56961111111161 -4.330944444444722,29.56961111111161 -4.331055555555833,29.56972222222272 -4.331055555555833,29.56972222222272 -4.330944444444722,29.56961111111161 -4.330944444444722)))
130747391 326295 326294 1 130707940 130343248 SRID=4326;MULTIPOLYGON(((29.5695000000005 -4.331055555555833,29.5695000000005 -4.331166666666945,29.56961111111161 -4.331166666666945,29.56961111111161 -4.331055555555833,29.5695000000005 -4.331055555555833)))
130747391 326295 326294 1 130707940 130343248 SRID=4326;MULTIPOLYGON(((29.56938888888938 -4.331166666666945,29.56938888888938 -4.331277777778055,29.5695000000005 -4.331277777778055,29.5695000000005 -4.331166666666945,29.56938888888938 -4.331166666666945)))
130747391 326295 326294 1 130707940 130343248 SRID=4326;MULTIPOLYGON(((29.56927777777827 -4.331277777778055,29.56927777777827 -4.331388888889167,29.56938888888938 -4.331388888889167,29.56938888888938 -4.331277777778055,29.56927777777827 -4.331277777778055)))
130747391 326295 326294 1 130707940 130343248 SRID=4326;MULTIPOLYGON(((29.56916666666716 -4.331388888889167,29.56916666666716 -4.331500000000277,29.56927777777827 -4.331500000000277,29.56927777777827 -4.331388888889167,29.56916666666716 -4.331388888889167)))
130747391 326295 326294 1 130707940 130343248 SRID=4326;MULTIPOLYGON(((29.56905555555605 -4.331500000000277,29.56905555555605 -4.331611111111389,29.56916666666716 -4.331611111111389,29.56916666666716 -4.331500000000277,29.56905555555605 -4.331500000000277)))
130747391 326295 326294 1 130707940 130343248 SRID=4326;MULTIPOLYGON(((29.56894444444494 -4.331611111111389,29.56894444444494 -4.331722222222501,29.56905555555605 -4.331722222222501,29.56905555555605 -4.331611111111389,29.56894444444494 -4.331611111111389)))
130747391 326295 326294 1 130707940 130343248 SRID=4326;MULTIPOLYGON(((29.56883333333383 -4.331722222222501,29.56883333333383 -4.331833333333611,29.56894444444494 -4.331833333333611,29.56894444444494 -4.331722222222501,29.56883333333383 -4.331722222222501)))
130747391 326295 326294 1 130707940 130343248 SRID=4326;MULTIPOLYGON(((29.56872222222271 -4.331833333333611,29.56872222222271 -4.331944444444723,29.56883333333383 -4.331944444444723,29.56883333333383 -4.331833333333611,29.56872222222271 -4.331833333333611)))
130747391 326295 326294 1 130707940 130343248 SRID=4326;MULTIPOLYGON(((29.56527777777827 -4.335277777778057,29.56527777777827 -4.335388888889167,29.56538888888938 -4.335388888889167,29.56538888888938 -4.335277777778057,29.56527777777827 -4.335277777778057)))
130747391 326295 326294 1 130707940 130343248 SRID=4326;MULTIPOLYGON(((29.56516666666716 -4.335388888889167,29.56516666666716 -4.335500000000279,29.56527777777827 -4.335500000000279,29.56527777777827 -4.335388888889167,29.56516666666716 -4.335388888889167)))
130747391 326295 326294 1 130707940 130343248 SRID=4326;MULTIPOLYGON(((29.56505555555605 -4.335500000000279,29.56505555555605 -4.335611111111389,29.56516666666716 -4.335611111111389,29.56516666666716 -4.335500000000279,29.56505555555605 -4.335500000000279)))
130747391 326295 326294 1 130707940 130343248 SRID=4326;MULTIPOLYGON(((29.56494444444494 -4.335611111111389,29.56494444444494 -4.3357222222225,29.56505555555605 -4.3357222222225,29.56505555555605 -4.335611111111389,29.56494444444494 -4.335611111111389)))
130747391 326295 326294 1 130707940 130343248 SRID=4326;MULTIPOLYGON(((29.56483333333383 -4.3357222222225,29.56483333333383 -4.335833333333611,29.56494444444494 -4.335833333333611,29.56494444444494 -4.3357222222225,29.56483333333383 -4.3357222222225)))
130747391 326295 326294 1 130707940 130343248 SRID=4326;MULTIPOLYGON(((29.56472222222272 -4.335833333333611,29.56472222222272 -4.335944444444722,29.56483333333383 -4.335944444444722,29.56483333333383 -4.335833333333611,29.56472222222272 -4.335833333333611)))

Currently investigating if this is an ingest or source data issue.

@rajadain
Copy link
Member Author

The recommendation from @aufdenkampe was to keep only one of the duplicate rows for each linkno. Since the rows are true duplicates with only minor differences in GEOM, we could keep any of the duplicates. I decided to keep the first.

First I made a companion table nondup which had the non-duplicated ogc_fid for every unique linkno:

CREATE TABLE nondup AS
    SELECT MIN(ogc_fid), linkno
    FROM tdxbasins
    GROUP BY linkno;

Then I joined the tdxbasins table with that and deleted all non-matching rows:

DELETE FROM tdxbasins
WHERE ogc_fid IN (
    SELECT ogc_fid
    FROM tdxbasins a LEFT JOIN nondup b ON a.ogc_fid = b.min
    WHERE b.min IS NULL
);

This left 15936428 rows in tdxbasins. After that, I was able to delete the ogc_fid column and make linkno the primary key:

-- Step 1: Drop the existing primary key constraint
ALTER TABLE tdxbasins DROP CONSTRAINT tdxbasins_pkey;

-- Step 2: Add a new primary key constraint on `linkno`
ALTER TABLE tdxbasins ADD CONSTRAINT tdxbasins_pkey PRIMARY KEY (linkno);

-- Step 3: Drop the `ogc_fid` column
ALTER TABLE tdxbasins DROP COLUMN ogc_fid;

-- Step 4: Add index for root id, discover time, finish time for fast lookup
CREATE INDEX idx_tdxbasins_root_discover_finish
ON tdxbasins (root_id, discover_time, finish_time);

This was all done locally on my developer machine because our staging database ran out of CPU credits. Now I'm going to dump this, upload to S3, and re-import to staging.

@rajadain
Copy link
Member Author

This is now complete and deployed to production.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AWS Funding Source: AWS
Projects
None yet
Development

No branches or pull requests

1 participant