Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize Geodata Schema to avoid repeated lookups of same lat/lon #428

Open
lindner opened this issue Nov 21, 2024 · 5 comments
Open

Normalize Geodata Schema to avoid repeated lookups of same lat/lon #428

lindner opened this issue Nov 21, 2024 · 5 comments

Comments

@lindner
Copy link

lindner commented Nov 21, 2024

Describe the bug
The Points table has denormalized city, country and geodata, normalize this.

Version
0.16.4

To Reproduce
Import lots of Points, notice that reverse Geocoding is taking a long time. Look at schema and data to notice a lot of duplication.

Expected behavior
The schema should be adjusted use a single geodata/city/country for a given lat/lon. Alternatively the ReverseGeocode job could look for existing data from the other Points to see if data has already been fetched.

Additional context
Here are some statistics from my import:

dawarich=# SELECT latitude, longitude, count(*) AS c FROM points pp GROUP BY latitude, longitude ORDER BY c DESC LIMIT 10;

 latitude  |  longitude  |   c
-----------+-------------+-------
 37.834035 | -122.196643 | 72248
 37.833984 | -122.196687 | 21906
 37.834008 | -122.196732 |  5542
 37.834109 | -122.196643 |  4701
 37.834065 | -122.196699 |  3841
 37.834071 | -122.196710 |  3734
 37.834046 | -122.196699 |  3656
 37.834082 | -122.196732 |  3174
 37.833961 | -122.196643 |  2806
 37.832880 | -122.197131 |  2768
(10 rows)

As a workaround you can use this UPDATE sql to populate null fields where the result is already fetched. It processes 1000 item batches, so you just keep running it until you get 0 rows.

WITH t AS (SELECT p1.id, p2.city, p2.country, p2.geodata 
                     FROM points p1 
                     INNER JOIN points p2 ON p1.latitude = p2.latitude AND p1.longitude = p2.longitude 
                             AND p1.city IS NULL AND p2.city IS NOT NULL AND p1.id <> p2.id  LIMIT 1000) 
UPDATE points SET city = t.city, country = t.country, geodata = t.geodata FROM t WHERE points.id = t.id;
@tabacha
Copy link
Contributor

tabacha commented Dec 5, 2024

An CREATE INDEX idx_points_lat_lon ON points (latitude, longitude); will speed up the query.

@lindner
Copy link
Author

lindner commented Dec 10, 2024

I do have an example where denormalization would not be wanted. The geodata field contains place data such as business names. Thus a lookup in 2024 and 2030 would vary. So it's not so bad to have this duplicated. Perhaps a cache would be warranted, and the queue processing could order entries by lat/long to optimize cache usage.

@lindner
Copy link
Author

lindner commented Dec 10, 2024

For those importing millions of points here's an optimization

  1. Remove the reverse_geocoding queue in the the /sidekiq interface
  2. Start the dawarich rails console. For me, I use the following docker command
    docker exec -it dawarich-app bin/rails console
  3. Execute the following query that finds points with more than 100 duplicates:
    p = Point.find_by_sql("select * from points where id in (select max(id) as id from points where city is null group by latitude,longitude having count(*) > 100)")
  4. Add these items to the reverse geocoding queue by executing the following:
    p.all.each do |r| r.async_reverse_geocode end
  5. Wait for the queue to empty
  6. Execute the update SQL above. (Remove the limit 1000 to update everything in one shot)
  7. Repeat for >10 and > 1 to efficiently process entries.

Hope this helps! Maybe add this as an FAQ or add some helper to do all of the above...

@Freika
Copy link
Owner

Freika commented Dec 10, 2024

@lindner thanks for diving into the topic! This issue is not missed or ignored, I'll pay it more attention later, hopefully soon. Meanwhile, thank you for providing options and suggesting another approach!

@tjorim
Copy link

tjorim commented Jan 21, 2025

I guess #699 is a first step at tackling this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants