Importing large Records.json from Takeout fails with no message #142

savvasdalkitsis · 2024-07-29T09:56:56Z

Describe the bug
Importing my (large) location history from Google Takeout fails with no error messages

Version
freikin/dawarich:latest

Logs

/var/app # bundle exec rake import:big_file['/tmp/Records.json','[email protected]']
[dotenv] Set DATABASE_PORT
[dotenv] Loaded .env.development
W, [2024-07-29T10:52:30.754729 #126]  WARN -- : DEPRECATION WARNING: `Rails.application.secrets` is deprecated in favor of `Rails.application.credentials` and will be removed in Rails 7.2. (called from <main> at /var/app/config/environment.rb:5)
D, [2024-07-29T10:52:31.562514 #126] DEBUG -- :   User Load (0.5ms)  SELECT "users".* FROM "users" WHERE "users"."email" = $1 LIMIT $2  [["email", "[email protected]"], ["LIMIT", 1]]
D, [2024-07-29T10:52:31.562766 #126] DEBUG -- :   ↳ lib/tasks/import.rake:9:in `block (2 levels) in <main>'
D, [2024-07-29T10:52:31.644237 #126] DEBUG -- :   TRANSACTION (0.2ms)  BEGIN
D, [2024-07-29T10:52:31.644938 #126] DEBUG -- :   ↳ lib/tasks/import.rake:13:in `block (2 levels) in <main>'
D, [2024-07-29T10:52:31.645787 #126] DEBUG -- :   Import Create (1.7ms)  INSERT INTO "imports" ("name", "user_id", "source", "created_at", "updated_at", "raw_points", "doubles", "processed", "raw_data") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9) RETURNING "id"  [["name", "/tmp/Records.json"], ["user_id", 1], ["source", 2], ["created_at", "2024-07-29 09:52:31.642947"], ["updated_at", "2024-07-29 09:52:31.642947"], ["raw_points", 0], ["doubles", 0], ["processed", 0], ["raw_data", nil]]
D, [2024-07-29T10:52:31.646293 #126] DEBUG -- :   ↳ lib/tasks/import.rake:13:in `block (2 levels) in <main>'
D, [2024-07-29T10:52:31.647681 #126] DEBUG -- :   TRANSACTION (1.1ms)  COMMIT
D, [2024-07-29T10:52:31.647951 #126] DEBUG -- :   ↳ lib/tasks/import.rake:13:in `block (2 levels) in <main>'
"Importing /tmp/Records.json for [email protected], file size is 1443919290... This might take a while, have patience!"
Killed

The text was updated successfully, but these errors were encountered:

savvasdalkitsis · 2024-07-29T09:57:30Z

berger321 · 2024-07-31T10:22:19Z

same is happening to me, keeps crashing my unraid server

Freika · 2024-07-31T17:39:32Z

Most likely, you're running out of memory. Consider giving more resources to Dawarich

savvasdalkitsis · 2024-07-31T19:59:42Z

I am running this on a server with 64GB RAM with no memory restrictions on the docker deployment so I doubt it's that to be honest.

Especially since the job gets killed almost immediately after it starts (for me at least)

savvasdalkitsis · 2024-07-31T20:00:28Z

is there any way to enable more verbose logging to troubleshoot this further?

savvasdalkitsis · 2024-08-05T11:38:32Z

In the meantime if anyone else has this problem, this is how i worked around it.

I split the giant Records.json file into multiple smaller ones using this script:

#!/bin/bash

input_file="Records.json"    # The input JSON file
output_prefix="smaller_array"    # The prefix for output files
chunk_size=100000                   # Number of elements per smaller array

# Get the total number of elements in the 'locations' array
total_elements=$(jq '.locations | length' $input_file)

# Loop through the array and split it into chunks
for ((i=0; i<total_elements; i+=chunk_size)); do
    start_index=$i
    end_index=$(($i + $chunk_size - 1))
    output_file="${output_prefix}_$(($i / $chunk_size + 1)).json"

    # Extract the chunk and save it into a new JSON file with the same structure
    jq "{locations: .locations[$start_index:$end_index + 1]}" $input_file > $output_file
    echo "Created $output_file"
done

And I imported each file individually. It takes a lot longer and is manual but hey

Freika · 2024-08-06T18:03:48Z

@savvasdalkitsis this is awesome, thank you! I'll consider using this approach to automatically split huge files during the import process.

hartmark · 2024-09-08T01:21:13Z

This worked perfectly for me, so getting dawarich to do this automagically behind the scenes would be awesome.

hartmark · 2024-09-08T01:24:45Z

my 1.4GB Records.json peaked at around 8GB used by jq using the script, thankfully I have 32GB on my server but a small heads-up for future googlers :)

jonhnet · 2024-09-08T18:39:40Z

Confirming: I just ran into this problem importing my 2GB Records.json. The bundle processes terminates with the stoic message "Killed" after 59 seconds; at that point top (running inside docker) shows it using 4002572 VIRT (2.0g RES). top also shows "MiB Mem : 19523.4 total", suggesting the memory limit is on the bundle process, not the docker machine.

I didn't see any suspiciously limited resources in the process limits:

/var/app # cat /proc/228/limits
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             unlimited            unlimited            processes 
Max open files            1048576              1048576              files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       77636                77636                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us

Anyway, I'm going to give @savvasdalkitsis 's strategy a try.

(I wonder whether re-attempting the imports repeatedly is going to create a database full of duplicate points... I guess that's a problem for another day. 😃 )

Freika · 2024-09-08T19:30:02Z

(I wonder whether re-attempting the imports repeatedly is going to create a database full of duplicate points... I guess that's a problem for another day. 😃 )

@jonhnet existing points won't be doubled, that is taken care of :)

jonhnet · 2024-09-08T19:45:30Z

Better still, I see that the import jobs are identified after the import is complete in the Imports panel, suggesting that there's metadata in the db that could be used to clean things up. "luckily", the failures have all resulted in 0-point imports, so it's a non-event.

jonhnet · 2024-09-08T19:47:09Z

Here's my take on @savvasdalkitsis 's chunk-ifying script.
splitter.py.txt

hartmark · 2024-09-08T21:52:52Z

Here's my take on @savvasdalkitsis 's chunk-ifying script. splitter.py.txt

Cool, thanks for your updated version which generates a nice script to just add them to dawarich aswell.

My python is quite noobish but you should probably flush the file in the while-loop incase you want to see its progress.

I also noticed that the sizes of the chunks are a bit uneven. Why is 001.json so much bigger?

% dawarich-splitter.py     
loaded record_count=2088720
Writing chunk-1000000-000.json
Writing chunk-1000000-001.json
Writing chunk-1000000-002.json
dawarich-splitter.py  138.83s user 7.53s system 98% cpu 2:29.16 total

% ls -lh *.json
-rw-r--r-- 1 markus markus 1.4G Sep  8 02:26 Records.json
-rw-r--r-- 1 markus markus 533M Sep  8 23:25 chunk-1000000-000.json
-rw-r--r-- 1 markus markus 1.1G Sep  8 23:27 chunk-1000000-001.json
-rw-r--r-- 1 markus markus 194M Sep  8 23:27 chunk-1000000-002.json

jonhnet · 2024-09-08T22:03:49Z

Chunks are divided by count of records. The records vary wildly in length depending on what Google decided to tuck into them that day. :)

jonhnet · 2024-09-08T22:04:36Z

Maybe a better script would count up to some total length, since the import process seems to die due to overall memory allocation.

hartmark · 2024-09-08T22:06:51Z

It seems it just swallowed the 1.1GB file so I guess all is good now :)

One small thing the script also should do is to remove the uploaded files from the docker volume so they won't take storage after being imported.

I'll let it run now for the night and see how it worked tomorrow.

jonhnet · 2024-09-09T14:15:02Z

@hartmark good point on cleaning up the mess after. I haven't actually finished running the script over here yet. :v)

My first chunk is 1M records. It completed the bundle step in ~1 hour (I didn't measure), but the sidekiq processing seems really slow. It has been 18 hours, I have 273k items processed and another million in the queue. I think because each point creates two work items serially, so I'm 273k/2M of the way through the job. That means this one chunk is going to take 6 days, and I have three more where that came from.

Is this expected performance for import? I can imagine improving it is a low priority, since it only happens once...

Sidekiq shows about 50-60 tasks completing per 10s polling interval. I do notice that the docker container is only running at ~10% CPU, suggesting that admitting more threads might make things a lot faster. I think I'll try docker compose down and bumping BACKGROUND_PROCESSING_CONCURRENCY from 10 to 100.

jonhnet · 2024-09-09T14:21:53Z

Oh shame on me, this topic is covered in the faq. I tried making 4 extra sidekiq containers, and now it looks like importing 1M records will take about a day. I guess I'll crank it up to 15 to grind through the next 4M records.

I note that if I docker compose stop dawarich, it loses track of the enqueued work items. I'm not sure if that's a bug or desired behavior. I guess the concern might be that a poorly timed shutdown (say to upgrade) might silently leave some recently-uploaded data unprocessed and hence invisible.

jonhnet · 2024-09-15T00:03:57Z

Here's a better version of my splitter script for enormous Google Takeout Records.json files. The main improvement is that it parses the input incrementally so this script itself doesn't encounter a memory bottleneck.

splitter.py.txt
(well, hold off on using this; I sent it before it finished, and my copy broke on one of my Records. I'll update.)

hartmark · 2024-09-15T18:07:14Z

Nice to have a new updated script. But now I have been able to import my google data.

It looks like quite alot of the items in the queue fails, but there is no retries. I can only see 8hrs back in time but it seems to be just ReverseGeocodingJobs that fails, will it retry the job somehow or how does it work?

AngryJKirk · 2024-11-17T17:58:58Z

#279 (comment)

I made it in an alternative way, crossposting for visibility

makanimike · 2024-12-29T14:47:43Z

Trying to import the first chunk using savvas' script, I run into this error....

[!] There was an error parsing `Gemfile`: No such file or directory @ rb_sysopen - .ruby-version. Bundler cannot continue.

#  from /var/app/Gemfile:6
#  -------------------------------------------
#  
>  ruby File.read('.ruby-version').strip
#  
#  -------------------------------------------

Any ideas?

ETA:
just had to execute the command from the root folder in the docker container... #386

makanimike · 2024-12-30T18:27:54Z

@hartmark :

I suppose your queue eventually got processed correctly?
I just imported my data 3-4 hours ago, but it appears like there is no progress at all. In fact, it looks like it didn't even start! I'm wondering if I need to do something to kickstart the process. Or if the import failed somehow.... there are 3,7M imports in the queue, but nothing is happening...

hartmark · 2024-12-30T19:22:56Z

@hartmark :

I suppose your queue eventually got processed correctly? I just imported my data 3-4 hours ago, but it appears like there is no progress at all. In fact, it looks like it didn't even start! I'm wondering if I need to do something to kickstart the process. Or if the import failed somehow.... there are 3,7M imports in the queue, but nothing is happening...

Yeah, it took a few days for it to process all.

makanimike · 2025-01-01T12:36:41Z

It was pretty much a classic PEBMAC issue, more or less. But I thought I would still share my experience here for posterity, anyone who might stumble into this thread searching for solutions....

During my early attempts to import the large Records.json file directly, I had adjusted the resource limits for as suggested in the FAQ. One thing lead to another, and a side effect of this was that sidekiq was suspended in a permanent state of deploying because of the CPU resources set to 0.001. On the front-end everything looked fine though. I could see the sidekiq interface, and click around. The queue would grow with every successful import of the partial Records.json chunks. But it just would not process anything. It was flatlining.

I had to set the resource settings back to normal values. Now I just have to wait for a few days.

aarondoet · 2025-01-04T20:11:23Z

I worked with google takeout exports in a nodejs project before. I used JSONStream to process the file, something similar exists for Ruby. I would imagine this should make it possible to process huge files without much RAM usage.
https://github.com/thisismydesign/json-streamer

here my script i just wrote, using JSONStream, because jq used too much RAM.

const { default: fs } = await import("fs");
const { default: JSONStream } = await import("JSONStream");
let chunk = 0;
let locations = {0: []};
let chunkSize = 100000;
let processed = 0;
let readStream = fs.createReadStream("./Records.json");
let stream = readStream.pipe(JSONStream.parse("locations.*"));
stream.on("data", l=>{
	processed++;
	if(locations[chunk].length < chunkSize) return locations[chunk].push(l);
	const c = chunk;
	fs.promises.writeFile(c+"_Records.json", JSON.stringify({locations: locations[c]})).then(()=>{delete locations[c]});
	chunk++;
	locations[chunk] = [l];
});
stream.on("end", ()=>{
	if(locations[chunk].length > 0){
		fs.promises.writeFile(chunk+"_Records.json", JSON.stringify({locations: locations[chunk]})).then(()=>console.log("done"));
	}else console.log("done");
});

Sparticuz · 2025-01-16T15:30:22Z

And I imported each file individually. It takes a lot longer and is manual but hey

After splitting, I had 45 files, so instead of sitting there I just ran this and it looped through all the imports

for f in tmp/imports/*.json; do bundle exec rake import:big_file["${f}",'[email protected]']; done;

jsixface · 2025-01-20T15:57:54Z

@aarondoet I agree that Streaming a large file is the way to go. Loading the entire file in memory is going to run into error at some point - even after splitting and the chunks are bigger.

@Freika is it possible to incorporate json-streamer for importing large json?

Freika · 2025-01-20T19:47:07Z

@jsixface yes, it's possible, although I can't provide any estimates. This issue stays open specifically so I could work on it later on.

Freika · 2025-01-21T22:37:07Z

In https://github.com/Freika/dawarich/releases/tag/0.23.3 I released a performance update to Records.json importing process. On my local machine, it now processes 178MB JSON file in about 2 minutes (627k points).

Freika mentioned this issue Nov 12, 2024

[BUG] import fail #387

Closed

This comment has been minimized.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Importing large Records.json from Takeout fails with no message #142

Importing large Records.json from Takeout fails with no message #142

savvasdalkitsis commented Jul 29, 2024

savvasdalkitsis commented Jul 29, 2024

berger321 commented Jul 31, 2024

Freika commented Jul 31, 2024

savvasdalkitsis commented Jul 31, 2024 •

edited

Loading

savvasdalkitsis commented Jul 31, 2024

savvasdalkitsis commented Aug 5, 2024 •

edited

Loading

Freika commented Aug 6, 2024

hartmark commented Sep 8, 2024

hartmark commented Sep 8, 2024

jonhnet commented Sep 8, 2024

Freika commented Sep 8, 2024 •

edited

Loading

jonhnet commented Sep 8, 2024

jonhnet commented Sep 8, 2024

hartmark commented Sep 8, 2024

jonhnet commented Sep 8, 2024

jonhnet commented Sep 8, 2024

hartmark commented Sep 8, 2024

jonhnet commented Sep 9, 2024

jonhnet commented Sep 9, 2024 •

edited

Loading

jonhnet commented Sep 15, 2024 •

edited

Loading

hartmark commented Sep 15, 2024

AngryJKirk commented Nov 17, 2024

makanimike commented Dec 29, 2024 •

edited

Loading

makanimike commented Dec 30, 2024

hartmark commented Dec 30, 2024

makanimike commented Jan 1, 2025

aarondoet commented Jan 4, 2025 •

edited

Loading

Sparticuz commented Jan 16, 2025

jsixface commented Jan 20, 2025

Freika commented Jan 20, 2025 •

edited

Loading

Freika commented Jan 21, 2025

This comment has been minimized.

Importing large Records.json from Takeout fails with no message #142

Importing large Records.json from Takeout fails with no message #142

Comments

savvasdalkitsis commented Jul 29, 2024

savvasdalkitsis commented Jul 29, 2024

berger321 commented Jul 31, 2024

Freika commented Jul 31, 2024

savvasdalkitsis commented Jul 31, 2024 • edited Loading

savvasdalkitsis commented Jul 31, 2024

savvasdalkitsis commented Aug 5, 2024 • edited Loading

Freika commented Aug 6, 2024

hartmark commented Sep 8, 2024

hartmark commented Sep 8, 2024

jonhnet commented Sep 8, 2024

Freika commented Sep 8, 2024 • edited Loading

jonhnet commented Sep 8, 2024

jonhnet commented Sep 8, 2024

hartmark commented Sep 8, 2024

jonhnet commented Sep 8, 2024

jonhnet commented Sep 8, 2024

hartmark commented Sep 8, 2024

jonhnet commented Sep 9, 2024

jonhnet commented Sep 9, 2024 • edited Loading

jonhnet commented Sep 15, 2024 • edited Loading

hartmark commented Sep 15, 2024

AngryJKirk commented Nov 17, 2024

makanimike commented Dec 29, 2024 • edited Loading

makanimike commented Dec 30, 2024

hartmark commented Dec 30, 2024

makanimike commented Jan 1, 2025

aarondoet commented Jan 4, 2025 • edited Loading

Sparticuz commented Jan 16, 2025

jsixface commented Jan 20, 2025

Freika commented Jan 20, 2025 • edited Loading

Freika commented Jan 21, 2025

This comment has been minimized.

savvasdalkitsis commented Jul 31, 2024 •

edited

Loading

savvasdalkitsis commented Aug 5, 2024 •

edited

Loading

Freika commented Sep 8, 2024 •

edited

Loading

jonhnet commented Sep 9, 2024 •

edited

Loading

jonhnet commented Sep 15, 2024 •

edited

Loading

makanimike commented Dec 29, 2024 •

edited

Loading

aarondoet commented Jan 4, 2025 •

edited

Loading

Freika commented Jan 20, 2025 •

edited

Loading