Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importing large Records.json from Takeout fails with no message #142

Open
savvasdalkitsis opened this issue Jul 29, 2024 · 32 comments
Open

Comments

@savvasdalkitsis
Copy link

Describe the bug
Importing my (large) location history from Google Takeout fails with no error messages

Version
freikin/dawarich:latest

Logs

/var/app # bundle exec rake import:big_file['/tmp/Records.json','[email protected]']
[dotenv] Set DATABASE_PORT
[dotenv] Loaded .env.development
W, [2024-07-29T10:52:30.754729 #126]  WARN -- : DEPRECATION WARNING: `Rails.application.secrets` is deprecated in favor of `Rails.application.credentials` and will be removed in Rails 7.2. (called from <main> at /var/app/config/environment.rb:5)
D, [2024-07-29T10:52:31.562514 #126] DEBUG -- :   User Load (0.5ms)  SELECT "users".* FROM "users" WHERE "users"."email" = $1 LIMIT $2  [["email", "[email protected]"], ["LIMIT", 1]]
D, [2024-07-29T10:52:31.562766 #126] DEBUG -- :   ↳ lib/tasks/import.rake:9:in `block (2 levels) in <main>'
D, [2024-07-29T10:52:31.644237 #126] DEBUG -- :   TRANSACTION (0.2ms)  BEGIN
D, [2024-07-29T10:52:31.644938 #126] DEBUG -- :   ↳ lib/tasks/import.rake:13:in `block (2 levels) in <main>'
D, [2024-07-29T10:52:31.645787 #126] DEBUG -- :   Import Create (1.7ms)  INSERT INTO "imports" ("name", "user_id", "source", "created_at", "updated_at", "raw_points", "doubles", "processed", "raw_data") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9) RETURNING "id"  [["name", "/tmp/Records.json"], ["user_id", 1], ["source", 2], ["created_at", "2024-07-29 09:52:31.642947"], ["updated_at", "2024-07-29 09:52:31.642947"], ["raw_points", 0], ["doubles", 0], ["processed", 0], ["raw_data", nil]]
D, [2024-07-29T10:52:31.646293 #126] DEBUG -- :   ↳ lib/tasks/import.rake:13:in `block (2 levels) in <main>'
D, [2024-07-29T10:52:31.647681 #126] DEBUG -- :   TRANSACTION (1.1ms)  COMMIT
D, [2024-07-29T10:52:31.647951 #126] DEBUG -- :   ↳ lib/tasks/import.rake:13:in `block (2 levels) in <main>'
"Importing /tmp/Records.json for [email protected], file size is 1443919290... This might take a while, have patience!"
Killed
@savvasdalkitsis
Copy link
Author

image

@berger321
Copy link

same is happening to me, keeps crashing my unraid server

@Freika
Copy link
Owner

Freika commented Jul 31, 2024

Most likely, you're running out of memory. Consider giving more resources to Dawarich

@savvasdalkitsis
Copy link
Author

savvasdalkitsis commented Jul 31, 2024

I am running this on a server with 64GB RAM with no memory restrictions on the docker deployment so I doubt it's that to be honest.

Especially since the job gets killed almost immediately after it starts (for me at least)

@savvasdalkitsis
Copy link
Author

is there any way to enable more verbose logging to troubleshoot this further?

@savvasdalkitsis
Copy link
Author

savvasdalkitsis commented Aug 5, 2024

In the meantime if anyone else has this problem, this is how i worked around it.

I split the giant Records.json file into multiple smaller ones using this script:

#!/bin/bash

input_file="Records.json"    # The input JSON file
output_prefix="smaller_array"    # The prefix for output files
chunk_size=100000                   # Number of elements per smaller array

# Get the total number of elements in the 'locations' array
total_elements=$(jq '.locations | length' $input_file)

# Loop through the array and split it into chunks
for ((i=0; i<total_elements; i+=chunk_size)); do
    start_index=$i
    end_index=$(($i + $chunk_size - 1))
    output_file="${output_prefix}_$(($i / $chunk_size + 1)).json"

    # Extract the chunk and save it into a new JSON file with the same structure
    jq "{locations: .locations[$start_index:$end_index + 1]}" $input_file > $output_file
    echo "Created $output_file"
done

And I imported each file individually. It takes a lot longer and is manual but hey

@Freika
Copy link
Owner

Freika commented Aug 6, 2024

@savvasdalkitsis this is awesome, thank you! I'll consider using this approach to automatically split huge files during the import process.

@hartmark
Copy link

hartmark commented Sep 8, 2024

This worked perfectly for me, so getting dawarich to do this automagically behind the scenes would be awesome.

@hartmark
Copy link

hartmark commented Sep 8, 2024

my 1.4GB Records.json peaked at around 8GB used by jq using the script, thankfully I have 32GB on my server but a small heads-up for future googlers :)

@jonhnet
Copy link

jonhnet commented Sep 8, 2024

Confirming: I just ran into this problem importing my 2GB Records.json. The bundle processes terminates with the stoic message "Killed" after 59 seconds; at that point top (running inside docker) shows it using 4002572 VIRT (2.0g RES). top also shows "MiB Mem : 19523.4 total", suggesting the memory limit is on the bundle process, not the docker machine.

I didn't see any suspiciously limited resources in the process limits:

/var/app # cat /proc/228/limits
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             unlimited            unlimited            processes 
Max open files            1048576              1048576              files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       77636                77636                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        

Anyway, I'm going to give @savvasdalkitsis 's strategy a try.

(I wonder whether re-attempting the imports repeatedly is going to create a database full of duplicate points... I guess that's a problem for another day. 😃 )

@Freika
Copy link
Owner

Freika commented Sep 8, 2024

(I wonder whether re-attempting the imports repeatedly is going to create a database full of duplicate points... I guess that's a problem for another day. 😃 )

@jonhnet existing points won't be doubled, that is taken care of :)

@jonhnet
Copy link

jonhnet commented Sep 8, 2024

Better still, I see that the import jobs are identified after the import is complete in the Imports panel, suggesting that there's metadata in the db that could be used to clean things up. "luckily", the failures have all resulted in 0-point imports, so it's a non-event.

@jonhnet
Copy link

jonhnet commented Sep 8, 2024

Here's my take on @savvasdalkitsis 's chunk-ifying script.
splitter.py.txt

@hartmark
Copy link

hartmark commented Sep 8, 2024

Here's my take on @savvasdalkitsis 's chunk-ifying script. splitter.py.txt

Cool, thanks for your updated version which generates a nice script to just add them to dawarich aswell.

My python is quite noobish but you should probably flush the file in the while-loop incase you want to see its progress.

I also noticed that the sizes of the chunks are a bit uneven. Why is 001.json so much bigger?

% dawarich-splitter.py     
loaded record_count=2088720
Writing chunk-1000000-000.json
Writing chunk-1000000-001.json
Writing chunk-1000000-002.json
dawarich-splitter.py  138.83s user 7.53s system 98% cpu 2:29.16 total

% ls -lh *.json
-rw-r--r-- 1 markus markus 1.4G Sep  8 02:26 Records.json
-rw-r--r-- 1 markus markus 533M Sep  8 23:25 chunk-1000000-000.json
-rw-r--r-- 1 markus markus 1.1G Sep  8 23:27 chunk-1000000-001.json
-rw-r--r-- 1 markus markus 194M Sep  8 23:27 chunk-1000000-002.json

@jonhnet
Copy link

jonhnet commented Sep 8, 2024

Chunks are divided by count of records. The records vary wildly in length depending on what Google decided to tuck into them that day. :)

@jonhnet
Copy link

jonhnet commented Sep 8, 2024

Maybe a better script would count up to some total length, since the import process seems to die due to overall memory allocation.

@hartmark
Copy link

hartmark commented Sep 8, 2024

It seems it just swallowed the 1.1GB file so I guess all is good now :)

One small thing the script also should do is to remove the uploaded files from the docker volume so they won't take storage after being imported.

I'll let it run now for the night and see how it worked tomorrow.

@jonhnet
Copy link

jonhnet commented Sep 9, 2024

@hartmark good point on cleaning up the mess after. I haven't actually finished running the script over here yet. :v)

My first chunk is 1M records. It completed the bundle step in ~1 hour (I didn't measure), but the sidekiq processing seems really slow. It has been 18 hours, I have 273k items processed and another million in the queue. I think because each point creates two work items serially, so I'm 273k/2M of the way through the job. That means this one chunk is going to take 6 days, and I have three more where that came from.

Is this expected performance for import? I can imagine improving it is a low priority, since it only happens once...

Sidekiq shows about 50-60 tasks completing per 10s polling interval. I do notice that the docker container is only running at ~10% CPU, suggesting that admitting more threads might make things a lot faster. I think I'll try docker compose down and bumping BACKGROUND_PROCESSING_CONCURRENCY from 10 to 100.

@jonhnet
Copy link

jonhnet commented Sep 9, 2024

Oh shame on me, this topic is covered in the faq. I tried making 4 extra sidekiq containers, and now it looks like importing 1M records will take about a day. I guess I'll crank it up to 15 to grind through the next 4M records.

I note that if I docker compose stop dawarich, it loses track of the enqueued work items. I'm not sure if that's a bug or desired behavior. I guess the concern might be that a poorly timed shutdown (say to upgrade) might silently leave some recently-uploaded data unprocessed and hence invisible.

@jonhnet
Copy link

jonhnet commented Sep 15, 2024

Here's a better version of my splitter script for enormous Google Takeout Records.json files. The main improvement is that it parses the input incrementally so this script itself doesn't encounter a memory bottleneck.

splitter.py.txt
(well, hold off on using this; I sent it before it finished, and my copy broke on one of my Records. I'll update.)

@hartmark
Copy link

Nice to have a new updated script. But now I have been able to import my google data.

It looks like quite alot of the items in the queue fails, but there is no retries. I can only see 8hrs back in time but it seems to be just ReverseGeocodingJobs that fails, will it retry the job somehow or how does it work?

image

@AngryJKirk
Copy link

#279 (comment)

I made it in an alternative way, crossposting for visibility

@makanimike
Copy link

makanimike commented Dec 29, 2024

Trying to import the first chunk using savvas' script, I run into this error....

[!] There was an error parsing `Gemfile`: No such file or directory @ rb_sysopen - .ruby-version. Bundler cannot continue.

#  from /var/app/Gemfile:6
#  -------------------------------------------
#  
>  ruby File.read('.ruby-version').strip
#  
#  -------------------------------------------

Any ideas?

ETA:
just had to execute the command from the root folder in the docker container... #386

@makanimike
Copy link

@hartmark :

I suppose your queue eventually got processed correctly?
I just imported my data 3-4 hours ago, but it appears like there is no progress at all. In fact, it looks like it didn't even start! I'm wondering if I need to do something to kickstart the process. Or if the import failed somehow.... there are 3,7M imports in the queue, but nothing is happening...

@hartmark
Copy link

@hartmark :

I suppose your queue eventually got processed correctly? I just imported my data 3-4 hours ago, but it appears like there is no progress at all. In fact, it looks like it didn't even start! I'm wondering if I need to do something to kickstart the process. Or if the import failed somehow.... there are 3,7M imports in the queue, but nothing is happening...

Yeah, it took a few days for it to process all.

@makanimike
Copy link

It was pretty much a classic PEBMAC issue, more or less. But I thought I would still share my experience here for posterity, anyone who might stumble into this thread searching for solutions....

During my early attempts to import the large Records.json file directly, I had adjusted the resource limits for as suggested in the FAQ. One thing lead to another, and a side effect of this was that sidekiq was suspended in a permanent state of deploying because of the CPU resources set to 0.001. On the front-end everything looked fine though. I could see the sidekiq interface, and click around. The queue would grow with every successful import of the partial Records.json chunks. But it just would not process anything. It was flatlining.

I had to set the resource settings back to normal values. Now I just have to wait for a few days.

@aarondoet
Copy link

aarondoet commented Jan 4, 2025

I worked with google takeout exports in a nodejs project before. I used JSONStream to process the file, something similar exists for Ruby. I would imagine this should make it possible to process huge files without much RAM usage.
https://github.com/thisismydesign/json-streamer

here my script i just wrote, using JSONStream, because jq used too much RAM.

const { default: fs } = await import("fs");
const { default: JSONStream } = await import("JSONStream");
let chunk = 0;
let locations = {0: []};
let chunkSize = 100000;
let processed = 0;
let readStream = fs.createReadStream("./Records.json");
let stream = readStream.pipe(JSONStream.parse("locations.*"));
stream.on("data", l=>{
	processed++;
	if(locations[chunk].length < chunkSize) return locations[chunk].push(l);
	const c = chunk;
	fs.promises.writeFile(c+"_Records.json", JSON.stringify({locations: locations[c]})).then(()=>{delete locations[c]});
	chunk++;
	locations[chunk] = [l];
});
stream.on("end", ()=>{
	if(locations[chunk].length > 0){
		fs.promises.writeFile(chunk+"_Records.json", JSON.stringify({locations: locations[chunk]})).then(()=>console.log("done"));
	}else console.log("done");
});

@Sparticuz
Copy link

And I imported each file individually. It takes a lot longer and is manual but hey

After splitting, I had 45 files, so instead of sitting there I just ran this and it looped through all the imports

for f in tmp/imports/*.json; do bundle exec rake import:big_file["${f}",'[email protected]']; done;

@jsixface
Copy link

@aarondoet I agree that Streaming a large file is the way to go. Loading the entire file in memory is going to run into error at some point - even after splitting and the chunks are bigger.

@Freika is it possible to incorporate json-streamer for importing large json?

@Freika
Copy link
Owner

Freika commented Jan 20, 2025

@jsixface yes, it's possible, although I can't provide any estimates. This issue stays open specifically so I could work on it later on.

@Freika
Copy link
Owner

Freika commented Jan 21, 2025

In https://github.com/Freika/dawarich/releases/tag/0.23.3 I released a performance update to Records.json importing process. On my local machine, it now processes 178MB JSON file in about 2 minutes (627k points).

@antoniorull

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests