-
-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Importing large Records.json from Takeout fails with no message #142
Comments
same is happening to me, keeps crashing my unraid server |
Most likely, you're running out of memory. Consider giving more resources to Dawarich |
I am running this on a server with 64GB RAM with no memory restrictions on the docker deployment so I doubt it's that to be honest. Especially since the job gets killed almost immediately after it starts (for me at least) |
is there any way to enable more verbose logging to troubleshoot this further? |
In the meantime if anyone else has this problem, this is how i worked around it. I split the giant Records.json file into multiple smaller ones using this script:
And I imported each file individually. It takes a lot longer and is manual but hey |
@savvasdalkitsis this is awesome, thank you! I'll consider using this approach to automatically split huge files during the import process. |
This worked perfectly for me, so getting dawarich to do this automagically behind the scenes would be awesome. |
my 1.4GB Records.json peaked at around 8GB used by jq using the script, thankfully I have 32GB on my server but a small heads-up for future googlers :) |
Confirming: I just ran into this problem importing my 2GB Records.json. The bundle processes terminates with the stoic message "Killed" after 59 seconds; at that point top (running inside docker) shows it using 4002572 VIRT (2.0g RES). top also shows "MiB Mem : 19523.4 total", suggesting the memory limit is on the bundle process, not the docker machine. I didn't see any suspiciously limited resources in the process limits:
Anyway, I'm going to give @savvasdalkitsis 's strategy a try. (I wonder whether re-attempting the imports repeatedly is going to create a database full of duplicate points... I guess that's a problem for another day. 😃 ) |
@jonhnet existing points won't be doubled, that is taken care of :) |
Better still, I see that the import jobs are identified after the import is complete in the Imports panel, suggesting that there's metadata in the db that could be used to clean things up. "luckily", the failures have all resulted in 0-point imports, so it's a non-event. |
Here's my take on @savvasdalkitsis 's chunk-ifying script. |
Cool, thanks for your updated version which generates a nice script to just add them to dawarich aswell. My python is quite noobish but you should probably flush the file in the while-loop incase you want to see its progress. I also noticed that the sizes of the chunks are a bit uneven. Why is 001.json so much bigger?
|
Chunks are divided by count of records. The records vary wildly in length depending on what Google decided to tuck into them that day. :) |
Maybe a better script would count up to some total length, since the import process seems to die due to overall memory allocation. |
It seems it just swallowed the 1.1GB file so I guess all is good now :) One small thing the script also should do is to remove the uploaded files from the docker volume so they won't take storage after being imported. I'll let it run now for the night and see how it worked tomorrow. |
@hartmark good point on cleaning up the mess after. I haven't actually finished running the script over here yet. :v) My first chunk is 1M records. It completed the Is this expected performance for import? I can imagine improving it is a low priority, since it only happens once... Sidekiq shows about 50-60 tasks completing per 10s polling interval. I do notice that the docker container is only running at ~10% CPU, suggesting that admitting more threads might make things a lot faster. I think I'll try |
Oh shame on me, this topic is covered in the faq. I tried making 4 extra sidekiq containers, and now it looks like importing 1M records will take about a day. I guess I'll crank it up to 15 to grind through the next 4M records. I note that if I docker compose stop dawarich, it loses track of the enqueued work items. I'm not sure if that's a bug or desired behavior. I guess the concern might be that a poorly timed shutdown (say to upgrade) might silently leave some recently-uploaded data unprocessed and hence invisible. |
Here's a better version of my splitter script for enormous Google Takeout Records.json files. The main improvement is that it parses the input incrementally so this script itself doesn't encounter a memory bottleneck. splitter.py.txt |
Nice to have a new updated script. But now I have been able to import my google data. It looks like quite alot of the items in the queue fails, but there is no retries. I can only see 8hrs back in time but it seems to be just ReverseGeocodingJobs that fails, will it retry the job somehow or how does it work? |
I made it in an alternative way, crossposting for visibility |
Trying to import the first chunk using savvas' script, I run into this error....
Any ideas? ETA: |
I suppose your queue eventually got processed correctly? |
Yeah, it took a few days for it to process all. |
It was pretty much a classic PEBMAC issue, more or less. But I thought I would still share my experience here for posterity, anyone who might stumble into this thread searching for solutions.... During my early attempts to import the large Records.json file directly, I had adjusted the resource limits for as suggested in the FAQ. One thing lead to another, and a side effect of this was that sidekiq was suspended in a permanent state of deploying because of the CPU resources set to 0.001. On the front-end everything looked fine though. I could see the sidekiq interface, and click around. The queue would grow with every successful import of the partial Records.json chunks. But it just would not process anything. It was flatlining. I had to set the resource settings back to normal values. Now I just have to wait for a few days. |
I worked with google takeout exports in a nodejs project before. I used JSONStream to process the file, something similar exists for Ruby. I would imagine this should make it possible to process huge files without much RAM usage. here my script i just wrote, using JSONStream, because jq used too much RAM. const { default: fs } = await import("fs");
const { default: JSONStream } = await import("JSONStream");
let chunk = 0;
let locations = {0: []};
let chunkSize = 100000;
let processed = 0;
let readStream = fs.createReadStream("./Records.json");
let stream = readStream.pipe(JSONStream.parse("locations.*"));
stream.on("data", l=>{
processed++;
if(locations[chunk].length < chunkSize) return locations[chunk].push(l);
const c = chunk;
fs.promises.writeFile(c+"_Records.json", JSON.stringify({locations: locations[c]})).then(()=>{delete locations[c]});
chunk++;
locations[chunk] = [l];
});
stream.on("end", ()=>{
if(locations[chunk].length > 0){
fs.promises.writeFile(chunk+"_Records.json", JSON.stringify({locations: locations[chunk]})).then(()=>console.log("done"));
}else console.log("done");
}); |
After splitting, I had 45 files, so instead of sitting there I just ran this and it looped through all the imports for f in tmp/imports/*.json; do bundle exec rake import:big_file["${f}",'[email protected]']; done; |
@aarondoet I agree that Streaming a large file is the way to go. Loading the entire file in memory is going to run into error at some point - even after splitting and the chunks are bigger. @Freika is it possible to incorporate json-streamer for importing large json? |
@jsixface yes, it's possible, although I can't provide any estimates. This issue stays open specifically so I could work on it later on. |
In https://github.com/Freika/dawarich/releases/tag/0.23.3 I released a performance update to Records.json importing process. On my local machine, it now processes 178MB JSON file in about 2 minutes (627k points). |
Describe the bug
Importing my (large) location history from Google Takeout fails with no error messages
Version
freikin/dawarich:latest
Logs
The text was updated successfully, but these errors were encountered: