You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need to re-code our script to circumvent the fact that R by default tries to load an entire file into the memory.
The easiest alternative is to use ff library, which works with dataframe containing heterogeneous data; if data are homogeneous (e.g., a number matrix), then also bigmemory library would do, but this doesn't appear to be our case.
The most general solutions instead are using Hadoop and map-reduce to parallelize your complex task in smaller, faster subtasks [2], or alternatively, leverage a database for storing and then querying data [3].
A simpler solution is to chunk the data and give each chunk to Senti4SD. That's what PR #9 did in a simple way. I also wrote my own script which adds several improvements, including allowing to call Senti4SD as an R function rather than as a bash script.
For example, I can work with a 100k dataset like this:
The code is available here: maelick/Senti4SD@5b0df31, and I can create a PR if there's interest.
I've been able to run it successfully on my 8GB memory laptop in 3800s with a chunk size of 1000. On a supercomputer I tried with a higher chunk size (10k) but only improved run time to 2800s. The reason why there's so little improvement is that I suspect an important amount of time is spent reading and writing huge CSV files. Using rJava (as I've mentioned in #10) instead of CSV files to communicate between Java and R could significantly improve performances... And reusability of the tool :-)
Currently, we don't have resources to work on this issue. Please, open the PR, we will merge it in a separate branch to make it available for others. Thank you.
We need to re-code our script to circumvent the fact that R by default tries to load an entire file into the memory.
The easiest alternative is to use
ff
library, which works with dataframe containing heterogeneous data; if data are homogeneous (e.g., a number matrix), then alsobigmemory
library would do, but this doesn't appear to be our case.The most general solutions instead are using Hadoop and map-reduce to parallelize your complex task in smaller, faster subtasks [2], or alternatively, leverage a database for storing and then querying data [3].
[1] https://rpubs.com/msundar/large_data_analysis
[2] http://www.bytemining.com/2010/08/taking-r-to-the-limit-part-ii-large-datasets-in-r/
[3] https://www.datasciencecentral.com/profiles/blogs/postgresql-monetdb-and-too-big-for-memory-data-in-r-part-ii
The text was updated successfully, but these errors were encountered: