-
Notifications
You must be signed in to change notification settings - Fork 1k
Add select parameter to fwrite() to avoid temporary objects #7236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7236 +/- ##
=========================================
Coverage ? 98.79%
=========================================
Files ? 81
Lines ? 15260
Branches ? 0
=========================================
Hits ? 15076
Misses ? 184
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Generated via commit b072be5 Download link for the artifact containing the test results: ↓ atime-results.zip
|
@MichaelChirico can I add # nocov to this line - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs manual update and NEWS entry
R/fwrite.R
Outdated
verbose=getOption("datatable.verbose", FALSE), | ||
encoding = "") { | ||
encoding = "", | ||
select) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use default value as NULL, same as fread
R/fwrite.R
Outdated
if (!null(select)) { | ||
if (is.data.table(x)) { | ||
cols = colnamesInt(x, select) | ||
x = .shallow(x, cols) | ||
} else { | ||
x = x[select] | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the distinction between data.table
and non data.table the best one here, since colnamesInt
basically also work with data.frames
and list
.
Moreover, we also need to handle matrix input!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I've placed our select logic after matrix conversion to data.table
thus it should work for matrix input also.
also would be good to add an atime performance test so we can see the reduction in memory usage. |
I surely will. But can you please tell where should |
test(2337.5, is.null(fwrite(data.table(a=numeric()), dec=",", sep=","))) | ||
|
||
# test for select parameter #4177 | ||
DT = data.table(a=1:2, b=3:4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add tests for other classes e.g.
df = as.data.frame(DT)
l = as.list(DT)
m = as.matrix(DT)
…into issue#4177
…e of select(keep x as list/dt)
Co-authored-by: Toby Dylan Hocking <[email protected]>
current error in atime CI is
this means that the file does not contain valid R code (open paren/brance without matching closer). |
Co-authored-by: Toby Dylan Hocking <[email protected]>
Co-authored-by: Toby Dylan Hocking <[email protected]>
the 1xN performance test is still failing, can you please investigate, explain/fix? |
@tdhock I find out that when select doesn't exist then we must subset first - so we've to emulatate select via subsetting before fwrite. And that subsetting must also be version pinned, so we've to use data.table:::[.data.table so that atime rewrites both [ and fwrite to same SHA. |
https://github.com/Rdatatable/data.table/wiki/Performance-testing#running-locally I don't understand you comment "I find out that when select doesn't exist then we must subset first - so we've to emulatate select via subsetting before fwrite. And that subsetting must also be version pinned, so we've to use data.table:::[.data.table so that atime rewrites both [ and fwrite to same SHA." can you please clarify? Current atime result is shown below. |
Offcourse, so atime rewrites only namespace qualified calls. agree? So when we write DT[, j], that uses the currently attached [.data.table and is not rewritten. Thus for slow(no select), we emulate select by first subsetting and that subsetting must be pinned so both fwrite and [ come from the same SHA. Hence data.table:::[.data.table(...). |
@tdhock thank you telling the atime CRAN/HEAD diffrence here! the problem I can identify is probably using - |
HEAD is still linear |
@tdhoch can you please tell how to check atime test locally so that I recheck before commits as I just added subsetting to data.table:::[.data.table. |
data.table::fwrite(DT, temp_file, select = select_idx) | ||
} else { | ||
data.table::fwrite(DT[, select_idx, with = FALSE], temp_file) | ||
data.table::fwrite(data.table:::`[.data.table`(DT, , select_idx, with = FALSE), temp_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good idea for correctness, thanks!
but I don't think it will affect this PR, because there are no modifications to [.data.table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for review! can you please guide how can I improve it then to get head constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tdhock I can't able to figure out why plot is linear at high N, its clear to me that memory plot grows with N as out test builds a 1*N table but for time graph, for most N its flat and rises at very high N - likely some overhead when N is big maybe OS or disk effect, not sure.
Could you please spare some time to explain me the likely cause so that I can work upon it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you must have added some code which is linear in the number of columns. please look in the additions you made.
Hi @tdhock The code which is linear in the number of columns might be 'seq_len(NCOL(x))' - allocate linear number of columns at large N, and instead we can assign x as .shallow(x, cols) when we have length(cols) == 1L or a strict subset or a re-order. Does this seems right to you? |
that seems linear in the number of cols but your plot still shows linear for HEAD=CRAN=Fast. (ideally should be constant like master=Slow) |
@tdhock As far as I can understand is in fwrite(select) disc I/O caused the big N spikes as when I reproduced the 1*N case with a null device, with a shallow 1 column view and OS null device, both memory and time plots are constant across N |
thanks for your analysis. |
Sure, I added this test to test.R to check complexity without disc variance using shallow subset
|
why did you do |
I added this new test - 1xN (null, shallow) as it removes disk cost -null device and column width as shallow 1 col view so that both Slow - pre select and Fast - with select already pass only one column to c, thus they are expected to look identical and flat whose result tells us the spikes we saw in the disk version come from file I/O, not per column processing. |
closes #4177
Problem:
When writing only specific columns requires creating temporary objects
fwrite(DT[,.(col1,col3)])
Solution:
Add select parameter that uses .shallow() for data.table objects to create shallow copy without data duplication.
fwrite(DT, select = c("col1", "col3"))
@MichaelChirico @tdhock please review.