Skip to content

Conversation

Mukulyadav2004
Copy link
Contributor

closes #4177

Problem:
When writing only specific columns requires creating temporary objects
fwrite(DT[,.(col1,col3)])

Solution:
Add select parameter that uses .shallow() for data.table objects to create shallow copy without data duplication.

fwrite(DT, select = c("col1", "col3"))

@MichaelChirico @tdhock please review.

Copy link

codecov bot commented Aug 4, 2025

Codecov Report

❌ Patch coverage is 90.00000% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (master@c27ec26). Learn more about missing BASE report.
⚠️ Report is 42 commits behind head on master.

Files with missing lines Patch % Lines
R/fwrite.R 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master    #7236   +/-   ##
=========================================
  Coverage          ?   98.79%           
=========================================
  Files             ?       81           
  Lines             ?    15260           
  Branches          ?        0           
=========================================
  Hits              ?    15076           
  Misses            ?      184           
  Partials          ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

github-actions bot commented Aug 4, 2025

  • HEAD=issue#4177 stopped early for fwrite(select) #4177 1xN
  • HEAD=issue#4177 slower P<0.001 for memrecycle regression fixed in #5463
    Comparison Plot

Generated via commit b072be5

Download link for the artifact containing the test results: ↓ atime-results.zip

Task Duration
R setup and installing dependencies 2 minutes and 40 seconds
Installing different package versions 40 seconds
Running and plotting the test cases 9 minutes and 21 seconds

@Mukulyadav2004
Copy link
Contributor Author

Mukulyadav2004 commented Aug 4, 2025

@MichaelChirico can I add # nocov to this line - x = x[select] as this is an edge case and only executes when select parameter is provided, and input x is a data.frame.

Copy link
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs manual update and NEWS entry

R/fwrite.R Outdated
verbose=getOption("datatable.verbose", FALSE),
encoding = "") {
encoding = "",
select) {
Copy link
Member

@jangorecki jangorecki Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use default value as NULL, same as fread

R/fwrite.R Outdated
Comment on lines 32 to 39
if (!null(select)) {
if (is.data.table(x)) {
cols = colnamesInt(x, select)
x = .shallow(x, cols)
} else {
x = x[select]
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the distinction between data.table and non data.table the best one here, since colnamesInt basically also work with data.frames and list.

Moreover, we also need to handle matrix input!

Copy link
Contributor Author

@Mukulyadav2004 Mukulyadav2004 Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I've placed our select logic after matrix conversion to data.table thus it should work for matrix input also.

@tdhock
Copy link
Member

tdhock commented Aug 5, 2025

also would be good to add an atime performance test so we can see the reduction in memory usage.

@Mukulyadav2004
Copy link
Contributor Author

I surely will. But can you please tell where should atime test should be placed - .ci/atime/tests.R ?

test(2337.5, is.null(fwrite(data.table(a=numeric()), dec=",", sep=",")))

# test for select parameter #4177
DT = data.table(a=1:2, b=3:4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add tests for other classes e.g.

df = as.data.frame(DT)
l = as.list(DT)
m = as.matrix(DT)

@tdhock
Copy link
Member

tdhock commented Aug 20, 2025

current error in atime CI is

 Error in parse(test.env$tests.R) : 
  /__w/data.table/data.table/.ci/atime/tests.R:328:0: unexpected end of input
326:     tests=extra.test.list)
327: # nolint end: undesirable_operator_linter.

this means that the file does not contain valid R code (open paren/brance without matching closer).
You should double check by trying to source/run all code in tests.R

Mukulyadav2004 and others added 2 commits August 20, 2025 21:51
Co-authored-by: Toby Dylan Hocking <[email protected]>
Co-authored-by: Toby Dylan Hocking <[email protected]>
@tdhock
Copy link
Member

tdhock commented Aug 20, 2025

the 1xN performance test is still failing, can you please investigate, explain/fix?

@Mukulyadav2004
Copy link
Contributor Author

Mukulyadav2004 commented Aug 20, 2025

@tdhock I find out that when select doesn't exist then we must subset first - so we've to emulatate select via subsetting before fwrite. And that subsetting must also be version pinned, so we've to use data.table:::[.data.table so that atime rewrites both [ and fwrite to same SHA.
Also can you please tell how to test atime test locally

@tdhock
Copy link
Member

tdhock commented Aug 22, 2025

https://github.com/Rdatatable/data.table/wiki/Performance-testing#running-locally

I don't understand you comment "I find out that when select doesn't exist then we must subset first - so we've to emulatate select via subsetting before fwrite. And that subsetting must also be version pinned, so we've to use data.table:::[.data.table so that atime rewrites both [ and fwrite to same SHA." can you please clarify?

Current atime result is shown below.
image
We see that master=CRAN=slow is constant whereas HEAD/Fast is linear.
Can you please figure out why HEAD is linear, and change it to be constant?
Thanks!

@Mukulyadav2004
Copy link
Contributor Author

I don't understand you comment "I find out that when select doesn't exist then we must subset first - so we've to emulatate select via subsetting before fwrite. And that subsetting must also be version pinned, so we've to use data.table:::[.data.table so that atime rewrites both [ and fwrite to same SHA." can you please clarify?

Offcourse, so atime rewrites only namespace qualified calls. agree? So when we write DT[, j], that uses the currently attached [.data.table and is not rewritten. Thus for slow(no select), we emulate select by first subsetting and that subsetting must be pinned so both fwrite and [ come from the same SHA. Hence data.table:::[.data.table(...).
Its my understanding, maybe I'm wrong here.

@Mukulyadav2004
Copy link
Contributor Author

Mukulyadav2004 commented Aug 22, 2025

@tdhock thank you telling the atime CRAN/HEAD diffrence here! the problem I can identify is probably using - cols = if (is.numeric(x)) rather than - cols = if (is.numeric(select)) , I think this forces the name - resolution path for every run thus failure in intended O(1) path.

@tdhock
Copy link
Member

tdhock commented Aug 23, 2025

HEAD is still linear

@Mukulyadav2004
Copy link
Contributor Author

@tdhoch can you please tell how to check atime test locally so that I recheck before commits as I just added subsetting to data.table:::[.data.table.

data.table::fwrite(DT, temp_file, select = select_idx)
} else {
data.table::fwrite(DT[, select_idx, with = FALSE], temp_file)
data.table::fwrite(data.table:::`[.data.table`(DT, , select_idx, with = FALSE), temp_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good idea for correctness, thanks!
but I don't think it will affect this PR, because there are no modifications to [.data.table

Copy link
Contributor Author

@Mukulyadav2004 Mukulyadav2004 Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for review! can you please guide how can I improve it then to get head constant

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tinfo <- atime::atime_pkg_test_info("~/R/data.table")
alist <- eval(tinfo$test.call[["fwrite(select) #4177 1xN"]])
plot(alist)
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdhock I can't able to figure out why plot is linear at high N, its clear to me that memory plot grows with N as out test builds a 1*N table but for time graph, for most N its flat and rises at very high N - likely some overhead when N is big maybe OS or disk effect, not sure.
Could you please spare some time to explain me the likely cause so that I can work upon it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you must have added some code which is linear in the number of columns. please look in the additions you made.

@Mukulyadav2004
Copy link
Contributor Author

Hi @tdhock

The code which is linear in the number of columns might be 'seq_len(NCOL(x))' - allocate linear number of columns at large N, and instead we can assign x as .shallow(x, cols) when we have length(cols) == 1L or a strict subset or a re-order.
Below is the plot-
image

Does this seems right to you?

@tdhock
Copy link
Member

tdhock commented Sep 2, 2025

that seems linear in the number of cols but your plot still shows linear for HEAD=CRAN=Fast. (ideally should be constant like master=Slow)

@Mukulyadav2004
Copy link
Contributor Author

Mukulyadav2004 commented Sep 3, 2025

@tdhock As far as I can understand is in fwrite(select) disc I/O caused the big N spikes as when I reproduced the 1*N case with a null device, with a shallow 1 column view and OS null device, both memory and time plots are constant across N
image

@tdhock
Copy link
Member

tdhock commented Sep 4, 2025

thanks for your analysis.
can you please share the code you used to get that result?
in that case, can you suggest changes to either the code or test?

@Mukulyadav2004
Copy link
Contributor Author

Mukulyadav2004 commented Sep 5, 2025

Sure, I added this test to test.R to check complexity without disc variance using shallow subset

fwrite(select) #4177 1xN (null, shallow)" = atime::atime_test(
  setup = {
    DT <- data.table(t(1:N))
    shallow_fun <- getFromNamespace("shallow", "data.table")
    DT1 <- shallow_fun(DT, 1L)
    nullfile <- if (.Platform$OS.type == "windows") "NUL" else "/dev/null"
  },
  expr = data.table::fwrite(DT1, nullfile, buffMB = 1L, nThread = 1L, col.names = FALSE),
  times = 10L,
  seconds.limit = 0.05,
  Slow = "66cb6d2393cef30083b444346a7600a079207806",
  Fast = "1887699fe965b5aa1fb8cb16b5507b7a5cbf5c85"
)

@tdhock
Copy link
Member

tdhock commented Sep 5, 2025

why did you do shallow_fun <- getFromNamespace("shallow", "data.table") ?
can you please change to data.table::shallow?
Also can you please remove times and seconds.limit argument?
Also can you please add comments to explain the Slow/Fast commit ids?
Finally can you please explain why this test would be relevant? we don't see any difference between Fast and Slow, so it does not seem worth adding. (Fast and Slow should be obvious different)

@Mukulyadav2004
Copy link
Contributor Author

Mukulyadav2004 commented Sep 5, 2025

I added this new test - 1xN (null, shallow) as it removes disk cost -null device and column width as shallow 1 col view so that both Slow - pre select and Fast - with select already pass only one column to c, thus they are expected to look identical and flat whose result tells us the spikes we saw in the disk version come from file I/O, not per column processing.
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fwrite to gain select and drop column(s) arguments without creating a temporary object
5 participants