Add select parameter to fwrite() to avoid temporary objects #7236

Mukulyadav2004 · 2025-08-04T20:23:16Z

Problem:
When writing only specific columns requires creating temporary objects
fwrite(DT[,.(col1,col3)])

Solution:
Add select parameter that uses .shallow() for data.table objects to create shallow copy without data duplication.

fwrite(DT, select = c("col1", "col3"))

@MichaelChirico @tdhock please review.

codecov · 2025-08-04T20:30:41Z

Codecov Report

❌ Patch coverage is 90.00000% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (master@c27ec26). Learn more about missing BASE report.
⚠️ Report is 42 commits behind head on master.

Files with missing lines	Patch %	Lines
R/fwrite.R	90.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##             master    #7236   +/-   ##
=========================================
  Coverage          ?   98.79%           
=========================================
  Files             ?       81           
  Lines             ?    15260           
  Branches          ?        0           
=========================================
  Hits              ?    15076           
  Misses            ?      184           
  Partials          ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2025-08-04T20:41:02Z

HEAD=issue#4177 stopped early for fwrite(select) #4177 1xN
HEAD=issue#4177 slower P<0.001 for memrecycle regression fixed in #5463

Generated via commit b072be5

Download link for the artifact containing the test results: ↓ atime-results.zip

Task	Duration
R setup and installing dependencies	2 minutes and 40 seconds
Installing different package versions	40 seconds
Running and plotting the test cases	9 minutes and 21 seconds

Mukulyadav2004 · 2025-08-04T21:22:42Z

@MichaelChirico can I add # nocov to this line - x = x[select] as this is an edge case and only executes when select parameter is provided, and input x is a data.frame.

jangorecki

Needs manual update and NEWS entry

jangorecki · 2025-08-05T00:22:13Z

R/fwrite.R

           verbose=getOption("datatable.verbose", FALSE),
-           encoding = "") {
+           encoding = "",
+           select) {


I would use default value as NULL, same as fread

ben-schwen · 2025-08-05T10:54:14Z

R/fwrite.R

+  if (!null(select)) {
+    if (is.data.table(x)) {
+      cols = colnamesInt(x, select)
+      x = .shallow(x, cols)
+    } else {
+      x = x[select]
+    }
+  }


Is the distinction between data.table and non data.table the best one here, since colnamesInt basically also work with data.frames and list.

Moreover, we also need to handle matrix input!

Sure, I've placed our select logic after matrix conversion to data.table thus it should work for matrix input also.

tdhock · 2025-08-05T11:58:55Z

also would be good to add an atime performance test so we can see the reduction in memory usage.

Mukulyadav2004 · 2025-08-05T13:21:35Z

I surely will. But can you please tell where should atime test should be placed - .ci/atime/tests.R ?

ben-schwen · 2025-08-05T14:29:46Z

inst/tests/tests.Rraw

 test(2337.5, is.null(fwrite(data.table(a=numeric()), dec=",", sep=",")))
+
+# test for select parameter #4177
+DT = data.table(a=1:2, b=3:4)


please add tests for other classes e.g.

df = as.data.frame(DT) l = as.list(DT) m = as.matrix(DT)

…into issue#4177

…e of select(keep x as list/dt)

R/fwrite.R

Co-authored-by: Toby Dylan Hocking <[email protected]>

tdhock · 2025-08-20T04:08:16Z

current error in atime CI is

 Error in parse(test.env$tests.R) : 
  /__w/data.table/data.table/.ci/atime/tests.R:328:0: unexpected end of input
326:     tests=extra.test.list)
327: # nolint end: undesirable_operator_linter.

this means that the file does not contain valid R code (open paren/brance without matching closer).
You should double check by trying to source/run all code in tests.R

.ci/atime/tests.R

Co-authored-by: Toby Dylan Hocking <[email protected]>

tdhock · 2025-08-20T17:30:59Z

the 1xN performance test is still failing, can you please investigate, explain/fix?

Mukulyadav2004 · 2025-08-20T20:26:40Z

@tdhock I find out that when select doesn't exist then we must subset first - so we've to emulatate select via subsetting before fwrite. And that subsetting must also be version pinned, so we've to use data.table:::[.data.table so that atime rewrites both [ and fwrite to same SHA.
Also can you please tell how to test atime test locally

tdhock · 2025-08-22T14:07:37Z

https://github.com/Rdatatable/data.table/wiki/Performance-testing#running-locally

I don't understand you comment "I find out that when select doesn't exist then we must subset first - so we've to emulatate select via subsetting before fwrite. And that subsetting must also be version pinned, so we've to use data.table:::[.data.table so that atime rewrites both [ and fwrite to same SHA." can you please clarify?

Current atime result is shown below.

We see that master=CRAN=slow is constant whereas HEAD/Fast is linear.
Can you please figure out why HEAD is linear, and change it to be constant?
Thanks!

Mukulyadav2004 · 2025-08-22T17:50:33Z

I don't understand you comment "I find out that when select doesn't exist then we must subset first - so we've to emulatate select via subsetting before fwrite. And that subsetting must also be version pinned, so we've to use data.table:::[.data.table so that atime rewrites both [ and fwrite to same SHA." can you please clarify?

Offcourse, so atime rewrites only namespace qualified calls. agree? So when we write DT[, j], that uses the currently attached [.data.table and is not rewritten. Thus for slow(no select), we emulate select by first subsetting and that subsetting must be pinned so both fwrite and [ come from the same SHA. Hence data.table:::[.data.table(...).
Its my understanding, maybe I'm wrong here.

Mukulyadav2004 · 2025-08-22T21:35:19Z

@tdhock thank you telling the atime CRAN/HEAD diffrence here! the problem I can identify is probably using - cols = if (is.numeric(x)) rather than - cols = if (is.numeric(select)) , I think this forces the name - resolution path for every run thus failure in intended O(1) path.

tdhock · 2025-08-23T01:07:59Z

HEAD is still linear

Mukulyadav2004 · 2025-08-23T16:18:04Z

@tdhoch can you please tell how to check atime test locally so that I recheck before commits as I just added subsetting to data.table:::[.data.table.

tdhock · 2025-08-25T19:15:07Z

.ci/atime/tests.R

        data.table::fwrite(DT, temp_file, select = select_idx)
      } else {
-        data.table::fwrite(DT[, select_idx, with = FALSE], temp_file)
+        data.table::fwrite(data.table:::`[.data.table`(DT, , select_idx, with = FALSE), temp_file)


this is a good idea for correctness, thanks!
but I don't think it will affect this PR, because there are no modifications to [.data.table

thanks for review! can you please guide how can I improve it then to get head constant

tinfo <- atime::atime_pkg_test_info("~/R/data.table") alist <- eval(tinfo$test.call[["fwrite(select) #4177 1xN"]]) plot(alist)

@tdhock I can't able to figure out why plot is linear at high N, its clear to me that memory plot grows with N as out test builds a 1*N table but for time graph, for most N its flat and rises at very high N - likely some overhead when N is big maybe OS or disk effect, not sure.
Could you please spare some time to explain me the likely cause so that I can work upon it

you must have added some code which is linear in the number of columns. please look in the additions you made.

Mukulyadav2004 · 2025-09-02T12:39:41Z

Hi @tdhock

The code which is linear in the number of columns might be 'seq_len(NCOL(x))' - allocate linear number of columns at large N, and instead we can assign x as .shallow(x, cols) when we have length(cols) == 1L or a strict subset or a re-order.
Below is the plot-

Does this seems right to you?

tdhock · 2025-09-02T15:44:02Z

that seems linear in the number of cols but your plot still shows linear for HEAD=CRAN=Fast. (ideally should be constant like master=Slow)

Mukulyadav2004 · 2025-09-03T21:48:24Z

@tdhock As far as I can understand is in fwrite(select) disc I/O caused the big N spikes as when I reproduced the 1*N case with a null device, with a shallow 1 column view and OS null device, both memory and time plots are constant across N

tdhock · 2025-09-04T13:16:24Z

thanks for your analysis.
can you please share the code you used to get that result?
in that case, can you suggest changes to either the code or test?

Mukulyadav2004 · 2025-09-05T03:10:59Z

Sure, I added this test to test.R to check complexity without disc variance using shallow subset

fwrite(select) #4177 1xN (null, shallow)" = atime::atime_test(
  setup = {
    DT <- data.table(t(1:N))
    shallow_fun <- getFromNamespace("shallow", "data.table")
    DT1 <- shallow_fun(DT, 1L)
    nullfile <- if (.Platform$OS.type == "windows") "NUL" else "/dev/null"
  },
  expr = data.table::fwrite(DT1, nullfile, buffMB = 1L, nThread = 1L, col.names = FALSE),
  times = 10L,
  seconds.limit = 0.05,
  Slow = "66cb6d2393cef30083b444346a7600a079207806",
  Fast = "1887699fe965b5aa1fb8cb16b5507b7a5cbf5c85"
)

tdhock · 2025-09-05T13:17:00Z

why did you do shallow_fun <- getFromNamespace("shallow", "data.table") ?
can you please change to data.table::shallow?
Also can you please remove times and seconds.limit argument?
Also can you please add comments to explain the Slow/Fast commit ids?
Finally can you please explain why this test would be relevant? we don't see any difference between Fast and Slow, so it does not seem worth adding. (Fast and Slow should be obvious different)

Mukulyadav2004 · 2025-09-05T19:52:55Z

I added this new test - 1xN (null, shallow) as it removes disk cost -null device and column width as shallow 1 col view so that both Slow - pre select and Fast - with select already pass only one column to c, thus they are expected to look identical and flat whose result tells us the spikes we saw in the disk version come from file I/O, not per column processing.
Thank you!

Mukulyadav2004 added 3 commits August 5, 2025 01:03

added logic to handle select argumnt using .shallow()

1fb7294

add conditional passing

ee3e85f

assign x when selct is used

01d8bbe

Mukulyadav2004 requested a review from MichaelChirico as a code owner August 4, 2025 20:23

Merge branch 'master' into issue#4177

9837fd4

Mukulyadav2004 added 4 commits August 5, 2025 02:14

added coverage tests

97cb32b

typo

4f9928a

merge master

0eb227f

right number

8424176

jangorecki reviewed Aug 5, 2025

View reviewed changes

manual entry + news

d0956f3

ben-schwen reviewed Aug 5, 2025

View reviewed changes

moved select logic to address matrix inputs also

bb8320d

ben-schwen reviewed Aug 5, 2025

View reviewed changes

Mukulyadav2004 and others added 9 commits August 5, 2025 20:23

for matrix handling

8e63991

add tests for other classes too

d237f57

remove trail whit space

3a718c0

is.null

23214d1

add tests

2744a89

trailing space

66d3744

Merge branch 'issue#4177' of https://github.com/Rdatatable/data.table …

7f7a061

…into issue#4177

merge master branch

1887699

added atime performance test

d5c933b

Mukulyadav2004 requested review from tdhock and Anirban166 as code owners August 8, 2025 16:19

bypass name res for num select, avoid no-op shallow, and cols in plac…

c2c5d89

…e of select(keep x as list/dt)

tdhock reviewed Aug 15, 2025

View reviewed changes

R/fwrite.R Outdated Show resolved Hide resolved

Mukulyadav2004 and others added 4 commits August 15, 2025 22:19

better writing style

926bb0e

Co-authored-by: Toby Dylan Hocking <[email protected]>

seperate both cases

cbffb5b

use only is.numeric(x)

ea91282

better manual docs

28705de

tdhock requested changes Aug 20, 2025

View reviewed changes

.ci/atime/tests.R Outdated Show resolved Hide resolved

closed parenthisis

94a9b59

tdhock reviewed Aug 20, 2025

View reviewed changes

.ci/atime/tests.R Outdated Show resolved Hide resolved

tdhock reviewed Aug 20, 2025

View reviewed changes

.ci/atime/tests.R Outdated Show resolved Hide resolved

Mukulyadav2004 and others added 2 commits August 20, 2025 21:51

improved test stmt

377620e

Co-authored-by: Toby Dylan Hocking <[email protected]>

imprvd tst stmt

d834811

Co-authored-by: Toby Dylan Hocking <[email protected]>

use select parm instead of x in is.numeric

08f6fee

fwrite call to use data.table subsetting

b072be5

tdhock reviewed Aug 25, 2025

View reviewed changes

Add select parameter to fwrite() to avoid temporary objects #7236

Are you sure you want to change the base?

Add select parameter to fwrite() to avoid temporary objects #7236

Uh oh!

Conversation

Mukulyadav2004 commented Aug 4, 2025

Uh oh!

codecov bot commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mukulyadav2004 commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jangorecki left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jangorecki Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ben-schwen Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

Mukulyadav2004 Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdhock commented Aug 5, 2025

Uh oh!

Mukulyadav2004 commented Aug 5, 2025

Uh oh!

ben-schwen Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tdhock commented Aug 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tdhock commented Aug 20, 2025

Uh oh!

Mukulyadav2004 commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdhock commented Aug 22, 2025

Uh oh!

Mukulyadav2004 commented Aug 22, 2025

Uh oh!

Mukulyadav2004 commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdhock commented Aug 23, 2025

Uh oh!

Mukulyadav2004 commented Aug 23, 2025

Uh oh!

tdhock Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

Mukulyadav2004 Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdhock Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

Mukulyadav2004 Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

tdhock Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

Mukulyadav2004 commented Sep 2, 2025

Uh oh!

tdhock commented Sep 2, 2025

codecov bot commented Aug 4, 2025 •

edited

Loading

github-actions bot commented Aug 4, 2025 •

edited

Loading

Mukulyadav2004 commented Aug 4, 2025 •

edited

Loading

jangorecki left a comment •

edited

Loading

jangorecki Aug 5, 2025 •

edited

Loading

Mukulyadav2004 Aug 5, 2025 •

edited

Loading

Mukulyadav2004 commented Aug 20, 2025 •

edited

Loading

Mukulyadav2004 commented Aug 22, 2025 •

edited

Loading

Mukulyadav2004 Aug 25, 2025 •

edited

Loading

Mukulyadav2004 commented Sep 3, 2025 •

edited

Loading

Mukulyadav2004 commented Sep 5, 2025 •

edited

Loading

Mukulyadav2004 commented Sep 5, 2025 •

edited

Loading