xsv table displays incorrectly #151

llimllib · 2018-10-03T15:24:23Z

Given data of very regular format, xsv table has a phase change that is irregular. I've reduced it to a test csv where it happens around line 575:

$ (head -n2; tail) < test.csv
contract_id,plan_id,segment_id,pharmacy_number,price_id,brand_dispensing_fee_30,generic_dispensing_fee_30
abcde,001,000,123456789012,160,123456789012,000000012500
abcde,019,000,123456789012,107,123456789012,000000006000
abcde,019,000,123456789012,107,123456789012,000000006000
abcde,019,000,123456789012,107,123456789012,000000006000
abcde,019,000,123456789012,107,123456789012,000000006000
abcde,019,000,123456789012,107,123456789012,000000006000
abcde,019,000,123456789012,107,123456789012,000000006000
abcde,019,000,123456789012,107,123456789012,000000006000
abcde,019,000,123456789012,107,123456789012,000000006000
abcde,019,000,123456789012,107,123456789012,000000006000
abcde,019,000,123456789012,107,123456789012,000000006000

$ (head -n2; tail) < <(xsv table test.csv)
contract_id  plan_id  segment_id  pharmacy_number  price_id  brand_dispensing_fee_30  generic_dispensing_fee_30
abcde        001      000         123456789012     160       123456789012             000000012500
abcde        019      000         123456789012     107       123456789012             000000006000
abcde        019      000         123456789012     107       123456789012             000000006000
abcde        019      000         123456789012     107       123456789012             000000006000
abcde        019      000         123456789012     107       123456789012             000000006000
abcde        019      000         123456789012     107       123456789012             000000006000
abcde        019      000         123456789012     107       123456789012             000000006000
abcde        019      000         123456789012     107       123456789012             000000006000
abcde   019  000  123456789012  107  123456789012  000000006000
abcde  019  000  123456789012  107  123456789012  000000006000
abcde  019  000  123456789012  107  123456789012  000000006000

The test file is available here: https://gist.githubusercontent.com/llimllib/91576d1fdbb1a0564932924af9313894/raw/3e716a3cd8aadd141a93e7683a5626876f5b548b/test.csv

The text was updated successfully, but these errors were encountered:

BurntSushi · 2018-10-03T15:42:11Z

My guess is that this is happening because of the buffer size. That is, once the output exceeds the internal buffer size, the alignment resets.

When I get a chance, I'll take a look at seeing what can be done to fix this. If there is a problem, it would need to be fixed in https://github.com/BurntSushi/tabwriter most likely. It's been a long time since I wrote that code, so I can't remember whether this is "by design" or not.

llimllib · 2018-10-03T17:36:48Z

I wasn't sure whether to file the bug here or there, but I don't really know how to write Rust so I figured I'd let you know here. It limits the usefulness of xsd table for me for sure.

(I Really like xsv though, thanks for the work and thanks for looking at the issue so quickly)

BurntSushi · 2018-10-03T17:49:25Z

@llimllib A possible work-around, if you're on a Unix-like system and have the column command installed, is to use xsv table test.csv | column -t.

llimllib · 2018-10-03T18:10:54Z

Hah, I went to xsv table because of the shortcomings of column, but I didn't consider piping back through column. That workaround wfm, thanks

ursetto · 2018-10-30T14:36:13Z

Unfortunately the column -t workaround does not work if your data contains any spaces, which is pretty common.

It seems like the "phase change" can occur in the middle of a line, and then the width of subsequent columns gets out of phase. Below it occurs in line 3, and then in the lines below, the IP address column is extremely wide. Until the next phase change, when this repeats.

Agent Managed - Reachable    hostname4.xyz.com            192.168.1.12    Red Hat Enterprise Linux Server 7 X86_64  abcdefg   Live
Agent Managed - Reachable    hostname5.xyz.com            192.168.1.13    Red Hat Enterprise Linux Server 7 X86_64  abcdefg   Not Specified
Agent Managed - Reachable    hostname6.xyz.com                     192.168.1.14       Red Hat Enterprise Linux Server 7 X86_64  abcdefg                                   Not Specified
Agent Managed - Reachable  hostname7.xyz.com  192.168.1.15                              Red Hat Enterprise Linux Server 7 X86_64  abcdefg  Not Specified
Agent Managed - Reachable  hostname8.xyz.com  192.168.1.16                              Red Hat Enterprise Linux Server 7 X86_64  abcdefg  Not Specified

BurntSushi · 2018-10-30T14:52:57Z

@ursetto You probably need to use --separator to set tab as the only separator. column defaults to whitespace.

There is nothing significant about when rows get out of alignment. It's not connected to the data. It's just an internal buffer size, at which point, alignment is reset. The reason why this bug exists is because the underlying implementation does not allow itself to use memory without bound.

ursetto · 2018-10-30T15:11:11Z

I'm guessing you are saying to replace xsv table with something like:

xsv fmt -t '\t'|column -s "$(echo -ne '\t')" -t

This seems to work fine as long as you can find a delimiter not in your data.

owst · 2019-10-06T21:40:28Z

I recently encountered this issue too. Having taken a look, @BurntSushi was right - the cause is an internal buffer being filled, then flushed to the underlying TabWriter, which causes the layout to break.

The buffer in question is the internal buffer of rust-csv - when attempting to write a CSV record, if the internal buffer becomes full, flush is called: the buffer's contents are written to the underlying writer (in this case, an instance of TabWriter). In addition, the underlying writer is also flushed.

The effect of this flushing of TabWriter can be more-easily demonstrated if we reduce the buffer size used by the CSV writer down here, from 32K to 14, then:

# 14th byte output to the tab writer is z
$ echo 'a,b\n1234,x\ny,z' | ./target/debug/xsv table
a     b
1234  x
y     z

# 14th byte output to the tab writer is the \t inserted after y
$ echo 'a,b\n12345,x\ny,z' | ./target/debug/xsv table 
a      b
12345  x
yz

# 14th byte output to the tab writer is y
$ echo 'a,b\n123456,x\ny,z' | ./target/debug/xsv table 
a       b
123456  x
y    z

# 14th byte output to the tab writer is the \n after x
$ echo 'a,b\n1234567,x\ny,z' | ./target/debug/xsv table 
a        b
1234567  x
y   z

The first example is as-expected
The second is malformed, as the TabWriter flush occurs just after a tab was written (which initiates a new cell) and empty trailing cells are ignored. When the next field is written after the reset of TabWriter, the fact that a separator was previously written (to end the y field) has been lost and the next field starts immediately

The difference between the third and fourth examples are due to TabWriter seemingly interpreting a leading \t differently to one between fields (n.b. the 4 vs 3 spaces):

$ ./tabwriter-bin/target/debug/tabwriter <<< $'\tz' | xxd
00000000: 2020 2020 7a0a                               z.
$ ./tabwriter-bin/target/debug/tabwriter <<< $'y\tz' | xxd
00000000: 7920 2020 7a0a                           y   z.

Having made some tweaks in a local build, this issue is resolved if the csv-writer does not flush the underlying writer (TabWriter) when its internal buffer becomes full, but simply writes out the buffer to the underlying writer (leaving it to flush as-and-when necessary) - I'll raise an issue/PR against the csv library to do this

llimllib · 2019-10-06T22:30:13Z

Thanks for the much more detailed investigation, @owst

This fixes a bug where the CSV writer was erroneously flushing the underlying writer when its internal buffer was full. Instead, when the internal buffer is full, it should simply write the contents of the buffer to the underlying writer. There is no need to flush it. Indeed, this causes other problems such as those observed in BurntSushi/xsv#151. We are careful to ensure that calling `flush` explicitly still calls `flush` on the underlying writer. Fixes #173

BurntSushi · 2020-01-07T12:46:35Z

While BurntSushi/rust-csv#174 fixes the underlying issue, I suspect it will be a bit of time before I get around to another xsv release. So if folks want this fix, then you'll need to do a cargo update -p csv and build xsv from source.

llimllib · 2020-01-07T12:53:45Z

Appreciate all your work, thanks for the update

owst · 2020-01-11T14:28:09Z

Thanks @BurntSushi!

FYI to build from source locally I had to give a more-specific csv version in Cargo.toml to allow the update from 1.0.1 to 1.1.2 (without it, csv only updated to 1.0.7):

diff --git a/Cargo.toml b/Cargo.toml
index 051fb3d..eea20da 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -30,7 +30,7 @@ opt-level = 3
 [dependencies]
 byteorder = "1"
 crossbeam-channel = "0.2.4"
-csv = "1"
+csv = "1.1"
 csv-index = "0.1.5"
 docopt = "1"
 filetime = "0.1"

apcamargo · 2023-04-19T22:56:49Z

@BurntSushi Any chance this will be fixed in the master branch sometime in the future?

owst mentioned this issue Oct 6, 2019

Do not flush underlying writer when buffer becomes full BurntSushi/rust-csv#173

Closed

owst mentioned this issue Dec 1, 2019

Only flush underlying writer on explicit flush call, not full buffer BurntSushi/rust-csv#174

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xsv table displays incorrectly #151

xsv table displays incorrectly #151

llimllib commented Oct 3, 2018

BurntSushi commented Oct 3, 2018

llimllib commented Oct 3, 2018 •

edited

Loading

BurntSushi commented Oct 3, 2018

llimllib commented Oct 3, 2018

ursetto commented Oct 30, 2018

BurntSushi commented Oct 30, 2018

ursetto commented Oct 30, 2018

owst commented Oct 6, 2019

llimllib commented Oct 6, 2019

BurntSushi commented Jan 7, 2020

llimllib commented Jan 7, 2020

owst commented Jan 11, 2020

apcamargo commented Apr 19, 2023

xsv table displays incorrectly #151

xsv table displays incorrectly #151

Comments

llimllib commented Oct 3, 2018

BurntSushi commented Oct 3, 2018

llimllib commented Oct 3, 2018 • edited Loading

BurntSushi commented Oct 3, 2018

llimllib commented Oct 3, 2018

ursetto commented Oct 30, 2018

BurntSushi commented Oct 30, 2018

ursetto commented Oct 30, 2018

owst commented Oct 6, 2019

llimllib commented Oct 6, 2019

BurntSushi commented Jan 7, 2020

llimllib commented Jan 7, 2020

owst commented Jan 11, 2020

apcamargo commented Apr 19, 2023

llimllib commented Oct 3, 2018 •

edited

Loading