Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xsv table displays incorrectly #151

Open
llimllib opened this issue Oct 3, 2018 · 13 comments
Open

xsv table displays incorrectly #151

llimllib opened this issue Oct 3, 2018 · 13 comments

Comments

@llimllib
Copy link

llimllib commented Oct 3, 2018

Given data of very regular format, xsv table has a phase change that is irregular. I've reduced it to a test csv where it happens around line 575:

$ (head -n2; tail) < test.csv
contract_id,plan_id,segment_id,pharmacy_number,price_id,brand_dispensing_fee_30,generic_dispensing_fee_30
abcde,001,000,123456789012,160,123456789012,000000012500
abcde,019,000,123456789012,107,123456789012,000000006000
abcde,019,000,123456789012,107,123456789012,000000006000
abcde,019,000,123456789012,107,123456789012,000000006000
abcde,019,000,123456789012,107,123456789012,000000006000
abcde,019,000,123456789012,107,123456789012,000000006000
abcde,019,000,123456789012,107,123456789012,000000006000
abcde,019,000,123456789012,107,123456789012,000000006000
abcde,019,000,123456789012,107,123456789012,000000006000
abcde,019,000,123456789012,107,123456789012,000000006000
abcde,019,000,123456789012,107,123456789012,000000006000

$ (head -n2; tail) < <(xsv table test.csv)
contract_id  plan_id  segment_id  pharmacy_number  price_id  brand_dispensing_fee_30  generic_dispensing_fee_30
abcde        001      000         123456789012     160       123456789012             000000012500
abcde        019      000         123456789012     107       123456789012             000000006000
abcde        019      000         123456789012     107       123456789012             000000006000
abcde        019      000         123456789012     107       123456789012             000000006000
abcde        019      000         123456789012     107       123456789012             000000006000
abcde        019      000         123456789012     107       123456789012             000000006000
abcde        019      000         123456789012     107       123456789012             000000006000
abcde        019      000         123456789012     107       123456789012             000000006000
abcde   019  000  123456789012  107  123456789012  000000006000
abcde  019  000  123456789012  107  123456789012  000000006000
abcde  019  000  123456789012  107  123456789012  000000006000

The test file is available here: https://gist.githubusercontent.com/llimllib/91576d1fdbb1a0564932924af9313894/raw/3e716a3cd8aadd141a93e7683a5626876f5b548b/test.csv

@BurntSushi
Copy link
Owner

My guess is that this is happening because of the buffer size. That is, once the output exceeds the internal buffer size, the alignment resets.

When I get a chance, I'll take a look at seeing what can be done to fix this. If there is a problem, it would need to be fixed in https://github.com/BurntSushi/tabwriter most likely. It's been a long time since I wrote that code, so I can't remember whether this is "by design" or not.

@llimllib
Copy link
Author

llimllib commented Oct 3, 2018

I wasn't sure whether to file the bug here or there, but I don't really know how to write Rust so I figured I'd let you know here. It limits the usefulness of xsd table for me for sure.

(I Really like xsv though, thanks for the work and thanks for looking at the issue so quickly)

@BurntSushi
Copy link
Owner

@llimllib A possible work-around, if you're on a Unix-like system and have the column command installed, is to use xsv table test.csv | column -t.

@llimllib
Copy link
Author

llimllib commented Oct 3, 2018

Hah, I went to xsv table because of the shortcomings of column, but I didn't consider piping back through column. That workaround wfm, thanks

@ursetto
Copy link

ursetto commented Oct 30, 2018

Unfortunately the column -t workaround does not work if your data contains any spaces, which is pretty common.

It seems like the "phase change" can occur in the middle of a line, and then the width of subsequent columns gets out of phase. Below it occurs in line 3, and then in the lines below, the IP address column is extremely wide. Until the next phase change, when this repeats.

Agent Managed - Reachable    hostname4.xyz.com            192.168.1.12    Red Hat Enterprise Linux Server 7 X86_64  abcdefg   Live
Agent Managed - Reachable    hostname5.xyz.com            192.168.1.13    Red Hat Enterprise Linux Server 7 X86_64  abcdefg   Not Specified
Agent Managed - Reachable    hostname6.xyz.com                     192.168.1.14       Red Hat Enterprise Linux Server 7 X86_64  abcdefg                                   Not Specified
Agent Managed - Reachable  hostname7.xyz.com  192.168.1.15                              Red Hat Enterprise Linux Server 7 X86_64  abcdefg  Not Specified
Agent Managed - Reachable  hostname8.xyz.com  192.168.1.16                              Red Hat Enterprise Linux Server 7 X86_64  abcdefg  Not Specified

@BurntSushi
Copy link
Owner

@ursetto You probably need to use --separator to set tab as the only separator. column defaults to whitespace.

There is nothing significant about when rows get out of alignment. It's not connected to the data. It's just an internal buffer size, at which point, alignment is reset. The reason why this bug exists is because the underlying implementation does not allow itself to use memory without bound.

@ursetto
Copy link

ursetto commented Oct 30, 2018

I'm guessing you are saying to replace xsv table with something like:

xsv fmt -t '\t'|column -s "$(echo -ne '\t')" -t

This seems to work fine as long as you can find a delimiter not in your data.

@owst
Copy link

owst commented Oct 6, 2019

I recently encountered this issue too. Having taken a look, @BurntSushi was right - the cause is an internal buffer being filled, then flushed to the underlying TabWriter, which causes the layout to break.

The buffer in question is the internal buffer of rust-csv - when attempting to write a CSV record, if the internal buffer becomes full, flush is called: the buffer's contents are written to the underlying writer (in this case, an instance of TabWriter). In addition, the underlying writer is also flushed.

The effect of this flushing of TabWriter can be more-easily demonstrated if we reduce the buffer size used by the CSV writer down here, from 32K to 14, then:

# 14th byte output to the tab writer is z
$ echo 'a,b\n1234,x\ny,z' | ./target/debug/xsv table
a     b
1234  x
y     z

# 14th byte output to the tab writer is the \t inserted after y
$ echo 'a,b\n12345,x\ny,z' | ./target/debug/xsv table 
a      b
12345  x
yz

# 14th byte output to the tab writer is y
$ echo 'a,b\n123456,x\ny,z' | ./target/debug/xsv table 
a       b
123456  x
y    z

# 14th byte output to the tab writer is the \n after x
$ echo 'a,b\n1234567,x\ny,z' | ./target/debug/xsv table 
a        b
1234567  x
y   z
  • The first example is as-expected
  • The second is malformed, as the TabWriter flush occurs just after a tab was written (which initiates a new cell) and empty trailing cells are ignored. When the next field is written after the reset of TabWriter, the fact that a separator was previously written (to end the y field) has been lost and the next field starts immediately
  • The difference between the third and fourth examples are due to TabWriter seemingly interpreting a leading \t differently to one between fields (n.b. the 4 vs 3 spaces):
    $ ./tabwriter-bin/target/debug/tabwriter <<< $'\tz' | xxd
    00000000: 2020 2020 7a0a                               z.
    $ ./tabwriter-bin/target/debug/tabwriter <<< $'y\tz' | xxd
    00000000: 7920 2020 7a0a                           y   z.
    

Having made some tweaks in a local build, this issue is resolved if the csv-writer does not flush the underlying writer (TabWriter) when its internal buffer becomes full, but simply writes out the buffer to the underlying writer (leaving it to flush as-and-when necessary) - I'll raise an issue/PR against the csv library to do this

@llimllib
Copy link
Author

llimllib commented Oct 6, 2019

Thanks for the much more detailed investigation, @owst

BurntSushi pushed a commit to BurntSushi/rust-csv that referenced this issue Jan 7, 2020
This fixes a bug where the CSV writer was erroneously flushing the
underlying writer when its internal buffer was full. Instead, when the
internal buffer is full, it should simply write the contents of the buffer
to the underlying writer. There is no need to flush it. Indeed, this causes
other problems such as those observed in
BurntSushi/xsv#151.

We are careful to ensure that calling `flush` explicitly still calls `flush`
on the underlying writer.

Fixes #173
@BurntSushi
Copy link
Owner

While BurntSushi/rust-csv#174 fixes the underlying issue, I suspect it will be a bit of time before I get around to another xsv release. So if folks want this fix, then you'll need to do a cargo update -p csv and build xsv from source.

@llimllib
Copy link
Author

llimllib commented Jan 7, 2020

Appreciate all your work, thanks for the update

@owst
Copy link

owst commented Jan 11, 2020

Thanks @BurntSushi!

FYI to build from source locally I had to give a more-specific csv version in Cargo.toml to allow the update from 1.0.1 to 1.1.2 (without it, csv only updated to 1.0.7):

diff --git a/Cargo.toml b/Cargo.toml
index 051fb3d..eea20da 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -30,7 +30,7 @@ opt-level = 3
 [dependencies]
 byteorder = "1"
 crossbeam-channel = "0.2.4"
-csv = "1"
+csv = "1.1"
 csv-index = "0.1.5"
 docopt = "1"
 filetime = "0.1"

@apcamargo
Copy link

@BurntSushi Any chance this will be fixed in the master branch sometime in the future?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants