Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open and from yaml fail on files with byte-order-mark #13925

Open
dcarosone opened this issue Sep 25, 2024 · 3 comments
Open

open and from yaml fail on files with byte-order-mark #13925

dcarosone opened this issue Sep 25, 2024 · 3 comments
Labels
file-format Parsing/Writing of file formats/protocols needs-triage An issue that hasn't had any proper look

Comments

@dcarosone
Copy link

dcarosone commented Sep 25, 2024

Describe the bug

parsing yaml files with a leading UTF-8 byte order mark fails.

Note: I expect this is an upstream issue in the serde_yaml crate. However, there's no way to raise the issue there as the repo has been archived and is no longer maintained.

I have raised an issue in the serde_norway fork: cafkafk/serde-norway#8

There are best-effort maintenance efforts being done there, with important caveats about long-term committment; perhaps nushell should consider switching to that fork, or consider what else to do about that problem. I didn't find an open issue relating to that directly, or I'd have just added the link above there.

How to reproduce

"id: thing\nfoo: bar" | save --raw test1.yaml"id: thing\nfoo: bar" | into binary | bytes add 0x[ef bb bf] | save --raw test2.yamlopen test1.yaml
╭─────┬───────╮
│ id  │ thing │
│ foo │ bar   │
╰─────┴───────╯open test2.yaml
Error:   × Error while parsing as yaml
   ╭─[entry #8:1:6]
 1 │ open test2.yaml
   ·      ─────┬────
   ·           ╰── Could not parse '/tmp/test2.yaml' with `from yaml`
   ╰────
  help: Check out `help from yaml` or `help from` for more options or open raw data with `open --raw '/tmp/test2.yaml'`

Error: nu::shell::unsupported_input

  × Unsupported input
   ╭─[entry #8:1:1]
 1 │ open test2.yaml
   · ──┬─┬
   ·   │ ╰── value originates from here
   ·   ╰── Could not load YAML: did not find expected <document start> at line 2 column 1
   ╰────

# a workaround that removes the BOM againopen --raw test2.yaml | decode utf8 | from yaml
╭─────┬───────╮
│ id  │ thing │
│ foo │ bar   │
╰─────┴───────╯

Expected behavior

parse all my yaml files with just open **/*.yaml rather than needing to glob **/*.yaml | each { open --raw $in | decode utf8 | from yaml }

Configuration

key value
version 0.98.0
major 0
minor 98
patch 0
branch
commit_hash
build_os linux-x86_64
build_target x86_64-unknown-linux-gnu
rust_version rustc 1.80.1 (3f5fd8dd4 2024-08-06) (built from a source tarball)
cargo_version cargo 1.80.0 (376290515 2024-07-16)
build_time 1980-01-01 00:00:00 +00:00
build_rust_channel release
allocator mimalloc
features default, sqlite, trash
installed_plugins
@dcarosone dcarosone added the needs-triage An issue that hasn't had any proper look label Sep 25, 2024
@fdncred
Copy link
Collaborator

fdncred commented Sep 25, 2024

This is the entire point of the decode command. IIRC, it was built into open with --decode but we changed it to be more composable. I think we'd have to sniff the file before opening it to determine if it was utf8 or not. I'm on the fence and could go either way with this, for a pr or against one.

@sholderbach sholderbach added the file-format Parsing/Writing of file formats/protocols label Sep 25, 2024
@dcarosone
Copy link
Author

I have no issue in general with having to pass through decode for some cases, especially when another encoding is used without BOMs, and whenever using open --raw and wanting to be explicit. As for the convenience of having it as an argument to open rather than a separate composable command, well, one can always define a custom command to combine them again.

The issue here, specifically, is about from yaml (regardless of whether called explicitly or implicitly by open via filename matching). It should, explicitly according to the standard, handle byte order marks. That's the issue raised upstream (for some approximate value thereof).

One thing that may be coming into play, though, is that some commands (like open) produce output (according to describe) of type "byte stream". That's not listed in the manual as one of the types. It seems to be accepted by things (like from yaml) that are documented as accepting string only.

So we may very well be passing things that are not valid utf-8 strings to things that are expecting them, and contributing to the issue (regardless that it's very convenient most of the time). This is not the first time I've been a bit confused about how the signature checking actually works in practice as part of all this.

@fdncred
Copy link
Collaborator

fdncred commented Sep 26, 2024

I've made up my mind. I don't think we should implicitly do any conversion in open. If people need to decode a file, that's why we have decode. Maybe it should be made clearer in an error message though?

One other thing that's related, I noticed that nushell does have a problem with anything non-utf8 and sometimes decode is the wrong answer. I was looking at some ansi art and couldn't really find any way to open and display the files in nushell because they had some non-utf8 characters in them. Not sure what to do about this. I ended up running them through decode but it wasn't a perfect solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
file-format Parsing/Writing of file formats/protocols needs-triage An issue that hasn't had any proper look
Projects
None yet
Development

No branches or pull requests

3 participants