Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-utf8 file names #42

Open
gulrotkake opened this issue Jul 11, 2024 · 12 comments
Open

Support non-utf8 file names #42

gulrotkake opened this issue Jul 11, 2024 · 12 comments
Labels
wontfix This will not be worked on

Comments

@gulrotkake
Copy link

If given a non-utf8 filename, such as

$ touch `echo -e "notutf8\xbf.dat"`

SftpSession::read_dir(path_to_directory) will fail to list the directory, as filename in DirEntry is of type string and try_get_string throws an error when encountering a non-utf8 string.

    fn try_get_string(&mut self) -> Result<String, Error> {
        let bytes = self.try_get_bytes()?;
        String::from_utf8(bytes).map_err(|_| Error::BadMessage("unable to parse str".to_owned()))
    }

A solution could be to use OsString or PathBuf, I'll try to write a PR, but I've barely started climbing the Rust learning curve :)

@TheBlindM
Copy link

I encountered the same problem

@AspectUnk
Copy link
Owner

Here it is described why we cannot use PathBuf. I can't say how appropriate it would be to use OsString. As a temporary solution we can use #39 because it allows to use .as_bytes() and leave further work to the end user but still this is not a panacea.

@AspectUnk AspectUnk added the bug Something isn't working label Jul 11, 2024
@gulrotkake
Copy link
Author

Would just using a Vec<u8> instead of String be a way to go? I suppose it's not too different from #39, if as_bytes() returns the original array, except it would make it very explicit that filenames are just a vector of bytes, but with the trade-off that the API becomes less ergonomic 🤷 .

@AspectUnk
Copy link
Owner

Yes

Would just using a Vec<u8> instead of String be a way to go?

but

API becomes less ergonomic

So I would like to find a more universal solution because I don't even like #39 that much

@kanpov
Copy link

kanpov commented Jul 11, 2024

Yes

Would just using a Vec<u8> instead of String be a way to go?

but

API becomes less ergonomic

So I would like to find a more universal solution because I don't even like #39 that much

OsString would be an alternative then?

@kanpov
Copy link

kanpov commented Jul 12, 2024

I could make a PR replacing with OsString @AspectUnk, don't see the disadvantage

@kanpov
Copy link

kanpov commented Jul 12, 2024

@AspectUnk Made a most definitely working implementation with OsString in #44

@AspectUnk
Copy link
Owner

AspectUnk commented Jul 13, 2024

After conducting research, I can say that UTF-16 support is not provided for in the specification because there is inconsistency regarding the size of a string character.

OsString like PathBuf is somewhat system-dependent, because its serialization will produce different results on different systems. The inconsistency in the size of the string character leads to packet integrity and byte order violations.

Linux:
image

Windows:
image

@gulrotkake Replacing String with Vec<u8> also won't yield results because the packet defines the number of characters, not bytes.
@kanpov The reason why your tests passed is because of the absence of tests on different systems.

Some leave the choice of encoding strings in UTF-16 to the end user, but warn that the remote server must also support it. It is possible to add this as a feature, but it is an extremely specific solution, so leaving #39 would be more appropriate. I may take care of it, but not very soon.

If I'm wrong about anything let me know.

@AspectUnk AspectUnk added wontfix This will not be worked on and removed bug Something isn't working labels Jul 13, 2024
@kanpov
Copy link

kanpov commented Jul 13, 2024

So it's pretty much a choice between not having utf-8 or not supporting windows @AspectUnk ? I'd rather have the latter, but maybe there's another way

@AspectUnk
Copy link
Owner

So it's pretty much a choice between not having utf-8 or not supporting windows @AspectUnk ? I'd rather have the latter, but maybe there's another way

Perhaps I misunderstood you, but it already supports utf-8 by default and at the same time windows, and #39 solves all problems with characters unsuitable for parsing.

@AspectUnk
Copy link
Owner

AspectUnk commented Jul 14, 2024

Some leave the choice of encoding strings in UTF-16 to the end user, but warn that the remote server must also support it.

Btw, along with this we can add support for the unfinished version of specification v6 (page 16) since it has a solution for sending charset.

@TheBlindM
Copy link

When will the version be released?😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants