Skip to content

GHC needs to output UTF-8 data on Windows #738

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
snoyberg opened this issue Aug 9, 2015 · 15 comments
Closed

GHC needs to output UTF-8 data on Windows #738

snoyberg opened this issue Aug 9, 2015 · 15 comments
Assignees
Milestone

Comments

@snoyberg
Copy link
Contributor

snoyberg commented Aug 9, 2015

In particular, this affects #734, where we want GHC to dump .hi file contents. This operation will fail on Windows if the file contains characters not supported by the code page. Ideally, we'd like to be able to tell GHC to simply output in UTF-8, but there does not seem to be any way to do that besides going through the standard code page auto-discovery on Windows, which will necessitate stack setting the code page for the entire console to 65001. I've already experimented with this extensively, and I think it's the right behavior, but I'm writing up an issue and then a pull request for posterity, and so that others can test before this is merged to master.

Pinging @borsboom due to #734. @ndmitchell as our resident Windows expert, any thoughts on this? Also pinging @fumieval, as this looks somewhat related to #422.

@snoyberg snoyberg self-assigned this Aug 9, 2015
@snoyberg snoyberg added this to the 0.2.0.0 milestone Aug 9, 2015
@snoyberg
Copy link
Contributor Author

snoyberg commented Aug 9, 2015

I've created branch 738-windows-code-page and pull request #739. I'd appreciate if others could test this. @borsboom I'm going to assign that PR to you; if it resolves the issues you're having with #734, please merge.

@ndmitchell
Copy link
Contributor

Shouldn't GHC just call hSetEncoding stdout utf8? That's what I do in Hoogle and it seems to work fine. If you can't get GHC to change setting the code page is one alternative. Writing to a temp file then re encoding to the console might be nicer since changing code page could have other effects later on.

@snoyberg
Copy link
Contributor Author

snoyberg commented Aug 9, 2015

I definitely intend to open up a GHC bug report referring to this issue. Nonetheless, we'll need to come up with a stop-gap solution for the current situation. Writing to a temp file won't actually help, because the issue is that GHC is deciding what it should be generating based on this global-ish setting. One possible enhancement to this branch I can think of is to bracket and set the original code pages after stack exits.

@ndmitchell
Copy link
Contributor

Can't you write to the temp file with a binary handle? Changing code page is a bit dangerous. Running a process in a finally seems a bit dodgy. It's ok if there is nothing better, but yuk. Also, Cygwin sometimes screws up the code page (1 in 10,000 we see at work).

@snoyberg
Copy link
Contributor Author

snoyberg commented Aug 9, 2015

Yes, I agree it's terrible. But GHC is being run as an external process, so whether we open up a binary handle or not is irrelevant. The specific case here is we're running ghc foo.hs -ddump-hi -ddump-to-file, and GHC is unable to write out the file because it doesn't know how to encode the out-of-code-page characters.

@ndmitchell
Copy link
Contributor

I guess not much choice then!

@snoyberg
Copy link
Contributor Author

snoyberg commented Aug 9, 2015

Sigh. I was hoping you had some Win32 magic hidden away that would save us. Ideally, we'd be able to do something like create a new anonymous pipe and just set the code page on that new pipe. But from everything I've read, that's not the way Windows works.

@ndmitchell
Copy link
Contributor

I think GHC should probably always talk UTF8 and ignore the code page. The default windows approach is somewhat broken (being a remnant of history rather than a modern design - it's fixed in Windows CE).

@snoyberg
Copy link
Contributor Author

snoyberg commented Aug 9, 2015

I'm in favor of that approach on all operating systems. The default of respecting either some semi-obscure environment variables (non-Windows) or some truly obtuse codepages (Windows) has led to lots of bugs in my experience. I'm sure this proposal will go over very well :/

@ndmitchell
Copy link
Contributor

GHC already assumed all input is UTF8 anyway, so this seems consistent. I would be in favour of GHC using code page eye to decide between normal quotes and "smart quotes" which always corrupt on Windows anyway.

@snoyberg
Copy link
Contributor Author

snoyberg commented Aug 9, 2015

Actually, with this change, smart quotes and non-Latin characters (I tested with Hebrew) work just fine.

command0

However, that's a bit of a cheat, since that's with ConEmu, not the standard console. With the standard console, I get the smart quotes but no Hebrew:

command1

@ndmitchell
Copy link
Contributor

To be honest, I've never liked the smart quotes, even when rendered as intended :)

@snoyberg
Copy link
Contributor Author

snoyberg commented Aug 9, 2015

GHC issue created: https://ghc.haskell.org/trac/ghc/ticket/10762

@chrisdone
Copy link
Member

The lack of Hebrew looks like a problem with the font rendering (no characters for Hebrew in that font and no fallbacking like Pango), rather than encoding, I expect.

(In a world where things are logical, anyway. And this is Windows.)

borsboom added a commit that referenced this issue Aug 9, 2015
@snoyberg
Copy link
Contributor Author

PR merged, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants