-
Notifications
You must be signed in to change notification settings - Fork 216
Select OEM/ANSI code page according to system locale setting #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hello, thank you for this. I've seen your progress across multiple issue trackers regarding Linux and 7zip path encoding. void MultiByteToUnicodeString3_iconv(UString &res, const AString &s)
{
res.Empty();
if (s.IsEmpty())
return;
iconv_t cd;
if ((cd = iconv_open("UTF-8", "CP932")) != (iconv_t)-1) {
AString sUtf8;
unsigned slen = s.Len();
char* src = s.Ptr_non_const();
unsigned dlen = slen * 4 + 1; // (source length * 4) + null termination
char* dst = sUtf8.GetBuf_SetEnd(dlen);
const char* dstStart = dst;
memset(dst, 0, dlen);
size_t slen_size_t = static_cast<size_t>(slen);
size_t dlen_size_t = static_cast<size_t>(dlen);
size_t done = iconv(cd, &src, &slen_size_t, &dst, &dlen_size_t);
if (done == (size_t)-1) {
iconv_close(cd);
// iconv failed. Falling back to default behavior
MultiByteToUnicodeString2(res, s, 932);
return;
}
// Null-terminate the result
*dst = '\0';
iconv_close(cd);
AString sUtf8CorrectLength;
size_t dstCorrectLength = dst - dstStart;
sUtf8CorrectLength.SetFrom(sUtf8, static_cast<unsigned>(dstCorrectLength));
if (ConvertUTF8ToUnicode(sUtf8CorrectLength, res) /*|| ignore_Utf8_Errors*/)
return;
}
}and then in LzhHandler's GetProperty: UString dst;
MultiByteToUnicodeString3_iconv(dst, item.GetName());
UString s = NItemName::WinPathToOsPath(dst);If you'd like to try implementing it for LZH, here's a sample file and the current / expected output of (I noticed the Debian package maintainer for 7zz is Japanese, so he may have some experience) |
Unfortunately, I also don't understand the 7-zip codebase well enough. I'm afraid we need Igor Pavlov to tell us how to implement this. |
|
@ip7z: What do you think about this PR? |
|
This and the patch applied to the debian package completely wrecks shift_jis encoded filenames. Difference is that the first one can be fixed with |
|
Will look at this, thanks! |
|
Can you please attach a sample file with such name? |
|
Absolutely, here you go. without patch: Also fwiw the current output with this patch is the same I get in Ark (KDE's archive GUI), no matter if I have an unpatched 7zip. But that's probably them just relying on how p7zip worked. Unzipping with pure 7z or specifying encoding with unzip is currently the only way to get these filenames without corrupting them. |
|
@kattjevfel please try to replace to after applying this patch, then check you archive once again. Make sure your system locale is set to ja_JP as 7zip should somehow "guess" what legacy code page to use, so its using system locale for that. |
|
Your patch does work after adding At that point I still prefer having them not be converted to UTF-8 by 7-zip at all and instead have an external tool identify and convert them, as that doesn't even require the locale to be installed. |
|
After change above it should also work without locale installed. |
|
You are right, when using |
|
Use |
|
Fix applied to Debian version also: |
|
I think the implementation should mimic
|
// Detect required code page name from current locale
char *lc = setlocale(LC_CTYPE, "");
if (!lc || !lc[0]) {
lc = getenv("LC_CTYPE");
}should be: // Detect required code page name from current locale
char *lc = setlocale(LC_CTYPE, "");
if (!lc || !lc[0]) {
lc = getenv("LC_ALL");
}
if (!lc || !lc[0]) {
lc = getenv("LC_CTYPE");
}
if (!lc || !lc[0]) {
lc = getenv("LANG");
}or if you do not want to change the global setlocale status (recommended): // Detect required code page name from current locale
char *lc = getenv("LC_ALL");
if (!lc || !lc[0]) {
lc = getenv("LC_CTYPE");
}
if (!lc || !lc[0]) {
lc = getenv("LANG");
} |
|
@defrag257 pls check now |
|
I checked in https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT, and the behavior implemented in this PR is not correct according to the .ZIP specification. Please follow the specification by default. Also, this behavior doesn't make sense. The computer creating a .ZIP file may not be the same computer used to extract the file. The interpretation of the zip entry names is just not related to the system locale settings on the computer extracting the file. Maybe this behavior can be optionally enabled by a command-line switch? But IMO it is not worth the added complexity and added |
|
But it's just a way Windows XP and some others do it |
|
The most popular zip archives creator in the world |
|
Maybe you can create a tool called something like "utf8myzip" that will fix up the encoding of a .ZIP file. As input it takes the nonportable .ZIP file and any information on the system and program that created that file. Then as output it creates a new .ZIP file with the UTF-8 flag set, and UTF-8 encoded entry names. People who want to deal with nonportable .ZIP files could first use that tool, and people who want to follow the spec can just use 7zip directly. |
|
The thing is, a huge number of archivers on Windows use the Windows approach for compatibility. We just have to follow their example. Yes, it’s not a perfect solution, but the sender and receiver are speaking the same language with a probability very close to 1. There’s also unar, which tries to detect the encoding based on the frequency of certain characters or something like that. The risk of incorrect detection there is much higher than the chance that the sender and receiver don’t use the same language. Also: libarchive outputs raw data as is, and without additional processing this results in invalid UTF-8 codes in the file system. Doesn’t look like an elegant solution. |

Fixes
https://sourceforge.net/p/sevenzip/bugs/2473/
https://sourceforge.net/p/sevenzip/bugs/1060/