Select OEM/ANSI code page according to system locale setting #36

unxed · 2024-06-07T17:31:27Z

Fixes

https://sourceforge.net/p/sevenzip/bugs/2473/
https://sourceforge.net/p/sevenzip/bugs/1060/

LupinThidr · 2024-06-22T06:12:38Z

Hello, thank you for this. I've seen your progress across multiple issue trackers regarding Linux and 7zip path encoding.
I ran across your comments while trying to diagnose 7zip unable to decode Shift-JIS (CP932) encoded paths in LZH files
I had thought that this patch would apply to those, but it seems ZipItem is specific to the .zip handler.
Other software such as unar don't fully support the .LZH spec as 7zip does.
I think your iconv conversion should be implemented as a separate function in StringConvert, so it can be easily used with other classes.
I was able to make LzhHandler use it rather the standard MultiByteToUnicodeString, but I don't completely understand the 7zip codebase, so I've hardcoded it to CP932 rather than the a mcp argument.

void MultiByteToUnicodeString3_iconv(UString &res, const AString &s)
{
  res.Empty();
  if (s.IsEmpty())
    return;

  iconv_t cd;
  if ((cd = iconv_open("UTF-8", "CP932")) != (iconv_t)-1) {

    AString sUtf8;

    unsigned slen = s.Len();
    char* src = s.Ptr_non_const();

    unsigned dlen = slen * 4 + 1; // (source length * 4) + null termination
    char* dst = sUtf8.GetBuf_SetEnd(dlen);
    const char* dstStart = dst;

    memset(dst, 0, dlen);

    size_t slen_size_t = static_cast<size_t>(slen);
    size_t dlen_size_t = static_cast<size_t>(dlen);
    size_t done = iconv(cd, &src, &slen_size_t, &dst, &dlen_size_t);

    if (done == (size_t)-1) {
      iconv_close(cd);

      // iconv failed. Falling back to default behavior
      MultiByteToUnicodeString2(res, s, 932);
      return;
    }

    // Null-terminate the result
    *dst = '\0';

    iconv_close(cd);

    AString sUtf8CorrectLength;
    size_t dstCorrectLength = dst - dstStart;
    sUtf8CorrectLength.SetFrom(sUtf8, static_cast<unsigned>(dstCorrectLength));
    if (ConvertUTF8ToUnicode(sUtf8CorrectLength, res) /*|| ignore_Utf8_Errors*/)
      return;
  }

}

and then in LzhHandler's GetProperty:

UString dst;
MultiByteToUnicodeString3_iconv(dst, item.GetName());

UString s = NItemName::WinPathToOsPath(dst);

If you'd like to try implementing it for LZH, here's a sample file and the current / expected output of 7zz l
https://archive.org/download/narcissu/na_sabun.lzh

2007-06-03 22:25:18 .....          537          392  na_sabun/▒C▒▒▒▒▒▒.txt

2007-06-03 22:25:18 .....          537          392  na_sabun/修正差分.txt

(I noticed the Debian package maintainer for 7zz is Japanese, so he may have some experience)

unxed · 2024-09-01T19:48:21Z

I don't completely understand the 7zip codebase

Unfortunately, I also don't understand the 7-zip codebase well enough. I'm afraid we need Igor Pavlov to tell us how to implement this.

Neustradamus · 2025-01-03T01:18:58Z

@ip7z: What do you think about this PR?

kattjevfel · 2025-01-14T12:51:58Z

This and the patch applied to the debian package completely wrecks shift_jis encoded filenames.
normal 7zip: �k��s[mp4]
with patch: ûkÅÄ×ìs[mp4]
convmv'd: 北条時行[mp4]

Difference is that the first one can be fixed with convmv -f shift_jis -t utf8 -r <path>, but the second cannot, as has already improperly been converted to UTF-8. FWIW this always happened with p7zip, but was fixed with 7zip (and now re-broken, hooray)

unxed · 2025-01-14T15:17:47Z

Will look at this, thanks!

unxed · 2025-01-14T15:18:26Z

Can you please attach a sample file with such name?

kattjevfel · 2025-01-14T15:58:40Z

Absolutely, here you go.
test.zip

without patch: 20240323��C�C���C���C�h/��p�K��i�ŏ��ɂ��ǂ݉��.]Read this first�j.TXT
with patch: 20240323é¿òùÿCâCââââCâââÅ¼ê½ûéâüâCâh/ùÿùpïKû±üiì┼Åëé╔é¿ôÃé¦ë║é│éó.]Read this firstüj.TXT
convmv'd: 20240323お風呂イチャイチャ小悪魔メイド/利用規約（最初にお読み下さい.]Read this first）.TXT

Also fwiw the current output with this patch is the same I get in Ark (KDE's archive GUI), no matter if I have an unpatched 7zip. But that's probably them just relying on how p7zip worked.

Unzipping with pure 7z or specifying encoding with unzip is currently the only way to get these filenames without corrupting them.

unxed · 2025-01-19T20:08:01Z

@kattjevfel please try to replace

    // Detect required code page name from current locale 
    char *lc = setlocale(LC_CTYPE, "");

    if (lc && lc[0]) {

to

    // Detect required code page name from current locale 
    char *lc = setlocale(LC_CTYPE, "");
    if (!lc || !lc[0]) {
      lc = getenv("LC_CTYPE");
    }

    if (lc && lc[0]) {

after applying this patch, then check you archive once again. Make sure your system locale is set to ja_JP as 7zip should somehow "guess" what legacy code page to use, so its using system locale for that.

kattjevfel · 2025-01-21T16:36:04Z

Your patch does work after adding #include <cstdlib>, though having to run it with LANG=ja_JP.UTF-8 every time I encounter a zip packaged on a Japanese system is kind of annoying.

At that point I still prefer having them not be converted to UTF-8 by 7-zip at all and instead have an external tool identify and convert them, as that doesn't even require the locale to be installed.

unxed · 2025-01-21T18:21:28Z

After change above it should also work without locale installed.

kattjevfel · 2025-01-21T19:39:43Z

You are right, when using LANG=ja_JP.UTF-8 without the locale installed, it behaves like without the patch entirely, which is at least better than the current state.

unxed · 2025-01-21T20:39:07Z

Use LC_CTYPE=ja_JP.UTF-8 to enable patch

unxed · 2025-03-03T13:21:39Z

Fix applied to Debian version also:
https://salsa.debian.org/debian/7zip/-/merge_requests/15

defrag257 · 2025-03-13T12:15:12Z

I think the implementation should mimic setlocale's behavior when locale is not installed:

Firstly, check LC_ALL, the environment variable with highest priority for overriding settings for all categories.
Then, check LC_CTYPE.
Finally, check LANG, the environment variable with lowest priority giving a default setting for all categories.

setlocale(3) - Linux manual page

defrag257 · 2025-04-12T13:30:16Z

// Detect required code page name from current locale
char *lc = setlocale(LC_CTYPE, "");
if (!lc || !lc[0]) {
  lc = getenv("LC_CTYPE");
}

should be:

// Detect required code page name from current locale
char *lc = setlocale(LC_CTYPE, "");
if (!lc || !lc[0]) {
  lc = getenv("LC_ALL");
}
if (!lc || !lc[0]) {
  lc = getenv("LC_CTYPE");
}
if (!lc || !lc[0]) {
  lc = getenv("LANG");
}

or if you do not want to change the global setlocale status (recommended):

// Detect required code page name from current locale
char *lc = getenv("LC_ALL");
if (!lc || !lc[0]) {
  lc = getenv("LC_CTYPE");
}
if (!lc || !lc[0]) {
  lc = getenv("LANG");
}

unxed · 2025-04-12T17:43:45Z

@defrag257 pls check now

nhz2 · 2025-11-06T17:32:30Z

I checked in https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT, and the behavior implemented in this PR is not correct according to the .ZIP specification.

Please follow the specification by default.

Also, this behavior doesn't make sense. The computer creating a .ZIP file may not be the same computer used to extract the file. The interpretation of the zip entry names is just not related to the system locale settings on the computer extracting the file.

Maybe this behavior can be optionally enabled by a command-line switch? But IMO it is not worth the added complexity and added iconv dependency.

unxed · 2025-11-06T17:39:19Z

But it's just a way Windows XP and some others do it

unxed · 2025-11-06T17:39:57Z

The most popular zip archives creator in the world

nhz2 · 2025-11-06T18:44:13Z

Maybe you can create a tool called something like "utf8myzip" that will fix up the encoding of a .ZIP file. As input it takes the nonportable .ZIP file and any information on the system and program that created that file. Then as output it creates a new .ZIP file with the UTF-8 flag set, and UTF-8 encoded entry names. People who want to deal with nonportable .ZIP files could first use that tool, and people who want to follow the spec can just use 7zip directly.

unxed · 2025-11-06T19:43:05Z

The thing is, a huge number of archivers on Windows use the Windows approach for compatibility. We just have to follow their example. Yes, it’s not a perfect solution, but the sender and receiver are speaking the same language with a probability very close to 1. There’s also unar, which tries to detect the encoding based on the frequency of certain characters or something like that. The risk of incorrect detection there is much higher than the chance that the sender and receiver don’t use the same language.

Also: libarchive outputs raw data as is, and without additional processing this results in invalid UTF-8 codes in the file system. Doesn’t look like an elegant solution.

nhz2 · 2025-11-06T21:42:21Z

Windows seems to have fixed this bug. When I create a zip archive using the file explorer now, I get this new ui

And the created archive is marked as from Unix, and UTF-8 encoded:

Name: Å.txt
Folder: -
Size: 2
Packed Size: 4
Modified: 2025-11-06 21:08:40
Attributes:  -rw-rw-rw-
Encrypted: -
CRC: D8932AAC
Method: Deflate
Characteristics: ux UT:M:1 : Descriptor UTF8
Host OS: Unix
Version: 20
Volume Index: 0
Offset: 70
------------------------: 
Name: enctest\
Size: 2
Packed Size: 4
Folders: 0
Files: 1
CRC: D8932AAC
------------------------: 
Path: C:\Users\nzimm\testfiles\enctest.zip
Type: zip
Physical Size: 352
------------------------: 
------------------------:

Also in terms of invalid UTF-8 codes in the file system, the way the most linux filesystems work, this is not actually that bad, and since no data is lost, the filenames can be fix up after extracting, if needed.

unxed added 2 commits June 7, 2024 19:30

Select OEM/ANSI code page according to system locale setting

445b1e3

Apply fix from https://salsa.debian.org/debian/7zip/-/merge_requests/14

4f339c6

unxed and others added 2 commits January 21, 2025 21:39

Merge branch 'ip7z:main' into main

299982c

do not break things if locale is not installed

7be55ef

kattjevfel added a commit to kattjevfel/scripts that referenced this pull request Jan 22, 2025

gallery-dl_postprocessing_selector: Work around ip7z/7zip#36

bbfe081

unxed added 2 commits February 23, 2025 16:30

added missing include

e807cf4

fix length calculation

b557f6f

fix by @defrag257

e60fa1f

Select OEM/ANSI code page according to system locale setting #36

Are you sure you want to change the base?

Select OEM/ANSI code page according to system locale setting #36

Uh oh!

Conversation

unxed commented Jun 7, 2024

Uh oh!

LupinThidr commented Jun 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

unxed commented Sep 1, 2024

Uh oh!

Neustradamus commented Jan 3, 2025

Uh oh!

kattjevfel commented Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

unxed commented Jan 14, 2025

Uh oh!

unxed commented Jan 14, 2025

Uh oh!

kattjevfel commented Jan 14, 2025

Uh oh!

unxed commented Jan 19, 2025

Uh oh!

kattjevfel commented Jan 21, 2025

Uh oh!

unxed commented Jan 21, 2025

Uh oh!

kattjevfel commented Jan 21, 2025

Uh oh!

unxed commented Jan 21, 2025

Uh oh!

unxed commented Mar 3, 2025

Uh oh!

defrag257 commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

defrag257 commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

unxed commented Apr 12, 2025

Uh oh!

nhz2 commented Nov 6, 2025

Uh oh!

unxed commented Nov 6, 2025

Uh oh!

unxed commented Nov 6, 2025

Uh oh!

nhz2 commented Nov 6, 2025

Uh oh!

unxed commented Nov 6, 2025

Uh oh!

nhz2 commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

LupinThidr commented Jun 22, 2024 •

edited

Loading

kattjevfel commented Jan 14, 2025 •

edited

Loading

defrag257 commented Mar 13, 2025 •

edited

Loading

defrag257 commented Apr 12, 2025 •

edited

Loading