Support for "Unicode Path Extra Field (0x7075)" and "Unicode Comment Extra Field (0x6375)" #656

rougemeilland · 2021-08-21T10:16:00Z

rougemeilland
Aug 21, 2021

Request

I would like you to realize one of the following functions.

Support for "Unicode Path Extra Field (0x7075)" and "Unicode Comment Extra Field (0x6375)"
Being able to get the CRC of the byte arraythat is the basis of the entry name (and comment) in the "ZipEntry" class
Being able to get the byte array that is the basis of the entry name (and comment) in the "ZipEntry" class

Background of request

I'm using Windows (NTFS) and I mainly use "WinRar" as the software for compression / decompression.
I often work with ZIP files that contain multibyte character entry names, as I often use filenames that contain multibyte characters.
In most cases, the problem will not occur. However, in rare cases, the entry name may not be converted successfully. This is because the character set of the NTFS file name is UNICODE and the ZIP file entry name is SHIFT-JIS (for Japanese).
"WinRar" adds "Unicode Path Extra Field" if the entry name contains multibyte characters. That's probably because it avoids the problems mentioned above.

On the other hand, I'm developing software for batching ZIP files and I'm thinking of using "SharpZipLib" for that.
Unfortunately, the current version of "SharpZipLib" doesn't seem to support "Unicode Path Extra Field", but fortunately "SharpZipLib" has published "ITaggedData" interface so I'm trying to implement it myself.

However, according to the "Unicode Path Extra Field (0x7075)" specification (see section 4.6.9 of "https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT"), there is a condition to refer to the UTF8 string contained in this extra field. The condition is that the value of NameCRC32 (4-byte integer) on the extra field matches the CRC of the byte array of the original entry name (Name property of the ZipEntry class). However, in the current version of "SharpZipLib", there is no way to get the byte array of entry names or their CRC.
I can convert the value of the Name property of the ZipEntry class to a byte array using the default Encoding, but that doesn't always match the original byte array.

For the above reasons, I want to get a byte array of entry name or its CRC.

For reference, I will post the source code of the implementation of "Unicode Path Extra Field (0x7075)" that I am developing.
You are free to modify or reprint these source codes.

Note that "Unicode Comment Extra Field (0x6375)" for entry comments can be implemented in the same way, except that the tag IDs are different.
However, in that case as well, it is necessary to be able to obtain the byte array (or its CRC) from which the "Comment" property of the "ZipEntry" class is based.

Source code of "UnicodePathExtraField" class (sample under development)

You are free to modify or reprint these source codes.

class UnicodePathExtraField
    : ITaggedData
{
    private const byte _supportedVersion = 1;
    private static Encoding _utf8Encoding;

    static UnicodePathExtraField()
    {
        // UTF8 encoding (without BOM)
        _utf8Encoding = new UTF8Encoding(false);
    }

    public UnicodePathExtraField()
    {
        CRC32 = 0;
        FullName = null;
    }

    public short TagID => 0x7075;

    public byte[] GetData()
    {
        if (FullName == null)
            return null;
        var writer = new ByteArrayOutputStream(); // A class similar to SharpZipLib's ZipHelperStream
        writer.WriteByte(_supportedVersion);
        writer.WriteUInt32LE(CRC32);
        writer.WriteBytes(_utf8Encoding.GetBytes(FullName));
        return writer.ToByteArray();
    }

    public void SetData(byte[] data, int index, int count)
    {
        var success = false;
        try
        {
            var reader = new ByteArrayInputStream(data, index, count); // A class similar to SharpZipLib's ZipHelperStream
            var version = reader.ReadByte();
            if (version != _supportedVersion)
                return;
            CRC32 = reader.ReadUInt32LE();
            FullName = _utf8Encoding.GetString(reader.ReadToEnd());
            if (string.IsNullOrEmpty(FullName))
                return;
            success = true;
        }
        finally
        {
            if (!success)
            {
                FullName = null;
                CRC32 = 0;
            }
        }
    }

    public UInt32 CRC32 { get; set; }

    public string FullName { get; set; }
}

How to use the "UnicodePathExtraField" class

You are free to modify or reprint these source codes.

var zipFilePath = @"(zip file path)";
using (var zipFile = new ZipFile(zipFilePath))
{
    foreach (ZipEntry entry in zipFile)
    {
        var utf8Name = entry.Name;
        using (var extraData = new ZipExtraData(entry.ExtraData))
        {
            if (!string.IsNullOrEmpty(entry.Name))
            {
                var unicodeEntryNameExtraData = extraData.GetData<UnicodePathExtraField>();
                if (unicodeEntryNameExtraData != null)
                {
                    // Originally, the CRC of the original byte array of the entry name must be calculated,
                    //  but here, the byte array of the name is tentatively obtained from "entry.Name" with the default encoding.
                    var crc = CalculateCrc32(Encoding.Default.GetBytes(entry.Name));
                    // var crc = CalculateCrc32(entry.GetNameBytes());

                    // You may use the FullName property only if the CRCs match
                    if (crc == unicodeEntryNameExtraData.CRC32)
                        utf8Name = unicodeEntryNameExtraData.FullName;
                }
            }
        }
        Console.WriteLine(utf8Name);
    }
}

piksel · 2021-08-21T21:31:59Z

piksel
Aug 21, 2021
Collaborator

Instead of using those, you can just set the Unicode flag (bit 11). The docs for the extra fields you mentioned actually explicitly suggest you do that unless you need different encodings for comment and file name:

If both the File Name and Comment fields are UTF-8, the new General Purpose Bit Flag, bit 11 (Language encoding flag (EFS)), can be used to indicate that both the header File Name and Comment fields are UTF-8 and, in this case, the Unicode Path and Unicode Comment extra fields are not needed and SHOULD NOT be created.

2 replies

rougemeilland Aug 22, 2021
Author

The purpose of developing an application that batches ZIP files is to make various optimizations on existing ZIP files and save them again.

In a ".epub" file (actual format is ZIP), move to the beginning if the "mimetype" entry is not at the beginning, or change it to "stored" if the "mimetype" entry is compressed.
Delete an empty folder.
Delete unnecessary entries such as the implicitly created thumbnail database of images.
Warns the user or automatically changes the entry name when the display order of the entry name list may depend on the execution environment or application due to entry naming. (Whether to consider the numeric string included in the entry name as a number, for example '5 .jpg' and '099.jpg')
And of course, for entries ~~containing "Unicode Path Extra Field "or "Unicode Comment Extra Field"~~ that contain multibyte strings in their names, set bit 11 of the general purpose bit flag and save the entry name and comment in UTF8 again. Because I think that is advantageous in terms of compatibility with other applications.

I am aiming to develop such an application, and one of the challenges for that is "correctly interpreting the" Unicode Path Extra Field "and" Unicode Comment Extra Field "of existing ZIP files."

I don't want to aggressively add "Unicode Path Extra Field" or "Unicode Comment Extra Field" to ZIP files. However, in reality, there are already ZIP files containing "Unicode Path Extra Field" and "Unicode Comment Extra Field", and other applications that create such ZIP files.
Don't get me wrong there.

Addendum: ZipEntry.Name is converted to a string with the computer's default encoding (SHIFT-JIS for Japanese).
In rare cases, ZipEntry.Name and the entry name provided in "Unicode Path Extra Field" may not match. The reason is that UNICODE and SHIFT-JIS do not always have a one-to-one correspondence between characters.
In most cases, the name provided in "Unicode Path Extra Field" will match the filename before it was stored in the ZIP file. The reason is that UNICODE is used for NTFS filenames.

piksel Oct 6, 2021
Collaborator

Yeah, allowing these fields to be read correctly (either in-library or with external code) should be supported. Especially as it's just a matter of adding a getter for the raw name/comment fields.

Sorry for misinterpreting your issue.

Numpsy · 2021-10-06T10:13:49Z

Numpsy
Oct 6, 2021
Collaborator

Are these the same fields mentioned in the (rather old) #33 ?

1 reply

piksel Oct 6, 2021
Collaborator

I think so. Or at least I think it applies most fields of this type that is found in the wild.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for "Unicode Path Extra Field (0x7075)" and "Unicode Comment Extra Field (0x6375)" #656

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Support for "Unicode Path Extra Field (0x7075)" and "Unicode Comment Extra Field (0x6375)" #656

rougemeilland Aug 21, 2021

Request

Background of request

Source code of "UnicodePathExtraField" class (sample under development)

How to use the "UnicodePathExtraField" class

Replies: 2 comments · 3 replies

piksel Aug 21, 2021 Collaborator

rougemeilland Aug 22, 2021 Author

piksel Oct 6, 2021 Collaborator

Numpsy Oct 6, 2021 Collaborator

piksel Oct 6, 2021 Collaborator

rougemeilland
Aug 21, 2021

Replies: 2 comments 3 replies

piksel
Aug 21, 2021
Collaborator

rougemeilland Aug 22, 2021
Author

piksel Oct 6, 2021
Collaborator

Numpsy
Oct 6, 2021
Collaborator

piksel Oct 6, 2021
Collaborator