-
Notifications
You must be signed in to change notification settings - Fork 0
How Does Unzippopotamus Work?
This document is provided simply to help explain the implementation approach of unzippopotamus. If you are looking for documentation on usage, you are probably trying to go here.
Understanding the zip File Format
The Problem with Unzip Streaming
This Libraries Approach
Emitting Individual Files
Before we break down the technical implementation of unzippopotamus, there are some basic details about the zip file format that are worth breaking down. The basic structure of a zip file is as follows:
[local file header 1]
[file data 1]
[data descriptor 1]
.
.
.
[local file header n]
[file data n]
[data descriptor n]
[archive decryption header] (EFS)
[archive extra data record] (EFS)
[central directory]
[zip64 end of central directory record]
[zip64 end of central directory locator]
[end of central directory record]
To put this into more human-readable terms, we can say that a zip contains two major sections: files and the central directory. The files section is a series of sandwiches, consisting of a local file header, the file data itself, and (sometimes) a data descriptor. The local file header generally contains the metadata necessary to understand how to unzip this file, with information like the file name and the compression method. The file data is just a series of raw bytes where the actual file that is in the zip is encoded.
The central directory, the other major section of a zip file, comes after all the file-sandwiches. It is a directory of all the contents of the zip, as well as important metadata about those files - you can think of it like a menu. Importantly, it's considered the authoritative source of that information, whereas the local file headers and data descriptors above are less-so.
Across different zip formats and styles the structure can be modified somewhat, and different flags can mean different sections are present or not present in the zip file. However, they all share this same general structure. For a deeper dive, you can read more about the zip file specification here: libzip zip file specification
Because the central directory is the authoritative source of information about how to unzip a zip file, it is considered to be nessisary info for unzipping the file. However, because it's at the end of the file, and we are streaming from front to back, this poses an issue. Many libraries solve this by either A) not providing a streaming interface or B) storing whole files to memory or disk. However, there are many situations in which these limitations might not be acceptable (take an AWS lambda, for example, with limited memory/disk). For these situations, we have to step slightly outside the confines of the zip file spec and use the local file headers that precede each file instead of the central directory - we have to go "off menu" so-to-speak.
However, this trade-off is not without its drawbacks. In fact, if you can at all avoid needing to unzip file streams without using memory or disk then you should absolutely not use unzippopotamus, and instead use another library (I recommend yauzl
).
There are some perfectly valid zip files that a "normal" unzip parser can handle yet unzippopotamus (or any stream unzipping library) cannot. With proper error handling, the risk of these files is limited to throwing an error while trying to unzip them. However, the fact remains that the file is technically unzippable, and yet unzippopotamus can't unzip them. For more information about this, check out the super informitave section about unzip streaming in the yauzl
documentation here
Unzippopotamus under the hood is essentially just a finite state machine. By extending Nodes TransformStream, it takes in data from the input zip stream one chunk at a time and uses it, plus the current state of the fsm, to process the data. For example, when in the LocalFileHeader state, unzippopotamus is looking to break down the metadata that precedes each file entry. Once it has the metadata it needs, it switches to the FileData state and uses that metadata to decompress, unencrypt, or pass along the file data to an output stream. When it's done with that, it uses flags set in the next byte to decide which state to enter, and the whole thing repeats. A breakdown of the FSM is included in graph-form below. (Note that the graph is for a broad level understanding, and that the technical details in the code may be more nuanced)
stateDiagram-v2
[*] --> LocalFileHeader
[*] --> InvalidZip: Local File Header not found
LocalFileHeader --> LocalFileHeader: Not enough bytes to read full header, fetch more
LocalFileHeader --> InvalidZip: Local File Header not valid
LocalFileHeader --> FileData: Header done, start reading the file data
FileData --> InvalidZip: File data invalid.
FileData --> OtherError: File data processign error (Decompression error, for example)
FileData --> Unknown: Done reading file data
Unknown --> LocalFileHeader: Local file header flag found, start reading next file header
Unknown --> CentralDirectoryHeader: Central directory flag found, start reading central directory
CentralDirectoryHeader --> [*]: Done reading zip file contents
%% Notes
note right of InvalidZip
This invalid zip error will happen when something
is wrong with the format of the zip file
end note
note right of OtherError
This invalid zip error will happen when something
goes wrong when processing the file data, most
likley during decompression
end note
This approach works well with the nature of streams, as things happen "live" and the library has no clue what's coming next - only what's already happened, and what's happening right now. The FSM allows the library to detect when it's in faulty states and handle errors appropriately, resiliently handle tricky situations, and be performant and memory-conscious.
When an individual file is emitted from the unzip streamer, it does so in the form of an Entry
. This class extends Nodes PassThrough stream, meaning that it simply takes data in one end (from the unzip streamer), and passes it directly out the other (to the destination of your choice). What this allows us to do is treat each entry from the zip file sort of like its own brand new stream, instead of part of a larger one. Each Entry
has metadata about the file already inside of it for use, and is fault tolerant (properly emitting events like 'end' and 'error').
The only tricky limitation to be aware of, and this is a limitation of unzip streaming itself, is that we have no way of knowing how many files are in a given zip file ahead of time. Thus, we don't actually know how many instances of Entry
we will see before the stream is done. What this means practically is that, although the abstraction of Entry
gives us an easy way to interact with the individual files in the zip, we can't entirely forget about the zip file. If, for example, we wanted to wrap the unzipping process in a Promise
so that we could use it with async
, we would have to wrap the parent stream as opposed to the individual Entry
, as the ending of an individual entry stream does not signify the ending of the entire unzipping process (even when you're pretty sure that's the last file).