unzip 使用什么方法在存档中查找单个文件？

Question 1

在大型存档中搜索单个文件时，它使用方法 1，您可以使用以下命令查看该方法strace：

open("dataset.zip", O_RDONLY)           = 3
ioctl(1, TIOCGWINSZ, 0x7fff9a895920)    = -1 ENOTTY (Inappropriate ioctl for device)
write(1, "Archive:  dataset.zip\n", 22Archive:  dataset.zip
) = 22
lseek(3, 943718400, SEEK_SET)           = 943718400
read(3, "\340P\356(s\342\306\205\201\27\360U[\250/2\207\346<\252+u\234\225\1[<\2310E\342\274"..., 4522) = 4522
lseek(3, 943722880, SEEK_SET)           = 943722880
read(3, "\3\f\225P\\ux\v\0\1\4\350\3\0\0\4\350\3\0\0", 20) = 20
lseek(3, 943718400, SEEK_SET)           = 943718400
read(3, "\340P\356(s\342\306\205\201\27\360U[\250/2\207\346<\252+u\234\225\1[<\2310E\342\274"..., 8192) = 4522
lseek(3, 849346560, SEEK_SET)           = 849346560
read(3, "D\262nv\210\343\240C\24\227\344\367q\300\223\231\306\330\275\266\213\276M\7I'&35\2\234J"..., 8192) = 8192
stat("rand-28.txt", 0x559f43e0a550)     = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550)    = -1 ENOENT (No such file or directory)
stat("rand-28.txt", 0x559f43e0a550)     = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550)    = -1 ENOENT (No such file or directory)
open("rand-28.txt", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4
ioctl(1, TIOCGWINSZ, 0x7fff9a895790)    = -1 ENOTTY (Inappropriate ioctl for device)
write(1, " extracting: rand-28.txt        "..., 37 extracting: rand-28.txt             ) = 37
read(3, "\275\3279Y\206\223\217}\355W%:\220YNT\0\257\260z^\361T\242\2\370\21\336\372+\306\310"..., 8192) = 8192

unzip打开dataset.zip，查找末尾，然后查找存档中所请求文件的开头（rand-28.txt位于偏移量 849346560 处）并从那里读取。

通过扫描档案的最后65557字节找到中心目录；看代码从这里开始:

/*---------------------------------------------------------------------------
    Find and process the end-of-central-directory header.  UnZip need only
    check last 65557 bytes of zipfile:  comment may be up to 65535, end-of-
    central-directory record is 18 bytes, and signature itself is 4 bytes;
    add some to allow for appended garbage.  Since ZipInfo is often used as
    a debugging tool, search the whole zipfile if zipinfo_mode is true.
  ---------------------------------------------------------------------------*/

Answer

在大型存档中搜索单个文件时，它使用方法 1，您可以使用以下命令查看该方法strace：

open("dataset.zip", O_RDONLY)           = 3
ioctl(1, TIOCGWINSZ, 0x7fff9a895920)    = -1 ENOTTY (Inappropriate ioctl for device)
write(1, "Archive:  dataset.zip\n", 22Archive:  dataset.zip
) = 22
lseek(3, 943718400, SEEK_SET)           = 943718400
read(3, "\340P\356(s\342\306\205\201\27\360U[\250/2\207\346<\252+u\234\225\1[<\2310E\342\274"..., 4522) = 4522
lseek(3, 943722880, SEEK_SET)           = 943722880
read(3, "\3\f\225P\\ux\v\0\1\4\350\3\0\0\4\350\3\0\0", 20) = 20
lseek(3, 943718400, SEEK_SET)           = 943718400
read(3, "\340P\356(s\342\306\205\201\27\360U[\250/2\207\346<\252+u\234\225\1[<\2310E\342\274"..., 8192) = 4522
lseek(3, 849346560, SEEK_SET)           = 849346560
read(3, "D\262nv\210\343\240C\24\227\344\367q\300\223\231\306\330\275\266\213\276M\7I'&35\2\234J"..., 8192) = 8192
stat("rand-28.txt", 0x559f43e0a550)     = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550)    = -1 ENOENT (No such file or directory)
stat("rand-28.txt", 0x559f43e0a550)     = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550)    = -1 ENOENT (No such file or directory)
open("rand-28.txt", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4
ioctl(1, TIOCGWINSZ, 0x7fff9a895790)    = -1 ENOTTY (Inappropriate ioctl for device)
write(1, " extracting: rand-28.txt        "..., 37 extracting: rand-28.txt             ) = 37
read(3, "\275\3279Y\206\223\217}\355W%:\220YNT\0\257\260z^\361T\242\2\370\21\336\372+\306\310"..., 8192) = 8192

unzip打开dataset.zip，查找末尾，然后查找存档中所请求文件的开头（rand-28.txt位于偏移量 849346560 处）并从那里读取。

通过扫描档案的最后65557字节找到中心目录；看代码从这里开始:

/*---------------------------------------------------------------------------
    Find and process the end-of-central-directory header.  UnZip need only
    check last 65557 bytes of zipfile:  comment may be up to 65535, end-of-
    central-directory record is 18 bytes, and signature itself is 4 bytes;
    add some to allow for appended garbage.  Since ZipInfo is often used as
    a debugging tool, search the whole zipfile if zipinfo_mode is true.
  ---------------------------------------------------------------------------*/

Question 2

实际上它是一种混合物。 unzip 从已知位置读取一些数据，然后读取与 zip 文件中的目标条目相关（但不相同）的数据块。

zip/unzip 的设计在源文件的注释中进行了解释。这是来自的相关内容extract.c:

/*--------------------------------------------------------------------------- 
    The basic idea of this function is as follows.  Since the central di- 
    rectory lies at the end of the zipfile and the member files lie at the 
    beginning or middle or wherever, it is not very desirable to simply 
    read a central directory entry, jump to the member and extract it, and 
    then jump back to the central directory.  In the case of a large zipfile 
    this would lead to a whole lot of disk-grinding, especially if each mem- 
    ber file is small.  Instead, we read from the central directory the per- 
    tinent information for a block of files, then go extract/test the whole 
    block.  Thus this routine contains two small(er) loops within a very 
    large outer loop:  the first of the small ones reads a block of files 
    from the central directory; the second extracts or tests each file; and 
    the outer one loops over blocks.  There's some file-pointer positioning 
    stuff in between, but that's about it.  Btw, it's because of this jump- 
    ing around that we can afford to be lenient if an error occurs in one of 
    the member files:  we should still be able to go find the other members, 
    since we know the offset of each from the beginning of the zipfile. 
  ---------------------------------------------------------------------------*/

该格式本身主要源自 PK-Ware 的实现，总结如下编程信息文本文件。据此，中央目录中也有不止一种类型的记录，因此 unzip 无法轻松地转到文件末尾并创建一个条目数组来查找目标文件。

现在...如果您花时间阅读源代码，您会发现unzip读取 8192 字节的缓冲区（查找INBUFSIZ）。我只会对相当大的 zip 文件（我想到的是 Java 源代码）使用单文件提取，但即使对于较小的 zip 文件，您也可以看到缓冲区大小的影响。为了看到这一点，我压缩了 PuTTY 的 Git 文件，其中包含 2727 个文件（计算 git 日志的副本）。 Java 比 20 年前更大，并且没有缩小。从 zip 文件中提取该日志（选择该日志是因为它不会位于按字母顺序排序的索引的末尾，并且可能不是在从中央目录读取的第一个块中）给出了这个strace对于lseek通话：

lseek(3, -2252, SEEK_CUR)               = 1267
lseek(3, 120463360, SEEK_SET)           = 120463360
lseek(3, 120468731, SEEK_SET)           = 120468731
lseek(3, 120135680, SEEK_SET)           = 120135680
lseek(3, 270336, SEEK_SET)              = 270336
lseek(3, 120463360, SEEK_SET)           = 120463360

像往常一样，通过基准测试，ymmv。

Answer