分割连接的 tiff 文件

Question 1

如果你确定连接的 TIFF 都是小端文件（49 49 2A 00 魔法数字），那么这个 Perl 脚本应该可以工作。调用如下perl foo.pl < file.tif

#!/usr/bin/env perl                                                         

my $big_endian = "MM\0*";
my $big_endian_regex = "MM\0\\*";
my $little_endian = "II*\0";
my $little_endian_regex = "II\\*\0";

my $tiff_magic = $little_endian;
my $tiff_magic_regex = $little_endian_regex;

my $n = 0;
my $fileprefix = "chunk";
my $buffer;

{ local $/ = undef; $buffer = <stdin>; }

my @images = split /${tiff_magic_regex}/, $buffer;

for my $image (@images) {
    next if $image eq '';
    my $file = sprintf("$fileprefix.%02d.tif", $n++);
    open FILE, ">", $file or die "open $file: ";
    print FILE $tiff_magic, $image or die "print $file: ";
    close FILE or die "close $file: ";
}

exit 0;

Answer

如果你确定连接的 TIFF 都是小端文件（49 49 2A 00 魔法数字），那么这个 Perl 脚本应该可以工作。调用如下perl foo.pl < file.tif

#!/usr/bin/env perl                                                         

my $big_endian = "MM\0*";
my $big_endian_regex = "MM\0\\*";
my $little_endian = "II*\0";
my $little_endian_regex = "II\\*\0";

my $tiff_magic = $little_endian;
my $tiff_magic_regex = $little_endian_regex;

my $n = 0;
my $fileprefix = "chunk";
my $buffer;

{ local $/ = undef; $buffer = <stdin>; }

my @images = split /${tiff_magic_regex}/, $buffer;

for my $image (@images) {
    next if $image eq '';
    my $file = sprintf("$fileprefix.%02d.tif", $n++);
    open FILE, ">", $file or die "open $file: ";
    print FILE $tiff_magic, $image or die "print $file: ";
    close FILE or die "close $file: ";
}

exit 0;

Question 2

我玩了一会儿，想出了另一种在文件中查找任意十六进制字节序列的方法，所以我想添加第二个答案。假设您在一个名为的文件中将多个 TIFF 连接在一起，您只需使用和manyTIFs即可完成此操作：xxdawk

#!/bin/bash

# Dump the concatenated TIFFs, one byte per line, so line number is byte offset
xxd -c1 manyTIFs | awk '
   BEGIN{
      sa[0]="49"          # sa = sought array. It contains the bytes we are seeking
      sa[1]="49"
      sa[2]="2a"
      sa[3]="00"
      si=0                # seek index, which item in "sa" we are looking for
   }
   { 
      byte=$2             # Pick up the hex byte, it is the second field, i.e. after the offset
      if(byte==sa[si]){   # if it's the one we are looking for
         si++             # look for next byte
         if(si==4){       # check if we have found all 4 bytes
            si=0          # restart the search
            print NR-4    # TIFF file started 4 bytes back
         }
      } else {
         si=0             # restart the search
      }
   }
'

它打印出每个 TIFF 开始的字节偏移量 - 我将把它作为练习来输入dd以进行实际的剪切。

如果你想看看该xxd命令输出什么，它是这样的，这就是为什么awk查看第 2 列：

示例输出

00000000: 49  I
00000001: 49  I
00000002: 2a  *
00000003: 00  .
00000004: 10  .
00000005: 6c  l
00000006: 04  .

Answer

我玩了一会儿，想出了另一种在文件中查找任意十六进制字节序列的方法，所以我想添加第二个答案。假设您在一个名为的文件中将多个 TIFF 连接在一起，您只需使用和manyTIFs即可完成此操作：xxdawk

#!/bin/bash

# Dump the concatenated TIFFs, one byte per line, so line number is byte offset
xxd -c1 manyTIFs | awk '
   BEGIN{
      sa[0]="49"          # sa = sought array. It contains the bytes we are seeking
      sa[1]="49"
      sa[2]="2a"
      sa[3]="00"
      si=0                # seek index, which item in "sa" we are looking for
   }
   { 
      byte=$2             # Pick up the hex byte, it is the second field, i.e. after the offset
      if(byte==sa[si]){   # if it's the one we are looking for
         si++             # look for next byte
         if(si==4){       # check if we have found all 4 bytes
            si=0          # restart the search
            print NR-4    # TIFF file started 4 bytes back
         }
      } else {
         si=0             # restart the search
      }
   }
'

它打印出每个 TIFF 开始的字节偏移量 - 我将把它作为练习来输入dd以进行实际的剪切。

如果你想看看该xxd命令输出什么，它是这样的，这就是为什么awk查看第 2 列：

示例输出

00000000: 49  I
00000001: 49  I
00000002: 2a  *
00000003: 00  .
00000004: 10  .
00000005: 6c  l
00000006: 04  .

Question 3

我知道对于 TIFF 文件，前 2 个字节是字符，并且计算为字节顺序（intel 或 motorola）的 ascii“II”或“MM”，然后 2 个字节（字）表示版本，应该是十进制 42（不要惊慌）。

例如：http://www.fileformat.info/format/tiff/corion.htm

在您的示例中，您看到 II+42 英特尔字节顺序和版本 42。

Answer

我知道对于 TIFF 文件，前 2 个字节是字符，并且计算为字节顺序（intel 或 motorola）的 ascii“II”或“MM”，然后 2 个字节（字）表示版本，应该是十进制 42（不要惊慌）。

例如：http://www.fileformat.info/format/tiff/corion.htm

在您的示例中，您看到 II+42 英特尔字节顺序和版本 42。

Question 4

由于没有人对我的其他答案感兴趣，所以我想我会添加第三种完全不同的方法：

首先，使用以下格式制作一些 TIFF big-endian、TIFF little-endian、JPG、PNG、GIF 图像：图像魔术师供测试用：

magick -size 640x480 xc:red image.gif
magick -size 640x480 xc:red image.jpg
magick -size 640x480 xc:red image.tif
magick -size 640x480 xc:red -define tiff:endian=msb imageMSB.tif
magick -size 640x480 xc:red -define tiff:endian=lsb imageLSB.tif

然后将它们全部连接在一起形成一个大的无定形团块并检查我们得到了什么：

cat image* > blob

ls -l image* blob

-rw-r--r--    1 root     root       3692113 Oct 13 09:27 blob
-rw-r--r--    1 root     root           903 Oct 13 09:25 image.gif
-rw-r--r--    1 root     root          3888 Oct 13 09:25 image.jpg
-rw-r--r--    1 root     root           362 Oct 13 09:27 image.png
-rw-r--r--    1 root     root       1843480 Oct 13 09:26 imageLSB.tif
-rw-r--r--    1 root     root       1843480 Oct 13 09:26 imageMSB.tif

现在的答案是，使用binwalk并显示所有文件的十六进制和十进制的字节偏移量，您可以使用它们来awk分离dd出您的文件 - 所有文件都具有正确的扩展名：

binwalk blob

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
0             0x0             GIF image data, version "89a", 640 x 480
903           0x387           JPEG image data, JFIF standard 1.01
4791          0x12B7          PNG image, 640 x 480, 1-bit colormap, non-interlaced
4926          0x133E          Zlib compressed data, best compression
5153          0x1421          TIFF image data, little-endian offset of first image directory: 1843208
1848633       0x1C3539        TIFF image data, big-endian, offset of first image directory: 1843208

请注意，你可以更简单地使用以下命令提取文件binwalk本身：

binwalk -e BIGBLOB.BIN

对于不信任或不关心安装的人binwalk，只需启动一个 docker alpine 映像，并将主机上的当前目录映射到/work容器中：

docker run -it -v "$(pwd)":/work -w /work alpine:latest

然后，在容器内运行：

echo "https://dl-cdn.alpinelinux.org/alpine/edge/testing" >> /etc/apk/repositories
apk update && apk add binwalk

Answer