估计文件的可压缩性

Question 1

python -c "
import zlib
from itertools import islice
from functools import partial
import sys
with open(sys.argv[1], "rb") as f:
  compressor = zlib.compressobj()
  t, z = 0, 0.0
  for chunk in islice(iter(partial(f.read, 4096), b''), 0, None, 10):
    t += len(chunk)
    z += len(compressor.compress(chunk))
  z += len(compressor.flush())
  print(z/t)
" file

Answer

这是（希望是等效的）Python 版本斯蒂芬·查泽拉斯解决方案

python -c "
import zlib
from itertools import islice
from functools import partial
import sys
with open(sys.argv[1], "rb") as f:
  compressor = zlib.compressobj()
  t, z = 0, 0.0
  for chunk in islice(iter(partial(f.read, 4096), b''), 0, None, 10):
    t += len(chunk)
    z += len(compressor.compress(chunk))
  z += len(compressor.flush())
  print(z/t)
" file

Question 2

例如，您可以尝试每 10 个块压缩一个以得到一个想法：

perl -MIPC::Open2 -nE 'BEGIN{$/=\4096;open2(\*I,\*O,"gzip|wc -c")}
                       if ($. % 10 == 1) {print O $_; $l+=length}
                       END{close O; $c = <I>; say $c/$l}'

（此处为 4K 块）。

Answer

例如，您可以尝试每 10 个块压缩一个以得到一个想法：

perl -MIPC::Open2 -nE 'BEGIN{$/=\4096;open2(\*I,\*O,"gzip|wc -c")}
                       if ($. % 10 == 1) {print O $_; $l+=length}
                       END{close O; $c = <I>; say $c/$l}'

（此处为 4K 块）。

Question 3

我有一个多 GB 的文件，我不确定它是否被压缩，所以我测试压缩了前 10M 字节：

head -c 10000000 large_file.bin | gzip | wc -c

它并不完美，但对我来说效果很好。

Answer

我有一个多 GB 的文件，我不确定它是否被压缩，所以我测试压缩了前 10M 字节：

head -c 10000000 large_file.bin | gzip | wc -c

它并不完美，但对我来说效果很好。

Question 4

这是基于 iruvar 的改进的 Python 版本很棒的解决方案。主要改进是该脚本仅从磁盘读取它实际压缩的数据块：

import zlib
def Predict_file_compression_ratio(MyFilePath):
 blocksize = (4096 * 1) # Increase if you want to read more bytes per block at once.
 blocksize_seek = 0

 # r = read, b = binary
 with open(MyFilePath, "rb") as f:
  # Make a zlib compressor object, and set compression level.
  # 1 is fastest, 9 is slowest
  compressor = zlib.compressobj(1)
  t, z, counter = 0, 0, 0

  while True:
    # Use this modulo calculation to check every "number" of blocks.
    if counter % 10 == 0:
      # Seek to the correct byte position of the file.
      f.seek(blocksize_seek)
      # The block above will be read, increase the seek distance by one block for the next iteration.
      blocksize_seek += blocksize
      # Read data chunk of file into this variable.
      data = f.read(blocksize)
      
      # Stop if there are no more data.
      if not data:
        # For zlib: Flush any remaining compressed data. Not doing this can lead to a tiny inaccuracy.
        z += len(compressor.flush())
        break

      # Uncompressed data size, add size to variable to get a total value.
      t += len(data)
      # Compressed data size
      z += len(compressor.compress(data))

    # When we skip, we want to increase the seek distance. This is vital for correct skipping.
    else:
      blocksize_seek += blocksize
    # Increase the block / iteration counter.
    counter += 1

 # Print the results. But avoid division by 0 >_>
 if not t == 0:
  print('Compression ratio: ' + str(z/t))
 else:
  print('Compression ratio: none, file has no content.')
 print('Compressed: ' + str(z))
 print('Uncompressed: ' + str(t))

如果高数据速率至关重要，而准确的压缩比并不那么重要，则可以使用 lz4。如果您只想找出哪些文件可以压缩最多且 CPU 使用率较低，那么这非常有用。该模块需要使用pip安装从这里。在 Python 代码本身中，您几乎只需要这样：

import lz4.block
z += len(lz4.block.compress(data))

请注意，我观察到使用此脚本确实会破坏备用内存（在 Windows 上肯定如此），这会降低文件性能 - 特别是在具有经典硬盘驱动器的计算机上，并且如果您一次对大量文件使用此功能。通过在脚本的 Python 进程上设置低内存页面优先级可以避免这种内存浪费。我选择在 Windows 上使用 AutoHotkey 来执行此操作。有用的来源这里。

Answer

这是基于 iruvar 的改进的 Python 版本很棒的解决方案。主要改进是该脚本仅从磁盘读取它实际压缩的数据块：

import zlib
def Predict_file_compression_ratio(MyFilePath):
 blocksize = (4096 * 1) # Increase if you want to read more bytes per block at once.
 blocksize_seek = 0

 # r = read, b = binary
 with open(MyFilePath, "rb") as f:
  # Make a zlib compressor object, and set compression level.
  # 1 is fastest, 9 is slowest
  compressor = zlib.compressobj(1)
  t, z, counter = 0, 0, 0

  while True:
    # Use this modulo calculation to check every "number" of blocks.
    if counter % 10 == 0:
      # Seek to the correct byte position of the file.
      f.seek(blocksize_seek)
      # The block above will be read, increase the seek distance by one block for the next iteration.
      blocksize_seek += blocksize
      # Read data chunk of file into this variable.
      data = f.read(blocksize)
      
      # Stop if there are no more data.
      if not data:
        # For zlib: Flush any remaining compressed data. Not doing this can lead to a tiny inaccuracy.
        z += len(compressor.flush())
        break

      # Uncompressed data size, add size to variable to get a total value.
      t += len(data)
      # Compressed data size
      z += len(compressor.compress(data))

    # When we skip, we want to increase the seek distance. This is vital for correct skipping.
    else:
      blocksize_seek += blocksize
    # Increase the block / iteration counter.
    counter += 1

 # Print the results. But avoid division by 0 >_>
 if not t == 0:
  print('Compression ratio: ' + str(z/t))
 else:
  print('Compression ratio: none, file has no content.')
 print('Compressed: ' + str(z))
 print('Uncompressed: ' + str(t))

如果高数据速率至关重要，而准确的压缩比并不那么重要，则可以使用 lz4。如果您只想找出哪些文件可以压缩最多且 CPU 使用率较低，那么这非常有用。该模块需要使用pip安装从这里。在 Python 代码本身中，您几乎只需要这样：

import lz4.block
z += len(lz4.block.compress(data))

请注意，我观察到使用此脚本确实会破坏备用内存（在 Windows 上肯定如此），这会降低文件性能 - 特别是在具有经典硬盘驱动器的计算机上，并且如果您一次对大量文件使用此功能。通过在脚本的 Python 进程上设置低内存页面优先级可以避免这种内存浪费。我选择在 Windows 上使用 AutoHotkey 来执行此操作。有用的来源这里。

估计文件的可压缩性

答案1

答案2

答案3

答案4

相关内容