下载后如何检测损坏的图像并将其删除?

下载后如何检测损坏的图像并将其删除?

我抓取了大量 jpg、jpeg 或 png 格式的图片。但是,有些图片因为损坏而无法在缩略图视图中显示。我已经将 Linux 中缩略图视图的限制增加到 100MB,因此现在只有损坏的图片无法显示为缩略图。

如何使用 Python 代码或 bash 脚本检测这些图像?

例如,当我单击其中一个 PNG 时,它看起来如下所示:

<?xml version="1.0" encoding="utf-8"?>
<!-- Generator: Adobe Illustrator 16.0.4, SVG Export Plug-In . SVG Version: 6.00 Build 0)  -->
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg version="1.1" id="Layer_1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px"
     width="3024px" height="3024px" viewBox="0 0 3024 3024" enable-background="new 0 0 3024 3024" xml:space="preserve">
<path d="M1515.778,205.669C981.54,205.669,476.37,135.396,0-0.008c0,1068.396,0,2756.137,0,3024.016
    c476.37-135.406,981.54-205.678,1515.778-205.678c528.163,0,1031.723,68.74,1508.222,205.678c0-1028.865,0-1919.085,0-3024.016
    C2547.501,136.957,2043.941,205.669,1515.778,205.669z M2353.338,463.804l-18.163,28.818
    c-17.094,27.133-32.388,50.271-53.773,82.474c-28.56,42.729-81.775,127.019-145.562,235.543
    c-17.617,29.904-39.284,66.361-62.221,104.957c-43.037,72.421-91.819,154.506-129.665,219.889
    c-15.9,27.728-32.095,55.903-48.402,84.298c-42.299,73.63-86.041,149.764-127.873,223.314
    c-43.399,76.25-85.883,150.915-128.288,225.84v75.033c0,104.111,2.157,217.242,6.057,318.561
    c1.859,46.055,3.741,128.156,5.733,215.074c2.375,103.506,4.829,210.529,7.438,264.551l0.779,16.256l0.094,1.971l-17.549-5.006
    c-6.877-1.961-13.856-3.748-20.918-5.383c-21.498-4.521-44.457-8.213-66.956-10.348c-13.755-1.125-27.735-1.691-41.9-1.691
    c-0.057,0-0.113,0-0.17,0c-0.056,0-0.104,0-0.165,0c-14.169,0-28.146,0.566-41.904,1.691c-22.488,2.135-45.451,5.826-66.952,10.348
    c-7.059,1.635-14.037,3.422-20.919,5.383l-17.549,5.006l0.098-1.971l0.779-16.256c2.608-54.021,5.063-161.045,7.435-264.551
    c1.995-86.918,3.877-169.02,5.729-215.074c3.907-101.318,6.061-214.449,6.061-318.561v-75.033
    c-42.405-74.926-84.89-149.59-128.292-225.84c-41.829-73.551-85.57-149.685-127.87-223.314
    c-16.311-28.395-32.497-56.57-48.405-84.298c-37.84-65.383-86.621-147.467-129.663-219.889
    c-22.94-38.596-44.604-75.053-62.22-104.957c-63.787-108.524-117.003-192.814-145.562-235.543
    c-21.385-32.204-36.68-55.341-53.777-82.474l-18.159-28.818l-0.117-0.188l32.957,9.44c42.194,12.089,85.16,17.965,131.348,17.965
    c46.015,0,90.265-5.927,131.51-17.611l9.983-2.829l5.037,9.076c81.625,147.26,300.785,507.46,431.723,722.674
    c45.156,74.209,80.898,132.967,98.715,162.765c0.061-0.102,0.121-0.207,0.181-0.309c0.061,0.102,0.128,0.208,0.188,0.309
    c17.812-29.798,53.562-88.556,98.712-162.765c130.941-215.214,350.1-575.415,431.726-722.674l5.029-9.076l9.99,2.829
    c41.246,11.685,85.491,17.611,131.51,17.611c46.188,0,89.15-5.876,131.345-17.965l32.96-9.44L2353.338,463.804z"/>
</svg>

但有些图像根本无法打开,就像这样:

无法加载图片

文本文件图标

当我打开没有缩略图预览的 jpg 图像时,它们会以文本文件的形式打开,其中包含许多奇怪的字符。我的最终目标是自动删除这些损坏的文件,而不是通过查看缩略图手动删除它们,因为我有 10,000 张图像。

另外,例如,当我单击该图像时我看不到它,但我得到了以下结果:

$ identify 590.jpeg
590.jpeg JPEG 450x338 450x338+0+0 8-bit DirectClass 47.8KB 0.000u 0:00.000

>>> from PIL import Image
>>> im = Image.open("590.jpeg")
>>> im.verify()
>>> 

更新:图像无法打开损坏的 png 文件,但无法检测损坏的 jpg/jpeg 文件:

>>> im = Image.open("722.png")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/scratch/sjn/anaconda/lib/python3.6/site-packages/PIL/Image.py", line 2590, in open
    % (filename if filename else fp))
OSError: cannot identify image file '722.png'

722.png

[jalal@goku media30_GV_Aug2018]$ identify 722.png
722.png SVG 3024x3024 3024x3024+0+0 16-bit DirectClass 2.61KB 0.000u 0:00.009

答案1

尝试 ImageMagick 的identify命令。来自手册页:

识别描述一个或多个图像文件的格式和特征。它还会报告图像是否不完整或损坏。

例子:

$ identify foo.png
identify: NotAPNGImageFile (foo.png).

$ echo $?
1

另一种方法是使用PIL(Python 图像库)

from PIL import Image

im = Image.open("foo.png")
im.verify()

来自文档

im.验证()

尝试确定文件是否损坏,而不实际解码图像数据。如果此方法发现任何问题,它会引发适当的异常。此方法仅适用于新打开的图像;如果图像已加载,则结果未定义。此外,如果您需要在使用此方法后加载图像,则必须重新打开图像文件。

来源

相关内容