我需要一种简单的方法来检查我的文件是否是有效文档(pdf、doc、docx、ppt、pptx、xls、xlsx、odt、ods、odp 等)。
我不能使用,file
因为magic
效果根本不好。例如,对于 PDF 文件,这是我的输出。
sweb@sweb-laptop: /media/files/ebooks/PDF and CHM$ file --mime *. Pdf
PHP 5 for Dummies. Pdf: application/pdf; charset=binary
PHP 6 and MySQL 5 for Dynamic Web Sites. Pdf: application/octet-stream; charset=binary
PHP6 and MySQL Bible. Pdf: application/pdf; charset=binary
PHP6.pdf: application/octet-stream; charset=binary
PHP and MySQL for Dummies SE. Pdf: application/pdf; charset=binary
例如,我使用abiword
– 这是一个很好的工具 – 但是它可以转换任何格式。它不检查有效文档:
abiword --to=txt --to-name=output.txt audio.mp3
那么是否有任何命令可用于检查有效文件?
答案1
更新你的/usr/share/file/magic
文件?
#------------------------------------------------------------------------------
# pdf: file(1) magic for Portable Document Format
#
0 string %PDF- PDF document
>5 byte x \b, version %c
>7 byte x \b.%c
我将使用 hexdump 来查看未正确识别的 PDF 的前几个字节。
更新。
如何更新魔法文件取决于您的操作系统和发行版。通常,您会使用包管理器。例如,在 RedHat Linux 和后续发行版中,您可以使用 来yum provides /usr/share/file/magic
查找包含该文件的包,然后使用sudo yum update <packagename>
...
$ yum provides /usr/share/file/magic
...
file-4.17-15.el5_3.1.x86_64 : A utility for determining file types.
Repo : installed
Matched from:
Other : Provides-match: /usr/share/file/magic
$ sudo yum update file
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
* base: centos.mirroring.pulsant.co.uk
* extras: centos.mirroring.pulsant.co.uk
* rpmforge: nl.mirror.eurid.eu
* updates: centos.mirroring.pulsant.co.uk
Setting up Update Process
Resolving Dependencies
--> Running transaction check
---> Package file.x86_64 0:4.17-21 set to be updated
rpmforge/filelists_db | 5.9 MB 00:08
updates/filelists_db | 1.9 MB 00:03
--> Finished Dependency Resolution
Dependencies Resolved
================================================================================
Package Arch Version Repository Size
================================================================================
Updating:
file x86_64 4.17-21 base 320 k
Transaction Summary
================================================================================
Install 0 Package(s)
Upgrade 1 Package(s)
Total download size: 320 k
Is this ok [y/N]: y
Downloading Packages:
file-4.17-21.x86_64.rpm | 320 kB 00:02
Running rpm_check_debug
Running Transaction Test
Finished Transaction Test
Transaction Test Succeeded
Running Transaction
Updating : file 1/2
Cleanup : file 2/2
Updated:
file.x86_64 0:4.17-21
Complete!
如果您愿意,您可以在阅读并使用示例文档文件为自己制定签名magic
后,使用文本编辑器来更新文件。man magic
hexdump _C -n 20
如果您要这样做,最好先创建一个单独的魔术文件并使用file
s-m magicfile
选项对其进行测试。
附言
$ file --mime `locate *.pdf`
/usr/share/doc/bind-9.3.6/arm/Bv9ARM.pdf: application/pdf
/usr/share/doc/libtheora-1.0alpha7/Theora_I_spec.pdf: application/pdf
/usr/share/doc/prelink-0.4.0/prelink.pdf: application/pdf
/usr/share/doc/samba-3.0.33/Samba3-ByExample.pdf: application/pdf
/usr/share/doc/samba-3.0.33/Samba3-Developers-Guide.pdf: application/pdf
/usr/share/doc/samba-3.0.33/Samba3-HOWTO.pdf: application/pdf
/usr/share/doc/speex-1.0.5/manual.pdf: application/pdf
/usr/share/ghostscript/8.70/examples/annots.pdf: application/pdf
/usr/share/gimp-print/doc/users-guide.pdf: application/pdf
file
也许您可以将一些无法正确识别的文件上传到文件共享网站。