file 命令显然返回错误的 MIME 类型

file 命令显然返回错误的 MIME 类型

为什么以下不返回text/csv

$ echo 'foo,bar\nbaz,quux' > temp.csv;file -b --mime temp.csv
text/plain; charset=us-ascii

为了更加清楚起见,我使用了此示例,但我也遇到了其他 CSV 文件的问题。

$ file -b --mime '/Users/jasonswett/projects/client_work/gd/spec/test_files/wtf.csv'
text/plain; charset=us-ascii

为什么它不认为 CSV 是 CSV?我可以对 CSV 做些什么来使file返回“正确”的事情吗?

答案1

mimetypes 由 unix 联机帮助页中所谓的“幻数”决定。每个文件中都有一个确定文件类型和文件格式的幻数。下面的摘录来自文件命令手册页

The magic number tests are used to check for files with data in partic-
       ular fixed formats.  The canonical example of this  is  a  binary  exe-
       cutable  (compiled  program)  a.out  file,  whose  format is defined in
       a.out.h and possibly exec.h in the standard include  directory.   These
       files  have  a  'magic  number'  stored  in a particular place near the
       beginning of the file that tells the UNIX  operating  system  that  the
       file  is  a binary executable, and which of several types thereof.  The
       concept of 'magic number' has been applied by extension to data  files.
       Any  file  with  some invariant identifier at a small fixed offset into
       the file can usually be described in this way.  The information identi-
       fying   these   files   is   read   from   the   compiled   magic  file
       /usr/share/file/magic.mgc , or  /usr/share/file/magic  if  the  compile
       file  does  not exist. In addition file will look in $HOME/.magic.mgc ,
       or $HOME/.magic for magic entries.

unix 手册页还提到,如果文件与幻数不匹配,则文本文件被视为 ASCII/ISO-8859-x/非 ISO 8 位扩展 ASCII(最适合的格式)

 If a file does not match any of the entries in the magic    file,  it  is
       examined to see if it seems to be a text file.  ASCII, ISO-8859-x, non-
       ISO 8-bit extended-ASCII character sets (such as those used  on  Macin-
       tosh  and  IBM  PC systems), UTF-8-encoded Unicode, UTF-16-encoded Uni-
       code, and EBCDIC character sets can be distinguished by  the  different
       ranges  and  sequences  of bytes that constitute printable text in each
       set.  If a file passes  any  of  these  tests,  its  character  set  is
       reported.  ASCII, ISO-8859-x, UTF-8, and extended-ASCII files are iden-
       tified as ''text'' because they will be mostly readable on  nearly  any
       terminal

建议

使用mimetype命令代替文件命令

mimetype temp.csv

用于进一步挖掘的网络链接

http://unixhelp.ed.ac.uk/CGI/man-cgi?file

答案2

不幸的是,您可能无法采取任何措施使文件产生正确的输出。

file命令根据幻数数据库测试文件的前几个字节。在文件开头具有一些特定标识符的二进制文件(例如图像或可执行文件)中很容易检查这一点。

如果文件不是二进制文件,它将检查编码并查找文件中的某些特定单词来确定类型,但仅限于有限数量的文件类型(其中大多数是编程语言)。

相关内容