为什么以下不返回text/csv
?
$ echo 'foo,bar\nbaz,quux' > temp.csv;file -b --mime temp.csv
text/plain; charset=us-ascii
为了更加清楚起见,我使用了此示例,但我也遇到了其他 CSV 文件的问题。
$ file -b --mime '/Users/jasonswett/projects/client_work/gd/spec/test_files/wtf.csv'
text/plain; charset=us-ascii
为什么它不认为 CSV 是 CSV?我可以对 CSV 做些什么来使file
返回“正确”的事情吗?
答案1
mimetypes 由 unix 联机帮助页中所谓的“幻数”决定。每个文件中都有一个确定文件类型和文件格式的幻数。下面的摘录来自文件命令手册页
The magic number tests are used to check for files with data in partic-
ular fixed formats. The canonical example of this is a binary exe-
cutable (compiled program) a.out file, whose format is defined in
a.out.h and possibly exec.h in the standard include directory. These
files have a 'magic number' stored in a particular place near the
beginning of the file that tells the UNIX operating system that the
file is a binary executable, and which of several types thereof. The
concept of 'magic number' has been applied by extension to data files.
Any file with some invariant identifier at a small fixed offset into
the file can usually be described in this way. The information identi-
fying these files is read from the compiled magic file
/usr/share/file/magic.mgc , or /usr/share/file/magic if the compile
file does not exist. In addition file will look in $HOME/.magic.mgc ,
or $HOME/.magic for magic entries.
unix 手册页还提到,如果文件与幻数不匹配,则文本文件被视为 ASCII/ISO-8859-x/非 ISO 8 位扩展 ASCII(最适合的格式)
If a file does not match any of the entries in the magic file, it is
examined to see if it seems to be a text file. ASCII, ISO-8859-x, non-
ISO 8-bit extended-ASCII character sets (such as those used on Macin-
tosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Uni-
code, and EBCDIC character sets can be distinguished by the different
ranges and sequences of bytes that constitute printable text in each
set. If a file passes any of these tests, its character set is
reported. ASCII, ISO-8859-x, UTF-8, and extended-ASCII files are iden-
tified as ''text'' because they will be mostly readable on nearly any
terminal
建议
使用mimetype
命令代替文件命令
mimetype temp.csv
用于进一步挖掘的网络链接
http://unixhelp.ed.ac.uk/CGI/man-cgi?file
答案2
不幸的是,您可能无法采取任何措施使文件产生正确的输出。
该file
命令根据幻数数据库测试文件的前几个字节。在文件开头具有一些特定标识符的二进制文件(例如图像或可执行文件)中很容易检查这一点。
如果文件不是二进制文件,它将检查编码并查找文件中的某些特定单词来确定类型,但仅限于有限数量的文件类型(其中大多数是编程语言)。