我试图让file
命令检测一些从未打算按文件分类的 Windows 文本文件...最好的选择似乎使用正则表达式来匹配行内容,但我找不到其使用的单个示例(共性关键字“file”、“magic”和“regex”在以 google 为中心的世界中没有帮助)。手册页没有帮助。
此外,我无法让 ^ $ 工作。
两个文件都以
Project Units: <stuff>
Units & Scale - <stuff>
<blank line>
下一行是开始的标题 4a) 对象点 ID,照片 #, 4b) Id,名称,
我为此尝试的神奇规则是:
0 字符串项目\040Units: >2 正则表达式 ^Object\040point\040ID,Photo\040#, PhotoModeler 2D 导出表 0 字符串项目\040Units: >2 正则表达式 ^Id、名称、PhotoModeler 3D 导出表
即在第一行匹配“项目单位:”,然后尝试正则表达式以达到最大 2+1 行。将正则表达式锚定到行首以提高速度。
这是 Ubuntu 14.04,文件 5.14。
文件类型 1 的示例(仅限前 10 行):
项目单位:米 单位和比例 - 活动、平移 - 活动、旋转 - 活动 对象点 ID、照片编号、X(像素)、Y(像素)、残差 X、残差 Y、残差矢量、标记类型、图层、材质、标记 2,1,1429.187065,1456.427823,-0.164541,0.182824,0.245964,LSM 圆形,默认,白色, 2,2,666.583514,1126.807078,-0.168174,0.109780,0.200833,LSM 圆形,默认,白色, 2,3,716.264669,1196.788962,0.152059,0.082258,0.172882,LSM 圆形,默认,白色, 2,4,674.145595,442.969428,0.119315,-0.050084,0.129401,LSM 圆形,默认,白色, 2,5,330.056929,836.292587,0.048372,-0.022235,0.053238,LSM 圆形,默认,白色, 2,6,1147.101715,39.253316,0.475434,-0.189514,0.511814,LSM 圆形,默认,白色,
文件类型 2 的示例(仅限前 10 行):
项目单位:米 单位和比例 - 活动、平移 - 活动、旋转 - 活动 ID、姓名、照片(使用)、X(项目单位)、Y(项目单位)、Z(项目单位)、X 精度、Y 精度、Z 精度、精度矢量长度、紧密度(百分比)、紧密度(项目单位) ,角度(度),控件名称,RMS 残差(像素),最大残差(像素),照片最大残差,材质,图层,标记,类型,处理中使用,冻结,#约束,目标代码,目标位,参考。检查标签,照片(已标记),颜色(R),颜色(G),颜色(B) 2," ","1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21",0.285721 ,1.143037,-0.000990,0.000044,0.000043,0.000075,0.000097,0.037511,0.000682,85.604862,,0.261467,0.511814,6,白色,默认,是,否, 0,不适用,不适用,,” 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21",255,255,255 3," ","1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21",0.428622 ,1.143108,-0.000230,0.000044,0.000042,0.000074,0.000096,0.033814,0.000615,86.326354,,0.222883,0.475602,6,白色,默认,是,否, 0,不适用,不适用,,” 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21",255,255,255 4," ","1,2,3,4,5,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21",0.142979,1.143124 ,-0.000840,0.000045,0.000044,0.000078,0.000100,0.030045,0.000546,84.468461,,0.239445,0.374918,16,白色,默认,,常规,是,否,0,n/a,n/一个,,“1, 2,3,4,5,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21",255,255,255 5," ","1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21",0.571353 ,1.143164,0.000784,0.000044,0.000042,0.000074,0.000096,0.027194,0.000494,86.593419,,0.213540,0.430629,6,白色,默认,是,否,0 ,不适用,不适用,,"1 ,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21",255,255,255 6," ","1,2,3,4,5,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21",0.000141,1.143101 ,-0.000885,0.000046,0.000045,0.000081,0.000103,0.035513,0.000646,82.937166,,0.291437,0.465014,16,白色,默认,,常规,是,否,0,n/a,n/一个,,“1, 2,3,4,5,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21",255,255,255 7," ","1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21",0.714058 ,1.143134,0.000247,0.000044,0.000043,0.000075,0.000097,0.030057,0.000547,86.326626,,0.221009,0.426056,6,白色,默认,是,否,0 ,不适用,不适用,,"1 ,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21",255,255,255
答案1
这文件(1)联机帮助页仅告诉您如何运行该命令。有关魔法图案的描述,请参见魔法(5)。然而, 的部分regex
并不是特别详细。在它附带的模式文件中可以找到它的广泛使用示例: https://github.com/file/file/tree/master/magic/Magdir
您的主要问题是插入符号需要转义:\^
对于行首,\\^
对于文字^
。我还没弄清楚unscaped^
有什么特殊含义。空格也可以被转义,使模式更具可读性。
您打算将匹配限制在较窄的行范围内。 regex
接受一个/<length>
选项(在单词之后regex
,而不是在模式之后),这样就限制了搜索位置结束。如果长度后跟l
,则表示行而不是字节。在我的测试中,/1l
只能匹配空行——非空行,即使使用精确的起始偏移量,也至少需要/2l
.
为了开始搜索的,offset
被解释为字节计数,即使使用regex
. (5.19 版本之前,文档表明它被解释为“行计数”,但该声明是已删除没有匹配的代码更改,所以我怀疑它在那之前是否准确。)您可以使用 offset&0
从上一场比赛的末尾开始搜索,但是当上一场比赛结束时,这不会产生很大的差异第一行的中间。
此外,“行开头”还匹配“搜索范围的开头”(即 from offset
),无论这是否是文件中行的开头。
因此,为了更严格地匹配,您可以在每一行上使用全行正则表达式,并&1
在下一个匹配上使用偏移量,以跳过上一个换行符,并位于正确的位置以便\^
按预期工作。这对于识别您的自定义文件类型可能有点过分了。
最后,您不需要重复公共部分。缩进级别>
意味着当同一级别的先前模式失败时应尝试该模式。
将所有这些结合在一起:
0 regex/2l \^Project\ Units:.*$
>&1 regex/2l \^Units\ &\ Scale.*$
>>&1 regex/1l \^$
>>>&1 regex/2l \^Object\ Point\ ID Photo Modeler 2D export table
>>>&1 regex/2l \^Id,Name,Photos Photo Modeler 3D export table
答案2
一种解决方案是由于@JigglyNaga - 转义插入符。下面的代码片段现在是我的 .magic 文件的一部分。
0 字符串项目\040Units: >2 正则表达式 \^Id,PhotoModeler 3D 导出表 0 字符串项目\040Units: >2 正则表达式 \^Object\040Point\040ID,PhotoModeler 2D 导出表