我需要删除文本文件中两个给定字符串之间的所有文本。字符串可能位于不同的行上。例如,在以下文本文件中
@article{ginsberg_lifespan_2018,
title = {On the lifespan of three-dimensional abstract gravity water waves with vorticity},
abstract = {test1
test2 abstract {NS}
test3},
language = {en},
urldate = {2018-12-05},
author = {Ginsberg, Daniel},
month = dec,
year = {2018}
}
@article{higaki_two-dimensional_2017,
title = {On the two-dimensional steady {Navier}-{Stokes} equations related to flows around a rotating obstacle},
abstract = {We study the two-dimensional stationary Navier-Stokes equations with rotating effect in the whole space. The unique existence and the asymptotics of solutions are obtained without the smallness assumption on the rotation parameter.},
journal = {arXiv:1703.07372 [math]},
author = {Higaki, Mitsuo and Maekawa, Yasunori and Nakahara, Yuu},
month = mar,
year = {2017},
note = {arXiv: 1703.07372},
keywords = {Mathematics - Analysis of PDEs}
}
我想删除abstract =
和 a之间的所有内容},
,它始终位于行尾,包括这些字符串。也就是说我想要以下输出:
@article{ginsberg_lifespan_2018,
title = {On the lifespan of three-dimensional abstract gravity water waves with vorticity},
language = {en},
urldate = {2018-12-05},
author = {Ginsberg, Daniel},
month = dec,
year = {2018}
}
@article{higaki_two-dimensional_2017,
title = {On the two-dimensional steady {Navier}-{Stokes} equations related to flows around a rotating obstacle},
journal = {arXiv:1703.07372 [math]},
author = {Higaki, Mitsuo and Maekawa, Yasunori and Nakahara, Yuu},
month = mar,
year = {2017},
note = {arXiv: 1703.07372},
keywords = {Mathematics - Analysis of PDEs}
}
我知道已经有人问过此类问题,并且我尝试了发布的解决方案。例如,我用过
perl -0777 -pe 's/abstract = .*},\n/\n/gs'
abstract =
但这会删除第一次出现和最后一次出现之间的文本},
,而不是连续出现的文本。这就是我得到的
@article{ginsberg_lifespan_2018,
title = {On the lifespan of three-dimensional gravity water waves with vorticity},
keywords = {Mathematics - Analysis of PDEs}
}
我怎样才能纠正这个命令以获得期望的结果?
答案1
$ sed '/abstract = .*},$/d; /abstract = /,/},$/d' <file
@article{ginsberg_lifespan_2018,
title = {On the lifespan of three-dimensional abstract gravity water waves with vorticity},
language = {en},
urldate = {2018-12-05},
author = {Ginsberg, Daniel},
month = dec,
year = {2018}
}
@article{higaki_two-dimensional_2017,
title = {On the two-dimensional steady {Navier}-{Stokes} equations related to flows around a rotating obstacle},
journal = {arXiv:1703.07372 [math]},
author = {Higaki, Mitsuo and Maekawa, Yasunori and Nakahara, Yuu},
month = mar,
year = {2017},
note = {arXiv: 1703.07372},
keywords = {Mathematics - Analysis of PDEs}
}
这首先尝试删除完整的单行abstract
条目,如果不起作用,则尝试删除多行条目abstract
。多行条目是从包含的行abstract =
到以 结尾的下一行的一组行},
。
带注释的sed
脚本:
/abstract = .*},$/d # delete complete abstract line, skip to next input line
/abstract = /,/},$/d # delete multi-line abstract entry
例如,如果您需要更具体地指定起始字符串,则可以使用这些表达式的位^[[:blank:]]*abstract
来代替。abstract
这将只允许abstract =
在这些行之前添加空格或制表符。
答案2
sed 的解决方案(例如)是将每个开始和结束字符串转换为一字符,因此我们可以使用正则表达式来避免(否定)一个字符[^…]
。
转换为一个字符(假设%
(start) 和#
(end) 不能出现在您的文件中,稍后会详细介绍):
<<<infile sed 's/abstract =/%/g; s/},\n/#/g'
然后,我们可以从第一个中选择(并删除)开始( %
)特点到第一个结尾( #
) 字符后面:
sed 's/%[^#]*#//g'
[^#]
进行比赛所需的非贪婪。
由于某些分隔字符可能仍然存在,因此我们需要恢复它们。
sed 's/%/abstract =/g; s/#/},\n/g' # assuming GNU sed.
当然,上述所有内容都必须应用于整个文件,因为模式可能出现在不同的行上。因此,我们在保留空间中捕获整个文件:
sed 'H;1h;$!d;g;'
在一个完整的命令行中:
<infile sed 'H;1h;$!d;g; s/abstract =/%/g; s/},\n/#/g;
s/%[^#]*#//g ;
s/%/abstract =/g; s/#/},\n/g'
如果所选字符可能存在于输入文件中,我们可能会选择一些其他明确的分隔符,这些分隔符不会存在于您的文本文件中。
具有值的字符(字节)在 ASCII 中01
被02
称为 SOH(标题开始)和 STX(文本开始),是“控制字符”,在文本文件中非常罕见。为了使用它们,我们最好构建一个 shell 脚本:
#!/bin/bash
start=$'\1'
end=$'\2'
startpattern='abstract ='
endpattern=$'},\\\n' # The newline needs a `\` for sed to work.
sed 'H;1h;$!d;g;
s/'"$startpattern"'/'"$start"'/g;
s/'"$endpattern"'/'"$end"'/g;
s/'"$start"'[^'"$end"']*'"$end"'//g;
s/'"$start"'/'"$startpattern"'/g;
s/'"$end"'/'"$endpattern"'/g' <infile
答案3
你是对的,这个或类似的问题已经在这里被问过无数次了。会走多远
sed '/abstract.*{/ {:L; /}/{d; b;}; N; bL; }' file
我懂了?匹配后abstract
,如果需要的话,它会循环直到}
找到。
编辑:允许修改后的请求:
sed '/abstract.*{/ {:L; /},$/{d; b;}; N; bL; }' file
答案4
你的 Perl 代码已经差不多完成了,只需要一些调整:
perl -0777pe 's/abstract = .*?\},\n/\n/msg'
/s 标志使 . 匹配换行符并且使 .*? 正则表达式不贪婪。