删除连续出现的字符串之间的文本

删除连续出现的字符串之间的文本

我需要删除文本文件中两个给定字符串之间的所有文本。字符串可能位于不同的行上。例如,在以下文本文件中

@article{ginsberg_lifespan_2018,
    title = {On the lifespan of three-dimensional abstract gravity water waves with vorticity},
    abstract = {test1
test2  abstract {NS}

test3},
    language = {en},
    urldate = {2018-12-05},
    author = {Ginsberg, Daniel},
    month = dec,
    year = {2018}
}

@article{higaki_two-dimensional_2017,
    title = {On the two-dimensional steady {Navier}-{Stokes} equations related to flows around a rotating obstacle},
    abstract = {We study the two-dimensional stationary Navier-Stokes equations with rotating effect in the whole space. The unique existence and the asymptotics of solutions are obtained without the smallness assumption on the rotation parameter.},
    journal = {arXiv:1703.07372 [math]},
    author = {Higaki, Mitsuo and Maekawa, Yasunori and Nakahara, Yuu},
    month = mar,
    year = {2017},
    note = {arXiv: 1703.07372},
    keywords = {Mathematics - Analysis of PDEs}
}

我想删除abstract =和 a之间的所有内容},,它始终位于行尾,包括这些字符串。也就是说我想要以下输出:

@article{ginsberg_lifespan_2018,
    title = {On the lifespan of three-dimensional abstract gravity water waves with vorticity},
    language = {en},
    urldate = {2018-12-05},
    author = {Ginsberg, Daniel},
    month = dec,
    year = {2018}
}

@article{higaki_two-dimensional_2017,
    title = {On the two-dimensional steady {Navier}-{Stokes} equations related to flows around a rotating obstacle},
    journal = {arXiv:1703.07372 [math]},
    author = {Higaki, Mitsuo and Maekawa, Yasunori and Nakahara, Yuu},
    month = mar,
    year = {2017},
    note = {arXiv: 1703.07372},
    keywords = {Mathematics - Analysis of PDEs}
}

我知道已经有人问过此类问题,并且我尝试了发布的解决方案。例如,我用过

perl -0777 -pe 's/abstract = .*},\n/\n/gs'

abstract =但这会删除第一次出现和最后一次出现之间的文本},,而不是连续出现的文本。这就是我得到的

@article{ginsberg_lifespan_2018,
    title = {On the lifespan of three-dimensional gravity water waves with vorticity},

    keywords = {Mathematics - Analysis of PDEs}
}

我怎样才能纠正这个命令以获得期望的结果?

答案1

$ sed '/abstract = .*},$/d; /abstract = /,/},$/d' <file
@article{ginsberg_lifespan_2018,
    title = {On the lifespan of three-dimensional abstract gravity water waves with vorticity},
    language = {en},
    urldate = {2018-12-05},
    author = {Ginsberg, Daniel},
    month = dec,
    year = {2018}
}

@article{higaki_two-dimensional_2017,
    title = {On the two-dimensional steady {Navier}-{Stokes} equations related to flows around a rotating obstacle},
    journal = {arXiv:1703.07372 [math]},
    author = {Higaki, Mitsuo and Maekawa, Yasunori and Nakahara, Yuu},
    month = mar,
    year = {2017},
    note = {arXiv: 1703.07372},
    keywords = {Mathematics - Analysis of PDEs}
}

这首先尝试删除完整的单行abstract条目,如果不起作用,则尝试删除多行条目abstract。多行条目是从包含的行abstract =到以 结尾的下一行的一组行},

带注释的sed脚本:

/abstract = .*},$/d    # delete complete abstract line, skip to next input line
/abstract = /,/},$/d   # delete multi-line abstract entry

例如,如果您需要更具体地指定起始字符串,则可以使用这些表达式的位^[[:blank:]]*abstract来代替。abstract这将只允许abstract =在这些行之前添加空格或制表符。

答案2

sed 的解决方案(例如)是将每个开始和结束字符串转换为字符,因此我们可以使用正则表达式来避免(否定)一个字符[^…]

转换为一个字符(假设%(start) 和#(end) 不能出现在您的文件中,稍后会详细介绍):

<<<infile sed 's/abstract =/%/g; s/},\n/#/g'

然后,我们可以从第一个中选择(并删除)开始( %)特点到第一个结尾( #) 字符后面:

sed 's/%[^#]*#//g'

[^#]进行比赛所需的非贪婪

由于某些分隔字符可能仍然存在,因此我们需要恢复它们。

sed 's/%/abstract =/g; s/#/},\n/g'    # assuming GNU sed.

当然,上述所有内容都必须应用于整个文件,因为模式可能出现在不同的行上。因此,我们在保留空间中捕获整个文件:

sed 'H;1h;$!d;g;'

在一个完整的命令行中:

 <infile sed 'H;1h;$!d;g;  s/abstract =/%/g; s/},\n/#/g;
                           s/%[^#]*#//g ;
                           s/%/abstract =/g; s/#/},\n/g'

如果所选字符可能存在于输入文件中,我们可能会选择一些其他明确的分隔符,这些分隔符不会存在于您的文本文件中。

具有值的字符(字节)在 ASCII 中0102称为 SOH(标题开始)和 STX(文本开始),是“控制字符”,在文本文件中非常罕见。为了使用它们,我们最好构建一个 shell 脚本:

 #!/bin/bash
 start=$'\1'
 end=$'\2'
 startpattern='abstract ='
 endpattern=$'},\\\n'         # The newline needs a `\` for sed to work.

 sed 'H;1h;$!d;g;
      s/'"$startpattern"'/'"$start"'/g;
      s/'"$endpattern"'/'"$end"'/g;
      s/'"$start"'[^'"$end"']*'"$end"'//g;
      s/'"$start"'/'"$startpattern"'/g;
      s/'"$end"'/'"$endpattern"'/g'  <infile

答案3

你是对的,这个或类似的问题已经在这里被问过无数次了。会走多远

sed '/abstract.*{/ {:L; /}/{d; b;}; N; bL; }' file

我懂了?匹配后abstract,如果需要的话,它会循环直到}找到。

编辑:允许修改后的请求:

sed '/abstract.*{/ {:L; /},$/{d; b;}; N; bL; }' file

答案4

你的 Perl 代码已经差不多完成了,只需要一些调整:

 perl -0777pe 's/abstract = .*?\},\n/\n/msg'

/s 标志使 . 匹配换行符并且使 .*? 正则表达式不贪婪。

相关内容