使用 sed 从表达式中删除字符

2024-5-30 • tag-icon

shell-script shell text-processing sed regular-expression

使用 sed 从表达式中删除字符

我有一个表单中的字符串

|a 一些文本、字母或数字。 |其他一些文本字母或数字 |b 文本的其他部分 |c 其他一些字母或数字

请注意，该条可以单独存在，如“数字。|其他”或带有字符“|a”、“|b”、“|c”等，可能一直到“|z”

但这也可能是

|没有任何其他栏的标题

换句话说，柱的数量是未知的。

我需要找到两个与 sed 一起使用的正则表达式：

第一个，查找 |a 和 |b 或 |b 和 |c 之间的所有文本，依此类推

在 1) 中，例如，

查找 a| 之后的所有文本但在 b| 之前，产生：

一些文字、字母或数字。 |其他一些文字字母或数字

查找 b| 之后的所有文本但在 c| 之前，产生，在上面的例子中：

文本的其他部分

需要第二个表达式来查找 |a 之后的所有文本，但是，不是在 |b 处停止，而是简单地删除任何条形，单独删除 (|)，或与另一个字符 |a、|b、|c 等一起删除。

1）例如：

一些文本、字母或数字一些其他文本字母或数字文本的其他部分一些其他字母或数字

答案1

假设 GNU 实用程序和数据文件data，

grep -Po '(?<=\|a).*(?=\|b)' data

 Some text, letters or numbers. | Some other text letters or numbers

sed -r -e 's/^.?*\|a//' -e 's/\|[a-z]?//g' data

 Some text, letters or numbers.  Some other text letters or numbers  some other part of text  some other letters or numbers 
 Title without any other bars

根据需要将|a和更改|b为|c和等。|d

请注意，这些都不会删除|x标记周围的空白，因此您的文本具有前导空格和尾随空格（此处均无法显示）。如果您也希望将其删除，则需要将其包含在模式中：

grep -Po '(?<=\|a ).*(?= \|b)' data
sed -r -e 's/^.?*\|a ?//' -e 's/ ?\|([a-z] ?)?//g' data

正如此处所写，该sed命令会将各个小节连接在一起。如果您希望它们之间有一个空格，只需将//末尾的更改为/ /。

答案2

我不清楚您是否希望分隔符中的字母是连续的，所以我继续假设您想要处理要求分隔符连续的更困难的情况（即与但|a配对）|b不与|c）。我不确定是否可以单独使用正则表达式来做到这一点（至少不能没有非常详细的正则表达式）。无论如何，这里有一个处理这种情况的简单 Python 脚本：

#!/usr/bin/env python2
# -*- coding: ascii -*-
"""parse.py"""

import sys
import re

def extract(string):
    """Removes text between delimters of the form `|START` and `|STOP`
    where START is a single ASCII letter and STOP is the next sequential
    ASCII character (e.g. `|a` and `|b` if START=a and STOP=b or
    `|x` and `|y` if START=x and STOP=y)."""

    # Find the opening delimiter (e.g. '|a' or '|b')
    start_match = re.search(r'\|[a-z]', string)
    start_index = start_match.start()
    start_letter = string[start_index+1]

    # Find the matching closing delimiter
    stop_letter = chr(ord(start_letter) + 1) 
    stop_index = string.find('|' + stop_letter)

    # Extract and return the substring
    substring = string[start_index+2:stop_index]
    return(substring)

def remove(string):

    # Find the opening delimiter (e.g. '|a' or '|b')
    start_match = re.search(r'\|[a-z]', string)
    start_index = start_match.start()
    start_letter = string[start_index+1]

    # Remove everything up to and including the opening delimiter
    string = string[start_index+2:]

    # Remove the desired substrings which occur after the delimiter
    string = re.sub(r'\|[a-z]?', '', string)

    # Return the updated string
    return(string)

if __name__=="__main__":
    input_string = sys.stdin.readline()
    sys.stdout.write(extract(input_string) + '\n')
    sys.stdout.write(remove(input_string))

相关内容