如何删除位于两个模式之间的每个记录的数据之间的换行符?

如何删除位于两个模式之间的每个记录的数据之间的换行符?

我有一个大文件需要解析和重新格式化,最好使用sed(在 bash 下)。该文件包含以 开头PATTERN_START和结尾的重复序列PATTERN_END。这些序列与我必须保持不变的其他文本混合在一起。序列中有多条记录(编号从 1 到n, 在哪里n可以是 1 到 12)。记录是一组以以下形式的行开头的行,其中Record i是 1 到 1 之间的整数n,并以另一行 ( ) 或一行结束。一条记录的长度可以是1行到30行。Record (i+1)PATTERN_END

这是输入文件的通用表示:

不相关的数据          (可能有很多行)
模式_开始 |
记录 1 ⎤ |
记录 1 的数据(最多 30 行)    | |  (多次重复)
      ︙ ⎦ |  (最多12条记录)    |
记录2 | |
记录 2 的数据                        ⎦ |
PATTERN_END ⎦
不相关的数据          (可能有很多行)

因此,我希望仅对于位于PATTERN_START和之间的记录PATTERN_END,将每个记录的所有数据行都聚集到该Record行上。

有人可以帮忙吗?

下面是我必须解析的文件的示例,以及我想要的结果类型:

输入

Blabla
Blabla
PATTERN_OTHER
Record 1         <- record not between PATTERN_START and PATTERN_END tags => do not touch it
Data
Data
PATTERN_END
Blabla
PATTERN_START
Record 1         <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Data
Record 2         <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Record 3         <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Data
Data
PATTERN_END
Blabla
Blabla
Blabla
Blabla
PATTERN_START
Record 1         <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Data
PATTERN_END
Blabla
Blabla
PATTERN_OTHER
Record 1         <- record not between PATTERN_START and PATTERN_END tags => do not touch it
Data
Data
Record 2         <- record not between PATTERN_START and PATTERN_END tags => do not touch it
Data
PATTERN_END
Blabla
Blabla
PATTERN_START
Record 1         <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Record 2         <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Data
PATTERN_END
Blabla
Blabla

输出

Blabla
Blabla
PATTERN_OTHER
Record 1         <- was not between PATTERN_START and PATTERN_END tags => not modified
Data
Data
PATTERN_END
Blabla
PATTERN_START
Record 1 Data Data Data        <- record data grouped in one line
Record 2 Data Data             <- record data grouped in one line
Record 3 Data Data Data Data   <- record data grouped in one line
PATTERN_END
Blabla
Blabla
Blabla
Blabla
PATTERN_START
Record 1 Data Data Data        <- record data grouped in one line
PATTERN_END
Blabla
Blabla
PATTERN_OTHER
Record 1         <- was not between PATTERN_START and PATTERN_END tags => not modified
Data
Data
Record 2         <- was not between PATTERN_START and PATTERN_END tags => not modified
Data
PATTERN_END
Blabla
Blabla
PATTERN_START
Record 1 Data                  <- record data grouped in one line
Record 2 Data Data Data        <- record data grouped in one line
PATTERN_END
Blabla
Blabla

答案1

认为这就是你想要使用 GNU sed 的结果

 sed -n '/^PATTERN_START/,/^PATTERN_END/{
         //!{H;/^Record/!{x;s/\n\([^\n]*\)$/ \1/;x}};
         /^PATTERN_START/{h};/^PATTERN_END/{x;p;x;p};d
         };p' file

解释

sed -n #Non printing


'/^PATTERN_START/,/^PATTERN_END/{
#If the line falls between these two patterns execute the next block

  //!{
  #If the previous pattern matched from the line above is not on matched(so skip 
         the start and end lines), then execute next block

        H;
        #append the line to the hold buffer, so this appends all lines between 
       #`/^PATTERN_START/` and `/^PATTERN_END/` not including those.

        /^Record/!{
        #If the line does not begin with record then execute next block

            x;s/\n\([^\n]*\)$/ \1/;x
            #Swap current line with pattern buffer holding all our other lines 
            #up to now.Then remove the last newline. As this only executed when 
            #record is not matched it just removes the newline from the start 
            #of `data`.
            #The line is then put switched back into the hold buffer.

        }
        #End of not record block

    }; 
    #End of previous pattern match block

    /^PATTERN_START/{h};

    #If line begins with `PATTERN_START` then the hold buffer is overwritten 
    #with this line removing all the previous matched lines.

    /^PATTERN_END/{x;p;x;p}
    #If line begins with `PATTERN_END` the swap in our saved lines, print them,
    #then swap back in the PATTERN END line and print that as well.

    ;d
    #Delete all the lines within the range, as we print them explicitly in the 
    #Pattern end block above


         };p' file
         # Print everything that's not in the range print, and the name of the file

答案2

这是我能想到的最好的:

sed -n '/^PATTERN_START/, /^PATTERN_END/{
            /^PATTERN_START/{x;s/^.*$//;x};
            /^Record/{x;/^\n/{s/^\n//p;d};s/\n/ /gp};
            /^PATTERN_END/{x;/^\n/{s/^\n//p;d};s/\n/ /gp;g;p};
            /^Record/!H
        };   
        /^PATTERN_START/, /^PATTERN_END/!p'

解释

我假设您熟悉 中的保持空间和模式空间的概念sed。在此解决方案中,我们将在模式空间中进行大量操作。因此,第一点是使用-n选项禁用自动打印并在需要的地方进行打印。

第一个任务是连接线之间的所有线Record

考虑以下文件:

a
b
Record 1
c
d
Record 2
e
f
Record 3

连接线后,我们希望它是

a
b
Record 1 c d
Record 2 e f
Record 3

所以,这是计划:

  1. 我们读取一行,将其附加到保留空间。
  2. 如果该行以 开头Record,则表示上一条记录已完成,新的记录已开始。所以我们打印出保留空间,刷新它并再次从点 1 开始。

第 1 点由代码实现/^Record/!H(命令中的第 5 行)。它的意思是“如果该行不以 开头Record,则在保留空间中添加一个新行并将此行附加到保留空间”。

第 2 点可以通过代码 /^Record/{x;s/\n/ /gp;} 实现,其中x交换保持空间和模式空间,scommand 将所有\ns 替换为s,pflag 打印模式空间。使用 的x还有一个优点,即现在保留空间包含当前Record行,以便我们可以开始点 1 和 2 的另一个循环。

但是,这有一个问题。在给定的示例中,第一行之前有两行 a b Record。我们不想在这些方面\n替代。由于它们不以 开头Record,根据第 1 点,\n添加以保留空间,然后附加这些行。因此,如果保留空间的第一个字符是\n,则意味着Record之前没有遇到过,我们不应该\n用替换。这是通过命令完成的

/^\n/{s/^\n//p;d}

所以整个命令就变成了

/^Record/{x;/^\n/{s/^\n//p;d};s/\n/ /gp};

现在,第二个复杂的问题是,我们想要连接行,即使一行Record不是由一行Record而是由一行终止的PATTERN_END。我们想要做与第 2 点完全相同的事情,即使该行以 开头PATTERN_END。所以命令变成

/^PATTERN_END/{x;/^\n/?s/^\n//p;d};s/\n/ /gp}

但是,这有一个问题。与线路的情况一样RecordPATTERN_END线路现在最终位于保留空间中。但我们知道,不会有更多的线路连接PATTERN_END。所以,我们可以打印出来。因此,我们将PATTERN_END线条带入图案空间g并使用 打印它p。所以最终的命令变成

/^PATTERN_END/{x;/^\n/?s/^\n//p;d};s/\n/ /gp;g;p}

下一个问题是线条PATTERN_START。在上面的解释中,我们假设一开始,保留空间是空的。但过了一段时间后PATTERN_END,保留空间里就有东西了。 (那东西只是PATTERN_END线)。当我们用 开始新的循环时PATTERN_START,我们想要清除保留空间。

所以,我们要做的就是当遇到 时PATTERN_START,交换保持空间和模式空间的内容,清除模式空间并再次交换。这使得保持空间干净。这正是以下命令的作用:

/^PATTERN_START/{x;s/^.*$//;x}

最后一点是,我们只想在PATTERN_STARTPATTERN_END行之间进行所有这些摆弄。其他的,我们只是打印它们。这是通过命令完成的

/^PATTERN_START/, /^PATTERN_END/{
    ----above commands go here----
};
/^PATTERN_START/, /^PATTERN_END/!p

将所有这些放在一起,这给出了最终命令:)

答案3

其他方式sed

sed '/PATTERN_START/,/PATTERN_END/{   # in this range
//!{                                  # if not start or end of range
/^Record/{                            # if line matches Record
x                                     # exchange pattern and hold space
/^$/d                                 # if pattern space is empty, delete it
s/\n/ /g                              # replace newlines with spaces
}
/^Record/!{                           # if line doesn't match Record
H                                     # append it to hold space
d                                     # then delete it
}
}
/PATTERN_END/{                        # at end of range
x                                     # exchange pattern and hold space
s/\n/ /g                              # replace newlines with space
G                                     # append hold space to pattern space
x                                     # exchange again
s/.*//                                # empty pattern space
x                                     # exchange again > empty line in hold space
}
}' infile

或者

sed '/PATTERN_START/,/PATTERN_END/{     # same as above
//!{                                    # same as above
: again
N                                       # pull the next line into pattern space
/\nRecord/!{                            # if pattern space doesn't match this
/\nPATTERN_END/!{                       # and doesn't match this either
s/\n/ /                                 # replace newline with space
b again                                 # go to : again
}
}
P                                       # print up to first newline
D                                       # then delete up to first newline
}
}' infile

一行:

sed '/PATTERN_START/,/PATTERN_END/{//!{/^Record/{x;/^$/d;s/\n/ /g};/^Record/!{H;d}};/PATTERN_END/{x;s/\n/ /g;G;x;s/.*//;x}}' infile

sed '/PATTERN_START/,/PATTERN_END/{//!{: again;N;/\nRecord/!{/\nPATTERN_END/!{s/\n/ /;b again}};P;D}}' infile

答案4

我做了三个版本。


v1


sed     -e'/^PATTERN_START/!b'  -e:n -eN  \
        -e'/\nPATTERN_END$/!bn' -eh\;s/// \
        -e'x;s/\n[[:print:]]*$//;x'       \
        -e's/\(\nRecord [[:print:]]*\)\{0,1\}\n/\1 /g'  \
        -e'G;P;D'       data

那一个打印出所有的仅应用编辑后的文件Record之间出现的线PATTERN_{START,END}


v2


sed   -ne'/\n/P;:n'    \
       -e'/^PATTERN_[OS]/!D'   -eN     \
       -e'/\nPATTERN_END$/!bn' -es///  \
       -e'/^PATTERN_S/s/\(\nRecord [[:print:]]*\)\{0,1\}\n/\1 /g'      \
       -eG\;D  ./data                 ###<gd data> haha

那一个打印Record任一内的线PATTERN_{(START|OTHER),END}但仅适用编辑那些发生在PATTERN_{START,END}


v3


sed   -ne'/\n/P;:n'    \
       -e'/^PATTERN_START/!D'  -eN     \
       -e'/\nPATTERN_END$/!bn' -es///  \
       -e's/\(\nRecord [[:print:]]*\)\{0,1\}\n/\1 /g'      \
       -eG\;D  ./data

还有那个仅有的编辑和仅有的印刷Record之间出现的线PATTERN_{START,END}

以下是运行输入样本后每个输出的输出。输出样本以相反的顺序呈现,即从最短到最长。


v3


Record 1         <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data Data
Record 2         <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data
Record 3         <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data Data Data
Record 1         <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data Data
Record 1         <- record between PATTERN_START and PATTERN_END tags => to put in one line Data
Record 2         <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data Data

v2


Record 1         <- record not between PATTERN_START and PATTERN_END tags => do not touch it
Data
Data
Record 1         <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data Data
Record 2         <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data
Record 3         <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data Data Data
Record 1         <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data Data
Record 1         <- record not between PATTERN_START and PATTERN_END tags => do not touch it
Data
Data
Record 2         <- record not between PATTERN_START and PATTERN_END tags => do not touch it
Data
Record 1         <- record between PATTERN_START and PATTERN_END tags => to put in one line Data
Record 2         <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data Data

v1


Blabla
Blabla
PATTERN_OTHER
Record 1         <- record not between PATTERN_START and PATTERN_END tags => do not touch it
Data
Data
PATTERN_END
Blabla
PATTERN_START
Record 1         <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data Data
Record 2         <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data
Record 3         <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data Data Data
PATTERN_START
Record 1         <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Data
Record 2         <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Record 3         <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Data
Data
Blabla
Blabla
Blabla
Blabla
PATTERN_START
Record 1         <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data Data
PATTERN_START
Record 1         <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Data
Blabla
Blabla
PATTERN_OTHER
Record 1         <- record not between PATTERN_START and PATTERN_END tags => do not touch it
Data
Data
Record 2         <- record not between PATTERN_START and PATTERN_END tags => do not touch it
Data
PATTERN_END
Blabla
Blabla
PATTERN_START
Record 1         <- record between PATTERN_START and PATTERN_END tags => to put in one line Data
Record 2         <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data Data
PATTERN_START
Record 1         <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Record 2         <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Data
Blabla
Blabla

相关内容