我有一个大文件需要解析和重新格式化,最好使用sed
(在 bash 下)。该文件包含以 开头PATTERN_START
和结尾的重复序列PATTERN_END
。这些序列与我必须保持不变的其他文本混合在一起。序列中有多条记录(编号从 1 到n, 在哪里n可以是 1 到 12)。记录是一组以以下形式的行开头的行,其中Record i
我是 1 到 1 之间的整数n,并以另一行 ( ) 或一行结束。一条记录的长度可以是1行到30行。Record (i+1)
PATTERN_END
这是输入文件的通用表示:
不相关的数据 (可能有很多行) ⎤ 模式_开始 | 记录 1 ⎤ | 记录 1 的数据 ⎤ (最多 30 行) | | (多次重复) ︙ ⎦ | (最多12条记录) | 记录2 | | 记录 2 的数据 ⎦ | PATTERN_END ⎦ 不相关的数据 (可能有很多行)
因此,我希望仅对于位于PATTERN_START
和之间的记录PATTERN_END
,将每个记录的所有数据行都聚集到该Record
行上。
有人可以帮忙吗?
下面是我必须解析的文件的示例,以及我想要的结果类型:
输入
Blabla
Blabla
PATTERN_OTHER
Record 1 <- record not between PATTERN_START and PATTERN_END tags => do not touch it
Data
Data
PATTERN_END
Blabla
PATTERN_START
Record 1 <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Data
Record 2 <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Record 3 <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Data
Data
PATTERN_END
Blabla
Blabla
Blabla
Blabla
PATTERN_START
Record 1 <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Data
PATTERN_END
Blabla
Blabla
PATTERN_OTHER
Record 1 <- record not between PATTERN_START and PATTERN_END tags => do not touch it
Data
Data
Record 2 <- record not between PATTERN_START and PATTERN_END tags => do not touch it
Data
PATTERN_END
Blabla
Blabla
PATTERN_START
Record 1 <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Record 2 <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Data
PATTERN_END
Blabla
Blabla
输出
Blabla
Blabla
PATTERN_OTHER
Record 1 <- was not between PATTERN_START and PATTERN_END tags => not modified
Data
Data
PATTERN_END
Blabla
PATTERN_START
Record 1 Data Data Data <- record data grouped in one line
Record 2 Data Data <- record data grouped in one line
Record 3 Data Data Data Data <- record data grouped in one line
PATTERN_END
Blabla
Blabla
Blabla
Blabla
PATTERN_START
Record 1 Data Data Data <- record data grouped in one line
PATTERN_END
Blabla
Blabla
PATTERN_OTHER
Record 1 <- was not between PATTERN_START and PATTERN_END tags => not modified
Data
Data
Record 2 <- was not between PATTERN_START and PATTERN_END tags => not modified
Data
PATTERN_END
Blabla
Blabla
PATTERN_START
Record 1 Data <- record data grouped in one line
Record 2 Data Data Data <- record data grouped in one line
PATTERN_END
Blabla
Blabla
答案1
认为这就是你想要使用 GNU sed 的结果
sed -n '/^PATTERN_START/,/^PATTERN_END/{
//!{H;/^Record/!{x;s/\n\([^\n]*\)$/ \1/;x}};
/^PATTERN_START/{h};/^PATTERN_END/{x;p;x;p};d
};p' file
解释
sed -n #Non printing
'/^PATTERN_START/,/^PATTERN_END/{
#If the line falls between these two patterns execute the next block
//!{
#If the previous pattern matched from the line above is not on matched(so skip
the start and end lines), then execute next block
H;
#append the line to the hold buffer, so this appends all lines between
#`/^PATTERN_START/` and `/^PATTERN_END/` not including those.
/^Record/!{
#If the line does not begin with record then execute next block
x;s/\n\([^\n]*\)$/ \1/;x
#Swap current line with pattern buffer holding all our other lines
#up to now.Then remove the last newline. As this only executed when
#record is not matched it just removes the newline from the start
#of `data`.
#The line is then put switched back into the hold buffer.
}
#End of not record block
};
#End of previous pattern match block
/^PATTERN_START/{h};
#If line begins with `PATTERN_START` then the hold buffer is overwritten
#with this line removing all the previous matched lines.
/^PATTERN_END/{x;p;x;p}
#If line begins with `PATTERN_END` the swap in our saved lines, print them,
#then swap back in the PATTERN END line and print that as well.
;d
#Delete all the lines within the range, as we print them explicitly in the
#Pattern end block above
};p' file
# Print everything that's not in the range print, and the name of the file
答案2
这是我能想到的最好的:
sed -n '/^PATTERN_START/, /^PATTERN_END/{
/^PATTERN_START/{x;s/^.*$//;x};
/^Record/{x;/^\n/{s/^\n//p;d};s/\n/ /gp};
/^PATTERN_END/{x;/^\n/{s/^\n//p;d};s/\n/ /gp;g;p};
/^Record/!H
};
/^PATTERN_START/, /^PATTERN_END/!p'
解释
我假设您熟悉 中的保持空间和模式空间的概念sed
。在此解决方案中,我们将在模式空间中进行大量操作。因此,第一点是使用-n
选项禁用自动打印并在需要的地方进行打印。
第一个任务是连接线之间的所有线Record
。
考虑以下文件:
a
b
Record 1
c
d
Record 2
e
f
Record 3
连接线后,我们希望它是
a
b
Record 1 c d
Record 2 e f
Record 3
所以,这是计划:
- 我们读取一行,将其附加到保留空间。
- 如果该行以 开头
Record
,则表示上一条记录已完成,新的记录已开始。所以我们打印出保留空间,刷新它并再次从点 1 开始。
第 1 点由代码实现/^Record/!H
(命令中的第 5 行)。它的意思是“如果该行不以 开头Record
,则在保留空间中添加一个新行并将此行附加到保留空间”。
第 2 点可以通过代码 /^Record/{x;s/\n/ /gp;} 实现,其中x
交换保持空间和模式空间,s
command 将所有\n
s 替换为s,
p
flag 打印模式空间。使用 的x
还有一个优点,即现在保留空间包含当前Record
行,以便我们可以开始点 1 和 2 的另一个循环。
但是,这有一个问题。在给定的示例中,第一行之前有两行 a b Record
。我们不想在这些方面\n
替代。由于它们不以 开头
Record
,根据第 1 点,\n
添加以保留空间,然后附加这些行。因此,如果保留空间的第一个字符是\n
,则意味着Record
之前没有遇到过,我们不应该\n
用替换。这是通过命令完成的
/^\n/{s/^\n//p;d}
所以整个命令就变成了
/^Record/{x;/^\n/{s/^\n//p;d};s/\n/ /gp};
现在,第二个复杂的问题是,我们想要连接行,即使一行Record
不是由一行Record
而是由一行终止的PATTERN_END
。我们想要做与第 2 点完全相同的事情,即使该行以 开头PATTERN_END
。所以命令变成
/^PATTERN_END/{x;/^\n/?s/^\n//p;d};s/\n/ /gp}
但是,这有一个问题。与线路的情况一样Record
,PATTERN_END
线路现在最终位于保留空间中。但我们知道,不会有更多的线路连接PATTERN_END
。所以,我们可以打印出来。因此,我们将PATTERN_END
线条带入图案空间g
并使用 打印它p
。所以最终的命令变成
/^PATTERN_END/{x;/^\n/?s/^\n//p;d};s/\n/ /gp;g;p}
下一个问题是线条PATTERN_START
。在上面的解释中,我们假设一开始,保留空间是空的。但过了一段时间后PATTERN_END
,保留空间里就有东西了。 (那东西只是PATTERN_END
线)。当我们用 开始新的循环时PATTERN_START
,我们想要清除保留空间。
所以,我们要做的就是当遇到 时PATTERN_START
,交换保持空间和模式空间的内容,清除模式空间并再次交换。这使得保持空间干净。这正是以下命令的作用:
/^PATTERN_START/{x;s/^.*$//;x}
最后一点是,我们只想在PATTERN_START
和PATTERN_END
行之间进行所有这些摆弄。其他的,我们只是打印它们。这是通过命令完成的
/^PATTERN_START/, /^PATTERN_END/{
----above commands go here----
};
/^PATTERN_START/, /^PATTERN_END/!p
将所有这些放在一起,这给出了最终命令:)
答案3
其他方式sed
:
sed '/PATTERN_START/,/PATTERN_END/{ # in this range
//!{ # if not start or end of range
/^Record/{ # if line matches Record
x # exchange pattern and hold space
/^$/d # if pattern space is empty, delete it
s/\n/ /g # replace newlines with spaces
}
/^Record/!{ # if line doesn't match Record
H # append it to hold space
d # then delete it
}
}
/PATTERN_END/{ # at end of range
x # exchange pattern and hold space
s/\n/ /g # replace newlines with space
G # append hold space to pattern space
x # exchange again
s/.*// # empty pattern space
x # exchange again > empty line in hold space
}
}' infile
或者
sed '/PATTERN_START/,/PATTERN_END/{ # same as above
//!{ # same as above
: again
N # pull the next line into pattern space
/\nRecord/!{ # if pattern space doesn't match this
/\nPATTERN_END/!{ # and doesn't match this either
s/\n/ / # replace newline with space
b again # go to : again
}
}
P # print up to first newline
D # then delete up to first newline
}
}' infile
一行:
sed '/PATTERN_START/,/PATTERN_END/{//!{/^Record/{x;/^$/d;s/\n/ /g};/^Record/!{H;d}};/PATTERN_END/{x;s/\n/ /g;G;x;s/.*//;x}}' infile
和
sed '/PATTERN_START/,/PATTERN_END/{//!{: again;N;/\nRecord/!{/\nPATTERN_END/!{s/\n/ /;b again}};P;D}}' infile
答案4
我做了三个版本。
v1
sed -e'/^PATTERN_START/!b' -e:n -eN \
-e'/\nPATTERN_END$/!bn' -eh\;s/// \
-e'x;s/\n[[:print:]]*$//;x' \
-e's/\(\nRecord [[:print:]]*\)\{0,1\}\n/\1 /g' \
-e'G;P;D' data
那一个打印出所有的仅应用编辑后的文件Record
之间出现的线PATTERN_{START,END}
。
v2
sed -ne'/\n/P;:n' \
-e'/^PATTERN_[OS]/!D' -eN \
-e'/\nPATTERN_END$/!bn' -es/// \
-e'/^PATTERN_S/s/\(\nRecord [[:print:]]*\)\{0,1\}\n/\1 /g' \
-eG\;D ./data ###<gd data> haha
那一个打印Record
任一内的线PATTERN_{(START|OTHER),END}
但仅适用编辑那些发生在PATTERN_{START,END}
。
v3
sed -ne'/\n/P;:n' \
-e'/^PATTERN_START/!D' -eN \
-e'/\nPATTERN_END$/!bn' -es/// \
-e's/\(\nRecord [[:print:]]*\)\{0,1\}\n/\1 /g' \
-eG\;D ./data
还有那个仅有的编辑和仅有的印刷Record
之间出现的线PATTERN_{START,END}
。
以下是运行输入样本后每个输出的输出。输出样本以相反的顺序呈现,即从最短到最长。
v3
Record 1 <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data Data
Record 2 <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data
Record 3 <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data Data Data
Record 1 <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data Data
Record 1 <- record between PATTERN_START and PATTERN_END tags => to put in one line Data
Record 2 <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data Data
v2
Record 1 <- record not between PATTERN_START and PATTERN_END tags => do not touch it
Data
Data
Record 1 <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data Data
Record 2 <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data
Record 3 <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data Data Data
Record 1 <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data Data
Record 1 <- record not between PATTERN_START and PATTERN_END tags => do not touch it
Data
Data
Record 2 <- record not between PATTERN_START and PATTERN_END tags => do not touch it
Data
Record 1 <- record between PATTERN_START and PATTERN_END tags => to put in one line Data
Record 2 <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data Data
v1
Blabla
Blabla
PATTERN_OTHER
Record 1 <- record not between PATTERN_START and PATTERN_END tags => do not touch it
Data
Data
PATTERN_END
Blabla
PATTERN_START
Record 1 <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data Data
Record 2 <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data
Record 3 <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data Data Data
PATTERN_START
Record 1 <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Data
Record 2 <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Record 3 <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Data
Data
Blabla
Blabla
Blabla
Blabla
PATTERN_START
Record 1 <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data Data
PATTERN_START
Record 1 <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Data
Blabla
Blabla
PATTERN_OTHER
Record 1 <- record not between PATTERN_START and PATTERN_END tags => do not touch it
Data
Data
Record 2 <- record not between PATTERN_START and PATTERN_END tags => do not touch it
Data
PATTERN_END
Blabla
Blabla
PATTERN_START
Record 1 <- record between PATTERN_START and PATTERN_END tags => to put in one line Data
Record 2 <- record between PATTERN_START and PATTERN_END tags => to put in one line Data Data Data
PATTERN_START
Record 1 <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Record 2 <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Data
Blabla
Blabla