我有一个超过 3264880 行的大文件。我想根据两个字符串“BEGIN JOB”和“END JOB”拆分该文件并将其写入多个文件,文件名应基于Identifier
行之间存在的某个字符串BEGIN JOB
和END JOB
样本数据
BEGIN JOB
Identifier "ADHOC_Extract"
DateModified "2018-10-02"
TimeModified "15.09.52"
BEGIN DSRECORD
Identifier "ROOT"
OLEType "CJobDefn"
Readonly "0"
Name "ADHOC_Extract"
END JOB
BEGIN JOB
Identifier "HOC_Extract"
DateModified "2018-11-02"
TimeModified "12.09.52"
BEGIN DSRECORD
Identifier "ROOT"
OLEType "CJobDefn"
Readonly "0"
Name "HOC_Extract"
END JOB
输出预计为两个文件,因为我的样本只有两个...但它将有超过 1000 个这样的重复模式
ADHOC_Extract.txt
BEGIN JOB
Identifier "ADHOC_Extract"
DateModified "2018-10-02"
TimeModified "15.09.52"
BEGIN DSRECORD
Identifier "ROOT"
OLEType "CJobDefn"
Readonly "0"
Name "ADHOC_Extract"
END JOB
HOC_Extract.txt
BEGIN JOB
Identifier "HOC_Extract"
DateModified "2018-11-02"
TimeModified "12.09.52"
BEGIN DSRECORD
Identifier "ROOT"
OLEType "CJobDefn"
Readonly "0"
Name "HOC_Extract"
END JOB
我什至可以为此编写一个 shell 脚本
答案1
使用 GNU awk
gawk -v RS="" '
match($0, /Identifier "([^"]+)/, m) {
print > (m[1]".txt")
close(m[1]".txt")
}
' sample.txt
借助 Perl,使用 CPAN 中的便捷 Path::Tiny 模块
perl -MPath::Tiny -00 -ne '/Identifier "(.+?)"/ and path("$1.txt")->spew($_)' sample.txt
答案2
如果数据中的段落具有相同的格式(即每段十行),那么该命令split
非常有效(分裂的人)。
#!/bin/bash
#remove blank lines from the original dataset.
awk NF original_data.txt > Free_spaces_data.txt
# split the dataset into files (paragraph per file), each paragraph is 10 lines.
split -l 10 Free_spaces_data.txt new
#rename the files based on the internal name within each paragraph
for f in ./new*?; do
name=$(cat $f | awk -F'"' '/Name/{print $2}')
mv "${f}" "${name}.txt";
done
答案3
这将采用第一个“标识符”行来提取文件名:
awk '
!/^ *$/ {BUF = BUF RS $0
}
! FN &&
/Identifier/ {FN = $NF ".txt"
gsub (/"/, "", FN)
}
/END JOB/ {print BUF > FN
BUF = FN = ""
}
' file
它跳过空行,将实际行附加到缓冲区,在第一个(FN 空)“标识符”出现时创建文件名,删除任何"
,并在 上打印缓冲区END JOB
,重置BUF
和FN
为空。