如何将这些丑陋的输出转化为漂亮、有用的数据?
输出:
/* ---------- TA#box#AbC_p ---------- */
insert_job: TA#box#AbC_p job_type: a
#owner: bob
permission: gx
date_conditions: 1
days_of_week: su
start_times: "16:15"
run_window: "16:15-17:30"
description: "Job AbC that runs at 4:15PM on Sundays, and should end before 5:30PM"
/* ---------- TA#cmd#EfGJob_p ---------- */
insert_job: TA#cmd#EfGJob_p job_type: b
box_name: TA#box#AbC_p
command: /path/to/shell/script.sh
machine: vm_machine1
#owner: alex
permission: gx
date_conditions: 2
run_window: "16:20-16:30"
description: "job EfG that runs within box AbC"
term_run_time: 60
std_out: /path/to/log.log
std_err: /path/to/err.log
alarm_if_fail: 1
profile: /path/to/profile
等等,很长一段时间。 #cmd# 作业有时位于#box# 下。如果它们位于 #box# 下,则 #cmd# 部分会缩进。
我理想的输出是这样的:
"Job Name", "Time", "Schedule", "Machine", "Description", "Command"
"TA#box#AbC_p", "16:15", "su", "", "Job AbC that runs at 4:15PM on Sundays, and should end before 5:30PM", ""
"TA#cmd#EfGJob_p", "16:15", "su", "vm_machine1", "job EfG that runs within box AbC", "/path/to/shell/script.sh"
我正在尝试 awk、perl 和 grep,但在打印 CSV 行之前,我无法将一个“部分”的所有信息保存在一起。
答案1
有点可怕的 sed oneliner:
sed -n \
# we divide out incoming text to small parts,
# each one as you mentioned from /---.*box.*/ to /profile/
'/---.*box.*/,/profile/{
# inside of each part we do following things:
# if string matches our pattern we extract
# the value and give it some identifier (which you
# can see is "ij", "st" and so on)
# and we copy that value with identifier to hold buffer,
# but we don't replace the content of hold buffer
# we just append (capital H) new var to it
/insert_job/{s/[^:]*: /ij"/;s/ .*/",/;H};
/start_times/{s/[^:]*: /st/;s/$/,/;H};
/days_of_week/{s/[^:]*: /dw"/;s/$/",/;H};
/machine/{s/[^:]*: /ma"/;s/$/",/;H};
/description/{s/[^:]*: /de/;s/$/,/;H};
/command/{s/[^:]*: /co"/;s/$/",/;H};
# when line matches next pattern (profile)
# we think that it is the end of our part,
# therefore we delete the whole line (s/.*//;)
# and exchange the pattern and hold buffers (x;)
# so now in pattern buffer we have several strings with all needed variables
# but all of them are in pattern space, therefore we can remove
# all newlines symbols (s/\n//g;). so it is just one string
# with a list of variables
# and we just need to move to the order we want,
# so in this section we do it with several s commands.
# after that we print the result (p)
/profile/{s/.*//;x;s/\n//g;s/ij\("[^"]*box[^"]*",\)/\1/;
s/,\(.*\)st\("[^"]*",\)\(.*ij"[^"]*",\)/,\2\1\3\2/;
s/\([^,]*,[^,]*,\)\(.*\)dw\("[^"]*",\)\(.*ij"[^"]*",[^,]*,\)/\1\3\2\4\3/;
s/de/"",/;s/ij/""\n/;
s/\(\n[^,]*,[^,]*,[^,]*,\)\(.*\)ma\("[^"]*",\)/\1\3\2/;
s/co\("[^"]*"\),\(.*\)/\2\1/;s/de//;p}
};
# the last command just adds table caption and nothing more.
# note: if you want to add some new commands,
# add them before this one
1i"Job Name", "Time", "Schedule", "Machine", "Description", "Command"'
我写它是因为不同框中的字段顺序可能有所不同,但配置文件始终是最后一个。如果顺序总是相同的话,会更容易一些。
答案2
我会使用 Perl,或者至少使用 awk。
perl -ne '
BEGIN {
print "\"Job Name\", \"Time\", \"Schedule\", \"Machine\", \"Description\", \"Command\", \"\n";
}
chomp; s/^\s+//; s/\s+$//;
if (($_ eq "" || eof) && exists $fields{insert_job}) {
print "\"", join("\", \"", @fields{qw(insert_job start_times days_of_week machine description command)}), "\"\n";
delete @fields{qw(insert_job)};
}
if (/^([^ :]+): *(.*)/) {$fields{$1} = $2}
'
说明:
- 该
BEGIN
块在脚本开头运行一次,其余部分针对每个输入行运行。 - 以开头的行
chomp
去掉前导和尾随空白。 - 如果该字段存在,第一
if
行将在空行(段落分隔符)上触发。insert_job
- 该
delete
行删除该insert_job
字段。添加您不想从一个段落溢出到下一个段落的其他字段名称。 - 最后
if
一行存储字段。
答案3
使用 TXR 语言:
@(bind inherit-time nil)
@(bind inherit-sched nil)
@(collect)
@ (all)
@indent/* ---------- @jobname ---------- */
@ (and)
@/ *//* ---------- @nil#@type#@nil ---------- */
@ (end)
@ (bind is-indented @(> (length indent) 0))
@ (gather :vars ((time "") (sched "") (mach "") (descr "") (cmd "")))
@/ */start_times: "@*time"
@/ */days_of_week: @sched
@/ */machine: @mach
@/ */description: "@*descr"
@/ */command: @cmd
@ (until)
@ (end)
@ (cases)
@ (bind type "box")
@ (set (inherit-time inherit-sched) (time sched))
@ (or)
@ (bind type "cmd")
@ (bind is-indented t)
@ (set (time sched) (inherit-time inherit-sched))
@ (end)
@(end)
@(output)
"Job Name", "Time", "Schedule", "Machine", "Description", "Command"
@ (repeat)
"@jobname", "@time", "@sched", "@mach", "@descr", "@cmd"
@ (end)
@(end)
这是一种非常幼稚的做法。从每条记录中,我们提取我们感兴趣的所有字段,用空白替换不存在的字段( 参数中的默认值:vars
)@(gather)
。我们关注作业类型(box
或cmd
)和缩进。当我们看到一个盒子时,我们将一些盒子属性复制到全局变量中;当我们看到缩进的 cmd 时,它会复制这些属性。 (我们盲目地假设它们是由较早的人设置的box
。)
跑步:
$ txr jobs.txr jobs
"Job Name", "Time", "Schedule", "Machine", "Description", "Command"
"TA#box#AbC_p", "16:15", "su", "", "Job AbC that runs at 4:15PM on Sundays, and should end before 5:30PM", ""
"TA#cmd#EfGJob_p", "16:15", "su", "vm_machine1", "job EfG that runs within box AbC", "/path/to/shell/script.sh"
请注意,输出是逗号分隔的带引号的字段,但对于数据包含引号的可能性没有采取任何措施。如果引号以某种方式在 中转义description:
,那么当然它将被保留。该@*descr
表示法是贪婪匹配,因此description: "a b"c\"d"
将导致采用 将在输出中逐字再现的descr
字符。a b"c\"d
这个解决方案的好处是,如果我们没有数据示例,我们可以从代码结构中猜测大部分数据,因为它表达了整个文件的有序模式匹配。我们可以看到正在收集的部分以一行开头/* --- ... --- */
,其中嵌入了作业名称,并且作业名称中间的两个哈希标记之间有一个类型字段。然后是一个强制性的空行,之后收集属性,直到另一个空行,依此类推。