我有一个 CSV 文件,file.csv
包含日期和时间,如下所示:
id0,2020-12-12T07:18:26,7f
id1,2017-04-28T19:59:00,80
id2,2017-04-28T03:14:35,e4
id3,2020-12-12T23:45:09,ff
id4,2020-12-12T09:12:34,a1
id5,2017-04-28T00:31:54,65
id6,2020-12-12T20:13:47,45
id7,2017-04-28T21:04:30,7f
我想根据第 2 列中的日期拆分文件。使用上面的示例,它应该创建 2 个文件:
file_1.csv
id1,2017-04-28T19:59:00,80
id2,2017-04-28T03:14:35,e4
id5,2017-04-28T00:31:54,65
id7,2017-04-28T21:04:30,7f
和
file_2.csv
id0,2020-12-12T07:18:26,7f
id3,2020-12-12T23:45:09,ff
id4,2020-12-12T09:12:34,a1
id6,2020-12-12T20:13:47,45
我尝试使用sort
和awk
来完成这项工作,但它根据日期和时间将文件分成 8 个文件。
sort -k2 -t, file.csv | awk -F, '!($2 in col) {col[$2]=++i} {print > ("file_" i ".csv")}'
如何仅根据日期(而不是日期和时间)分割文件?
答案1
怎么样:
awk -F', ' '
{ date = substr($2,1,10) }
!(date in outfile) { outfile[date] = "file_" (++numout) ".csv" }
{ print > outfile[date] }
' file.csv
如果它是一个包含许多唯一日期的大文件,您可能需要通过以下方式防止“打开文件过多”错误:
{ print >> outfile[date]; close(outfile[date]) }
答案2
$ cat tst.sh
#!/usr/bin/env bash
awk -F'[ -]' -v OFS='\t' '{print $2$3, NR, $0}' "${@:--}" |
sort -k1,1n -k2,2n |
cut -f3- |
awk -F'[ -]' '
{ curr = $2$3 }
curr != prev {
close(out)
out = "file_" (++cnt) ".csv"
prev = curr
}
{ print > out }
'
./tst.sh file
$ head file_*
==> file_1.csv <==
id1, 2017-04-28T19:59:00, 80
id2, 2017-04-28T03:14:35, e4
id5, 2017-04-28T00:31:54, 65
id7, 2017-04-28T21:04:30, 7f
==> file_2.csv <==
id0, 2020-12-12T07:18:26, 7f
id3, 2020-12-12T23:45:09, ff
id4, 2020-12-12T09:12:34, a1
id6, 2020-12-12T20:13:47, 45
上述内容将与任何 POSIX awk、sort 和 cut 一起稳健、高效且可移植地工作,并将保留输出文件中的输入顺序。
以下是前 3 个步骤如何重新排列输入文件内容:
$ cat file
id0, 2020-12-12T07:18:26, 7f
id1, 2017-04-28T19:59:00, 80
id2, 2017-04-28T03:14:35, e4
id3, 2020-12-12T23:45:09, ff
id4, 2020-12-12T09:12:34, a1
id5, 2017-04-28T00:31:54, 65
id6, 2020-12-12T20:13:47, 45
id7, 2017-04-28T21:04:30, 7f
这样,当最终的 awk 脚本运行时,它的行已按年+月从 $2 开始排序,并保留具有相同日期+时间的所有行的输入顺序:
$ awk -F'[ -]' -v OFS='\t' '{print $2$3, NR, $0}' file
202012 1 id0, 2020-12-12T07:18:26, 7f
201704 2 id1, 2017-04-28T19:59:00, 80
201704 3 id2, 2017-04-28T03:14:35, e4
202012 4 id3, 2020-12-12T23:45:09, ff
202012 5 id4, 2020-12-12T09:12:34, a1
201704 6 id5, 2017-04-28T00:31:54, 65
202012 7 id6, 2020-12-12T20:13:47, 45
201704 8 id7, 2017-04-28T21:04:30, 7f
$ awk -F'[ -]' -v OFS='\t' '{print $2$3, NR, $0}' file | sort -k1,1n -k2,2n
201704 2 id1, 2017-04-28T19:59:00, 80
201704 3 id2, 2017-04-28T03:14:35, e4
201704 6 id5, 2017-04-28T00:31:54, 65
201704 8 id7, 2017-04-28T21:04:30, 7f
202012 1 id0, 2020-12-12T07:18:26, 7f
202012 4 id3, 2020-12-12T23:45:09, ff
202012 5 id4, 2020-12-12T09:12:34, a1
202012 7 id6, 2020-12-12T20:13:47, 45
$ awk -F'[ -]' -v OFS='\t' '{print $2$3, NR, $0}' file | sort -k1,1n -k2,2n | cut -f3-
id1, 2017-04-28T19:59:00, 80
id2, 2017-04-28T03:14:35, e4
id5, 2017-04-28T00:31:54, 65
id7, 2017-04-28T21:04:30, 7f
id0, 2020-12-12T07:18:26, 7f
id3, 2020-12-12T23:45:09, ff
id4, 2020-12-12T09:12:34, a1
id6, 2020-12-12T20:13:47, 45
答案3
按照你现在的方式进行,意味着首先执行sort
然后分成不同的文件,并且还要避免使用awk
数组:
<infile sort -t, -k2 \
|awk -F, '{
substr($2,1,10)!=prev && nxt++;
print >>("file_"nxt".csv"); close("file_"nxt".csv");
prev=substr($2,1,10);
}'