我想知道是否有一种简单的方法,也许是一个衬垫,使用unix cli工具来沿着定义的时间偏移/中断/差异(例如两个小时)以毫秒精度(+%FT%T.%3NZ
例如2021-05-27T13:59:33.641Z
)分割带有ISO-8601 UTC时间戳的CSV文件。
一如既往,有某些不同的方式来获得它,而对于有类似问题的其他用户,其他选项也可能与全面的答案相关,我......
- ...使用/拥有 git 2.31.1's
GNU Bash 4.4.23
,GNU sed 4.8
,GNU Awk 5.0.0
(以及它捆绑的所有其他工具),xsv 0.13.0
在jq 1.6
Windows 7 上 - ...宁愿在交互式 shell 的脚本中使用它
- ...使用分号 (
;
) 作为分隔符,不使用逗号 - ... 做不是引用我的值(例如,用单引号 (
'
) 或双引号 ("
) 括起来) - ...没有标题
- ...已经将整个 CSV 保存在变量中,并且还希望将结果保存在变量(数组?)中,以便能够进一步分析它们
- 我的专栏做不是实际上有固定的长度,除了字母数字字符外还可能包含空格和连字符
- 时间戳是我的现实世界数据中八列中的第五列
- 可以假设该文件最多为 250k 行和 20 MiB
- 虽然在我的 i5-4300U 上脚本/命令花费的时间少于半秒会更好,但最多 5 到 10 秒仍然不会破坏交易
例子
如果我有2 hours
用于分割的偏移量(并且我没有混淆任何内容),则此文件:
abc;square;2021-05-27T14:15:39.315Z
def;circle;2021-05-27T14:17:03.416Z
ghi;triang;2021-05-27T14:45:13.520Z
abc;circle;2021-05-27T15:25:47.624Z
ghi;square;2021-05-27T17:59:33.641Z
def;triang;2021-05-27T18:15:33.315Z
abc;circle;2021-05-27T21:12:13.350Z
ghi;triang;2021-05-27T21:15:31.135Z
将分为以下三个部分
abc;square;2021-05-27T14:15:39.315Z
def;circle;2021-05-27T14:17:03.416Z
ghi;triang;2021-05-27T14:45:13.520Z
abc;circle;2021-05-27T15:25:47.624Z
ghi;square;2021-05-27T17:59:33.641Z
def;triang;2021-05-27T18:15:33.315Z
abc;circle;2021-05-27T21:12:13.350Z
ghi;triang;2021-05-27T21:15:31.135Z
免责声明:我不是母语人士,所以如果改写使这个问题更容易理解,请继续。冗长的回复。例如,还指定不适用于我的用例(逗号、引号)的选项,或者在这个问题的文本中同时使用单词semicolon
和符号是为了 SEO 目的;
答案1
鉴于变量中的示例 CSV 数据$csv
:
gawk '
function timestamp2epoch(ts, m) {
if(match(ts, /([0-9]{4})-([0-9]{2})-([0-9]{2})T([0-9]{2}):([0-9]{2}):([0-9]{2})\..*/, m))
return mktime(m[1] " " m[2] " " m[3] " " m[4] " " m[5] " " m[6])
else
return -1
}
BEGIN {
FS = ";"
interval = 2 * 3600 # 2 hours
}
{ t = timestamp2epoch($3) }
t > start + interval { start = t; n++ }
{ batch[n] = batch[n] (batch[n] == "" ? "" : "/") $0 }
END {
PROCINFO["sorted_in"] = "@ind_num_asc"
for (i in batch)
print batch[i]
}
' <<<"$csv"
输出
abc;square;2021-05-27T14:15:39.315Z/def;circle;2021-05-27T14:17:03.416Z/ghi;triang;2021-05-27T14:45:13.520Z/abc;circle;2021-05-27T15:25:47.624Z
ghi;square;2021-05-27T17:59:33.641Z/def;triang;2021-05-27T18:15:33.315Z
abc;circle;2021-05-27T21:12:13.350Z/ghi;triang;2021-05-27T21:15:31.135Z
可以将其读入 shell 数组,例如:
mapfile -t batches < <(gawk '...' <<<"$csv")
declare -p batches
declare -a batches=([0]="abc;square;2021-05-27T14:15:39.315Z/def;circle;2021-05-27T14:17:03.416Z/ghi;triang;2021-05-27T14:45:13.520Z/abc;circle;2021-05-27T15:25:47.624Z" [1]="ghi;square;2021-05-27T17:59:33.641Z/def;triang;2021-05-27T18:15:33.315Z" [2]="abc;circle;2021-05-27T21:12:13.350Z/ghi;triang;2021-05-27T21:15:31.135Z")
然后对它们进行交互,如下所示:
for ((i = 0; i < "${#batches[@]}"; i++)); do
IFS="/" read -ra records <<<"${batches[i]}"
echo "batch $i"
for record in "${records[@]}"; do echo " $record"; done
echo
done
batch 0
abc;square;2021-05-27T14:15:39.315Z
def;circle;2021-05-27T14:17:03.416Z
ghi;triang;2021-05-27T14:45:13.520Z
abc;circle;2021-05-27T15:25:47.624Z
batch 1
ghi;square;2021-05-27T17:59:33.641Z
def;triang;2021-05-27T18:15:33.315Z
batch 2
abc;circle;2021-05-27T21:12:13.350Z
ghi;triang;2021-05-27T21:15:31.135Z
答案2
以下 perl 脚本将输出输入文件,每次看到不在上一个开始时间段 2 小时内的行时都会添加一个空行 - 将输入分成最长持续时间为 2 小时的批次。
起始时间段在读取第一行时设置,并且仅在打印额外的空白行时更新 - 这是为了确保至少每 2 小时生成一个新批次 - 否则您的样本输入将仅分为两个批次(来自的 6 行) 14:15 到 18:15,以及 21:12 和 21:15 的 2 行),并且在 16:45 处添加一个额外的日志条目,在 20:00 处添加另一个日志条目,这样可以防止样本输入的任何拆分。
它从输入的第三个字段获取日期和时间 - 请注意,perl 数组从零而不是从一开始,$F[2]
array 的第三个字段也是如此@F
。
#!/usr/bin/perl
use strict;
use Date::Parse;
my $start;
while(<>) {
chomp;
my $approx;
my @F = split /;/;
# approximate date/time to start of hour
($approx = $F[2]) =~ s/:\d\d:\d\d\.\d+Z$/:00:00/;
my $now = str2time($approx);
$start = $now if ($. == 1);
if (($now - $start) > 7200) {
$start = $now;
print "\n";
};
print "$_\n";
}
示例输出:
$ ./split.pl input.csv
abc;square;2021-05-27T14:15:39.315Z
def;circle;2021-05-27T14:17:03.416Z
ghi;triang;2021-05-27T14:45:13.520Z
abc;circle;2021-05-27T15:25:47.624Z
ghi;square;2021-05-27T17:59:33.641Z
def;triang;2021-05-27T18:15:33.315Z
abc;circle;2021-05-27T21:12:13.350Z
ghi;triang;2021-05-27T21:15:31.135Z
如果您需要在单独的文件中输出,您可以这样做:
#!/usr/bin/perl
use strict;
use Date::Parse;
my $start;
# output-file counter
my $fc = 1;
my $outfile = "file.$fc.csv";
open (my $fh, ">", $outfile) || die "couldn't open $outfile for write: $!\n";
while(<>) {
chomp;
my $approx;
my @F = split /;/;
# approximate date/time to start of hour
($approx = $F[2]) =~ s/:\d\d:\d\d\.\d+Z$/:00:00/;
my $now = str2time($approx);
$start = $now if ($. == 1);
if (($now - $start) > 7200) {
$start = $now;
close($fh);
$fc++;
$outfile = "file.$fc.csv";
open ($fh, ">", $outfile) || die "couldn't open $outfile for write: $!\n";
};
print $fh "$_\n";
}
如果您希望任一版本的脚本在其可以处理的时间格式方面更加灵活,请使用:
($approx = $F[2]) =~ s/:\d\d:\d\d(?:\.\d+)?Z?$/:00:00/;
这允许时间字符串中的小数部分和 Z 都是可选的。
答案3
使用 GNU awk forgensub()
和mktime()
:
$ cat tst.awk
BEGIN {
FS = ";"
maxSecs = 2 * 60 * 60
prevTime = -(maxSecs + 1)
}
{
split($3,dt,/[.]/)
dateHMS = gensub(/[-T:]/," ","g",dt[1])
currSecs = mktime(dateHMS,1) "." dt[2]
secsDelta = currTime - prevTime
prevTime = currTime
}
secsDelta > maxSecs {
close(out)
out = "out" (++numOut)
}
{ print > out }
$ awk -f tst.awk file
$ head out?
==> out1 <==
abc;square;2021-05-27T14:15:39.315Z
def;circle;2021-05-27T14:17:03.416Z
ghi;triang;2021-05-27T14:45:13.520Z
abc;circle;2021-05-27T15:25:47.624Z
==> out2 <==
ghi;square;2021-05-27T17:59:33.641Z
def;triang;2021-05-27T18:15:33.315Z
==> out3 <==
abc;circle;2021-05-27T21:12:13.350Z
ghi;triang;2021-05-27T21:15:31.135Z
答案4
如果文件中的所有日期都属于同一天:
#!/usr/bin/awk -f
BEGIN {
FS=OFS=";"
ho = 1
}
{
# Split the last field in date and times
split($NF, a, "T")
# Get the hour from time
h = a[2]
sub(/:.*$/, "", h)
if (lh == 0) lh = h+ho
if (h > lh) {
lh = h+ho
print "\n"
}
}1
您可以编辑脚本块ho
中的(小时偏移)BEGIN
,以在 csv 中拆分为其他小时偏移。
#!/usr/bin/awk -f
BEGIN {
FS=OFS=";"
# Set here the hour offset
hour_offset = 1
# Get the hour values in seconds
ho = 60 * 60 * hour_offset
}
{
sub(/Z$/, "", $NF)
# Call /bin/date and translate the 'visual date' to
# epoch timestamp.
cmd="/bin/date -d " $NF " +%s"
epoch=((cmd | getline line) > 0 ? line : -1)
close(cmd)
if (epoch == -1) {
print "Date throw an error at : " NR;
exit 1;
}
# If the lh (last hour) is not set, set it
# to the current value for the epoch time plus
# the chosen offset
if (!lh) lh = epoch + ho
# if the current offset less the the old hour processed is
# greater then the offset you choose: update the offset and
# print the separator
if (epoch - lh > ho) {
lh = epoch + ho
print ""
}
}1