我需要打印每个唯一 ID (5 美元) 的记录开始日期和结束日期之间的差异 (以天为单位) (6 美元),该 ID 在新字段上有两条以上记录。
数据看起来像这样
7 65 2 5 32070 2010-12-14 13:25:30
7 82 2 10 41920 2010-12-14 11:30:45
7 65 2 5 32070 2010-03-25 10:15:45
7 83 1 67 29446 2010-12-14 04:15:25
7 81 1 47 32070 2011-5-11 08:14:20
7 83 1 67 29446 2011-03-10 06:10:23
7 82 2 10 41920 2011-02-28 06:25:30
7 83 1 67 29446 2011-6-22 07:13:24
7 82 2 10 41920 2011-5-14 06:15:25
我需要输出如下所示:
7 65 2 5 32070 2010-12-14 13:25:30 147
7 82 2 10 41920 2010-12-14 11:30:45 150
7 65 2 5 32070 2010-03-25 10:15:45 147
7 83 1 67 29446 2010-12-14 04:15:25 189
7 81 1 47 32070 2011-5-11 08:14:20 147
7 83 1 67 29446 2011-03-10 06:10:23 189
7 82 2 10 41920 2011-02-28 06:25:30 150
7 83 1 67 29446 2011-6-22 07:13:24 189
7 82 2 10 41920 2011-5-14 06:15:25 150
我编写了以下代码,但它没有考虑每个唯一 ID 的两条以上记录($5)。
$ awk 'NR==FNR {
c = "date -d \""$6 "\" +%s"; # use system date for epoch time seconds
c | getline d; # execute command in c var,output to d
a[$5] = (($5 in a) ? d-a[$5] : d); # set or subtract from array
next # skip to next record
} { # for the second go:
# $1=$1; # uncomment to clean trailing space
print $0, int(a[$5]/86400) # print record and time difference
}' file file
答案1
该解决方案需要GNU awk
:
NR == FNR {
split($6, arr, "-");
date = mktime(sprintf("%4d %02d %02d 00 00 00", arr[1], arr[2], arr[3]));
if (!start[$5] || date < start[$5]) {
start[$5] = date;
}
if (date > stop[$5]) {
stop[$5] = date;
}
next;
}
{
print $0 " " int((stop[$5] - start[$5]) / (3600 * 24));
}