我有包含以下元素的 CSV 文件:
timestamp, data1, data2, data3, data4, data5
2015-03-16 00:00:00, 3, 3, 4, 2, 5
2015-03-16 00:10:00, 4, 7, 8, 9, 3
2015-03-16 00:20:00, 1, 23, , 5, 4
2015-03-16 00:30:00, 3, 2, 46, 7, 6
2015-03-16 00:40:00, 4, 5, 6, 8, 4
2015-03-16 00:50:00, 2, , 4, , 2
2015-03-16 01:00:00, 22, 1, 3, 4, 3
2015-03-16 01:50:00, 3, , 7, 8, 89
2015-03-16 02:00:00, 44, 5, 6, 7,
2015-03-16 02:10:00, 3, 2, 1, 2, 2
2015-03-16 02:20:00, 4, , , 3, 3
2015-03-16 03:30:00, 5, 6, 7, 4, 3
我想添加缺失的时间戳并在NaN
数据不存在的地方填充数据字段,如下所示:
timestamp, data1, data2, data3, data4, data5
2015-03-16 00:00:00, 3, 3, 4, 2, 5
2015-03-16 00:10:00, 4, 7, 8, 9, 3
2015-03-16 00:20:00, 1, 23, NaN, 5, 4
2015-03-16 00:30:00, 3, 2, 46, 7, 6
2015-03-16 00:40:00, 4, 5, 6, 8, 4
2015-03-16 00:50:00, 2, NaN, 4, NaN, 2
2015-03-16 01:00:00, 22, 1, 3, 4, 3
2015-03-16 01:10:00, NaN, NaN, NaN, NaN, NaN
2015-03-16 01:20:00, NaN, NaN, NaN, NaN, NaN
2015-03-16 01:30:00, NaN, NaN, NaN, NaN, NaN
2015-03-16 01:40:00, NaN, NaN, NaN, NaN, NaN
2015-03-16 01:50:00, 3, NaN, 7, 8, 89
2015-03-16 02:00:00, 44, 5, 6, 7, NaN
2015-03-16 02:10:00, 3, 2, 1, 2, 2
2015-03-16 02:20:00, 4, NaN, NaN, 3, 3
2015-03-16 02:30:00, NaN, NaN, NaN, NaN, NaN
2015-03-16 02:40:00, NaN, NaN, NaN, NaN, NaN
2015-03-16 02:50:00, NaN, NaN, NaN, NaN, NaN
2015-03-16 03:00:00, NaN, NaN, NaN, NaN, NaN
2015-03-16 03:10:00, NaN, NaN, NaN, NaN, NaN
2015-03-16 03:20:00, NaN, NaN, NaN, NaN, NaN
2015-03-16 03:30:00, 5, 6, 7, 4, 3
我怎样才能做到这一点?请注意,这种情况会持续几天,而不仅仅是一天。为了简单起见,我使用了一天的数据。我正在使用 Cygwin,谢谢。
答案1
我在下面使用 Miller,而不是awk
因为没有任何awk
实现对时间戳解析有任何真正的支持,并且除了解析 CSV 之外不支持解析简单的CSV(无嵌入分隔符或换行符)。
使用磨坊主( mlr
) 解析 CSV,清理空格,生成缺失的时间戳,最后用字符串替换缺失值NaN
:
mlr --csv \
clean-whitespace then \
put 't1 = strptime($timestamp,"%F %T");
NR > 1 {
while (@t + 600 != t1) {
@t += 600;
timestamp = strftime(@t,"%F %T");
emit timestamp
}
}
@t = t1' then \
unsparsify then \
put 'for (k,v in $*) { is_empty(v) { $[k] = "NaN" } }' file
该clean-whitespace
操作从所有字段中删除所有侧翼空格,并将所有连续的内部空格压缩为单个空格。
两个put
表达式中的第一个生成丢失的时间戳。它通过跟踪“当前时间” @t
(使用 Unix 时间)并以 600 秒(10 分钟)的步长向前推进,直到它对应于从输入读取的时间戳来实现此目的t1
。对于每个步骤,时间戳的输出格式与文件中的格式相同。为此,显然假设每个输入时间戳以 10 分钟间隔出现。
由于第一个put
表达式生成的新记录缺少除 之外的所有字段timestamp
,因此我们使用unsparsify
添加缺失的字段。
第二个put
表达式迭代所有字段并将每个空字段设置为 string NaN
。
考虑到问题中的数据,输出:
timestamp,data1,data2,data3,data4,data5
2015-03-16 00:00:00,3,3,4,2,5
2015-03-16 00:10:00,4,7,8,9,3
2015-03-16 00:20:00,1,23,NaN,5,4
2015-03-16 00:30:00,3,2,46,7,6
2015-03-16 00:40:00,4,5,6,8,4
2015-03-16 00:50:00,2,NaN,4,NaN,2
2015-03-16 01:00:00,22,1,3,4,3
2015-03-16 01:10:00,NaN,NaN,NaN,NaN,NaN
2015-03-16 01:20:00,NaN,NaN,NaN,NaN,NaN
2015-03-16 01:30:00,NaN,NaN,NaN,NaN,NaN
2015-03-16 01:40:00,NaN,NaN,NaN,NaN,NaN
2015-03-16 01:50:00,3,NaN,7,8,89
2015-03-16 02:00:00,44,5,6,7,NaN
2015-03-16 02:10:00,3,2,1,2,2
2015-03-16 02:20:00,4,NaN,NaN,3,3
2015-03-16 02:30:00,NaN,NaN,NaN,NaN,NaN
2015-03-16 02:40:00,NaN,NaN,NaN,NaN,NaN
2015-03-16 02:50:00,NaN,NaN,NaN,NaN,NaN
2015-03-16 03:00:00,NaN,NaN,NaN,NaN,NaN
2015-03-16 03:10:00,NaN,NaN,NaN,NaN,NaN
2015-03-16 03:20:00,NaN,NaN,NaN,NaN,NaN
2015-03-16 03:30:00,5,6,7,4,3
答案2
使用 GNU awk 执行时间函数 (mktime()
和strftime()
)、gensub()
和\s
:
$ cat tst.awk
BEGIN { FS=OFS="," }
{ gsub(/\s*,\s*/,",") }
NR == 1 { print; next }
NR == 2 {
dfltData = gensub(/[^,]/,"","g")
prevSecs = mktime(gensub(/[- :]/," ","g",$1))
}
{
currTime = $1
currData = $0
while ( prevTime < currTime ) {
prevTime = strftime("%F %T",prevSecs += 600)
if ( prevTime < currTime ) {
$0 = dfltData
$1 = prevTime
prt()
}
}
$0 = currData
prt()
}
function prt() {
$0 = gensub(/,(,|$)/,",Nan\\1","g") # this needs 2 passes
print gensub(/,(,|$)/,",Nan\\1","g")
}
$ awk -f tst.awk file
timestamp,data1,data2,data3,data4,data5
2015-03-16 00:00:00,3,3,4,2,5
2015-03-16 00:10:00,4,7,8,9,3
2015-03-16 00:20:00,1,23,Nan,5,4
2015-03-16 00:30:00,3,2,46,7,6
2015-03-16 00:40:00,4,5,6,8,4
2015-03-16 00:50:00,2,Nan,4,Nan,2
2015-03-16 01:00:00,22,1,3,4,3
2015-03-16 01:10:00,Nan,Nan,Nan,Nan,Nan
2015-03-16 01:20:00,Nan,Nan,Nan,Nan,Nan
2015-03-16 01:30:00,Nan,Nan,Nan,Nan,Nan
2015-03-16 01:40:00,Nan,Nan,Nan,Nan,Nan
2015-03-16 01:50:00,3,Nan,7,8,89
2015-03-16 02:00:00,44,5,6,7,Nan
2015-03-16 02:10:00,3,2,1,2,2
2015-03-16 02:20:00,4,Nan,Nan,3,3
2015-03-16 02:30:00,Nan,Nan,Nan,Nan,Nan
2015-03-16 02:40:00,Nan,Nan,Nan,Nan,Nan
2015-03-16 02:50:00,Nan,Nan,Nan,Nan,Nan
2015-03-16 03:00:00,Nan,Nan,Nan,Nan,Nan
2015-03-16 03:10:00,Nan,Nan,Nan,Nan,Nan
2015-03-16 03:20:00,Nan,Nan,Nan,Nan,Nan
2015-03-16 03:30:00,5,6,7,4,3