插入时间戳并替换 CSV 文件中的空字段

插入时间戳并替换 CSV 文件中的空字段

我有包含以下元素的 CSV 文件:

timestamp,      data1,  data2,  data3,  data4,  data5
2015-03-16 00:00:00,    3,  3,  4,  2,  5
2015-03-16 00:10:00,    4,  7,  8,  9,  3
2015-03-16 00:20:00,    1,  23, ,   5,  4
2015-03-16 00:30:00,    3,  2,  46, 7,  6
2015-03-16 00:40:00,    4,  5,  6,  8,  4
2015-03-16 00:50:00,    2,  ,   4,  ,   2
2015-03-16 01:00:00,    22, 1,  3,  4,  3
2015-03-16 01:50:00,    3,  ,   7,  8,  89
2015-03-16 02:00:00,    44, 5,  6,  7,  
2015-03-16 02:10:00,    3,  2,  1,  2,  2
2015-03-16 02:20:00,    4,  ,   ,   3,  3
2015-03-16 03:30:00,    5,  6,  7,  4,  3

我想添加缺失的时间戳并在NaN数据不存在的地方填充数据字段,如下所示:

timestamp,      data1,  data2,  data3,  data4,  data5
2015-03-16 00:00:00,    3,  3,  4,  2,  5
2015-03-16 00:10:00,    4,  7,  8,  9,  3
2015-03-16 00:20:00,    1,  23, NaN,    5,  4
2015-03-16 00:30:00,    3,  2,  46, 7,  6
2015-03-16 00:40:00,    4,  5,  6,  8,  4
2015-03-16 00:50:00,    2,  NaN,    4,  NaN,    2
2015-03-16 01:00:00,    22, 1,  3,  4,  3
2015-03-16 01:10:00,    NaN,    NaN,    NaN,    NaN,    NaN
2015-03-16 01:20:00,    NaN,    NaN,    NaN,    NaN,    NaN
2015-03-16 01:30:00,    NaN,    NaN,    NaN,    NaN,    NaN
2015-03-16 01:40:00,    NaN,    NaN,    NaN,    NaN,    NaN
2015-03-16 01:50:00,    3,  NaN,    7,  8,  89
2015-03-16 02:00:00,    44, 5,  6,  7,  NaN
2015-03-16 02:10:00,    3,  2,  1,  2,  2
2015-03-16 02:20:00,    4,  NaN,    NaN,    3,  3
2015-03-16 02:30:00,    NaN,    NaN,    NaN,    NaN,    NaN
2015-03-16 02:40:00,    NaN,    NaN,    NaN,    NaN,    NaN
2015-03-16 02:50:00,    NaN,    NaN,    NaN,    NaN,    NaN
2015-03-16 03:00:00,    NaN,    NaN,    NaN,    NaN,    NaN
2015-03-16 03:10:00,    NaN,    NaN,    NaN,    NaN,    NaN
2015-03-16 03:20:00,    NaN,    NaN,    NaN,    NaN,    NaN
2015-03-16 03:30:00,    5,  6,  7,  4,  3

我怎样才能做到这一点?请注意,这种情况会持续几天,而不仅仅是一天。为了简单起见,我使用了一天的数据。我正在使用 Cygwin,谢谢。

答案1

我在下面使用 Miller,而不是awk因为没有任何awk实现对时间戳解析有任何真正的支持,并且除了解析 CSV 之外不支持解析简单的CSV(无嵌入分隔符或换行符)。

使用磨坊主( mlr) 解析 CSV,清理空格,生成缺失的时间戳,最后用字符串替换缺失值NaN

mlr --csv \
    clean-whitespace then \
    put 't1 = strptime($timestamp,"%F %T");
         NR > 1 {
             while (@t + 600 != t1) {
                 @t += 600;
                 timestamp = strftime(@t,"%F %T");
                 emit timestamp
             }
         }
         @t = t1' then \
    unsparsify then \
    put 'for (k,v in $*) { is_empty(v) { $[k] = "NaN" } }' file

clean-whitespace操作从所有字段中删除所有侧翼空格,并将所有连续的内部空格压缩为单个空格。

两个put表达式中的第一个生成丢失的时间戳。它通过跟踪“当前时间” @t(使用 Unix 时间)并以 600 秒(10 分钟)的步长向前推进,直到它对应于从输入读取的时间戳来实现此目的t1。对于每个步骤,时间戳的输出格式与文件中的格式相同。为此,显然假设每个输入时间戳以 10 分钟间隔出现。

由于第一个put表达式生成的新记录缺少除 之外的所有字段timestamp,因此我们使用unsparsify添加缺失的字段。

第二个put表达式迭代所有字段并将每个空字段设置为 string NaN

考虑到问题中的数据,输出:

timestamp,data1,data2,data3,data4,data5
2015-03-16 00:00:00,3,3,4,2,5
2015-03-16 00:10:00,4,7,8,9,3
2015-03-16 00:20:00,1,23,NaN,5,4
2015-03-16 00:30:00,3,2,46,7,6
2015-03-16 00:40:00,4,5,6,8,4
2015-03-16 00:50:00,2,NaN,4,NaN,2
2015-03-16 01:00:00,22,1,3,4,3
2015-03-16 01:10:00,NaN,NaN,NaN,NaN,NaN
2015-03-16 01:20:00,NaN,NaN,NaN,NaN,NaN
2015-03-16 01:30:00,NaN,NaN,NaN,NaN,NaN
2015-03-16 01:40:00,NaN,NaN,NaN,NaN,NaN
2015-03-16 01:50:00,3,NaN,7,8,89
2015-03-16 02:00:00,44,5,6,7,NaN
2015-03-16 02:10:00,3,2,1,2,2
2015-03-16 02:20:00,4,NaN,NaN,3,3
2015-03-16 02:30:00,NaN,NaN,NaN,NaN,NaN
2015-03-16 02:40:00,NaN,NaN,NaN,NaN,NaN
2015-03-16 02:50:00,NaN,NaN,NaN,NaN,NaN
2015-03-16 03:00:00,NaN,NaN,NaN,NaN,NaN
2015-03-16 03:10:00,NaN,NaN,NaN,NaN,NaN
2015-03-16 03:20:00,NaN,NaN,NaN,NaN,NaN
2015-03-16 03:30:00,5,6,7,4,3

答案2

使用 GNU awk 执行时间函数 (mktime()strftime())、gensub()\s

$ cat tst.awk
BEGIN { FS=OFS="," }
{ gsub(/\s*,\s*/,",") }
NR == 1 { print; next }
NR == 2 {
    dfltData = gensub(/[^,]/,"","g")
    prevSecs = mktime(gensub(/[- :]/," ","g",$1))
}
{
    currTime = $1
    currData = $0
    while ( prevTime < currTime ) {
        prevTime = strftime("%F %T",prevSecs += 600)
        if ( prevTime < currTime ) {
            $0 = dfltData
            $1 = prevTime
            prt()
        }
    }
    $0 = currData
    prt()
}

function prt() {
    $0 = gensub(/,(,|$)/,",Nan\\1","g") # this needs 2 passes
    print gensub(/,(,|$)/,",Nan\\1","g")
}

$ awk -f tst.awk file
timestamp,data1,data2,data3,data4,data5
2015-03-16 00:00:00,3,3,4,2,5
2015-03-16 00:10:00,4,7,8,9,3
2015-03-16 00:20:00,1,23,Nan,5,4
2015-03-16 00:30:00,3,2,46,7,6
2015-03-16 00:40:00,4,5,6,8,4
2015-03-16 00:50:00,2,Nan,4,Nan,2
2015-03-16 01:00:00,22,1,3,4,3
2015-03-16 01:10:00,Nan,Nan,Nan,Nan,Nan
2015-03-16 01:20:00,Nan,Nan,Nan,Nan,Nan
2015-03-16 01:30:00,Nan,Nan,Nan,Nan,Nan
2015-03-16 01:40:00,Nan,Nan,Nan,Nan,Nan
2015-03-16 01:50:00,3,Nan,7,8,89
2015-03-16 02:00:00,44,5,6,7,Nan
2015-03-16 02:10:00,3,2,1,2,2
2015-03-16 02:20:00,4,Nan,Nan,3,3
2015-03-16 02:30:00,Nan,Nan,Nan,Nan,Nan
2015-03-16 02:40:00,Nan,Nan,Nan,Nan,Nan
2015-03-16 02:50:00,Nan,Nan,Nan,Nan,Nan
2015-03-16 03:00:00,Nan,Nan,Nan,Nan,Nan
2015-03-16 03:10:00,Nan,Nan,Nan,Nan,Nan
2015-03-16 03:20:00,Nan,Nan,Nan,Nan,Nan
2015-03-16 03:30:00,5,6,7,4,3

相关内容