根据最新时间戳数据创建新的 CSV 文件

Question

假设：

对于名为的输出文件，01_file1_202311301100.csv字符串'file1'来自输入文件名称中的第二个“_”分隔字段（例如，Real_file1_table2.dat）
OP的代码似乎为“昨天”创建了一个新的子目录，但这意味着我们不应该处理日期为“今天”的输入文件条目；为了这个答案，我将假设所有输入/输出文件都位于当前目录中； OP 可以扩展代码来寻址子目录以及如何处理“昨天”与“今天”
唯一用双引号括起来的输入字段是第一个（逗号分隔）字段
第一个字段将始终是格式"YYYY-MM-DD HH:MM:SS"，否则我们忽略该行
没有输入字段包含嵌入的换行符

总体设计：

确定输出文件的bash前缀 ( pfx)（根据输入文件的名称）
确定bash“最后一个”输出文件的名称
pfx将“最后一个”输出文件名传递给awk
用于awk处理输入*.dat文件
根据第一个字段的内容构建输出文件名（例如，2023-11-30 11:00:00变为202311301100）
如果输出文件名是少于“最后一个”输出文件名，这告诉我们输出文件已经存在，所以我们将忽略输入行
在输出文件名是的情况下等于“最后一个”输出文件名，我们将继续生成一个新的输出文件（这应该解决在脚本运行之前和之后将日期时间值 - 例如： 2023-11-30 11:00- 添加到文件的情况*.datawk
在输出文件名是的情况下比...更棒“最后一个”输出文件名，这告诉我们需要生成一个新的输出文件

一种bash / awk方法：

for datfile in *_table2.dat
do
    [[ ! -f "${datfile}" ]] && break

    ############
    #### the following bash code needs to be run before each run of the awk script

    IFS='_' read -r _ pfx _ <<< "${datfile}"

    case "${pfx}" in
        file1)  pfx="01_${pfx}" ;;
        file2)  pfx="02_${pfx}" ;;    
        file3)  pfx="03_${pfx}" ;;
            *)  pfx="00_${pfx}" ;;
    esac

    last_file="${pfx}_000000000000.csv"

    for outfile in "${pfx}"_*.csv
    do
        [[ -f "${outfile}" ]] && last_file="${outfile}"
    done

    ############
    #### at this point we have:
    ####   1) the '##_file#' prefix for our new output files(s)
    ####   2) the name of the 'last' output file

    awk -v pfx="${pfx}" -v last_file="${last_file}" '
    BEGIN      { FS=OFS=","
                 regex = "^\"[0-9]{4}.*\"$"                               # 1st field regex: "YYYY..."
               }

    FNR==2     { hdr = $0 }

    $1 ~ regex { dt = $1                                                    # copy 1st field
                 gsub(/[^[:digit:]]/,"",dt)                               # strip out everything other than digits
                 dt = substr(dt,1,12)                                     # grab YYYY-MM-DD HH:MM which now looks like YYYYMMDDHHMM

                 if ( dt != dt_prev ) {                                   # if this is a new dt value
                    dt_prev = dt
                    printme = 1                                           # default to printing input lines to new output file

                    close(outfile)                                        # close previous output file
                    outfile = pfx "_" dt ".csv"                           # build new output file name

                    if ( outfile < last_file ) {                          # if "less than" last file then we will skip
                       printf "WARNING: file exists: %s (skipping)\n", outfile
                       printme = 0
                    }
                    else
                    if ( outfile == last_file ) {                         # if "equal to" last file then overwrite
                       printf "WARNING: file exists: %s (overwriting)\n", outfile
                       print hdr > outfile                                # print default header to our overwrite file
                    }
                    else                                                  # else new output file is "greater than" last file
                       print hdr > outfile                                # print default header to our new output file
                 }

                 if ( printme ) {                                         # if printme==1 then print current line to outfile
                    print $1,$2,sprintf("%0.3f%s%0.3f%s%0.3f",$3,OFS,$4,OFS,$5) > outfile
                 }
               }
    ' "${datfile}"
done

针对 OP 的第一个版本运行Real_file1_table2.dat：

$ awk ....

$ head 01*csv
==> 01_file1_202311301100.csv <==
Timestamp,col1,col2,col3,col4
"2023-11-30 11:00:00",289233,0.3495333,0.2412115,333.2676

==> 01_file1_202311301101.csv <==
Timestamp,col1,col2,col3,col4
"2023-11-30 11:01:00",289234,1.035533,1.019842,344.1969

为了测试“覆盖”逻辑，我们将更改 OP 的第二个版本，Real_file1_table2.dat如下所示：

$ cat Real_file1_table2.2.dat
Timestamp,col1,col2,col3,col4
"2023-11-30 11:00:00",289233,0.3495333,0.2412115,333.2676
"2023-11-30 11:01:00",289234,1.035533,1.019842,344.1969
"2023-11-30 11:01:00",666666,0.7777777,0.8888888,17.99999    # another 2023-11-30 11:01 entry
"2023-11-30 11:02:00",289235,0.7758334,0.7252186,17.75404
"2023-11-30 11:03:00",289236,0.7693,0.7103683,359.0702

针对这个新版本运行Real_file1_table2.dat：

$ awk ...
WARNING: file exists: 01_file1_202311301100.csv (skipping)
WARNING: file exists: 01_file1_202311301101.csv (overwriting)

$ head 01*csv
==> 01_file1_202311301100.csv <==
Timestamp,col1,col2,col3,col4
"2023-11-30 11:00:00",289233,0.3495333,0.2412115,333.2676

==> 01_file1_202311301101.csv <==
"2023-11-30 11:01:00",289234,1.035533,1.019842,344.1969
"2023-11-30 11:01:00",666666,0.7777777,0.8888888,17.99999

==> 01_file1_202311301102.csv <==
Timestamp,col1,col2,col3,col4
"2023-11-30 11:02:00",289235,0.7758334,0.7252186,17.75404

==> 01_file1_202311301103.csv <==
Timestamp,col1,col2,col3,col4
"2023-11-30 11:03:00",289236,0.7693,0.7103683,359.0702

Answer 1

假设：

对于名为的输出文件，01_file1_202311301100.csv字符串'file1'来自输入文件名称中的第二个“_”分隔字段（例如，Real_file1_table2.dat）
OP的代码似乎为“昨天”创建了一个新的子目录，但这意味着我们不应该处理日期为“今天”的输入文件条目；为了这个答案，我将假设所有输入/输出文件都位于当前目录中； OP 可以扩展代码来寻址子目录以及如何处理“昨天”与“今天”
唯一用双引号括起来的输入字段是第一个（逗号分隔）字段
第一个字段将始终是格式"YYYY-MM-DD HH:MM:SS"，否则我们忽略该行
没有输入字段包含嵌入的换行符

总体设计：

确定输出文件的bash前缀 ( pfx)（根据输入文件的名称）
确定bash“最后一个”输出文件的名称
pfx将“最后一个”输出文件名传递给awk
用于awk处理输入*.dat文件
根据第一个字段的内容构建输出文件名（例如，2023-11-30 11:00:00变为202311301100）
如果输出文件名是少于“最后一个”输出文件名，这告诉我们输出文件已经存在，所以我们将忽略输入行
在输出文件名是的情况下等于“最后一个”输出文件名，我们将继续生成一个新的输出文件（这应该解决在脚本运行之前和之后将日期时间值 - 例如： 2023-11-30 11:00- 添加到文件的情况*.datawk
在输出文件名是的情况下比...更棒“最后一个”输出文件名，这告诉我们需要生成一个新的输出文件

一种bash / awk方法：

for datfile in *_table2.dat
do
    [[ ! -f "${datfile}" ]] && break

    ############
    #### the following bash code needs to be run before each run of the awk script

    IFS='_' read -r _ pfx _ <<< "${datfile}"

    case "${pfx}" in
        file1)  pfx="01_${pfx}" ;;
        file2)  pfx="02_${pfx}" ;;    
        file3)  pfx="03_${pfx}" ;;
            *)  pfx="00_${pfx}" ;;
    esac

    last_file="${pfx}_000000000000.csv"

    for outfile in "${pfx}"_*.csv
    do
        [[ -f "${outfile}" ]] && last_file="${outfile}"
    done

    ############
    #### at this point we have:
    ####   1) the '##_file#' prefix for our new output files(s)
    ####   2) the name of the 'last' output file

    awk -v pfx="${pfx}" -v last_file="${last_file}" '
    BEGIN      { FS=OFS=","
                 regex = "^\"[0-9]{4}.*\"$"                               # 1st field regex: "YYYY..."
               }

    FNR==2     { hdr = $0 }

    $1 ~ regex { dt = $1                                                    # copy 1st field
                 gsub(/[^[:digit:]]/,"",dt)                               # strip out everything other than digits
                 dt = substr(dt,1,12)                                     # grab YYYY-MM-DD HH:MM which now looks like YYYYMMDDHHMM

                 if ( dt != dt_prev ) {                                   # if this is a new dt value
                    dt_prev = dt
                    printme = 1                                           # default to printing input lines to new output file

                    close(outfile)                                        # close previous output file
                    outfile = pfx "_" dt ".csv"                           # build new output file name

                    if ( outfile < last_file ) {                          # if "less than" last file then we will skip
                       printf "WARNING: file exists: %s (skipping)\n", outfile
                       printme = 0
                    }
                    else
                    if ( outfile == last_file ) {                         # if "equal to" last file then overwrite
                       printf "WARNING: file exists: %s (overwriting)\n", outfile
                       print hdr > outfile                                # print default header to our overwrite file
                    }
                    else                                                  # else new output file is "greater than" last file
                       print hdr > outfile                                # print default header to our new output file
                 }

                 if ( printme ) {                                         # if printme==1 then print current line to outfile
                    print $1,$2,sprintf("%0.3f%s%0.3f%s%0.3f",$3,OFS,$4,OFS,$5) > outfile
                 }
               }
    ' "${datfile}"
done

针对 OP 的第一个版本运行Real_file1_table2.dat：

$ awk ....

$ head 01*csv
==> 01_file1_202311301100.csv <==
Timestamp,col1,col2,col3,col4
"2023-11-30 11:00:00",289233,0.3495333,0.2412115,333.2676

==> 01_file1_202311301101.csv <==
Timestamp,col1,col2,col3,col4
"2023-11-30 11:01:00",289234,1.035533,1.019842,344.1969

为了测试“覆盖”逻辑，我们将更改 OP 的第二个版本，Real_file1_table2.dat如下所示：

$ cat Real_file1_table2.2.dat
Timestamp,col1,col2,col3,col4
"2023-11-30 11:00:00",289233,0.3495333,0.2412115,333.2676
"2023-11-30 11:01:00",289234,1.035533,1.019842,344.1969
"2023-11-30 11:01:00",666666,0.7777777,0.8888888,17.99999    # another 2023-11-30 11:01 entry
"2023-11-30 11:02:00",289235,0.7758334,0.7252186,17.75404
"2023-11-30 11:03:00",289236,0.7693,0.7103683,359.0702

针对这个新版本运行Real_file1_table2.dat：

$ awk ...
WARNING: file exists: 01_file1_202311301100.csv (skipping)
WARNING: file exists: 01_file1_202311301101.csv (overwriting)

$ head 01*csv
==> 01_file1_202311301100.csv <==
Timestamp,col1,col2,col3,col4
"2023-11-30 11:00:00",289233,0.3495333,0.2412115,333.2676

==> 01_file1_202311301101.csv <==
"2023-11-30 11:01:00",289234,1.035533,1.019842,344.1969
"2023-11-30 11:01:00",666666,0.7777777,0.8888888,17.99999

==> 01_file1_202311301102.csv <==
Timestamp,col1,col2,col3,col4
"2023-11-30 11:02:00",289235,0.7758334,0.7252186,17.75404

==> 01_file1_202311301103.csv <==
Timestamp,col1,col2,col3,col4
"2023-11-30 11:03:00",289236,0.7693,0.7103683,359.0702

根据最新时间戳数据创建新的 CSV 文件

答案1

相关内容