我有一个脚本可以访问多个 dat 文件并根据前一天的数据生成 csv 文件。这些 DAT 文件根据来自各种仪器的数据每分钟更新一次。
脚本片段:
gawk -F, '
{ gsub(/"/,"") }
FNR==2{
delete timestamp; #This code was added
start=strftime("%Y %m %d 00 00 00", systime()-172800);#to fix the
for(min=0; min<1440; min++) #timestamp formatting
timestamp[strftime("%F %H:%M", mktime(start)+min*60)] #issue from the input files.
fname=FILENAME;
gsub(/Real_|_Table2.*\.dat$/,"", fname);
$2="col1";
$3="col2";
$4="col3";
$5="col4";
if ( fname=="file1") ID1="01";
else if ( fname=="file2") ID1="02";
else if ( fname=="file3") ID1="03";
else ID1="00";
hdr=$1 FS $2 FS $3 FS $4 FS $5;
yday=strftime("%Y%m%d", systime()-86400);
dirName=yday;
system("mkdir -p "dirName); next
}
(substr($1,1,16) in timestamp){
fname=FILENAME;
gsub(/Real_|_Table2.*\.dat$/,"", fname);
cp=$1; gsub(/[-: ]|00$/, "", cp);
if ( fname=="file2"|| fname=="file3")
printf("%s%s,%.3f,%.3f,%.3f,%.3f\n", hdr ORS, $1, $3, $2, $4, $6)>(dirName"/"ID1"_"fname"_"cp".csv");
else
printf("%s%s,%.3f,%.3f,%.3f,%.3f\n", hdr ORS, $1, $3, $5, "999")>(dirName"/"ID1"_"fname"_"cp".csv");
close(dirName"/"ID1"_"fname"_"cp".csv");
delete timestamp[substr($1,1,16)] }
ENDFILE{ for (x in timestamp){
cpx=x; gsub(/[-: ]/, "", cpx);
print hdr ORS x "-999,-999,-999,-999," >(dirName"/"ID1"_"fname"_"cpx".csv");
close(dirName"/"ID1"_"fname"_"cpx".csv")
}
}' *_Table2.dat
我希望编辑脚本,以便它可以扫描 dat 文件以获取新数据,并仅为新数据创建 csv 文件。在当前格式中,脚本为 *.dat 文件(新文件或历史文件)中的每个时间戳创建 csv 文件。
输入文件示例(Real_file1_table2.dat):
"Data for Site1"
TIMESTAMP,col1,col2,col3,col4
"2023-11-30 11:00:00",289233,0.3495333,0.2412115,333.2676
"2023-11-30 11:01:00",289234,1.035533,1.019842,344.1969
请注意标题位于第 2 行 然后创建以下输出文件:
01_file1_202311301100.csv
01_file1_202311301101.csv
ETC..
csv 文件中包含的数据基于时间戳。
例如:
01_file1_202311301100.csv包含以下数据:
TIMESTAMP,col1,col2,col3,col4
2023/11/30 11:00,289233,0.349,0.241,333.267
01_file1_202311301101.csv包含以下数据:
TIMESTAMP,col1,col2,col3,col4
2023/11/30 11:01,289234,1.035,1.019,344.196
ETC。
请注意,这些 csv 文件中的数据四舍五入为 3 个浮点
当第二次执行脚本时,以下数据现在包含在Real_file1_table2.dat:
"Data for Site1"
TIMESTAMP,col1,col2,col3,col4
"2023-11-30 11:00:00",289233,0.3495333,0.2412115,333.2676
"2023-11-30 11:01:00",289234,1.035533,1.019842,344.1969
"2023-11-30 11:02:00",289235,0.7758334,0.7252186,17.75404
"2023-11-30 11:03:00",289236,0.7693,0.7103683,359.0702
我希望脚本仅为最新数据创建 csv 文件,即:
01_file1_202311301102.csv
01_file1_202311301103.csv
我不想重新创建已存在的 csv 文件。
因此,每当执行脚本时,它必须只为最新数据创建 csv 文件。
对你的帮助表示感谢
答案1
假设:
- 对于名为的输出文件,
01_file1_202311301100.csv
字符串'file1'
来自输入文件名称中的第二个“_”分隔字段(例如,Real_file1_table2.dat
) - OP的代码似乎为“昨天”创建了一个新的子目录,但这意味着我们不应该处理日期为“今天”的输入文件条目;为了这个答案,我将假设所有输入/输出文件都位于当前目录中; OP 可以扩展代码来寻址子目录以及如何处理“昨天”与“今天”
- 唯一用双引号括起来的输入字段是第一个(逗号分隔)字段
- 第一个字段将始终是格式
"YYYY-MM-DD HH:MM:SS"
,否则我们忽略该行 - 没有输入字段包含嵌入的换行符
总体设计:
- 确定输出文件的
bash
前缀 (pfx
)(根据输入文件的名称) - 确定
bash
“最后一个”输出文件的名称 pfx
将“最后一个”输出文件名传递给awk
- 用于
awk
处理输入*.dat
文件 - 根据第一个字段的内容构建输出文件名(例如,
2023-11-30 11:00:00
变为202311301100
) - 如果输出文件名是少于“最后一个”输出文件名,这告诉我们输出文件已经存在,所以我们将忽略输入行
- 在输出文件名是的情况下等于“最后一个”输出文件名,我们将继续生成一个新的输出文件(这应该解决在脚本运行之前和之后将日期时间值 - 例如:
2023-11-30 11:00
- 添加到文件的情况*.dat
awk
- 在输出文件名是的情况下比...更棒“最后一个”输出文件名,这告诉我们需要生成一个新的输出文件
一种bash / awk
方法:
for datfile in *_table2.dat
do
[[ ! -f "${datfile}" ]] && break
############
#### the following bash code needs to be run before each run of the awk script
IFS='_' read -r _ pfx _ <<< "${datfile}"
case "${pfx}" in
file1) pfx="01_${pfx}" ;;
file2) pfx="02_${pfx}" ;;
file3) pfx="03_${pfx}" ;;
*) pfx="00_${pfx}" ;;
esac
last_file="${pfx}_000000000000.csv"
for outfile in "${pfx}"_*.csv
do
[[ -f "${outfile}" ]] && last_file="${outfile}"
done
############
#### at this point we have:
#### 1) the '##_file#' prefix for our new output files(s)
#### 2) the name of the 'last' output file
awk -v pfx="${pfx}" -v last_file="${last_file}" '
BEGIN { FS=OFS=","
regex = "^\"[0-9]{4}.*\"$" # 1st field regex: "YYYY..."
}
FNR==2 { hdr = $0 }
$1 ~ regex { dt = $1 # copy 1st field
gsub(/[^[:digit:]]/,"",dt) # strip out everything other than digits
dt = substr(dt,1,12) # grab YYYY-MM-DD HH:MM which now looks like YYYYMMDDHHMM
if ( dt != dt_prev ) { # if this is a new dt value
dt_prev = dt
printme = 1 # default to printing input lines to new output file
close(outfile) # close previous output file
outfile = pfx "_" dt ".csv" # build new output file name
if ( outfile < last_file ) { # if "less than" last file then we will skip
printf "WARNING: file exists: %s (skipping)\n", outfile
printme = 0
}
else
if ( outfile == last_file ) { # if "equal to" last file then overwrite
printf "WARNING: file exists: %s (overwriting)\n", outfile
print hdr > outfile # print default header to our overwrite file
}
else # else new output file is "greater than" last file
print hdr > outfile # print default header to our new output file
}
if ( printme ) { # if printme==1 then print current line to outfile
print $1,$2,sprintf("%0.3f%s%0.3f%s%0.3f",$3,OFS,$4,OFS,$5) > outfile
}
}
' "${datfile}"
done
针对 OP 的第一个版本运行Real_file1_table2.dat
:
$ awk ....
$ head 01*csv
==> 01_file1_202311301100.csv <==
Timestamp,col1,col2,col3,col4
"2023-11-30 11:00:00",289233,0.3495333,0.2412115,333.2676
==> 01_file1_202311301101.csv <==
Timestamp,col1,col2,col3,col4
"2023-11-30 11:01:00",289234,1.035533,1.019842,344.1969
为了测试“覆盖”逻辑,我们将更改 OP 的第二个版本,Real_file1_table2.dat
如下所示:
$ cat Real_file1_table2.2.dat
Timestamp,col1,col2,col3,col4
"2023-11-30 11:00:00",289233,0.3495333,0.2412115,333.2676
"2023-11-30 11:01:00",289234,1.035533,1.019842,344.1969
"2023-11-30 11:01:00",666666,0.7777777,0.8888888,17.99999 # another 2023-11-30 11:01 entry
"2023-11-30 11:02:00",289235,0.7758334,0.7252186,17.75404
"2023-11-30 11:03:00",289236,0.7693,0.7103683,359.0702
针对这个新版本运行Real_file1_table2.dat
:
$ awk ...
WARNING: file exists: 01_file1_202311301100.csv (skipping)
WARNING: file exists: 01_file1_202311301101.csv (overwriting)
$ head 01*csv
==> 01_file1_202311301100.csv <==
Timestamp,col1,col2,col3,col4
"2023-11-30 11:00:00",289233,0.3495333,0.2412115,333.2676
==> 01_file1_202311301101.csv <==
"2023-11-30 11:01:00",289234,1.035533,1.019842,344.1969
"2023-11-30 11:01:00",666666,0.7777777,0.8888888,17.99999
==> 01_file1_202311301102.csv <==
Timestamp,col1,col2,col3,col4
"2023-11-30 11:02:00",289235,0.7758334,0.7252186,17.75404
==> 01_file1_202311301103.csv <==
Timestamp,col1,col2,col3,col4
"2023-11-30 11:03:00",289236,0.7693,0.7103683,359.0702