我有数百万行的数据,所以我无法通过电子表格执行此操作。您会看到一些数据行没有数据,因此应该跳过这些数据行。数据采用 .CSV 格式。欢迎任何 Unix 工具,也许是 AWK,因为分隔符“”。
就像是
awk -F " " '{ print $1 $2 $3 $4}' filename.csv
但如何获取每天唯一的最高湿度。一种选择但不起作用:
awk -F, 'NR==1 {next} {date=$1; humidity=$2; if (!max[date] || humidity > max[date]) {max[date]=humidity; line[date]=$0}} END {for (date in line) print line[date]}' filename.csv
数据(来自 Airthings View Plus -设备)
recorded HUMIDITY %TEMP °C
2023-08-14 23:55:48 97.01 14.65
2023-08-14 23:55:50
2023-08-14 23:58:14 97.07 14.63
2023-08-14 23:58:16
2023-08-15 0:00:46 97.02 14.62
2023-08-15 0:00:48
2023-08-15 0:03:14 97.06 14.62
2023-08-15 0:03:16
头部真实数据
recorded,HUMIDITY %,TEMP °C
2023-08-14 16:55:56,97.64,14.62
2023-08-14 16:55:58,,
2023-08-14 16:58:13,97.41,14.76
2023-08-14 16:58:15,,
2023-08-14 17:00:45,97.04,14.88
2023-08-14 17:00:47,,
2023-08-15 17:03:12,96.78,15.26
2023-08-15 17:03:14,,
2023-08-15 17:05:45,96.59,14.89
所需输出
recorded HUMIDITY %TEMP °C
2023-08-14 23:58:14 97.07 14.63
2023-08-15 0:03:14 97.06 14.62
操作系统:Debian 12 Linux
答案1
EX.3.feb.2024 - hum.t.p.day.csv
注意:现在已经用谷歌文档中的完整OP尝试了这些,这种csvsql
方法虽然在美学上很吸引人,但似乎是痛苦地慢(大约比米勒解慢 600 倍)。
鉴于你的例子
$ cat input.csv
recorded,HUMIDITY %,TEMP °C
2023-08-14 16:55:56,97.64,14.62
2023-08-14 16:55:58,,
2023-08-14 16:58:13,97.41,14.76
2023-08-14 16:58:15,,
2023-08-14 17:00:45,97.04,14.88
2023-08-14 17:00:47,,
2023-08-15 17:03:12,96.78,15.26
2023-08-15 17:03:14,,
2023-08-15 17:05:45,96.59,14.89
然后使用csvsql
基于Python的csvkit:
$ csvsql --query '
SELECT recorded,MAX("HUMIDITY %") AS "HUMIDITY %","TEMP °C" FROM input \
WHERE "HUMIDITY %" NOT NULL GROUP BY DATE(recorded)
' input.csv
recorded,HUMIDITY %,TEMP °C
2023-08-14 16:55:56.000000,97.64,14.62
2023-08-15 17:03:12.000000,96.78,15.26
对于表格输出,通过管道传输结果csvformat -T
。
或者使用装饰 - 排序 - 取消装饰的方法磨坊主
$ mlr --csvlite filter '!is_empty(${HUMIDITY %})' then \
put '$day = splitnvx($recorded, " ")[1]' then \
top -a -F -f 'HUMIDITY %' -g day then \
cut -x -f day input.csv
recorded,HUMIDITY %,TEMP °C
2023-08-14 16:55:56,97.64,14.62
2023-08-15 17:03:12,96.78,15.26
如果您希望在这种情况下以表格形式输出,请更改--csvlite
为--icsvlite --opprint
.
答案2
AWK
$ awk -F',' 'NR>1{arr[$1]=arr[$1]>$2?arr[$1]:$2} END{for (i in arr) print i "," arr[i]}' ~/test.txt
输出
2023-08-14,97.07
2023-08-15,99.06
输入
$ cat ~/test.txt
recorded,HUMIDITY,TEMPC
2023-08-14,97.01,14.65
2023-08-14,90.01,19.65
2023-08-14,97.07,14.63
2023-08-15,97.02,14.60
2023-08-15,97.06,14.68
2023-08-15,99.06,14.62
2023-08-15,59.06,10.62
答案3
您可以使用排序命令。从您的示例数据开始:
$ cat infile
2023-08-14 23:55:48 97.01 14.65
2023-08-14 23:55:50
2023-08-14 23:58:14 97.07 14.63
2023-08-14 23:58:16
2023-08-15 0:00:46 97.02 14.62
2023-08-15 0:00:48
2023-08-15 0:03:14 97.06 14.62
2023-08-15 0:03:16
首先去掉没有湿度和温度数据的空行和短行:
$ awk 'NF>2' infile
然后将结果通过管道传递给排序命令:
$ awk 'NF>2' infile | sort -nr -t "," -k1,3
2023-08-15 0:03:14 97.06 14.62
2023-08-15 0:00:46 97.02 14.62
2023-08-14 23:58:14 97.07 14.63
2023-08-14 23:55:48 97.01 14.65
答案4
使用乐(以前称为 Perl_6)
~$ raku -MCSV::Parser -e 'my $file_handle = open $*ARGFILES, :r;
my $parser = CSV::Parser.new(:$file_handle, :contains_header_row);
my %accum;
until $file_handle.eof {
my %ln-in = $parser.get_line();
%accum.append: %ln-in{"recorded"}.words[0] => %ln-in{"HUMIDITY %"};
};
$file_handle.close;
put .key => .value.max for %accum.sort;' file
这是用 Raku(Perl 家族的一种编程语言)编写的答案。 Raku 的CSV::Parser
模块在命令行加载,输入按行读取。
首先$file_handle
声明 a,然后声明 a $parser
。最终%accum
哈希用于捕获数据。在中央until
条件语句中,get_line
读入行直到eof
。对于每一行,临时%ln-in
散列被解构为键/值对,日期记录为key
,湿度记录为value
。该键/值append
对被写%accum
入哈希值。
在最后的声明中,哈希值仅%accum
包含.put
.value.max
示例输出:
2023-08-14 97.64
2023-08-15 97.24
2023-08-16 97.33
2023-08-17 97.91
2023-08-18 97.83
2023-08-19 97.74
2023-08-20 97.85
2023-08-21 98.27
2023-08-22 99.79
2023-08-23 99.74
2023-08-24 99.24
2023-08-25 98.82
2023-08-26 99.34
2023-08-27 99.99
2023-08-28 99.95
2023-08-29 96.76
2023-08-30 99.93
2023-08-31 99.99
2023-09-01 99.99
2023-09-02 97.43
2023-09-03 97.85
2023-09-04 98.2
2023-09-05 98.42
2023-09-06 98.37
2023-09-07 97.78
2023-09-08 98.4
2023-09-09 98.68
2023-09-10 98.86
2023-09-11 99.06
2023-09-12 99.16
2023-09-13 99.26
2023-09-14 99.29
2023-09-15 98.66
2023-09-16 98.86
2023-09-17 98.95
2023-09-18 97.05
2023-09-19 98.59
2023-09-20 98.79
2023-09-21 99.08
2023-09-22 99.19
2023-09-23 99.34
2023-09-24 99.04
2023-09-25 98.83
2023-09-26 98.03
2023-09-30 97.06
2023-10-01 96.42
2023-10-04 95.12
2023-10-05 95.54
2023-10-10 89.17
2023-10-12 92.46
2023-10-28 61.28
2023-10-29 63.81
2023-10-30 72.65
2023-10-31 83.58
2023-11-01 80.13
2023-11-02 73.02
2023-11-03 84.29
2023-11-04 88.51
2023-11-05 88.78
2023-11-06 88.92
2023-11-07 88.42
2023-11-08 84.67
2023-11-09 82.41
2023-11-10 88.76
2023-11-11 88.49
2023-11-12 81.66
2023-11-13 77.46
2023-11-14 75.9
2023-11-15 68.31
2023-11-16 64.05
2023-11-17 58.46
2023-11-18 70.53
2023-11-19 74.63
2023-11-20 80.45
2023-11-21 68.86
2023-11-22 74.33
2023-11-23 78.19
2023-11-24 76.35
2023-11-25 53.12
2023-11-26 55.34
2023-11-27 53.97
2023-11-28 73.23
2023-11-29 73.78
2023-11-30 68.54
2023-12-01 65.52
2023-12-02 60.14
2023-12-03 57.08
2023-12-04 56.14
2023-12-05 58.63
2023-12-06 69.79
2023-12-07 71.09
2023-12-08 67.39
2023-12-09 64.75
2023-12-10 75.76
2023-12-11 74.33
2023-12-12 71.79
2023-12-13 68.64
2023-12-14 66.75
2023-12-15 67.56
2023-12-16 77.84
2023-12-17 78.79
2023-12-18 76.41
2023-12-19 79.71
2023-12-20 73.87
2023-12-26 70.58
2023-12-27 70.97
2023-12-28 73.21
2023-12-29 74.73
2023-12-30 73.81
2023-12-31 63.24
2024-01-01 53.16
2024-01-02 54.76
2024-01-03 55.39
2024-01-04 57.09
2024-01-05 54.78
2024-01-06 55.84
2024-01-07 49.51
2024-01-08 65.82
2024-01-09 76.71
2024-01-10 77.09
2024-01-11 77.02
2024-01-12 71.6
2024-01-13 69.29
2024-01-14 69.79
2024-01-15 69.02
2024-01-16 58.51
2024-01-17 64.79
2024-01-18 70.11
2024-01-19 69.38
2024-01-20 52.53
2024-01-21 67.66
2024-01-22 78.08
2024-01-23 79.02
2024-01-24 78.58
2024-01-25 78.84
2024-01-26 70.76
2024-01-27 74.59
2024-01-28 79.74
2024-01-29 79.76
2024-01-30 78.77
2024-01-31 80.31
2024-02-01 80.34
2024-02-02 74.42
上面的代码累积了每个键的多个值,仅max
在最后输出。对于真正庞大的文件,这可能是一个问题。作为提高内存效率的替代方案,Raku 代码append
会检查传入行的键/值对,然后立即检查value
元素是否 > 1。如果是,则将这些值减少为单个max
值。下面的代码给出与上面相同的示例输出:
~$ raku -MCSV::Parser -e 'my $file_handle = open $*ARGFILES, :r;
my $parser = CSV::Parser.new( :$file_handle, :contains_header_row);
my %accum;
until $file_handle.eof() { given $parser.get_line() -> %ln-in {
my $ln-key = %ln-in{"recorded"}.words[0]; my $ln-value = %ln-in{"HUMIDITY %"};
%accum.append: $ln-key => $ln-value;
%accum{$ln-key} = %accum{$ln-key}.max if %accum{$ln-key}.elems > 1 }
};
$file_handle.close;
put .key => .value for %accum.sort;' file