如何按每天的最高湿度过滤大型 CSV 文件?

如何按每天的最高湿度过滤大型 CSV 文件?

我有数百万行的数据,所以我无法通过电子表格执行此操作。您会看到一些数据行没有数据,因此应该跳过这些数据行。数据采用 .CSV 格式。欢迎任何 Unix 工具,也许是 AWK,因为分隔符“”。

就像是

awk -F " " '{ print $1 $2 $3 $4}' filename.csv

但如何获取每天唯一的最高湿度。一种选择但不起作用:

awk -F, 'NR==1 {next} {date=$1; humidity=$2; if (!max[date] || humidity > max[date]) {max[date]=humidity; line[date]=$0}} END {for (date in line) print line[date]}' filename.csv

数据(来自 Airthings View Plus -设备)

recorded HUMIDITY %TEMP °C
2023-08-14 23:55:48 97.01 14.65
2023-08-14 23:55:50

2023-08-14 23:58:14 97.07 14.63
2023-08-14 23:58:16

2023-08-15 0:00:46 97.02 14.62
2023-08-15 0:00:48

2023-08-15 0:03:14 97.06 14.62
2023-08-15 0:03:16

头部真实数据

recorded,HUMIDITY %,TEMP °C
2023-08-14 16:55:56,97.64,14.62
2023-08-14 16:55:58,,
2023-08-14 16:58:13,97.41,14.76
2023-08-14 16:58:15,,
2023-08-14 17:00:45,97.04,14.88
2023-08-14 17:00:47,,
2023-08-15 17:03:12,96.78,15.26
2023-08-15 17:03:14,,
2023-08-15 17:05:45,96.59,14.89

所需输出

recorded HUMIDITY %TEMP °C
2023-08-14 23:58:14 97.07 14.63
2023-08-15 0:03:14 97.06 14.62

操作系统:Debian 12 Linux

答案1

EX.3.feb.2024 - hum.t.p.day.csv注意:现在已经用谷歌文档中的完整OP尝试了这些,这种csvsql方法虽然在美学上很吸引人,但似乎是痛苦地慢(大约比米勒解慢 600 倍)。

鉴于你的例子

$ cat input.csv
recorded,HUMIDITY %,TEMP °C
2023-08-14 16:55:56,97.64,14.62
2023-08-14 16:55:58,,
2023-08-14 16:58:13,97.41,14.76
2023-08-14 16:58:15,,
2023-08-14 17:00:45,97.04,14.88
2023-08-14 17:00:47,,
2023-08-15 17:03:12,96.78,15.26
2023-08-15 17:03:14,,
2023-08-15 17:05:45,96.59,14.89

然后使用csvsql基于Python的csvkit:

$ csvsql --query '
    SELECT recorded,MAX("HUMIDITY %") AS "HUMIDITY %","TEMP °C" FROM input \
    WHERE "HUMIDITY %" NOT NULL GROUP BY DATE(recorded)
' input.csv
recorded,HUMIDITY %,TEMP °C
2023-08-14 16:55:56.000000,97.64,14.62
2023-08-15 17:03:12.000000,96.78,15.26

对于表格输出,通过管道传输结果csvformat -T


或者使用装饰 - 排序 - 取消装饰的方法磨坊主

$ mlr --csvlite filter '!is_empty(${HUMIDITY %})' then \
    put '$day = splitnvx($recorded, " ")[1]' then \
    top -a -F -f 'HUMIDITY %' -g day then \
    cut -x -f day input.csv
recorded,HUMIDITY %,TEMP °C
2023-08-14 16:55:56,97.64,14.62
2023-08-15 17:03:12,96.78,15.26

如果您希望在这种情况下以表格形式输出,请更改--csvlite--icsvlite --opprint.

答案2

AWK

$ awk -F',' 'NR>1{arr[$1]=arr[$1]>$2?arr[$1]:$2} END{for (i in arr) print i "," arr[i]}' ~/test.txt

输出

2023-08-14,97.07
2023-08-15,99.06

输入

$ cat ~/test.txt
recorded,HUMIDITY,TEMPC
2023-08-14,97.01,14.65
2023-08-14,90.01,19.65
2023-08-14,97.07,14.63
2023-08-15,97.02,14.60
2023-08-15,97.06,14.68
2023-08-15,99.06,14.62
2023-08-15,59.06,10.62

答案3

您可以使用排序命令。从您的示例数据开始:

$ cat infile
2023-08-14 23:55:48 97.01 14.65
2023-08-14 23:55:50

2023-08-14 23:58:14 97.07 14.63
2023-08-14 23:58:16

2023-08-15 0:00:46 97.02 14.62
2023-08-15 0:00:48

2023-08-15 0:03:14 97.06 14.62
2023-08-15 0:03:16

首先去掉没有湿度和温度数据的空行和短行:

$ awk 'NF>2' infile

然后将结果通过管道传递给排序命令:

$ awk 'NF>2' infile | sort -nr -t "," -k1,3
2023-08-15 0:03:14 97.06 14.62
2023-08-15 0:00:46 97.02 14.62
2023-08-14 23:58:14 97.07 14.63
2023-08-14 23:55:48 97.01 14.65

答案4

使用(以前称为 Perl_6)

~$ raku -MCSV::Parser -e 'my $file_handle = open $*ARGFILES, :r;     
                          my $parser = CSV::Parser.new(:$file_handle, :contains_header_row);  
                          my %accum;                                 
                          until $file_handle.eof {                   
                              my %ln-in = $parser.get_line();        
                              %accum.append: %ln-in{"recorded"}.words[0] => %ln-in{"HUMIDITY %"};  
                          };                                         
                          $file_handle.close;                        
                          put .key => .value.max for %accum.sort;'   file

这是用 Raku(Perl 家族的一种编程语言)编写的答案。 Raku 的CSV::Parser模块在命令行加载,输入按行读取。

首先$file_handle声明 a,然后声明 a $parser。最终%accum哈希用于捕获数据。在中央until条件语句中,get_line读入行直到eof。对于每一行,临时%ln-in散列被解构为键/值对,日期记录为key,湿度记录为value。该键/值append对被写%accum入哈希值。

在最后的声明中,哈希值仅%accum包含.put.value.max

示例输出:

2023-08-14  97.64
2023-08-15  97.24
2023-08-16  97.33
2023-08-17  97.91
2023-08-18  97.83
2023-08-19  97.74
2023-08-20  97.85
2023-08-21  98.27
2023-08-22  99.79
2023-08-23  99.74
2023-08-24  99.24
2023-08-25  98.82
2023-08-26  99.34
2023-08-27  99.99
2023-08-28  99.95
2023-08-29  96.76
2023-08-30  99.93
2023-08-31  99.99
2023-09-01  99.99
2023-09-02  97.43
2023-09-03  97.85
2023-09-04  98.2
2023-09-05  98.42
2023-09-06  98.37
2023-09-07  97.78
2023-09-08  98.4
2023-09-09  98.68
2023-09-10  98.86
2023-09-11  99.06
2023-09-12  99.16
2023-09-13  99.26
2023-09-14  99.29
2023-09-15  98.66
2023-09-16  98.86
2023-09-17  98.95
2023-09-18  97.05
2023-09-19  98.59
2023-09-20  98.79
2023-09-21  99.08
2023-09-22  99.19
2023-09-23  99.34
2023-09-24  99.04
2023-09-25  98.83
2023-09-26  98.03
2023-09-30  97.06
2023-10-01  96.42
2023-10-04  95.12
2023-10-05  95.54
2023-10-10  89.17
2023-10-12  92.46
2023-10-28  61.28
2023-10-29  63.81
2023-10-30  72.65
2023-10-31  83.58
2023-11-01  80.13
2023-11-02  73.02
2023-11-03  84.29
2023-11-04  88.51
2023-11-05  88.78
2023-11-06  88.92
2023-11-07  88.42
2023-11-08  84.67
2023-11-09  82.41
2023-11-10  88.76
2023-11-11  88.49
2023-11-12  81.66
2023-11-13  77.46
2023-11-14  75.9
2023-11-15  68.31
2023-11-16  64.05
2023-11-17  58.46
2023-11-18  70.53
2023-11-19  74.63
2023-11-20  80.45
2023-11-21  68.86
2023-11-22  74.33
2023-11-23  78.19
2023-11-24  76.35
2023-11-25  53.12
2023-11-26  55.34
2023-11-27  53.97
2023-11-28  73.23
2023-11-29  73.78
2023-11-30  68.54
2023-12-01  65.52
2023-12-02  60.14
2023-12-03  57.08
2023-12-04  56.14
2023-12-05  58.63
2023-12-06  69.79
2023-12-07  71.09
2023-12-08  67.39
2023-12-09  64.75
2023-12-10  75.76
2023-12-11  74.33
2023-12-12  71.79
2023-12-13  68.64
2023-12-14  66.75
2023-12-15  67.56
2023-12-16  77.84
2023-12-17  78.79
2023-12-18  76.41
2023-12-19  79.71
2023-12-20  73.87
2023-12-26  70.58
2023-12-27  70.97
2023-12-28  73.21
2023-12-29  74.73
2023-12-30  73.81
2023-12-31  63.24
2024-01-01  53.16
2024-01-02  54.76
2024-01-03  55.39
2024-01-04  57.09
2024-01-05  54.78
2024-01-06  55.84
2024-01-07  49.51
2024-01-08  65.82
2024-01-09  76.71
2024-01-10  77.09
2024-01-11  77.02
2024-01-12  71.6
2024-01-13  69.29
2024-01-14  69.79
2024-01-15  69.02
2024-01-16  58.51
2024-01-17  64.79
2024-01-18  70.11
2024-01-19  69.38
2024-01-20  52.53
2024-01-21  67.66
2024-01-22  78.08
2024-01-23  79.02
2024-01-24  78.58
2024-01-25  78.84
2024-01-26  70.76
2024-01-27  74.59
2024-01-28  79.74
2024-01-29  79.76
2024-01-30  78.77
2024-01-31  80.31
2024-02-01  80.34
2024-02-02  74.42

上面的代码累积了每个键的多个值,仅max在最后输出。对于真正庞大的文件,这可能是一个问题。作为提高内存效率的替代方案,Raku 代码append会检查传入行的键/值对,然后立即检查value元素是否 > 1。如果是,则将这些值减少为单个max值。下面的代码给出与上面相同的示例输出:

~$ raku -MCSV::Parser -e 'my $file_handle = open $*ARGFILES, :r;
                          my $parser = CSV::Parser.new( :$file_handle, :contains_header_row); 
                          my %accum; 
                          until $file_handle.eof() { given $parser.get_line() -> %ln-in {  
                              my $ln-key = %ln-in{"recorded"}.words[0]; my $ln-value = %ln-in{"HUMIDITY %"};
                              %accum.append: $ln-key => $ln-value;
                              %accum{$ln-key} = %accum{$ln-key}.max if %accum{$ln-key}.elems > 1 }
                         }; 
                         $file_handle.close; 
                         put .key => .value for %accum.sort;'  file

https://github.com/tony-o/perl6-csv-parser
https://raku.org

相关内容