我有一个 CSV 文件,其中包含这样的数据(几乎 100000 行)。我需要计算具有特定日期的特定单词在一定日期范围内出现在文件中的次数。例如,从 15/03/2020 到 16/04/2020 我需要计算“Sinaloa”一词出现的次数,这意味着计算 15/03/2020、16/03/2020、17/03 的所有“Sinaloa” /2020, ... , 16/04/2020
我尝试使用 grep 但它只计算第一个和最后一个日期。
编辑:我们以“墨西哥城”和一个日期为例,例如 25/04/2020。我需要统计从 15/03/2020 到 25/04/2020 的所有“Ciudad de Mexico”。在本例中,期望的输出为 5。但问题是,本例中的最终日期是用户的输入,因此如果更改为 01/05/2020,则输出应为 6。
"167386","Baja California","F",54,"01/04/2020","confirmado"
"0d01b5","Sinaloa","F",60,"13/04/2020","confirmado"
"1beec8","Ciudad de México","M",47,"16/04/2020","confirmado"
"15fcd7","Ciudad de México","M",46,"16/04/2020","confirmado"
"0a5675","Sinaloa","F",34,"19/05/2020","confirmado"
"0e9e95","Ciudad de México","F",31,"25/04/2020","confirmado"
"07fa63","Ciudad de México","M",37,"01/05/2020","confirmado"
"0693ef","Ciudad de México","F",48,"20/03/2020","confirmado"
"19afc8","Baja California","F",45,"06/04/2020","confirmado"
"093740","Baja California","M",81,"19/04/2020","confirmado"
"1b3c74","México","M",57,"16/04/2020","confirmado"
"025cb1","Baja California","M",51,"29/04/2020","confirmado"
"15764f","México","M",73,"05/05/2020","confirmado"
"07c084","Tabasco","F",52,"23/04/2020","confirmado"
"1b9e29","Ciudad de México","F",47,"11/04/2020","confirmado"
答案1
有一个很好的特定 CSV 解析器,比awk解析真实的 CSV,csvkit
您有 0 场比赛:
输入文件
添加了所需的标头:
"c1", "c2", "c3","c4", "c5", "c6"
"167386","Baja California","F",54,"01/04/2020","confirmado"
"0d01b5","Sinaloa","F",60,"13/04/2020","confirmado"
"1beec8","Ciudad de México","M",47,"16/04/2020","confirmado"
"15fcd7","Ciudad de México","M",46,"16/04/2020","confirmado"
"0a5675","Sinaloa","F",34,"19/05/2020","confirmado"
"0e9e95","Ciudad de México","F",31,"25/04/2020","confirmado"
"07fa63","Ciudad de México","M",37,"01/05/2020","confirmado"
"0693ef","Ciudad de México","F",48,"20/03/2020","confirmado"
"19afc8","Baja California","F",45,"06/04/2020","confirmado"
"093740","Baja California","M",81,"19/04/2020","confirmado"
"1b3c74","México","M",57,"16/04/2020","confirmado"
"025cb1","Baja California","M",51,"29/04/2020","confirmado"
"15764f","México","M",73,"05/05/2020","confirmado"
"07c084","Tabasco","F",52,"23/04/2020","confirmado"
"1b9e29","Ciudad de México","F",47,"11/04/2020","confirmado"
命令:
$ csvgrep -c2 -r '^Sinaloa$' file | grep -c '1[567]/04/2020'
输出
0
答案2
您可以迭代日期并 grep 模式
d="2020-03-15"
until [[ "$d" == "2020-04-26" ]]; do
pattern=$(date -d "$d" +%d/%m/%Y)
d=$(date -I -d "$d + 1 day")
grep "Ciudad de México.*$pattern" file.csv
done
输出
"0693ef","Ciudad de México","F",48,"20/03/2020","confirmado"
"1b9e29","Ciudad de México","F",47,"11/04/2020","confirmado"
"1beec8","Ciudad de México","M",47,"16/04/2020","confirmado"
"15fcd7","Ciudad de México","M",46,"16/04/2020","confirmado"
"0e9e95","Ciudad de México","F",31,"25/04/2020","confirmado"
答案3
将变量加载到awk
awk -F\" -v city="Ciudad de México" -v from="01/04/2020" -v to="16/04/2020" '
BEGIN{split(to,t,"/"); split(from,f,"/");
to=mktime(t[3]" "t[2]" "t[1]" 0 0 0"); from=mktime(f[3]" "f[2]" "f[1]" 0 0 0")}
$4~city{split($8,th,"/"); this=mktime(th[3]" "th[2]" "th[1]" 0 0 0");
c+=(this<=to&&this>=from)}END{print city, c}' file
Ciudad de México 3
演练
拆分"
并加载条件
awk -F\" -v city="Ciudad de México" -v from="01/04/2020" -v to="16/04/2020" '
从 datespec 变量中获取日期元素
BEGIN{split(to,t,"/"); split(from,f,"/");
将它们转换为时间戳
to=mktime(t[3]" "t[2]" "t[1]" 0 0 0"); from=mktime(f[3]" "f[2]" "f[1]" 0 0 0")}
循环遍历主文件检查city
in$4
并获取时间戳($8
如果存在匹配)
$4~city{split($8,th,"/"); this=mktime(th[3]" "th[2]" "th[1]" 0 0 0");
c
如果时间戳在范围内,则增加计数器
c+=(this<=to&&this>=from)}
....
END{print city, c}' file