我有一个 csv 格式的数据集,如下所示(6000 行):
Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
1,05-02-2010,1643690.9,0,42.31,2.572,211.0963582,8.106
1,12-02-2010,1641957.44,1,38.51,2.548,211.2421698,8.106
---
---
我想拆分日期列以仅包含年份创建一个新列。像这样的东西:
Store,Date,Year,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
1,05-02-2010,2010,1643690.9,0,42.31,2.572,211.0963582,8.106
1,12-02-2010,2010,1641957.44,1,38.51,2.548,211.2421698,8.106
我试图使用这段代码,但我不知道如何修改它来解决我的问题: 如何提取年份...
有人可以给我一些建议吗?
答案1
我会使用适当的csv
解析器,例如csvkit
:
csvsql --query "select Store,Date,strftime('%Y', Date) as Year,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment from data" data.csv
输出:
Store,Date,Year,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
1,2010-05-02,2010,1643690.9,0,42.31,2.572,211.0963582,8.106
1,2010-12-02,2010,1641957.44,1,38.51,2.548,211.2421698,8.106
但如果你想用 来做到这一点awk
,可以这样做:
awk '
BEGIN{FS=OFS=","}
NR==1{$2="Date,Year"}
NR>1 {sub(/[0-9]{4}/, "&,&", $2)}
1' data.csv
答案2
awk 'BEGIN{FS=OFS=","}NR==1{gsub("Date","Date,year",$0)} NR>1 {$2=$2","substr($2,7,4)}1' filename
Store,Date,year,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
1,05-02-2010,2010,1643690.9,0,42.31,2.572,211.0963582,8.106
1,12-02-2010,2010,1641957.44,1,38.51,2.548,211.2421698,8.106