我得到了这个脚本我的一个相关问题 -如何将文件名和标题插入 csv 的开头
find . -name '*.csv' -printf "%f\n" |
sed 's/.csv$//' |
xargs -I{} sed -i '1s/^/customer|/ '$'\n'' 1!s/^/{}|/' {}.csv;
目前对于大文件需要相当长的时间。我将其扩展到 50,000 个文件并得到了这个结果。
real 1m41.251s
user 0m59.326s
sys 0m38.681s
对于 100,000 个文件,我得到了这个。
real 3m18.466s
user 1m58.451s
sys 1m16.550s
du -sh
100,000 个文件为 485M。我想将此数据扩展到 10-20 GB。
我想知道是否有任何方法可以加快上述脚本的速度。我愿意使用任何工具来加快速度。
如果有帮助的话,我正在使用 Ubuntu 18.04.02 LTS,16 GB RAM。
time awk -i inplace -v OFS='|' 'FNR==1{cust=FILENAME; sub(/\.csv$/,"",cust)} {print (FNR>1 ? cust : "customer"), $0}' *.csv
real 0m20.253s
user 0m3.336s
sys 0m14.854s
sed
它比最初的:o快得多。我不明白为什么。如果有人可以解释它,那将非常有帮助。
当我将其扩展到一百万个文件时,上面的脚本说Argument list too long
。
我尝试了以下方法,但速度很慢,
find . -name \*.csv -exec awk -i inplace -v OFS='|' 'FNR==1{cust=FILENAME; sub(/\.csv$/,"",cust)} {print (FNR>1 ? cust : "customer"), $0}' {} \;
即使我批量执行,对于 100,000 个文件来说似乎也很慢。
time find . -name "10*.csv" -exec awk -i inplace -v OFS='|' 'FNR==1{cust=FILENAME; sub(/\.csv$/,"",cust)} {print (FNR>1 ? cust : "customer"), $0}' {} \;
real 9m29.474s
user 2m3.336s
sys 6m37.822s
我使用 Ed 的答案尝试了通常的 for 循环,但它的工作速度似乎与生成的原始文件相同,大约 40 分钟生成 100 万条记录。
for file in *.csv; do
echo "$file"
awk -i inplace -v OFS='|' 'FNR==1{cust=FILENAME; sub(/\.csv$/,"",cust)} {print (FNR>1 ? cust : "customer"), $0}' "$file"
done
我尝试使用ls
和xargs
每 100,000 个文件对其进行批处理,这似乎是合理的,因为 Ed 给出了初始解决方案。
time ls 11*.csv | xargs awk -i inplace -v OFS='|' 'FNR==1{cust=FILENAME; sub(/\.csv$/,"",cust)} {print (FNR>1 ? cust : "customer"), $0}'
real 0m23.619s
user 0m3.537s
sys 0m15.272s
time ls 12*.csv | xargs awk -i inplace -v OFS='|' 'FNR==1{cust=FILENAME; sub(/\.csv$/,"",cust)} {print (FNR>1 ? cust : "customer"), $0}'
real 0m25.044s
user 0m3.892s
sys 0m16.261s
time ls 13*.csv | xargs awk -i inplace -v OFS='|' 'FNR==1{cust=FILENAME; sub(/\.csv$/,"",cust)} {print (FNR>1 ? cust : "customer"), $0}'
real 0m24.997s
user 0m4.035s
sys 0m16.757s
我现在的计划是使用上面的解决方案,使用for循环来批处理。假设每批平均时间为 25 秒,则需要 25*10 -> 4 分钟左右。我觉得对于百万条记录来说速度很快。
如果有人有更好的解决方案请告诉我。如果上面写的任何代码是错误的/不好的,请告诉我。我还是一个初学者,可能复制或理解不正确。
答案1
$ awk -v OFS=',' 'FNR==1{cust=FILENAME; sub(/\.csv$/,"",cust)} {print (FNR>1 ? cust : "customer"), $0}' 10000000.csv
customer,first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
10000000,Chae,Jesusa,Cummings,Female,[email protected],555-555-8750,911 Hauser Pike,Moline,Georgia,Cameroon,2016-06-29,2016-07-16,36298,2016-07-17,Acer,493.86,14,354.77,Broken,123.68,898.13
所以对于任何 awk 你都可以这样做:
for file in *.csv; do
awk 'script' "$file" > tmp && mv tmp "$file"
done
或使用 GNU awk 进行“就地”编辑:
$ tail -n +1 10000000.csv 10000001.csv
==> 10000000.csv <==
first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
Chae,Jesusa,Cummings,Female,[email protected],555-555-8750,911 Hauser Pike,Moline,Georgia,Cameroon,2016-06-29,2016-07-16,36298,2016-07-17,Acer,493.86,14,354.77,Broken,123.68,898.13
==> 10000001.csv <==
first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
Fleta,Rosette,Hurley,Other,[email protected],1-555-555-1210,35 Freelon Arcade,Beaverton,Rhode Island,Cayman Islands,2009-06-08,2009-06-29,39684,2009-07-01,NVIDIA GeForce GTX 980,474.31,16,395.79,Broken,157.53,1088.04
Bennett,Dennis,George,Male,[email protected],(555) 555-4131,505 Robert C Levy Arcade,Wellington,Louisiana,Mexico,2019-05-09,2019-05-19,37938,2019-05-21,8GB,187.67,16,205.77,Service,170.21,1007.85
Tommye,Pamula,Diaz,Other,[email protected],555.555.4445,1001 Canby Boulevard,Edinburg,Massachusetts,Gambia,2004-05-02,2004-05-24,31364,2004-05-26,Lenovo,137.21,13,193.63,Replacement,246.43,934.31
Albert,Jerrold,Cohen,Other,[email protected],+1-(555)-555-8491,1181 Baden Avenue,Menomonee Falls,Texas,Tajikistan,2019-08-03,2019-08-12,37768,2019-08-15,Intel® Iris™ Graphics 6100,396.46,17,223.02,Service,118.53,960.27
Louetta,Collene,Best,Fluid,[email protected],1-555-555-7050,923 Barry Viaduct,Laurel,Illinois,St. Barthélemy,2009-03-02,2009-03-06,39557,2009-03-07,AMD Radeon R9 M395X,133.9,11,198.49,Fix,178.54,1055.32
Kandace,Wesley,Diaz,Female,[email protected],+1-(555)-555-5414,341 Garlington Run,Santa Maria,New Jersey,Mexico,2005-10-09,2005-10-10,30543,2005-10-14,Samsung,590.29,5,354.85,Service,292.56,1032.22
。
$ awk -i inplace -v OFS=',' 'FNR==1{cust=FILENAME; sub(/\.csv$/,"",cust)} {print (FNR>1 ? cust : "customer"), $0}' 10000000.csv 10000001.csv
。
$ tail -n +1 10000000.csv 10000001.csv
==> 10000000.csv <==
customer,first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
10000000,Chae,Jesusa,Cummings,Female,[email protected],555-555-8750,911 Hauser Pike,Moline,Georgia,Cameroon,2016-06-29,2016-07-16,36298,2016-07-17,Acer,493.86,14,354.77,Broken,123.68,898.13
==> 10000001.csv <==
customer,first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
10000001,Fleta,Rosette,Hurley,Other,[email protected],1-555-555-1210,35 Freelon Arcade,Beaverton,Rhode Island,Cayman Islands,2009-06-08,2009-06-29,39684,2009-07-01,NVIDIA GeForce GTX 980,474.31,16,395.79,Broken,157.53,1088.04
10000001,Bennett,Dennis,George,Male,[email protected],(555) 555-4131,505 Robert C Levy Arcade,Wellington,Louisiana,Mexico,2019-05-09,2019-05-19,37938,2019-05-21,8GB,187.67,16,205.77,Service,170.21,1007.85
10000001,Tommye,Pamula,Diaz,Other,[email protected],555.555.4445,1001 Canby Boulevard,Edinburg,Massachusetts,Gambia,2004-05-02,2004-05-24,31364,2004-05-26,Lenovo,137.21,13,193.63,Replacement,246.43,934.31
10000001,Albert,Jerrold,Cohen,Other,[email protected],+1-(555)-555-8491,1181 Baden Avenue,Menomonee Falls,Texas,Tajikistan,2019-08-03,2019-08-12,37768,2019-08-15,Intel® Iris™ Graphics 6100,396.46,17,223.02,Service,118.53,960.27
10000001,Louetta,Collene,Best,Fluid,[email protected],1-555-555-7050,923 Barry Viaduct,Laurel,Illinois,St. Barthélemy,2009-03-02,2009-03-06,39557,2009-03-07,AMD Radeon R9 M395X,133.9,11,198.49,Fix,178.54,1055.32
10000001,Kandace,Wesley,Diaz,Female,[email protected],+1-(555)-555-5414,341 Garlington Run,Santa Maria,New Jersey,Mexico,2005-10-09,2005-10-10,30543,2005-10-14,Samsung,590.29,5,354.85,Service,292.56,1032.22
如果您有太多文件需要在命令行上传递,并且通过 xargs 运行它太慢,那么这里有另一个选择:
awk -i inplace ... '
BEGIN {
while ( (getline line < ARGV[1]) > 0 ) {
if ( line ~ /\.csv$/ ) {
ARGV[ARGC] = line
ARGC++
}
}
ARGV[1] = ""
}
{ the "real" script }
' <(ls)
上面的代码将 的输出读取ls
为输入文件而不是参数,用以 结尾的文件名填充参数数组.csv
,然后对文件进行操作,就像它们在命令行上作为参数传递一样。
答案2
您可以尝试以下两种方法:
$ find . -name \*.csv -type f ! -empty -exec \
perl -spe 's/^/,/;
$F //= $ARGV =~ s/\.csv$//r;
s/^/$. == 1 ? "\n$C" : $F/e;
undef $F, close ARGV if eof;
' -- -C="Customer" {} +
第二个利用 Gnu sed 功能,特别是使用 F 命令来获取文件名和 -s 选项来将多个文件不视为单个流,而是单独处理:
$ find . -name \*.csv -type f ! -empty -exec \
sed -se 'F;1s/^/CUSTOMER,/' {} + |
sed -E \
-e 'N;s/.*\.csv(\nCUSTOMER,)/\1/;t' \
-e 's/\.csv\n/,/;s/..//' \
;