将具有相同标题的多个 CSV 文件合并为多个组文件

将具有相同标题的多个 CSV 文件合并为多个组文件

您好,我正在寻找最快的解决方案来处理 csv 文件负载。

情况: 我在一个文件夹中有多个 csv 文件不同的标头

我已经对它们进行了预处理,以删除顶部的垃圾线,因此所有的都有标准标头

我想合并 CSV 文件组完全相同的听众到一个新建文件夹

Single Folder:
    Tree 
    ├── 161598827330618_data_aa.csv 
    ├── ..............  
    ├── ............... 
    ├── ................ 
    ├── 161598852706227_data_bh.csv 
Note: Filenames are Random with no pattern*

示例文件-1.csv

School Name,Project Description,Construction Award,Project type,Building ID,Building Address,City,Postcode
George HS - QUEENS,New,76850000,CAP,Q298,50-51 98TH STREET,Queens,11368 
MARBLE HILL INTERNATIONAL HS -,EXT MASONRY/FLOOD/PARAPETS/ROOFS,10490000,CIP,X475,99 TERRACE VIEW AVENUE,Bronx,10463
NEW DORP HS - STATEN ISLAND,PARTIAL ACCESSIBILITY,488000,CIP,R435,465 NEW DORP LANE,Staten Island,10306

示例文件-2.csv

School Name,Project Description,Construction Award,Project type,Building ID,Building Address,City,Postcode
EAST SIDE COMMUNITY SCHOOL,FIFTH FLOOR CEILING REPLACEMENT,150000,CIP,M060,420 EAST 12 STREET,Manhattan,10009
RICHMOND HILL HS - QUEENS,STEEL DETERIORATED COLUMS & COLUMN,1064400,CIP,Q475,89-30 114 STREET,Queens,11418
SUCCESS ACADEMY CHARTER SCHOOL,INTERIOR STAIRS,2045000,CIP,M099,410 EAST 100 STREET,Manhattan,10029

示例文件-3.csv

Reporting Period,Project Number,City,County,Zip Code,Sector,Solicitation,Electric Utility
02/28/2021,2453,Youngstown,,14174,Non-Residential,ARRA Projects,National Grid
02/28/2021,218852,Queens,Queens,11356,Residential,PON 2112,Consolidated Edison
02/28/2021,220037,Warwick,Orange,10990,Residential,PON 2112,Orange and Rockland Utilities
02/28/2021,2011-230103-SLPR,Center Moriches,Suffolk,11934,Residential,Solar ARRA Funding,Long Island Power Authority

示例文件-4.csv

Reporting Period,Project Number,City,County,Zip Code,Sector,Solicitation,Electric Utility
02/28/2021,2453,Youngstown,,14174,Non-Residential,ARRA Projects,National Grid
02/28/2021,218852,Queens,Queens,11356,Residential,PON 2112,Consolidated Edison
02/28/2021,220037,Warwick,Orange,10990,Residential,PON 2112,Orange and Rockland Utilities
02/28/2021,2011-230103-SLPR,Center Moriches,Suffolk,11934,Residential,Solar ARRA Funding,Long Island Power Authority

示例文件-5.csv

OBJECTID,Borough,PSSite,ParkName,ParkZone,PSStatus,GlobalID,CreatedDate,UpdatedDate
283721,Brooklyn,Street,,,Populated,C90AAD08-D99E-4759-A64C-219D6143BFB3,07-08-15 13:10,12/20/2019 04:34:58 PM
7669836,Queens,Park,Astoria Park,Q004-ZN02,Empty,AB55A658-8276-4734-A698-5FFCAE96578E,08/13/2020 01:18:00 PM,08/20/2020 06:15:32 PM
7123408,Brooklyn,Park,Asser Levy Park,,Populated,B32D93C9-5958-4129-A87A-FA7C9A5A4E87,01-09-20 13:15,01-09-20 13:17

示例文件-6.csv

OBJECTID,Borough,PSSite,ParkName,ParkZone,PSStatus,GlobalID,CreatedDate,UpdatedDate
6036681,Manhattan,Park,Riverside Park,,Populated,6A3E747D-CD5E-43EB-9789-67DB2064E878,04-11-18 11:11,08-06-20 21:21
7170578,Bronx,Park,Garden Of Eden,,Populated,B1E8B660-4B65-437F-B61F-06B1B71A4E1C,01/28/2020 03:18:00 PM,01/28/2020 03:19:26 PM
740416,Bronx,Park,Mullaly Park,X034-ZN02,Populated,E8F51E3B-CC6F-46A3-AF17-02B6BE8DCC57,08/26/2015 04:34:00 PM,01/30/2020 04:10:41 PM
5004669,Queens,Street,,,Populated,20157769-88EC-4867-9F50-852EF4814BF0,11-02-16 16:56,08-03-20 13:12:00 AM

示例文件-7.csv

Indicator,Group,State,Subgroup,Phase,Time Period,Time Period Label,Value,Low CI,High CI,Confidence Interval
Private Health Insurance Coverage,National Estimate,United States,United States,1,1,Apr 23 - May 5,75.4,74.7,76.2,74.7 - 76.2
Public Health Insurance Coverage,By Age,United States,18 - 24 years,1,1,Apr 23 - May 5,19.5,15.4,24.3,15.4 - 24.3
Uninsured at the Time of Interview,By Gender,United States,Female,1,1,Apr 23 - May 5,11,10.3,11.7,10.3 - 11.7

示例文件-8.csv

Year, dtmSurveyDate, ColonyID, strAOUCode, Type, strPhotoInterpreters, strColonyName, strCounty, strState, strCountry
2014,03-Jun-14,219-001,COMU,Image Check - No Birds,Kirsten Bixler,"""Tillamook Head Rocks"" (Eastern Rocks)",Clatsop County,Oregon,United States
2014,03-Jun-14,219-002,COMU,Image Check - No Birds,Kirsten Bixler,"""Tillamook Head Rocks"" (Northern Rock)",Clatsop County,Oregon,United States
2014,03-Jun-14,219-003,COMU,Shapefile-RawCount,Kirsten Bixler,"""Tillamook Head Rocks"" (Southwestern Rocks)",Clatsop County,Oregon,United States
2014,03-Jun-14,219-005,COMU,Shapefile,Shawn W. Stephensen,Tillamook Rock,Clatsop County,Oregon,United States

期望的结果:

示例文件-1.csv }
示例文件-2.csv } 标头检查 > Same Header found> 合并 >GROUP-1.csv

示例文件-3.csv }
示例文件-4.csv } 标头检查 > Same Header found> 合并 >GROUP-2.csv

示例文件-5.csv }
示例文件-6.csv } 标头检查 > Same Header found> 合并 >GROUP-3.csv

示例文件-7.csv } 标头检查 > No Similar Headers > 无>GROUP-5.csv
示例文件-8.csv } 标头检查 > No Similar Headers > 无>GROUP-6.csv

首选解决方案:Bash 脚本和 Linux 命令尝试过的解决方案:

#!/bin/bash
awk '
  FNR==1{
    if (!($0 in h)||file!=h[$0]){close(file)}
    if (!($0 in h)){file=h[$0]=i++}
    else{file=h[$0];next}
  }
  {print >> (file)}
' ./*.csv

https://unix.stackexchange.com/a/602291/459978 上述方法有效,但我不确定它是否可以处理数千个文件进行处理和分类。我需要在不同的文件夹中创建 Group*.csv 格式。

最短完成时间很重要 https://stackoverflow.com/a/51921621/3088275

寻找带有 Awk 或 Sed 或 Linux 命令的 bash 脚本的操作代码,哪个是最快实现所需输出的

答案1

是否保证仅在连续文件中找到重复的标头?有了这个答案,我想处理如果情况并非如此:

#!/bin/bash

# Declare header_list[] to be an associative array
declare -A header_list

# Read the first line from every *.csv file in $1
# Each filename is added to the appropriate entry in header_list[]
for f in "${1:?}"/*.csv; do
  echo "### Reading header from $f"
  header_list[$(head -1 "$f")]+="${IFS}$f"
done

# Handle the list of files for each entry in header_list[]
group_id=1
for key in "${!header_list[@]}"; do
  value="${header_list[$key]}"
  groupfile="${2:?}/GROUP-${group_id}.csv"
  echo "### Header: ${key}"
  echo "### Group File: ${groupfile}"
  
  # Echo the header as the first line of $groupfile
  echo "${key}" > "${groupfile}"
  
  # Skip the first line, but echo every other line from each file with this header
  for file in ${value}; do
  echo "# File: ${file}"
    tail --lines=+2 "$file" >> "${groupfile}"
  done
  
  # Increment group_id
  (( group_id++ ))
done

保存到文件,然后使用两个参数运行:包含源文件的目录和输出目录。

一些注意事项:

  • 输出目录必须存在。
  • 带有 $IFS 中字符的文件名将无法正确处理。

相关内容