我想比较两个具有以下格式的 CSV 文件。他们没有标题。我想通过特定列(在本例中为第二列)对它们进行比较。
源 CSV 文件大约 4-5GB,因此无法将它们加载到内存中。
如果 old.csv 中没有匹配的列,则将每个新行写入 out.csv 中。
第二列将是一个 html 链接,为了简单起见,这里只用一个词。
我的问题是否可以使用 sed、awk、join 或 grep 获得相同的结果?
旧的.csv
"person"|"john"|"smith"
"person"|"anne"|"frank"
"person"|"bob"|"macdonald"
"fruit"|"orange"|"banana"
"fruit"|"strawberry"|"fields"
"fruit"|"ringring"|"banana"
新的.csv
"person"|"john"|"smith"
"person"|"anne"|"frank"
"person"|"bob"|"macdonald"
"fruit"|"orange"|"banana"
"fruit"|"strawberry"|"fields"
"glider"|"person"|"airport"
"fruit"|"ringring"|"banana"
"glider"|"person2"|"airport"
差异.py
#!/usr/bin/env python3
"""
Source: https://gist.github.com/davidrleonard/4dbeebf749248a956e44
Usage: $ ./csv-difference.py -d new.csv -s old.csv -o out.csv -c 1
"""
import sys
import argparse
import csv
def main():
parser = argparse.ArgumentParser(description='Output difference in CSVs.')
parser.add_argument('-d', '--dataset', help='A CSV file of the full dataset', required=True)
parser.add_argument('-s', '--subset', help='A CSV file that is a subset of the full dataset', required=True)
parser.add_argument('-o', '--output', help='The CSV file we should write to (will be overwritten if it exists', required=True)
parser.add_argument('-c', '--column', help='A number of the column to be compared (0 is column 1, 1 is column 2, etc.)', required=True, type=int)
args = parser.parse_args()
dataset_file = args.dataset
subset_file = args.subset
output_file = args.output
column_num = args.column
with open(dataset_file, 'r') as datafile, open(subset_file, 'r') as subsetfile, open(output_file, 'w') as outputfile:
data = {row[column_num]: row for row in csv.reader(datafile, delimiter='|', quotechar='"')}
subset = {row[column_num]: row for row in csv.reader(subsetfile, delimiter='|', quotechar='"')}
data_keys = set(data.keys())
subset_keys = set(subset.keys())
output_keys = data_keys - subset_keys
output = [data[key] for key in output_keys]
output_csv = csv.writer(outputfile, delimiter='|', quotechar='"', quoting=csv.QUOTE_ALL)
for row in output:
output_csv.writerow(row)
if __name__ == '__main__':
main()
sys.stdout.flush()
哪个正在生成out.csv
"glider"|"person"|"airport"
"glider"|"person2"|"airport"
答案1
使用 awk 超级简单:
$ awk -F'|' 'NR == FNR {old[$2]; next} !($2 in old)' old.csv new.csv
"glider"|"person"|"airport"
"glider"|"person2"|"airport"
它将 old.csv 文件的第二个字段存储在名为“old”的数组中,然后对于 new.csv 文件,它将打印第二个字段不在“old”数组中的记录。
确实,这不会尊重引号内的任何管道字符。为此,我喜欢 ruby 的 csv 模块:
ruby -rcsv -e '
old_col2 = []
old_data = CSV.foreach("./old.csv", :col_sep => "|") do |row|
old_col2 << row[1]
end
CSV.foreach("./new.csv", :col_sep => "|") do |row|
if not old_col2.include?(row[1])
puts CSV.generate_line(row, :col_sep => "|", :force_quotes => true)
end
end
'