文件a.txt
约有 10 万个单词,每个单词占一行
july.cpp
windows.exe
ttm.rar
document.zip
文件b.txt
有 150k 个单词,逐行一个单词 - 有些单词来自文件a.txt
,但有些单词是新的:
july.cpp
NOVEMBER.txt
windows.exe
ttm.rar
document.zip
diary.txt
我怎样才能将这些文件合并为一个,删除所有重复的行,并保留新的行(存在于a.txt
但不存在于的行b.txt
,反之亦然)?
答案1
有一个命令可以执行此操作:comm
。如中所述man comm
,它很简单:
comm -3 file1 file2
Print lines in file1 not in file2, and vice versa.
请注意,comm
要求文件内容经过排序,因此您必须在调用它们之前comm
对它们进行排序,就像这样:
sort unsorted-file.txt > sorted-file.txt
总结一下:
sort a.txt > as.txt
sort b.txt > bs.txt
comm -3 as.txt bs.txt > result.txt
执行上述命令后,文件中就会出现预期的行result.txt
。
答案2
这是一个简短的python3脚本,基于Germar 的回答b.txt
,它应该在保留未排序顺序的同时实现这一点。
#!/usr/bin/python3
with open('a.txt', 'r') as afile:
a = set(line.rstrip('\n') for line in afile)
with open('b.txt', 'r') as bfile:
for line in bfile:
line = line.rstrip('\n')
if line not in a:
print(line)
# Uncomment the following if you also want to remove duplicates:
# a.add(line)
答案3
#!/usr/bin/env python3
with open('a.txt', 'r') as f:
a_txt = f.read()
a = a_txt.split('\n')
del(a_txt)
with open('b.txt', 'r') as f:
while True:
b = f.readline().strip('\n ')
if not len(b):
break
if not b in a:
print(b)
答案4
看一下 coreutilscomm
命令 -man comm
NAME
comm - compare two sorted files line by line
SYNOPSIS
comm [OPTION]... FILE1 FILE2
DESCRIPTION
Compare sorted files FILE1 and FILE2 line by line.
With no options, produce three-column output. Column one contains
lines unique to FILE1, column two contains lines unique to FILE2, and
column three contains lines common to both files.
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
例如你可以这样做
$ comm -13 <(sort a.txt) <(sort b.txt)
diary.txt
NOVEMBER.txt
( 独有的线条b.txt
)