使用特定列比较 2 个 csv 文件并忽略特定字符串

使用特定列比较 2 个 csv 文件并忽略特定字符串

我的操作系统:RHEL7.X

我有一个包含 3 列的 csv 文件,如下所示 -

#######Category: Fruit
Berries are fruit,04415bef-8c82-4672-848a-9ac80035040a,"Blah/Blah  And/More/Blah"
Apple is a fruit,073c6bf3-74d2-4c20-b8a7-82625cffe256,"Blah1.0/Blah  And/More/Blah2.0"
Orange is a fruit,083dddd7-2df4-422a-a07b-00419ccf98fd,"Blah2.0/Blah  And/More/Blah2.1"
---
---
#######Category: Vegetable
Sprouts are vegetable,84e0a0d8-0c0e-448b-9ad6-21178dac8b86,"Blah/Blah  And/lot More/Blah"
Ginger is a vegetable,abb5fee2-a588-45a7-a8c0-9cd87d60a6d2,"Blah1.0/Blah  And/whole lot more/Blah2.0"
onions is a vegetable,baeee94a-6447-4c80-bddf-313e0fc144a7,"Blah2.0/Blah  And/Lot More/Blah2.1"
---
---
#######Category: Mixed
Tomatoes can be called a mix,fe31c693-cdf4-4171-9c80-7a297d4bdb96,"Blah/Blah And/More/Blah"
cucumber is a joke,d3540c7f-fea5-4e64-87df-c18fdb0b7ff3,"Blah1.0/Blah  And/More/Blah2.0"
Spinach is LOL,da816135-9852-4067-8780-4e504dc8084b,"Blah2.0/Blah  And/Even More/Blah2.1"
---
---

• 以上数据来自一个文件

• 每天都会生成一个新的 csv,每个标题下都有附加条目

• 标头示例:#######Category: Fruit

  • 上例中的“---”代表更多条目。可能有数千个条目

• 标题(以“#”开头)每天都不会改变。

• 每天都会抓取新数据并将其转储到文件中。假设今天的数据被转储到文件中:/path/to/Today.txt

• 24 小时旧数据转储到文件中:/path/to/Yesterday.txt

• 此csv 的第二列具有不带空格的唯一字符串。我们就这样称呼它吧uid

• 不重要,但仅提一下 - 可以有更多标题,每个标题下可以有数千个条目。

目标1:要在文件 Today.txt 中查找在 Yesterday.txt 中不存在的新条目,请使用uid(第二列)

这个 bash 代码的作用是:

TodayFile=/path/to/Today.txt
YstdayFile=/path/to/Yesterday.txt

sed -e '/^\s*$/ d' -e '/^#/ d' $TodayFile | awk -F , '{print $2}' | sed '/^$/d' | while read uid; do

        if [[ -z $(grep $uid $YstdayFile) ]];then
            grep $uid $TodayFile >> NewEntries.txt
        fi
done

当我们打开新生成的 时NewEntries.txt,我们会得到不存在于的新条目/path/to/Yesterday.txt

cat -n NewEntries.txt

1 new entry
2 new entry
3 new entry
---
---

但这个输出还不够好

目标是:查找新条目并保持标题(它们所属的标题)完整且有序

模拟的最终输出应该如下所示:

#######Category: Fruit
new entry
new entry
---
---
#######Category: Vegetable
new entry
new entry
---
---
#######Category: Mixed
new entry
new entry
---
---

新条目(如果有)应出现在相关标题下

我如何使用 shell/bash 脚本来实现这一点...有什么建议吗?

还有比运行循环更简单的解决方案吗?

答案1

假设:

  • 所有的都Yesterday.txt包含在Today.txt
  • 第一列不包含任何嵌入的逗号

在这种特殊情况下,由于您已经在使用awk,我们可以在单个awk脚本中编写整个过程。这降低了代码的复杂性,并且应该显着提高性能(特别是对于较大的文件)。

设置(在 3 个类别中每个类别的末尾附加一个新行):

$ cat Yesterday.txt
#######Category: Fruit
Berries are fruit,04415bef-8c82-4672-848a-9ac80035040a,"Blah/Blah  And/More/Blah"
Apple is a fruit,073c6bf3-74d2-4c20-b8a7-82625cffe256,"Blah1.0/Blah  And/More/Blah2.0"
#######Category: Vegetable
Sprouts are vegetable,84e0a0d8-0c0e-448b-9ad6-21178dac8b86,"Blah/Blah  And/lot More/Blah"
Ginger is a vegetable,abb5fee2-a588-45a7-a8c0-9cd87d60a6d2,"Blah1.0/Blah  And/whole lot more/Blah2.0"
#######Category: Mixed
Tomatoes can be called a mix,fe31c693-cdf4-4171-9c80-7a297d4bdb96,"Blah/Blah And/More/Blah"
cucumber is a joke,d3540c7f-fea5-4e64-87df-c18fdb0b7ff3,"Blah1.0/Blah  And/More/Blah2.0"

$ cat Today.txt
#######Category: Fruit
Berries are fruit,04415bef-8c82-4672-848a-9ac80035040a,"Blah/Blah  And/More/Blah"
Apple is a fruit,073c6bf3-74d2-4c20-b8a7-82625cffe256,"Blah1.0/Blah  And/More/Blah2.0"
Orange is a fruit,083dddd7-2df4-422a-a07b-00419ccf98fd,"Blah2.0/Blah  And/More/Blah2.1"
#######Category: Vegetable
Sprouts are vegetable,84e0a0d8-0c0e-448b-9ad6-21178dac8b86,"Blah/Blah  And/lot More/Blah"
Ginger is a vegetable,abb5fee2-a588-45a7-a8c0-9cd87d60a6d2,"Blah1.0/Blah  And/whole lot more/Blah2.0"
onions is a vegetable,baeee94a-6447-4c80-bddf-313e0fc144a7,"Blah2.0/Blah  And/Lot More/Blah2.1"
#######Category: Mixed
Tomatoes can be called a mix,fe31c693-cdf4-4171-9c80-7a297d4bdb96,"Blah/Blah And/More/Blah"
cucumber is a joke,d3540c7f-fea5-4e64-87df-c18fdb0b7ff3,"Blah1.0/Blah  And/More/Blah2.0"
Spinach is LOL,da816135-9852-4067-8780-4e504dc8084b,"Blah2.0/Blah  And/Even More/Blah2.1"

一个awk想法:

awk -F',' '
FNR==NR { seen[$2]; next }              # 1st file: save 2nd column (uid) as index in array seen[]
/^#####/ || !($2 in seen)               # 2nd file: if line starts with "#####" or 2nd field is not an index in array seen[], then print current line to stdout
' Yesterday.txt Today.txt

这会生成:

#######Category: Fruit
Orange is a fruit,083dddd7-2df4-422a-a07b-00419ccf98fd,"Blah2.0/Blah  And/More/Blah2.1"
#######Category: Vegetable
onions is a vegetable,baeee94a-6447-4c80-bddf-313e0fc144a7,"Blah2.0/Blah  And/Lot More/Blah2.1"
#######Category: Mixed
Spinach is LOL,da816135-9852-4067-8780-4e504dc8084b,"Blah2.0/Blah  And/Even More/Blah2.1"

相关内容