我的操作系统:RHEL7.X
我有一个包含 3 列的 csv 文件,如下所示 -
#######Category: Fruit
Berries are fruit,04415bef-8c82-4672-848a-9ac80035040a,"Blah/Blah And/More/Blah"
Apple is a fruit,073c6bf3-74d2-4c20-b8a7-82625cffe256,"Blah1.0/Blah And/More/Blah2.0"
Orange is a fruit,083dddd7-2df4-422a-a07b-00419ccf98fd,"Blah2.0/Blah And/More/Blah2.1"
---
---
#######Category: Vegetable
Sprouts are vegetable,84e0a0d8-0c0e-448b-9ad6-21178dac8b86,"Blah/Blah And/lot More/Blah"
Ginger is a vegetable,abb5fee2-a588-45a7-a8c0-9cd87d60a6d2,"Blah1.0/Blah And/whole lot more/Blah2.0"
onions is a vegetable,baeee94a-6447-4c80-bddf-313e0fc144a7,"Blah2.0/Blah And/Lot More/Blah2.1"
---
---
#######Category: Mixed
Tomatoes can be called a mix,fe31c693-cdf4-4171-9c80-7a297d4bdb96,"Blah/Blah And/More/Blah"
cucumber is a joke,d3540c7f-fea5-4e64-87df-c18fdb0b7ff3,"Blah1.0/Blah And/More/Blah2.0"
Spinach is LOL,da816135-9852-4067-8780-4e504dc8084b,"Blah2.0/Blah And/Even More/Blah2.1"
---
---
• 以上数据来自一个文件
• 每天都会生成一个新的 csv,每个标题下都有附加条目
• 标头示例:#######Category: Fruit
- 上例中的“---”代表更多条目。可能有数千个条目
• 标题(以“#”开头)每天都不会改变。
• 每天都会抓取新数据并将其转储到文件中。假设今天的数据被转储到文件中:/path/to/Today.txt
• 24 小时旧数据转储到文件中:/path/to/Yesterday.txt
• 此csv 的第二列具有不带空格的唯一字符串。我们就这样称呼它吧uid
• 不重要,但仅提一下 - 可以有更多标题,每个标题下可以有数千个条目。
目标1:要在文件 Today.txt 中查找在 Yesterday.txt 中不存在的新条目,请使用uid
(第二列)
这个 bash 代码的作用是:
TodayFile=/path/to/Today.txt
YstdayFile=/path/to/Yesterday.txt
sed -e '/^\s*$/ d' -e '/^#/ d' $TodayFile | awk -F , '{print $2}' | sed '/^$/d' | while read uid; do
if [[ -z $(grep $uid $YstdayFile) ]];then
grep $uid $TodayFile >> NewEntries.txt
fi
done
当我们打开新生成的 时NewEntries.txt
,我们会得到不存在于的新条目/path/to/Yesterday.txt
cat -n NewEntries.txt
1 new entry
2 new entry
3 new entry
---
---
但这个输出还不够好
目标是:查找新条目并保持标题(它们所属的标题)完整且有序
模拟的最终输出应该如下所示:
#######Category: Fruit
new entry
new entry
---
---
#######Category: Vegetable
new entry
new entry
---
---
#######Category: Mixed
new entry
new entry
---
---
新条目(如果有)应出现在相关标题下
我如何使用 shell/bash 脚本来实现这一点...有什么建议吗?
还有比运行循环更简单的解决方案吗?
答案1
假设:
- 所有的都
Yesterday.txt
包含在Today.txt
- 第一列不包含任何嵌入的逗号
在这种特殊情况下,由于您已经在使用awk
,我们可以在单个awk
脚本中编写整个过程。这降低了代码的复杂性,并且应该显着提高性能(特别是对于较大的文件)。
设置(在 3 个类别中每个类别的末尾附加一个新行):
$ cat Yesterday.txt
#######Category: Fruit
Berries are fruit,04415bef-8c82-4672-848a-9ac80035040a,"Blah/Blah And/More/Blah"
Apple is a fruit,073c6bf3-74d2-4c20-b8a7-82625cffe256,"Blah1.0/Blah And/More/Blah2.0"
#######Category: Vegetable
Sprouts are vegetable,84e0a0d8-0c0e-448b-9ad6-21178dac8b86,"Blah/Blah And/lot More/Blah"
Ginger is a vegetable,abb5fee2-a588-45a7-a8c0-9cd87d60a6d2,"Blah1.0/Blah And/whole lot more/Blah2.0"
#######Category: Mixed
Tomatoes can be called a mix,fe31c693-cdf4-4171-9c80-7a297d4bdb96,"Blah/Blah And/More/Blah"
cucumber is a joke,d3540c7f-fea5-4e64-87df-c18fdb0b7ff3,"Blah1.0/Blah And/More/Blah2.0"
$ cat Today.txt
#######Category: Fruit
Berries are fruit,04415bef-8c82-4672-848a-9ac80035040a,"Blah/Blah And/More/Blah"
Apple is a fruit,073c6bf3-74d2-4c20-b8a7-82625cffe256,"Blah1.0/Blah And/More/Blah2.0"
Orange is a fruit,083dddd7-2df4-422a-a07b-00419ccf98fd,"Blah2.0/Blah And/More/Blah2.1"
#######Category: Vegetable
Sprouts are vegetable,84e0a0d8-0c0e-448b-9ad6-21178dac8b86,"Blah/Blah And/lot More/Blah"
Ginger is a vegetable,abb5fee2-a588-45a7-a8c0-9cd87d60a6d2,"Blah1.0/Blah And/whole lot more/Blah2.0"
onions is a vegetable,baeee94a-6447-4c80-bddf-313e0fc144a7,"Blah2.0/Blah And/Lot More/Blah2.1"
#######Category: Mixed
Tomatoes can be called a mix,fe31c693-cdf4-4171-9c80-7a297d4bdb96,"Blah/Blah And/More/Blah"
cucumber is a joke,d3540c7f-fea5-4e64-87df-c18fdb0b7ff3,"Blah1.0/Blah And/More/Blah2.0"
Spinach is LOL,da816135-9852-4067-8780-4e504dc8084b,"Blah2.0/Blah And/Even More/Blah2.1"
一个awk
想法:
awk -F',' '
FNR==NR { seen[$2]; next } # 1st file: save 2nd column (uid) as index in array seen[]
/^#####/ || !($2 in seen) # 2nd file: if line starts with "#####" or 2nd field is not an index in array seen[], then print current line to stdout
' Yesterday.txt Today.txt
这会生成:
#######Category: Fruit
Orange is a fruit,083dddd7-2df4-422a-a07b-00419ccf98fd,"Blah2.0/Blah And/More/Blah2.1"
#######Category: Vegetable
onions is a vegetable,baeee94a-6447-4c80-bddf-313e0fc144a7,"Blah2.0/Blah And/Lot More/Blah2.1"
#######Category: Mixed
Spinach is LOL,da816135-9852-4067-8780-4e504dc8084b,"Blah2.0/Blah And/Even More/Blah2.1"