通过命令行处理 CSV 文件：如果连续条目具有相同的第二列值，则仅删除连续行条目之间的中间行

Question

使用awk：

BEGIN { FS = "," }

/^[*]/ { print; next }

{
        if (NR > 1 && $2 == word) {
                tail = $0
                ++count
        } else {
                if (count) print tail
                word = $2; count = 0
                print
        }
}

END { if (count) print tail }

该awk脚本无条件打印所有以开头的行*。如果该行不是这样的行，并且如果第二个字段中的单词是我们记住的单词，则将该记录存储在变量中tail（“tail”，如一系列记录中具有相同单词的最后一条记录）第二个字段）。

如果第二个字段是不是与之前相同，然后打印尾部记录，如果上一运行记录中有多个记录，则记住新单词并打印当前记录（新运行的一个或多个记录中的第一个记录）第二个字段中的相同单词）。

根据提供的数据进行测试并假设它是简单的 CSV（意味着没有嵌入的分隔符或换行符等）：

$ awk -f script file
0,Apple
* Checkpoint
* Another checkpoint
4,Apple
5,Box
6,Box
7,Citrus
8,Box
9,Apple
11,Apple
12,Dove
13,Citrus
* Sudden checkpoint, * Leftover checkpoint note 1, * Leftover checkpoint note N
16,Citrus
17,Apple
18,Citrus

与上面类似但使用米勒 ( mlr)，它支持 CSV，并且能够处理带有复杂引号字符串的 CSV 记录：

if (is_not_null(@word) && $2 == @word) {
        @tail = $*;
        false # omit this record for now
} else {
        is_not_null(@tail) {
                emit @tail # emit the tail record
        }
        @word = $2; @tail = null;
        true  # emit this record
}

end { is_not_null(@tail) { emit @tail } }

filter这是 Miller子命令的表达式，用于使用与awk上述代码非常相似的逻辑来包含或省略输入数据集中的记录。我们可以*通过--pass-comments-with='*'命令行中的use来让 Miller 遍历以该字符开头的行。使用--csvwith-N将输入视为无标头 CSV。

$ mlr --pass-comments-with='*' --csv -N filter -f script file
0,Apple
* Checkpoint
* Another checkpoint
4,Apple
5,Box
6,Box
7,Citrus
8,Box
9,Apple
11,Apple
12,Dove
13,Citrus
* Sudden checkpoint," * Leftover checkpoint note 1"," * Leftover checkpoint note N"
16,Citrus
17,Apple
18,Citrus

Answer 1