我想将电子表格中的某些行输出到不同的文件,如果它们包含的数字比上面单元格中的数字大1

我想将电子表格中的某些行输出到不同的文件,如果它们包含的数字比上面单元格中的数字大1

我有一个大的 csv 文件,看起来有点像这样:

SomeData,SomeData,1,SomeData  
SomeData,SomeData,1,SomeData  
SomeData,SomeData,2,SomeData  
SomeData,SomeData,3,SomeData  
SomeData,SomeData,1,SomeData  
SomeData,SomeData,1,SomeData  
SomeData,SomeData,1,SomeData  
SomeData,SomeData,1,SomeData  
SomeData,SomeData,2,SomeData  
SomeData,SomeData,3,SomeData  
SomeData,SomeData,4,SomeData  
SomeData,SomeData,5,SomeData  
SomeData,SomeData,1,SomeData  
SomeData,SomeData,1,SomeData  
SomeData,SomeData,1,SomeData  
SomeData,SomeData,1,SomeData  

我想创建一个新的 csv 文件,其中仅包含第三个值属于集合一部分的行,即如果下面行的第三个字段中的值较高,那么我希望包含这两行。

因此,在上面的示例中,我只想将第 2-4 行和第 8-12 行保存在新文件中。

我正在努力弄清楚如何grep寻找这种模式。

有任何想法吗?

谢谢

答案1

CSV 文件很棘手。假设是Somedata一个可能包含逗号的正确引用的数据字段,我们可以将分隔符替换为数据中绝对不会出现的内容,例如制表符($'\t'在大多数现代 shell 中)(您可以将其更改为某些内容)你知道有效)。如果数据字段没有逗号,则跳过csvformat此处的位。

使用csvkit

$ csvformat -D$'\t' data.csv
SomeData        SomeData        1       SomeData
SomeData        SomeData        1       SomeData
SomeData        SomeData        2       SomeData
SomeData        SomeData        3       SomeData
SomeData        SomeData        1       SomeData
etc.

然后我们可以将其传递给一个awk脚本,该脚本执行查找组的实际工作。

NR > 1 && $3  == count + 1 {
    # This line is part of the set.

    ++count;        # We expect this value on the next line.
    ++set_size;     # This is the number of lines in the set.

    # Output previous line and remember this line.
    print previous_line;
    previous_line = $0;

    # Continue with next line.
    next;
}

set_size > 0 && $3 != count + 1 {
    # This line is not part of the set, but we're currently tracking a
    # set.  This means that the set ended, so output the last line of
    # the set.

    print previous_line;
    set_size = 0;
}

{
    # This line might be part of the next set.

    count = $3;
    previous_line = $0
}

运行它:

$ csvformat -D$'\t' data.csv | awk -F$'\t' -f script.awk
SomeData        SomeData        1       SomeData
SomeData        SomeData        2       SomeData
SomeData        SomeData        3       SomeData
SomeData        SomeData        1       SomeData
SomeData        SomeData        2       SomeData
SomeData        SomeData        3       SomeData
SomeData        SomeData        4       SomeData
SomeData        SomeData        5       SomeData

然后将其恢复为标准逗号分隔形式:

$ csvformat -D$'\t' data.csv | awk -F$'\t' -f script.awk | csvformat -d$'\t'
SomeData,SomeData,1,SomeData
SomeData,SomeData,2,SomeData
SomeData,SomeData,3,SomeData
SomeData,SomeData,1,SomeData
SomeData,SomeData,2,SomeData
SomeData,SomeData,3,SomeData
SomeData,SomeData,4,SomeData
SomeData,SomeData,5,SomeData

如果数据字段内的数据没有逗号,您可以csvformat完全省略它:

$ awk -F',' -f script.awk data.csv
SomeData,SomeData,1,SomeData
SomeData,SomeData,2,SomeData
SomeData,SomeData,3,SomeData
SomeData,SomeData,1,SomeData
SomeData,SomeData,2,SomeData
SomeData,SomeData,3,SomeData
SomeData,SomeData,4,SomeData
SomeData,SomeData,5,SomeData

答案2

假设数据位于datafile可用awk于此处理的文件中。 (您不能使用,grep因为它只处理单行上下文。)

awk -F, '
    # Initially we are not in a block
    BEGIN { inblock=0 }

    # If the third field is greater than that on the previous line, print
    # it. But if we were not already in a block then print the saved line
    # too - and mark that we are now in a block
    NR>1 && $3>third { if (!inblock) { print line; inblock=1 } print $0 }

    # If we are in a block and the third field is smaller that the previous
    # line value, drop out of the block
    inblock && $3<=third { inblock=0 }

    # Save the values each time around
    {third=$3; line=$0 }
' datafile
SomeData,SomeData,1,SomeData
SomeData,SomeData,2,SomeData
SomeData,SomeData,3,SomeData
SomeData,SomeData,1,SomeData
SomeData,SomeData,2,SomeData
SomeData,SomeData,3,SomeData
SomeData,SomeData,4,SomeData
SomeData,SomeData,5,SomeData

相关内容