如何仅捕获具有 5 个值的“csv”行

如何仅捕获具有 5 个值的“csv”行

根据此规则,我们只想捕获具有 5 个值的“csv”行

"","","","",""

例子:

more conf.csv

"linux02","cluster26","api2-thrift-apiconf","api.driver.memory",
"linux02","cluster26","api2-thrift-apiconf","api.executor.cores"
"linux02","cluster26","api.executor.instances","2"

"linux02","cluster26","api2-thrift-apiconf","api.driver.memory","2"
"linux02","cluster26","api2-thrift-apiconf","api.executor.cores","2"
"linux02","cluster26","api2-thrift-apiconf","api.executor.instances","2"
"linux02","cluster26","api2-thrift-apiconf","api.executor.memory","2"
"linux02","cluster26","api2-thrift-apiconf","api.sql.shuffle.partitions","141"
"linux02","cluster26","api2-thrift-apiconf","api.dynamicAllocation.enabled","true"

"linux02","cluster26","api2-thrift-apiconf","api.driver.memory","api2-thrift-apiconf","api.executor.memory"
"linux02","cluster26","api2-thrift-apiconf","api.executor.cores"
"linux02","cluster26","api.executor.instances","2"

预期输出:

"linux02","cluster26","api2-thrift-apiconf","api.driver.memory","2"
"linux02","cluster26","api2-thrift-apiconf","api.executor.cores","2"
"linux02","cluster26","api2-thrift-apiconf","api.executor.instances","2"
"linux02","cluster26","api2-thrift-apiconf","api.executor.memory","2"
"linux02","cluster26","api2-thrift-apiconf","api.sql.shuffle.partitions","141"
"linux02","cluster26","api2-thrift-apiconf","api.dynamicAllocation.enabled","true"

答案1

使用:

awk -F "," 'NF==5 {print $0}' conf.csv

打印包含 5 个字段的行。然而,该行:

"linux02","cluster26","api2-thrift-apiconf","api.driver.memory",

导致错误,因为最后一个逗号欺骗了awk相信该行中有第五个字段。

答案2

为了正确处理 CSV,CSV 解析器:

ruby -rcsv -e '
  data = CSV.foreach(ARGV.shift) {|row|
    if row.size == 5 and row.none? {|elem| elem.nil?}
      puts CSV.generate_line(row, :force_quotes=>true)
    end
  }
' conf.csv

答案3

grep -E '(".+",){4}".+"' Csv.file
"linux02","cluster26","api2-thrift-apiconf","api.driver.memory","2"
"linux02","cluster26","api2-thrift-apiconf","api.executor.cores","2"
"linux02","cluster26","api2-thrift-apiconf","api.executor.instances","2"
"linux02","cluster26","api2-thrift-apiconf","api.executor.memory","2"
"linux02","cluster26","api2-thrift-apiconf","api.sql.shuffle.partitions","141"
"linux02","cluster26","api2-thrift-apiconf","api.dynamicAllocation.enabled","true"
"linux02","cluster26","api2-thrift-apiconf","api.driver.memory","api2-thrift-apiconf","api.executor.memory"

-E将使用扩展正则表达式,您搜索 4 次".+",并添加 1 次".+"。但您仍然应该提供一些您尝试过的内容。

笔记:.+如果您想要包含 5 个字段的行(甚至空一个字段),我曾经搜索过非空字符串,并将其替换+*

grep -E '(".*",){4}".*"' Csv.file

答案4

解析数据磨坊主( mlr) 作为无标头、参差不齐(每条记录的字段数量不同)的 CSV 文件,并输出恰好具有五个字段的所有记录:

$ mlr --csv -N --ragged filter 'NF == 5' file
linux02,cluster26,api2-thrift-apiconf,api.driver.memory,
linux02,cluster26,api2-thrift-apiconf,api.driver.memory,2
linux02,cluster26,api2-thrift-apiconf,api.executor.cores,2
linux02,cluster26,api2-thrift-apiconf,api.executor.instances,2
linux02,cluster26,api2-thrift-apiconf,api.executor.memory,2
linux02,cluster26,api2-thrift-apiconf,api.sql.shuffle.partitions,141
linux02,cluster26,api2-thrift-apiconf,api.dynamicAllocation.enabled,true

请注意,与预期输出相比,我们得到了一条额外的记录,因为给定的输入包含第五个字段为空的记录。

我们可以排除第五个字段为空的记录,并强制引用所有字段,如下所示:

$ mlr --csv -N --ragged --quote-all filter 'NF == 5 && !is_empty($5)' file
"linux02","cluster26","api2-thrift-apiconf","api.driver.memory","2"
"linux02","cluster26","api2-thrift-apiconf","api.executor.cores","2"
"linux02","cluster26","api2-thrift-apiconf","api.executor.instances","2"
"linux02","cluster26","api2-thrift-apiconf","api.executor.memory","2"
"linux02","cluster26","api2-thrift-apiconf","api.sql.shuffle.partitions","141"
"linux02","cluster26","api2-thrift-apiconf","api.dynamicAllocation.enabled","true"

相关内容