根据此规则,我们只想捕获具有 5 个值的“csv”行
"","","","",""
例子:
more conf.csv
"linux02","cluster26","api2-thrift-apiconf","api.driver.memory",
"linux02","cluster26","api2-thrift-apiconf","api.executor.cores"
"linux02","cluster26","api.executor.instances","2"
"linux02","cluster26","api2-thrift-apiconf","api.driver.memory","2"
"linux02","cluster26","api2-thrift-apiconf","api.executor.cores","2"
"linux02","cluster26","api2-thrift-apiconf","api.executor.instances","2"
"linux02","cluster26","api2-thrift-apiconf","api.executor.memory","2"
"linux02","cluster26","api2-thrift-apiconf","api.sql.shuffle.partitions","141"
"linux02","cluster26","api2-thrift-apiconf","api.dynamicAllocation.enabled","true"
"linux02","cluster26","api2-thrift-apiconf","api.driver.memory","api2-thrift-apiconf","api.executor.memory"
"linux02","cluster26","api2-thrift-apiconf","api.executor.cores"
"linux02","cluster26","api.executor.instances","2"
预期输出:
"linux02","cluster26","api2-thrift-apiconf","api.driver.memory","2"
"linux02","cluster26","api2-thrift-apiconf","api.executor.cores","2"
"linux02","cluster26","api2-thrift-apiconf","api.executor.instances","2"
"linux02","cluster26","api2-thrift-apiconf","api.executor.memory","2"
"linux02","cluster26","api2-thrift-apiconf","api.sql.shuffle.partitions","141"
"linux02","cluster26","api2-thrift-apiconf","api.dynamicAllocation.enabled","true"
答案1
使用:
awk -F "," 'NF==5 {print $0}' conf.csv
打印包含 5 个字段的行。然而,该行:
"linux02","cluster26","api2-thrift-apiconf","api.driver.memory",
导致错误,因为最后一个逗号欺骗了awk相信该行中有第五个字段。
答案2
为了正确处理 CSV,CSV 解析器:
ruby -rcsv -e '
data = CSV.foreach(ARGV.shift) {|row|
if row.size == 5 and row.none? {|elem| elem.nil?}
puts CSV.generate_line(row, :force_quotes=>true)
end
}
' conf.csv
答案3
grep -E '(".+",){4}".+"' Csv.file
"linux02","cluster26","api2-thrift-apiconf","api.driver.memory","2"
"linux02","cluster26","api2-thrift-apiconf","api.executor.cores","2"
"linux02","cluster26","api2-thrift-apiconf","api.executor.instances","2"
"linux02","cluster26","api2-thrift-apiconf","api.executor.memory","2"
"linux02","cluster26","api2-thrift-apiconf","api.sql.shuffle.partitions","141"
"linux02","cluster26","api2-thrift-apiconf","api.dynamicAllocation.enabled","true"
"linux02","cluster26","api2-thrift-apiconf","api.driver.memory","api2-thrift-apiconf","api.executor.memory"
-E
将使用扩展正则表达式,您搜索 4 次".+",
并添加 1 次".+"
。但您仍然应该提供一些您尝试过的内容。
笔记:.+
如果您想要包含 5 个字段的行(甚至空一个字段),我曾经搜索过非空字符串,并将其替换+
为*
:
grep -E '(".*",){4}".*"' Csv.file
答案4
解析数据磨坊主( mlr
) 作为无标头、参差不齐(每条记录的字段数量不同)的 CSV 文件,并输出恰好具有五个字段的所有记录:
$ mlr --csv -N --ragged filter 'NF == 5' file
linux02,cluster26,api2-thrift-apiconf,api.driver.memory,
linux02,cluster26,api2-thrift-apiconf,api.driver.memory,2
linux02,cluster26,api2-thrift-apiconf,api.executor.cores,2
linux02,cluster26,api2-thrift-apiconf,api.executor.instances,2
linux02,cluster26,api2-thrift-apiconf,api.executor.memory,2
linux02,cluster26,api2-thrift-apiconf,api.sql.shuffle.partitions,141
linux02,cluster26,api2-thrift-apiconf,api.dynamicAllocation.enabled,true
请注意,与预期输出相比,我们得到了一条额外的记录,因为给定的输入包含第五个字段为空的记录。
我们可以排除第五个字段为空的记录,并强制引用所有字段,如下所示:
$ mlr --csv -N --ragged --quote-all filter 'NF == 5 && !is_empty($5)' file
"linux02","cluster26","api2-thrift-apiconf","api.driver.memory","2"
"linux02","cluster26","api2-thrift-apiconf","api.executor.cores","2"
"linux02","cluster26","api2-thrift-apiconf","api.executor.instances","2"
"linux02","cluster26","api2-thrift-apiconf","api.executor.memory","2"
"linux02","cluster26","api2-thrift-apiconf","api.sql.shuffle.partitions","141"
"linux02","cluster26","api2-thrift-apiconf","api.dynamicAllocation.enabled","true"