用逗号替换给定模式

用逗号替换给定模式

内容如下test.txt,一般为22栏:

BusinessDate,SourceSystemId,IceIndexId,IceIndexName,ComponentId,ComponentReferenceType,ComponentType,ComponentName,ComponentIssuerCIS,ComponentIssuerName,ComponentWeighting,IceCurveID,IceCurveName,RiskyCurveCIS,OriginalWeighting,DerivedWeighting,indexType,cafName,indexStartDate,indexCurrency,componentCurrency
2019-09-19,ICEEUR,11260370,risky_CMBX3_AM_19HGEMAC7.usd,20173QAG6,CUSIP,BOND,GCCFC 2007-GG9 A-M,FCMT7US,COML MORT TST 2007-GG9,0.04,19063270,risky_20173QAG6_FCMT7US.usd,FCMT7US,0.04,0.04,indexCds,index_risky_CMBX3_AM_19HGEMAC7.usd,2010-02-09,USD,USD
2019-09-19,ICEEUR,11260370,risky_CMBX3_AM_19HGEMAC7.usd,61753JAF6,CUSIP,BOND,MSC 2007-IQ13 AM,Z01IYUS,MORGAN STNLY CAP I TST 2007-IQ13,0.04,19059680,risky_61753JAF6_Z01IYUS.usd,Z01IYUS,0.04,0.04,indexCds,index_risky_CMBX3_AM_19HGEMAC7.usd,2010-02-09,USD,USD
2019-09-19,ICEEUR,12345400,risky_itraxx_europe32_14.eur,XSNOREFOB258,ISIN,BOND,NOREFOB_BANCO_SANTANDER_SA,BBDERES,BANCO SANTANDER SA, MADRID HO,0.008,20286090,risky_bank_bsch_14.eur,BBDERES,0.008,0.008,indexCds,index_risky_itraxx_europe32_14.eur,2019-09-18,EUR,EUR

我想找到超过 22 列的行,这些列是问题记录,并将其替换为逗号。

bcz BANCO SANTANDER SA, MADRID HO是一个关键字,不应该用逗号分隔,它应该是不带逗号的 like, as BANCO SANTANDER SA MADRID HO

答案1

检测哪些输入行具有超过 21 个字段非常容易。例如

awk -F, 'NF>21' input.txt

修复它并不那么容易。您必须手动编辑这些行(例如,使用vinano或您喜欢的编辑器),因为脚本没有简单的方法来知道哪些字段是假的(或者相反,哪些字段包含不应该包含的逗号)。

这是一项需要实际情报的工作,而不是一些简单(甚至复杂)的搜索和替换启发法。它还需要您对数据集的实际了解。

这就是为什么真实CSV文件在引号内包含字符串字段,以及为什么真正的 CSV 解析器能够理解带引号的字符串字段。

如果您可以获得以逗号分隔的文件生成的任何内容来生成带有引号字符串字段的真实 CSV,那么处理将是很多更轻松。这才是解决这个问题的真正办法——从源头上解决问题。


其他答案指出,您至少需要做的一部分是将以空格开头的任何字段与前一个字段合并。

如果您打算这样做,还应该将输入转换为格式正确的 CSV。例如:

$ perl -e '
use Text::CSV qw(csv);
$csv = Text::CSV->new();

while($row = $csv->getline(ARGV)) {

  # merge fields beginning with whitespace with the previous field.
  for ($i=1; $i <= @$row; $i++) {
    if ($row->[$i] =~ m/^\s/) {
      $row->[$i-1] .= "," . $row->[$i];
      $row->[$i] = undef;
    };
  };

  # delete any undef-ed fields    
  @$row = grep{defined $_} @$row;

  $csv->say(STDOUT, $row);
};
' input.txt

笔记:

  • $csv->getline()命令获取整行 CSV 数据,无论它是全部在单行上还是由于多行引号字符串等原因分布在多行上。

    这与您当前的输入文件并不是特别相关(因为它不是 CSV,只是看起来有点像 CSV 文件),但在处理真正的 CSV 文件时非常有用。

示例输出:

BusinessDate,SourceSystemId,IceIndexId,IceIndexName,ComponentId,ComponentReferenceType,ComponentType,ComponentName,ComponentIssuerCIS,ComponentIssuerName,ComponentWeighting,IceCurveID,IceCurveName,RiskyCurveCIS,OriginalWeighting,DerivedWeighting,indexType,cafName,indexStartDate,indexCurrency,componentCurrency
2019-09-19,ICEEUR,11260370,risky_CMBX3_AM_19HGEMAC7.usd,20173QAG6,CUSIP,BOND,"GCCFC 2007-GG9 A-M",FCMT7US,"COML MORT TST 2007-GG9",0.04,19063270,risky_20173QAG6_FCMT7US.usd,FCMT7US,0.04,0.04,indexCds,index_risky_CMBX3_AM_19HGEMAC7.usd,2010-02-09,USD,USD
2019-09-19,ICEEUR,11260370,risky_CMBX3_AM_19HGEMAC7.usd,61753JAF6,CUSIP,BOND,"MSC 2007-IQ13 AM",Z01IYUS,"MORGAN STNLY CAP I TST 2007-IQ13",0.04,19059680,risky_61753JAF6_Z01IYUS.usd,Z01IYUS,0.04,0.04,indexCds,index_risky_CMBX3_AM_19HGEMAC7.usd,2010-02-09,USD,USD
2019-09-19,ICEEUR,12345400,risky_itraxx_europe32_14.eur,XSNOREFOB258,ISIN,BOND,NOREFOB_BANCO_SANTANDER_SA,BBDERES,"BANCO SANTANDER SA, MADRID HO",0.008,20286090,risky_bank_bsch_14.eur,BBDERES,0.008,0.008,indexCds,index_risky_itraxx_europe32_14.eur,2019-09-18,EUR,EUR

这已将第四行的两个有问题的字段合并为"BANCO SANTANDER SA, MADRID HO"

注意全部带有嵌入空格(以及任何其他可能有问题的字符)的字段也用双引号引起来。即输出现在是格式正确的 CSV。

这只解决了我们所知道的一个问题。可能还有我们(即您)还不知道的其他内容。例如,可能有额外的逗号,但后面没有紧跟空格。

该脚本(就像此处实现该算法的变体的其他答案一样)将休息字段应以空白字符开头的任何输入行。您的示例输入中没有任何内容,但从一个标题和三个数据行的样本大小中假设这一点并不安全。

真实的解决方案仍然是,如前所述,从源头解决问题通过使其输出格式正确的 CSV 而不是这种损坏的垃圾。


如果您无法使生成此输出的程序生成正确引用的 CSV,另一种选择是让它使用列分隔符数据中没有。管道字符|或分号;或制表符 ( 0x09, Ctrl-I, \t) 通常是分隔符的不错选择。

答案2

看起来您真正想要的是将任何以空格开头的字段加入到前一个字段的末尾:

$ awk -F, '{print NF}' file
21
21
21
22

$ cat tst.awk
BEGIN { FS=OFS="," }
{
    rec = $1
    for (i=2; i<=NF; i++)  {
        rec = rec ( $i ~ /^[[:space:]]/ ? "" : OFS) $i
    }
    print rec
}

$ awk -f tst.awk file
BusinessDate,SourceSystemId,IceIndexId,IceIndexName,ComponentId,ComponentReferenceType,ComponentType,ComponentName,ComponentIssuerCIS,ComponentIssuerName,ComponentWeighting,IceCurveID,IceCurveName,RiskyCurveCIS,OriginalWeighting,DerivedWeighting,indexType,cafName,indexStartDate,indexCurrency,componentCurrency
2019-09-19,ICEEUR,11260370,risky_CMBX3_AM_19HGEMAC7.usd,20173QAG6,CUSIP,BOND,GCCFC 2007-GG9 A-M,FCMT7US,COML MORT TST 2007-GG9,0.04,19063270,risky_20173QAG6_FCMT7US.usd,FCMT7US,0.04,0.04,indexCds,index_risky_CMBX3_AM_19HGEMAC7.usd,2010-02-09,USD,USD
2019-09-19,ICEEUR,11260370,risky_CMBX3_AM_19HGEMAC7.usd,61753JAF6,CUSIP,BOND,MSC 2007-IQ13 AM,Z01IYUS,MORGAN STNLY CAP I TST 2007-IQ13,0.04,19059680,risky_61753JAF6_Z01IYUS.usd,Z01IYUS,0.04,0.04,indexCds,index_risky_CMBX3_AM_19HGEMAC7.usd,2010-02-09,USD,USD
2019-09-19,ICEEUR,12345400,risky_itraxx_europe32_14.eur,XSNOREFOB258,ISIN,BOND,NOREFOB_BANCO_SANTANDER_SA,BBDERES,BANCO SANTANDER SA MADRID HO,0.008,20286090,risky_bank_bsch_14.eur,BBDERES,0.008,0.008,indexCds,index_risky_itraxx_europe32_14.eur,2019-09-18,EUR,EUR

$ awk -f tst.awk file | awk -F, '{print NF}'
21
21
21
21

答案3

如果您的目标是拦截以空格开头的字段,请使用 sed 和 Miller (https://github.com/johnkerl/miller) 是

sed -r 's/, /|/g' input.csv | mlr --csv put -S 'for (k in $*) {$[k] = gsub($[k], "[|]", ", ");}'

你将会拥有

BusinessDate,SourceSystemId,IceIndexId,IceIndexName,ComponentId,ComponentReferenceType,ComponentType,ComponentName,ComponentIssuerCIS,ComponentIssuerName,ComponentWeighting,IceCurveID,IceCurveName,RiskyCurveCIS,OriginalWeighting,DerivedWeighting,indexType,cafName,indexStartDate,indexCurrency,componentCurrency
2019-09-19,ICEEUR,11260370,risky_CMBX3_AM_19HGEMAC7.usd,20173QAG6,CUSIP,BOND,GCCFC 2007-GG9 A-M,FCMT7US,COML MORT TST 2007-GG9,0.04,19063270,risky_20173QAG6_FCMT7US.usd,FCMT7US,0.04,0.04,indexCds,index_risky_CMBX3_AM_19HGEMAC7.usd,2010-02-09,USD,USD
2019-09-19,ICEEUR,11260370,risky_CMBX3_AM_19HGEMAC7.usd,61753JAF6,CUSIP,BOND,MSC 2007-IQ13 AM,Z01IYUS,MORGAN STNLY CAP I TST 2007-IQ13,0.04,19059680,risky_61753JAF6_Z01IYUS.usd,Z01IYUS,0.04,0.04,indexCds,index_risky_CMBX3_AM_19HGEMAC7.usd,2010-02-09,USD,USD
2019-09-19,ICEEUR,12345400,risky_itraxx_europe32_14.eur,XSNOREFOB258,ISIN,BOND,NOREFOB_BANCO_SANTANDER_SA,BBDERES,"BANCO SANTANDER SA, MADRID HO",0.008,20286090,risky_bank_bsch_14.eur,BBDERES,0.008,0.008,indexCds,index_risky_itraxx_europe32_14.eur,2019-09-18,EUR,EUR

答案4

尝试这个,

awk -F ',' '{a=$1;for (i=2;i<=10;i++) a=a","$i ; 
             for (i=11;i<=(NF-11);i++) a=a$i; 
             for (i=(NF-10);i<=NF;i++) a=a","$i; 
             print a}'  input.csv  


BusinessDate,SourceSystemId,IceIndexId,IceIndexName,ComponentId,ComponentReferenceType,ComponentType,ComponentName,ComponentIssuerCIS,ComponentIssuerName,ComponentWeighting,IceCurveID,IceCurveName,RiskyCurveCIS,OriginalWeighting,DerivedWeighting,indexType,cafName,indexStartDate,indexCurrency,componentCurrency
2019-09-19,ICEEUR,11260370,risky_CMBX3_AM_19HGEMAC7.usd,20173QAG6,CUSIP,BOND,GCCFC 2007-GG9 A-M,FCMT7US,COML MORT TST 2007-GG9,0.04,19063270,risky_20173QAG6_FCMT7US.usd,FCMT7US,0.04,0.04,indexCds,index_risky_CMBX3_AM_19HGEMAC7.usd,2010-02-09,USD,USD
2019-09-19,ICEEUR,11260370,risky_CMBX3_AM_19HGEMAC7.usd,61753JAF6,CUSIP,BOND,MSC 2007-IQ13 AM,Z01IYUS,MORGAN STNLY CAP I TST 2007-IQ13,0.04,19059680,risky_61753JAF6_Z01IYUS.usd,Z01IYUS,0.04,0.04,indexCds,index_risky_CMBX3_AM_19HGEMAC7.usd,2010-02-09,USD,USD
2019-09-19,ICEEUR,12345400,risky_itraxx_europe32_14.eur,XSNOREFOB258,ISIN,BOND,NOREFOB_BANCO_SANTANDER_SA,BBDERES,BANCO SANTANDER SA MADRID HO,0.008,20286090,risky_bank_bsch_14.eur,BBDERES,0.008,0.008,indexCds,index_risky_itraxx_europe32_14.eur,2019-09-18,EUR,EUR

上面的代码将从第 11 列连接到列 NF-11...

相关内容