拆分 CSV 文件的某些字段以分布在多行上,同时保留非拆分字段的值

拆分 CSV 文件的某些字段以分布在多行上,同时保留非拆分字段的值

我有一个.csv这样的文件:

IMSI,MSISDN,SUBCATEGY,EPCMMEHOST,EPCMMEREALM,EPCROAMSCH,PSROAMSCH,OINR,MSCNUMBER,VLRNUMBER,SGSNNUMBER,EPCAPNNAME,EPCAPNAMBRUP,EPCAPNAMBRDOWN,EPCAMBRUP,EPCAMBRDOWN,CHARGE,EPCCHARGCHRT
NAN,NAN,2,,,0,,0,,,,OLAP$DOLAP,200000000$2048000,400000000$2048000,200000000,400000000,5,5

我想分割包含$符号的所有字段,以便该符号之前的部分保留在原始行上,而之后的部分则转移到直接在下面的新创建的行中,其中文件的其他字段保持与以下相同的值在最初找到分割字符的行上。如果一行中的多个字段包含分割分隔符,则将所有字段的“第二”部分传输到同一新创建的行。

在上面的示例中,输出应如下所示:

IMSI,MSISDN,SUBCATEGY,EPCMMEHOST,EPCMMEREALM,EPCROAMSCH,PSROAMSCH,OINR,MSCNUMBER,VLRNUMBER,SGSNNUMBER,EPCAPNNAME,EPCAPNAMBRUP,EPCAPNAMBRDOWN,EPCAMBRUP,EPCAMBRDOWN,CHARGE,EPCCHARGCHRT
NAN,NAN,2,,,0,,0,,,,OLAP,200000000,400000000,200000000,400000000,5,5
NAN,NAN,2,,,0,,0,,,,DOLAP,2048000,2048000,200000000,400000000,5,5

答案1

我会假设@Paul_Pedant提到的观点确实是一个错字,并且该OLAP$DOLAP字段也将被分割。在这种情况下,以下awk程序(我们称之为splitfields.awk)可能会起作用:

#/usr/bin/awk -f
{
    copy=0;

    for (i=1;i<=NF;i++)
    {
        j=index($i,"$");
        if (j>0)
        {
            fields[i]=substr($i,j+1);
            $i=substr($i,1,j-1);
            copy=1
        }
        else fields[i]=$i;
    }
    print;

    if (copy==1)
    {
        for (i=1;i<NF;i++) printf("%s%s",fields[i],OFS);
        printf("%s%s",fields[NF],ORS)
    }
}

你可以通过调用来使用它

user@host~$ awk -F, -v OFS="," -f splitfields.awk input.csv

该脚本检查每个字段中的$符号,并在数组中复制该字段fields以供进一步使用。如果$找到,则相应的字段当前的线减少到部分当数组$中相应的条目fields被填充时$

进行此检查后,无论如何都会打印(可能已修改的)当前行。此外,如果任何一个字段包含该符号,则会以 CSV 格式打印$由数组内容组成的新行。fields

答案2

$ cat tst.awk
BEGIN { FS=OFS="," }
{
    delete subFlds
    numSubFlds = 0
    for (i=1; i<=NF; i++) {
        n = split($i,parts,/[$]/)
        for (subFldNr=1; subFldNr<=n; subFldNr++) {
            subFlds[i,subFldNr] = parts[subFldNr]
        }
        numSubFlds = (numSubFlds > n ? numSubFlds : n)
    }

    delete prev
    for (subFldNr=1; subFldNr<=numSubFlds; subFldNr++) {
        for (i=1; i<=NF; i++) {
            curr = ( (i,subFldNr) in subFlds ? subFlds[i,subFldNr] : prev[i] )
            printf "%s%s", curr, (i<NF ? OFS : ORS)
            prev[i] = curr
        }
    }
}

$ awk -f tst.awk file
IMSI,MSISDN,SUBCATEGY,EPCMMEHOST,EPCMMEREALM,EPCROAMSCH,PSROAMSCH,OINR,MSCNUMBER,VLRNUMBER,SGSNNUMBER,EPCAPNNAME,EPCAPNAMBRUP,EPCAPNAMBRDOWN,EPCAMBRUP,EPCAMBRDOWN,CHARGE,EPCCHARGCHRT
NAN,NAN,2,,,0,,0,,,,OLAP,200000000,400000000,200000000,400000000,5,5
NAN,NAN,2,,,0,,0,,,,DOLAP,2048000,2048000,200000000,400000000,5,5

答案3

awk -F, -v OFS=, '{split("",q);for(i=j=1;i<=NF;i++){if(split($i,a,/\$/)>1){j++;$i=a[1];q[i]=a[2]}}print;if(j>1){for(i in q)$i=q[i];print}}' file

或者,扩展:

awk -F, -v OFS=, '{
  split("", q)       # clear the q array; delete q is non-standard
  for(i = j = 1; i <= NF; i++){
    if(split($i, a, /\$/) > 1){  # if the field contains a $
       j++                # increase the cnt of fields with a $
       $i = a[1]          # set the field to the part before the $
       q[i] = a[2]        # save the part after the $ in the q array
    }
  }
  print                   # print the line with the parts before $
  if(j > 1){              # if any field had a $
     for(i in q)          # for each field which had a $
       $i = q[i]          # replace it with the part after $  
     print                # re-print the line with the parts after $
  }
}' file

答案4

如果我理解正确,某些字段是由 分隔的“双”变体字段$,您只需要这些行的两个副本,其中第一行包含第一个变体,第二行包含第二个变体....

awk '{v1=$0; gsub(/\$[^,]+/,""); gsub (/,[^,]+\$/,",",v1); print $0; if (v1!=$0) print v1}' file1

IMSI,MSISDN,SUBCATEGY,EPCMMEHOST,EPCMMEREALM,EPCROAMSCH,PSROAMSCH,OINR,MSCNUMBER,VLRNUMBER,SGSNNUMBER,EPCAPNNAME,EPCAPNAMBRUP,EPCAPNAMBRDOWN,EPCAMBRUP,EPCAMBRDOWN,CHARGE,EPCCHARGCHRT
NAN,NAN,2,,,0,,0,,,,OLAP,200000000,400000000,200000000,400000000,5,5
NAN,NAN,2,,,0,,0,,,,DOLAP,2048000,2048000,200000000,400000000,5,5

相关内容