如何拆分 CSV 中的字段并将该行中的字段复制到新行

如何拆分 CSV 中的字段并将该行中的字段复制到新行

我有一个使用 CSV 文件的目标,第 6 个字段包含单词,但最大字符长度为 16。如果字段长度超过 16 个字符,我想复制该行并将其分解而不破坏单词。

当前文件

"5","4","3","2","1","XYZ ABCD E"
"1","2","3","4","5","AB CDE F GHI JK LMNOP Q RS TUV W XYZ 12 3456 7890"
"9","8","7","6","5","LMN O PQ R"

所需输出

"5","4","3","2","1","XYZ ABCD E"
"1","2","3","4","5","AB CDE F GHI JK"
"1","2","3","4","5","LMNOP Q RS TUV W"
"1","2","3","4","5","XYZ 12 3456 7890"
"9","8","7","6","5","LMN O PQ R"

答案1

使用GNU Awk ( gawk) 运行fold获取行/变量/协进程

gawk -F, '
  BEGIN{
    OFS=FS; 
    cmd="fold -sw 16";
  }

  # if total length (16 + 2 for quotes) is within limit, print as-is
  length($NF) <= 18 {print; next}

  # else
  {
    # trim the quotes, then fold
    print substr($NF,2,length($NF)-2) |& cmd; 
    close(cmd,"to"); 
    NF--; 
    while((cmd |& getline var) > 0){

      # (optional) trim trailing whitespace
      sub(/[ \t]+$/,"",var);

      print $0, "\"" var "\"" ;
    }
    close(cmd,"from");
  }
' file.csv

从操作中删除sub尾随空格fold

请注意,要获得显示的精确输出,需要fold -sw17在 16 个字符处加上(随后删除的)尾随空格进行换行。但是,这样做可能会导致折叠输出的最后一行超过 16 个字符的限制。

答案2

我创建了一个相当蹩脚的 awk 脚本,它保留了双引号。它来了:

{
    for ( i=0; i<= length($6); i+=16 )
    {
        if ( i+17 < length($6) )
        {
            if ( i == 0 )
                printf ("%s,%s,%s,%s,%s,%s\"\n", $1, $2, $3, $4, $5, substr($6,i,16))
            else
                printf ("%s,%s,%s,%s,%s,\"%s\"\n", $1, $2, $3, $4, $5, substr($6,i+1,16))
        }
        else
        {
            if ( i == 0 )
                printf ("%s,%s,%s,%s,%s,%s\n", $1, $2, $3, $4, $5, substr($6,i,16))
            else
                printf ("%s,%s,%s,%s,%s,\"%s\n", $1, $2, $3, $4, $5, substr($6,i+1,16))
        }
    }
}

输出是:

$ awk -F, -f awks csvfields
"5","4","3","2","1","XYZ ABCD E"
"1","2","3","4","5","AB CDE F GHI JK"
"1","2","3","4","5"," LMNOP Q RS TUV "
"1","2","3","4","5","W XYZ 12 3456 78"
"1","2","3","4","5","90"
"9","8","7","6","5","LMN O PQ R"
$

唯一的问题是,如果边界处有空格,它会被保留,与已被删除的示例不同。

答案3

尝试使用下面的代码,效果也很好

 k=16;for ((j=1;j<=50;j++)); do  awk -v j="$j" -v k="$k" -F "," '{if(length($NF) > 16){print $1,$2,$3,$4,$5,substr($NF,j,k)}else {print $0}}' filename; j=$(($j+16)); done|sort | uniq

输出

"5","4","3","2","1","XYZ ABCD E"
"1","2","3","4","5","AB CDE F GHI JK"
"1","2","3","4","5","LMNOP Q RS TUV W"
"1","2","3","4","5","XYZ 12 3456 7890"
"9","8","7","6","5","LMN O PQ R"

答案4

仅 SHELL 方法(在 Bash 和 Ksh93 上测试)。不过,我确实喜欢这种fold方法,因为它使用现有的工具。

# read from stdin, output to stdout
# Note no Shebang line at top so it made it easier for to try bash/ksh as interpreters

OIFS="$IFS"
IFS=,
while read f1 f2 f3 f4 f5 f6; do
    f6=${f6#\"}
    f6=${f6%\"}             # strip DQs
    if ((${#f6}<17)); then  # no action
            IFS="$OIFS"
            echo "$f1,$f2,$f3,$f4,$f5,\"$f6\""
            IFS=","
            continue
    else
            IFS="$OIFS"
            while ((${#f6}>17)); do
                    n6=${f6:0:16}
                    f6=${f6#$n6}
                    n6=${n6# }
                    n6=${n6% }
                    echo "$f1,$f2,$f3,$f4,$f5,\"$n6\""
            done
            echo "$f1,$f2,$f3,$f4,$f5,\"${f6# }\""
    fi
    IFS=","
done
IFS="$OIFS"
exit

结果:

"5","4","3","2","1","XYZ ABCD E"
"1","2","3","4","5","AB CDE F GHI JK"
"1","2","3","4","5","LMNOP Q RS TUV W"
"1","2","3","4","5","XYZ 12 3456 7890"
"9","8","7","6","5","LMN O PQ R"

要在不使用 using 或类似的情况下解决分词问题fold,以下代码应替换上面显示的注释掉的行。还将第二个echo命令行替换为:

                    c6="$f6"
                    n6=""
                    while (((${#n6}+${#nw})<=16)); do
                            n6=$n6${c6%% *}\
                            n6=${n6# }
                            eval c6=\${c6\#${c6%% *} }
                            nw=${c6%% *}
                    done
                    #n6=${f6:0:16} ### replace by above

并替换

            echo "$f1,$f2,$f3,$f4,$f5,\"${f6# }\""

            ((${#f6}>0)) && echo "$f1,$f2,$f3,$f4,$f5,\"${f6# }\""

以避免出现任何空字段 6 余数。

使用以下测试文件:

"5","4","3","2","1","XYZ ABCD E"
"1","2","3","4","5","AB CDE F GHI JK LMNOP Q RS TUV W XYZ 12 3456 7890"
"9","8","7","6","5","LMN O PQ R"
"1","2","3","4","5","A BB CCC DDD EEEE FFFFF GGGGGG HHHHHHH"

结果:

"5","4","3","2","1","XYZ ABCD E"
"1","2","3","4","5","AB CDE F GHI JK"
"1","2","3","4","5","LMNOP Q RS TUV W"
"1","2","3","4","5","XYZ 12 3456 7890"
"9","8","7","6","5","LMN O PQ R"
"1","2","3","4","5","A BB CCC DDD"
"1","2","3","4","5","EEEE FFFFF"
"1","2","3","4","5","GGGGGG HHHHHHH"

然而,现有工具的使用fold要容易得多,并且遵循 UNIX 哲学——构建在现有的简单工具之上。但如果您喜欢 Shell 编程,那么上述是获得解决方案的一种方法。如果有人需要代码的解释,请与我联系。

相关内容