Linux - 如何忽略“”之间的特殊字符?

Linux - 如何忽略“”之间的特殊字符?

我的文件:(1 个示例行)

MMP,"01_janitorial,02_cleaning_tools",1,,CUBIC_INCH,"(14) tray capacity, 6" upright with 3" spacing, mounts on 48"W x 24"D, taupe epoxy, fits MetroMax i
& MetroMax Q shelf, NSF",CLEANING

我需要将其读入具有 7 列的 Postgresql 表中。

列的细分:

  1. MMP
  2. "01_janitorial,02_cleaning_tools"
  3. 1
  4. CUBIC_INCH
  5. "(14) tray capacity, 6" upright with 3" spacing, mounts on 48"W x 1. 24"D, taupe epoxy, fits MetroMax i & MetroMax Q shelf, NSF"
  6. CLEANING

该文件基本上以逗号分隔,但如果文本位于双引号内,我需要忽略逗号、回车符(如果存在)和双引号。如第 2 栏和第 6 栏所示。

我可以使用 postgresql 复制命令来加载,或者使用 awk、perl、sed 或任何其他命令来转换文件,然后加载。

答案1

正如所说,文件生成不正确。不过,您可以尝试不仅使用,分隔符还使用",和来解决它,"。当然,将需要自定义脚本,并且不保证您不会在第六个字段中遇到类似的情况。

或者,您可以删除前五个字段,假设第 6 个字段是唯一混乱的字段,然后从结果中删除最后一个字段和逗号。剩下的将是第六场内容。

答案2

仅仅使用-F,通常不足以解析 CSV 文件。特别是如果如所描述的那样,分隔符可以是带引号的字符串的一部分。你可以解决一些通过使用FPAT表达式来定义字段而不是为字段分隔符定义字符,但awk仍会逐行进行,因此您必须先行使用数据中的换行符。

完成后,您可以执行诸如 之类的操作awk 'BEGIN {FPAT="([^,]+)|(\"[^\"]+\")"} { /* normal processing here */ }' /path/to/file

该表达式将定义为“任何不是逗号的内容”或“双引号,一个或多个不是双引号的内容,后跟双引号”的字段。

然而,如果您引用的任何数据,这将会爆炸他们自己包含双引号。

答案3

该解决方案将非常具体于您的数据文件,因为引号未正确转义。由于只有一个故障栏,因此这是相当可行的。给你:

#!/bin/bash
while IFS='' read -r line || [[ -n "$line" ]]; do
    echo "Line: $line"

# grabbing the first field is easy ..
    f1=$(echo $line | cut -d, -f1 )

# now remove the first field from the line
    line=$(echo $line | sed "s/$f1,//" )
    echo "Line is now: $line"

# to grab the second field use quote as a delimiter
    f2=$(echo $line | cut -d\" -f2 )

# now remove the second field from the line
    line=$(echo $line | sed "s/\"$f2\",//" )
    echo "Line is now: $line"

# fields 3,4,5 are trivial .. just repeat the same pattern as 1 and then remove them
    f3=$(echo $line | cut -d, -f1 )
    line=$(echo $line | sed "s/$f3,//" )
    echo "Line is now: $line"
    f4=$(echo $line | cut -d, -f1 )
    line=$(echo $line | sed "s/$f4,//" )
    echo "Line is now: $line"
    f5=$(echo $line | cut -d, -f1 )
    line=$(echo $line | sed "s/$f5,//" )

# here is the "trick" ... reverse the string, then you can cut field 7 first!
    line=$(echo $line | rev)
    echo "Line is now: $line"
    f7=$(echo $line | cut -d, -f1 )

# now remove field 7 from the string, then reverse it back
    line=$(echo $line | sed "s/$f7,//" )
    f7=$(echo $f7 | rev)

# now we can reverse the remaining string, which is field 6 back to normal
    line=$(echo $line | rev)
# and then remove the leading quote
    line=$(echo $line | cut --complement -c 1)
# and then remove the trailing quote
    line=$(echo $line | sed "s/\"$//" )
    echo "Line is now: $line"
# and then double up all the remaining quotes
    f6=$(echo $line | sed "s/\"/\"\"/g" )

    echo f1 = $f1
    echo f2 = $f2
    echo f3 = $f3
    echo f4 = $f4
    echo f5 = $f5
    echo f6 = $f6
    echo f7 = $f7
    echo $f1,\"$f2\",$f3,$f4,$f5,\"$f6\",$f7 >> fixed.txt
done < "$1"

我让它回显大量输出来向您展示它是如何工作的,一旦您理解了它,您就可以删除所有回显语句以使其更快。它将固定行附加到固定.txt

以下是运行和输出的示例:

[root@alpha ~]# ./fixit.sh test.txt
Line: MMP,"01_janitorial,02_cleaning_tools",1,,CUBIC_INCH,"(14) tray capacity, 6" upright with 3" spacing, mounts on 48"W x 24"D, taupe epoxy, fits MetroMax i & MetroMax Q shelf, NSF",CLEANING
Line is now: "01_janitorial,02_cleaning_tools",1,,CUBIC_INCH,"(14) tray capacity, 6" upright with 3" spacing, mounts on 48"W x 24"D, taupe epoxy, fits MetroMax i & MetroMax Q shelf, NSF",CLEANING
Line is now: 1,,CUBIC_INCH,"(14) tray capacity, 6" upright with 3" spacing, mounts on 48"W x 24"D, taupe epoxy, fits MetroMax i & MetroMax Q shelf, NSF",CLEANING
Line is now: ,CUBIC_INCH,"(14) tray capacity, 6" upright with 3" spacing, mounts on 48"W x 24"D, taupe epoxy, fits MetroMax i & MetroMax Q shelf, NSF",CLEANING
Line is now: CUBIC_INCH,"(14) tray capacity, 6" upright with 3" spacing, mounts on 48"W x 24"D, taupe epoxy, fits MetroMax i & MetroMax Q shelf, NSF",CLEANING
Line is now: GNINAELC,"FSN ,flehs Q xaMorteM & i xaMorteM stif ,yxope epuat ,D"42 x W"84 no stnuom ,gnicaps "3 htiw thgirpu "6 ,yticapac yart )41("
Line is now: (14) tray capacity, 6" upright with 3" spacing, mounts on 48"W x 24"D, taupe epoxy, fits MetroMax i & MetroMax Q shelf, NSF
f1 = MMP
f2 = 01_janitorial,02_cleaning_tools
f3 = 1
f4 =
f5 = CUBIC_INCH
f6 = (14) tray capacity, 6"" upright with 3"" spacing, mounts on 48""W x 24""D, taupe epoxy, fits MetroMax i & MetroMax Q shelf, NSF
f7 = CLEANING

如果您需要以其他方式转义引号,鉴于上述内容,这应该是非常明显的。

答案4

我通过删除引用文件中的回车符来获得最终产品,如下脚本:

$ cat remove_cr.awk
#!/usr/bin/awk -f
{ record = record $0
  # If number of quotes is odd, continue reading record.
  if ( gsub( /"/, "&", record ) % 2 )
  { record = record " "
    next
  }
}
{ print record
  record = ""
}

相关内容