我的文件:(1 个示例行)
MMP,"01_janitorial,02_cleaning_tools",1,,CUBIC_INCH,"(14) tray capacity, 6" upright with 3" spacing, mounts on 48"W x 24"D, taupe epoxy, fits MetroMax i
& MetroMax Q shelf, NSF",CLEANING
我需要将其读入具有 7 列的 Postgresql 表中。
列的细分:
MMP
"01_janitorial,02_cleaning_tools"
1
CUBIC_INCH
"(14) tray capacity, 6" upright with 3" spacing, mounts on 48"W x 1. 24"D, taupe epoxy, fits MetroMax i & MetroMax Q shelf, NSF"
CLEANING
该文件基本上以逗号分隔,但如果文本位于双引号内,我需要忽略逗号、回车符(如果存在)和双引号。如第 2 栏和第 6 栏所示。
我可以使用 postgresql 复制命令来加载,或者使用 awk、perl、sed 或任何其他命令来转换文件,然后加载。
答案1
正如所说,文件生成不正确。不过,您可以尝试不仅使用,
分隔符还使用",
和来解决它,"
。当然,将需要自定义脚本,并且不保证您不会在第六个字段中遇到类似的情况。
或者,您可以删除前五个字段,假设第 6 个字段是唯一混乱的字段,然后从结果中删除最后一个字段和逗号。剩下的将是第六场内容。
答案2
仅仅使用-F,
通常不足以解析 CSV 文件。特别是如果如所描述的那样,分隔符可以是带引号的字符串的一部分。你可以解决一些通过使用FPAT
表达式来定义字段而不是为字段分隔符定义字符,但awk
仍会逐行进行,因此您必须先行使用数据中的换行符。
完成后,您可以执行诸如 之类的操作awk 'BEGIN {FPAT="([^,]+)|(\"[^\"]+\")"} { /* normal processing here */ }' /path/to/file
。
该表达式将定义为“任何不是逗号的内容”或“双引号,一个或多个不是双引号的内容,后跟双引号”的字段。
然而,如果您引用的任何数据,这将会爆炸他们自己包含双引号。
答案3
该解决方案将非常具体于您的数据文件,因为引号未正确转义。由于只有一个故障栏,因此这是相当可行的。给你:
#!/bin/bash
while IFS='' read -r line || [[ -n "$line" ]]; do
echo "Line: $line"
# grabbing the first field is easy ..
f1=$(echo $line | cut -d, -f1 )
# now remove the first field from the line
line=$(echo $line | sed "s/$f1,//" )
echo "Line is now: $line"
# to grab the second field use quote as a delimiter
f2=$(echo $line | cut -d\" -f2 )
# now remove the second field from the line
line=$(echo $line | sed "s/\"$f2\",//" )
echo "Line is now: $line"
# fields 3,4,5 are trivial .. just repeat the same pattern as 1 and then remove them
f3=$(echo $line | cut -d, -f1 )
line=$(echo $line | sed "s/$f3,//" )
echo "Line is now: $line"
f4=$(echo $line | cut -d, -f1 )
line=$(echo $line | sed "s/$f4,//" )
echo "Line is now: $line"
f5=$(echo $line | cut -d, -f1 )
line=$(echo $line | sed "s/$f5,//" )
# here is the "trick" ... reverse the string, then you can cut field 7 first!
line=$(echo $line | rev)
echo "Line is now: $line"
f7=$(echo $line | cut -d, -f1 )
# now remove field 7 from the string, then reverse it back
line=$(echo $line | sed "s/$f7,//" )
f7=$(echo $f7 | rev)
# now we can reverse the remaining string, which is field 6 back to normal
line=$(echo $line | rev)
# and then remove the leading quote
line=$(echo $line | cut --complement -c 1)
# and then remove the trailing quote
line=$(echo $line | sed "s/\"$//" )
echo "Line is now: $line"
# and then double up all the remaining quotes
f6=$(echo $line | sed "s/\"/\"\"/g" )
echo f1 = $f1
echo f2 = $f2
echo f3 = $f3
echo f4 = $f4
echo f5 = $f5
echo f6 = $f6
echo f7 = $f7
echo $f1,\"$f2\",$f3,$f4,$f5,\"$f6\",$f7 >> fixed.txt
done < "$1"
我让它回显大量输出来向您展示它是如何工作的,一旦您理解了它,您就可以删除所有回显语句以使其更快。它将固定行附加到固定.txt。
以下是运行和输出的示例:
[root@alpha ~]# ./fixit.sh test.txt
Line: MMP,"01_janitorial,02_cleaning_tools",1,,CUBIC_INCH,"(14) tray capacity, 6" upright with 3" spacing, mounts on 48"W x 24"D, taupe epoxy, fits MetroMax i & MetroMax Q shelf, NSF",CLEANING
Line is now: "01_janitorial,02_cleaning_tools",1,,CUBIC_INCH,"(14) tray capacity, 6" upright with 3" spacing, mounts on 48"W x 24"D, taupe epoxy, fits MetroMax i & MetroMax Q shelf, NSF",CLEANING
Line is now: 1,,CUBIC_INCH,"(14) tray capacity, 6" upright with 3" spacing, mounts on 48"W x 24"D, taupe epoxy, fits MetroMax i & MetroMax Q shelf, NSF",CLEANING
Line is now: ,CUBIC_INCH,"(14) tray capacity, 6" upright with 3" spacing, mounts on 48"W x 24"D, taupe epoxy, fits MetroMax i & MetroMax Q shelf, NSF",CLEANING
Line is now: CUBIC_INCH,"(14) tray capacity, 6" upright with 3" spacing, mounts on 48"W x 24"D, taupe epoxy, fits MetroMax i & MetroMax Q shelf, NSF",CLEANING
Line is now: GNINAELC,"FSN ,flehs Q xaMorteM & i xaMorteM stif ,yxope epuat ,D"42 x W"84 no stnuom ,gnicaps "3 htiw thgirpu "6 ,yticapac yart )41("
Line is now: (14) tray capacity, 6" upright with 3" spacing, mounts on 48"W x 24"D, taupe epoxy, fits MetroMax i & MetroMax Q shelf, NSF
f1 = MMP
f2 = 01_janitorial,02_cleaning_tools
f3 = 1
f4 =
f5 = CUBIC_INCH
f6 = (14) tray capacity, 6"" upright with 3"" spacing, mounts on 48""W x 24""D, taupe epoxy, fits MetroMax i & MetroMax Q shelf, NSF
f7 = CLEANING
如果您需要以其他方式转义引号,鉴于上述内容,这应该是非常明显的。
答案4
我通过删除引用文件中的回车符来获得最终产品,如下脚本:
$ cat remove_cr.awk
#!/usr/bin/awk -f
{ record = record $0
# If number of quotes is odd, continue reading record.
if ( gsub( /"/, "&", record ) % 2 )
{ record = record " "
next
}
}
{ print record
record = ""
}