我有一个庞大的客户帐户信息文件,目前按如下方式排序到一列中。我希望使用:
作为分隔符来分割每一行。但这样做时,对于每一行,当分开时,我想创建一个新列,将每行后面的数据放入:
相应的列中。我的最终目标是将其转换为 CSV 形式,以便导入到某个地方进行数据分析和/或构建数据库。
firstName:John
middleName:null
lastName:Doe
companyName:John Doe Corp
suffix:null
primaryEmail:[email protected]
primaryPhone:555.555.5555
secondaryEmail:[email protected]
secondaryPhone:null
此外,这不是每个客户的总行数。每个客户有 55 行。
答案1
使用perl
任何桌面或服务器 Linux 发行版上都存在的 :
perl -lne '
BEGIN{$,=","}
($k,$v)=split":",$_,2;
next unless defined $v;
for($k,$v){s/"/""/g,$_=qq{"$_"}if/[$,"]/}
$k=$t{$k}//=$t++;
if(exists$f[$k]){print@f;@f=()}
$f[$k]=$v;
END{print@f;print STDERR sort{$t{$a}<=>$t{$b}}keys%t}
' your_file
这应该将文件转换为标准 CSV,除了标头(包含字段名称的第一行)将在处理整个文件后打印到 stderr。您可以使用... >body 2>hdr
然后将其保存在某处cat hdr body > final_file.csv
。
这对空行等没有任何特殊意义:一条记录被视为由一组具有不同名称的字段组成,无论它们的顺序如何。
包含,
或的字段"
将被放入 内部"..."
,并且任何内部"
都将通过将其加倍来转义""
(使用 CSV 约定)。
您可以通过更改为例如来调整字段分隔符$,=","
。$,="|"
(或$,="\t"
用于选项卡)。您可以通过删除该行来摆脱引用和转义for($k,$v){ ... }
。
awk
这可以在(而不是在sed
or中完成tr
),只是它会更复杂一些,因为awk
无法一次打印整个数组(您必须循环遍历它们),也无法在中拆分字符串有限数量的字段(你必须使用一个substr
技巧)。
答案2
为了完整起见,awk
基于 - 的解决方案。
-脚本awk
(我们称之为convert_csv.awk
):
#!/bin/awk -f
BEGIN{FS=":";OFS=","}
# Process all non-empty lines
NF>0{
# Check if the "key" part of the line was not yet encountered, both globally
# and for the currently processes record.
# If not encountered globally yet, add to list of headers (=columns).
new_hdr=1; new_key=1;
for (i=1; i<=n_hdrs; i++) {if (hdr[i]==$1) new_hdr=0;}
if (new_hdr) hdr[++n_hdrs]=$1;
for (key in val) {if (key==$1) new_key=0;}
# Once no globally new keys are found, consider the "list of headers" as
# complete and print it as CSV header line.
if (!new_hdr && !hdr_printed)
{
for (i=1;i<=n_hdrs;i++) printf("%s%s", hdr[i], i==n_hdrs?ORS:OFS);
hdr_printed=1;
}
# If the current key was already found in the currently processed record,
# we assume that a new record was started, and print the data collected
# so far before collecting data on the next record.
if (!new_key)
{
for (i=1;i<=n_hdrs;i++) printf("%s%s", val[hdr[i]], i==n_hdrs?ORS:OFS);
delete val;
}
# Associate the "value" part of the line with the "key", perform transformations
# as necessary. Since both the 'gsub()' function used for escaping '"' to '""'
# and the 'index()' function used to localize ',' return non-zero if an occurence
# was found, the sum of both return values being > 0 indicates that the field
# must be quoted.
quote=gsub("\"","\"\"",$2)+index($2,",");
if (quote) $2="\""$2"\"";
val[$1]=$2;
}
# Print the last record. If it was the only record, print the header line, too (this
# is the case if 'new_hdr' is still 'true' at end-of-file).
END {
if (new_hdr)
{
for (i=1;i<=n_hdrs;i++) printf("%s%s", hdr[i], i==n_hdrs?ORS:OFS);
}
for (i=1;i<=n_hdrs;i++) printf("%s%s", val[hdr[i]], i==n_hdrs?ORS:OFS);
}
该函数在注释中进行了解释,但基本上它会查找唯一键的集合,并在行上找到“已经遇到”的键时认为记录已完成;然后它打印该记录并清除临时缓冲区以收集下一条记录。它还应用 @mosvy 指示的转换,以遵守字段中特殊字符的 CSV 标准。
将此称为
awk -f convert_csv.awk input.txt