从全名中获取名字并使用 awk 或 sed 将其提取到新列

从全名中获取名字并使用 awk 或 sed 将其提取到新列

我有很多.csv包含客户信息的文件。在所有这些文件中,我想FIRSTNAME在该列旁边添加一个附加列FULLNAME。名字可以通过抓取第一个单词来生成FULLNAME

没有像让·保罗这样只有两个字的名字。在最后一列中,字段文本中使用了逗号

输入

COMPANY,FULLNAME,EMAIL,FUNCTION,CITY,INDUSTRY,COMMENT
Company name,Firstname Lastname,[email protected],Marketing Manager,New York,Health Care,"home, work"
Company name,Firstname infix Lastname,[email protected],Marketing Manager,New York,Health Care,"home, workhome, work"
Company name,Firstname infix infix2 Lastname,[email protected],Marketing Manager,New York,Health Care,"home, work"

预期产出

COMPANY,FULLNAME,FIRSTNAME,EMAIL,FUNCTION,CITY,INDUSTRY,COMMENT
Company name,Firstname Lastname,Firstname,[email protected],Marketing Manager,New York,Health Care,"home, work"
Company name,Firstname infix Lastname,Firstname,[email protected],Marketing Manager,New York,Health Care,"home, work"
Company name,Firstname infix infix2 Lastname,Firstname,[email protected],Marketing Manager,New York,Health Care,"home, work"

如何使用 awk、sed 或其他东西来做到这一点?

答案1

使用支持 CSV 的实用程序磨坊主( mlr):

mlr --csv \
    put '$FIRSTNAME = sub($FULLNAME," .*","")' then \
    reorder -f COMPANY,FULLNAME,FIRSTNAME file

...鉴于问题中的数据,结果是

COMPANY,FULLNAME,FIRSTNAME,EMAIL,FUNCTION,CITY,INDUSTRY,COMMENT
Company name,Firstname Lastname,Firstname,[email protected],Marketing Manager,New York,Health Care,"home, work"
Company name,Firstname infix Lastname,Firstname,[email protected],Marketing Manager,New York,Health Care,"home, workhome, work"
Company name,Firstname infix infix2 Lastname,Firstname,[email protected],Marketing Manager,New York,Health Care,"home, work"

Miller 的这种使用首先FIRSTNAME通过基于正则表达式的替换创建一个新字段 ,该替换会删除该FULLNAME字段中第一个空格字符之后的所有内容。

由于新字段最后呈现,因此这些字段将被重新排序,以确保前几个字段按此顺序为COMPANYFULLNAME、 和。FIRSTNAME其余字段保留其原始顺序。

您可以使用with 的函数来代替put表达式 using ,以空格分割字段的值并选出第一个生成的字符串:sub()putsplitnv()FIRSTNAME

mlr --csv \
    put '$FIRSTNAME = splitnv($FULLNAME," ")[1]' then \
    reorder -f COMPANY,FULLNAME,FIRSTNAME file

为了更漂亮的输出:

$ mlr --icsv --opprint --barred put '$FIRSTNAME = splitnv($FULLNAME," ")[1]' then reorder -f COMPANY,FULLNAME,FIRSTNAME file
+--------------+---------------------------------+-----------+--------------------------------+-------------------+----------+-------------+----------------------+
| COMPANY      | FULLNAME                        | FIRSTNAME | EMAIL                          | FUNCTION          | CITY     | INDUSTRY    | COMMENT              |
+--------------+---------------------------------+-----------+--------------------------------+-------------------+----------+-------------+----------------------+
| Company name | Firstname Lastname              | Firstname | [email protected] | Marketing Manager | New York | Health Care | home, work           |
| Company name | Firstname infix Lastname        | Firstname | [email protected] | Marketing Manager | New York | Health Care | home, workhome, work |
| Company name | Firstname infix infix2 Lastname | Firstname | [email protected] | Marketing Manager | New York | Health Care | home, work           |
+--------------+---------------------------------+-----------+--------------------------------+-------------------+----------+-------------+----------------------+

答案2

使用(以前称为 Perl_6)

~$ raku -MText::CSV -e 'my @a = csv(in => $*IN);  \
                        my @b = [Z] @a>>[0..1], @a>>[1].map(*.words.[0]), @a>>[2..*];  \
                        @b = @b>>.[*;*]>>.Array; @b[0][2] = "FIRSTNAME";  \
                        csv(in => @b, out => $*OUT);'  file

如果 OP 希望整个 CSV 输出都用双引号引起来,这可能是最简单的方法(出现嵌入逗号和/或空格的引用字段,请参阅 RFC4180)。

Perl(5) 模块Text::CSV_XS备受推崇,该模块的长期作者/维护者开发了 RakuText::CSV模块(H. Merijn Brand,个人交流)。这里,Raku 的Text::CSV功能仅限于 IO 操作,特别是引用的最后一栏。否则,上面的列操作是使用标准 Raku 数组完成的。不过,此代码的优点是可以使用双引号文件路径代替$*IN,后者采用 STDIN。

上面,文件@a在第一行被读入数组,在第二行中,word零索引列 1 的第一个被取出并分配给零索引列 2;该行和其余行被逐行分配给@b数组。第三行完成了一些内务处理(展平数组、使@b元素可变以及更正列标题)。最后在第四行输出文件。

输入示例:

COMPANY,FULLNAME,EMAIL,FUNCTION,CITY,INDUSTRY,COMMENT
Company name,Firstname Lastname,[email protected],Marketing Manager,New York,Health Care,"home, work"
Company name,Firstname infix Lastname,[email protected],Marketing Manager,New York,Health Care,"home, workhome, work"
Company name,Firstname infix infix2 Lastname,[email protected],Marketing Manager,New York,Health Care,"home, work"

示例输出(来自上面):

COMPANY,FULLNAME,FIRSTNAME,EMAIL,FUNCTION,CITY,INDUSTRY,COMMENT
"Company name","Firstname Lastname",Firstname,[email protected],"Marketing Manager","New York","Health Care","home, work"
"Company name","Firstname infix Lastname",Firstname,[email protected],"Marketing Manager","New York","Health Care","home, workhome, work"
"Company name","Firstname infix infix2 Lastname",Firstname,[email protected],"Marketing Manager","New York","Health Care","home, work"

实际上,仅重新引用最后一列需要做更多的工作。将上面的最后一行替换csv(in => @b, out => $*OUT)为以下内容:

.join(",").put for @b[0];  \
.join(",").put for [Z] @b[1..*]>>.[0..*-2]>>.join(","), @b[1..*]>>.[*-1].map(*.raku);'  

示例输出(修订后的代码示例):

COMPANY,FULLNAME,FIRSTNAME,EMAIL,FUNCTION,CITY,INDUSTRY,COMMENT
Company name,Firstname Lastname,Firstname,[email protected],Marketing Manager,New York,Health Care,"home, work"
Company name,Firstname infix Lastname,Firstname,[email protected],Marketing Manager,New York,Health Care,"home, workhome, work"
Company name,Firstname infix infix2 Lastname,Firstname,[email protected],Marketing Manager,New York,Health Care,"home, work"

https://datatracker.ietf.org/doc/html/rfc4180
https://github.com/Tux/CSV/blob/master/doc/Text-CSV.md
https://raku.org

答案3

使用sed

$ sed -E '2,$s/[^,]*,([^ ]*) [^,]*,/&\1,/;1s/([^,]*,){2}/&FIRSTNAME,/' input_file
COMPANY,FULLNAME,FIRSTNAME,EMAIL,FUNCTION,CITY,INDUSTRY,COMMENT
Company name,Firstname Lastname,Firstname,[email protected],Marketing Manager,New York,Health Care,"home, work"
Company name,Firstname infix Lastname,Firstname,[email protected],Marketing Manager,New York,Health Care,"home, workhome, work"
Company name,Firstname infix infix2 Lastname,Firstname,[email protected],Marketing Manager,New York,Health Care,"home, work"

相关内容