通过内容定义字段并删除内部分隔符

通过内容定义字段并删除内部分隔符

我有一些逗号分隔的文件,不幸的是,其中一些字符串中包含逗号。这使得对它们进行排序变得很困难。请参阅上一个问题

无论如何排序,我认为最好删除这些括起来的逗号,因为它们只会对我的管道中的每个程序造成潜在的危害。

我刚刚开始学习 awk/gawk。我认为一个好的策略是:

  1. 通过内容而不是分隔符来定义字段,如下所示这里
  2. 删除字段内的分隔符这里,注意应该将 gsub 限制为单列的修改

然后我尝试使用脚本 sorter.awk,我的目的是仅从第 6 列中删除逗号:

BEGIN {
FPAT = "([^,]+)|(\"[^\"]+\")"

}

{gsub(/[,]/,"",$6)}1

但是当我使用命令

gawk -f sorter.awk bugtest.csv > 输出.csv

将其应用于以下文件 bugtest.csv:

1000,101,1,2,"VEN","Venezuela, Bolivarian Republic of",1967,22,4,99,0,0,1967-12-07,"R/22/2328A",0,1,"PRIVILEGES AND IMMUNITIES","TO ADOPT OPERATIVE PARAG. 2 OF DRAFT RESOL. (A/6965) ON DIPLOMATIC PRIVILEGES AND IMMUNITIES, WHICH PARAGRAPH URGES U.N. MEMBER-STATES WHO HAVE NOT YET DONE SO TO ACCEDE TO THE U.N. CONVENTION ON PRIVILEGES AND IMMUNITIES.",0,0,0,0,0,0,0,22027
1000,713,1,1,"TWN","Taiwan, Province of China",1967,22,4,99,0,0,1967-12-07,"R/22/2328A",0,1,"PRIVILEGES AND IMMUNITIES","TO ADOPT OPERATIVE PARAG. 2 OF DRAFT RESOL. (A/6965) ON DIPLOMATIC PRIVILEGES AND IMMUNITIES, WHICH PARAGRAPH URGES U.N. MEMBER-STATES WHO HAVE NOT YET DONE SO TO ACCEDE TO THE U.N. CONVENTION ON PRIVILEGES AND IMMUNITIES.",0,0,0,0,0,0,0,22027
100,101,1,2,"VEN","Venezuela, Bolivarian Republic of",1948,3,9,6,37,0,1948-11-07,"R/3/566C",0,1,"DISARMAMENT, NUCLEAR","TO ADOPT PARAGRAPH 7 OF THE USSR DRAFT RESOL. (A/723), SAID PARAGRAPH RECOMMENDING THE PROHIBITION OF ATOMIC WEAPONS INTENDED FOR AGGRESSION.",0,1,1,0,0,0,0,3023
1001,101,1,1,"VEN","Venezuela, Bolivarian Republic of",1967,22,1,101,0,0,1967-12-07,"R/22/2328B",0,0,"PRIVILEGES AND IMMUNITIES","TO ADOPT DRAFT RESOL. (A/6965) URGING U.N. MEMBER-STATES WHO HAVE NOT YET DONE SO TO ACCEDE TO THE U.N. CONVENTION ON (DIPLOMATIC) PRIVILEGES AND IMMUNITIES AND DEPLORING ALL DEPARTURES FROM THE RULES OF INTERNATIONAL LAW ON THE SUBJECT.",0,0,0,0,0,0,0,22028
1001,713,1,1,"TWN","Taiwan, Province of China",1967,22,1,101,0,0,1967-12-07,"R/22/2328B",0,0,"PRIVILEGES AND IMMUNITIES","TO ADOPT DRAFT RESOL. (A/6965) URGING U.N. MEMBER-STATES WHO HAVE NOT YET DONE SO TO ACCEDE TO THE U.N. CONVENTION ON (DIPLOMATIC) PRIVILEGES AND IMMUNITIES AND DEPLORING ALL DEPARTURES FROM THE RULES OF INTERNATIONAL LAW ON THE SUBJECT.",0,0,0,0,0,0,0,22028
1002,101,1,3,"VEN","Venezuela, Bolivarian Republic of",1967,22,11,50,51,0,1967-12-07,"R/22/2338A",1,1,"INTERNATIONAL YEAR FOR HUMAN RIGHTS","TO ADOPT THE AMENDMENT (A/L. 542) TO DRAFT RESOL. (A/7008) ON \INTERNATIONAL YEAR FOR HUMAN RIGHTS\\, WHICH AMENDMENT DELETES OPERATIVE PARAG.10.\""""",0,0,0,1,0,0,0,22029

output.csv 看起来像这样,没有任何逗号:

1000 101 1 2 "VEN" "Venezuela Bolivarian Republic of" 1967 22 4 99 0 0 1967-12-07 "R/22/2328A" 0 1 "PRIVILEGES AND IMMUNITIES" "TO ADOPT OPERATIVE PARAG. 2 OF DRAFT RESOL. (A/6965) ON DIPLOMATIC PRIVILEGES AND IMMUNITIES, WHICH PARAGRAPH URGES U.N. MEMBER-STATES WHO HAVE NOT YET DONE SO TO ACCEDE TO THE U.N. CONVENTION ON PRIVILEGES AND IMMUNITIES." 0 0 0 0 0 0 0 22027
1000 713 1 1 "TWN" "Taiwan Province of China" 1967 22 4 99 0 0 1967-12-07 "R/22/2328A" 0 1 "PRIVILEGES AND IMMUNITIES" "TO ADOPT OPERATIVE PARAG. 2 OF DRAFT RESOL. (A/6965) ON DIPLOMATIC PRIVILEGES AND IMMUNITIES, WHICH PARAGRAPH URGES U.N. MEMBER-STATES WHO HAVE NOT YET DONE SO TO ACCEDE TO THE U.N. CONVENTION ON PRIVILEGES AND IMMUNITIES." 0 0 0 0 0 0 0 22027
100 101 1 2 "VEN" "Venezuela Bolivarian Republic of" 1948 3 9 6 37 0 1948-11-07 "R/3/566C" 0 1 "DISARMAMENT, NUCLEAR" "TO ADOPT PARAGRAPH 7 OF THE USSR DRAFT RESOL. (A/723), SAID PARAGRAPH RECOMMENDING THE PROHIBITION OF ATOMIC WEAPONS INTENDED FOR AGGRESSION." 0 1 1 0 0 0 0 3023
1001 101 1 1 "VEN" "Venezuela Bolivarian Republic of" 1967 22 1 101 0 0 1967-12-07 "R/22/2328B" 0 0 "PRIVILEGES AND IMMUNITIES" "TO ADOPT DRAFT RESOL. (A/6965) URGING U.N. MEMBER-STATES WHO HAVE NOT YET DONE SO TO ACCEDE TO THE U.N. CONVENTION ON (DIPLOMATIC) PRIVILEGES AND IMMUNITIES AND DEPLORING ALL DEPARTURES FROM THE RULES OF INTERNATIONAL LAW ON THE SUBJECT." 0 0 0 0 0 0 0 22028
1001 713 1 1 "TWN" "Taiwan Province of China" 1967 22 1 101 0 0 1967-12-07 "R/22/2328B" 0 0 "PRIVILEGES AND IMMUNITIES" "TO ADOPT DRAFT RESOL. (A/6965) URGING U.N. MEMBER-STATES WHO HAVE NOT YET DONE SO TO ACCEDE TO THE U.N. CONVENTION ON (DIPLOMATIC) PRIVILEGES AND IMMUNITIES AND DEPLORING ALL DEPARTURES FROM THE RULES OF INTERNATIONAL LAW ON THE SUBJECT." 0 0 0 0 0 0 0 22028
1002 101 1 3 "VEN" "Venezuela Bolivarian Republic of" 1967 22 11 50 51 0 1967-12-07 "R/22/2338A" 1 1 "INTERNATIONAL YEAR FOR HUMAN RIGHTS" "TO ADOPT THE AMENDMENT (A/L. 542) TO DRAFT RESOL. (A/7008) ON \INTERNATIONAL YEAR FOR HUMAN RIGHTS\\, WHICH AMENDMENT DELETES OPERATIVE PARAG.10.\" """" 0 0 0 1 0 0 0 22029

那么,如何实现从第 6 列中删除引用的分隔符?为了清楚起见,委内瑞拉和台湾后面应该是逗号。

答案1

您只需设置输出字段分隔符OFS

BEGIN {
FPAT = "([^,]+)|(\"[^\"]+\")"
OFS = ","
}

{gsub(/[,]/,"",$6)}1

否则,您将获得默认值OFS,即空格字符。

请注意,,不是正则表达式元字符,因此不需要用括号括起来,位于的左侧gsub,因此这个更简单的表达式也可以起作用:

{gsub(/,/,"",$6)}1

相关内容