提取字段和子字符串并合并排序行

Question 1

使用awk和GNU datamash：

awk 'BEGIN{ OFS=FS="\t" }
  NR>2{                       # skip first two records
    split($3, a, "/" )        # split $3 into array a on /
    domain=a[3]               # 3rd element is the domain name
    sub(/^www\./, "", domain) # remove www. prefix
    print domain, $4          # print domain and email
  }
' file | datamash -g 1 unique 2

该awk部分打印所有记录的域和电子邮件，跳过前两行。这将是

a.com   [email protected]
a.com   [email protected]
b.fr    [email protected]
b.fr    [email protected]

然后将输出通过管道传输到datamash第一个字段上对输入进行分组，并打印第二个字段的以逗号分隔的唯一值列表。

输出：

a.com   [email protected]
b.fr    [email protected],[email protected]

标题行保留作为练习。

Answer

使用awk和GNU datamash：

awk 'BEGIN{ OFS=FS="\t" }
  NR>2{                       # skip first two records
    split($3, a, "/" )        # split $3 into array a on /
    domain=a[3]               # 3rd element is the domain name
    sub(/^www\./, "", domain) # remove www. prefix
    print domain, $4          # print domain and email
  }
' file | datamash -g 1 unique 2

该awk部分打印所有记录的域和电子邮件，跳过前两行。这将是

a.com   [email protected]
a.com   [email protected]
b.fr    [email protected]
b.fr    [email protected]

然后将输出通过管道传输到datamash第一个字段上对输入进行分组，并打印第二个字段的以逗号分隔的唯一值列表。

输出：

a.com   [email protected]
b.fr    [email protected],[email protected]

标题行保留作为练习。

Question 2

使用 GNU awk 处理数组的数组和gensub()：

$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR>1 { d2e[gensub(/[^.]+\.([^.]+\.[^./]+).*/,"\\1",1,$3)][$4] }
END {
    print "domain", "emails(s)"
    for (domain in d2e) {
        cnt = 0
        for (email in d2e[domain]) {
            row = (cnt++ ? row ", " : domain OFS) email
        }
        print row
    }
}

$ awk -f tst.awk file
domain  emails(s)
a.com   [email protected]
b.fr    [email protected], [email protected]

Answer

使用 GNU awk 处理数组的数组和gensub()：

$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR>1 { d2e[gensub(/[^.]+\.([^.]+\.[^./]+).*/,"\\1",1,$3)][$4] }
END {
    print "domain", "emails(s)"
    for (domain in d2e) {
        cnt = 0
        for (email in d2e[domain]) {
            row = (cnt++ ? row ", " : domain OFS) email
        }
        print row
    }
}

$ awk -f tst.awk file
domain  emails(s)
a.com   [email protected]
b.fr    [email protected], [email protected]

Question 3

使用磨坊主（在 sed 的帮助下）：

$ mlr --prepipe 'sed "/^$/d"' --tsv   put -q -S '
  $domain = joinv(mapexcept(splitnvx(joinv(mapselect(splitnvx($URL,"/"),3),""),"."),1),".");
  @e[$domain] = mapsum(@e[$domain],{$email:1});
  end {
    for(k,v in @e){@{email(s)}[k] = joink(v,",")};
    emit @{email(s)}, "domain"
  }' File.tsv
domain  email(s)
a.com   [email protected]
b.fr    [email protected],[email protected]

sed--prepipe命令只是删除无关的空行，以便输入可以解析为 TSV。该$domain变量是通过将URL字段拆分两次获得的，首先/（选择第三个元素），然后.（选择除第一个元素之外的所有元素，例如www）。然后将流外映射@e构造为地图字段的email数量 - 这是删除同一域的重复电子邮件的步骤。在处end，将电子邮件映射转换为逗号分隔的字符串并发出它们。

Answer

使用磨坊主（在 sed 的帮助下）：

$ mlr --prepipe 'sed "/^$/d"' --tsv   put -q -S '
  $domain = joinv(mapexcept(splitnvx(joinv(mapselect(splitnvx($URL,"/"),3),""),"."),1),".");
  @e[$domain] = mapsum(@e[$domain],{$email:1});
  end {
    for(k,v in @e){@{email(s)}[k] = joink(v,",")};
    emit @{email(s)}, "domain"
  }' File.tsv
domain  email(s)
a.com   [email protected]
b.fr    [email protected],[email protected]

sed--prepipe命令只是删除无关的空行，以便输入可以解析为 TSV。该$domain变量是通过将URL字段拆分两次获得的，首先/（选择第三个元素），然后.（选择除第一个元素之外的所有元素，例如www）。然后将流外映射@e构造为地图字段的email数量 - 这是删除同一域的重复电子邮件的步骤。在处end，将电子邮件映射转换为逗号分隔的字符串并发出它们。

Question 4

这个问题可以通过构造一个以第三个字段（实际上是它的一部分）为键的字典来解决，其对应的值是set第四个字段被扔进去的地方。 a 的用处set在于它保持它的元素本质上是唯一的，所以我们不必在任何类型的编程练习中付出努力来保持价值观的独特性。

python3 -c 'import sys
ifile = sys.argv[1]
fs = ofs = "\t"
d = {}

with open(ifile) as fh:
  for i,l in enumerate(fh,1):
    if i < 3: continue
    x,x,y,email,x = l.split(fs)
    domain = y.split("/")[2].split(".",1)[1]
    if domain in d:
      d[domain].add(email)
    else:
      d[domain] = { email }

print(f"domain{ofs}email(s)",
      *[k+ofs+", ".join(v) for k,v in d.items()],
      sep="\n")
' file
domain  email(s)
a.com   [email protected]
b.fr    [email protected], [email protected]

Answer

这个问题可以通过构造一个以第三个字段（实际上是它的一部分）为键的字典来解决，其对应的值是set第四个字段被扔进去的地方。 a 的用处set在于它保持它的元素本质上是唯一的，所以我们不必在任何类型的编程练习中付出努力来保持价值观的独特性。

python3 -c 'import sys
ifile = sys.argv[1]
fs = ofs = "\t"
d = {}

with open(ifile) as fh:
  for i,l in enumerate(fh,1):
    if i < 3: continue
    x,x,y,email,x = l.split(fs)
    domain = y.split("/")[2].split(".",1)[1]
    if domain in d:
      d[domain].add(email)
    else:
      d[domain] = { email }

print(f"domain{ofs}email(s)",
      *[k+ofs+", ".join(v) for k,v in d.items()],
      sep="\n")
' file
domain  email(s)
a.com   [email protected]
b.fr    [email protected], [email protected]

提取字段和子字符串并合并排序行

答案1

答案2

答案3

答案4

相关内容