BASH 脚本在 Ubuntu 上经过一些处理后挂起

Question 1

纯 sed 解决方案：

sed -r 's/^[^|]+\|[^|]+\|([^|]+)\|[^|]+\|([^|]+)\|.+\( .+, ([^ ]+).+/\2:\3,\1/' <in.dat >out.dat

Answer

纯 sed 解决方案：

sed -r 's/^[^|]+\|[^|]+\|([^|]+)\|[^|]+\|([^|]+)\|.+\( .+, ([^ ]+).+/\2:\3,\1/' <in.dat >out.dat

Question 2

doit() {
  # Hattip to @sudodus
  tr ' ' '|' |
    tr -s '|' '|' |
    cut -d '|' -f 3,5,9 
}
export -f doit
parallel -k --pipepart --block -1 -a input.txt doit > output.txt

-k保持顺序，因此输入的第一行/最后一行也将是输出的第一行/最后一行
--pipepart即时分割文件
--block -1每个 CPU 线程分成 1 个块
-a input.txt要分割的文件
doit要调用的命令（或 bash 函数）

从速度上看，在我的系统上parallel，黄色版本比黑色版本快tr大约 200 MB（秒数与 MB）：

Answer

doit() {
  # Hattip to @sudodus
  tr ' ' '|' |
    tr -s '|' '|' |
    cut -d '|' -f 3,5,9 
}
export -f doit
parallel -k --pipepart --block -1 -a input.txt doit > output.txt

-k保持顺序，因此输入的第一行/最后一行也将是输出的第一行/最后一行
--pipepart即时分割文件
--block -1每个 CPU 线程分成 1 个块
-a input.txt要分割的文件
doit要调用的命令（或 bash 函数）

从速度上看，在我的系统上parallel，黄色版本比黑色版本快tr大约 200 MB（秒数与 MB）：

Question 3

Perl 解决方案

此脚本不会并行执行任何操作，但无论如何速度都相当快。将其另存为filter.pl（或您喜欢的任何名称）并使其可执行。

#!/usr/bin/env perl

use strict;
use warnings;

while( <> ) {
    if ( /^(?:[^|]+\|){2}([^|]+)\|[^|]+\|([^|]+)\|[^,]+,\s*(\S+)/ ) {
        print "$2:$3,$1\n";
    }
}

我复制了您的示例数据，直到得到 1,572,864 行，然后按如下方式运行：

me@ubuntu:~> time ./filter.pl < input.txt > output.txt
real    0m3,603s
user    0m3,487s
sys     0m0,100s

me@ubuntu:~> tail -3 output.txt
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186

如果你喜欢单行代码，请执行以下操作：

perl -lne 'print "$2:$3,$1" if /^(?:[^|]+\|){2}([^|]+)\|[^|]+\|([^|]+)\|[^,]+,\s*(\S+)/;' < input.txt > output.txt

Answer

Perl 解决方案

此脚本不会并行执行任何操作，但无论如何速度都相当快。将其另存为filter.pl（或您喜欢的任何名称）并使其可执行。

#!/usr/bin/env perl

use strict;
use warnings;

while( <> ) {
    if ( /^(?:[^|]+\|){2}([^|]+)\|[^|]+\|([^|]+)\|[^,]+,\s*(\S+)/ ) {
        print "$2:$3,$1\n";
    }
}

我复制了您的示例数据，直到得到 1,572,864 行，然后按如下方式运行：

me@ubuntu:~> time ./filter.pl < input.txt > output.txt
real    0m3,603s
user    0m3,487s
sys     0m0,100s

me@ubuntu:~> tail -3 output.txt
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.SERV1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT81.C1.P1,12-654-0330
202-00_MSRFKH00OL6:R1.S1.LT7.PON8.ONT75.SERV1,12-664-1186

如果你喜欢单行代码，请执行以下操作：

perl -lne 'print "$2:$3,$1" if /^(?:[^|]+\|){2}([^|]+)\|[^|]+\|([^|]+)\|[^,]+,\s*(\S+)/;' < input.txt > output.txt

Question 4

Python

import sys,re

pattern=re.compile(r'^.+\|.+\|(.+)\|.+\|(.+)\|.+, (.+) \)\|$')

for line in sys.stdin:
match=pattern.match(line)
if match:
    print(match.group(2)+':'+match.group(3)+','+match.group(1))

（适用于 Python2 和 Python3）

使用非贪婪匹配的正则表达式速度提高了 4 倍（避免回溯？），并使 Python 与 cut/sed 方法相媲美（Python2 比 Python3 快一点）

import sys,re

pattern=re.compile(r'^[^|]+?\|[^|]+?\|([^|]+?)\|[^|]+?\|([^|]+?)\|[^,]+?, (.+) \)\|$')

for line in sys.stdin:
match=pattern.match(line)
if match:
    print(match.group(2)+':'+match.group(3)+','+match.group(1))

Answer

Python

import sys,re

pattern=re.compile(r'^.+\|.+\|(.+)\|.+\|(.+)\|.+, (.+) \)\|$')

for line in sys.stdin:
match=pattern.match(line)
if match:
    print(match.group(2)+':'+match.group(3)+','+match.group(1))

（适用于 Python2 和 Python3）

使用非贪婪匹配的正则表达式速度提高了 4 倍（避免回溯？），并使 Python 与 cut/sed 方法相媲美（Python2 比 Python3 快一点）

import sys,re

pattern=re.compile(r'^[^|]+?\|[^|]+?\|([^|]+?)\|[^|]+?\|([^|]+?)\|[^,]+?, (.+) \)\|$')

for line in sys.stdin:
match=pattern.match(line)
if match:
    print(match.group(2)+':'+match.group(3)+','+match.group(1))

BASH 脚本在 Ubuntu 上经过一些处理后挂起

答案1

答案2

答案3

Perl 解决方案

答案4

Python

相关内容