如何过滤一对大括号

Question 1

这是使用 GNU awk，使用 POSIXawk会很麻烦（缺少gensub，我不止一次使用过）。

#!/usr/bin/env gawk

function join(array, result, i)
{
    result = array[0];
    end = length(array) - 1;
    for (i = 1; i <= end; i++)
        result = result "," array[i];
    return result;
}
function push(arr, elem)
{
    arr[length(arr)] = elem;
}

# split("", arr) is a horribly unreadable way to clear an array
BEGIN { split("", arr); }

/{part}|{chapter}/ {
    l = gensub(".*{(.+)}{(.+)}{([0-9]+)}$", "\\1,\\3,\\2", "g");
    if ("part" == substr(l, 0, 4)) {
        if (length(arr) > 0) { print join(arr); }
        split("", arr);
        push(arr, gensub("^(.*),(.*),(.*)$", "\\2,\\3","g", l));
    } else {
        push(arr, gensub("^(.*),(.*),(.*)$", "\\3","g", l));
    }
}

END { print join(arr); }

这利用了正则表达式是贪婪的事实，因此匹配每次都会获得整行。这比我一开始想象的要付出更多的努力。

输入以下内容：

\contentsline {part}{Some title here\hfil }{5}
\contentsline {chapter}{\numberline {}Person name here}{5}
blah blah
\contentsline {chapter}{\numberline {}Person name here}{5}
blah blah
blah blah
\contentsline {chapter}{\numberline {}Person name here}{5}
\contentsline {chapter}{\numberline {}Person name here}{5}
blah blah
blah blah
\contentsline {chapter}{\numberline {}Person name here}{5}
\contentsline {chapter}{\numberline {}Person name here}{5}
\contentsline {part}{Some title here\hfil }{7}
\contentsline {chapter}{\numberline {}Person name here}{7}
blah blah
blah blah
\contentsline {chapter}{\numberline {}Person name here}{7}
blah blah
\contentsline {part}{Some title here\hfil }{9}
blah blah
blah blah
\contentsline {chapter}{\numberline {}Person name here}{9}

我们生产cat input | awk -f the_above_script.awk：

5,Some title here\hfil ,\numberline {}Person name here,\numberline {}Person name here,\numberline {}Person name here,\numberline {}Person name here,\numberline {}Person name here,\numberline {}Person name here
7,Some title here\hfil ,\numberline {}Person name here,\numberline {}Person name here
9,Some title here\hfil ,\numberline {}Person name here

页码取自包含后发生的{part}任何内容。这允许一本书的某些部分包含多个章节。{chapter}{part}

Answer

这是使用 GNU awk，使用 POSIXawk会很麻烦（缺少gensub，我不止一次使用过）。

#!/usr/bin/env gawk

function join(array, result, i)
{
    result = array[0];
    end = length(array) - 1;
    for (i = 1; i <= end; i++)
        result = result "," array[i];
    return result;
}
function push(arr, elem)
{
    arr[length(arr)] = elem;
}

# split("", arr) is a horribly unreadable way to clear an array
BEGIN { split("", arr); }

/{part}|{chapter}/ {
    l = gensub(".*{(.+)}{(.+)}{([0-9]+)}$", "\\1,\\3,\\2", "g");
    if ("part" == substr(l, 0, 4)) {
        if (length(arr) > 0) { print join(arr); }
        split("", arr);
        push(arr, gensub("^(.*),(.*),(.*)$", "\\2,\\3","g", l));
    } else {
        push(arr, gensub("^(.*),(.*),(.*)$", "\\3","g", l));
    }
}

END { print join(arr); }

这利用了正则表达式是贪婪的事实，因此匹配每次都会获得整行。这比我一开始想象的要付出更多的努力。

输入以下内容：

\contentsline {part}{Some title here\hfil }{5}
\contentsline {chapter}{\numberline {}Person name here}{5}
blah blah
\contentsline {chapter}{\numberline {}Person name here}{5}
blah blah
blah blah
\contentsline {chapter}{\numberline {}Person name here}{5}
\contentsline {chapter}{\numberline {}Person name here}{5}
blah blah
blah blah
\contentsline {chapter}{\numberline {}Person name here}{5}
\contentsline {chapter}{\numberline {}Person name here}{5}
\contentsline {part}{Some title here\hfil }{7}
\contentsline {chapter}{\numberline {}Person name here}{7}
blah blah
blah blah
\contentsline {chapter}{\numberline {}Person name here}{7}
blah blah
\contentsline {part}{Some title here\hfil }{9}
blah blah
blah blah
\contentsline {chapter}{\numberline {}Person name here}{9}

我们生产cat input | awk -f the_above_script.awk：

5,Some title here\hfil ,\numberline {}Person name here,\numberline {}Person name here,\numberline {}Person name here,\numberline {}Person name here,\numberline {}Person name here,\numberline {}Person name here
7,Some title here\hfil ,\numberline {}Person name here,\numberline {}Person name here
9,Some title here\hfil ,\numberline {}Person name here

页码取自包含后发生的{part}任何内容。这允许一本书的某些部分包含多个章节。{chapter}{part}

Question 2

使用 PerlText::Balanced模块，顶层{}的内容可以这样提取：

#!/usr/bin/env perl
use strict;
use warnings;
use Text::Balanced qw(extract_bracketed);

# this will of course fail if the input is one multiple lines, as this
# is only a line-by-line parser of standard input or the filenames
# passed to this script
while ( my $line = readline ) {
    if ( $line =~ m/\\contentsline / ) {
        my @parts = extract_contents($line);
        # emit as CSV (though ideally instead use Text::CSV module)
        print join( ",", @parts ), "\n";
    } else {
        #print "NO MATCH ON $line";
    }
}

sub extract_contents {
    my $line = shift;
    my @parts;
    # while we can get a {} bit out of the input line, anywhere in the
    # input line
    while ( my $part = extract_bracketed( $line, '{}', qr/[^{]*/ ) ) {
        # trim off the delimiters
        $part = substr $part, 1, length($part) - 2;
        push @parts, $part;
    }
    return @parts;
}

通过一些输入：

% < input 
not content line
\contentsline {chapter}{\numberline {}Person name here}{5}
\contentsline {part}{Title with math $\frac{a}{b}$\hfil }{15}
also not content line
% perl parser input
chapter,\numberline {}Person name here,5
part,Title with math $\frac{a}{b}$\hfil ,15
%

Answer

使用 PerlText::Balanced模块，顶层{}的内容可以这样提取：

#!/usr/bin/env perl
use strict;
use warnings;
use Text::Balanced qw(extract_bracketed);

# this will of course fail if the input is one multiple lines, as this
# is only a line-by-line parser of standard input or the filenames
# passed to this script
while ( my $line = readline ) {
    if ( $line =~ m/\\contentsline / ) {
        my @parts = extract_contents($line);
        # emit as CSV (though ideally instead use Text::CSV module)
        print join( ",", @parts ), "\n";
    } else {
        #print "NO MATCH ON $line";
    }
}

sub extract_contents {
    my $line = shift;
    my @parts;
    # while we can get a {} bit out of the input line, anywhere in the
    # input line
    while ( my $part = extract_bracketed( $line, '{}', qr/[^{]*/ ) ) {
        # trim off the delimiters
        $part = substr $part, 1, length($part) - 2;
        push @parts, $part;
    }
    return @parts;
}

通过一些输入：

% < input 
not content line
\contentsline {chapter}{\numberline {}Person name here}{5}
\contentsline {part}{Title with math $\frac{a}{b}$\hfil }{15}
also not content line
% perl parser input
chapter,\numberline {}Person name here,5
part,Title with math $\frac{a}{b}$\hfil ,15
%

Question 3

在TXR

@(repeat)
\contentsline {part}{@title\hfil }{@page}
@  (trailer)
@  (skip)
\contentsline {chapter}{\numberline {}@author}{@page}
@  (do (put-line `@title,@author,@page`))
@(end)

样本数据：

\lorem{ipsum}
\contentsline {part}{The Art of The Meringue\hfil }{5}
a
b
c
j
\contentsline {chapter}{\numberline {}Doug LeMonjello}{5}


\contentsline {part}{Parachuting Primer\hfil }{16}

\contentsline {chapter}{\numberline {}Hugo Phirst}{16}

\contentsline {part}{Making Sense of $\frac{a}{b}$\hfil }{19}

\contentsline {part}{War and Peace\hfil }{27}

\contentsline {chapter}{\numberline {}D. Vide}{19}

\contentsline {part}{War and Peace\hfil }{19}

跑步：

$ txr title-auth.txr data
The Art of The Meringue,Doug LeMonjello,5
Parachuting Primer,Hugo Phirst,16
Making Sense of $\frac{a}{b}$,D. Vide,19

笔记：

因为@(trailer)使用时，作者给出的台词不必严格遵循其部分。数据可以引入几个\contentsline {part}元素，然后是与chapter页码匹配的行。
@(skip)意味着搜索整个剩余数据。通过添加数字参数来限制范围可以提高性能。如果可以假设总是{chapter}在之后的 50 行内找到匹配{part}，我们可以使用@(skip 50).

Answer

在TXR

@(repeat)
\contentsline {part}{@title\hfil }{@page}
@  (trailer)
@  (skip)
\contentsline {chapter}{\numberline {}@author}{@page}
@  (do (put-line `@title,@author,@page`))
@(end)

样本数据：

\lorem{ipsum}
\contentsline {part}{The Art of The Meringue\hfil }{5}
a
b
c
j
\contentsline {chapter}{\numberline {}Doug LeMonjello}{5}


\contentsline {part}{Parachuting Primer\hfil }{16}

\contentsline {chapter}{\numberline {}Hugo Phirst}{16}

\contentsline {part}{Making Sense of $\frac{a}{b}$\hfil }{19}

\contentsline {part}{War and Peace\hfil }{27}

\contentsline {chapter}{\numberline {}D. Vide}{19}

\contentsline {part}{War and Peace\hfil }{19}

跑步：

$ txr title-auth.txr data
The Art of The Meringue,Doug LeMonjello,5
Parachuting Primer,Hugo Phirst,16
Making Sense of $\frac{a}{b}$,D. Vide,19

笔记：

因为@(trailer)使用时，作者给出的台词不必严格遵循其部分。数据可以引入几个\contentsline {part}元素，然后是与chapter页码匹配的行。
@(skip)意味着搜索整个剩余数据。通过添加数字参数来限制范围可以提高性能。如果可以假设总是{chapter}在之后的 50 行内找到匹配{part}，我们可以使用@(skip 50).

如何过滤一对大括号

答案1

答案2

答案3

相关内容