如何查找文件中字符串的字符位置？

Question 1

在当前版本的 Perl 中，您可以使用@-和@+魔法数组来获取整个正则表达式和任何可能的捕获组的匹配位置。两个数组的第 0 个元素保存与整个子字符串相关的索引，这也是$-[0]您感兴趣的元素。

作为单行：

$ echo 'aöæaæaæa' | perl -CSDLA -ne 'BEGIN { $pattern = shift }; printf "%d\n", $-[0] while $_ =~ m/$pattern/g;'  æa
2
4
6

或者完整的脚本：

#!/usr/bin/perl

use strict;
use warnings;
use utf8;
use Encode;
use open  ":encoding(utf8)";
undef $/;
my $pattern = decode_utf8(shift);
binmode STDIN, ":utf8";
while (<STDIN>) {
    printf "%d\n", $-[0] while $_ =~ m/$pattern/g;
}

例如

$ echo 'aöæaæaæa' | perl match.pl æa -
2
4
6

（后一个脚本仅适用于标准输入。我似乎无法强制 Perl 将所有文件视为 UTF-8。）

Answer

在当前版本的 Perl 中，您可以使用@-和@+魔法数组来获取整个正则表达式和任何可能的捕获组的匹配位置。两个数组的第 0 个元素保存与整个子字符串相关的索引，这也是$-[0]您感兴趣的元素。

作为单行：

$ echo 'aöæaæaæa' | perl -CSDLA -ne 'BEGIN { $pattern = shift }; printf "%d\n", $-[0] while $_ =~ m/$pattern/g;'  æa
2
4
6

或者完整的脚本：

#!/usr/bin/perl

use strict;
use warnings;
use utf8;
use Encode;
use open  ":encoding(utf8)";
undef $/;
my $pattern = decode_utf8(shift);
binmode STDIN, ":utf8";
while (<STDIN>) {
    printf "%d\n", $-[0] while $_ =~ m/$pattern/g;
}

例如

$ echo 'aöæaæaæa' | perl match.pl æa -
2
4
6

（后一个脚本仅适用于标准输入。我似乎无法强制 Perl 将所有文件视为 UTF-8。）

Question 2

和zsh：

set -o extendedglob # for (#m) which in patterns causes the matched portion to be
                    # made available in $MATCH and the offset (1-based) in $MBEGIN
                    # (and causes the expansion of the replacement in
                    # ${var//pattern/replacement} to be deferred to the
                    # time of replacement)

haystack=aöæaæaæa
needle=æ

offsets=() i=0
: ${haystack//(#m)$needle/$((offsets[++i] = MBEGIN - 1))}
print -l $offsets

Answer

和zsh：

set -o extendedglob # for (#m) which in patterns causes the matched portion to be
                    # made available in $MATCH and the offset (1-based) in $MBEGIN
                    # (and causes the expansion of the replacement in
                    # ${var//pattern/replacement} to be deferred to the
                    # time of replacement)

haystack=aöæaæaæa
needle=æ

offsets=() i=0
: ${haystack//(#m)$needle/$((offsets[++i] = MBEGIN - 1))}
print -l $offsets

Question 3

使用 GNUawk或任何其他 POSIX 兼容awk实现（不是mawk），以及正确的语言环境设置：

$ LANG='en_US.UTF-8' gawk -v pat='æa' -- '
{
    s = $0;
    pos = 0;
    while (match(s, pat)) {
        pos += RSTART-1;
        print "file", FILENAME ": line", FNR, "position", pos, "matched", substr(s, RSTART, RLENGTH);
        pos += RLENGTH;
        s = substr(s, RSTART+RLENGTH);
    }
}
' <<<'aöæaæaæa'
file -: line 1 position 2 matched æa
file -: line 1 position 4 matched æa
file -: line 1 position 6 matched æa
$

-v pat参数中指示的模式gawk可以是任何有效的正则表达式。

Answer

使用 GNUawk或任何其他 POSIX 兼容awk实现（不是mawk），以及正确的语言环境设置：

$ LANG='en_US.UTF-8' gawk -v pat='æa' -- '
{
    s = $0;
    pos = 0;
    while (match(s, pat)) {
        pos += RSTART-1;
        print "file", FILENAME ": line", FNR, "position", pos, "matched", substr(s, RSTART, RLENGTH);
        pos += RLENGTH;
        s = substr(s, RSTART+RLENGTH);
    }
}
' <<<'aöæaæaæa'
file -: line 1 position 2 matched æa
file -: line 1 position 4 matched æa
file -: line 1 position 6 matched æa
$

-v pat参数中指示的模式gawk可以是任何有效的正则表达式。

如何查找文件中字符串的字符位置？

答案1

答案2

答案3

相关内容