尝试使用 grep 从 HTML 文件中删除所有 ID

Question 1

尽管这违背了我更好的判断，但我会将其发布（sed部分）。

也就是说：如果是为了快速而肮脏的修复，那就继续吧。如果是更严重的事情或者你要经常做的事情等等。使用其他的东西，比如 python、perl 等，你不依赖正则表达式，而是依赖模块来处理 HTML 文档。

一种更简单的方法是使用例如 sed。

sed 's/\(<[^>]*\) \+id="[^"]*"\([^>]*>\)/\1\2/' sample.html > noid.html

解释：

            +--------------------------------- Match group 1
            |                      +---------- Match group 2
         ___|___                ___|___
        |       |              |       |  
sed 's/\(<[^>]*\) \+id="[^"]*"\([^>]*>\)/\1\2/' sample.html > noid.html
     |   |  | |   |  |    | ||    |  |      |
     |   |  | |   |  |    | ||    |  |      +- \1\2  Subst. with group 1 and 2
     |   |  | |   |  |    | ||    |  +-------- >     Closing bracket
     |   |  | |   |  |    | ||    +----------- [^>]* Same as below
     |   |  | |   |  |    | |+---------------- "     Followed by "
     |   |  | |   |  |    | +----------------- *     Zero or more times
     |   |  | |   |  |    +------------------- [^"]  Not double-quote
     |   |  | |   |  +------------------------ id="  Literal string
     |   |  | |   +---------------------------  \+   Space 1 or more times
     |   |  | +------------------------------- *     Zero or more times 
     |   |  +--------------------------------- [^>]  Not closing bracket
     |   +------------------------------------ <     Opening bracket
     +---------------------------------------- s     Substitute

用于sed -i就地编辑文件。（可能会后悔，但无法挽回。）

更好的;使用 perl 的示例：

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser::Simple;
use HTML::Entities;
use utf8;

die "$0 [file]\n" unless defined $ARGV[0];

my $parser = HTML::TokeParser::Simple->new(file => $ARGV[0]);

if (!$parser) {
    die "No HTML file found.\n";
}

while (my $token = $parser->get_token) {
    $token->delete_attr('id');
    print $token->as_is;
}

您的 grep 命令不会匹配任何内容。但是当您使用反转选项时，-v它会打印所有不匹配的内容 - 从而打印整个文件。

grep 不是就地文件修改器但通常是在文件中查找内容的工具。尝试例如：

grep -o '\(<[^>]*\)id="[^"]*"[^>]*>' sample.html

-o表示仅打印匹配的模式。（不是整条线）

sed等awk通常用于编辑流或文件。例如，如上面的例子。

从你的 grep 有一些错误的概念：

 id\="[a-zA-Z][0-9]"

将完全匹配：

id=
一范围内的字符a-z或A-Z
其次是一个位数

换句话说，它将匹配：

id="a0"
id="a1"
id="a2"
...
id="Z9"

没有什么像：id="foo99"或id="blah-gah"。

此外，它会匹配：

 ^ <-- start of line (As it is first in pattern or group)
 $ <-- end of line   (As you use the `-E` option)
 # Else it would be:
 ^ <-- start of line (As it is first in pattern or group)
 $ <-- dollar sign   (Does not mean end of line unless it is at end of
                      pattern or group)

因此什么也没有。

Answer