抓取网页内容时将数字与固定数字相匹配

2024-5-27 • tag-icon

我正在尝试解析源网页，尝试找到与此类似的所有 href：

href='http://example.org/index.php?showtopic=509480

其中后面的数字showtopic=是随机的（并且具有 6 个固定位数，例如 123456 - 654321）

while read -r line
do
    source=$(curl -L line) #is this the right way to parse the source?
    grep "href='http://example.org/index.php?showtopic=" >> output.txt 
done <file.txt #file contains a list of web pages

如果我不知道哪个号码，我怎样才能抓住所有的线？也许用正则表达式进行第二次 grep ？我想在 awk 中使用类似于以下的范围：

awk "'/href='http://example.org/index.php?showtopic=/,/^\s/'" >> file.txt

或双重 grep 为：

grep "href='http://example.org/index.php?showtopic=" | grep -e ^[0-9]{1,6}$ >> output.txt

答案1

cat input.txt |grep "href='http://example.org/index.php?showtopic=" > output.txt

cat 输出通过管道传输到 grep 的文件内容。 grep 逐行比较它并将整行写入输出文本。

或者你可以使用 sed：

 sed -n "\#href='http://example.org/index.php?showtopic=#p"  input.txt >  output-sed.txt

答案1

相关内容