Curl url txt 文件，但从单个文件中单独 grep 每个 url

Question 1

这应该分两行完成：

sed -n 's/\s*URL\s*=\s*\(.*\)/\1/p' /tmp/curl.conf|xargs -I {} curl -O "{}"
sed -n 's/\s*URL\s*=\s*\(.*\)/\1/p' /tmp/curl.conf|xargs -I {} basename "{}"|xargs -I {} sed '/mortgage/q' "{}"

每行的第一个 sed 命令从 url 文件（示例中的 /tmp/curl.conf）中提取 URL。在第一行中，我们使用curl的-O选项将每个页面的输出保存到具有页面名称的文件中。在第二行中，我们重新检查每个文件并仅显示您感兴趣的文本。当然，如果文件中没有出现“mortgage”一词，则将输出整个文件。

这将为您在当前目录中的每个 url 留下一个临时文件。

编辑：

这是一个简短的脚本，可以避免任何剩余文件，它将结果输出到标准输出，您可以根据需要从那里重定向它：

#!/bin/bash
TMPF=$(mktemp)
# sed command extracts URLs line by line
sed -n 's/\s*URL\s*=\s*\(.*\)/\1/p' /tmp/curl.conf >$TMPF
while read URL; do
    # retrieve each web page and delete any text after 'mortgage' (substitute whatever test you like)
    curl "$URL" 2>/dev/null | sed '/mortgage/q'
done <"$TMPF"
rm "$TMPF"

Answer

这应该分两行完成：

sed -n 's/\s*URL\s*=\s*\(.*\)/\1/p' /tmp/curl.conf|xargs -I {} curl -O "{}"
sed -n 's/\s*URL\s*=\s*\(.*\)/\1/p' /tmp/curl.conf|xargs -I {} basename "{}"|xargs -I {} sed '/mortgage/q' "{}"

每行的第一个 sed 命令从 url 文件（示例中的 /tmp/curl.conf）中提取 URL。在第一行中，我们使用curl的-O选项将每个页面的输出保存到具有页面名称的文件中。在第二行中，我们重新检查每个文件并仅显示您感兴趣的文本。当然，如果文件中没有出现“mortgage”一词，则将输出整个文件。

这将为您在当前目录中的每个 url 留下一个临时文件。

编辑：

这是一个简短的脚本，可以避免任何剩余文件，它将结果输出到标准输出，您可以根据需要从那里重定向它：

#!/bin/bash
TMPF=$(mktemp)
# sed command extracts URLs line by line
sed -n 's/\s*URL\s*=\s*\(.*\)/\1/p' /tmp/curl.conf >$TMPF
while read URL; do
    # retrieve each web page and delete any text after 'mortgage' (substitute whatever test you like)
    curl "$URL" 2>/dev/null | sed '/mortgage/q'
done <"$TMPF"
rm "$TMPF"

Question 2

即使您的curl 配置文件包含其他选项（例如用户代理、引荐来源网址等），此通用技巧仍然有效。

第一步，假设您的配置文件名为卷曲配置，然后用于awk '/^[Uu][Rr][Ll]/{print;print "output = dummy/"++k;next}1' curl_config > curl_config2 创建一个新的curl配置文件，该文件在每个url/URL下递增地附加不同的输出文件名：

例子：

[xiaobai@xiaobai curl]$ cat curl_config
URL = "www.google.com"
user-agent = "holeagent/5.0"

url = "m12345.google.com"
user-agent = "holeagent/5.0"

URL = "googlevideo.com"
user-agent = "holeagent/5.0"
[xiaobai@xiaobai curl]$ awk '/^[Uu][Rr][Ll]/{print;print "output = dummy/"++k;next}1' curl_config  > curl_config2 
[xiaobai@xiaobai curl]$ cat curl_config2
URL = "www.google.com"
output = dummy/1
user-agent = "holeagent/5.0"

url = "m12345.google.com"
output = dummy/2
user-agent = "holeagent/5.0"

URL = "googlevideo.com"
output = dummy/3
user-agent = "holeagent/5.0"
[xiaobai@xiaobai curl]$

然后mkdir dummy创建一个目录来保存这个临时文件。创建inotifywait会话（将 sed '/google/q' 替换为 sed '/mortgage/q'）：

[xiaobai@xiaobai curl]$ rm -r dummy; mkdir dummy;
[xiaobai@xiaobai curl]$ rm final 
[xiaobai@xiaobai curl]$ inotifywait -m dummy -e close_write | while read path action file; do echo "[$file]">> final ; sed '/google/q' "$path$file" >> final; echo "$path$file"; rm "$path$file"; done;
Setting up watches.
Watches established.

打开另一个 bash/终端会话，rm最终的文件（如果存在），然后使用上面第一步中创建的curl_config2 文件运行curl：

[xiaobai@xiaobai curl]$ curl -vLK curl_config2
...processing

现在看一下 inotifywait 会话，它会打印最新的关闭写入文件，sed 它并在完成后立即删除它：

[xiaobai@xiaobai curl]$ inotifywait -m dummy -e close_write | while read path action file; do echo "[$file]">> final ; sed '/google/q' "$path$file" >> final; echo "$path$file"; rm "$path$file"; done;
Setting up watches.
Watches established.
dummy/1
dummy/3

最后你可以观察你的输出名为最终的，这[1和3]分隔符是从echo "[$file]">> final上面生成的：

立即删除文件的原因是因为我假设您的输出文件很大，而且必须继续处理许多网址，因此可以节省磁盘空间以立即删除它。

Answer