从多个网站提取所有 PDF 链接

Question 1

我将创建一个单独的（文本）文件，将所有 URL 列在一行中：

www.url1
www.url2

然后将脚本中的行更改为附加找到的 pdf 链接到生成的pdflinks.txt（tee -a pdflinks.txt | more而不是tee pdflinks.txt）：

lynx -cache=0 -dump -listonly "$WEBSITE" | grep ".*\.pdf$" | awk '{print $2}' | tee -a pdflinks.txt | more

然后使脚本可执行并通过另一个脚本（在本例中为 python）运行它：

#!/usr/bin/python3
import subprocess

url_list = "/path/to/url_list.txt"
script = "/path/to/script.sh"

with open(url_list) as sourcefile:
    sourcefile = sourcefile.readlines()
for item in sourcefile:
    subprocess.call([script, item.replace("\n", "")])

将上面的文本粘贴到一个空文档中，添加适当的路径并将其保存为 run_pdflinkextractor.py 并通过命令运行

python3 /path/to/run_pdflinkextractor.py

更多的选择

您实际上并没有要求这样做，但是如果您想下载找到的 pdf 链接，那么中途停止是很可惜的。下面的脚本可能很方便执行此操作。步骤相同：将以下文本粘贴到一个空文件中，将其另存为download_pdffiles.py，添加在第一步中创建的路径pdflinks.txt，以及您要将文件下载到的文件夹的路径，然后通过以下命令运行它：

 python3 /path/to/download_pdffiles.py

实际下载文件的脚本：

#!/usr/bin/python3

import subprocess

pdf_list = "/path/to/pdflinks.txt"
download_directory = "/path/to/downloadfolder"

with open(pdf_list) as sourcefile:
    sourcefile = sourcefile.readlines()
for item in sourcefile:
    subprocess.call(["wget", "-P", download_directory, item.replace("\n", "")])

当然，您可以向脚本添加更多选项，例如，出现错误时该怎么做（脚本中会忽略错误）。请参阅man wget以了解更多选项。

Answer

我将创建一个单独的（文本）文件，将所有 URL 列在一行中：

www.url1
www.url2

然后将脚本中的行更改为附加找到的 pdf 链接到生成的pdflinks.txt（tee -a pdflinks.txt | more而不是tee pdflinks.txt）：

lynx -cache=0 -dump -listonly "$WEBSITE" | grep ".*\.pdf$" | awk '{print $2}' | tee -a pdflinks.txt | more

然后使脚本可执行并通过另一个脚本（在本例中为 python）运行它：

#!/usr/bin/python3
import subprocess

url_list = "/path/to/url_list.txt"
script = "/path/to/script.sh"

with open(url_list) as sourcefile:
    sourcefile = sourcefile.readlines()
for item in sourcefile:
    subprocess.call([script, item.replace("\n", "")])

将上面的文本粘贴到一个空文档中，添加适当的路径并将其保存为 run_pdflinkextractor.py 并通过命令运行

python3 /path/to/run_pdflinkextractor.py

更多的选择

您实际上并没有要求这样做，但是如果您想下载找到的 pdf 链接，那么中途停止是很可惜的。下面的脚本可能很方便执行此操作。步骤相同：将以下文本粘贴到一个空文件中，将其另存为download_pdffiles.py，添加在第一步中创建的路径pdflinks.txt，以及您要将文件下载到的文件夹的路径，然后通过以下命令运行它：

 python3 /path/to/download_pdffiles.py

实际下载文件的脚本：

#!/usr/bin/python3

import subprocess

pdf_list = "/path/to/pdflinks.txt"
download_directory = "/path/to/downloadfolder"

with open(pdf_list) as sourcefile:
    sourcefile = sourcefile.readlines()
for item in sourcefile:
    subprocess.call(["wget", "-P", download_directory, item.replace("\n", "")])

当然，您可以向脚本添加更多选项，例如，出现错误时该怎么做（脚本中会忽略错误）。请参阅man wget以了解更多选项。

Question 2

答案取决于您对“URL 列表”的定义。

如果您想将其作为多参数命令行脚本执行，请使用如下代码：

#!/bin/bash
for WEBSITE in "$*"
do
    <scriptname> "$WEBSITE"
done

还有一种方法可以从文件中逐行读取 URL 列表：

#!/bin/bash
_file="$1"
while IFS= read -r line
do
    <scriptname> "$line"
done < "$_file"

按照你的意愿改进我的答案。我不是 bash 大师 :)

Answer

答案取决于您对“URL 列表”的定义。

如果您想将其作为多参数命令行脚本执行，请使用如下代码：

#!/bin/bash
for WEBSITE in "$*"
do
    <scriptname> "$WEBSITE"
done

还有一种方法可以从文件中逐行读取 URL 列表：

#!/bin/bash
_file="$1"
while IFS= read -r line
do
    <scriptname> "$line"
done < "$_file"

按照你的意愿改进我的答案。我不是 bash 大师 :)

Question 3

我不喜欢将文件名硬编码到脚本中。我更喜欢将它们作为参数提供。这可以通过对 Glutanimate 脚本进行非常小的修改来实现：

#!/usr/bin/env bash

# NAME:         pdflinkextractor
# AUTHOR:       Glutanimate (http://askubuntu.com/users/81372/), 2013
# LICENSE:      GNU GPL v2
# DEPENDENCIES: wget lynx
# DESCRIPTION:  extracts PDF links from websites and dumps them to the stdout and as a textfile
#               only works for links pointing to files with the ".pdf" extension
#
# USAGE:        pdflinkextractor "www.website.com" > output_file


echo "Getting link list..."

## Go through each URL given and find the PDFs it links to
for website in "$@"; do
    lynx -cache=0 -dump -listonly "$website" | awk '/.pdf$/{print $2}'
done

您可以将其保存为downloadpdfs，使其可执行（chmod +x downloadpdfs），然后运行它，并为它提供多个地址作为参数：

downloadpdfs "http://example.com" "http://example2.com" "http://example3.com" > pdflinks.txt

上述操作将创建一个名为的文件，pdflinks.txt其中包含从每个输入 URL 中提取的所有链接。

Answer

我不喜欢将文件名硬编码到脚本中。我更喜欢将它们作为参数提供。这可以通过对 Glutanimate 脚本进行非常小的修改来实现：

#!/usr/bin/env bash

# NAME:         pdflinkextractor
# AUTHOR:       Glutanimate (http://askubuntu.com/users/81372/), 2013
# LICENSE:      GNU GPL v2
# DEPENDENCIES: wget lynx
# DESCRIPTION:  extracts PDF links from websites and dumps them to the stdout and as a textfile
#               only works for links pointing to files with the ".pdf" extension
#
# USAGE:        pdflinkextractor "www.website.com" > output_file


echo "Getting link list..."

## Go through each URL given and find the PDFs it links to
for website in "$@"; do
    lynx -cache=0 -dump -listonly "$website" | awk '/.pdf$/{print $2}'
done

您可以将其保存为downloadpdfs，使其可执行（chmod +x downloadpdfs），然后运行它，并为它提供多个地址作为参数：

downloadpdfs "http://example.com" "http://example2.com" "http://example3.com" > pdflinks.txt

上述操作将创建一个名为的文件，pdflinks.txt其中包含从每个输入 URL 中提取的所有链接。

从多个网站提取所有 PDF 链接

答案1

答案2

答案3

相关内容