这有点题外,但我希望你们能帮助我。我找到了一个网站,里面全是我需要的文章,但那些文章里混杂着很多无用的文件(主要是 jpg)。
我想知道是否有办法找到(不下载) 服务器上的所有 PDF 以制作链接列表。基本上,我只是想过滤掉所有非 PDF 的内容,以便更好地了解哪些内容可以下载,哪些内容不能下载。
答案1
概述
好的,就是这样。这是一个以脚本形式呈现的编程解决方案:
#!/bin/bash
# NAME: pdflinkextractor
# AUTHOR: Glutanimate (http://askubuntu.com/users/81372/), 2013
# LICENSE: GNU GPL v2
# DEPENDENCIES: wget lynx
# DESCRIPTION: extracts PDF links from websites and dumps them to the stdout and as a textfile
# only works for links pointing to files with the ".pdf" extension
#
# USAGE: pdflinkextractor "www.website.com"
WEBSITE="$1"
echo "Getting link list..."
lynx -cache=0 -dump -listonly "$WEBSITE" | grep ".*\.pdf$" | awk '{print $2}' | tee pdflinks.txt
# OPTIONAL
#
# DOWNLOAD PDF FILES
#
#echo "Downloading..."
#wget -P pdflinkextractor_files/ -i pdflinks.txt
安装
您需要拥有wget
并lynx
安装:
sudo apt-get install wget lynx
用法
该脚本将获取网站上所有文件的列表.pdf
,并将其转储到命令行输出和工作目录中的文本文件中。如果您注释掉“可选”wget
命令,脚本将继续将所有文件下载到新目录中。
例子
$ ./pdflinkextractor http://www.pdfscripting.com/public/Free-Sample-PDF-Files-with-scripts.cfm
Getting link list...
http://www.pdfscripting.com/public/FreeStuff/PDFSamples/JSPopupCalendar.pdf
http://www.pdfscripting.com/public/FreeStuff/PDFSamples/ModifySubmit_Example.pdf
http://www.pdfscripting.com/public/FreeStuff/PDFSamples/DynamicEmail_XFAForm_V2.pdf
http://www.pdfscripting.com/public/FreeStuff/PDFSamples/AcquireMenuItemNames.pdf
http://www.pdfscripting.com/public/FreeStuff/PDFSamples/BouncingButton.pdf
http://www.pdfscripting.com/public/FreeStuff/PDFSamples/JavaScriptClock.pdf
http://www.pdfscripting.com/public/FreeStuff/PDFSamples/Matrix2DOperations.pdf
http://www.pdfscripting.com/public/FreeStuff/PDFSamples/RobotArm_3Ddemo2.pdf
http://www.pdfscripting.com/public/FreeStuff/PDFSamples/SimpleFormCalculations.pdf
http://www.pdfscripting.com/public/FreeStuff/PDFSamples/TheFlyv3_EN4Rdr.pdf
http://www.pdfscripting.com/public/FreeStuff/PDFSamples/ImExportAttachSample.pdf
http://www.pdfscripting.com/public/FreeStuff/PDFSamples/AcroForm_BasicToggle.pdf
http://www.pdfscripting.com/public/FreeStuff/PDFSamples/AcroForm_ToggleButton_Sample.pdf
http://www.pdfscripting.com/public/FreeStuff/PDFSamples/AcorXFA_BasicToggle.pdf
http://www.pdfscripting.com/public/FreeStuff/PDFSamples/ConditionalCalcScripts.pdf
Downloading...
--2013-12-24 13:31:25-- http://www.pdfscripting.com/public/FreeStuff/PDFSamples/JSPopupCalendar.pdf
Resolving www.pdfscripting.com (www.pdfscripting.com)... 74.200.211.194
Connecting to www.pdfscripting.com (www.pdfscripting.com)|74.200.211.194|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 176008 (172K) [application/pdf]
Saving to: `/Downloads/pdflinkextractor_files/JSPopupCalendar.pdf'
100%[===========================================================================================================================================================================>] 176.008 120K/s in 1,4s
2013-12-24 13:31:29 (120 KB/s) - `/Downloads/pdflinkextractor_files/JSPopupCalendar.pdf' saved [176008/176008]
...
答案2
一个简单的 javascript 代码片段可以解决这个问题:(注意:我假设链接中的所有 pdf 文件都以 .pdf 结尾。)
打开浏览器的 javascript 控制台,复制以下代码并将其粘贴到 js 控制台,完成!
//get all link elements
var link_elements = document.querySelectorAll(":link");
//extract out all uris.
var link_uris = [];
for (var i=0; i < link_elements.length; i++)
{
//remove duplicated links
if (link_elements[i].href in link_uris)
continue;
link_uris.push (link_elements[i].href);
}
//filter out all links containing ".pdf" string
var link_pdfs = link_uris.filter (function (lu) { return lu.indexOf (".pdf") != -1});
//print all pdf links
for (var i=0; i < link_pdfs.length; i++)
console.log (link_pdfs[i]);