我想在弃用(过时)的文献论坛中查找我的文章易祸网。一些论坛模块被禁用,我无法获取按作者列出的文章列表。此外,该网站未被 Google、Yndex 等搜索引擎收录。
找到我所有文章的唯一方法是打开档案页面站点(图 1)。然后我必须选择特定的年份和月份 - 例如2013 年 1 月(图 1)。然后我必须检查每篇文章(图 2)开头是否写着我的昵称 -pa4080(图3)。但文章数量只有几千篇。
我读过以下几个主题,但没有一个解决方案能满足我的需求:
我会发帖我自己的解决方案。但对我来说很有趣: 有没有更优雅的方法来解决这个任务?
答案1
为了解决这个任务,我创建了下一个简单的主要使用 CLI 工具的 bash 脚本wget
。
#!/bin/bash
TARGET_URL='http://e-bane.net/modules.php?name=Stories_Archive'
KEY_WORDS=('pa4080' 's0ther')
MAP_FILE='url.map'
OUT_FILE='url.list'
get_url_map() {
# Use 'wget' as spider and output the result into a file (and stdout)
wget --spider --force-html -r -l2 "${TARGET_URL}" 2>&1 | grep '^--' | awk '{ print $3 }' | tee -a "$MAP_FILE"
}
filter_url_map() {
# Apply some filters to the $MAP_FILE and keep only the URLs, that contain 'article&sid'
uniq "$MAP_FILE" | grep -v '\.\(css\|js\|png\|gif\|jpg\|txt\)$' | grep 'article&sid' | sort -u > "${MAP_FILE}.uniq"
mv "${MAP_FILE}.uniq" "$MAP_FILE"
printf '\n# -----\nThe number of the pages to be scanned: %s\n' "$(cat "$MAP_FILE" | wc -l)"
}
get_key_urls() {
counter=1
# Do this for each line in the $MAP_FILE
while IFS= read -r URL; do
# For each $KEY_WORD in $KEY_WORDS
for KEY_WORD in "${KEY_WORDS[@]}"; do
# Check if the $KEY_WORD exists within the content of the page, if it is true echo the particular $URL into the $OUT_FILE
if [[ ! -z "$(wget -qO- "${URL}" | grep -io "${KEY_WORD}" | head -n1)" ]]; then
echo "${URL}" | tee -a "$OUT_FILE"
printf '%s\t%s\n' "${KEY_WORD}" "YES"
fi
done
printf 'Progress: %s\r' "$counter"; ((counter++))
done < "$MAP_FILE"
}
# Call the functions
get_url_map
filter_url_map
get_key_urls
该脚本有三个功能:
第一个函数
get_url_map()
使用wget
as--spider
(这意味着它只会检查页面是否存在)并将创建深度级别的递归-r
URL 。(另一个示例可以在这里找到:$MAP_FILE
$TARGET_URL
-l2
将网站转换为 PDF)。目前,其中$MAP_FILE
包含约 20,000 个 URL。The second function
filter_url_map()
will simplify the content of the$MAP_FILE
. In this case we need only the lines (URLs) that contain the stringarticle&sid
and they are about 3000. More Ideas could be found here: How to remove particular words from lines of a text file?The third function
get_key_urls()
will usewget -qO-
(as the commandcurl
- examples) to output the content of each URL from the$MAP_FILE
and will try to find any of the$KEY_WORDS
within it. If any of the$KEY_WORDS
is founded within the content of any particular URL, that URL will be saved in the$OUT_FILE
.
During the working process the output of the script looks as it is shown on the next image. It takes about 63 minutes to finish if there are two keywords and 42 minute when only one keyword is searched.
答案2
script.py
:
#!/usr/bin/python3
from urllib.parse import urljoin
import json
import bs4
import click
import aiohttp
import asyncio
import async_timeout
BASE_URL = 'http://e-bane.net'
async def fetch(session, url):
try:
with async_timeout.timeout(20):
async with session.get(url) as response:
return await response.text()
except asyncio.TimeoutError as e:
print('[{}]{}'.format('timeout error', url))
with async_timeout.timeout(20):
async with session.get(url) as response:
return await response.text()
async def get_result(user):
target_url = 'http://e-bane.net/modules.php?name=Stories_Archive'
res = []
async with aiohttp.ClientSession() as session:
html = await fetch(session, target_url)
html_soup = bs4.BeautifulSoup(html, 'html.parser')
date_module_links = parse_date_module_links(html_soup)
for dm_link in date_module_links:
html = await fetch(session, dm_link)
html_soup = bs4.BeautifulSoup(html, 'html.parser')
thread_links = parse_thread_links(html_soup)
print('[{}]{}'.format(len(thread_links), dm_link))
for t_link in thread_links:
thread_html = await fetch(session, t_link)
t_html_soup = bs4.BeautifulSoup(thread_html, 'html.parser')
if is_article_match(t_html_soup, user):
print('[v]{}'.format(t_link))
# to get main article, uncomment below code
# res.append(get_main_article(t_html_soup))
# code below is used to get thread link
res.append(t_link)
else:
print('[x]{}'.format(t_link))
return res
def parse_date_module_links(page):
a_tags = page.select('ul li a')
hrefs = a_tags = [x.get('href') for x in a_tags]
return [urljoin(BASE_URL, x) for x in hrefs]
def parse_thread_links(page):
a_tags = page.select('table table tr td > a')
hrefs = a_tags = [x.get('href') for x in a_tags]
# filter href with 'file=article'
valid_hrefs = [x for x in hrefs if 'file=article' in x]
return [urljoin(BASE_URL, x) for x in valid_hrefs]
def is_article_match(page, user):
main_article = get_main_article(page)
return main_article.text.startswith(user)
def get_main_article(page):
td_tags = page.select('table table td.row1')
td_tag = td_tags[4]
return td_tag
@click.command()
@click.argument('user')
@click.option('--output-filename', default='out.json', help='Output filename.')
def main(user, output_filename):
loop = asyncio.get_event_loop()
res = loop.run_until_complete(get_result(user))
# if you want to return main article, convert html soup into text
# text_res = [x.text for x in res]
# else just put res on text_res
text_res = res
with open(output_filename, 'w') as f:
json.dump(text_res, f)
if __name__ == '__main__':
main()
requirement.txt
:
aiohttp>=2.3.7
beautifulsoup4>=4.6.0
click>=6.7
Here is python3 version of the script (tested on python3.5 on Ubuntu 17.10).
How to use:
- To use it put both code in files. As example the code file is
script.py
and package file isrequirement.txt
. - Run
pip install -r requirement.txt
. - Run the script as example
python3 script.py pa4080
It uses several libraryes:
- click for argument parser
- beautifulsoup for html parser
- aiohttp for html downloader
Things to know to develop the program further (other than the doc of required package):
- python library: asyncio, json and urllib.parse
- css selectors (mdn web docs), also some html. see also how to use css selector on your browser such as this article
How it works:
- First I create a simple html downloader. It is modified version from the sample given on aiohttp doc.
- After that creating simple command line parser which accept username and output filename.
- Create a parser for thread links and main article. Using pdb and simple url manipulation should do the job.
- Combine the function and put the main article on json, so other program can process it later.
Some idea so it can be developed further
- Create another subcommand that accept date module link: it can be done by separating the method to parse the date module to its own function and combine it with new subcommand.
- Caching the date module link: create cache json file after getting threads link. so the program don't have to parse the link again. or even just cache the entire thread main article even if it doesn't match
This is not the most elegant answer, but I think it is better than using bash answer.
- It use Python, which mean it can be used cross platform.
- Simple installation, all required package can be installed using pip
- It can be developed further, more readable the program, easier it can be developed.
- It does the same job as the bash script only for 13 minutes.
答案3
I recreated my script based on this answer provided by @karel. Now the script uses lynx
instead of wget
. In result it becomes significantly faster.
The current version does the same job for 15 minutes when there are two searched keywords and only 8 minutes if we searching for only one keyword. That is faster than the Python solution provided by @dan.
In addition lynx
provides better handling of non latin characters.
#!/bin/bash
TARGET_URL='http://e-bane.net/modules.php?name=Stories_Archive'
KEY_WORDS=('pa4080') # KEY_WORDS=('word' 'some short sentence')
MAP_FILE='url.map'
OUT_FILE='url.list'
get_url_map() {
# Use 'lynx' as spider and output the result into a file
lynx -dump "${TARGET_URL}" | awk '/http/{print $2}' | uniq -u > "$MAP_FILE"
while IFS= read -r target_url; do lynx -dump "${target_url}" | awk '/http/{print $2}' | uniq -u >> "${MAP_FILE}.full"; done < "$MAP_FILE"
mv "${MAP_FILE}.full" "$MAP_FILE"
}
filter_url_map() {
# Apply some filters to the $MAP_FILE and keep only the URLs, that contain 'article&sid'
uniq "$MAP_FILE" | grep -v '\.\(css\|js\|png\|gif\|jpg\|txt\)$' | grep 'article&sid' | sort -u > "${MAP_FILE}.uniq"
mv "${MAP_FILE}.uniq" "$MAP_FILE"
printf '\n# -----\nThe number of the pages to be scanned: %s\n' "$(cat "$MAP_FILE" | wc -l)"
}
get_key_urls() {
counter=1
# Do this for each line in the $MAP_FILE
while IFS= read -r URL; do
# For each $KEY_WORD in $KEY_WORDS
for KEY_WORD in "${KEY_WORDS[@]}"; do
# Check if the $KEY_WORD exists within the content of the page, if it is true echo the particular $URL into the $OUT_FILE
if [[ ! -z "$(lynx -dump -nolist "${URL}" | grep -io "${KEY_WORD}" | head -n1)" ]]; then
echo "${URL}" | tee -a "$OUT_FILE"
printf '%s\t%s\n' "${KEY_WORD}" "YES"
fi
done
printf 'Progress: %s\r' "$counter"; ((counter++))
done < "$MAP_FILE"
}
# Call the functions
get_url_map
filter_url_map
get_key_urls