我正在使用此命令链来过滤机器人/爬虫流量并禁止 IP 地址。有什么方法可以使这个命令链更加高效吗?
sudo awk -F' - |\\"' '{print $1, $7}' access.log |
grep -i -E 'bot|crawler' |
grep -i -v -E 'google|yahoo|bing|msn|ask|aol|duckduckgo' |
awk '{system("sudo ufw deny from "$1" to any")}'
这是我正在解析的示例日志文件。默认的apache2 access.log
173.239.53.9 - - [09/Oct/2019:01:52:39 +0000] "GET /robots.txt HTTP/1.1" 200 3955 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; FSL 7.0.6.01001)"
46.229.168.143 - - [09/Oct/2019:01:54:56 +0000] "GET /robots.txt HTTP/1.1" 200 4084 "-" "Mozilla/5.0 (compatible; SemrushBot/6~bl; +http://www.semrush.com/bot.html)"
157.55.39.20 - - [09/Oct/2019:01:56:10 +0000] "GET /robots.txt HTTP/1.1" 200 3918 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
65.132.59.34 - - [09/Oct/2019:01:56:53 +0000] "GET /robots.txt HTTP/1.1" 200 4150 "-" "Gigabot (1.1 1.2)"
198.204.244.90 - - [09/Oct/2019:01:58:23 +0000] "GET /robots.txt HTTP/1.1" 200 4480 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)"
192.151.157.210 - - [09/Oct/2019:02:03:41 +0000] "GET /robots.txt HTTP/1.1" 200 4480 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)"
93.158.161.112 - - [09/Oct/2019:02:09:35 +0000] "GET /neighborhood/ballard/robots.txt HTTP/1.1" 404 31379 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
203.133.169.54 - - [09/Oct/2019:02:09:43 +0000] "GET /robots.txt HTTP/1.1" 200 4281 "-" "Mozilla/5.0 (compatible; Daum/4.1; +http://cs.daum.net/faq/15/4118.html?faqId=28966)"
谢谢
答案1
使用单个awk
命令:
awk -F' - |\"' 'tolower($7) ~ /bot|crawler/ && tolower($7) !~ /google|yahoo|bing|msn|ask|aol|duckduckgo/{system("sudo ufw deny from "$1" to any")}' access.log
这将仅过滤掉第七列中具有bot
或的条目(您的第一个命令的作用。 crawler
grep
除非第七栏才不是包含google|yahoo|bing|msn|ask|aol|duckduckgo
(你的第二个grep
命令的作用)。任何匹配的行都将sudo ufw deny from "$1" to any
在其第一列上执行。 (你的最终awk
命令的作用)。