多年来,当网站、推文等中出现某些字词时,Google 新闻快讯每天都会发送电子邮件,这些电子邮件(HTML 和.txt
格式)有一个独特之处:没有简单的方法可以提取包含 Google 新闻快讯关键字的 URL。由于我们实际上有 1000 封这样的电子邮件,我们只想提取 Google 新闻快讯中标识的 URL,以便将这些 URL 添加到我们的 Google 拒绝列表以及.htaccess
我们需要阻止的域文件中。
尝试了无数次,我们还是无法让这个.bat
文件工作。只需设置两个分隔符,批处理文件就会搜索一个充满.eml
/ .html
/.txt
文件的目录,并针对它查找的每个文件提取中间的字符串url=3D
,然后\u0026ct
将 URL 输出到.txt
文件中。然后,我们可以将输出(手动添加 1x/月或更多,如果需要)添加到我们的 Google Disavow 列表和.htaccess
阻止列表中。
第 2 步,完全自动化,一旦 URL 被拉出,它们就会自动添加到正在运行的拒绝列表中,然后.htaccess
我们放在同一目录中的临时文件就可以出现了。
这变得非常具有挑战性,因为由于 MS DOS.bat
文件命令来查找字符串(findstr
)命令使用该usebackq
命令,我们认为如果搜索字符串本身findstr
存在任何/
和/或字符,则该命令存在问题。=
之所以会出现问题,是因为在解析电子邮件中设置的任何 Google 新闻提醒时(我们认为 Google 是故意这样做的),无法解析出需要抓取的 URL 的开头,因为显示确切 URL 开头位置的分隔符是命令失败的u0026url=3D
原因。findstr
举个例子,这里有两次尝试,其中尝试 1 将拉出部分行,而尝试 2 只会在DidIT.txt
成功文件的每一行上返回一些奇怪的输出:
line:*=
line:*=
line:*=
这清楚地表明了问题的存在。
如果有人能提供帮助,我们将不胜感激,因为我们正在尝试解析 5000 条 Google 新闻提醒,我们需要将其附加到我们的阻止列表和 Disavow 列表中。
.bat
以下是我们尝试过的文件的两个最佳示例,以及我们从 Google 新闻提醒电子邮件中获取的电子邮件的典型.html
/原始输出的简单片段。.txt
选项1
@echo off
setlocal enabledelayedexpansion
set "source_directory=D:\000\parser"
set "output_file=DidIT.txt"
REM Clear output file
type nul > "%output_file%"
REM Loop through all .eml files in the source directory
for %%F in ("%source_directory%\*.eml") do (
echo Processing file: %%~nxF
REM Read each line of the .eml file
for /f "tokens=2 delims==&" %%A in ('findstr /C:"url=3D" "%%F"') do (
REM Extract characters between "&url=" and "&ct" delimiters
for /f "tokens=1 delims=&" %%B in ("%%A") do (
REM Output the extracted characters to the output file
echo %%B>>"%output_file%"
)
)
)
echo Completed! The output has been saved to "%output_file%".
选项 #2
@echo off
setlocal enabledelayedexpansion
set "directory=D:\000\parser"
set "outputFile=%directory%\DidIT.txt"
echo Processing .eml files in directory: %directory%
rem Delete the output file if it already exists
if exist "%outputFile%" del "%outputFile%"
for %%F in ("%directory%\*.eml") do (
echo Processing file: %%~nxF
set "delim1=u0026url"
set "delim2=u0026cd"
set "content="
for /f "usebackq tokens=1,* delims=" %%A in ("%%F") do (
set "line=%%A"
if "!line:%delim1%=!" neq "!line!" (
set "content=!line:*%delim1%=!"
setlocal enabledelayedexpansion
for /f "delims=%delim2%" %%C in ("!content!") do (
echo %%C>> "%outputFile%"
)
endlocal
)
)
)
echo Output file didit.txt created successfully.
endlocal
示例 Google 新闻提醒.html
/ .txt
/.eml
电子邮件运行上述内容,制作多份副本并标记为#1.eml
, #2.eml
,#3.eml
"widgets": [ {
"type": "LINK",
"title": "tmavomodr=C3=A9 =C5=w2q2wa blackblob, strih do A, obtiahnut=
=C3=A9, blabla",
"description": "yytree=C3=A1t =C4=8D. 814091234: tmiiiih=C3=A9 =C5=A1=
aty blackblob, strih do A, gheewse=C3=A9, wahwah, 234: 6 =E2=82=AC, L=
yayayay: Bansk=C3=A1 =C5=A0ooohooh.",
"url": "https://www.google.com/url?rct=3Dj\u0026sa=3Dt\u0026url=3Dhtt=
ps://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d=
1-2-333333-56543554.php\u0026ct=3Dga\u0026cd=111111111111111111111111111111=
222222222222222222222222222222222222222222222222222222\2322usg=erererererer=
3333333333333333333333333"
} ]
} ]
}
</script> <!--[if mso]>
<table><tr><td width=342wq>
<![endif]-->
<div style=3D"width:100%;max-width:650px"> <div style=3D"font-family:Arial=
"> <table style=3D"border-collapse:collapse;border-left:1px solid #121212;b=
order-right:1px solid #343434"> <tr> <td style=3D"background-color:#w2w2w2;=
padding-left:18px;border-bottom:1px solid #121212;border-top:1px solid #121=
332"></td> <td valign=3D"middle" style=3D"padding:13px 10px 8px 0px;backgro=
und-color:#w2w2w2;border-top:1px solid #q1q1q1;border-bottom:1px solid #111=
输出DiDIt.txt
应如下所示:
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php
答案1
使用此文本input.html
"widgets": [ { "type": "LINK", "title": "tmavomodr=C3=A9 =C5=w2q2wa blackblob, strih do A, obtiahnut= =C3=A9, blabla", "description": "yytree=C3=A1t =C4=8D. 814091234: tmiiiih=C3=A9 =C5=A1= aty blackblob, strih do A, gheewse=C3=A9, wahwah, 234: 6 =E2=82=AC, L= yayayay: Bansk=C3=A1 =C5=A0ooohooh.", "url": "https://www.google.com/url?rct=3Dj\u0026sa=3Dt\u0026url=3Dhtt= ps://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d= 1-2-333333-56543554.php\u0026ct=3Dga\u0026cd=111111111111111111111111111111= 222222222222222222222222222222222222222222222222222222\2322usg=erererererer= 3333333333333333333333333" } ] } ] } <!--[if mso]>
...一个 Python 代码片段可以帮助您进一步了解:
C:\>python -c "import sys, re;s=' '.join(sys.stdin.readlines());z=re.findall('.0026url=3D(.+).u0026ct=3D',s);print(z[0])" <input.html
...从您的(格式错误的?)示例中打印出以下内容:
htt= ps://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d= 1-2-333333-56543554.php
这将对@Destroy666 编辑的版本“起作用” 2input.html
:
python -c "import sys, re;s=''.join([line.replace('=\n','') for line in sys.stdin.readlines()]);z=re.findall('.0026url=3D(.+).u0026ct=3D',s);print(z[0])" <2input.html
输出:
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php
答案2
输出字符串的示例内容不包含“=”,因为:
...d=1-2-333333-56543554.php
如果这不是故意的(打字错误),那么您可以使用:
@echo off && setlocal enabledelayedexpansion
cd.>"DidIT.txt"
for %%G in ("D:\000\parser\*.eml")do echo/Processing file: "%%~nxG" && =;(
for /f usebackq^tokens^=*delims^=^\ %%i in (`type "%%~fG" ^|find ".php"
`)do for /f ^usebackq^delims^=^\ %%? in (`echo;%%~i`)do echo/htt%%~?
);= >>".\DidIT.txt"
endlocal & echo Completed! The output has been saved to .\DidIT.txt
- 内容保存在
.\DidIT.txt
:
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d=1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d=1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d=1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d=1-2-333333-56543554.php
但是要提取字符串并删除输出/文件中的“=”,您可以尝试:
@echo off && setlocal enabledelayedexpansion
cd.>"DidIT.txt"
for %%G in ("D:\000\*.eml")do echo/Processing file: "%%~nxG" && =;(
for /f usebackq^tokens^=*delims^=^\ %%i in (`type "%%~fG" ^|find ".php"
`)do for /f ^usebackq^delims^=^\ %%? in (`echo;%%~i`)do set "_str=%%~?" && =;(
for /f ^usebackq^tokens^=1*delims^=^= %%a in (`%ComSpec% /v /c echo;!_str!
`)do >>".\DidIT.txt" echo/htt%%~a%%~b
);=
);=
endlocal & echo Completed! The output has been saved to .\DidIT.txt
- 内容保存在
.\DidIT.txt
:
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php
Powershell 中的另一种方法是:
$str=$(gc -path "D:\000\*.eml" | sls ".php") |
?{$_} ; $str=($str -replace ("=","") -split '\\u'|sls .php)
$str -replace ("ps:","https:") | out-file .\DidIT.txt
- 内容保存在
.\DidIT.txt
:
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php
或者...
$str=$(gc -path "D:\000\*.eml" | sls ".php") |
?{$_} ; $str=($str -split '\\u'|sls .php)
$str -replace ("ps:","https:") | out-file .\DidIT.txt
- 内容保存在
.\DidIT.txt
:
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php
Powershell 中的另一种方法是:
$str=$(gc -path "D:\000\*.eml" | sls ".php") |
?{$_} ; $str=($str -replace ("=","") -split '\\u'|sls .php)
$str -replace ("ps:","https:") | out-file .\DidIT.txt
- 内容保存在
.\DidIT.txt
:
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d=1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d=1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d=1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d=1-2-333333-56543554.php
答案3
所以从大家的意见中,我已经能够整理出下面的代码,但它仍然不起作用,因为它无法解析出.eml 文件中每一行固有的 '=',该 '=' 位于两个分隔符之间,用于标识包含典型的谷歌新闻提醒中的关键词的实际网页的 URL。我遗漏了什么吗?
@ioio @hannu 你能帮忙吗?
@echo off
setlocal enabledelayedexpansion
set "prefix=\u0026url=3D"
set "suffix=\u0026ct=3Dga"
set "outputFile=output.txt"
> "%outputFile%" (
for %%F in (*.eml) do (
set "content="
for /f "usebackq tokens=*" %%L in ("%%F") do (
set "line=%%L"
setlocal enabledelayedexpansion
if "!line!" neq "" (
set "content=!content!!line!"
if "!line!"=="!suffix!" (
set "content=!content:*!prefix!=!"
set "content=!content:~0,-12!"
set "content=!content:=!"
echo !content!
set "content="
)
)
endlocal
)
)
)
echo Output written to %outputFile%
endlocal