似乎不可能从数百封混乱的 Google 新闻提醒电子邮件中批量提取两个文本分隔符之间的 URL 字符串

似乎不可能从数百封混乱的 Google 新闻提醒电子邮件中批量提取两个文本分隔符之间的 URL 字符串

多年来,当网站、推文等中出现某些字词时,Google 新闻快讯每天都会发送电子邮件,这些电子邮件(HTML 和.txt格式)有一个独特之处:没有简单的方法可以提取包含 Google 新闻快讯关键字的 URL。由于我们实际上有 1000 封这样的电子邮件,我们只想提取 Google 新闻快讯中标识的 URL,以便将这些 URL 添加到我们的 Google 拒绝列表以及.htaccess我们需要阻止的域文件中。

尝试了无数次,我们还是无法让这个.bat文件工作。只需设置两个分隔符,批处理文件就会搜索一个充满.eml/ .html/.txt文件的目录,并针对它查找的每个文件提取中间的字符串url=3D,然后\u0026ct将 URL 输出到.txt文件中。然后,我们可以将输出(手动添加 1x/月或更多,如果需要)添加到我们的 Google Disavow 列表和.htaccess阻止列表中。

第 2 步,完全自动化,一旦 URL 被拉出,它们就会自动添加到正在运行的拒绝列表中,然后.htaccess我们放在同一目录中的临时文件就可以出现了。

这变得非常具有挑战性,因为由于 MS DOS.bat文件命令来查找字符串(findstr)命令使用该usebackq命令,我们认为如果搜索字符串本身findstr存在任何/和/或字符,则该命令存在问题。=

之所以会出现问题,是因为在解析电子邮件中设置的任何 Google 新闻提醒时(我们认为 Google 是故意这样做的),无法解析出需要抓取的 URL 的开头,因为显示确切 URL 开头位置的分隔符是命令失败的u0026url=3D原因。findstr

举个例子,这里有两次尝试,其中尝试 1 将拉出部分行,而尝试 2 只会在DidIT.txt成功文件的每一行上返回一些奇怪的输出:

   line:*=
   line:*=
   line:*=

这清楚地表明了问题的存在。

如果有人能提供帮助,我们将不胜感激,因为我们正在尝试解析 5000 条 Google 新闻提醒,我们需要将其附加到我们的阻止列表和 Disavow 列表中。

.bat以下是我们尝试过的文件的两个最佳示例,以及我们从 Google 新闻提醒电子邮件中获取的电子邮件的典型.html/原始输出的简单片段。.txt


选项1

@echo off
setlocal enabledelayedexpansion

set "source_directory=D:\000\parser"
set "output_file=DidIT.txt"

REM Clear output file
type nul > "%output_file%"

REM Loop through all .eml files in the source directory
for %%F in ("%source_directory%\*.eml") do (
    echo Processing file: %%~nxF

REM Read each line of the .eml file
    for /f "tokens=2 delims==&" %%A in ('findstr /C:"url=3D" "%%F"') do (
        REM Extract characters between "&url=" and "&ct" delimiters
        for /f "tokens=1 delims=&" %%B in ("%%A") do (
            REM Output the extracted characters to the output file
            echo %%B>>"%output_file%"
        )
    )
)

echo Completed! The output has been saved to "%output_file%".

选项 #2

@echo off
setlocal enabledelayedexpansion

set "directory=D:\000\parser"
set "outputFile=%directory%\DidIT.txt"

echo Processing .eml files in directory: %directory%

rem Delete the output file if it already exists
if exist "%outputFile%" del "%outputFile%"

for %%F in ("%directory%\*.eml") do (
    echo Processing file: %%~nxF
    set "delim1=u0026url"
    set "delim2=u0026cd"
    set "content="
    for /f "usebackq tokens=1,* delims=" %%A in ("%%F") do (
        set "line=%%A"
        if "!line:%delim1%=!" neq "!line!" (
            set "content=!line:*%delim1%=!"
            setlocal enabledelayedexpansion
            for /f "delims=%delim2%" %%C in ("!content!") do (
                echo %%C>> "%outputFile%"
            )
            endlocal
        )
    )
)

echo Output file didit.txt created successfully.

endlocal

示例 Google 新闻提醒.html/ .txt/.eml电子邮件运行上述内容,制作多份副本并标记为#1.eml, #2.eml,#3.eml

"widgets": [ {
"type": "LINK",
"title": "tmavomodr=C3=A9 =C5=w2q2wa blackblob, strih do A, obtiahnut=
=C3=A9, blabla",
"description": "yytree=C3=A1t =C4=8D. 814091234: tmiiiih=C3=A9 =C5=A1=
aty blackblob, strih do A, gheewse=C3=A9, wahwah, 234: 6 =E2=82=AC, L=
yayayay: Bansk=C3=A1 =C5=A0ooohooh.",
      "url": "https://www.google.com/url?rct=3Dj\u0026sa=3Dt\u0026url=3Dhtt=
ps://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d=
1-2-333333-56543554.php\u0026ct=3Dga\u0026cd=111111111111111111111111111111=
222222222222222222222222222222222222222222222222222222\2322usg=erererererer=
3333333333333333333333333"
    } ]
  } ]
}
</script> <!--[if mso]>
 <table><tr><td width=342wq>
<![endif]-->
 <div style=3D"width:100%;max-width:650px"> <div style=3D"font-family:Arial=
"> <table style=3D"border-collapse:collapse;border-left:1px solid #121212;b=
order-right:1px solid #343434"> <tr> <td style=3D"background-color:#w2w2w2;=
padding-left:18px;border-bottom:1px solid #121212;border-top:1px solid #121=
332"></td> <td valign=3D"middle" style=3D"padding:13px 10px 8px 0px;backgro=
und-color:#w2w2w2;border-top:1px solid #q1q1q1;border-bottom:1px solid #111=

输出DiDIt.txt应如下所示:

https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php

https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php

https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php

答案1

使用此文本input.html

"widgets": [ { "type": "LINK", "title": "tmavomodr=C3=A9 =C5=w2q2wa blackblob, strih do A, obtiahnut= =C3=A9, blabla", "description": "yytree=C3=A1t =C4=8D. 814091234: tmiiiih=C3=A9 =C5=A1= aty blackblob, strih do A, gheewse=C3=A9, wahwah, 234: 6 =E2=82=AC, L= yayayay: Bansk=C3=A1 =C5=A0ooohooh.", "url": "https://www.google.com/url?rct=3Dj\u0026sa=3Dt\u0026url=3Dhtt= ps://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d= 1-2-333333-56543554.php\u0026ct=3Dga\u0026cd=111111111111111111111111111111= 222222222222222222222222222222222222222222222222222222\2322usg=erererererer= 3333333333333333333333333" } ] } ] } <!--[if mso]>

...一个 Python 代码片段可以帮助您进一步了解:

C:\>python -c "import sys, re;s=' '.join(sys.stdin.readlines());z=re.findall('.0026url=3D(.+).u0026ct=3D',s);print(z[0])" <input.html

...从您的(格式错误的?)示例中打印出以下内容:

htt= ps://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d= 1-2-333333-56543554.php

这将对@Destroy666 编辑的版本“起作用” 2input.html

python -c "import sys, re;s=''.join([line.replace('=\n','') for line in sys.stdin.readlines()]);z=re.findall('.0026url=3D(.+).u0026ct=3D',s);print(z[0])" <2input.html 

输出:

https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php

答案2

输出字符串的示例内容不包含“=”,因为:

...d=1-2-333333-56543554.php

如果这不是故意的(打字错误),那么您可以使用:

@echo off && setlocal enabledelayedexpansion

cd.>"DidIT.txt"
for %%G in ("D:\000\parser\*.eml")do echo/Processing file: "%%~nxG" && =;(
     for /f usebackq^tokens^=*delims^=^\ %%i in (`type "%%~fG" ^|find ".php"
        `)do for /f ^usebackq^delims^=^\ %%? in (`echo;%%~i`)do echo/htt%%~?
    );= >>".\DidIT.txt"

endlocal & echo Completed! The output has been saved to .\DidIT.txt
  • 内容保存在.\DidIT.txt
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d=1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d=1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d=1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d=1-2-333333-56543554.php

但是要提取字符串并删除输出/文件中的“=”,您可以尝试:

@echo off && setlocal enabledelayedexpansion

cd.>"DidIT.txt"
for %%G in ("D:\000\*.eml")do echo/Processing file: "%%~nxG" && =;(
    for /f usebackq^tokens^=*delims^=^\ %%i in (`type "%%~fG" ^|find ".php"
       `)do for /f ^usebackq^delims^=^\ %%? in (`echo;%%~i`)do set "_str=%%~?" && =;(
            for /f ^usebackq^tokens^=1*delims^=^= %%a in (`%ComSpec% /v /c echo;!_str!
               `)do >>".\DidIT.txt" echo/htt%%~a%%~b
            );=
        );=

endlocal & echo Completed! The output has been saved to .\DidIT.txt
  • 内容保存在.\DidIT.txt
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php

Powershell 中的另一种方法是:

$str=$(gc -path "D:\000\*.eml" | sls ".php") | 
     ?{$_} ; $str=($str -replace ("=","") -split '\\u'|sls .php)
$str -replace ("ps:","https:") | out-file .\DidIT.txt
  • 内容保存在.\DidIT.txt
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php

或者...

$str=$(gc -path "D:\000\*.eml" | sls ".php") | 
     ?{$_} ; $str=($str -split '\\u'|sls .php)
$str -replace ("ps:","https:") | out-file .\DidIT.txt
  • 内容保存在.\DidIT.txt
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d1-2-333333-56543554.php

Powershell 中的另一种方法是:

$str=$(gc -path "D:\000\*.eml" | sls ".php") | 
     ?{$_} ; $str=($str -replace ("=","") -split '\\u'|sls .php)
$str -replace ("ps:","https:") | out-file .\DidIT.txt
  • 内容保存在.\DidIT.txt
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d=1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d=1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d=1-2-333333-56543554.php
https://ohohohoh.wawawa.zz/ghhhhyh656y/deas/yayaya-iiiuiiu-dd1111111dee-yaya-d=1-2-333333-56543554.php

答案3

所以从大家的意见中,我已经能够整理出下面的代码,但它仍然不起作用,因为它无法解析出.eml 文件中每一行固有的 '=',该 '=' 位于两个分隔符之间,用于标识包含典型的谷歌新闻提醒中的关键词的实际网页的 URL。我遗漏了什么吗?

@ioio @hannu 你能帮忙吗?

@echo off
setlocal enabledelayedexpansion

set "prefix=\u0026url=3D"
set "suffix=\u0026ct=3Dga"
set "outputFile=output.txt"

> "%outputFile%" (
    for %%F in (*.eml) do (
        set "content="
        for /f "usebackq tokens=*" %%L in ("%%F") do (
            set "line=%%L"
            setlocal enabledelayedexpansion
            if "!line!" neq "" (
                set "content=!content!!line!"
                if "!line!"=="!suffix!" (
                    set "content=!content:*!prefix!=!"
                    set "content=!content:~0,-12!"
                    set "content=!content:=!"
                    echo !content!
                    set "content="
                )
            )
            endlocal
        )
    )
)

echo Output written to %outputFile%

endlocal

相关内容