编辑了问题,因为我愿意接受不同类型的解决方案,而不像以前那样只是批量处理 我在 Windows 上,并且有一些建议的 SED 等。所以我可以使用命令行来使用这些第三方独立 exe
假设我在 abc.txt 文件中有以下几行
"@yuy007 what are you doing friend #disneyrocks"
"STFU, i dont care what you think @happy55"
"@social88 @gg99 ok mate see you at the subway :)"
"btw arnold was great in that movie @tt11 @gg11 #disneyrocks"
"we are going to disney. Do you want to? #disneyrocks"
"We dont like disney. #disneyrocks we are not going"
".@socialguy what are you upto #disneyrocks "
我需要对上述文件使用 5 个过滤器来获取 def.txt
- 删除所有以 @ 字符开头的行,如第 1 行和第 3 行
- 删除所有以 .@ 字符开头的行,例如第 7 行
- 删除所有不包含以 # 开头的单词的行,例如 2nd 和 3rd
- 在剩下的行中,删除所有以 @ 字符开头的单词(保持行完整),例如第二个中的单词 @happy55、第三个中的单词 @social99 和 @gg99,等等。在这种情况下,我们仍然需要保留行首和行末的引号“
- 删除上述行后留下的所有空白行
编辑 如果我有以下行,它会错误地删除@word 之后的内容
"btw arnold was great in that movie @tt101 @gb1997 #whatthehell"
编辑为
"btw arnold was great in that movie"
谢谢
答案1
您将需要使用正则表达式来实现这一点。由于您已指定 BATCH 作为首选脚本语言,因此我们需要添加该功能。我们可以通过多种方式实现这一点,但我喜欢此版本由 dostips.com 上名为 Dave Benham 的某人编写,因为它仅使用应该已经存在于您的机器上的二进制文件:
@if (@X)==(@Y) @end /* Harmless hybrid line that begins a JScript comment
::************ Documentation ***********
::REPL.BAT version 4.1
:::
:::REPL Search Replace [Options [SourceVar]]
:::REPL /?[REGEX|REPLACE]
:::REPL /V
:::
::: Performs a global regular expression search and replace operation on
::: each line of input from stdin and prints the result to stdout.
:::
::: Each parameter may be optionally enclosed by double quotes. The double
::: quotes are not considered part of the argument. The quotes are required
::: if the parameter contains a batch token delimiter like space, tab, comma,
::: semicolon. The quotes should also be used if the argument contains a
::: batch special character like &, |, etc. so that the special character
::: does not need to be escaped with ^.
:::
::: If called with a single argument of /?, then prints help documentation
::: to stdout. If a single argument of /?REGEX, then opens up Microsoft's
::: JScript regular expression documentation within your browser. If a single
::: argument of /?REPLACE, then opens up Microsoft's JScript REPLACE
::: documentation within your browser.
:::
::: If called with a single argument of /V, case insensitive, then prints
::: the version of REPL.BAT.
:::
::: Search - By default, this is a case sensitive JScript (ECMA) regular
::: expression expressed as a string.
:::
::: JScript regex syntax documentation is available at
::: http://msdn.microsoft.com/en-us/library/ae5bf541(v=vs.80).aspx
:::
::: Replace - By default, this is the string to be used as a replacement for
::: each found search expression. Full support is provided for
::: substituion patterns available to the JScript replace method.
:::
::: For example, $& represents the portion of the source that matched
::: the entire search pattern, $1 represents the first captured
::: submatch, $2 the second captured submatch, etc. A $ literal
::: can be escaped as $$.
:::
::: An empty replacement string must be represented as "".
:::
::: Replace substitution pattern syntax is fully documented at
::: http://msdn.microsoft.com/en-US/library/efy6s3e6(v=vs.80).aspx
:::
::: Options - An optional string of characters used to alter the behavior
::: of REPL. The option characters are case insensitive, and may
::: appear in any order.
:::
::: I - Makes the search case-insensitive.
:::
::: L - The Search is treated as a string literal instead of a
::: regular expression. Also, all $ found in Replace are
::: treated as $ literals.
:::
::: B - The Search must match the beginning of a line.
::: Mostly used with literal searches.
:::
::: E - The Search must match the end of a line.
::: Mostly used with literal searches.
:::
::: V - Search and Replace represent the name of environment
::: variables that contain the respective values. An undefined
::: variable is treated as an empty string.
:::
::: A - Only print altered lines. Unaltered lines are discarded.
::: If both the M and V options are present, then prints the
::: entire result if there was a change anywhere in the string.
::: The A option is incompatible with the M option unless the S
::: option is also present.
:::
::: M - Multi-line mode. The entire contents of stdin is read and
::: processed in one pass instead of line by line, thus enabling
::: search for \n. This also enables preservation of the original
::: line terminators. If the M option is not present, then every
::: printed line is termiated with carriage return and line feed.
::: The M option is incompatible with the A option unless the S
::: option is also present.
:::
::: Note: If working with binary data containing NULL bytes,
::: then the M option must be used.
:::
::: X - Enables extended substitution pattern syntax with support
::: for the following escape sequences within the Replace string:
:::
::: \\ - Backslash
::: \b - Backspace
::: \f - Formfeed
::: \n - Newline
::: \q - Quote
::: \r - Carriage Return
::: \t - Horizontal Tab
::: \v - Vertical Tab
::: \xnn - Extended ASCII byte code expressed as 2 hex digits
::: \unnnn - Unicode character expressed as 4 hex digits
:::
::: Also enables the \q escape sequence for the Search string.
::: The other escape sequences are already standard for a regular
::: expression Search string.
:::
::: Also modifies the behavior of \xnn in the Search string to work
::: properly with extended ASCII byte codes.
:::
::: Extended escape sequences are supported even when the L option
::: is used. Both Search and Replace support all of the extended
::: escape sequences if both the X and L opions are combined.
:::
::: S - The source is read from an environment variable instead of
::: from stdin. The name of the source environment variable is
::: specified in the next argument after the option string. Without
::: the M option, ^ anchors the beginning of the string, and $ the
::: end of the string. With the M option, ^ anchors the beginning
::: of a line, and $ the end of a line.
:::
::: REPL.BAT was written by Dave Benham, with assistance from DosTips user Aacini
::: to get \xnn to work properly with extended ASCII byte codes. Also assistance
::: from DosTips user penpen diagnosing issues reading NULL bytes, along with a
::: workaround. REPL.BAT was originally posted at:
::: http://www.dostips.com/forum/viewtopic.php?f=3&t=3855
:::
::************ Batch portion ***********
@echo off
if .%2 equ . (
if "%~1" equ "/?" (
<"%~f0" cscript //E:JScript //nologo "%~f0" "^:::" "" a
exit /b 0
) else if /i "%~1" equ "/?regex" (
explorer "http://msdn.microsoft.com/en-us/library/ae5bf541(v=vs.80).aspx"
exit /b 0
) else if /i "%~1" equ "/?replace" (
explorer "http://msdn.microsoft.com/en-US/library/efy6s3e6(v=vs.80).aspx"
exit /b 0
) else if /i "%~1" equ "/V" (
<"%~f0" cscript //E:JScript //nologo "%~f0" "^::(REPL\.BAT version)" "$1" a
exit /b 0
) else (
call :err "Insufficient arguments"
exit /b 1
)
)
echo(%~3|findstr /i "[^SMILEBVXA]" >nul && (
call :err "Invalid option(s)"
exit /b 1
)
echo(%~3|findstr /i "M"|findstr /i "A"|findstr /vi "S" >nul && (
call :err "Incompatible options"
exit /b 1
)
cscript //E:JScript //nologo "%~f0" %*
exit /b 0
:err
>&2 echo ERROR: %~1. Use REPL /? to get help.
exit /b
************* JScript portion **********/
var env=WScript.CreateObject("WScript.Shell").Environment("Process");
var args=WScript.Arguments;
var search=args.Item(0);
var replace=args.Item(1);
var options="g";
if (args.length>2) options+=args.Item(2).toLowerCase();
var multi=(options.indexOf("m")>=0);
var alterations=(options.indexOf("a")>=0);
if (alterations) options=options.replace(/a/g,"");
var srcVar=(options.indexOf("s")>=0);
if (srcVar) options=options.replace(/s/g,"");
if (options.indexOf("v")>=0) {
options=options.replace(/v/g,"");
search=env(search);
replace=env(replace);
}
if (options.indexOf("x")>=0) {
options=options.replace(/x/g,"");
replace=replace.replace(/\\\\/g,"\\B");
replace=replace.replace(/\\q/g,"\"");
replace=replace.replace(/\\x80/g,"\\u20AC");
replace=replace.replace(/\\x82/g,"\\u201A");
replace=replace.replace(/\\x83/g,"\\u0192");
replace=replace.replace(/\\x84/g,"\\u201E");
replace=replace.replace(/\\x85/g,"\\u2026");
replace=replace.replace(/\\x86/g,"\\u2020");
replace=replace.replace(/\\x87/g,"\\u2021");
replace=replace.replace(/\\x88/g,"\\u02C6");
replace=replace.replace(/\\x89/g,"\\u2030");
replace=replace.replace(/\\x8[aA]/g,"\\u0160");
replace=replace.replace(/\\x8[bB]/g,"\\u2039");
replace=replace.replace(/\\x8[cC]/g,"\\u0152");
replace=replace.replace(/\\x8[eE]/g,"\\u017D");
replace=replace.replace(/\\x91/g,"\\u2018");
replace=replace.replace(/\\x92/g,"\\u2019");
replace=replace.replace(/\\x93/g,"\\u201C");
replace=replace.replace(/\\x94/g,"\\u201D");
replace=replace.replace(/\\x95/g,"\\u2022");
replace=replace.replace(/\\x96/g,"\\u2013");
replace=replace.replace(/\\x97/g,"\\u2014");
replace=replace.replace(/\\x98/g,"\\u02DC");
replace=replace.replace(/\\x99/g,"\\u2122");
replace=replace.replace(/\\x9[aA]/g,"\\u0161");
replace=replace.replace(/\\x9[bB]/g,"\\u203A");
replace=replace.replace(/\\x9[cC]/g,"\\u0153");
replace=replace.replace(/\\x9[dD]/g,"\\u009D");
replace=replace.replace(/\\x9[eE]/g,"\\u017E");
replace=replace.replace(/\\x9[fF]/g,"\\u0178");
replace=replace.replace(/\\b/g,"\b");
replace=replace.replace(/\\f/g,"\f");
replace=replace.replace(/\\n/g,"\n");
replace=replace.replace(/\\r/g,"\r");
replace=replace.replace(/\\t/g,"\t");
replace=replace.replace(/\\v/g,"\v");
replace=replace.replace(/\\x[0-9a-fA-F]{2}|\\u[0-9a-fA-F]{4}/g,
function($0,$1,$2){
return String.fromCharCode(parseInt("0x"+$0.substring(2)));
}
);
replace=replace.replace(/\\B/g,"\\");
search=search.replace(/\\\\/g,"\\B");
search=search.replace(/\\q/g,"\"");
search=search.replace(/\\x80/g,"\\u20AC");
search=search.replace(/\\x82/g,"\\u201A");
search=search.replace(/\\x83/g,"\\u0192");
search=search.replace(/\\x84/g,"\\u201E");
search=search.replace(/\\x85/g,"\\u2026");
search=search.replace(/\\x86/g,"\\u2020");
search=search.replace(/\\x87/g,"\\u2021");
search=search.replace(/\\x88/g,"\\u02C6");
search=search.replace(/\\x89/g,"\\u2030");
search=search.replace(/\\x8[aA]/g,"\\u0160");
search=search.replace(/\\x8[bB]/g,"\\u2039");
search=search.replace(/\\x8[cC]/g,"\\u0152");
search=search.replace(/\\x8[eE]/g,"\\u017D");
search=search.replace(/\\x91/g,"\\u2018");
search=search.replace(/\\x92/g,"\\u2019");
search=search.replace(/\\x93/g,"\\u201C");
search=search.replace(/\\x94/g,"\\u201D");
search=search.replace(/\\x95/g,"\\u2022");
search=search.replace(/\\x96/g,"\\u2013");
search=search.replace(/\\x97/g,"\\u2014");
search=search.replace(/\\x98/g,"\\u02DC");
search=search.replace(/\\x99/g,"\\u2122");
search=search.replace(/\\x9[aA]/g,"\\u0161");
search=search.replace(/\\x9[bB]/g,"\\u203A");
search=search.replace(/\\x9[cC]/g,"\\u0153");
search=search.replace(/\\x9[dD]/g,"\\u009D");
search=search.replace(/\\x9[eE]/g,"\\u017E");
search=search.replace(/\\x9[fF]/g,"\\u0178");
if (options.indexOf("l")>=0) {
search=search.replace(/\\b/g,"\b");
search=search.replace(/\\f/g,"\f");
search=search.replace(/\\n/g,"\n");
search=search.replace(/\\r/g,"\r");
search=search.replace(/\\t/g,"\t");
search=search.replace(/\\v/g,"\v");
search=search.replace(/\\x[0-9a-fA-F]{2}|\\u[0-9a-fA-F]{4}/g,
function($0,$1,$2){
return String.fromCharCode(parseInt("0x"+$0.substring(2)));
}
);
search=search.replace(/\\B/g,"\\");
} else search=search.replace(/\\B/g,"\\\\");
}
if (options.indexOf("l")>=0) {
options=options.replace(/l/g,"");
search=search.replace(/([.^$*+?()[{\\|])/g,"\\$1");
replace=replace.replace(/\$/g,"$$$$");
}
if (options.indexOf("b")>=0) {
options=options.replace(/b/g,"");
search="^"+search
}
if (options.indexOf("e")>=0) {
options=options.replace(/e/g,"");
search=search+"$"
}
var search=new RegExp(search,options);
var str1, str2;
if (srcVar) {
str1=env(args.Item(3));
str2=str1.replace(search,replace);
if (!alterations || str1!=str2) if (multi) {
WScript.Stdout.Write(str2);
} else {
WScript.Stdout.WriteLine(str2);
}
} else if (multi){
var buf=1024;
str1="";
while (!WScript.StdIn.AtEndOfStream) {
str1+=WScript.StdIn.Read(buf);
buf*=2
}
WScript.Stdout.Write(str1.replace(search,replace));
} else {
while (!WScript.StdIn.AtEndOfStream) {
str1=WScript.StdIn.ReadLine();
str2=str1.replace(search,replace);
if (!alterations || str1!=str2) WScript.Stdout.WriteLine(str2);
}
}
复制并保存为复制脚本。如果您认为会再次使用它,您可能希望将其放在系统路径中。否则,只需将其与您正在处理的文件放在一起。现在为该任务创建另一个文件(我将其命名为测试脚本):
@echo off
type abc.txt | repl "^[\s\q]@[^\s].*\r?\n?" "" XM | repl "[\s\q]@[^\s\q]+" "" X > abc.out.txt
这应该能满足您的要求。它已被修改为输出 Windows 行尾(我的文本编辑器不关心,所以我没有注意到这个问题)。
这部分
repl "^[\s\q]@[^\s].*\r?\n?" "" XM
将删除以引号或 @ 开头的每一行。它将忽略只有"@ some text
或@ some text
或只有@
或的行"@
(@ 后面必须至少有一个非空白字符)。您可以通过删除 来删除此要求[^\s]
。这部分
repl "[\s\q]@[^\s\q]+" "" X
将删除所有以 @ 开头且后面至少有一个非空格或引号字符的单词。
我们使用 X 参数是因为它添加了 /q 替换,这使我们能够搜索那些讨厌的引号。需要 M 选项,以便我们能够实际替换新行(此外,如果没有它,我们将在末尾有一个额外的空白行)。更多信息可以在JScript 正则表达式参考。
笔记:我现在已经解决了上述替换的一些问题,并使用更好的命令使它们变得更加简单。
如果只想显示包含@的行,那么可以使用:
type abc.txt | repl "^((?![\q\s]@\w+).)*$" "" X | repl "\r?\n?\s*$" "" M > abc.out2.txt
我花了很长时间才弄清楚如何在所有情况下工作,我可能错过了一些可能的组合。不过,它会忽略一行中的电子邮件地址和 @ 字符。RegEx 不擅长否定结果,需要使用前瞻来执行此操作。第二部分通过删除第一次调用后留下的所有空行来处理部分混乱。这可能会产生不必要的副作用,即还会删除文件中任何已经为空的行。