脚本用于查找文件中的某些单词并删除这些行

脚本用于查找文件中的某些单词并删除这些行

编辑了问题,因为我愿意接受不同类型的解决方案,而不像以前那样只是批量处理 我在 Windows 上,并且有一些建议的 SED 等。所以我可以使用命令行来使用这些第三方独立 exe

假设我在 abc.txt 文件中有以下几行

"@yuy007 what are you doing friend #disneyrocks"
"STFU, i dont care what you think @happy55"
"@social88 @gg99 ok mate see you at the subway :)"
"btw arnold was great in that movie @tt11 @gg11 #disneyrocks"
"we are going to disney. Do you want to? #disneyrocks"
"We dont like disney. #disneyrocks we are not going" 
".@socialguy what are you upto #disneyrocks " 

我需要对上述文件使用 5 个过滤器来获取 def.txt

  1. 删除所有以 @ 字符开头的行,如第 1 行和第 3 行
  2. 删除所有以 .@ 字符开头的行,例如第 7 行
  3. 删除所有不包含以 # 开头的单词的行,例如 2nd 和 3rd
  4. 在剩下的行中,删除所有以 @ 字符开头的单词(保持行完整),例如第二个中的单词 @happy55、第三个中的单词 @social99 和 @gg99,等等。在这种情况下,我们仍然需要保留行首和行末的引号“
  5. 删除上述行后留下的所有空白行

编辑 如果我有以下行,它会错误地删除@word 之后的内容

"btw arnold was great in that movie @tt101 @gb1997 #whatthehell"

编辑为

"btw arnold was great in that movie"

谢谢

答案1

您将需要使用正则表达式来实现这一点。由于您已指定 BATCH 作为首选脚本语言,因此我们需要添加该功能。我们可以通过多种方式实现这一点,但我喜欢此版本由 dostips.com 上名为 Dave Benham 的某人编写,因为它仅使用应该已经存在于您的机器上的二进制文件:

@if (@X)==(@Y) @end /* Harmless hybrid line that begins a JScript comment

::************ Documentation ***********
::REPL.BAT version 4.1
:::
:::REPL  Search  Replace  [Options  [SourceVar]]
:::REPL  /?[REGEX|REPLACE]
:::REPL  /V
:::
:::  Performs a global regular expression search and replace operation on
:::  each line of input from stdin and prints the result to stdout.
:::
:::  Each parameter may be optionally enclosed by double quotes. The double
:::  quotes are not considered part of the argument. The quotes are required
:::  if the parameter contains a batch token delimiter like space, tab, comma,
:::  semicolon. The quotes should also be used if the argument contains a
:::  batch special character like &, |, etc. so that the special character
:::  does not need to be escaped with ^.
:::
:::  If called with a single argument of /?, then prints help documentation
:::  to stdout. If a single argument of /?REGEX, then opens up Microsoft's
:::  JScript regular expression documentation within your browser. If a single
:::  argument of /?REPLACE, then opens up Microsoft's JScript REPLACE
:::  documentation within your browser.
:::
:::  If called with a single argument of /V, case insensitive, then prints
:::  the version of REPL.BAT.
:::
:::  Search  - By default, this is a case sensitive JScript (ECMA) regular
:::            expression expressed as a string.
:::
:::            JScript regex syntax documentation is available at
:::            http://msdn.microsoft.com/en-us/library/ae5bf541(v=vs.80).aspx
:::
:::  Replace - By default, this is the string to be used as a replacement for
:::            each found search expression. Full support is provided for
:::            substituion patterns available to the JScript replace method.
:::
:::            For example, $& represents the portion of the source that matched
:::            the entire search pattern, $1 represents the first captured
:::            submatch, $2 the second captured submatch, etc. A $ literal
:::            can be escaped as $$.
:::
:::            An empty replacement string must be represented as "".
:::
:::            Replace substitution pattern syntax is fully documented at
:::            http://msdn.microsoft.com/en-US/library/efy6s3e6(v=vs.80).aspx
:::
:::  Options - An optional string of characters used to alter the behavior
:::            of REPL. The option characters are case insensitive, and may
:::            appear in any order.
:::
:::            I - Makes the search case-insensitive.
:::
:::            L - The Search is treated as a string literal instead of a
:::                regular expression. Also, all $ found in Replace are
:::                treated as $ literals.
:::
:::            B - The Search must match the beginning of a line.
:::                Mostly used with literal searches.
:::
:::            E - The Search must match the end of a line.
:::                Mostly used with literal searches.
:::
:::            V - Search and Replace represent the name of environment
:::                variables that contain the respective values. An undefined
:::                variable is treated as an empty string.
:::
:::            A - Only print altered lines. Unaltered lines are discarded.
:::                If both the M and V options are present, then prints the
:::                entire result if there was a change anywhere in the string.
:::                The A option is incompatible with the M option unless the S
:::                option is also present.
:::
:::            M - Multi-line mode. The entire contents of stdin is read and
:::                processed in one pass instead of line by line, thus enabling
:::                search for \n. This also enables preservation of the original
:::                line terminators. If the M option is not present, then every
:::                printed line is termiated with carriage return and line feed.
:::                The M option is incompatible with the A option unless the S
:::                option is also present.
:::
:::                Note: If working with binary data containing NULL bytes,
:::                      then the M option must be used.
:::
:::            X - Enables extended substitution pattern syntax with support
:::                for the following escape sequences within the Replace string:
:::
:::                \\     -  Backslash
:::                \b     -  Backspace
:::                \f     -  Formfeed
:::                \n     -  Newline
:::                \q     -  Quote
:::                \r     -  Carriage Return
:::                \t     -  Horizontal Tab
:::                \v     -  Vertical Tab
:::                \xnn   -  Extended ASCII byte code expressed as 2 hex digits
:::                \unnnn -  Unicode character expressed as 4 hex digits
:::
:::                Also enables the \q escape sequence for the Search string.
:::                The other escape sequences are already standard for a regular
:::                expression Search string.
:::
:::                Also modifies the behavior of \xnn in the Search string to work
:::                properly with extended ASCII byte codes.
:::
:::                Extended escape sequences are supported even when the L option
:::                is used. Both Search and Replace support all of the extended
:::                escape sequences if both the X and L opions are combined.
:::
:::            S - The source is read from an environment variable instead of
:::                from stdin. The name of the source environment variable is
:::                specified in the next argument after the option string. Without
:::                the M option, ^ anchors the beginning of the string, and $ the
:::                end of the string. With the M option, ^ anchors the beginning
:::                of a line, and $ the end of a line.
:::
::: REPL.BAT was written by Dave Benham, with assistance from DosTips user Aacini
::: to get \xnn to work properly with extended ASCII byte codes. Also assistance
::: from DosTips user penpen diagnosing issues reading NULL bytes, along with a
::: workaround. REPL.BAT was originally posted at:
::: http://www.dostips.com/forum/viewtopic.php?f=3&t=3855
:::

::************ Batch portion ***********
@echo off
if .%2 equ . (
  if "%~1" equ "/?" (
    <"%~f0" cscript //E:JScript //nologo "%~f0" "^:::" "" a
    exit /b 0
  ) else if /i "%~1" equ "/?regex" (
    explorer "http://msdn.microsoft.com/en-us/library/ae5bf541(v=vs.80).aspx"
    exit /b 0
  ) else if /i "%~1" equ "/?replace" (
    explorer "http://msdn.microsoft.com/en-US/library/efy6s3e6(v=vs.80).aspx"
    exit /b 0
  ) else if /i "%~1" equ "/V" (
    <"%~f0" cscript //E:JScript //nologo "%~f0" "^::(REPL\.BAT version)" "$1" a
    exit /b 0
  ) else (
    call :err "Insufficient arguments"
    exit /b 1
  )
)
echo(%~3|findstr /i "[^SMILEBVXA]" >nul && (
  call :err "Invalid option(s)"
  exit /b 1
)
echo(%~3|findstr /i "M"|findstr /i "A"|findstr /vi "S" >nul && (
  call :err "Incompatible options"
  exit /b 1
)
cscript //E:JScript //nologo "%~f0" %*
exit /b 0

:err
>&2 echo ERROR: %~1. Use REPL /? to get help.
exit /b

************* JScript portion **********/
var env=WScript.CreateObject("WScript.Shell").Environment("Process");
var args=WScript.Arguments;
var search=args.Item(0);
var replace=args.Item(1);
var options="g";
if (args.length>2) options+=args.Item(2).toLowerCase();
var multi=(options.indexOf("m")>=0);
var alterations=(options.indexOf("a")>=0);
if (alterations) options=options.replace(/a/g,"");
var srcVar=(options.indexOf("s")>=0);
if (srcVar) options=options.replace(/s/g,"");
if (options.indexOf("v")>=0) {
  options=options.replace(/v/g,"");
  search=env(search);
  replace=env(replace);
}
if (options.indexOf("x")>=0) {
  options=options.replace(/x/g,"");
  replace=replace.replace(/\\\\/g,"\\B");
  replace=replace.replace(/\\q/g,"\"");
  replace=replace.replace(/\\x80/g,"\\u20AC");
  replace=replace.replace(/\\x82/g,"\\u201A");
  replace=replace.replace(/\\x83/g,"\\u0192");
  replace=replace.replace(/\\x84/g,"\\u201E");
  replace=replace.replace(/\\x85/g,"\\u2026");
  replace=replace.replace(/\\x86/g,"\\u2020");
  replace=replace.replace(/\\x87/g,"\\u2021");
  replace=replace.replace(/\\x88/g,"\\u02C6");
  replace=replace.replace(/\\x89/g,"\\u2030");
  replace=replace.replace(/\\x8[aA]/g,"\\u0160");
  replace=replace.replace(/\\x8[bB]/g,"\\u2039");
  replace=replace.replace(/\\x8[cC]/g,"\\u0152");
  replace=replace.replace(/\\x8[eE]/g,"\\u017D");
  replace=replace.replace(/\\x91/g,"\\u2018");
  replace=replace.replace(/\\x92/g,"\\u2019");
  replace=replace.replace(/\\x93/g,"\\u201C");
  replace=replace.replace(/\\x94/g,"\\u201D");
  replace=replace.replace(/\\x95/g,"\\u2022");
  replace=replace.replace(/\\x96/g,"\\u2013");
  replace=replace.replace(/\\x97/g,"\\u2014");
  replace=replace.replace(/\\x98/g,"\\u02DC");
  replace=replace.replace(/\\x99/g,"\\u2122");
  replace=replace.replace(/\\x9[aA]/g,"\\u0161");
  replace=replace.replace(/\\x9[bB]/g,"\\u203A");
  replace=replace.replace(/\\x9[cC]/g,"\\u0153");
  replace=replace.replace(/\\x9[dD]/g,"\\u009D");
  replace=replace.replace(/\\x9[eE]/g,"\\u017E");
  replace=replace.replace(/\\x9[fF]/g,"\\u0178");
  replace=replace.replace(/\\b/g,"\b");
  replace=replace.replace(/\\f/g,"\f");
  replace=replace.replace(/\\n/g,"\n");
  replace=replace.replace(/\\r/g,"\r");
  replace=replace.replace(/\\t/g,"\t");
  replace=replace.replace(/\\v/g,"\v");
  replace=replace.replace(/\\x[0-9a-fA-F]{2}|\\u[0-9a-fA-F]{4}/g,
    function($0,$1,$2){
      return String.fromCharCode(parseInt("0x"+$0.substring(2)));
    }
  );
  replace=replace.replace(/\\B/g,"\\");
  search=search.replace(/\\\\/g,"\\B");
  search=search.replace(/\\q/g,"\"");
  search=search.replace(/\\x80/g,"\\u20AC");
  search=search.replace(/\\x82/g,"\\u201A");
  search=search.replace(/\\x83/g,"\\u0192");
  search=search.replace(/\\x84/g,"\\u201E");
  search=search.replace(/\\x85/g,"\\u2026");
  search=search.replace(/\\x86/g,"\\u2020");
  search=search.replace(/\\x87/g,"\\u2021");
  search=search.replace(/\\x88/g,"\\u02C6");
  search=search.replace(/\\x89/g,"\\u2030");
  search=search.replace(/\\x8[aA]/g,"\\u0160");
  search=search.replace(/\\x8[bB]/g,"\\u2039");
  search=search.replace(/\\x8[cC]/g,"\\u0152");
  search=search.replace(/\\x8[eE]/g,"\\u017D");
  search=search.replace(/\\x91/g,"\\u2018");
  search=search.replace(/\\x92/g,"\\u2019");
  search=search.replace(/\\x93/g,"\\u201C");
  search=search.replace(/\\x94/g,"\\u201D");
  search=search.replace(/\\x95/g,"\\u2022");
  search=search.replace(/\\x96/g,"\\u2013");
  search=search.replace(/\\x97/g,"\\u2014");
  search=search.replace(/\\x98/g,"\\u02DC");
  search=search.replace(/\\x99/g,"\\u2122");
  search=search.replace(/\\x9[aA]/g,"\\u0161");
  search=search.replace(/\\x9[bB]/g,"\\u203A");
  search=search.replace(/\\x9[cC]/g,"\\u0153");
  search=search.replace(/\\x9[dD]/g,"\\u009D");
  search=search.replace(/\\x9[eE]/g,"\\u017E");
  search=search.replace(/\\x9[fF]/g,"\\u0178");
  if (options.indexOf("l")>=0) {
    search=search.replace(/\\b/g,"\b");
    search=search.replace(/\\f/g,"\f");
    search=search.replace(/\\n/g,"\n");
    search=search.replace(/\\r/g,"\r");
    search=search.replace(/\\t/g,"\t");
    search=search.replace(/\\v/g,"\v");
    search=search.replace(/\\x[0-9a-fA-F]{2}|\\u[0-9a-fA-F]{4}/g,
      function($0,$1,$2){
        return String.fromCharCode(parseInt("0x"+$0.substring(2)));
      }
    );
    search=search.replace(/\\B/g,"\\");
  } else search=search.replace(/\\B/g,"\\\\");
}
if (options.indexOf("l")>=0) {
  options=options.replace(/l/g,"");
  search=search.replace(/([.^$*+?()[{\\|])/g,"\\$1");
  replace=replace.replace(/\$/g,"$$$$");
}
if (options.indexOf("b")>=0) {
  options=options.replace(/b/g,"");
  search="^"+search
}
if (options.indexOf("e")>=0) {
  options=options.replace(/e/g,"");
  search=search+"$"
}
var search=new RegExp(search,options);
var str1, str2;

if (srcVar) {
  str1=env(args.Item(3));
  str2=str1.replace(search,replace);
  if (!alterations || str1!=str2) if (multi) {
    WScript.Stdout.Write(str2);
  } else {
    WScript.Stdout.WriteLine(str2);
  }
} else if (multi){
  var buf=1024;
  str1="";
  while (!WScript.StdIn.AtEndOfStream) {
    str1+=WScript.StdIn.Read(buf);
    buf*=2
  }
  WScript.Stdout.Write(str1.replace(search,replace));
} else {
  while (!WScript.StdIn.AtEndOfStream) {
    str1=WScript.StdIn.ReadLine();
    str2=str1.replace(search,replace);
    if (!alterations || str1!=str2) WScript.Stdout.WriteLine(str2);
  }
}

复制并保存为复制脚本。如果您认为会再次使用它,您可能希望将其放在系统路径中。否则,只需将其与您正在处理的文件放在一起。现在为该任务创建另一个文件(我将其命名为测试脚本):

@echo off
type abc.txt | repl "^[\s\q]@[^\s].*\r?\n?" "" XM | repl "[\s\q]@[^\s\q]+" "" X > abc.out.txt

这应该能满足您的要求。它已被修改为输出 Windows 行尾(我的文本编辑器不关心,所以我没有注意到这个问题)。

  • 这部分repl "^[\s\q]@[^\s].*\r?\n?" "" XM将删除以引号或 @ 开头的每一行。它将忽略只有"@ some text@ some text或只有@或的行"@(@ 后面必须至少有一个非空白字符)。您可以通过删除 来删除此要求[^\s]

  • 这部分repl "[\s\q]@[^\s\q]+" "" X将删除所有以 @ 开头且后面至少有一个非空格或引号字符的单词。

我们使用 X 参数是因为它添加了 /q 替换,这使我们能够搜索那些讨厌的引号。需要 M 选项,以便我们能够实际替换新行(此外,如果没有它,我们将在末尾有一个额外的空白行)。更多信息可以在JScript 正则表达式参考。

笔记:我现在已经解决了上述替换的一些问题,并使用更好的命令使它们变得更加简单。


如果只想显示包含@的行,那么可以使用:

type abc.txt | repl "^((?![\q\s]@\w+).)*$" "" X | repl "\r?\n?\s*$" "" M > abc.out2.txt

我花了很长时间才弄清楚如何在所有情况下工作,我可能错过了一些可能的组合。不过,它会忽略一行中的电子邮件地址和 @ 字符。RegEx 不擅长否定结果,需要使用前瞻来执行此操作。第二部分通过删除第一次调用后留下的所有空行来处理部分混乱。这可能会产生不必要的副作用,即还会删除文件中任何已经为空的行。

相关内容