使用终端从文件中提取文本？

Question 1

并不是一行代码（尽管运行它的命令是单行:)），但这是一个python选项：

#!/usr/bin/env python3
import sys
file = sys.argv[1]

with open(file) as src:
    text = src.read()

starters = [(i+6, text[i:].find("&action")+i) for i in range(len(text)) if text[i:i+6] == "id_ad="]
if len (starters) > 0:
    for item in starters:
        print(text[item[0]:item[1]])

该脚本首先列出 (起始) 字符串“id_ad=”与 (结束) 字符串“&action”的所有出现 (索引)。然后，它会打印出这些“标记”之间的所有内容。

从准备好的文件中提取：

“我想处理文本主体并从文本中的特定位置提取一个整数，但我不确定如何描述该“特定位置”。正则表达式真的让我很困惑。我花了（浪费了）几个小时阅读教程，但我觉得离答案还很远 :( 有一堆文本可能包含也可能不包含整数（我不想要的），然后有一行始终包含 id_ad=1929170&action 有一堆文本可能包含也可能不包含整数（我不想要的），然后有一行始终包含 id_ad=1889170&action，然后后面跟着一堆我不关心的垃圾，同样，它可能包含也可能不包含一个或多个整数。有一堆文本可能包含也可能不包含整数（我不想要的），然后有一行始终包含 id_ad=1889170&action，然后后面跟着一堆我不关心的垃圾不关心，同样，它可能包含也可能不包含一个或多个整数。有一堆文本可能包含也可能不包含整数（我不想要），然后有一行始终包含 id_ad=1929990&action"

结果是：

如何使用

将脚本粘贴到一个空文件中，保存并extract.py运行以下命令：

python3 <script> <file>

笔记

如果文本文件中仅出现一次，则脚本可以更短：

#!/usr/bin/env python3
import sys
file = sys.argv[1]

with open(file) as src:
    text = src.read()
print(text[text.find("id_ad=")+6:text.find("&action")])

Answer

并不是一行代码（尽管运行它的命令是单行:)），但这是一个python选项：

#!/usr/bin/env python3
import sys
file = sys.argv[1]

with open(file) as src:
    text = src.read()

starters = [(i+6, text[i:].find("&action")+i) for i in range(len(text)) if text[i:i+6] == "id_ad="]
if len (starters) > 0:
    for item in starters:
        print(text[item[0]:item[1]])

该脚本首先列出 (起始) 字符串“id_ad=”与 (结束) 字符串“&action”的所有出现 (索引)。然后，它会打印出这些“标记”之间的所有内容。

从准备好的文件中提取：

“我想处理文本主体并从文本中的特定位置提取一个整数，但我不确定如何描述该“特定位置”。正则表达式真的让我很困惑。我花了（浪费了）几个小时阅读教程，但我觉得离答案还很远 :( 有一堆文本可能包含也可能不包含整数（我不想要的），然后有一行始终包含 id_ad=1929170&action 有一堆文本可能包含也可能不包含整数（我不想要的），然后有一行始终包含 id_ad=1889170&action，然后后面跟着一堆我不关心的垃圾，同样，它可能包含也可能不包含一个或多个整数。有一堆文本可能包含也可能不包含整数（我不想要的），然后有一行始终包含 id_ad=1889170&action，然后后面跟着一堆我不关心的垃圾不关心，同样，它可能包含也可能不包含一个或多个整数。有一堆文本可能包含也可能不包含整数（我不想要），然后有一行始终包含 id_ad=1929990&action"

结果是：

如何使用

将脚本粘贴到一个空文件中，保存并extract.py运行以下命令：

python3 <script> <file>

笔记

如果文本文件中仅出现一次，则脚本可以更短：

#!/usr/bin/env python3
import sys
file = sys.argv[1]

with open(file) as src:
    text = src.read()
print(text[text.find("id_ad=")+6:text.find("&action")])

Question 2

例如：

 egrep "id_ad=[[:digit:]]+&action" file.txt |  tr "=&" "  " | cut -d " " -f2

...但我确信还有更优雅的方式;-)。

一步步：

egrep "id_ad=[[:digit:]]+&action" file.txt

扫描由文字，后跟 1 个或多个数字（的含义，后跟文字file.txt）组成的模式（正则表达式）。将输出发送到标准输出。id_ad=[[:digit:]]+&action

tr "=&" "  "

将字符“=”和“&”转换为两个空格。

cut -d " " -f2

打印标准输入的第二个字段（以空格分隔）。

Answer

例如：

 egrep "id_ad=[[:digit:]]+&action" file.txt |  tr "=&" "  " | cut -d " " -f2

...但我确信还有更优雅的方式;-)。

一步步：

egrep "id_ad=[[:digit:]]+&action" file.txt

扫描由文字，后跟 1 个或多个数字（的含义，后跟文字file.txt）组成的模式（正则表达式）。将输出发送到标准输出。id_ad=[[:digit:]]+&action

tr "=&" "  "

将字符“=”和“&”转换为两个空格。

cut -d " " -f2

打印标准输入的第二个字段（以空格分隔）。

Question 3

使用 sed：

sed 's/id_ad=\(.*\)&action/\1/' filename

解释：

.*上述命令返回文件名中两个 START 字( id_ad=) 和 END 字( ) 之间的任何字符串( ) &action。
\(...\)用于捕获组。\(是捕获组的开始，以结束\)。\1我们打印其组索引（我们有一个捕获组）

上述解决方案的更好的sed命令可以是这样的：

sed 's/^id_ad=\([0-9]*\)&action/\1/' filename

^行的开头。
[0-9]*：出现 0 次或多次的任意数字。
_{有关 sed 命令的更多信息，请参阅}

使用 grep：

解释：

grep -Po '(?<=id_ad=)[0-9]*(?=&action)' filename

来自 man grep：

-o, --only-matching
      Print only the matched (non-empty) parts of a matching line,
      with each such part on a separate output line.
-P, --perl-regexp
      Interpret PATTERN as a Perl compatible regular expression (PCRE)

[0-9]*返回文件名中两个 START 字( id_ad=) 和 END 字( ) 之间出现 0 次或多次的任意数字( ) &action。

(?<=pattern)：正向后视。一对括号，左括号后跟问号、“小于”符号和等号。

(?<=id_ad=)[0-9]*id_ad=（正向后视）匹配文件名后跟的 0 个或多个数字。

(?=pattern)：正向前瞻：正向前瞻结构是一对括号，左括号后跟问号和等号。

[0-9]*(?=&action):（正向预测）匹配 0 次或多次后跟 pattern( &action) 的数字，但不将 pattern( &action) 作为匹配的一部分。
_{阅读有关 Lookahead 和 Lookbehind 的更多信息}

额外链接：
_{高级 Grep 主题

面向设计师的 GREP}

Answer

使用 sed：

sed 's/id_ad=\(.*\)&action/\1/' filename

解释：

.*上述命令返回文件名中两个 START 字( id_ad=) 和 END 字( ) 之间的任何字符串( ) &action。
\(...\)用于捕获组。\(是捕获组的开始，以结束\)。\1我们打印其组索引（我们有一个捕获组）

上述解决方案的更好的sed命令可以是这样的：

sed 's/^id_ad=\([0-9]*\)&action/\1/' filename

^行的开头。
[0-9]*：出现 0 次或多次的任意数字。
_{有关 sed 命令的更多信息，请参阅}

使用 grep：

解释：

grep -Po '(?<=id_ad=)[0-9]*(?=&action)' filename

来自 man grep：

-o, --only-matching
      Print only the matched (non-empty) parts of a matching line,
      with each such part on a separate output line.
-P, --perl-regexp
      Interpret PATTERN as a Perl compatible regular expression (PCRE)

[0-9]*返回文件名中两个 START 字( id_ad=) 和 END 字( ) 之间出现 0 次或多次的任意数字( ) &action。

(?<=pattern)：正向后视。一对括号，左括号后跟问号、“小于”符号和等号。

(?<=id_ad=)[0-9]*id_ad=（正向后视）匹配文件名后跟的 0 个或多个数字。

(?=pattern)：正向前瞻：正向前瞻结构是一对括号，左括号后跟问号和等号。

[0-9]*(?=&action):（正向预测）匹配 0 次或多次后跟 pattern( &action) 的数字，但不将 pattern( &action) 作为匹配的一部分。
_{阅读有关 Lookahead 和 Lookbehind 的更多信息}

额外链接：
_{高级 Grep 主题

面向设计师的 GREP}

Question 4

通过模块的另一个 Python 答案re。示例取自 Jacob 的帖子。

script.py

#!/usr/bin/python3
import sys
import re
file = sys.argv[1]
L = []                                                  # Declare an empty list
with open(file) as src:
    for j in src:                                       # iterate through all the lines
        for i in re.findall(r'id_ad=(\d+)&action', j):  # extracts the digits which was present in-between `id_ad=` and `&action` strings.
            L.append(i)                                 # Append the extracted digits to the already declared empty list L. 
    for f in L:                                         # Iterate through all the elements in the list L
        print(f)                                        # Print each element from the list L in a separate new line.

运行上述脚本，

python3 script.py /path/to/the/file

例子：

$ cat fi
I want to process the body of text and extract an integer from a specific position in the text, but I'm not sure how to describe that 'particular position'. Regular expressions really confuse me. I spent (wasted) a couple hours reading tutorials and I feel no closer to an answer :( There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains

 id_ad=1929170&action There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1889170&action and then followed by a bunch of garbage I don't care about, again it may or may not include one or more integers. There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains

 id_ad=1889170&action and then followed by a bunch of garbage I don't care about, again it may or may not include one or more integers. There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1929990&action

$ python3 script.py ~/file
1929170
1889170
1889170
1929990

Answer

通过模块的另一个 Python 答案re。示例取自 Jacob 的帖子。

script.py

#!/usr/bin/python3
import sys
import re
file = sys.argv[1]
L = []                                                  # Declare an empty list
with open(file) as src:
    for j in src:                                       # iterate through all the lines
        for i in re.findall(r'id_ad=(\d+)&action', j):  # extracts the digits which was present in-between `id_ad=` and `&action` strings.
            L.append(i)                                 # Append the extracted digits to the already declared empty list L. 
    for f in L:                                         # Iterate through all the elements in the list L
        print(f)                                        # Print each element from the list L in a separate new line.

运行上述脚本，

python3 script.py /path/to/the/file

例子：

$ cat fi
I want to process the body of text and extract an integer from a specific position in the text, but I'm not sure how to describe that 'particular position'. Regular expressions really confuse me. I spent (wasted) a couple hours reading tutorials and I feel no closer to an answer :( There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains

 id_ad=1929170&action There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1889170&action and then followed by a bunch of garbage I don't care about, again it may or may not include one or more integers. There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains

 id_ad=1889170&action and then followed by a bunch of garbage I don't care about, again it may or may not include one or more integers. There's a bunch of text which may or may not include integers (that I don't want) and then there's a line that always contains id_ad=1929990&action

$ python3 script.py ~/file
1929170
1889170
1889170
1929990

使用终端从文件中提取文本？

答案1

如何使用

笔记

答案2

答案3

使用 sed：

解释：

使用 grep：

解释：

答案4

相关内容