需要从多个文件的特定行中提取2个字符串并打印到新文件，以制表符分隔

Question 1

您可以在当前文件夹中的每个文件的循环中使用 sed。您提取相关部分并将它们附加到一个名为如下的>>文件中：file

for files in *; \
do sed -n -e '/^From file/ H;' \
          -e '/Ratio of morphemes over utterances/ {H; x; s/\n//g; s/From file <\(.*\)>.*Ratio of morphemes over utterances = \([0-9]*\.[0-9]*\).*/\1:    \2/g; p;}' "$files";
done >>file

Answer

您可以在当前文件夹中的每个文件的循环中使用 sed。您提取相关部分并将它们附加到一个名为如下的>>文件中：file

for files in *; \
do sed -n -e '/^From file/ H;' \
          -e '/Ratio of morphemes over utterances/ {H; x; s/\n//g; s/From file <\(.*\)>.*Ratio of morphemes over utterances = \([0-9]*\.[0-9]*\).*/\1:    \2/g; p;}' "$files";
done >>file

Question 2

perl -0nE 'say "$1\t$2" if /From file <(.*?)>.*over utterances = (\d\S*)/s' * > out

Answer

perl -0nE 'say "$1\t$2" if /From file <(.*?)>.*over utterances = (\d\S*)/s' * > out

Question 3

既然您提到您熟悉 Python，这里有一个可以完成这项工作的 Python 脚本：

#!/usr/bin/env python
from __future__ import print_function
import os,re,sys

def read_file(filepath):
    with open(filepath) as fd:
         for line in fd:
             clean_line = line.strip()

             if 'From file' in clean_line:

                 words = re.split('<|>| ', clean_line)
                 print(words[-2],end=" ")

             if 'Ratio of morphemes over utterances' in clean_line:
                 print(clean_line.split('=')[-1])



def find_files(treeroot):
    selfpath = os.path.abspath(__file__)
    for dir,subdirs,files in os.walk(treeroot):
         for f in files: 
             filepath = os.path.abspath(os.path.join(dir,f))
             if selfpath  ==  filepath: continue
             try:
                 read_file(filepath)
             except IOError:
                 pass
def main():
    directory = '.'
    if len(sys.argv) == 2:
       directory = sys.argv[1]
    find_files(os.path.abspath(directory))

if __name__ == '__main__': main()

示例运行：

$ ./extract_data.py                                                                                               
adam02.cha  2.547
adam01.cha  2.213

其工作方式很简单：我们os.walk递归地遍历目录，查找所有文件并排除脚本本身，对于每个文件，我们运行read_file()function ，它逐行读取每个文件，并找到适当的字段。用于使用空格和,作为单词分隔符，re.split()更方便地将文件名字符串分解为单词列表。该脚本可以采用目录的命令行参数，但如果未给出 - 则假定当前工作目录。这样您就可以运行给定路径或从存储文件的目录中的脚本。至于使用所有数据创建新文件，这很简单 - 使用 shell 的重定向作为.请注意 - 将脚本重定向到位于不同目录中的文件，因为新文件可能会排队并破坏脚本。额外的改进是您可以调用文件的 for 循环来以排序的方式读取文件。<>./extract_data.py > /path/to/new_file.txtos.walk()for f in sorted(files):

Answer

既然您提到您熟悉 Python，这里有一个可以完成这项工作的 Python 脚本：

#!/usr/bin/env python
from __future__ import print_function
import os,re,sys

def read_file(filepath):
    with open(filepath) as fd:
         for line in fd:
             clean_line = line.strip()

             if 'From file' in clean_line:

                 words = re.split('<|>| ', clean_line)
                 print(words[-2],end=" ")

             if 'Ratio of morphemes over utterances' in clean_line:
                 print(clean_line.split('=')[-1])



def find_files(treeroot):
    selfpath = os.path.abspath(__file__)
    for dir,subdirs,files in os.walk(treeroot):
         for f in files: 
             filepath = os.path.abspath(os.path.join(dir,f))
             if selfpath  ==  filepath: continue
             try:
                 read_file(filepath)
             except IOError:
                 pass
def main():
    directory = '.'
    if len(sys.argv) == 2:
       directory = sys.argv[1]
    find_files(os.path.abspath(directory))

if __name__ == '__main__': main()

示例运行：

$ ./extract_data.py                                                                                               
adam02.cha  2.547
adam01.cha  2.213

其工作方式很简单：我们os.walk递归地遍历目录，查找所有文件并排除脚本本身，对于每个文件，我们运行read_file()function ，它逐行读取每个文件，并找到适当的字段。用于使用空格和,作为单词分隔符，re.split()更方便地将文件名字符串分解为单词列表。该脚本可以采用目录的命令行参数，但如果未给出 - 则假定当前工作目录。这样您就可以运行给定路径或从存储文件的目录中的脚本。至于使用所有数据创建新文件，这很简单 - 使用 shell 的重定向作为.请注意 - 将脚本重定向到位于不同目录中的文件，因为新文件可能会排队并破坏脚本。额外的改进是您可以调用文件的 for 循环来以排序的方式读取文件。<>./extract_data.py > /path/to/new_file.txtos.walk()for f in sorted(files):

Question 4

你可以尝试使用 awk 命令

awk '/Ratio of morphemes over utterances/{print FILENAME,$NF;next}' *.cha

如果你想从模式中提取文件名来自文件< adam01.cha>

然后，尝试下面的 awk 命令。

awk '/From file/{filename=$NF} filename && /Ratio of morphemes over utterances/{print FILENAME,$NF;filename="";next}' *.txt

Answer

你可以尝试使用 awk 命令

awk '/Ratio of morphemes over utterances/{print FILENAME,$NF;next}' *.cha

如果你想从模式中提取文件名来自文件< adam01.cha>

然后，尝试下面的 awk 命令。

awk '/From file/{filename=$NF} filename && /Ratio of morphemes over utterances/{print FILENAME,$NF;filename="";next}' *.txt

需要从多个文件的特定行中提取2个字符串并打印到新文件，以制表符分隔

答案1

答案2

答案3

答案4

相关内容