从文本文件中提取信息

Question 1

这看起来像是一种 XML 或类似的标记语言文件。此类文件不应该用简单的正则表达式解析，以免唤醒TO͇̹̺ͅƝ̴ş̳ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡。您应该使用特定于该标记的解析器和您最喜欢的脚本语言。

这看起来像 OMIM 或 HPO 数据，在这种情况下，您应该能够获取简单的文本文件并简化事情。如果您不能并且确实需要解析此文件，您可以在 perl 中执行此操作：

perl -lne '/<.*?>([^<>]+)/ && print $1' foo.txt

但是，如果每行有多个标签，或者标签的内容可以跨越多行，或者标签的数据可以包含>或，则此操作将会中断<。如果您的所有信息都是总是在之间<category="whatever">blah blah</category>，您可以更稳健地获得所有内容（包括多行标记内容和嵌入的<or >）：

#!/usr/bin/env perl

## Set the start and end tags
$end="</category>"; 
$start="<category=.*?>"; 

## Read through the file line by line
while(<>){
    ## set $a to one if the current line matches $start
    $a=1 if /$start/; 
    ## If the current line matches $start, capture any relevant content.
    ## I am also removing any $start or $end tags if present.
    if(s/($start)*(.+)($end)*/$2/){
    push @lines,$2 if $a==1;
    }  
    ## If the current line matches $end, capture any relevant content,
    ## print what we have saved so far, set $a back to 0 and empty the
    ## @lines array
    if(/$end/){
    map{s/$end//;}@lines; 
    print "@lines\n";
    @lines=(); 
    $a=0
    }; 
}

将此脚本另存为foo.pl或其他内容，使其可执行并在您的文件上运行：

./foo.pl file.txt

例如：

$ cat file.txt 
<category="SpecificDisease">Type II 
 human complement C2 deficiency</category>
<category="Modifier">Huntington disease</category>
<category="CompositeMention">hereditary breast < and ovarian cancer</category>
<category="DiseaseClass">myopathy > cardiopathy</category>

$ ./foo.pl file.txt 
Type II   human complement C2 deficiency
Huntington disease
hereditary breast < and ovarian cancer
myopathy > cardiopathy

不过，我再次强调，如果（很可能）您的文件比上面的示例更复杂，这会失败并且需要更复杂的方法。

Answer

这看起来像是一种 XML 或类似的标记语言文件。此类文件不应该用简单的正则表达式解析，以免唤醒TO͇̹̺ͅƝ̴ş̳ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡。您应该使用特定于该标记的解析器和您最喜欢的脚本语言。