计算输入文件中模式的出现次数以与大文件匹配

计算输入文件中模式的出现次数以与大文件匹配

我在一个文本文件中有一个大学列表,在一个单独的文件中我有一个具有隶属关系的出版物列表。我想写一个脚本,检查出版物重复了多少次,并计算大学合作的次数。我的数据如下; “p1”是论文标题,“所属院校”是发表该论文的院校

例子:-

数据

UID、隶属关系

p1    "ADPRI, S"
p1    "ADPRI, S"
p2    "ADPRI, S"
p2    "AAC&S, H"
p3    "AAC&S, H"
p3    "HU, USA" 
p3    "Penn, USA"
p4    "AAC&S, H"  
p5    "AAC&S, H"  
p6    "AAC&S, H"  
p7    "AAC&S, H"  
p8    "AU, A"  
p9    "AECI, A"  
p10   "AECI, A" 
p10   "AECI, A" 

在上述数据中,论文“p2”链接到“ADPRI,S”和“AAC&S,H”。
类似地,“p3”链接到大学“AAC&S,H”、“HU,USA”、“Penn,USA”。
因此,我的脚本应该提供一个文件,其中给出了两所大学之间的合作数量。对于上述数据将是

期望的输出:

 College_A       College_B       Collaborated
  ADPRI, S       AAC&S, H            2
  HU, USA        Penn, USA           1
  ....
  ....
 so on for all the colleges,

**我在“第2列”上使用了sort和uniq命令来获取大学的数量,这是797所大学的列表,我的数据库有这些大学发表的20000多篇论文。我的数据也有很多空格和特殊字符。 **

PS:- 数据是制表符分隔的,我在 CSV 中也有相同的数据。

答案1

使用 Perl:

#!/usr/bin/env perl

use strict;
use warnings;

use List::MoreUtils qw(uniq);
use Set::Intersection;

my ( %papers, @colleges );

while (<>) {          
    chomp; 
    my ( $paper, $college ) = m/(\S+)\t"(.+)"/g;

    # normalize college names
    $college =~ s/\s\+/ /go;
    $college =~ s/^\s\+//go;
    $college =~ s/\s\+$//go;

    $papers{$college} //= [];
    push @{ $papers{$college} }, $paper;
}

@colleges = sort keys %papers;
for my $college (@colleges) {
    $papers{$college} = [ uniq sort @{ $papers{$college} } ];
}

print qq(College_A\tCollege_B\tCollaborated\n);
for ( my $i = 0 ; $i < @colleges - 1 ; $i++ ) {
    for ( my $j = $i + 1 ; $j < @colleges ; $j++ ) {
        my $collaborations = scalar get_intersection(
            { -preordered => 1 },
            $papers{ $colleges[$i] },
            $papers{ $colleges[$j] }
        );  
        print $colleges[$i], "\t", $colleges[$j], "\t", $collaborations, "\n"
          if ($collaborations);
    }
}

使用Python:

#!/usr/bin/env python

from __future__ import print_function

import re
import sys
from collections import defaultdict

papers = defaultdict(lambda: set())
for line in sys.stdin:
    paper, college = line.split("\t")
    college = re.sub(r'^"|"$', '', college)
    college = re.sub(r'\s+', ' ', college)
    college = re.sub(r'^\s+|\s+$', '', college)
    papers[college].add(paper)

colleges = sorted(papers.keys())

print("College_A\tCollege_B\tCollaborated")
for i in range(len(colleges) - 1):
    for j in range(i + 1, len(colleges)):
        collaborations = len(papers[colleges[i]].intersection(papers[colleges[j]]))
        if collaborations:
            print("%s\t%s\t%d" % (colleges[i], colleges[j], collaborations))

答案2

gawk解决方案。

用法: ./program.awk input.txt

另外,./program.awk input.txt | column -t -s $'\t'如果对齐丢失,您可以这样做:为了漂亮的显示。

#!/usr/bin/awk -f

function pub_to_aff() {
    for(i in pub_arr) {
        for(j in pub_arr) {
            if(i != j)
                aff_arr[i][j]++;    
        }   
    }   
    delete pub_arr;
}

BEGIN {
    OFS = "\t";
    FS = "\t";
}

$1 != prev_uid {
    prev_uid = $1; 
    pub_to_aff();
}
{
    pub_arr[$2] = 1;
}

END {
    pub_to_aff();
    print "College_A", "College_B", "Collaborated";

    for(i in aff_arr) {
        for(j in aff_arr[i]) {
            print i, j, aff_arr[i][j];          
        }   
    }   
}

输入- 添加了两行用于演示 - top3p4

p1  "ADPRI, S"
p1  "ADPRI, S"
p2  "ADPRI, S"
p2  "AAC&S, H"
p3  "AAC&S, H"
p3  "ADPRI, S"
p3  "HU, USA"
p3  "Penn, USA"
p4  "AAC&S, H"
p4  "ADPRI, S"
p5  "AAC&S, H"
p6  "AAC&S, H"
p7  "AAC&S, H"
p8  "AU, A"
p9  "AECI, A"
p10 "AECI, A"
p10 "AECI, A"

输出

College_A   College_B   Collaborated
"AAC&S, H"  "HU, USA"   1
"AAC&S, H"  "Penn, USA" 1
"AAC&S, H"  "ADPRI, S"  3
"HU, USA"   "AAC&S, H"  1
"HU, USA"   "Penn, USA" 1
"HU, USA"   "ADPRI, S"  1
"Penn, USA" "AAC&S, H"  1
"Penn, USA" "HU, USA"   1
"Penn, USA" "ADPRI, S"  1
"ADPRI, S"  "AAC&S, H"  3
"ADPRI, S"  "HU, USA"   1
"ADPRI, S"  "Penn, USA" 1

编辑-真实数据测试。

输入- 我只留下了您的sample.txt 内容的一部分,并更改了几行以演示脚本工作。请注意,如果输入文件不包含合作大学,则脚本将仅输出一行 - header。

WOS:000355337800046 "ACHARYA NARENDRA DEV COLL, NEW DELHI"
WOS:000355337800046 "ACHARYA NARENDRA DEV COLL, NEW DELHI"
WOS:000355337800046 "ACHARYA PRAFULLA CHANDRA COLL. KOLKATA"
WOS:000328700900001 "ACHARYA PRAFULLA CHANDRA COLL. KOLKATA"
WOS:000338233800012 "ADAMAS INST TECHNOL, KOLKATA"
WOS:000338233800012 "ADARSH MAHAVIDYALAYA DHAMANGAON RAILWAY, AMRAVATI"
WOS:000349637600009 "ADARSH MAHAVIDYALAYA DHAMANGAON RAILWAY, AMRAVATI"
WOS:000314892400031 "ADITYA INST TECHNOL & MANAGEMENT, TEKKALI"

使用的命令: ./program.awk sample.txt | column -t -s $'\t'

输出

College_A                                            College_B                                            Collaborated
"ADAMAS INST TECHNOL, KOLKATA"                       "ADARSH MAHAVIDYALAYA DHAMANGAON RAILWAY, AMRAVATI"  1
"ACHARYA NARENDRA DEV COLL, NEW DELHI"               "ACHARYA PRAFULLA CHANDRA COLL. KOLKATA"             1
"ACHARYA PRAFULLA CHANDRA COLL. KOLKATA"             "ACHARYA NARENDRA DEV COLL, NEW DELHI"               1
"ADARSH MAHAVIDYALAYA DHAMANGAON RAILWAY, AMRAVATI"  "ADAMAS INST TECHNOL, KOLKATA"                       1

相关内容