我在一个文本文件中有一个大学列表,在一个单独的文件中我有一个具有隶属关系的出版物列表。我想写一个脚本,检查出版物重复了多少次,并计算大学合作的次数。我的数据如下; “p1”是论文标题,“所属院校”是发表该论文的院校
例子:-
数据
UID、隶属关系
p1 "ADPRI, S"
p1 "ADPRI, S"
p2 "ADPRI, S"
p2 "AAC&S, H"
p3 "AAC&S, H"
p3 "HU, USA"
p3 "Penn, USA"
p4 "AAC&S, H"
p5 "AAC&S, H"
p6 "AAC&S, H"
p7 "AAC&S, H"
p8 "AU, A"
p9 "AECI, A"
p10 "AECI, A"
p10 "AECI, A"
在上述数据中,论文“p2”链接到“ADPRI,S”和“AAC&S,H”。
类似地,“p3”链接到大学“AAC&S,H”、“HU,USA”、“Penn,USA”。
因此,我的脚本应该提供一个文件,其中给出了两所大学之间的合作数量。对于上述数据将是
期望的输出:
College_A College_B Collaborated
ADPRI, S AAC&S, H 2
HU, USA Penn, USA 1
....
....
so on for all the colleges,
**我在“第2列”上使用了sort和uniq命令来获取大学的数量,这是797所大学的列表,我的数据库有这些大学发表的20000多篇论文。我的数据也有很多空格和特殊字符。 **
PS:- 数据是制表符分隔的,我在 CSV 中也有相同的数据。
答案1
使用 Perl:
#!/usr/bin/env perl
use strict;
use warnings;
use List::MoreUtils qw(uniq);
use Set::Intersection;
my ( %papers, @colleges );
while (<>) {
chomp;
my ( $paper, $college ) = m/(\S+)\t"(.+)"/g;
# normalize college names
$college =~ s/\s\+/ /go;
$college =~ s/^\s\+//go;
$college =~ s/\s\+$//go;
$papers{$college} //= [];
push @{ $papers{$college} }, $paper;
}
@colleges = sort keys %papers;
for my $college (@colleges) {
$papers{$college} = [ uniq sort @{ $papers{$college} } ];
}
print qq(College_A\tCollege_B\tCollaborated\n);
for ( my $i = 0 ; $i < @colleges - 1 ; $i++ ) {
for ( my $j = $i + 1 ; $j < @colleges ; $j++ ) {
my $collaborations = scalar get_intersection(
{ -preordered => 1 },
$papers{ $colleges[$i] },
$papers{ $colleges[$j] }
);
print $colleges[$i], "\t", $colleges[$j], "\t", $collaborations, "\n"
if ($collaborations);
}
}
使用Python:
#!/usr/bin/env python
from __future__ import print_function
import re
import sys
from collections import defaultdict
papers = defaultdict(lambda: set())
for line in sys.stdin:
paper, college = line.split("\t")
college = re.sub(r'^"|"$', '', college)
college = re.sub(r'\s+', ' ', college)
college = re.sub(r'^\s+|\s+$', '', college)
papers[college].add(paper)
colleges = sorted(papers.keys())
print("College_A\tCollege_B\tCollaborated")
for i in range(len(colleges) - 1):
for j in range(i + 1, len(colleges)):
collaborations = len(papers[colleges[i]].intersection(papers[colleges[j]]))
if collaborations:
print("%s\t%s\t%d" % (colleges[i], colleges[j], collaborations))
答案2
gawk
解决方案。
用法: ./program.awk input.txt
另外,./program.awk input.txt | column -t -s $'\t'
如果对齐丢失,您可以这样做:为了漂亮的显示。
#!/usr/bin/awk -f
function pub_to_aff() {
for(i in pub_arr) {
for(j in pub_arr) {
if(i != j)
aff_arr[i][j]++;
}
}
delete pub_arr;
}
BEGIN {
OFS = "\t";
FS = "\t";
}
$1 != prev_uid {
prev_uid = $1;
pub_to_aff();
}
{
pub_arr[$2] = 1;
}
END {
pub_to_aff();
print "College_A", "College_B", "Collaborated";
for(i in aff_arr) {
for(j in aff_arr[i]) {
print i, j, aff_arr[i][j];
}
}
}
输入- 添加了两行用于演示 - top3
和p4
。
p1 "ADPRI, S"
p1 "ADPRI, S"
p2 "ADPRI, S"
p2 "AAC&S, H"
p3 "AAC&S, H"
p3 "ADPRI, S"
p3 "HU, USA"
p3 "Penn, USA"
p4 "AAC&S, H"
p4 "ADPRI, S"
p5 "AAC&S, H"
p6 "AAC&S, H"
p7 "AAC&S, H"
p8 "AU, A"
p9 "AECI, A"
p10 "AECI, A"
p10 "AECI, A"
输出
College_A College_B Collaborated
"AAC&S, H" "HU, USA" 1
"AAC&S, H" "Penn, USA" 1
"AAC&S, H" "ADPRI, S" 3
"HU, USA" "AAC&S, H" 1
"HU, USA" "Penn, USA" 1
"HU, USA" "ADPRI, S" 1
"Penn, USA" "AAC&S, H" 1
"Penn, USA" "HU, USA" 1
"Penn, USA" "ADPRI, S" 1
"ADPRI, S" "AAC&S, H" 3
"ADPRI, S" "HU, USA" 1
"ADPRI, S" "Penn, USA" 1
编辑-真实数据测试。
输入- 我只留下了您的sample.txt 内容的一部分,并更改了几行以演示脚本工作。请注意,如果输入文件不包含合作大学,则脚本将仅输出一行 - header。
WOS:000355337800046 "ACHARYA NARENDRA DEV COLL, NEW DELHI"
WOS:000355337800046 "ACHARYA NARENDRA DEV COLL, NEW DELHI"
WOS:000355337800046 "ACHARYA PRAFULLA CHANDRA COLL. KOLKATA"
WOS:000328700900001 "ACHARYA PRAFULLA CHANDRA COLL. KOLKATA"
WOS:000338233800012 "ADAMAS INST TECHNOL, KOLKATA"
WOS:000338233800012 "ADARSH MAHAVIDYALAYA DHAMANGAON RAILWAY, AMRAVATI"
WOS:000349637600009 "ADARSH MAHAVIDYALAYA DHAMANGAON RAILWAY, AMRAVATI"
WOS:000314892400031 "ADITYA INST TECHNOL & MANAGEMENT, TEKKALI"
使用的命令: ./program.awk sample.txt | column -t -s $'\t'
输出
College_A College_B Collaborated
"ADAMAS INST TECHNOL, KOLKATA" "ADARSH MAHAVIDYALAYA DHAMANGAON RAILWAY, AMRAVATI" 1
"ACHARYA NARENDRA DEV COLL, NEW DELHI" "ACHARYA PRAFULLA CHANDRA COLL. KOLKATA" 1
"ACHARYA PRAFULLA CHANDRA COLL. KOLKATA" "ACHARYA NARENDRA DEV COLL, NEW DELHI" 1
"ADARSH MAHAVIDYALAYA DHAMANGAON RAILWAY, AMRAVATI" "ADAMAS INST TECHNOL, KOLKATA" 1