我有一个像这样的空格分隔的文件:(它有 1775 行)
head output.fam
0 ALIKE_g_1LTX827_BI_SNP_F01_33250.CEL 0 0 0 -9
0 BURRY_g_3KYJ479_BI_SNP_A12_40182.CEL 0 0 0 -9
0 ABAFT_g_4RWG569_BI_SNP_E12_35136.CEL 0 0 0 -9
0 MILLE_g_5AVC089_BI_SNP_F02_35746.CEL 0 0 0 -9
0 PEDAL_g_8WWR250_BI_SNP_B06_37732.CEL 0 0 0 -9
...
和一个逗号分隔的文件 phg000008.individualinfo (有 1838 行):
#Phen_Sample_ID - individual sample name associated with phenotypes
#Geno_Sample_ID - sample name associates with genotypes
#Ind_id - unique individual name which can be used to match duplicates (in this case same as Phen_Sample_ID)
#Ped_id - Pedigree ID
#Fa_id - Father individual ID
#Ma_id - Mother individual ID
#Sex - coded 1 for Male, 2 for Female
#Ind_QC_flag - value "ALL" indicates released in both Quality Filtered and Complete set
#Genotyping_Plate
#Sample_plate_well_string - This string corresponds to the file within the CEL files distribution
#Genotype_Clustering_Set
#Study-id - dbGaP assigned study id
#Phen_ID,Geno_Sample_ID,Ind_id,Ped_id,Fa_id,Ma_id,Sex,Ind_QC_flag,Genotyping_Plate,Sample_plate_well_string,Genotyping_Clustering_Set,Study_id
G1000,G1000,G1000,fam1000-,0,0,2,ALL,7FDZ321,POSED_g_7FDZ321_BI_SNP_B02_36506,set05,phs000018
G1001,G1001,G1001,fam1001-,G4243,G4205,1,ALL,3KYJ479,BURRY_g_3KYJ479_BI_SNP_H04_40068,set02,phs000018
G2208,G2208,G2208,fam2208-,G3119,G3120,2,ALL,1LTX827,ALIKE_g_1LTX827_BI_SNP_F01_33250,set01,phs000018
G1676,G1676,G1676,fam1676-,G1675,G1674,1,ALL,3KYJ479,BURRY_g_3KYJ479_BI_SNP_A12_40182,set02,phs000018
...
我想通过查看是否可以从output.fam中的第二列找到值来更改我的output.fam,例如phg000008.individualinfo中的ALIKE_g_1LTX827_BI_SNP_F01_33250.CEL(忽略.CEL后缀),并且是否有一行与该条目替换将output.fam中的条目替换为phg000008.individualinfo第一列中的值,并且对于同一行,将output.fam第一列中的值替换为phg000008.individualinfo第四列中的值(不包括 - 后缀)
例如,对于两行,output.fam 将如下所示:
fam2208 G2208 0 0 0 -9
fam1676 G1676 0 0 0 -9
答案1
尝试
awk '
FNR == NR {sub (/-/, "", $4) # get rid of "-" in $4
T[$10 ".CEL"] = $4 " " $1 # save file2 in temp array
next
}
$2 in T {$1 = T[$2] # check if $2 is relevant; replace
$2 = "" # $1 with temp array value; delete $2
print
}
' FS=, file2 FS=" " file1
答案2
您可以从 phg 文件生成 sed 脚本并使用它来修改 fam 文件:
grep -v ^# phg000008.individualinfo \
| cut -d, -f3,4,10 \
| sed -E 's=(.*),(.*)-,(.*)=s/[^ ]+ \3\\.CEL/\2 \1/=' \
| grep s/ \
| sed -Ef- output.fam
生成的脚本如下所示:
s/[^ ]+ POSED_g_7FDZ321_BI_SNP_B02_36506\.CEL/fam1000 G1000/
s/[^ ]+ BURRY_g_3KYJ479_BI_SNP_H04_40068\.CEL/fam1001 G1001/
s/[^ ]+ ALIKE_g_1LTX827_BI_SNP_F01_33250\.CEL/fam2208 G2208/
s/[^ ]+ BURRY_g_3KYJ479_BI_SNP_A12_40182\.CEL/fam1676 G1676/