将一个文件的两列与另一文件的两列进行比较。 Python 还是 Bash？

2024-6-5 • tag-icon

所以大家有两个文件，它们都是制表符分隔的文本文件，尝试根据两列合并这两个文件。这两个文件未排序且不包含标头。另一件事是 Final.tsv 很大，包含大约 200 万行。

  **`final.tsv`**

        ClinVarVariant:208014   OCRL:exon 6-12 del  HP:0000028  Cryptorchidism  MONDO:0010645   oculocerebrorenal syndrome 
    ClinVarVariant:208014   OCRL:exon 6-12 del  HP:0000083  Renal insufficiency MONDO:0010645   oculocerebrorenal syndrome
    ClinVarVariant:208014   OCRL:exon 6-12 del  HP:0000091  Abnormal renal tubule morphology    MONDO:0010645   oculocerebrorenal syndrome
    ClinVarVariant:208014   OCRL:exon 6-12 del  HP:0000093  Proteinuria MONDO:0010645   oculocerebrorenal syndrome
    ClinVarVariant:208014   OCRL:exon 6-12 del  HP:0000121  Nephrocalcinosis    MONDO:0010645   oculocerebrorenal syndrome
    ClinVarVariant:208014   OCRL:exon 6-12 del  HP:0000164  Abnormality of the dentition    MONDO:0010645   oculocerebrorenal syndrome
    ClinVarVariant:208014   OCRL:exon 6-12 del  HP:0000189  Narrow palate   MONDO:0010645   oculocerebrorenal syndrome
    ClinVarVariant:208014   OCRL:exon 6-12 del  HP:0000194  Open mouth  MONDO:0010645   oculocerebrorenal syndrome
    ClinVarVariant:208014   OCRL:exon 6-12 del  HP:0000219  Thin upper lip vermilion    MONDO:0010645   oculocerebrorenal syndrome



**om.tsv**
    309000  LOWE OCULOCEREBRORENAL SYNDROME HP:0000028  OMIM:309000 XLR
    309000  LOWE OCULOCEREBRORENAL SYNDROME HP:0000083  OMIM:309000 XLR
    309000  LOWE OCULOCEREBRORENAL SYNDROME HP:0000093  OMIM:309000 XLR
    309000  LOWE OCULOCEREBRORENAL SYNDROME HP:0000501  OMIM:309000 XLR
    309000  LOWE OCULOCEREBRORENAL SYNDROME HP:0000505  OMIM:309000 XLR

因此，这里的任务是将 Final.tsv 文件的第 6 列和第 3 列与 om.tsv 文件的第 2 列和第 3 列进行匹配。匹配两列时，应合并两个文件并保存在匹配文件中。如果不匹配，则应将整行打印到另一个不匹配文件中。另请注意，我需要一种不区分大小写的方法，其中匹配基于关键字。

根据上述内容，例如眼脑肾综合征应与 LOWE OCULOCEREBRORENAL SYNDROME 匹配。

 Output
      ClinVarVariant:208014 OCRL:exon 6-12 del  HP:0000028  Cryptorchidism  MONDO:0010645   oculocerebrorenal syndrome  309000  LOWE OCULOCEREBRORENAL SYNDROME HP:0000028  OMIM:309000 XLR

许多不同的方法（例如 awk、join 甚至一些 pandas 方法）都被尝试来解决这种复杂性。有什么建议么！提前致谢！ :)

相关内容