组织缺少列的 CSV 文件

Question 1

现在，您重写的问题在概念上很容易回答：您有一组标签，这些标签可能出现在每行数据中，也可能不出现。您想要读取每一行，并按顺序浏览各列，检查该列中的标记是否是预期的标记。如果没有，请插入一个空白单元格，然后检查下一列。一旦到达预期标签列表的末尾，就发出重建的行。

这是一些伪代码，您可以用您选择的语言来实现：

read the first row
split the text on commas to create the array of expected tags
read the next row
    if no more data, exit
    split the text on commas to create a row data array
    for each expected tag
        check the current column in the row's data
        if the tag matches
            write the column data to the output
            advance the current column in the row data
        else
            write a blank column to the output
        terminate the output line

Answer

现在，您重写的问题在概念上很容易回答：您有一组标签，这些标签可能出现在每行数据中，也可能不出现。您想要读取每一行，并按顺序浏览各列，检查该列中的标记是否是预期的标记。如果没有，请插入一个空白单元格，然后检查下一列。一旦到达预期标签列表的末尾，就发出重建的行。

这是一些伪代码，您可以用您选择的语言来实现：

read the first row
split the text on commas to create the array of expected tags
read the next row
    if no more data, exit
    split the text on commas to create a row data array
    for each expected tag
        check the current column in the row's data
        if the tag matches
            write the column data to the output
            advance the current column in the row data
        else
            write a blank column to the output
        terminate the output line

Question 2

我刚刚注意到数据的每一列实际上都以列的名称开头。当我第一次看到你的问题时，我一定错过了这一点。这使得重新格式化数据不仅成为可能，而且相当容易。

#!/usr/bin/perl

use strict;
my @headers; # array to hold the headers in the order they were seen.
my @search;  # array to hold a copy of @headers sorted by string length

while(<>) {
  chomp;    # remove newline character at end-of-line

  if ($. == 1) {
    next if (scalar @headers); # only process headers for first file
    # Split the first line into @headers array, removing any
    # leading or trailing spaces from each column
    @headers = split '\s*,\s*';

    # In case one key might be a substring of another key, copy the
    # @headers array, sorted by length, so we can compare the data
    # with the longest header names first.
    @search = sort { length($b) <=> length($a) } @headers;

    print join(",", @headers), "\n";

  } else {
    my %columns = ();
    # Loop over each column of the input line (row), inserting it into
    # the %columns hash, using the appropriate column name as the key.
    foreach my $c (split '\s*,\s*') {
      my $found = 0;
      foreach my $h (@search) {
        # If the current column ($c) begins with a header
        # name ($h), we've found the right key for it.
        if ($c =~ s/^$h\s+//i) { # match and remove header from column
        #if ($c =~ m/^$h\s+/i) { # or just match without removing header
          $columns{$h} = $c;
          $found = 1;
        };
      };
      warn "Unknown column '$c' in line $. of $ARGV\n" if
        ($c ne '' && ! $found);
    };

    # Output every column in the same order as in the header line.
    # Columns not actually present in a row are output as an empty field
    print join(",", @columns{@headers}), "\n";
  };

  # Reset the line counter at the end of each input file if
  # there's more than one
  close(ARGV) if eof;
}

匹配每列的正则表达式不区分大小写进行匹配。如果您的数据中的列包含大小写或混合大小写版本相同的名称，然后/i从正则表达式中删除修饰符。

使用合适的名称保存它，例如./fix-data.pl，并使其可执行chmod +x ./fix-data.pl。

示例输出：

$ ./fix-data.pl datafile.txt 
Name,Date,Address,Email
Alex,Sept 3,123 Madeup,[email protected]
Jenn,Sept 4,,[email protected]

或者，使用注释掉的替代if语句：

$ ./fix-data.pl datafile.txt 
Name,Date,Address,Email
Name Alex,Date Sept 3,Address 123 Madeup,Email [email protected]
Name Jenn,Date Sept 4,,Email [email protected]

我不知道为什么有人会想要第二种格式，因为列名已经在标题行中，并且每个输出行的每一列都按正确的顺序......但如果这就是你想要的，那么它很容易做到。

顺便说一句，您可以通过管道将输出格式化为具有相同宽度列的表格column：

$ ./fix-data.pl datafile.txt | column -t -s , -o ', '
Name, Date  , Address   , Email
Alex, Sept 3, 123 Madeup, [email protected]
Jenn, Sept 4,           , [email protected]

使用columnwith' | '作为输出分隔符在我看来更易于阅读（并且仍然可以轻松导入电子表格或由其他程序解析）

$ ./fix-data.pl datafile.txt | column -t -s , -o ' | '
Name | Date   | Address    | Email
Alex | Sept 3 | 123 Madeup | [email protected]
Jenn | Sept 4 |            | [email protected]

column甚至可以将数据输出为有效的 json，例如：

$ ./fix-data.pl datafile.txt |
  tail -n +2 |
  column --json -s , \
      --table-columns "$(sed -n -e '1s/ *, */,/gp' datafile.txt)"
{
   "table": [
      {
         "name": "Alex",
         "date": "Sept 3",
         "address": "123 Madeup",
         "email": "[email protected]"
      },{
         "name": "Jenn",
         "date": "Sept 4",
         "address": null,
         "email": "[email protected]"
      }
   ]
}

（至少在 Debian 上，column它在bsdextrautils软件包中。在其他发行版上，它可能在实用程序Linux）

磨坊主和数据混合一旦您将数据转换为合理的格式，它们也是有用的命令行工具。

注意：该脚本假设数据只是简单的逗号分隔格式，不是格式正确的 CSV（参见，例如RFC 4180 - 逗号分隔值 (CSV) 文件的通用格式和 MIME 类型) 可以使用带引号的字符串字段甚至嵌入逗号之内引用的字段。如果任何行包含带引号的列，则需要使用 CSV 解析器，而不是简单地用逗号分割每个输入行。例如 Perl 的文本::CSV模块。我不认为这可能是需要的，因为你的数据采用了一种奇怪的非 CSV 格式，显然是由生成它的人发明的（如果他们知道 CSV，他们可能会使用它......或者弄乱了数据甚至比现在更糟糕）。

此警告适用于任何语言的任何实现，因为问题将是由混乱的数据而不是代码造成的。

column也不适用于包含嵌入逗号的 CSV。

Answer