我有一个 3GB(2400 万行)的 CSV 文件,需要从中提取数据。我使用 Excel 完成了此类提取,但此文件太大,无法在 Excel 中处理。我想根据第三列的值获取数据行。
这是数据样本。没有标题(但可以根据需要添加)。
5/23/2017,00:04:50,421, 0, ,B, 657.78240967, -1
5/23/2017,00:04:50,421, 1, ,B, 1.00000000, -1
5/23/2017,00:04:50,421, 2, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 3, ,B, 6.03906202, -1
5/23/2017,00:04:50,421, 4, ,B, 6000.00000000, -1
5/23/2017,00:04:50,421, 5, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 6, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 7, ,B, 11.00000000, -1
5/23/2017,00:04:50,421, 8, ,B, 65.00000000, -1
5/23/2017,00:04:50,421, 9, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 10, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 11, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 12, ,B, 2.00000000, -1
5/23/2017,00:04:50,421, 13, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 14, ,B, 5.28633022, -1
5/23/2017,00:04:50,421, 15, ,B, 2000.00000000, -1
5/23/2017,00:04:50,421, 16, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 17, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 18, ,B, 4.40085888, -1
5/23/2017,00:04:50,421, 19, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 20, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 21, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 22, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 23, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 24, ,B, 7.00000000, -1
5/23/2017,00:04:50,421, 25, ,B, 60.00000000, -1
5/23/2017,00:04:50,421, 26, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 27, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 28, ,B, 80.00000000, -1
5/23/2017,00:04:50,421, 29, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 30, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 31, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 32, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 33, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 34, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 35, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 36, ,B, 4.32588148, -1
5/23/2017,00:04:50,421, 37, ,B, 1718.00000000, -1
5/23/2017,00:04:50,421, 38, ,B, 0.05945313, -1
5/23/2017,00:04:50,421, 39, ,B, 2.95723248, -1
5/23/2017,00:04:50,421, 40, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 41, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 42, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 43, ,B, 35946.00000000, -1
5/23/2017,00:04:50,421, 44, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 45, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 46, ,B, 42271.00000000, -1
5/23/2017,00:04:50,421, 47, ,B, 25402.00000000, -1
5/23/2017,00:04:50,421, 48, ,B, 1.00000000, -1
5/23/2017,00:04:50,421, 49, ,B, 1.00000000, -1
5/23/2017,00:04:50,421, 50, ,B, 40745.00000000, -1
5/23/2017,00:04:50,421, 51, ,B, 24373.00000000, -1
5/23/2017,00:04:50,421, 52, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 53, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 54, ,B, 46517.00000000, -1
5/23/2017,00:04:50,421, 55, ,B, 35.00000000, -1
5/23/2017,00:04:50,421, 56, ,B, 51.00000000, -1
5/23/2017,00:04:50,421, 57, ,B, 51.00000000, -1
5/23/2017,00:04:50,421, 58, ,B, 49396.00000000, -1
5/23/2017,00:04:50,421, 59, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 60, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 61, ,B, 13610.00000000, -1
5/23/2017,00:04:50,421, 62, ,B, 50711.00000000, -1
5/23/2017,00:04:50,421, 63, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 64, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 65, ,B, 93927.00000000, -1
5/23/2017,00:04:50,421, 66, ,B, 50173.00000000, -1
5/23/2017,00:04:50,421, 67, ,B, 1.00000000, -1
5/23/2017,00:04:50,421, 68, ,B, 1.00000000, -1
5/23/2017,00:04:50,421, 69, ,B, 13797.00000000, -1
5/23/2017,00:04:50,421, 70, ,B, 75.00000000, -1
5/23/2017,00:04:50,421, 71, ,B, 114.00000000, -1
5/23/2017,00:04:50,421, 72, ,B, 75.00000000, -1
5/23/2017,00:04:50,421, 73, ,B, 47238.00000000, -1
5/23/2017,00:04:50,421, 74, ,B, 1.00000000, -1
5/23/2017,00:04:50,421, 75, ,B, 1.00000000, -1
5/23/2017,00:04:50,421, 76, ,B, 19338.00000000, -1
5/23/2017,00:04:50,421, 77, ,B, 29879.00000000, -1
5/23/2017,00:04:50,421, 78, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 79, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 80, ,B, 19633.00000000, -1
5/23/2017,00:04:50,421, 81, ,B, 16687.00000000, -1
5/23/2017,00:04:50,421, 82, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 83, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 84, ,B, 6823.00000000, -1
5/23/2017,00:04:50,421, 85, ,B, 85.00000000, -1
5/23/2017,00:04:50,421, 86, ,B, 91.00000000, -1
5/23/2017,00:04:50,421, 87, ,B, 91.00000000, -1
5/23/2017,00:04:50,421, 88, ,B, 0.00000000, -1
5/23/2017,00:04:50,421, 89, ,B, 0.00000000, -1
5/23/2017,00:04:54,312, 0, , , 660.28082275, 0
5/23/2017,00:04:54,312, 1, , , 1.00000000, 1
5/23/2017,00:04:54,312, 2, , , 0.00000000, 2
5/23/2017,00:04:54,312, 3, , , 5.95734215, 3
5/23/2017,00:04:54,312, 4, , , 6000.00000000, 4
5/23/2017,00:04:54,312, 5, , , 0.00000000, 5
5/23/2017,00:04:54,312, 6, , , 0.00000000, 6
5/23/2017,00:04:54,312, 7, , , 11.00000000, 7
5/23/2017,00:04:54,312, 8, , , 65.00000000, 8
5/23/2017,00:04:54,312, 9, , , 0.00000000, 9
5/23/2017,00:04:54,312, 10, , , 0.00000000, 10
5/23/2017,00:04:54,312, 11, , , 0.00000000, 11
5/23/2017,00:04:54,312, 12, , , 2.00000000, 12
5/23/2017,00:04:54,312, 13, , , 0.00000000, 13
5/23/2017,00:04:54,312, 14, , , 5.32481575, 14
5/23/2017,00:04:54,312, 15, , , 2000.00000000, 15
5/23/2017,00:04:54,312, 16, , , 0.00000000, 16
5/23/2017,00:04:54,312, 17, , , 0.00000000, 17
5/23/2017,00:04:54,312, 18, , , 4.46131372, 18
5/23/2017,00:04:54,312, 19, , , 0.00000000, 19
5/23/2017,00:04:54,312, 20, , , 0.00000000, 20
5/23/2017,00:04:54,312, 21, , , 0.00000000, 21
5/23/2017,00:04:54,312, 22, , , 0.00000000, 22
5/23/2017,00:04:54,312, 23, , , 0.00000000, 23
5/23/2017,00:04:54,312, 24, , , 7.00000000, 24
5/23/2017,00:04:54,312, 25, , , 60.00000000, 25
5/23/2017,00:04:54,312, 26, , , 0.00000000, 26
5/23/2017,00:04:54,312, 27, , , 0.00000000, 27
5/23/2017,00:04:54,312, 28, , , 80.00000000, 28
5/23/2017,00:04:54,312, 29, , , 0.00000000, 29
5/23/2017,00:04:54,312, 30, , , 0.00000000, 30
5/23/2017,00:04:54,312, 31, , , 0.00000000, 31
5/23/2017,00:04:54,312, 32, , , 0.00000000, 32
5/23/2017,00:04:54,312, 33, , , 0.00000000, 33
5/23/2017,00:04:54,312, 34, , , 0.00000000, 34
5/23/2017,00:04:54,312, 35, , , 0.00000000, 35
5/23/2017,00:04:54,312, 36, , , 4.33114624, 36
5/23/2017,00:04:54,312, 37, , , 1733.00000000, 37
5/23/2017,00:04:54,312, 38, , , 0.71863282, 38
5/23/2017,00:04:54,312, 39, , , 2.95582461, 39
5/23/2017,00:04:54,312, 40, , , 0.00000000, 40
5/23/2017,00:04:54,312, 41, , , 0.00000000, 41
5/23/2017,00:04:54,312, 42, , , 0.00000000, 42
5/23/2017,00:04:54,312, 43, , , 35946.00000000, 43
5/23/2017,00:04:54,312, 44, , , 0.00000000, 44
5/23/2017,00:04:54,312, 45, , , 0.00000000, 45
5/23/2017,00:04:54,312, 46, , , 42271.00000000, 46
5/23/2017,00:04:54,312, 47, , , 25402.00000000, 47
5/23/2017,00:04:54,312, 48, , , 1.00000000, 48
5/23/2017,00:04:54,312, 49, , , 1.00000000, 49
5/23/2017,00:04:54,312, 50, , , 40745.00000000, 50
5/23/2017,00:04:54,312, 51, , , 24373.00000000, 51
5/23/2017,00:04:54,312, 52, , , 0.00000000, 52
5/23/2017,00:04:54,312, 53, , , 0.00000000, 53
5/23/2017,00:04:54,312, 54, , , 46517.00000000, 54
5/23/2017,00:04:54,312, 55, , , 35.00000000, 55
5/23/2017,00:04:54,312, 56, , , 50.00000000, 56
5/23/2017,00:04:54,312, 57, , , 50.00000000, 57
5/23/2017,00:04:54,312, 58, , , 49396.00000000, 58
5/23/2017,00:04:54,312, 59, , , 0.00000000, 59
5/23/2017,00:04:54,312, 60, , , 0.00000000, 60
5/23/2017,00:04:54,312, 61, , , 13610.00000000, 61
5/23/2017,00:04:54,312, 62, , , 50711.00000000, 62
5/23/2017,00:04:54,312, 63, , , 0.00000000, 63
5/23/2017,00:04:54,312, 64, , , 0.00000000, 64
5/23/2017,00:04:54,312, 65, , , 93927.00000000, 65
5/23/2017,00:04:54,312, 66, , , 50173.00000000, 66
5/23/2017,00:04:54,312, 67, , , 1.00000000, 67
5/23/2017,00:04:54,312, 68, , , 1.00000000, 68
5/23/2017,00:04:54,312, 69, , , 13797.00000000, 69
5/23/2017,00:04:54,312, 70, , , 75.00000000, 70
5/23/2017,00:04:54,312, 71, , , 114.00000000, 71
5/23/2017,00:04:54,312, 72, , , 75.00000000, 72
5/23/2017,00:04:54,312, 73, , , 47238.00000000, 73
5/23/2017,00:04:54,312, 74, , , 1.00000000, 74
5/23/2017,00:04:54,312, 75, , , 1.00000000, 75
5/23/2017,00:04:54,312, 76, , , 19338.00000000, 76
5/23/2017,00:04:54,312, 77, , , 29879.00000000, 77
5/23/2017,00:04:54,312, 78, , , 0.00000000, 78
5/23/2017,00:04:54,312, 79, , , 0.00000000, 79
5/23/2017,00:04:54,312, 80, , , 19633.00000000, 80
5/23/2017,00:04:54,312, 81, , , 16687.00000000, 81
5/23/2017,00:04:54,312, 82, , , 0.00000000, 82
5/23/2017,00:04:54,312, 83, , , 0.00000000, 83
5/23/2017,00:04:54,312, 84, , , 6823.00000000, 84
5/23/2017,00:04:54,312, 85, , , 85.00000000, 85
5/23/2017,00:04:54,312, 86, , , 91.00000000, 86
5/23/2017,00:04:54,312, 87, , , 91.00000000, 87
5/23/2017,00:04:54,312, 88, , , 0.00000000, 88
5/23/2017,00:04:54,312, 89, , , 0.00000000, 89
假设我想要具有索引(第三列)的行,以便37
我的数据看起来像这样。
5/23/2017,00:04:50,421, 37, ,B, 1718.00000000, -1
5/23/2017,00:04:54,312, 37, , , 1733.00000000, 37
我可以使用什么工具来实现这一点?有什么建议吗?我主要使用Windows 10
linux 机器,但如果有很好的工具,也可以找到它。我已经用它对这个文件进行了基本的操作,Sublime Text 3
效果很好。
任何帮助将不胜感激。
答案1
我认为你指的是第 4 列。你可以awk
在命令行中使用:
awk -F, -v OFS=, '$4==37' input.csv > output.csv
-F,
将输入字段分隔符设置为逗号
-v OFS=,
将输出字段分隔符设置为逗号
$4==37
匹配并打印第四个字段为 37 的任何行
这将创建一个新文件output.csv
:
5/23/2017,00:04:50,421, 37, ,B, 1718.00000000, -1
5/23/2017,00:04:54,312, 37, , , 1733.00000000, 37
awk
用途非常广泛。您还可以使用它来合并多个 CSV 文件并删除标题。它是 Linux 和 MacOS 的标准配置,并且提供 Windows 版本。
答案2
我同意 teylyn 的评论:Excel 应该能够处理3GB 文件和24k 行。
64 位环境对文件大小没有硬性限制。工作簿大小仅受可用内存和系统资源的限制。
工作表上的总行数和列数:1,048,576 行 x 16,384 列
安全的选择是导入或解析 CSV,因为引号字符串内可能有一个逗号(确实,您的数据似乎没有逗号)。
然而,我想我会提出一个perl
解决方案:
perl -F, -lane 'print if $F[3]==37' input.csv
perl
在类 UNIX 操作系统上是标准配置;要在 Windows 上使用它,您需要安装例如赛格威。西弗吉尼亚海岸也应该提供,尽管我没有检查。
可以用 来实现同样的效果awk
,正如 Mike Fitzpatrick 所建议的:
awk -F, -v OFS=, '$4==37' input.csv
也可以通过以下方式实现相同的效果sed
:
sed -n '/[^,]*,[^,]*,[^,]*, *37,/p' input.csv