问题:
最近我一直在处理相当大的数据集。
但是,Excel 似乎处理这些问题非常慢(仅打开或保存文件就需要 3-5 分钟)。
电子表格详细信息
- 行数:50,000
- 列数:90
- 工作表:1
- 文件大小:157mb。
该工作表由纯数据组成。尚未添加任何公式或 VBA 脚本。没有单元格引用(例如 =H3)或数组。
眼镜:
Office 2010 32 位
处理器:i7 3.4 GHZ 四核
内存:6 Gb
Windows 7
迄今采取的措施
我尝试将其保存为 Excel 二进制 (.xlsb) 文件,这有点帮助(并且也稍微减小了文件大小)。
但是,50,000 行远未达到 Excel 2010 的极限(即 1,048,576 行),但它确实很慢。我还记得旧版本的 Excel 能够相对较快地处理大量数据。
有没有什么方法可以加快 Excel 的速度?
答案1
我决定尝试尽可能地模仿您的场景。为此,我:
- 创建了一个包含 50,000 行和 90 列数据的 MS Excel 电子表格(幸好我处理的是大型数据集,所以这并不难)
- 确保数据是多种格式的混合:文本、日期、数字和常规
- 没有使用任何公式或 VBA 代码
但是,对我来说,这个电子表格只有 24MB 大小,大约只有你的电子表格大小的 15%,所以我无法解释这一点。
不幸的是,您提供的唯一信息是打开/保存电子表格的读/写速度,因此我只能测试这些。我的结果如下:
- 7.68 秒将数据保存到本地硬盘
- 3 分 53 秒将数据保存到网络驱动器
- 10.4 秒打开本地驱动器中的数据
- 50.43 秒打开网络驱动器上的数据
现在,由于我的 50,000 行和 90 列数据只有 24MB,因此保存和加载我的数据显然比您的 157MB 文件(大约大 6.5 倍)要少得多。
但是,正如您所见,打开和保存文件的速度并不完全取决于文件本身的大小。使用 SSD 比使用本地硬盘快得多,而本地硬盘又比网络驱动器快得多。此外,网络类型对数据传输速度有很大影响(例如,千兆以太网比快速以太网快得多)。
不幸的是,我能找到的运行此测试的最接近的计算机具有以下规格:
- Windows 7 企业版(64 位)
- i5 2.6GHz
- 8GB 内存
- MS Excel 2010(32 位)
概括
简而言之:
- 影响读写速度的因素有很多,不仅仅是数据量,还有硬件和/或网络环境
- 更好的测试是查看 Excel 对您的数据进行实际数字运算需要多长时间,而不是读取/写入传输速度
- 对于 Excel 来说,另一个因素是你是否加载了任何插件等,以及它们位于何处
- 您的数据大小为 157MB,是我的电子表格大小的 6.5 倍,还包含 450 万个单元格(那么这是什么原因造成的?)
- 安装更多 RAM(我在 Windows 和 Mac 上的 Excel 中处理大型数据集已有近二十年,影响 Excel 性能的最大因素就是您安装的 RAM 数量
- 拥有 450 万个数据单元,你可能需要重新考虑数据库是否能更好地满足你的需求
答案2
电子表格绝对不是处理大量数据的正确工具,因为单元格是作为对象加载到内存中的(一种非常低效的方式)。
您可以使用 CSV(或 TSV)格式来存储数据,并使用专用工具来操作它们。例如 open-refine 或 GNU awk 等。
编辑:我最近尝试过维西达特在相当大的数据集(1.7GB,300k+行)上成功。非常方便快速地进行切片和切块。
答案3
强调文字处理大量数据的最快方法是使用 DataCollections 整合数据,然后再将其提供给 Excel 处理。我已广泛使用过此技术,它的代码很简单,而且运行速度比要求 Excel 处理大量数据快得多。
'Copy/Paste the following text into an Excelcode Module
'Consider a large dataset containing 5 years files of Date/customer account number/customer name/5 x address/phone number/invoice value
'and where customers may issue more than one invoice per day.
'Let's assume the business requirement is to provide a report of total value per customer by year/month.
'Rather than placing these records in an Excel worksheet, use VB to read the 5 yearly files and consolidate the information in a memory
'collection by year/month rather than year/month/day.
'If there were, say, 300,000 records this would result in a maximum of 60,000 records:- 5 years x 12 months x 1,000 customers.
'at the same time any rows or columns not contributing to the report can be excluded. (10 columns reduced to 4 in this example)
'and the Excel 1 million row limit would not apply unless the consolidated data exceeded this limit.
'With 300,000 rows and 10 columns of data, Excel has 3 million cells to process.
'Consolidation reduces the data to 60,000 rows and 4 columns i.e. 240,000 cells.
'Therefore Excel has 3,000,000 / 240,000 = 12.5% of the data to process which will directly impact runtime.
'The code below assumes the data is already held in an Excel worksheet but this could easily be converted to reading from one or more files.
'Using a Collection to consolidate the data results in Excel having far less data to process and less data means less time to process.
'I have used the Collection Class for 20 years but heard recently that there is also a Dictionary class which has benefits over the Collections class.
'I have never used this class but see details https://www.experts-exchange.com/articles/3391/Using-the-Dictionary-Class-in-VBA.html
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
'First create an empty .xlsm Workbook containing sheet Data and Sheet Consolidation
'then create a Class record of only those data fields required for analysis (i.e. not address and phone number in this example)
'by Cut and pasting the following definitions into Class1 which specifies the Collection memory record format
'containing only those data elements required for analysis (i.e. not address and phone number)
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
'Option Explicit
'Private pYYMM As String
'Private pAccountNumber As String
'Private pCustomerName As String
'Private pTotalValue As Single
''
'Property Let YYMM(xx As String) 'YYMM
' pYYMM = xx
'End Property
'Property Get YYMM() As String
' YYMM = pYYMM
'End Property
'Property Let AccountNumber(xx As String) 'AccountNumber
' pAccountNumber = xx
'End Property
'Property Get AccountNumber() As String
' AccountNumber = pAccountNumber
'End Property
'Property Let CustomerName(xx As String) 'CustomerName
' pCustomerName = xx
'End Property
'Property Get CustomerName() As String
' CustomerName = pCustomerName
'End Property
'Property Let TotalValue(xx As Single) 'TotalValue
' pTotalValue = xx
'End Property
'Property Get TotalValue() As Single
' TotalValue = pTotalValue
'End Property
''''''''''''''''''''''''''''''''''''''''''''''''''
'and remove the first ' comment on each row
'then, in a Code module, declare the Collection::-
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Option Explicit
Dim NewRecord As New Class1 'record area (definition of new Class1 record)
Dim Class1Collection As New Collection 'A collection of keyed records held in memory
Dim Class1Key As String 'Collection Key must be a String
Dim WalkRecord As Class1
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
'Next create the text key used to access each Class1 record in memory by year, month and account number:-
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Dim Rw As Long
Dim StartTime As Date
Sub Consolidate()
StartTime = Now()
Sheets("Data").Select
For Rw = 2 To Range("A1").CurrentRegion.Rows.Count
Class1Key = Format(Range("A" & Rw).Value, "yymm") & Range("B" & Rw).Value 'Key = yymm & AccountNumber
''''''''''''''''''''''''''''''''
'Next populate the Class1 record:-
''''''''''''''''''''''''''''''''
NewRecord.YYMM = Format(Range("A" & Rw).Value, "yymm") 'YYMM
NewRecord.AccountNumber = Range("B" & Rw).Value 'AccountNumber
NewRecord.CustomerName = Range("C" & Rw).Value 'CustomerName
NewRecord.TotalValue = Range("J" & Rw).Value 'TotalValue
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
'Next write the Class record into memory and, if an identical key already exists, accumulate TotalValue:-
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
On Error Resume Next
Call Class1Collection.Add(NewRecord, Class1Key)
Select Case Err.Number
Case Is = 0
On Error GoTo 0
'******** NEW RECORD INSERTED OK HERE *********
Case Is = 457
On Error GoTo 0
'******** CONSOLIDATE DUPLICATE KEY CASES HERE ************
Class1Collection(Class1Key).TotalValue = Class1Collection(Class1Key).TotalValue + NewRecord.TotalValue
Case Else
'******** UNEXPECTED ERROR *****************
Err.Raise Err.Number
Stop
End Select
Set NewRecord = Nothing
Next Rw
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
' When all the records have been processed, write Collection entries to a WorkSheet:-
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Sheets("Consolidation").Select
Range("A1:D1").Value = Array("Date", "account number", "customer name", "total value")
Rw = 2
For Each WalkRecord In Class1Collection 'walk collection and write rows
Cells(Rw, 1) = "20" & Left(WalkRecord.YYMM, 2) & "/" & Right(WalkRecord.YYMM, 2)
Cells(Rw, 2) = WalkRecord.AccountNumber
Cells(Rw, 3) = WalkRecord.CustomerName
Cells(Rw, 4) = WalkRecord.TotalValue
Rw = Rw + 1
Next WalkRecord
Set Class1Collection = Nothing
Columns("A:A").NumberFormat = "dd/mm/yyyy;@"
Columns("D:D").NumberFormat = "#,##0.00"
Columns("A:D").HorizontalAlignment = xlCenter
Columns("A:D").EntireColumn.AutoFit
'
Range("A2").Select
Rw = Range("A1").CurrentRegion.Rows.Count
ActiveWorkbook.Worksheets("Consolidation").Sort.SortFields.Clear
ActiveWorkbook.Worksheets("Consolidation").Sort.SortFields.Add Key _
:=Range("A2:A" & Rw), SortOn:=xlSortOnValues, Order:=xlAscending, DataOption:=xlSortNormal
ActiveWorkbook.Worksheets("Consolidation").Sort.SortFields.Add Key _
:=Range("B2:B" & Rw), SortOn:=xlSortOnValues, Order:=xlAscending, DataOption:=xlSortNormal
With ActiveWorkbook.Worksheets("Consolidation").Sort
.SetRange Range("A1:D" & Rw)
.Header = xlYes
.MatchCase = False
.Orientation = xlTopToBottom
.SortMethod = xlPinYin
.Apply
End With
MsgBox ("Runtime " & Format(Now() - StartTime, "hh:mm:ss")) '300k records takes 1 min 46 seconds on my ancient HP laptop
End Sub
'
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
'Now run macro Generate to create 300k test rows, followed by macro Consolidate to demonstrate consolidation by data collection.
'In my experience the improvement in Excel run time for producing any reports is so spectacular that no further action is required
'but further run time improvements could result from avoiding calling the Format function twice every row,
'or holding CustomerName in another collection and accessing during the Walk process, or using the Dictionary class
'Note that is advisable to run macro ClearSheets before saving your WorkBook.
'Also the Record constant can be manually adjust up to a maximum of 1048575 if required
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
'
Sub Generate() 'Generates test data in Sheet Data
Const StartDate As Date = 42370 '01/01/2016
Const FinishDate As Date = 44196 '31/12/2020
Const Records As Long = 300000 'number of records to generate - max 1,048,575
Const Customers As Integer = 1000 'number of Customers
Const StartValue As Integer = 5 '£5
Const FinishValue As Integer = 100 '£100
StartTime = Now()
Sheets("Data").Select
Range("A1:J1").Value = Array("Date", "account number", "customer name", "address 1", "address 2", "address 3", "address 4", "address 5", "phone number", "invoice value")
For Rw = 2 To Records + 1
Cells(Rw, 1) = WorksheetFunction.RandBetween(StartDate, FinishDate) 'Date
Cells(Rw, 2) = WorksheetFunction.RandBetween(1, Customers) 'Customer number
Range(Cells(Rw, 3), Cells(Rw, 9)) = Array("Customer " & Cells(Rw, 2), "Not required", "Not required", "Not required", "Not required", "Not required", "Not required")
Cells(Rw, 10) = WorksheetFunction.RandBetween(StartValue, FinishValue) 'Value
Next Rw
Columns("A:A").NumberFormat = "dd/mm/yyyy;@"
Columns("J:J").NumberFormat = "#,##0.00"
Columns("A:J").HorizontalAlignment = xlCenter
Columns("A:J").EntireColumn.AutoFit
Range("A2").Select
MsgBox ("Runtime " & Format(Now() - StartTime, "hh:mm:ss")) '300k records takes 1 min 50 seconds on my ancient HP laptop
End Sub
Sub ClearSheets() 'delete all data before saving
Sheets("Data").Select
Range("A1").CurrentRegion.Delete Shift:=xlUp
Range("A1").Select
Sheets("Consolidation").Select
Range("A1").CurrentRegion.Delete Shift:=xlUp
Range("A1").Select
End Sub
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
'Finally, a data collection can comprise of keys and value without a Class record.
'Such a collection can be used to identify data duplicates (e.g. key=cellcontent/row number) or hold a lookup table (e.g. key=account/name)
'Most applications process far more data than they report therefore using collections benefits most data sets irrespective of volume.
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''