There have been a few posts on the community related to working with large CSV files and memory issues. A lot of this is tied to two points:
- The Blue Prism executable is a 32-bit process and thus limited to a maximum of 2GB of addressable memory.
- Many of the actions, in existing VBOs, are designed to read the entire CSV file at one time.
There are a few ways to handle these large CSV files. The one I'll focus on here is
chunking. What is chunking? To understand chunking, consider the following example:
ExampleYou have a CSV file containing 1 million records of sales data with each record containing 14 columns. You need to read that data and identify every record for company XYZ. How might you go about it?
The standard solution comes down to using either the
MS Excel or
Utility - File Management VBOs to read/import the CSV file contents in one step. While this works fine in cases where the CSV file is relatively small, it begins to strain as the file size and record/column count increase.
ChunkingTo address this scenario, we have recently added a new action to the
Utility - File Management VBO called
Get CSV Text as Collection - Chunk. This new action is similar to the existing
Get CSV Text as Collection action except that instead of trying to read the entire CSV at one time it will read a subset of the records contained in the CSV (i.e. a chunk).
To use the action, you pass in the following parameters:
- CSV File Path - The full path and name of the CSV file.
- First Line is Header - A flag indicating if the first line of the CSV is a column header line.
- Starting Row - The row number where reading of the CSV data should start. Note: This value is zero-indexed meaning the first data record is row #0.
- Chunk Size - A number indicating the size (i.e. number of records) to read in the chunk.
Assuming the file exists, the action will open the file, pull out the required rows of data, and return that smaller collection along with a value showing the number of records that were actually read (in case the Chunk Size is larger than the number of available/remaining records) and the number of the next record where a subsequent call to this action should begin reading for the next chunk of data.
I've conducted some tests using CSV files ranging from 10K to 5M records (14 columns of data per record). Below is a screenshot of the elapsed times to read the various test files.
I tested both options of reading the data (non-chunked and chunked). For the tests up to, and including, 100K records, non-chunked reading worked fine. Beginning with the 500K test, and onward, memory issues arose. From 1M onward, the test failed every time unless the chunking action was used.
Note 1: The times shown are just for reading the data from the CSV file, populating the Collection, and returning the collection to the calling process. Additional time would be required to then loop through the collections to perform whatever business processing is necessary (ex. looking for company XYZ records).Note 2: These tests were conducted on a 3.5 year old Dell Latitude 7490 with a Core i7-8650U CPU and 16GB of RAM.
Note 3: The memory issues don't magically show up at record 100,001. It varies depending on the system you're running and it's specs. And there's a pretty significant jump between my test at 100K and 500K records.
Not exactly lightning fast, but for a Digital Worker I'd say more than respectable. And best of all, no
System.OutOfMemory exceptions. ;)
As I mentioned above, there are other options for addressing this issue including:
- Reading the CSV line-by-line and loading a work queue.
- Using a Code stage and possibly Linq.
- etc
In the future I may write additional articles discussing some of the other options. For now, if you're experiencing this issue, consider downloading the latest version of the
Utility - File Management VBO (v6.10.4) from the Digital Exchange and see how it works for you.
P.S. - I've attached the test process that I used. You'll need to furnish your own CSV test files, or you can download the same files I used. They are available on the site
E for Excel.
Cheers,
------------------------------
Eric Wilson
Director, Integrations and Enablement
Blue Prism Digital Exchange
------------------------------