Digital Exchange

 View Only
last person joined: 13 hours ago 

This community is a place to discuss Blue Prism DX assets and development.

  • 1.  Working with large CSV files? Just chunk it! :-)

    Posted 02-24-2022 00:28
      |   view attached
    There have been a few posts on the community related to working with large CSV files and memory issues. A lot of this is tied to two points:
    • The Blue Prism executable is a 32-bit process and thus limited to a maximum of 2GB of addressable memory.
    • Many of the actions, in existing VBOs, are designed to read the entire CSV file at one time. 

    There are a few ways to handle these large CSV files. The one I'll focus on here is chunking. What is chunking? To understand chunking, consider the following example:

    Example
    You have a CSV file containing 1 million records of sales data with each record containing 14 columns. You need to read that data and identify every record for company XYZ. How might you go about it?

    The standard solution comes down to using either the MS Excel or Utility - File Management VBOs to read/import the CSV file contents in one step. While this works fine in cases where the CSV file is relatively small, it begins to strain as the file size and record/column count increase.

    Chunking
    To address this scenario, we have recently added a new action to the Utility - File Management VBO called Get CSV Text as Collection - Chunk. This new action is similar to the existing Get CSV Text as Collection action except that instead of trying to read the entire CSV at one time it will read a subset of the records contained in the CSV (i.e. a chunk).

    To use the action, you pass in the following parameters:
    • CSV File Path - The full path and name of the CSV file.
    • First Line is Header - A flag indicating if the first line of the CSV is a column header line.
    • Starting Row - The row number where reading of the CSV data should start. Note: This value is zero-indexed meaning the first data record is row #0.
    • Chunk Size - A number indicating the size (i.e. number of records) to read in the chunk.

    Assuming the file exists, the action will open the file, pull out the required rows of data, and return that smaller collection along with a value showing the number of records that were actually read (in case the Chunk Size is larger than the number of available/remaining records) and the number of the next record where a subsequent call to this action should begin reading for the next chunk of data.

    I've conducted some tests using CSV files ranging from 10K to 5M records (14 columns of data per record). Below is a screenshot of the elapsed times to read the various test files.

    I tested both options of reading the data (non-chunked and chunked). For the tests up to, and including, 100K records, non-chunked reading worked fine. Beginning with the 500K test, and onward, memory issues arose. From 1M onward, the test failed every time unless the chunking action was used.

    Note 1: The times shown are just for reading the data from the CSV file, populating the Collection, and returning the collection to the calling process. Additional time would be required to then loop through the collections to perform whatever business processing is necessary (ex. looking for company XYZ records).

    Note 2: ​These tests were conducted on a 3.5 year old Dell Latitude 7490 with a Core i7-8650U CPU and 16GB of RAM. 

    Note 3: The memory issues don't magically show up at record 100,001. It varies depending on the system you're running and it's specs. And there's a pretty significant jump between my test at 100K and 500K records.
     
    Not exactly lightning fast, but for a Digital Worker I'd say more than respectable. And best of all, no System.OutOfMemory exceptions. 😉

    As I mentioned above, there are other options for addressing this issue including:
    • Reading the CSV line-by-line and loading a work queue.
    • Using a Code stage and possibly Linq.
    • etc

    In the future I may write additional articles discussing some of the other options. For now, if you're experiencing this issue, consider downloading the latest version of the Utility - File Management VBO (v6.10.4) from the Digital Exchange and see how it works for you.

    P.S. - I've attached the test process that I used. You'll need to furnish your own CSV test files, or you can download the same files I used. They are available on the site E for Excel.

    Cheers,

    ------------------------------
    Eric Wilson
    Director, Integrations and Enablement
    Blue Prism Digital Exchange
    ------------------------------​

    Attachment(s)



  • 2.  RE: Working with large CSV files? Just chunk it! :-)

    Posted 02-24-2022 05:11
    Thanks a lot for this wonderful article @Eric Wilson. This is probably one of the biggest headache with most of the automations that even I have come across and chunking is the only way around. Loved the way you explained it and this new action I am definitely gonna try in my future automations as well :) ​​​

    ------------------------------
    ----------------------------------

    Regards,
    Devneet Mohanty
    Intelligent Process Automation Consultant | Sr. Consultant - Automation Developer,
    Wonderbotz India Pvt. Ltd.
    Blue Prism Community MVP | Blue Prism 7x Certified Professional
    Website: https://devneet.github.io/
    Email: devneetmohanty07@gmail.com

    ----------------------------------
    ------------------------------



  • 3.  RE: Working with large CSV files? Just chunk it! :-)

    Posted 02-24-2022 15:32
    Hi Eric.

    This is really great stuff.

    Thanks for sharing.

    jack

    ------------------------------
    Jack Look
    Sr Product Consultant
    Blue Prism
    ------------------------------



  • 4.  RE: Working with large CSV files? Just chunk it! :-)

    Posted 05-02-2022 15:50

    Hi Eric, this is really promising!

    Is there any way for me to change the encoding options? It doesn't really like my CSV files because of special characters. 

    Hálsaþing 4;203;1116498;1000;Kópavogsbær;18.04.2011 00:00;02.05.2011 09:18;48000;44450;2007;10101;244,1;480;m²;Serbyli;1;0



    ------------------------------
    Atli Harðarson
    ------------------------------



  • 5.  RE: Working with large CSV files? Just chunk it! :-)

    Posted 05-02-2022 18:59
    Hi @Atli Harðarson,

    Can you elaborate on your situation? Are you receiving an error or exception? If so, can you share the details? There is a CharacterSet value that can be added to the OleDbConnectionString that's created within the Code stage that might help here. Here's an example of what the connection string might look like:

    Using cn As New OleDbConnection("Provider=Microsoft.Jet.OLEDB.4.0; Data Source=" & Folder & ";Extended Properties=""Text;HDR=" & HDRString & ";Format=Delimited;CharacterSet=UTF8;""")​

    Cheers,


    ------------------------------
    Eric Wilson
    Director, Integrations and Enablement
    Blue Prism Digital Exchange
    ------------------------------



  • 6.  RE: Working with large CSV files? Just chunk it! :-)

    Posted 05-03-2022 12:03

    Thanks Eric, solved the UTF-8 

    Had to change UTF-8 to 65001, now also trying to solve the ";" delimiter :)

    Using cn As New OleDbConnection("Provider=Microsoft.Jet.OLEDB.4.0; Data Source=" & Folder & ";Extended Properties=""Text;HDR=" & HDRString & ";Format=Delimited;CharacterSet=65001;""")


    ------------------------------
    Atli Harðarson
    ------------------------------



  • 7.  RE: Working with large CSV files? Just chunk it! :-)

    Posted 05-03-2022 12:11
    @Atli Harðarson,

    Try changing Format=Delimited to Format=Delimited(;).

    Cheers,


    ------------------------------
    Eric Wilson
    Director, Integrations and Enablement
    Blue Prism Digital Exchange
    ------------------------------



  • 8.  RE: Working with large CSV files? Just chunk it! :-)

    Posted 05-03-2022 09:02
    Thanks @Eric Wilson,

    Very useful article for working with large CSV files. Thanks for sharing !!​

    ------------------------------
    Manpreet Kaur
    Manager
    Deloitte
    ------------------------------



  • 9.  RE: Working with large CSV files? Just chunk it! :-)

    Posted 06-12-2023 11:26

    Thanks for this post, really helpful.

    However, I have a question. How to work with such large xlsx/csv file in a situation where I need to find one record according to the X criteria, then find another record according to the Y criteria, etc. I have the criteria saved in another xlsx/csv file, which has about 600-1000 rows. 

    Could you help?



    ------------------------------
    Marcin Grzelak
    Blue Prism Developer
    TSYS Managed Services EMEA Limited
    Europe/London
    ------------------------------



Welcome to the Blue Prism Digital Exchange Community!

The Blue Prism Digital Exchange is a "shop window" for new and emerging technologies—a platform that puts powerful RPA and AI capabilities into the hands of business leaders. Users can find and apply pre-built AI capabilities, in the form of downloadable integrations and Visual Business Objects (VBOs), to automated processes. These assets connect and integrate Digital Workers, existing systems and processes to Blue Prism's technology partners, creating a solid foundation of AI-enabled Intelligent Automation that's scalable and sustainable.

Blue Prism Digital ExchangeDX Asset IdeasContact DX Support

FAQs

The Blue Prism Digital Exchange (DX) is an online marketplace where businesses can instantly access, apply and share pre-built AI, cognitive and advanced RPA technologies from best-in-class providers. These assets easily connect to existing digital workers, systems and processes to enhance automation capabilities.
The Digital Exchange is free to all users. Most of the content on the DX is free to download but there are some submissions that do have a cost associated. The submissions with a cost are advertised on the asset card and profile. No unwanted costs will be applied to any users.
You can visit and browse the Digital Exchange here. If you would like to consume or download any material it is necessary to create an account on the Blue Prism Portal first.
Everyone can access the Digital Exchange and consume the assets on it. If you would like to contribute to the marketplace it is necessary that you create an account and sign up as a partner.