08-07-24 03:28 AM
Hi
I need help with extracting a CSV file from a database using SQL queries in Blue Prism. In the development environment, everything worked fine. However, now that my bot is in production, the CSV file size exceeds 2 GB, causing an "Out of Memory" issue.
How can I extract such large files in Blue Prism?
Additionally,
will I face problems in manipulating such a huge file after extraction? Any strategies or best practices for managing this would be greatly appreciated.
Answered! Go to Answer.
10-07-24 10:37 AM
Hi @sandeshtope
To expand on @devneetmohanty07 comments you will need to have a way of chunking up the data and then potentially saving them as individual csv records or potentially create a string and for each new dataset add the data to the string until complete then export as a csv file in one.
First thing is to chunk up the data so you need to offset the number of rows with the number you want to get and then the next set is incremented from there e.g. start row 0 upt to 100 then the second set of data would be 100 up to 200 etc. Here is an example query on how to set that up, I would use the start and row limit as data items and have these variable through your loop i.e. after one query use a calculation to set the next range in the start based on the row limit, you could have the row limit as an env variable if you wanted to increase/decrease the limit which would avoid code releases.
Select * from my_db
where my_table = other_table
offset [Row num start] rows fetch next [Row limit] rows only
After this query the calculation stage contains the following calculation [Row num start]+[Row limit] save this value to [Row num start] . This repeats each loop meaning the [Row num start] increments by the row limit for each set of data.
To get this as a csv file first use the VBO Utility - String > Get collection as CSV and output to a text data item and then VBO Utility - file management > Write text file
Hopefully I've explained this clearly, let me know if you need any additional clarity on anything. 🙂
08-07-24 08:15 AM
Hi @sandeshtope ,
Whenever you are working with a very large dataset which easily exceeds 500 MB’s or something, you directly cannot rely on collections as Blue Prism is a 32-bit application (now, with v7.2+ we have 64-bit application) and if you try to store all this data in a collection, the internal memory will easily go out of space causing Blue Prism service to crash. The same concept also applies even if you use OLEDB connections which you are doing in your case it seems however OLEDB is faster than your regular excel operations still the problem lies with the fact that you cannot store more than the required limit in a collection at a time.
Hence, the main problem to automate use cases with large dataset is that if you want to automate these use cases using Blue Prism's collection data item then you need to have a chunking mechanism while you operate with collections, that means, while you read the data from any external source (excel, database, access, web table, CSV files etc.), you need to have some sort of logic which can break the data in terms of buckets and that you can store in collection and work with the same. So technically, even if you use OLEDB you would need some sort of chunking logic in your select queries. This can be a bit complicated and again the processing time will be a lot. For this reason, your best bet would be to somehow interact with the external application itself which in our case is Excel and get your operations work on that application without relying on collections itself which can be done using Power Queries or some code that directly interacts with the data in that file without bringing all of it to a collection at all. However, such approaches would need a change in the overall solution design.
To recommend you better on what you can do, we would need to know exactly what is your goal related to this data from which I mean what you intend to do once you read this CSV data and then a bit details about your columns such as if there is a primary key column in your dataset or not (having unique values and not having null values) or not and what type of columns you have in there.
10-07-24 10:37 AM
Hi @sandeshtope
To expand on @devneetmohanty07 comments you will need to have a way of chunking up the data and then potentially saving them as individual csv records or potentially create a string and for each new dataset add the data to the string until complete then export as a csv file in one.
First thing is to chunk up the data so you need to offset the number of rows with the number you want to get and then the next set is incremented from there e.g. start row 0 upt to 100 then the second set of data would be 100 up to 200 etc. Here is an example query on how to set that up, I would use the start and row limit as data items and have these variable through your loop i.e. after one query use a calculation to set the next range in the start based on the row limit, you could have the row limit as an env variable if you wanted to increase/decrease the limit which would avoid code releases.
Select * from my_db
where my_table = other_table
offset [Row num start] rows fetch next [Row limit] rows only
After this query the calculation stage contains the following calculation [Row num start]+[Row limit] save this value to [Row num start] . This repeats each loop meaning the [Row num start] increments by the row limit for each set of data.
To get this as a csv file first use the VBO Utility - String > Get collection as CSV and output to a text data item and then VBO Utility - file management > Write text file
Hopefully I've explained this clearly, let me know if you need any additional clarity on anything. 🙂