OLEDB on .xlsx - 20.000 row limit?

PvD_SE · ‎01-03-21

Hi,

My process uses OLEDB to select a data-subset from an Excel sheet. Originally I used the Filter Collection action but this turned out to be way too slow.

The used OLEDB steps are:

Set Connection
Connection string: Provider=Microsoft.ACE.OLEDB.12.0;Data Source="C:\MyFolder\MyXL.XLSX";Extended Properties="Excel 12.0 Xml;HDR=YES;IMEX=1;";
Open
Get Collection
SQL: select * from Sheet1 where (Column1 = 'ABC') and (Column2='1' or Column2='2')
Close

This results in my collection being populated with data visually matching the SQL statement.

The above seems to work fine but for a minor flaw: If my Excel sheet contains more than 20.000 rows, and most do, the selection is only executed on rows 1 to 20.000, while the rows 20.001 until the end are completely ignored. The result collection from the SQL only contains data from rows 1 to 20.000. When I check the Excel content, the rows past 20.000 do contain data that was asked for in the SQL. So, when I apply the SQL selection as data filters in Excel all relevant data is chosen, also past row 20.000. The OLEDB only delivers some 200 rows where Excel with same filters results in over 10.000 rows.

As this is obviously not what I want, I spent time on Google to find a solution to my little problem. Unfortunately, to no avail.

So my questions to you are:

Does anyone recognize this tricky behavior in OLEDB?
Are there any useful tips I can apply to either fix the problem or analyze it further?

Thanks in advance!

------------------------------
Happy coding!
Paul
Sweden
------------------------------

Happy coding!
Paul, Sweden
(By all means, do not mark this as the best answer!)

John__Carter · ‎01-03-21

Hi Paul - as an experiment, if the file had only one small column (eg every cell = 1), does it still only read 20,000 rows? Just to prove the limit is no related to the number of columns of volume of data.

Also, is there anything strange at row 20,000 in the original file? Weird characters, large cells, anything like that? Just trying to think what is it about 20K?

If there was an 'Row Number' column in the file and the WHERE statement of the query said [Row Number] >20000, does that work?

Are they any drivers other than ACE that you could try https://www.connectionstrings.com/excel/

------------------------------
John Carter
Professional Services
Blue Prism
------------------------------

PrashantBoga · ‎03-03-21

Hi,

I have tried to extract data from file which has 23 columns (simple data) and over 145K rows. I came across below scenarios,

able to filter out and extract data across the files. there is no limit as such I could observe
however, I was having issues (OutofMemoryException) when result set has more record, although I am not very sure about exact limit here but I tried to find out resultset having 80K records, it also had same exception. Is there any limit collection limit? or how to find out the higher limit for collection to hold the data

Paul, have you checked if excel column has different data type than actual data? also, are you getting this glitch every time you run query on file having more than 20k? I am trying to find out more about data so I will check from my end.

Thank you
Prashant.

------------------------------
Prashant Boga
------------------------------

PvD_SE · ‎03-03-21

Ok, here it is.

John:

I checked the CSV that the XL is based on for out-of-ascii-range characters with Notepad++, none were found.
All cells are data type text, none numeric or date. This as a result of the IMEX=1 parameter in the connection string and the way the CSV was generated by the IBMi. Though date and amount data is stored in the char-type cells.
I read through the connectionstrings.com for XL site without finding anything that would solve my little issue.
Nope, I haven't checked for another driver than ACE because an earlier search led me to believe ACE was the thing to go by.

Prashant:

All 10 columns in my XL have data type Text.
The first bit in my SQL example should result in 30k+ rows, the second part of the SQL takes this down to 10k+ rows. When I first sort my XL on the first column (Column1) in my SQL and then only run the first SQL bit (where (Column1 = 'ABC')), always around 20k rows were selected - never more.
Note that there's no crash! The OleDB does not crash or throw any form of exception. It just selects what it can and continues. In general I prefer things to crash when they don't work properly, and don't like OleDB just carry on without any exception as it does now - this makes it a dangerous feature to use.

Thank you both for your contributions!

According to an esteemed CoE colleague of mine (thanks RL!), problems are to be expected with 20k+ rows with OleDB. As John stated there's no hard 20k limit in BP. But it seems the combination of many rows and filtering is likely to invoke some sort memory limit in BP or on my VDI. I previously encountered memory problems when BP was filtering larger collections with the BP collection built-ins. This was partly resolved by splitting them up in lesser collections and filtering these, and partly by going OleDB with the others. My colleague suggested a splitting-up-in-lesser-chunks solution to my current problem so that's what I'll explore.

After thought:
With the knowledge of BP filtering in smaller chunks taking a lot of time to process, while OleDB should be able to do the same (albeit, incomplete) in seconds, moving to either the lesser-chunks-solution or the BP-native-filtering-solution does not make me feel I accomplished anything great. In other words, I'm happy to be able to deliver a solution but less happy about the solution itself.

------------------------------
Happy coding!
Paul
Sweden
------------------------------

Happy coding!
Paul, Sweden
(By all means, do not mark this as the best answer!)

SS&C Blue Prism Community

OLEDB on .xlsx - 20.000 row limit?