Gen AI on BP runtime resource using Onnx Runtime

EricNewton · ‎02-12-24

Hi All,

I just wanted to share a hobby project I have worked on the last few weeks to see if I could get a local language model to run on runtime resource.

The general inference workflow for LLMs I have seen is running a separate service and using API to prompt the LLM, which makes sense for large model requiring GPU to run. Even running most local methods involves running a local server and using API calls in the same manner. A while ago I read about ONNX models (open source, MIT license, and led by Microsoft - https://github.com/onnxruntime) and are for runtime, and I recently had the thought of seeing the possibility of integrating this into a Blue Prism object and running on resource.

I only did this on my local computer and on training edition of Blue Prism so you mileage may vary running elsewhere, but any way I thought I would share incase anyone else has interest or use case.

I will go ahead and link my github with the BP release you can download and readme which I will copy and paste below so you can decide if you want take any time downloading.

The TDLR of the readme is you will need to get a number of dll from Microsoft nuget packages and then pick a model or models to play with from huggingface.co.

In the release I provide a toy example of taking non-formatted address details in a collection and running loop to ask the model to output to json format. I think small models provide a lot of potential around fine tuning for specific processes and running on runtime. I also played around with trying to get multimodal to work and was going to take a bit more time so stopped for now, so if there is a lot of interest about that I can look further into it.

If folks try this out and have issues let me know and I will help out best I can.

Anyway here is link to github and below readme -

https://github.com/etnewton/onnx-blueprism/tree/main

This project aims to run a language model at runtime within Blue Prism Automate.

I was able to achieve this utilizing the ONNX Runtime engine, https://onnxruntime.ai/, with Microsoft's OnnxRuntime and OnnxRuntimeGenAI packages. I used version 1.19.2 for the OnnxRuntime and 0.4.0 for GenAI packages. However did test with the new versions of 1.20.1 and 0.5.2 and they did work but did have some odd behavior with Phi model not completing so would suggest using the versions mentioned below. I would like to see about getting multimodal to run, which required the newer versions, with a Phi 3 vision model.

The system this was tested on was a Windows 11 Pro Desktop with Blue Prism v7.3.1 Learning edition installed, Intel i7-12700K CPU, and 32 GB of RAM. Models will be loaded into RAM and CPU will be performing inference, so this should be kept in mind for any models being ran as large models will take longer to complete inference.

Set Up -

1. Have Blue Prism installed on resource.

2. Download the required packages and move the dll to the Blue Prism Automate Folder.

I used NuGet Package Manager to download the packages then found the appropriate dll in the build or runtime folder noted below and moved over.

These are typically saved to YourUserPath.nuget\packages\ then the package requested. To get the correct dll if the package has a runtime folder under the version navigate to the x64 folder in runtime folder then pull that dll, if no runtime folder is there go into the lib folder and look for the netstandard2.0 folder and take the dll from there.

x64 Runtimes .dll

Microsoft.ML.OnnxRuntime.DirectML version 1.19.2

https://www.nuget.org/packages/Microsoft.ML.OnnxRuntime.DirectML

onnxruntime.dll

Microsoft.ML.OnnxRuntimeGenAI version 0.4.0

https://www.nuget.org/packages/Microsoft.ML.OnnxRuntimeGenAI

onnxruntime-genai.dll

netstandard2.0 lib .dll

Microsoft.ML.OnnxRuntime.Managed version 1.19.2

https://www.nuget.org/packages/Microsoft.ML.OnnxRuntime.Managed

Microsoft.ML.OnnxRuntime.dll

Microsoft.ML.OnnxRuntimeGenAI.Managed version 0.4.0

https://www.nuget.org/packages/Microsoft.ML.OnnxRuntimeGenAI.Managed

Microsoft.ML.OnnxRuntimeGenAI.Managed.dll

Microsoft.ML.OnnxTransformer version 3.0.1

https://www.nuget.org/packages/Microsoft.ML.OnnxTransformer

Microsoft.ML.OnnxTransformer.dll

System.Memory version 4.5.5

https://www.nuget.org/packages/System.Memory

System.Memory.dll (note I override the existing dll in the Blue Prism Automate folder be sure to test before doing this blindly)

Additionally make sure netstandard.dll is in the same folder.

3. Download and import the Onnx GenAI.bprelease into Blue Prism provided here - https://github.com/etnewton/onnx-blueprism/tree/main

4. Download a text model from huggingface with the ONNX template.

Make sure you download the files related to cpu, typically they are in a folder labeled cpu_int4_rtn-block-32 or similar (example for Phi3 mini https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx/tree/main/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4).

I successfully used the following models -

Phi 3 Mini 128K Instruct - Takes roughly 3 GB of RAM and ran in my tests around ~20 token/sec

Llama 3.2 3B Instruct - Takes roughly 3 GB of RAM and ran in my tests around ~60 token/sec

Llama 3.1 8B Instruct - Takes roughly 5 GB of RAM and ran in my tests around ~10 token/sec

The files should be saved in a single folder and they should look something like this -

sometimes you need to rename the config and tokenizer/token file names to remove the folder name if you downloaded manually, example cpu_and_mobile_cpu-int4-rtn-block-32-acc-level-4_config.json to config.json.

While you are looking at the models be sure to look up the prompt template since you need to provide that in the action to load the model and so they work correctly.

5. Once you have this set up you are ready to test

So you can fire up Blue Prism, log in and open up the Example LM process. Navigate to the Prompt Example With Token Details page. Then to run the example you need to do the following -

Update Model Path data item to the folder path where all the files are placed for the model you downloaded, example D:\LLM\onnx\Llama-3.2-3B-Instruct-ONNX.

Update the Prompt Template data item to be the prompt template of the model, I already set this with the Phi template so if you are using that model don't need to update anything. Where the system prompt is to be place in the prompt place "{systemPrompt}" and where the user prompt be place but "{userPrompt}", you will pass the actual system and user prompts you want on the Send Prompt actions and it will replace those with the actual prompt when running. So an example would be the Phi template is -

<|system|>

{systemPrompt}

<|end|>

<|user|>

{userPrompt}

<|end|>

<|assistant|>

It is important on the template to include the line breaks as this is how the models were trained and in order to get accruate output this needs to match the right template. If you are getting weird output from a prompt then you should check if your template is correct.

Then you can run the example which looks to output 5 rows of data that is non-formatted addresses into a requested json format. If everything was done successfully you should then be able to view the Prompts collection and see the output generated. Here is an example using Phi3 Mini 128k

As you can see with my tests I was achieving roughly 20 tokens per second with Phi3, all from inference running on the CPU.

sastharpa · ‎03-12-24

Thanks @EricNewton on the wonderful post, this post and your thought on using a runtime LLM is just nothing short of a remarkable idea. I am sure a lot in our community will get intrigue and try it out. I will explore too and provide my feedback.

VL Ganesh
Tech Arch RPA

michaeloneil · ‎03-12-24

This is such an interesting approach to improving the development method. I was reading through it an im hopefully understanding your solution correctly, are you using the modelling language to generate JSON by taking addresses and correctly formatting the the addresses which are then produced in json format?

It would save a lot of hassle particularly around the addresses, i had a development that caused issues with addresses because theyre always in a different format/layout, a flat number can be shown as just a number or people write "Flat 5" for example and it makes it a trial and error to get the right information out.

EricNewton · ‎03-12-24

Yeah the first thing I thought of for a use case would be taking freeform data that you need formatting and having just have a model format it for you, my background is onboarding clients data to our systems so this was always something that was part of the process - getting client data in the format we need for our system.

It is just a toy in my example process but thought it might demonstrate usefulness in using a small model at runtime vs having to send to API. Now whether something like this works in real world application depends on the prompt and model results as these things can always hallucinate something, but I think there are some easy solutions for checking hallucinations for an example like this (I think just checking if each json value is in the original address is probably the easiest in this case).

Another thought I had was the current big buzzword of an "Agent" where you could possibly fine tune a small model with a list of process available to run in your environment and let it decide what process to run with what data it receives. Something like a bot checking emails and deciding if something is a new order or customer complaint and spinning up the related process for each unique case.

Tejaskumar_Darji · ‎31-12-24

Hi @EricNewton thank you so much for sharing this idea and the implementation steps. I was able to almost make it work according to your instructions, and it worked flawlessly. As I dig further, I have the following questions. If you could help and expand on this discussion, it would be greatly appreciated.

Tejaskumar_Darji · ‎31-12-24

1 - How did you figure out the prompt template?

2 - How did you match the use case to the model? I mean, what was your thought process or research that led you to the use of phi-3? Especially considering you wanted to run it locally.

3 - How can we determine what other use cases this model can serve?

4 - How did you determine the actions and C# code to make this model run in Blue Prism?

5 - I see you've used another page that grabs the token details as well. I assume this is to track how many tokens are used, but I'm not entirely sure. Could you shed more light on why you would grab these details?

EricNewton · ‎06-01-25

@Tejaskumar_Darji Thanks for taking a look at this. See my responses below.

1 - How did you figure out the prompt template? Typically who ever publishes the model will provide what the prompt template is in there model card. For example Meta mentions it here for their Llama 3.1 models https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/. Also it would be in the config.json in jinja template so you could derive from there.

2 - How did you match the use case to the model? I mean, what was your thought process or research that led you to the use of phi-3? Especially considering you wanted to run it locally. Phi models are from Microsoft and have MIT license so are the most permissive in terms of use so was just nature test to include. I used Llama 3.1 models also which have a pretty permissive license also. I did not research much on what would be good for formatting address, just pick some small models.

3 - How can we determine what other use cases this model can serve?I think you would just need to test with the model you are looking at using. Different models will perform differently and you still have to tune your prompts.

4 - How did you determine the actions and C# code to make this model run in Blue Prism? I looked at the documentation for the Onnx packages for the C# code, actions I just built for my test case.

5 - I see you've used another page that grabs the token details as well. I assume this is to track how many tokens are used, but I'm not entirely sure. Could you shed more light on why you would grab these details? Yeah I made 2 actions, one to just get the LLM response and another to get the response and keep track of the tokens generated and count how many per second. I don't think the count is correct as I believe the count includes tokens from the prompt so is larger than what then what is actually being generated from the model.

Tejaskumar_Darji · ‎07-01-25

Thank you so much @EricNewton for your detailed insights.

SS&C Blue Prism Community