Monday
Hi All,
I just wanted to share a hobby project I have worked on the last few weeks to see if I could get a local language model to run on runtime resource.
The general inference workflow for LLMs I have seen is running a separate service and using API to prompt the LLM, which makes sense for large model requiring GPU to run. Even running most local methods involves running a local server and using API calls in the same manner. A while ago I read about ONNX models (open source, MIT license, and led by Microsoft - https://github.com/onnxruntime) and are for runtime, and I recently had the thought of seeing the possibility of integrating this into a Blue Prism object and running on resource.
I only did this on my local computer and on training edition of Blue Prism so you mileage may vary running elsewhere, but any way I thought I would share incase anyone else has interest or use case.
I will go ahead and link my github with the BP release you can download and readme which I will copy and paste below so you can decide if you want take any time downloading.
The TDLR of the readme is you will need to get a number of dll from Microsoft nuget packages and then pick a model or models to play with from huggingface.co.
In the release I provide a toy example of taking non-formatted address details in a collection and running loop to ask the model to output to json format. I think small models provide a lot of potential around fine tuning for specific processes and running on runtime. I also played around with trying to get multimodal to work and was going to take a bit more time so stopped for now, so if there is a lot of interest about that I can look further into it.
If folks try this out and have issues let me know and I will help out best I can.
Anyway here is link to github and below readme -
https://github.com/etnewton/onnx-blueprism/tree/main
This project aims to run a language model at runtime within Blue Prism Automate.
I was able to achieve this utilizing the ONNX Runtime engine, https://onnxruntime.ai/, with Microsoft's OnnxRuntime and OnnxRuntimeGenAI packages. I used version 1.19.2 for the OnnxRuntime and 0.4.0 for GenAI packages. However did test with the new versions of 1.20.1 and 0.5.2 and they did work but did have some odd behavior with Phi model not completing so would suggest using the versions mentioned below. I would like to see about getting multimodal to run, which required the newer versions, with a Phi 3 vision model.
The system this was tested on was a Windows 11 Pro Desktop with Blue Prism v7.3.1 Learning edition installed, Intel i7-12700K CPU, and 32 GB of RAM. Models will be loaded into RAM and CPU will be performing inference, so this should be kept in mind for any models being ran as large models will take longer to complete inference.
Set Up -
I used NuGet Package Manager to download the packages then found the appropriate dll in the build or runtime folder noted below and moved over.
These are typically saved to YourUserPath.nuget\packages\ then the package requested. To get the correct dll if the package has a runtime folder under the version navigate to the x64 folder in runtime folder then pull that dll, if no runtime folder is there go into the lib folder and look for the netstandard2.0 folder and take the dll from there.
x64 Runtimes .dll
Microsoft.ML.OnnxRuntime.DirectML version 1.19.2
https://www.nuget.org/packages/Microsoft.ML.OnnxRuntime.DirectML
onnxruntime.dll
Microsoft.ML.OnnxRuntimeGenAI version 0.4.0
https://www.nuget.org/packages/Microsoft.ML.OnnxRuntimeGenAI
onnxruntime-genai.dll
netstandard2.0 lib .dll
Microsoft.ML.OnnxRuntime.Managed version 1.19.2
https://www.nuget.org/packages/Microsoft.ML.OnnxRuntime.Managed
Microsoft.ML.OnnxRuntime.dll
Microsoft.ML.OnnxRuntimeGenAI.Managed version 0.4.0
https://www.nuget.org/packages/Microsoft.ML.OnnxRuntimeGenAI.Managed
Microsoft.ML.OnnxRuntimeGenAI.Managed.dll
Microsoft.ML.OnnxTransformer version 3.0.1
https://www.nuget.org/packages/Microsoft.ML.OnnxTransformer
Microsoft.ML.OnnxTransformer.dll
System.Memory version 4.5.5
https://www.nuget.org/packages/System.Memory
System.Memory.dll (note I override the existing dll in the Blue Prism Automate folder be sure to test before doing this blindly)
Additionally make sure netstandard.dll is in the same folder.
Make sure you download the files related to cpu, typically they are in a folder labeled cpu_int4_rtn-block-32 or similar (example for Phi3 mini https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx/tree/main/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4).
I successfully used the following models -
Phi 3 Mini 128K Instruct - Takes roughly 3 GB of RAM and ran in my tests around ~20 token/sec
Llama 3.2 3B Instruct - Takes roughly 3 GB of RAM and ran in my tests around ~60 token/sec
Llama 3.1 8B Instruct - Takes roughly 5 GB of RAM and ran in my tests around ~10 token/sec
The files should be saved in a single folder and they should look something like this -
sometimes you need to rename the config and tokenizer/token file names to remove the folder name if you downloaded manually, example cpu_and_mobile_cpu-int4-rtn-block-32-acc-level-4_config.json to config.json.
While you are looking at the models be sure to look up the prompt template since you need to provide that in the action to load the model and so they work correctly.
So you can fire up Blue Prism, log in and open up the Example LM process. Navigate to the Prompt Example With Token Details page. Then to run the example you need to do the following -
Update Model Path data item to the folder path where all the files are placed for the model you downloaded, example D:\LLM\onnx\Llama-3.2-3B-Instruct-ONNX.
Update the Prompt Template data item to be the prompt template of the model, I already set this with the Phi template so if you are using that model don't need to update anything. Where the system prompt is to be place in the prompt place "{systemPrompt}" and where the user prompt be place but "{userPrompt}", you will pass the actual system and user prompts you want on the Send Prompt actions and it will replace those with the actual prompt when running. So an example would be the Phi template is -
<|system|>
{systemPrompt}
<|end|>
<|user|>
{userPrompt}
<|end|>
<|assistant|>
It is important on the template to include the line breaks as this is how the models were trained and in order to get accruate output this needs to match the right template. If you are getting weird output from a prompt then you should check if your template is correct.
Then you can run the example which looks to output 5 rows of data that is non-formatted addresses into a requested json format. If everything was done successfully you should then be able to view the Prompts collection and see the output generated. Here is an example using Phi3 Mini 128k
As you can see with my tests I was achieving roughly 20 tokens per second with Phi3, all from inference running on the CPU.
yesterday
Thanks @EricNewton on the wonderful post, this post and your thought on using a runtime LLM is just nothing short of a remarkable idea. I am sure a lot in our community will get intrigue and try it out. I will explore too and provide my feedback.
yesterday
This is such an interesting approach to improving the development method. I was reading through it an im hopefully understanding your solution correctly, are you using the modelling language to generate JSON by taking addresses and correctly formatting the the addresses which are then produced in json format?
It would save a lot of hassle particularly around the addresses, i had a development that caused issues with addresses because theyre always in a different format/layout, a flat number can be shown as just a number or people write "Flat 5" for example and it makes it a trial and error to get the right information out.
yesterday
Yeah the first thing I thought of for a use case would be taking freeform data that you need formatting and having just have a model format it for you, my background is onboarding clients data to our systems so this was always something that was part of the process - getting client data in the format we need for our system.
It is just a toy in my example process but thought it might demonstrate usefulness in using a small model at runtime vs having to send to API. Now whether something like this works in real world application depends on the prompt and model results as these things can always hallucinate something, but I think there are some easy solutions for checking hallucinations for an example like this (I think just checking if each json value is in the original address is probably the easiest in this case).
Another thought I had was the current big buzzword of an "Agent" where you could possibly fine tune a small model with a list of process available to run in your environment and let it decide what process to run with what data it receives. Something like a bot checking emails and deciding if something is a new order or customer complaint and spinning up the related process for each unique case.