The democratization of open-weight, locally-run language models has been one of the big achievements of the open-source community within the last few years. Large language models had long been thought of as the tool of enterprise corporations, requiring compute and software that was out of reach of individuals. This fundamentally changed when Meta released their Language Model LLAMA 2, making the controversial but courageous decision to publish the model in its entirety with open weights for the community to use. This sparked a massive reaction of open source projects built off of these weights, creating the foundation for local language model use that we currently enjoy today.

The first of these developments was a project called llama.cpp, by George Gerganov. This started as a C++ wrapper that could be used to perform inference calls using model weights from LLAMA 2 and similar models. It allowed loading these weights onto the user’s local GPU, or executing on CPU if they lacked sufficient VRAM to do so. As support for more models and architectures grew, this rapidly turned into one of the largest projects on github and has become a core library for local model inference alongside HuggingFace’s transformers.

As llama.cpp grew in popularity, other open source projects spawned off of this. One of these was Ollama, which began as a REST-based wrapper around llama.cpp and allowed quickly downloading, managing, and performing inference on models through simple REST commands that could support multiple users. Started as a project from an ex-docker alum Jeffrey Morgan, this project also rapidly grew in size and support and is currently one of the most popular ways to bootstrap running language models quickly and easily off of a local machine. All that’s required is to download Ollama, pull the model that you’d like to use, then issue a simple REST API call to start interacting with it.

The Ollama frontend app, running on Windows. Credit: Ollama.com

The community has developed a large number open source wrappers and libraries for use with Ollama, including bindings and support for many different languages. Ollama officially provides support for Javascript and Python with the project. The community has created wrappers for most other languages to allow developers to quickly and easily use them across a variety of platforms.

Noticing the lack of C++ support, I developed Ollama-hpp last year. This is a set of header-only C++ bindings for Ollama designed to reduce the time-to-code for using Ollama to a single include and a minimal set of commands for inference and control. It bundles nlohmann-json and httplib, two other popular open source C++ libraries to automatically handle message encoding to JSON and sending HTTP requests to the Ollama server. Inference can be achieved in just two lines, with most other features of the Ollama API also supported.

#include "ollama.hpp"

std::cout << ollama::generate("llama3:8b", "Why is the sky blue?") << std::endl;

The Readme and unit tests do a good job of walking through the use cases that the library supports. A few examples are shown below:

Chat functionality:

ollama::message message1("user", "What are nimbus clouds?");
ollama::message message2("assistant", "Nimbus clouds are dark rain clouds.");
ollama::message message3("user", "What are some other kinds of clouds?");

ollama::messages messages = {message1, message2, message3};

ollama::response response = ollama::chat("llama3:8b", messages);

Asynchronous streaming with a user function:

bool on_receive_response(const ollama::response& response)
{   
  std::cout << response << std::flush;

  if (response.as_json()["done"]==true) std::cout << std::endl;

  // Return true to continue streaming, or false to stop immediately
  return true;
}

std::function<bool(const ollama::response&)> response_callback = on_receive_response;  

ollama::message message("user", "Why is the sky blue?");       

ollama::chat("llama3:8b", message, response_callback, options);

Image support:

ollama::image image = ollama::image::from_file("llama.jpg");

// We can optionally include images with each message. 
//Vision-enabled models will be able to utilize these.
ollama::message message_with_image("user", "What do you see in this image?", image);
ollama::response response = ollama::chat("llava", message_with_image);

For power users looking for full control over inference, it may be better for them to link and use llama.cpp directly in their project. For those already using Ollama who need C++ support, Ollama-hpp is a fast and low-code solution to getting language models integrated with your project. The library has received over 200 stars and is hosted on github here. Please drop a comment or send me a message at code@jmont.net if you found it useful.

 


jmont
jmont

A software engineer with a passion for computer science and software-defined radio.

Author posts
Related Posts

Privacy Preference Center