LLM Agents with Ell

20 October 2024

Ell is a quite new Python library that is similar to LangChain. In my opinion it feels simpler to use but is not so featureful. It is mostly useful as a higher level API on top of LLM providers (such as OpenAI or Anthropic) and creation of agentic tools - functions that can be used by a LLM. But the main selling point is that it can store all the requests and responses in a database, that can be later viewed in Ell Studio - a GUI running in your browser that shows dependency graph with details of each prompt and tool, while also keeping track of changes to these pieces. This makes Ell an ultimate tool for prompt engineering. Each prompt and tool is represented as Python function and the database keep tracks of functions' signature and implementation changes. Moreover, the docs part of the function is the system message which makes it very well blended into the Python code. To mark the function as either prompt/chat or a tool, you can use Python decorators.

Basic example

I will show you a simple example on how to connect Ell to OpenAI to use GPT. The process for Claude will be similar. I will cover other providers later on as they are not directly included in Ell so far and you need some custom configuration.

import ell

@ell.simple( model="gpt-4o-mini", temperature=0.7 )
def name_meaning(name: str) -> str:
    """
    You are a chatbot that tells the meaning of person's name in a funny way.
    """
    return f"The person who approached you is named {name}."


if __name__ == '__main__':
    print( name_meaning("Mordecai") )

This is a prompt that uses GPT-4o-mini. It contains the system prompt of You are a chatbot that tells the meaning of person's name in a funny way. The user message sent to the LLM will be The person who approached you is named Mordecai. In order to run this example, you will need an API key from OpenAI. Log in to OpenAI, go to Projects, create a new project and create an API key in the new project for yourself. Restrict it to Read on Models and Write on Model capabilities. Store the key in a safe place such as password manager. Of course you also need some credits or payment method 😄 - I always top up my account with $10 worth of credits so that I am on a safe side.

Key creation

Now as you have saved the key into a password manager or something, you can run the example. Assuming you have Python3 with venv installed globally, we will create a new virtual environment and install ell. Save the above example in a file called example.py and run the following commands:

$ python3 -m venv venv
$ source venv/bin/activate  # Creating a virtual environment
$ pip install ell-ai
$ pip install "ell-ai[all]" # Install ell
$ read -s OPENAI_API_KEY    # Paste the API key after running this command
$ export OPENAI_API_KEY
$ python3 example.py        # Run
Ah, Mordecai! The name that sounds like a fancy dish you’d order at a five-star restaurant but ends up being just a really intense salad!

Storing and browsing prompts

As stated before, one of great Ell's capabilities is its prompt browser. Storing all the calls made by Ell, responses and changes to the functions is super easy and straightforward. Just add this single line at the beginning of the main file. Then all the functions marked by any of Ell's decorators will be recorded.

import ell

ell.init(store='./log', autocommit=True)

@ell.simple( model="gpt-4o-mini", temperature=0.7 )
def name_meaning(name: str) -> str:
    ...

It should create a new directory called log with a database of calls. Try it out! Then try changing the function signature by adding something like example below and running the program again.

import ell

ell.init( store="./log", autocommit=True )

@ell.simple( model="gpt-4o-mini", temperature=0.7 )
def name_meaning(name: str, age: int) -> str:
    """
    You are a chatbot that tells the meaning of person's name in a funny way.
    Adapt your response to the person's age.
    """
    return f"The person who approached you is named {name}. Their age is {age}."

if __name__ == '__main__':
    print(name_meaning("Mordecai", 20))
    print(name_meaning("Mordecai", 5))

Now you can run Ell Studio on this database. It will serve a web page on localhost and port 5555 where you can browse the calls and responses in your browser.

$ ell-studio --storage-dir log
INFO:     Uvicorn running on http://127.0.0.1:5555 (Press CTRL+C to quit)
2024-10-10 17:39:29 INFO    ] Database file found: log

Open the browser and go to http://localhost:5555 to see all the functions that were called. What is more, the database tracks changes to the functions and their signatures (stylized as Git commits). You can easily compare the prompts and see the behavior change.

Diffs between functions

Commits in the database

Defining tools for agents

Defining tools for agents is almost as easy as defining prompts. You just need to use @ell.tool decorator on the function that is supposed to be a tool. Then on the prompt function you use @ell.complex decorator and specify the list of tools to use. However, remember, each tool increases used tokens. Although for the modern models it shouldn't be a problem. But if you plan to retrieve entire websites and put it into the conversation, you need to remember about managing the context window size. GPT-4o and GPT-4o-mini has 128k tokens context window so it seems to be quite large but creating an entire backend service with GPT-4o instead of business logic doesn't seem like a reasonable idea.

One of the most difficult things for LLMs is comparing two numbers (or at least it used to be). The most popular one in Is 9.11 bigger than 9.9?. However, as this trivial problem was recently patched, let's bump the difficulty to use some trigonometry. From my tests, it is confusing enough for GPT-4o-mini where it changes the answer every other time with temperature of 0.5 without help of any tools.

The tools in Ell are defined as Python functions where the docstring is the description of what the tool does (it is passed to the LLM) as well as each argument should be marked with pydantic.Field with a description of what each argument is supposed to do.

import ell
import math
from pydantic import Field

@ell.tool()
def sin(angle: float = Field(description="Angle in degrees.")) -> str:
    """Returns a value of sin(angle) with up 10 decimal places. For example sin(30) = 0.5."""
    value = math.sin(math.radians(angle))
    return f"{value:.10f}"

Now we can pass this tool to the prompt. It will be automatically injected when sending the request to the API. Our new prompt function will return a full list of messages because each request to use a tool is being responded to us by the LLM as well as each result from a tool is a new user message added to the conversation so that we are not stuck in a loop of the model needing answers from the tools.

from typing import List

@ell.complex(model="gpt-4o-mini", tools=[ sin ], temperature=0.5)
def math_assistant(message_history: List[ell.Message]) -> List[ell.Message]:
    return [
        ell.system("You are a helpful assistant that helps with math."),
    ] + message_history

With each call to the LLM (so each run of math_assistant), we will send all of the history of the conversation. So all the previous messages returned by LLM or Ell's tools must be included. However, let's do a very simple example first and run it. This is a pure Python function that just calls math_assistant with a single message. Once we run the code below we should get something like this:

def converse(initial_prompt: str):
    conversation = [ ell.user(initial_prompt) ]
    response: ell.Message = math_assistant(conversation)
    return response.text

if __name__ == '__main__':
    print(converse("What is sin(30)?"))

$ python3 sine_example.py
ToolCall(sin(angle=30.0), tool_call_id='call_LV1A18u8c38djddmns2m')

So, instead of a response by the LLM, we get an object defined in Ell because the library decoded the response from this LLM as a request to call a tool that is not supposed to be given back to the user. Instead of returning this, we should check whether the latest response is ToolCall and if so, we should execute what is requested and send the result back to the LLM for another inference.

def converse(initial_prompt: str):
    conversation = [ ell.user(initial_prompt) ]
    response: ell.Message = math_assistant(conversation)

    if response is ToolCall or response.tool_calls:
        tool_results = response.call_tools_and_collect_as_message()
        # Include what the user wanted, what the assistant requested to run, and what the tool returned
        conversation = conversation + [response, tool_results]
        response = math_assistant(conversation)

    return response.text

if __name__ == '__main__':
    print(converse("What is sin(30)?"))

Once we run it, the output for this script should be similar to the following:

$ python3 sine_example.py
The sine of 30 degrees is \( \sin(30) = 0.5 \).

However, what if we ask the model to run several sine calculations? Or recursively run sin? Let's try this out by just changing the prompts.

if __name__ == '__main__':
    print("First question: ", converse("What is sin(35) and sin(51)?"))
    print("Second question: ", converse("What is sin(sin(67))?"))

$ python3 sine_example.py
First question:  The values are:
- \( \sin(35^\circ) \approx 0.5735764364 \)
- \( \sin(51^\circ) \approx 0.7771459615 \)
Second question:  ToolCall(sin(angle=0.9205048535), tool_call_id='call_LV1Am3HqgaPY6zhpEdR7j90m')

As you can see, for the first question the tools ran in parallel (hence call_tools... method). However, for recursive calls, we need to implement a loop that will run all the requested tools until there are no more requests to run. Additionally, we should make a safeguard that will protect the model from looping infinitely. Once we simply change if to while, the application will start to work.

...
    # The only change for now
    while response is ToolCall or response.tool_calls:
        tool_results = response.call_tools_and_collect_as_message()
        # Include what the user wanted, what the assistant requested to run, and what the tool returned
        conversation = conversation + [response, tool_results]
        response = math_assistant(conversation)
...

$ python3 sine_example.py
First question:  The values are as follows:
- \( \sin(35^\circ) \approx 0.5735764364 \)
- \( \sin(51^\circ) \approx 0.7771459615 \)
Second question:  The value of \( \sin(\sin(67)) \) is approximately \( 0.0160651494 \).

As a safeguard we should add a variable that will count the number of loops that happened and add it to the condition. This protects us from getting charged for burning tokens for no purpose. I will give my application at most 5 loops until it will print an error.

def converse(initial_prompt: str):
    conversation = [ ell.user(initial_prompt) ]
    response: ell.Message = math_assistant(conversation)

    max_iterations = 5
    while max_iterations > 0 and (response is ToolCall or response.tool_calls):
        tool_results = response.call_tools_and_collect_as_message()
        # Include what the user wanted, what the assistant requsted to run and what the tool returned
        conversation = conversation + [response, tool_results]
        response = math_assistant(conversation)
        max_iterations -= 1

    if max_iterations <= 0:
        raise Exception("Too many iterations, probably stuck in a loop.")

    return response.text

The easiest way to test this would be to stack 6 sine calls so that the program raises an exception. Let's try doing two tests - when we just as the agent to run it three or four times and another prompt to run it six times. Depending on the mood and temperature, model will understand 🙄 and finally hit the limit.

if __name__ == '__main__':
    print("First question: ", converse("What is sine of sine of sine of 10 degrees?"))
    print("Second question: ", converse("What is sine of sine of sine of sine of sine of sine of 10 degrees?"))

$ python3 sine_example.py
First question:  The sine of sine of sine of 10 degrees is approximately \(0.0000528962\).
Traceback (most recent call last):
  File "/Users/ppabis/Ell1/sine_example.py", line 40, in <module>
    print("Second question: ", converse("What is sine of sine of sine of sine of sine of sine of 10 degrees?"))
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ppabis/Ell1/sine_example.py", line 34, in converse
    raise Exception("Too many iterations, probably stuck in a loop.")
Exception: Too many iterations, probably stuck in a loop.

Debugging with ell-studio

However, this example I presented you here required several runs. I decided to record the calls using Ell's built in logging and check how the calls to the functions looked like. I was pretty unlucky because GPT-4o-mini sometimes did not understand how to run nested 6 sin functions recursively and instead ran them in parallel. On the other hand the three nested calls were correctly ran recursively.

GPT-4o-mini ran them in parallel :/

When you nest multiple calls, Ell Studio will also show dependencies between prompts and tools as well as which output goes to which input. In our sine example, the output of math_assistant is fed back to its input as well as output of sin in one of math_assistant inputs (orange dashed lines). The number next to the bolt ⚡️ also indicated how many times this function was invoked.

Ell studio graph

This way I was able to improve the system prompt and try it again. This time the model was more stable but it still sometimes forgot to run one of the functions, despite the fact I really verbosely described what it should do. gpt-4o did understand it though.

@ell.complex(model="gpt-4o-mini", tools=[ sin ], temperature=0.8)
def math_assistant(message_history: List[ell.Message]) -> List[ell.Message]:
    return [
        ell.system("""You are a helpful assistant that helps with math.
                   Be very careful with parentheses and run tools recursively when requested.
                   For example if the user asks you to run f(10) and f(20) then you should use the tool f twice at once.
                   If the user asks you f(f(f(5))) then you should request another function or tool call when needed."""),
    ] + message_history

Comparator function and a test set

Technically, gpt-4o-mini already compares numbers quite well. However, we can still improve its performance by creating a tool that will return larger, smaller or equal based on two floats passed into the function. This way we can ask the model to compare two numbers that will be embedded inside the sin function or any other we come up with. I will also extract the internals of the function so that it can be conveniently used for testing. I will also add tan function to mix up things a bit and make them more challenging.

def _sin(angle: float) -> float:
    return math.sin(math.radians(angle))

def _tan(angle: float) -> float:
    return math.tan(math.radians(angle))

def _number_comparator(a: float, b: float) -> str:
    if a > b:
        return f"{a} is larger than {b}, {a} > {b}"
    elif a < b:
        return f"{a} is smaller than {b}, {a} < {b}"
    else:
        return f"{a} is equal to {b}, {a} = {b}"

@ell.tool()
def number_comparator(a: float = Field(description="First number 'a' to compare."),
                      b: float = Field(description="Second number 'b' to compare.")) -> str:
    """Returns 'larger' if a > b, 'smaller' if a < b and 'equal' if a = b."""
    return _number_comparator(a, b)

@ell.tool()
def tangent(angle: float = Field(description="Angle in degrees.")) -> str:
    """Returns a value of tan(angle), which is tangent, with up 10 decimal places. For example tan(45) = 1."""
    value = _tan(angle)
    return f"{value:.10f}"

@ell.tool()
def sin(angle: float = Field(description="Angle in degrees.")) -> str:
    """Returns a value of sin(angle), which is sine, with up 10 decimal places. For example sin(30) = 0.5."""
    value = _sin(angle)
    return f"{value:.10f}"

Now we can prepare some test sets. We will print out the results of actual values directly computed by Python and the results made by the model. Some of the examples will be simple, others will be more complex. For each test we will use a structure that will contain the prompt and the expected result. Instead of hardcoding the results, we will use the extracted functions, prefixed by underscore, so that they are not recorded by Ell's logger. Remember to also update the math_assistant function to include the new tools.

...
@ell.complex(model="gpt-4o-mini", tools=[ sin, tangent, number_comparator ], temperature=0.8)
def math_assistant(message_history: List[ell.Message]) -> List[ell.Message]:
    ...


TESTS = [
    {
        'prompt': "Is 0.5 larger than 0.4?",
        'expected': f"{_number_comparator(0.5, 0.4)}"
    },
    {
        'prompt': "What is sin(3910)?",
        'expected': f"sin(3910) = {_sin(3910):.10f}"
    },
    {
        'prompt': "What is tan(4455)?",
        'expected': f"tan(4455) = {_tan(4455):.10f}"
    },
    {
        'prompt': "What is larger sin(sin(789)) or tan(9999)?",
        'expected': f"sin(sin(789)) = {_sin(_sin(789)):.10f}, tan(9999) = {_tan(9999):.10f}, {_number_comparator(_sin(_sin(789)), _tan(9999))}"
    },
    {
        'prompt': "What is the value of sin(tan(8910))?",
        'expected': f"sin(tan(8910)) = {_sin(_tan(8910)):.10f}"
    },
    {
        'prompt': "Is sin(8.333) larger than tan(5.45)?",
        'expected': f"{_number_comparator(_sin(8.333), _tan(5.45))}"
    }
]

if __name__ == '__main__':
    for test in TESTS:
        print(f"❓ Prompt: {test['prompt']}")
        print(f"👍 Expected: {test['expected']}")
        print(f"🤖 Model: {converse(test['prompt'])}")
        print()

I will just print the results and let you compare if they are correct. You can play around and call another prompt that will provide the expected result, the output of the converse function and ask the model to semantically compare the two if they are equal. This way you can put even more automation 😄.

I tried on gpt-4o-mini and the results were quite good but it still didn't follow all the nested function calls. sin(tan(8910)) was computed correctly but sin(sin(789)) was only a single sin call. Even reducing temperature did not help.

$ python3 sine_example.py
❓ Prompt: What is larger sin(sin(789)) or tan(9999)?
👍 Expected: sin(sin(789)) = 0.0162933313, tan(9999) = -6.3137515147, 0.016293331282790816 is larger than -6.313751514675357, 0.016293331282790816 > -6.313751514675357
🤖 Model: The value of \( \sin(\sin(789)) \) is larger than \( \tan(9999) \). Specifically, \( 0.9335804265 > -6.3137515147 \).

I retried the same test on llaama-3.2-3b on Groq to see how well a small model will behave on these tests. However, this model did not even understand how to deal with the functions - apparently internal prompts were not good enough to guide it through the process. Even in the first call it failed to format the answer reasonably.

$ python3 sine_example.py
❓ Prompt: Is 0.5 larger than 0.4?
👍 Expected: 0.5 is larger than 0.4, 0.5 > 0.4
🤖 Model: number_comparator=a=0.5,b=0.4{"number_comparator": "larger"}</number_comparator>

❓ Prompt: What is sin(3910)?
👍 Expected: sin(3910) = -0.7660444431
Traceback (most recent call last):
...
openai.BadRequestError: Error code: 400 - {'error': {'message': "Failed to call a function. Please adjust your prompt. See 'failed_generation' for more details.", 'type': 'invalid_request_error', 'code': 'tool_use_failed', 'failed_generation': '<function=sin>{3910}</function>'}}

Any other standard Llama model on Groq also failed miserably or wasn't even available (responding with 503). However, llama3-groq-70b-8192-tool-use-preview actually worked but still made the same mistake of calling only a single sin function instead of two nested ones, just like gpt-4o-mini. It did not give me the intermediate values, just plain "larger" but changing the tangent values to invert the test did not change the result. I looked into Ell Studio and saw that two sin functions were called in parallel.

{ "message_history": [
    { "content": [
        {
          "text": { "content": "What is larger sin(sin(789)) or tan(9939)? Provide all the intermediate values." }
        }
      ],
      "role": "user"
    },
    { "content": [
        { "tool_call": {
            "tool": "<function sin at 0x107a0eb60>",
            "tool_call_id": { "content": "call_sxmq" }
          }
        },
        { "tool_call": {
            "tool": "<function sin at 0x107a0eb60>",
            "tool_call_id": { "content": "call_914c" }
          }
        },
        { "tool_call": {
            "tool": "<function tangent at 0x107a0e3e0>",
            "tool_call_id": { "content": "call_dd5a" }
          }
        }
      ],
      "role": "assistant"
    },
    { "content": [
        { "tool_result": {
            "result": [ { "text": { "content": "\"0.9335804265\"" } } ],
            "tool_call_id": { "content": "call_sxmq" }
          }
        },
        { "tool_result": {
            "result": [ { "text": { "content": "\"0.9335804265\"" } } ],
            "tool_call_id": { "content": "call_914c" }
          }
        },
        { "tool_result": {
            "result": [ { "text": { "content": "\"0.8097840332\"" } } ],
            "tool_call_id": { "content": "call_dd5a" }
          }
        }
      ],
      "role": "user"
    }
]}

Sorry for the weird formatting but I wanted to shorten it a bit. So these are the three messages exchanged between our program and the model. First we send the prompt. Then the assistant sends us back request to run the tools: two sin functions and one tan function. Then we as the "user" send the model again the history of all that happened before (prompt and requests to run tools) along with the outputs of these tools.

Using different model providers

By default, Ell can use OpenAI models. Anthropic is another first-class provider in Ell. According to documentation it also supports Groq and Cohere but you still have to construct the client yourself so it is not different than any other compatible client. I decided to demonstrate Groq as it has quite generous free tier and has the fastest inference by using their own LPU hardware. Create Groq account and generate a new key here. Save it somewhere safe. Then load it into the environment. Also ensure that Groq Python packages are installed.

$ read -s GROQ_API_KEY # Paste the key after running this command
$ export GROQ_API_KEY
$ pip install groq

There are two ways to construct and pass a Groq client - either using directly their client or OpenAI compatible endpoint. ell is compatible with both so far but I want to demonstrate the OpenAI one as well as it can be reused for other providers that have such API compatibility.

import os

from groq import Groq
groq_client = Groq(api_key=os.environ["GROQ_API_KEY"])

from openai import Client
groq_openai_client = Client(base_url="https://api.groq.com/openai/v1", api_key=os.environ["GROQ_API_KEY"])

@ell.simple(model="llama-3.2-3b-preview", client=groq_client)
def llama_groq():
    """You are a helpful assistant."""
    return "Who are you, what's your name? What model are you? What version?"

@ell.simple(model="llama-3.2-3b-preview", client=groq_openai_client)
def llama_groq_openai():
    """You are a helpful assistant."""
    return "Who are you, what's your name? What model are you? What version?"

Gaining Mastery

However, you won't learn until you experiment (or even fail sometimes 😣). The examples above give only a glimpse of what Ell does. What you use it for is a completely different story. Just as learning a new programming framework doesn't make sense if you don't use it. So, look for things that you can optimize and automate with AI Agents. Maybe port your old project you made with LangChain or Swarm (a new agent framework from OpenAI)?

Another thing you can try is embedding calls to other prompts within tools - for example main GPT-4o wants to search through huge amounts of text documents and it requests a tool to load the file, call GPT-4o-mini with the content of the document and prompt to summarize. That way you can save on GPT-4o input tokens so that it doesn't load entire documents - just summaries. This way it can easier search through a large bunch of them.

pabis.eu