A/B test your LLM prompt thanks to Kameleoon Feature Experimentation
With the widespread adoption of Large Language Model (LLM) technology, prompt engineering and optimization has spread across several industries. Creating metrics to evaluate prompts is easy, but linking them to user engagement is challenging. This is where experimentation shines as the best way to evaluate and quantify the impact of a change on the metrics that matter. In this article you will learn how easy it is to set up and test different prompts for a LLM on user-relevant metrics with Kameleoon.
Optimizing Generative AI: The role of experimentation and prompt engineering
LLMs are very good at solving some specific problems and today every company is scratching its head to identify how to use them best to create value for their customers. The research field is still exploring and trying to understand what is possible to do with such models and how to get the most out of it. The larger community quickly understood that when you build a use case you need to write a set of instructions to your LLM which states and details the work you expect it to do. It is very hard to know how this set of instructions, usually called a prompt, will affect the model’s output. That is why LLMs yielded a new field of research and experimentation prompt engineering.
Although some best practices are starting to be explored, this is still a very new field. So, when a developer wants to know which prompt is the best, they have to craft some metrics which they hope will actually correlate with the experience of the user using the LLM powered feature. How can you know that phrasing your prompt to explicitly ask for a 'detailed and accurate summary' will lead to better user satisfaction than asking for a 'succinct and user-friendly overview'? Or if it has any impact at all?
In order to address these kinds of interrogations, experimentation is the best tool, as it actually is able to prove a causal effect of your change on the metrics that actually matter. You should craft some Operational metrics, for example using RAGAS to evaluate your LLM application. However, you don’t only want to know if your chatbot gives correct answers, what you actually care about is if it allowed your customers to do what he needs the chatbot for. This kind of metric which actually relates to business are the metrics which AB testing teams optimize every day and there is no reason for your Generative AI teams to be obnoxious about these metrics.
Leveraging Kameleoon’s Feature Flagging for A/B Testing
We will now demonstrate how you can set up a feature experiment with Kameleoon which will A/B test two prompts for a chatbot. At Kameleoon we have developed KAI, an assistant which is powered by an LLM and leverages the retrieval-augmented generation (RAG) process to answer user’s questions with relevant documents. If you want to learn more about how it works you can read this article. KAI backend is implemented in Python using among others LangChain and FastAPI. You will see how to use Kameleoon’s Python SDK in this context to build your prompt experiment. If you are only interested in seeing the implementation of the integration you can skip ahead until here.
Creating the feature experiment
We will be following the steps outlined here. After logging into the Kameleoon Platform, we click on the “New Feature Flag” CTA to create a new feature flag named “Test KAI Prompt” which we will use to run our prompt experiment. We write this down in the description field in case later on we forget it and for information sharing.
Now that we have created our Feature Flag, we save the feature key and the site code for later use. We will now create a Feature Variable inside it. The feature variable will have different values for each variation and we allow you to personalize the prompts according to the variation allocated.
We create a feature variable named “prompt_template” and we set its value to the default prompt our assistant will be using when asked questions.
We now move on to the Variation tab of the setup. Here, we create as many variations as the number of prompts that we will want to be A/B testing. When adding a new variation, we fill the variable “prompt_template” created previously with the prompt that we want to use in this variation.
Now comes the interesting part, we are able to attach any goal available to us for tracking with this feature flag. You can attach any metric which you find relevant to evaluate the prompts. Kameleoon being an unified platform, you are able to attach any objective created by any team in your organization. For example, for an e-commerce website, you are able to attach and monitor in your experiments the transaction metric which might have been setup by the marketing teams. In my case I will be monitoring the difference in interactions with KAI.
We are now done with the setup and we can move on to the next step which defines how we will be rolling out our experiment. We create a new experiment rule and allocate traffic equally between the two prompts that we have created, choosing to target all the visitors reaching the application.
Now that everything is ready, we will simply have to toggle ON the environment to start experimenting and delivering the prompts to the users. It might be obvious but you may want to note that if we want to add another variation or update the prompt content we will simply have to interact with the interface and not to redeploy our application.
Setting up Kameleoon’s Python SDK
For the SDK integration and implementation we will be following the steps outlined here. First, we want to add `kameleoon-client-python` as a new dependency to our project. As a side note: I recommend using rye for project management in Python. It is really great. So with rye adding a new dependency looks like this.
# bash
rye add kameleoon-client-python # add the new dependency
rye sync # synchronised your environment to take it into account
We create a new dependency for our application which will return the Kameleoon client instantiated. We make use of environment variables to pass the configuration values. Make sure to get your Client ID and Client Secret from Kameleoon’s application as described here.
def get_kameleoon_client() -> KameleoonClient:
"""
Dependency to get a Kameleoon SDK client
"""
kameleoon_client_config = KameleoonClientConfig(
client_id=os.getenv("KAMELEOON_CLIENT_ID"),
client_secret=os.getenv("KAMELEOON_CLIENT_SECRET"),
environment=os.getenv("KAMELEOON_ENVIRONMENT"),
top_level_domain=os.getenv("KAMELEOON_TOP_LEVEL_DOMAIN"),
)
return KameleoonClientFactory.create(
site_code=os.getenv("KAMELEOON_SITE_CODE"), config=kameleoon_client_config
)
And we call this dependency during the start-up of our application by using the “lifespan” feature of fastAPI and we make sure the client is initialized.
@asynccontextmanager
async def lifespan(_app: FastAPI):
"""
Initialize Kameleoon SDK client at startup
"""
# Add Kameleoon's SDK client and await its initialization
_app.state.kameleoon_client = get_kameleoon_client()
await _app.state.kameleoon_client.wait_init_async()
yield
Now when our web server receives a request, we use Kameleoon’s SDK to perform the assignment, tracking the visitor and its associated variation as well as getting the prompt we should use to generate the answer to its question.
def add_template_to_config(config: Dict, request: Request) -> None:
"""
Fetch the prompt template defined as a feature variable
then add it to the config for later use.
Make sure to handle possibles errors properly when calling this function.
"""
kam_client = request.app.state.kameleoon_client
# get Visitor Code
kam_visitor_code = kam_client.get_visitor_code(cookies=request.cookies)
# accept consent if user's consent is required
kam_client.set_legal_consent(kam_visitor_code, True)
# get feature variable based on variation
variation = kam_client.get_variation(
visitor_code=kam_visitor_code, feature_key=KAMELEOON_FEATURE_KEY
)
prompt_template = variation.variables[KAMELEOON_FEATURE_VARIABLE].value
# Add prompt feature variable to config
config["template"] = prompt_template
# Track conversion for goal interact with KAI
kam_client.track_conversion(
visitor_code=kam_visitor_code, goal_id=KAMELEOON_GOAL_ID
)
It’s as simple as that. Later on in the logic we read the prompt value from the config and use it to generate an answer to the user’s question. After toggling ON the environment we already can see in real time users being targeted by the SDK.
Benefits of using Kameleoon for Prompt testing
Feature flagging will be a centerpiece of the new technologies developing around generative AI. As Google puts it in its Rule #2 of the best practices for ML Engineering, note that an experiment framework, in which you can group users into buckets and aggregate statistics by experiment, is important. We showcased here how easy you can get setup and start monitoring the impact of your changes on real end-user metrics. This will be crucial when you want to evaluate and improve on your baseline LLM application. Kameleoon being unified, you can very easily attach to your feature flag the metrics which matter across the different teams in your organization and be sure you do not degrade them. We also support a wide range of SDKs, even though we used the Python SDK, you are sure to find one adequate to your development language.
By using Feature Variables you are also empowering less technical users to be able to create and test new prompts or other features without having to redeploy your application. We didn’t show it here but you can also make use of the precise segmentation criteria and real-time adjustments. You can already see how Kameleoon can deliver you detailed insights that will help your team to make data-driven decisions.