The growing role of AI in flespi

A small update regarding codi, our AI friend who is serving our users in multiple languages on its 24/7 shift.

So far, in 2024 Q1, codi handled around 25% of our communication with rather high quality. This number does not include human answers that are generated by AI and moderated by a human. It depends on the human personality in charge - will he/she use AI to generate an answer or prefer to respond on their own. But in general, it gives us an additional 20% of AI-backed answers. So in total, around 40% of our support operations are assisted by AI at the moment.

The system that we've built for communication - our flespi chat - is cool because we may easily replace a human with AI there and as well replace AI back to human anytime. And humans are mostly moderating AI answers and correcting them upon a mistake. So it is a kind of fruitful cooperation between AI and humans, where AI performs all the dirty work and humans are concentrated on more high-level tasks.

We adopted LLM API services from OpenAI and Anthropic and now are fully LLM-agnostic. Our AI platform automatically uses different models or providers depending on the communication context or LLM service availability.

We gathered quite noticeable expertise on the advantages and disadvantages of different models and providers. This experience, of course, is based on our use-case of support operations for highly-demanding technical product expertise. And this expertise is just in a moment - in LLM space everything is changing each month. In case you will be involved in GenAI development, let me share some of our observations:

Anthropic models are great when your task is to perform classification or extract valuable pieces of information from text (in our case it is various identifiers). Even the cheapest and quickest claude-3-haiku-20240307 model is better than any OpenAI model for this task.
High level OpenAI models are in general quicker than Anthropic models. gpt-4-1106-preview or gpt-4-turbo is 2-3 times quicker than claude-3-opus-20240229 and even quicker than claude-3-sonnet-20240229.
Anthropic models have the highest attention to details. Among 50K tokens of knowledge, it is constantly able to find the correct describing sentence and carefully extract this information during response generation. OpenAI models often get lost in the same context.
OpenAI models are degrading in attention to details with each new release. Original gpt-4-0613 is most probably the best but it lacks enough context window size (it is just 4K tokens). While gpt-4-1106-preview (the first gpt-4-turbo class model) is really great in most of tasks, each subsequent release (gpt-4-0125-preview and especially latest gpt-4-turbo-2024-04-09) are becoming more lazy, laconic and prefer to ignore provided instructions.
OpenAI models are better when you need to use functions/tools. Anthropic function calling appeared in beta mode just a couple of weeks ago and it contains quite a lot of bugs so far.
Comparing service availability, I would say that OpenAI API services work a little bit more durably. Anthropic API service now is often overloaded especially in business time in the US. The same was happening to OpenAI at the end of 2023. However, the difference is not that much. Most probably Anthropic is struggling with higher demand at the moment.
When the user 's question is covered by the information in RAG, any model will respond nicely. In that case, IMHO the clear winner is claude-3-haiku-20240307 from Anthropic that outputs tokens with machine gun performance (it needs around 10 seconds maximum for a big response), has 200K tokens context and is super cheap. However, if you need to apply high-level logic such simple models may fail with a high rate. The question is to detect when information in RAG is sufficient for the best reply before selecting a model for the answer.

If you compare it with the LLM evaluation recently published by OpenAI, you will see a totally different picture. Most likely, it’s due to prompt design because in our case we evaluate models using a huge amount of RAG and extra instructions, while in their evaluation, the system prompt consists of 1-2 sentences maximum. So, from a Chat Bot perspective (pure ChatGPT as an AI tool), OpenAI models are improving. However, as a LLM for specific natural language tasks, it quickly degrades.

Another interesting observation that is clearly visible in our/customer AI usage pattern is the shift from documentation based answering machine to real assistant that analyzes the data. Once we enriched AI with the correct knowledge and gave it tools to extract valuable information from the user's account, it started to analyze the data, extract valuable pieces of information upon request and even started advising.

Now if a user reports a problem, AI will find the item with a problem (e.g. a device), retrieve information about its status and perform various checks. Our next move is to provide more and more AI assistance now to human supporters to perform on-demand health checks of customer accounts - analyze telemetry, messages, logs, and perform the summary of what can be improved. Basically, the AI is going to implement all the internal processes we have established for a human supporter - look there, check this, try that, and so on. Once we optimize it, we will be ready to provide such service to customers.

And one important note is the cost of AI. So far, it greatly depends on the fact - if we are actively developing it or not. When nobody is developing the AI platform, the cost is around $20-30 per day. When we are running tests, optimizations, and performing active development with AI, the cost jumps to $50-100 per day. Taking into account hundreds of users it serves on a monthly basis, it is still very cheap.