Ever since ChatGPT took off in November, the AI chatbot space has been flooded with alternatives.
With different LLMs, pricing, UIs, and internet access, it’s tough to decide which one to use.
So, LMYSY Org (a UC Berkeley research org) created the Chatbot Arena – a benchmark platform for LLMs where users can compare two models by entering a prompt and picking the best answer without knowing which LLM generated it.
After they pick one, they get to see which LLM was used. The user ratings are then used to rank the LLMs on a leaderboard based on an Elo rating system (a popular chess rating system).
When I tried the arena for myself, the prompt was “Can you write me an email telling my boss that I will be out because I am going on a vacation that was planned months ago.”
The two responses were totally different; one gave much more context, length and fill-in-the-blanks. After picking “Model B” as the winner, I found out it was created by LMSYS Org from Meta’s LLaMA model, “vicuna-7b”.
The losing LLM was “gpt4all-13b-snoozy”, developed by Nomic AI and finetuned from LLaMA 13B.
Unsurprisingly, GPT-4 (OpenAI’s most advanced LLM) is currently in first place on the leaderboards with an Arena Elo rating of 1227. In second place is Claude-v1 (developed by Anthropic) with a rating of 1227.
GPT-4 is the best of the best, according to ZDNET’s AI chatbot rankings.
Anthropic’s second-ranking Claude isn’t available yet, but you can sign up for early access on their waitlist.
PaLM-Chat-Bison-001, a submodel of PaLM 2 (Google Bard’s LLM), comes in at number eight and is considered “not the worst but not one of the best.”
If you want to compare two different models, Chatbot Arena has a feature that could be helpful.