By this point, most of us have a preferred language model to chat with.
Many stick to the good old GPT-4o, others swear by Claude 3.5 Sonnet, and a few fringe freaks still prefer talking to other humans.
But do you truly know how well your chosen model compares to other LLMs, or are you using it purely out of habit?
Chances are it’s the latter.
So for today’s post, I’ve dug up a few free sites that let you pit LLMs against each other to find out which model is best for a specific question or task.
For my demo purposes, our question will be this silly nonsense:
“What is the capital of Paris?”
I want to see how LLMs handle human stupidity.
Let’s roll!
1. LMSYS Chatbot Arena’s “Side-By-Side”
The LMSYS Chatbot Arena’s Leaderboard is widely used to rank LLMs based on blind tests by real users.
But it also has a side-by-side section that lets you chat with two selected models and compare them.
How to use
1. Navigate to lmarena.ai.
2. Click the “Arena (side-by-side)” tab at the top:
3. Select the two models you want to compare from their respective dropdowns:
4. Input your prompt into the box at the bottom:
5. Run and compare the results:
As you can see, it’s a straightforward process.
You can even continue the chat to see how the models handle repeated interactions:
The good
Great selection (68 models)
Can tweak some model settings like temperature, output tokens, etc.
Can work with uploaded documents or images
Convenient side-by-side comparison view
No sign-up required
The bad
Intermittent errors and runtime issues with specific models
No ability to set a system prompt for the models
2. Modelbench
Modelbench primarily targets developers and professionals building with LLMs, but it also offers a rather beginner-friendly way to compare model outputs.
How to use
1. Navigate to modelbench.ai
2. Click the “Sign Up For Free Today” button:
3. Provide your details or sign in using your Google account:
4. In the Playground, click the “Compare With” button in the top-right to create the side-by-side view:
5. Select your two models from the dropdown:
6. Input your prompt into the box at the bottom:
7. Run and compare the results:
Just as with LMSYS Arena, you can respond with follow-up messages to continue the conversation:
The good
Convenient side-by-side comparison view
Massive selection (180+ models)
Can modify and adjust system prompts per model
Can work with uploaded images
Can tweak every model setting (temperature, tokens, Top-P, and so on)
Follow-up messages can be different per model
Additional test tools within the Workbench environment
The bad
Free trial is limited to 7 days
Requires sign-up
3. Wordware’s “Try all the models”
This one’s a bit of a different beast.
Wordware lets anyone build AI apps using natural language. With this featured app called “Try all the models for a single question,” you can…well, it’s in the name.
How to use
1. Navigate directly to the Wordware app via this link.
2. Type your question into the “QUESTION” box:
3. Click “Run App” and wait for your results.
Yup, that’s all there’s to it!
The stand-out feature of this app is that all responses are analyzed, evaluated, and ranked by Claude 3 Opus, which then gives you its verdict on what model best handled your specific question.
The good
Test over a dozen models in one go
Helpful evaluation and ranking by Claude 3 Opus
Super simple, one-input-box interface
No sign-up required
The bad
Can’t add or deselect models (it runs all models for every question)
A limited number of tested models
No helpful side-by-side view (you have to scroll through responses)
No follow-up interactions
No ability to upload files
Long wait time for the responses and evaluation to finish processing
Verdict
I think all sites have their strength and their usefulness when it comes to comparing LLMs. So your choice ultimately depends on your needs.
Modelbench is the most complete and robust tool, but you’ll need to sign up and eventually pay. If you’re a professional user building with LLMs, this one’s a no-brainer.
LMSYS Chatbot Arena is a great free alternative that offers many of the same capabilities without any sign-up or payment requirements.
Wordware’s “Try all the models” is perfect if you want to test multiple models on a single task or question with as little effort as possible. It even helps you make sense of the results.
So go ahead and take them for a spin. I’m curious to hear your thoughts.
🫵 Over to you…
Which of the tools makes the most sense for your needs? Are you aware of any other sites where one can compare LLMs for free?
Leave a comment or shoot me an email at whytryai@substack.com.
Appreciate the shout out Daniel 🙏
Super useful for folks shopping around!
My own 2 cents: these tests are necessarily very limited for me since I rarely just prompt once, but that has a lot to do with the way I use LLLMs - almost exclusively for research. If I'm learning about something, I have little idea which are the right questions to ask in that moment.