Comparing LLMs Is Not Easy
I saw a few tweets about Kimi K2 being better than other LLMs when it comes to writing:
- https://x.com/koylanai/status/1986464588099952886
- https://x.com/ivanfioravanti/status/1949321346976281013
- https://x.com/garyfung/status/1986651483547554069
- https://x.com/laserai/status/1986576064701997291
- https://x.com/DeryaTR_/status/1987140084747939980
I decided to do a simple test.
Table of Contents
Model Selection
My idea is to compare different models. However, testing LLMs is not easy, there are a few issues that one needs to consider:
- The number of LLMs has increased. There are closed source and open source models being released regularly.
- Some of the models have different flavours: thinking, fast, chat, instruct, base etc.
- Models have different parameters such as temperature, top-p, etc. which influence their output.
- The performance of the models is not consistent among different inference providers. Open-source models are affected from this issue. Some inference providers either limit the context window or serve a quantized version. In some cases, the model is not compatible with the inference setup used by providers.
Let me go through all the issues one by one.
First, I am not going to compare Kimi K2 against all the models. I decided to narrow it to a total of 4 models: 2 models developed by Western AI Labs and 2 models developed by Chinese AI Labs. The models that I selected were:
- Kimi K2
- LongCat Flash
- Grok
- Gemini 2.5
I didn’t go with ChatGPT, DeepSeek, or Qwen because they are quite popular, and I wanted to test some of the less popular ones.
Second, to the best of my knowledge, the models such as Grok and Gemini have different flavours such as fast, pro, thinking etc. For my test, I decided to stick with fast or non-thinking flavour. So, I am using Grok fast, Gemini 2.5 flash, LongCat flash with no thinking and Kimi K2 with thinking disabled.
Third, ideally, I should try different parameters for each model and generate an output which I should then use to do a comparison. However, the number of combinations of different parameters are quite significant enough that I decided to use the default model parameters.
Fourth, it is known that the models do not have consistent performance across different model providers. In such cases, the correct choice would to use a single inference provider and test the models. Unfortunately, no single inference provider have all the shortlisted models with them. The next best option is to use each model’s main or default provider. Unfortunately, for LongCat Flash, I am using lmarena.ai since I couldn’t find any other provider for it.
So, the final selection looks like this:
| Tested using | Model Parameters | Thinking/Reasoning | Model Interface | |
|---|---|---|---|---|
| LongCat Flash | lmarena.ai | Default | Off | Chat |
| Kimi k2 | kimi.com | Default | Off | Chat |
| Grok fast | grok.com | Default | Off | Chat |
| Gemini 2.5 flash | gemini.google.com | Default | Off | Chat |
I did not use any paid versions of the models for this test
Test and output
In order to test the creative writing ability of these models, I decided to use the following prompt:
Write a short story in less than 1000 words. Be as creative and novel as possible.
A key thing to note here is that if you were to try this prompt in a new chat session multiple times, the models would give you different output or story each time. Technically, I should run the same prompt for 10-15 times in order to capture the general writing style of the model. Even then I am not sure if I will be able to get a good representation of a model’s writing skills. SSo instead, I decided to run it only once and judge the output.
If you were to repeat the same test, you would get a different output.
You can read the short stories written by these models by clicking the links below:
- LongCat Flash Chat Story: The Library of Echoes
- Kimi K2 Story: The Day Gravity Went on Strike
- Grok fast: The City of Vanta
- Gemini 2.5 flash: The Komorebi Contraction
Judging the results
Judging creative writing has a subjective element to it. In my opinion:
Kimi K2 > LongCat Flash == Grok fast > Gemini 2.5 Flash.
The story written by Kimi K2 felt more human and natural compared to all the other models.
Now that was my assessment. What if I try to use an LLM to judge these stories?
Failed approach
I used ChatGPT and DeepSeek to judge these writing using Colorado State University’s creative writing rubric: https://tilt.colostate.edu/wp-content/uploads/2024/01/Written_CreativeWritingRubric_CURC.pdf
It turns out all of the short stories got a full score. This rubric is designed for students who are learning creative writing, not for LLMs, which can produce near-perfect writing. It is not helpful since the goal is to find who is the best using LLM
LLM as a Judge
Approach 1: Rating via ChatGPT and DeepSeek
I decided to ask ChatGPT and DeepSeek to rate the short stories written by these models:
| ChatGPT | DeepSeek | Average Score | |
|---|---|---|---|
| LongCat Flash | 4.8 | 4.8 | 4.8 |
| Kimi k2 | 4.9 | 4.9 | 4.9 |
| Grok fast | 4.95 | 5 | 4.975 |
| Gemini 2.5 flash | 4.93 | 5 | 4.965 |
So, as per ChatGPT and DeepSeek, it is:
Grok == Gemini 2.5 Flash > Kimi K2 > LongCat Flash
Approach 2: Let the models create a rubric and judge the output.
Instead of relying upon ChatGPT and DeepSeek, I decided to ask each one of the models to come up with their own rubric for judging high quality creative writing. This gave some interesting results:
Kimi K2
Rubric created by Kimi K2: https://pastebin.com/7LDWaE9s
Result:
| first read impact | language control | structure and pacing | originality | character and voice distinction | emotional after-taste | Total | |
|---|---|---|---|---|---|---|---|
| LongCat Flash | 10 | 9.5 | 9 | 10 | 9 | 10 | 57.5 |
| Kimi k2 | 9 | 9 | 9 | 9 | 8 | 9 | 53 |
| Grok fast | 9 | 9 | 9 | 9 | 8 | 9 | 53 |
| Gemini 2.5 flash | 9 | 9 | 9 | 10 | 8 | 8 | 53 |
LongCat Flash » Kimi K2 == Grok fast == Gemini 2.5 flash
LongCat Flash
Rubric created by LongCat Flash: https://pastebin.com/UtqL7W9q
Result:
| Originality | narrative craft & structure | emotional resonance | Language & style | thematic depth and subtext | Total | |
|---|---|---|---|---|---|---|
| LongCat Flash | 5 | 5 | 5 | 5 | 4 | 24 |
| Kimi k2 | 5 | 5 | 5 | 5 | 4 | 24 |
| Grok fast | 5 | 5 | 5 | 5 | 5 | 25 |
| Gemini 2.5 flash | 5 | 5 | 4.75 | 5 | 5 | 24.75 |
Grok fast >= Gemini 2.5 Flash > Kimi K2 == LongCat Flash
Grok fast
Rubric created by Grok fast: https://pastebin.com/A32Qv0PC
Result:
| originality and voice | theme depth and coherence | narrative architecture | character agency | prose and architecture | emotional impact | intellectual impact | technical polish (negative) | Weighted Total | |
|---|---|---|---|---|---|---|---|---|---|
| LongCat Flash | 4 | 4 | 3.5 | 4 | 4 | 4 | 4 | 0 | 98 |
| Kimi K2 | 4 | 4 | 3.5 | 3 | 4 | 4 | 4 | 0 | 94 |
| Grok fast | 4 | 4 | 3.5 | 3.5 | 4 | 4 | 4 | 0 | 96 |
| Gemini 2.5 flash | 4 | 4 | 4 | 3.5 | 4 | 4 | 4 | 0 | 98 |
Gemini 2.5 Flash == LongCat Flash > Grok fast > Kimi K2
Gemini 2.5 Flash
Rubric created by Gemini 2.5 Flash: https://pastebin.com/sFkxNDJB
Result:
| Narrative & structure | character and voice | language and imagery | theme and depth | overall impact | Total | |
|---|---|---|---|---|---|---|
| LongCat Flash | 4 | 4 | 5 | 5 | 5 | 23 |
| Kimi K2 | 5 | 4 | 5 | 5 | 5 | 24 |
| Grok fast | 5 | 4 | 5 | 5 | 5 | 24 |
| Gemini 2.5 flash | 4 | 5 | 5 | 4 | 5 | 23 |
Kimi K2 == Grok fast > LongCat Flash == Gemini 2.5 Flash
Final Result
| LongCat flash judge | Kimi K2 judge | Grok fast judge | Gemini 2.5 flash judge | Average | |
|---|---|---|---|---|---|
| LongCat Flash | 24 | 57.5 | 98 | 23 | 50.625 |
| Kimi K2 | 24 | 53 | 94 | 24 | 48.75 |
| Grok fast | 25 | 53 | 96 | 24 | 49.5 |
| Gemini 2.5 flash | 24.75 | 53 | 98 | 23 | 49.6875 |
So, the final result is:
LongCat Flash > Gemini 2.5 Flash >= Grok fast > Kimi K2
My Reflections
After doing this simple test, the following are my thoughts:
- One needs to consider the flavour of the model, its parameters, and inference options while running a test.
- LLM-as-a-judge is a cool concept because it allows you to scale testing; however, the judgement given by an LLM may not match a human’s.
- Having your own benchmark or test is a must. Do not go by what the public benchmarks say.
- Famous models such as ChatGPT may not always be the best for a given task. It is important to test other models who might perform better at a lower cost.
- LLMs aren’t great at math. It’s better to let them assign scores on individual metrics, but the overall tally should be done manually.
- Statistically speaking, this test doesn’t prove much. I’d need to repeat it several times and judge the outputs myself to know which model really excels at creative writing. It’s clear that an LLM can’t reliably judge creative writing.