Comparing LLMs Is Not Easy // Dino Hacks

I saw a few tweets about Kimi K2 being better than other LLMs when it comes to writing:

I decided to do a simple test.

Model Selection

My idea is to compare different models. However, testing LLMs is not easy, there are a few issues that one needs to consider:

The number of LLMs has increased. There are closed source and open source models being released regularly.
Some of the models have different flavours: thinking, fast, chat, instruct, base etc.
Models have different parameters such as temperature, top-p, etc. which influence their output.
The performance of the models is not consistent among different inference providers. Open-source models are affected from this issue. Some inference providers either limit the context window or serve a quantized version. In some cases, the model is not compatible with the inference setup used by providers.

Let me go through all the issues one by one.

First, I am not going to compare Kimi K2 against all the models. I decided to narrow it to a total of 4 models: 2 models developed by Western AI Labs and 2 models developed by Chinese AI Labs. The models that I selected were:

Kimi K2
LongCat Flash
Grok
Gemini 2.5

I didn’t go with ChatGPT, DeepSeek, or Qwen because they are quite popular, and I wanted to test some of the less popular ones.

Second, to the best of my knowledge, the models such as Grok and Gemini have different flavours such as fast, pro, thinking etc. For my test, I decided to stick with fast or non-thinking flavour. So, I am using Grok fast, Gemini 2.5 flash, LongCat flash with no thinking and Kimi K2 with thinking disabled.

Third, ideally, I should try different parameters for each model and generate an output which I should then use to do a comparison. However, the number of combinations of different parameters are quite significant enough that I decided to use the default model parameters.

Fourth, it is known that the models do not have consistent performance across different model providers. In such cases, the correct choice would to use a single inference provider and test the models. Unfortunately, no single inference provider have all the shortlisted models with them. The next best option is to use each model’s main or default provider. Unfortunately, for LongCat Flash, I am using lmarena.ai since I couldn’t find any other provider for it.

So, the final selection looks like this:

	Tested using	Model Parameters	Thinking/Reasoning	Model Interface
LongCat Flash	lmarena.ai	Default	Off	Chat
Kimi k2	kimi.com	Default	Off	Chat
Grok fast	grok.com	Default	Off	Chat
Gemini 2.5 flash	gemini.google.com	Default	Off	Chat

I did not use any paid versions of the models for this test

Test and output

In order to test the creative writing ability of these models, I decided to use the following prompt:

Write a short story in less than 1000 words. Be as creative and novel as possible.

A key thing to note here is that if you were to try this prompt in a new chat session multiple times, the models would give you different output or story each time. Technically, I should run the same prompt for 10-15 times in order to capture the general writing style of the model. Even then I am not sure if I will be able to get a good representation of a model’s writing skills. SSo instead, I decided to run it only once and judge the output.

If you were to repeat the same test, you would get a different output.

You can read the short stories written by these models by clicking the links below:

LongCat Flash Chat Story: The Library of Echoes
Kimi K2 Story: The Day Gravity Went on Strike
Grok fast: The City of Vanta
Gemini 2.5 flash: The Komorebi Contraction

Judging the results

Judging creative writing has a subjective element to it. In my opinion:

Kimi K2 > LongCat Flash == Grok fast > Gemini 2.5 Flash.

The story written by Kimi K2 felt more human and natural compared to all the other models.

Now that was my assessment. What if I try to use an LLM to judge these stories?

Failed approach

I used ChatGPT and DeepSeek to judge these writing using Colorado State University’s creative writing rubric: https://tilt.colostate.edu/wp-content/uploads/2024/01/Written_CreativeWritingRubric_CURC.pdf

It turns out all of the short stories got a full score. This rubric is designed for students who are learning creative writing, not for LLMs, which can produce near-perfect writing. It is not helpful since the goal is to find who is the best using LLM

LLM as a Judge

Approach 1: Rating via ChatGPT and DeepSeek

I decided to ask ChatGPT and DeepSeek to rate the short stories written by these models:

	ChatGPT	DeepSeek	Average Score
LongCat Flash	4.8	4.8	4.8
Kimi k2	4.9	4.9	4.9
Grok fast	4.95	5	4.975
Gemini 2.5 flash	4.93	5	4.965

So, as per ChatGPT and DeepSeek, it is:

Grok == Gemini 2.5 Flash > Kimi K2 > LongCat Flash

Approach 2: Let the models create a rubric and judge the output.

Instead of relying upon ChatGPT and DeepSeek, I decided to ask each one of the models to come up with their own rubric for judging high quality creative writing. This gave some interesting results:

Kimi K2

Rubric created by Kimi K2: https://pastebin.com/7LDWaE9s

Result:

	first read impact	language control	structure and pacing	originality	character and voice distinction	emotional after-taste	Total
LongCat Flash	10	9.5	9	10	9	10	57.5
Kimi k2	9	9	9	9	8	9	53
Grok fast	9	9	9	9	8	9	53
Gemini 2.5 flash	9	9	9	10	8	8	53

LongCat Flash » Kimi K2 == Grok fast == Gemini 2.5 flash

LongCat Flash

Rubric created by LongCat Flash: https://pastebin.com/UtqL7W9q

Result:

	Originality	narrative craft & structure	emotional resonance	Language & style	thematic depth and subtext	Total
LongCat Flash	5	5	5	5	4	24
Kimi k2	5	5	5	5	4	24
Grok fast	5	5	5	5	5	25
Gemini 2.5 flash	5	5	4.75	5	5	24.75

Grok fast >= Gemini 2.5 Flash > Kimi K2 == LongCat Flash

Grok fast

Rubric created by Grok fast: https://pastebin.com/A32Qv0PC

Result:

	originality and voice	theme depth and coherence	narrative architecture	character agency	prose and architecture	emotional impact	intellectual impact	Weighted Total
LongCat Flash	4	4	3.5	4	4	4	4	98
Kimi K2	4	4	3.5	3	4	4	4	94
Grok fast	4	4	3.5	3.5	4	4	4	96
Gemini 2.5 flash	4	4	4	3.5	4	4	4	98

Gemini 2.5 Flash == LongCat Flash > Grok fast > Kimi K2

Gemini 2.5 Flash

Rubric created by Gemini 2.5 Flash: https://pastebin.com/sFkxNDJB

Result:

	Narrative & structure	character and voice	language and imagery	theme and depth	overall impact	Total
LongCat Flash	4	4	5	5	5	23
Kimi K2	5	4	5	5	5	24
Grok fast	5	4	5	5	5	24
Gemini 2.5 flash	4	5	5	4	5	23

Kimi K2 == Grok fast > LongCat Flash == Gemini 2.5 Flash

Final Result

	LongCat flash judge	Kimi K2 judge	Grok fast judge	Gemini 2.5 flash judge	Average
LongCat Flash	24	57.5	98	23	50.625
Kimi K2	24	53	94	24	48.75
Grok fast	25	53	96	24	49.5
Gemini 2.5 flash	24.75	53	98	23	49.6875

So, the final result is:

LongCat Flash > Gemini 2.5 Flash >= Grok fast > Kimi K2

My Reflections

After doing this simple test, the following are my thoughts:

One needs to consider the flavour of the model, its parameters, and inference options while running a test.
LLM-as-a-judge is a cool concept because it allows you to scale testing; however, the judgement given by an LLM may not match a human’s.
Having your own benchmark or test is a must. Do not go by what the public benchmarks say.
Famous models such as ChatGPT may not always be the best for a given task. It is important to test other models who might perform better at a lower cost.
LLMs aren’t great at math. It’s better to let them assign scores on individual metrics, but the overall tally should be done manually.
Statistically speaking, this test doesn’t prove much. I’d need to repeat it several times and judge the outputs myself to know which model really excels at creative writing. It’s clear that an LLM can’t reliably judge creative writing.

Table of Contents

Model Selection

Test and output

Judging the results

Failed approach

LLM as a Judge

Approach 1: Rating via ChatGPT and DeepSeek

Approach 2: Let the models create a rubric and judge the output.

Kimi K2

LongCat Flash

Grok fast

Gemini 2.5 Flash

Final Result

My Reflections