I saw a few tweets about Kimi K2 being better than other LLMs when it comes to writing:

I decided to do a simple test.

Model Selection

My idea is to compare different models. However, testing LLMs is not easy, there are a few issues that one needs to consider:

  • The number of LLMs has increased. There are closed source and open source models being released regularly.
  • Some of the models have different flavours: thinking, fast, chat, instruct, base etc.
  • Models have different parameters such as temperature, top-p, etc. which influence their output.
  • The performance of the models is not consistent among different inference providers. Open-source models are affected from this issue. Some inference providers either limit the context window or serve a quantized version. In some cases, the model is not compatible with the inference setup used by providers.

Let me go through all the issues one by one.

First, I am not going to compare Kimi K2 against all the models. I decided to narrow it to a total of 4 models: 2 models developed by Western AI Labs and 2 models developed by Chinese AI Labs. The models that I selected were:

  • Kimi K2
  • LongCat Flash
  • Grok
  • Gemini 2.5

I didn’t go with ChatGPT, DeepSeek, or Qwen because they are quite popular, and I wanted to test some of the less popular ones.

Second, to the best of my knowledge, the models such as Grok and Gemini have different flavours such as fast, pro, thinking etc. For my test, I decided to stick with fast or non-thinking flavour. So, I am using Grok fast, Gemini 2.5 flash, LongCat flash with no thinking and Kimi K2 with thinking disabled.

Third, ideally, I should try different parameters for each model and generate an output which I should then use to do a comparison. However, the number of combinations of different parameters are quite significant enough that I decided to use the default model parameters.

Fourth, it is known that the models do not have consistent performance across different model providers. In such cases, the correct choice would to use a single inference provider and test the models. Unfortunately, no single inference provider have all the shortlisted models with them. The next best option is to use each model’s main or default provider. Unfortunately, for LongCat Flash, I am using lmarena.ai since I couldn’t find any other provider for it.

So, the final selection looks like this:

Tested using Model Parameters Thinking/Reasoning Model Interface
LongCat Flash lmarena.ai Default Off Chat
Kimi k2 kimi.com Default Off Chat
Grok fast grok.com Default Off Chat
Gemini 2.5 flash gemini.google.com Default Off Chat

I did not use any paid versions of the models for this test

Test and output

In order to test the creative writing ability of these models, I decided to use the following prompt:

Write a short story in less than 1000 words. Be as creative and novel as possible.

A key thing to note here is that if you were to try this prompt in a new chat session multiple times, the models would give you different output or story each time. Technically, I should run the same prompt for 10-15 times in order to capture the general writing style of the model. Even then I am not sure if I will be able to get a good representation of a model’s writing skills. SSo instead, I decided to run it only once and judge the output.

If you were to repeat the same test, you would get a different output.

You can read the short stories written by these models by clicking the links below:

Judging the results

Judging creative writing has a subjective element to it. In my opinion:

Kimi K2 > LongCat Flash == Grok fast > Gemini 2.5 Flash.

The story written by Kimi K2 felt more human and natural compared to all the other models.

Now that was my assessment. What if I try to use an LLM to judge these stories?

Failed approach

I used ChatGPT and DeepSeek to judge these writing using Colorado State University’s creative writing rubric: https://tilt.colostate.edu/wp-content/uploads/2024/01/Written_CreativeWritingRubric_CURC.pdf

It turns out all of the short stories got a full score. This rubric is designed for students who are learning creative writing, not for LLMs, which can produce near-perfect writing. It is not helpful since the goal is to find who is the best using LLM

LLM as a Judge

Approach 1: Rating via ChatGPT and DeepSeek

I decided to ask ChatGPT and DeepSeek to rate the short stories written by these models:

ChatGPT DeepSeek Average Score
LongCat Flash 4.8 4.8 4.8
Kimi k2 4.9 4.9 4.9
Grok fast 4.95 5 4.975
Gemini 2.5 flash 4.93 5 4.965

So, as per ChatGPT and DeepSeek, it is:

Grok == Gemini 2.5 Flash > Kimi K2 > LongCat Flash

Approach 2: Let the models create a rubric and judge the output.

Instead of relying upon ChatGPT and DeepSeek, I decided to ask each one of the models to come up with their own rubric for judging high quality creative writing. This gave some interesting results:

Kimi K2

Rubric created by Kimi K2: https://pastebin.com/7LDWaE9s

Result:

first read impact language control structure and pacing originality character and voice distinction emotional after-taste Total
LongCat Flash 10 9.5 9 10 9 10 57.5
Kimi k2 9 9 9 9 8 9 53
Grok fast 9 9 9 9 8 9 53
Gemini 2.5 flash 9 9 9 10 8 8 53

LongCat Flash » Kimi K2 == Grok fast == Gemini 2.5 flash

LongCat Flash

Rubric created by LongCat Flash: https://pastebin.com/UtqL7W9q

Result:

Originality narrative craft & structure emotional resonance Language & style thematic depth and subtext Total
LongCat Flash 5 5 5 5 4 24
Kimi k2 5 5 5 5 4 24
Grok fast 5 5 5 5 5 25
Gemini 2.5 flash 5 5 4.75 5 5 24.75

Grok fast >= Gemini 2.5 Flash > Kimi K2 == LongCat Flash

Grok fast

Rubric created by Grok fast: https://pastebin.com/A32Qv0PC

Result:

originality and voice theme depth and coherence narrative architecture character agency prose and architecture emotional impact intellectual impact technical polish (negative) Weighted Total
LongCat Flash 4 4 3.5 4 4 4 4 0 98
Kimi K2 4 4 3.5 3 4 4 4 0 94
Grok fast 4 4 3.5 3.5 4 4 4 0 96
Gemini 2.5 flash 4 4 4 3.5 4 4 4 0 98

Gemini 2.5 Flash == LongCat Flash > Grok fast > Kimi K2

Gemini 2.5 Flash

Rubric created by Gemini 2.5 Flash: https://pastebin.com/sFkxNDJB

Result:

Narrative & structure character and voice language and imagery theme and depth overall impact Total
LongCat Flash 4 4 5 5 5 23
Kimi K2 5 4 5 5 5 24
Grok fast 5 4 5 5 5 24
Gemini 2.5 flash 4 5 5 4 5 23

Kimi K2 == Grok fast > LongCat Flash == Gemini 2.5 Flash

Final Result

LongCat flash judge Kimi K2 judge Grok fast judge Gemini 2.5 flash judge Average
LongCat Flash 24 57.5 98 23 50.625
Kimi K2 24 53 94 24 48.75
Grok fast 25 53 96 24 49.5
Gemini 2.5 flash 24.75 53 98 23 49.6875

So, the final result is:

LongCat Flash > Gemini 2.5 Flash >= Grok fast > Kimi K2

My Reflections

After doing this simple test, the following are my thoughts:

  • One needs to consider the flavour of the model, its parameters, and inference options while running a test.
  • LLM-as-a-judge is a cool concept because it allows you to scale testing; however, the judgement given by an LLM may not match a human’s.
  • Having your own benchmark or test is a must. Do not go by what the public benchmarks say.
  • Famous models such as ChatGPT may not always be the best for a given task. It is important to test other models who might perform better at a lower cost.
  • LLMs aren’t great at math. It’s better to let them assign scores on individual metrics, but the overall tally should be done manually.
  • Statistically speaking, this test doesn’t prove much. I’d need to repeat it several times and judge the outputs myself to know which model really excels at creative writing. It’s clear that an LLM can’t reliably judge creative writing.