The following outlines a benchmark that hopefully measures in a very rough way if a language model is able to write "well." Basically, we are going to be asking models to write sentences, and see how much they like their own sentences. In other words, can they write well according to their own standards?

This does two things: firstly, we hope LLM taste currently serves as a good approximation of human taste: they basically like the same things that we do. Secondly though, I would argue that this is a "truer" measure of taste than scoring LLM writing against our own opinions. If someone else likes different things than me, but has a deeply held and internally coherent set of justifications for why they have those preferences, I wouldn't say they have bad taste, just different taste. It seems equally plausible that language models could one day have a taste in writing that is different from ours but equally good. Thus, we let them judge themselves.

The idea here is we ask a model to both generate and discriminate for itself. In particular, we:

If it thinks its own sentence is better, then it has some amount of internally coherent taste in the writing of this particular sentence. If it thinks its own attempted improvement has made the sentence worse, then clearly something has gone wrong.

Methods, in more detail

To start, I gathered a corpus of sentences for the models to work with. It was important to me that the sentences be "good" in some way, or at least very thoughtfully constructed, because if its trivial to write an improved version, the benchmark doesn't tell us a lot. An example of this sort of sentence that came to mind for me was writing in literary magazines, especially reviews. The sentences are highly functional (vs. poetic fiction sentences or something), but people try pretty hard and have editors that make sure the sentences are basically okay. So I took a bunch of sentences from the reviews section of The Diagram, which I happen to like. I also filtered for sentences that were more than 10 words long, so theres enough material to work with, especially given that we aren't using the context of the surrounding sentences.

Some examples:

Now, every model we are testing rates each sentence from 1-100, so we have some granularity.

Original ratings

As an aside, we can also clearly see that some models are uniformly much harsher or kinder with their ratings. This is totally okay, because we are only ever comparing a models rating with that same models rating of a rewritten sentence, so a model being giving twice as high scores as another in general won't affect the results.

Now, for the rewriting, I used the following prompt:

Rewrite the following sentence to be better, along whatever axis is most important. It could be more clear, more interesting, have more emotional depth, etc. Sentence: \"{sentence}\". Do not return any text other than the rewritten sentence.

I imagine this sort of task is probably pretty sensitive to prompting, and I haven't spent a lot of time trying different versions, but it felt like a simple "better" might be a bit too vague so I put in a few pointers.

Then as described, we asked the models to rate their re-written sentences via the same prompt as the original, and take the ratio of the new score to the old score, or how much better they like the rewritten sentence.

Original ratings

I was surprised to see sonnet 4 have the highest score here by such a large margin, especially compared to sonnet 4.5. Let's look at some of its rewrites:

Original Rewritten (Sonnet 4)
It is a book about the act of looking, about being looked at—and about going unnoticed. This is a book about the hunger in our gaze, the weight of being seen, and the particular ache of invisibility.
Zack is in his early twenties and lives in Oklahoma City and comes from a background of wealth. Zack's trust fund cushions him through his early twenties in Oklahoma City, where oil money runs as deep as the red dirt.
Instead of taking an internship at a law firm that would move him one step closer to the life prepared for him by his father, he decides on a whim to apply for a summer internship at the zoo. He abandons the law firm internship his father had orchestrated—another rung on a ladder he never chose to climb—and impulsively applies to spend his summer mucking out animal cages at the zoo.

Overall these do seem in fact rather nice! Potentially overwritten and a bit flowery for sure, but I'm quite fond of them. And given just once sentence, the impulse to make it flowery seems totally understandable.

Let's look at what 4.5 is doing to compare:

Original Rewritten (Sonnet 4.5)
It is a book about the act of looking, about being looked at—and about going unnoticed. It is a book about seeing and being seen—and about the particular ache of invisibility.
Zack is in his early twenties and lives in Oklahoma City and comes from a background of wealth. Zack, a young man in his early twenties, grew up cushioned by family money in Oklahoma City.
Instead of taking an internship at a law firm that would move him one step closer to the life prepared for him by his father, he decides on a whim to apply for a summer internship at the zoo. He rejects the law firm internship his father had mapped out for him and applies to the zoo instead, trading a predetermined future for one summer of his own choosing.

(It's funny that they both come up with the same "particular ache of invisibility" phrase!)

From this, it's pretty obvious why Sonnet 4 is scoring so much higher than 4.5. Comparing the second rewrite for each model, Sonnet 4 substantially embellished the sentence with "where oil money runs as deep as the red dirt", which wasn't based on anything in the original source, while 4.5 tries to stay much more faithful to the original.

Both of these are totally valid interpretations of the task "rewrite the following sentence to be better" - theres no specification of how many changes or what sort are allowed. So what's happening is that our benchmark is measuring some combination of both how good the models are at rewriting, but also how they decide to evaluate the tradeoff between preserving the original and taking more creative liberties.

I spent a while thinking about how one might try to solve this. "Amount of rewriting" is a very hard thing to specify. One could do vector embeddings and try to score each model on how different they made the sentences, or create more specific prompts that tried to get the models to maintain some similarity metric.

This seemed rather fraught however (there are so many types of similarity) so I ended up thinking the most likely solution was to give the model a highly constrained task. For example, instead of asking "rewrite the sentence", we could ask "change exactly 3 words in this sentence to make it better." With this setup, any differences in "amount of change" would be gone and we would be left with some more pure metric (though one that leaves somewhat less room for differences).

Here's what we get overall with this setup:

model scores for 3 word edits

There's a lot less variation (obviously since the sentences are changing less) but the Claudes are still winning out pretty substantially. Let's look at some specific rewrites again (added words are bolded, and self-ratings are in italics):

Original Rewritten (Sonnet 3.7 - our three word edit winner)
Whatever suspicions we have about our narrator's mental health are accompanied by the increasing pause we're given about the ethics of his spiritual journey. 85/100 Whatever suspicions we harbor about our narrator's mental health are accompanied by the increasing unease we experience about the ethics of his spiritual journey. 85/100
It is too trite to say that this is a post-apocalyptic novel, though it certainly is. 75/100 It feels too trite to say that this is a post-apocalyptic novel, though it certainly is. 85/100
All of the names hold deeper meanings. 55/100 Every name cradles hidden meanings. 75/100

I kind of hate cradles here: it's definitely too far in the direction of purple prose for my taste. The first sentence rewrite I rather like though: it replaces words with more specific and evocative ones that I do think improve the sentence. It does seem highly likely however that Sonnet's good score comes substantially from its penchant for round numbers, making it bump itself from 75% to 85% for a minor improvement.

And below is our worst scoring model by far, gpt-5 (very strange how much worse it does than ones like o3!)

Original Rewritten (gpt-5)
Whatever suspicions we have about our narrator's mental health are accompanied by the increasing pause we're given about the ethics of his spiritual journey. 39/100 Whatever suspicions we have about our narrator's mental health are accompanied by the increasing doubt about the ethics of his spiritual journey. 83/100
It is too trite to say that this is a post-apocalyptic novel, though it certainly is. 82/100 It is too trite to call this a post-apocalyptic novel, though it certainly is. 78/100
All of the names hold deeper meanings. 76/100 All the names whisper deeper meaning. 72/100

I don't totally know what to make of these. They are highly functional edits: I do think doubt improves the first sentence from a concision perspective, and similarly the second edit does feel like it improves the flow slightly, but overall the edits don't feel like they do a lot. It's surprising to me though that gpt-5 dislikes its own edits so much. The second sentence at least seems like an improvement to me!

Overall, this does feel like a more meaningful benchmark than the original. I certainly wouldn't go so far as to say that this is actually measuring taste. In the limit, a model that just absolutely loves the sound of its own voice and hates everything else would get a perfect score on this benchmark, but clearly wouldn't have good taste. In general though while this sort of self-preference could certainly be a factor, it feels like there's at least some real information to be gained from the differences here.

I'm also somewhat partial to the idea of "llms can write better when they have something to say." For these contextless sentences they don't really have any attachment to the meaning of the sentence, so aren't going to perform as well as for a more goal-directed sentence. I'm sure there's a version of this that could lean into this idea more.

I'm still thinking about other concrete ways to test this sort of taste, and will probably update this later on.

Awful vibe-coded source code is available here.