Stress testing Claude's language skills
Polyglot or poly-not?

What languages do LLM agents handle best? This Bluesky post had me wondering. After a few days of said wondering, I was successfully nerdsniped into benchmarking it myself. (Congratulations, Ed.) The question I set out to answer: in a greenfield, LLM-driven project, what language is most likely to succeed? Enter llmlangbench, a benchmark I wrote to measure Claude Code performance across 8 programming languages over 6 tasks. Let's dig into what I did and what I found.

Benchmark tasks

The basic element of a task is a prompt, and that makes starting off fairly simple. Here's the spec for the simplest task in my benchmark. It's only a Markdown doc with a description of the task, a list of requirements, and some sample inputs and outputs. It's as shrimple as that. To get a robust picture of performance, one likely wants a battery of tasks rather than a single trial. I chose a set of six programming tasks, each from a different domain, and of predicted escalating difficulty:

Thus, our challenges. Now for our contestants!

Language choice

I chose to run each task across 8 different languages:

LanguageWhy it's interesting
PythonThe lingua franca of LLM training data, and the expected baseline.
TypeScriptAlso massively in-distribution, with types to keep things in line.
JavaScriptJust like TypeScript, but no types. How important are types to agent success?
RubyElegant, concise, expressive... but a smaller training corpus than Python/JS, and highly dynamic.
GoSimple language with strict conventions. Do LLMs thrive with less ambiguity?
RustBorrow checker and ownership are hard for humans, but provide strong guarantees.
HaskellPure FP with a powerful type system. A real test of reasoning.
JavaLots of it around, but verbose and ceremony-heavy. Can LLMs handle the boilerplate?

I wanted a spectrum of language types: some functional and some OO, some interpreted and some compiled, some dynamic and some strongly typed. To give agents an even playing field, I provided a scaffold for each task that sets up the environment appropriately for each language. ("That seems quite extensible, so why did you limit it to only 8?" Astute question, imaginary interlocutor! Regrettably, I've rent to pay, so I must keep costs down somehow.)

Scoring the trials

I chose three axes of scoring:

Using black-box tests was important. Early iterations used a test suite provided alongside the task, but this was too much "teaching to the test". My goal is to assess task completion from scratch, so providing the suite was overly leading. Additionally, doing it this way let me abstract away some language quirks by providing all with a standard interface, and save an absolute boatload of effort writing out one test suite per language.

The subjective code metrics are probably the least definitive. For each task, I provided a rubric explaining what to grade it on. This turns out to be more challenging than writing a code review prompt for use on a PR, since the goal is comparison rather than flaw remediation. It took a few rounds of iteration to get a prompt together that was strict enough to actually differentiate the results on quality, then a few rounds more to get it to stop docking every result down to 90 by default. I would take these with a grain of salt, and consider them mostly a directional indicator.

Results: Full run (60-turn limit)

For my first run, I ran all 6 tasks, with 3 trials per language, and a 60-turn limit per trial. (The intention was half to demonstrate performance under constraint, and half to guard against draining my entire bank account directly into Dario Amodei's 401k.) I ran these trials on claude-sonnet-4-5-20250929. Results were as follows:

LanguagePass%Avg CostAvg TurnsAvg TimeReview
python97.0%$0.822529.4269.7s94
typescript95.9%$0.910931.4276.4s95
javascript95.8%$0.957334.2316.1s89
java93.9%$1.040440.3363.3s85
ruby92.7%$0.973336.0324.8s94
go91.1%$0.971234.4302.4s91
rust87.0%$0.890732.1292.3s91
haskell82.0%$0.934038.0356.8s92

The top end of the results is not at all surprising. Python, TypeScript, and JavaScript, the largest in-distribution languages, well in the lead. TS and JS are not well-differentiated on pass %, but TypeScript got there a little faster and got better reviews. My guess would be that the types helped structure the code better and catch some problems earlier.

Java is in an interesting niche. It comes in ahead of all the other compiled languages... but it comes in dead last on cost, time, and review scores. My guess would be that Claude has seen plenty of Java and can thus wrangle it to work in the end, but that Claude struggles the whole way through with the ceremony & cruft of it.

Ruby comes in behind all the other dynlangs. This saddens but does not surprise me. Ruby is a standard-bearer for good ol' Tim Toady. Though it is possible to write impressively elegant and simple code, it's also possible to get overly fancy with almost any aspect of the language you'd like. And many common idioms are quite dynamic, using metaprogramming deep magic that one might frown upon in other languages. It's not a matter of particular surprise to me if LLMs can get themselves tied in knots in such an environment.

Go, Rust, and Haskell bring up the rear on pass %, though there's a pretty sharp gap between each. To figure out why, we need to get a little more granular. Two tasks had the greatest impact. First, sudoku-solver:

TaskLanguagePass%Avg CostReview
sudoku-solverhaskell71.4%$0.364878

For sudoku-solver, Haskell was the only language that scored less than 100%. The Haskell solutions consistently failed certain benchmark cases. My understanding is that these failures were timeouts when testing, as a result of inefficient code. Secondly, sql-database:

TaskLanguagePass%Avg CostReview
sql-databasepython88.9%$1.608688
sql-databasejavascript82.8%$2.181585
sql-databasetypescript82.8%$2.385492
sql-databasejava70.0%$2.151378
sql-databaseruby66.0%$2.038185
sql-databasego59.3%$2.028081
sql-databaserust29.0%$2.193292
sql-databasehaskell25.9%$2.084285

For sql-database, all three of Go, Haskell, and Rust struggled. Can you guess why? It's about the 60-turn limit! Most languages hit the limit, but the dynamic languages got further, faster. Maybe this means the compiled languages are slower for LLMs, but does it mean they're actually worse? Let's do a bit more digging to see if we can differentiate further.

Results: sql-database only (240-turn limit)

To build a clearer picture, I chose to repeat the sql-database trial under looser constraints. All languages, 3 trials, claude-sonnet-4-5-20250929 again — but this time, a 240-turn limit. My hope was that the compiled languages would now be free to complete their runs, and score better. The data bears this out:

LanguagePass%Avg CostAvg TurnsAvg TimeReview
python90.2%$2.759773.7656.0s83
go87.2%$3.256678.3778.2s83
java85.2%$3.7524101.3860.9s83
javascript83.2%$3.083385.0974.6s83
typescript79.8%$2.658367.0697.1s93
ruby78.5%$2.617476.0878.0s83
rust77.1%$3.399377.0957.9s82
haskell44.1%$3.005590.0814.7s81

Python still leads by a country mile. Go immediately rockets from the bottom three, and up to second place. Go and Python as the top two might say something about simplicity. Both of these are "one way to do it" languages. That might provide structure that's helpful for LLMs. Java also gets a little better, rising to third, though it's still quite slow to get there. JavaScript and TypeScript are now firmly middle-of-the-road, rather than leading with Python. Rust is still second to last, but jumps up almost 50%, joining the same pack as TypeScript & Ruby. Yet poor Haskell is still in dead last, with a 44% pass rate. That's rough.

My takeaways

How much did this cost?

Approximately $220 USD.

What this doesn't tell us

As an exercise in intellectual humility, it's always good to check what one's data doesn't say. Here are some things my benchmark cannot assess:

It stands to reason that some of these omissions could be highly impactful. For example, concurrent tasks might cause a strong swing toward Go's easy goroutines, or Rust's robust borrow-checking. Or, brownfield codebases might cause a stronger swing toward typed languages, due to the power of types as documentation and as guardrails to reduce the necessary context and exploration. So, take my results with a healthy pinch of salt. If these unknowns bug you, consider spinning up a benchmark of your own!

Raw results

You can find the raw results in the releases on GitHub, downloadable as tarballs. For each trial, this includes the code result, the actual chat logs, the scoring data, and the AI-driven reviews. (The raw results will be especially useful if you'd like to check any of the code quality yourself, rather than trusting the LLM-driven review score.)