I'm guessing he didn't have access to any LLMs while competing, but I think a "centaur" approach probably would have outperformed both "only human" and "only LLM" competitors.
Reading through the challenge, there's a lot of data modelling and test harness writing and ideating that an LLM could knock out fairly quickly, but would take even a competitive coder some time to write (even if just limited by typing speed).
That'd give the human more time to experiment with different approaches and test incremental improvements.
Interesting, thanks for the links! I had read this part of the article:
> All competitors, including OpenAI, were limited to identical hardware provided by AtCoder, ensuring a level playing field between human and AI contestants.
And assumed that meant a pretty restricted (and LLM-free) environment. I think their policy is pretty pragmatic.
Ten hours is a decent amount of time, so I'm not too surprised the human won. LLMs don't really tend to improve the longer they get to chew on a problem (often the opposite in fact).
The LLM was probably getting nowhere trying to improve after the first few minutes.
On the livestream (perhaps elsewhere?) you can watch the submissions and scores come in over time. The LLM steadily increased (and sometimes decreased) it's score over time though by the end did seem to hit a lacuna. You could even see it try out new strategies (with walls e.g.) which didn't appear until about half-way through the competition.
Yup. I was talking about how they taught it to us, in school. It actually had an emotional place in my heart. For some reason, I found the story compelling.
I found it sad, because he dies at the end to prove he could beat the machine once, but the machine could keep producing its lesser output every day after his death. He gave it all for a Pyrrhic victory.
I despise the company that competed, but I feel obligated to acknowledge that headline buries the lede that their bot got SECOND place, and their 2nd place was closer to first than 3rd was to 2nd.
Are the submissions available online without needing to become a member of AtCoder?
I want to see what these 'heuristic' solutions look like.
Is it just that the ai precomputed more states and shoved their solutions in as the 'heuristic' or did it come up with novel, more broad, heuristics? Did the human and ai solutions have overlapping heuristics?
He's retired, so I'm guessing more about the clout. Or even just "love of the game"? He had a fairly popular tweet thread a couple years back where he wrote out 80 tips for competitive programming -- that feels less likely to be clout based
I'm at a complete loss to discern why this would be a useful task to solve. It seems like the equivalent of elementary schoolers saying "OK, if you're so smart, what's 9,203,278,023 times 3,333,300,209?"
It’s a coding contest not a fiverr programming task. If it seems like a challenge for challenge sake, it’s probably because it’s a challenge for challenge sake.
As someone with a degree in computer science it reminds me of almost every course I took. As someone who has worked at multiple FAANG and adjacent companies with high expectations, I’ve encountered things like this in most interviews and have devised similar problems to be given as interviews. The point isn’t to make something objectively useful in the question itself but to provide a toy example of a specific class of problem that absolutely shows up in practical situations, although by in large most IT programmers would never see such a problem in their careers. This does not however mean such problems don’t exist in the world and are not solved by computer scientists professionally at work in practical uses. Beyond that they also are tests of how well people have learned computer science, discrete math, and complex programming as a proxy for general technical intelligence (albeit not testing any specific technology or toolkit, as is emphasized in IT work). This seems surprising to me when people belly ache about computer science being asked in any context - at school, work, or in a programming contest as if the only worthwhile things to do are systems programming questions.
Say you have a bunch of warehouse robots, some which work on different sections in the warehouse. Maybe one section has less things to do, while another section has more things to do - and thus needs more help. So you need to move a bunch of robots there, in groups.
I feel like if that question was asked when calculators were invented, and someone was claiming humans were still better at arithmetic than machines, that it would be appropriate.
I was surprised reading through this problem that the machine solved it well at all.
I get that it’s a leet code style question but it’s got a lot of specifics and I assumed the corpus of training data on optimizing this type of problem was several orders of magnitude too small to train an LLM on and have good results.
Remember this is the worst AI will ever be from here on out. Models are only going to get better, faster, cheaper, more accessible and more easily deployable.
I think people need to realize that just because an AI model fails at one point, or some certain architecture has common failure modes, that billions of dollars are poured into correcting those failures and improving in every economically viable domain. Two years ago AI video looked like a garbled 140p nightmare, now it's higher quality video than all but professional production studios could make.
AI agents don't get tired. They don't need to sleep. They don't require sick days, parental leave, or PTO. They don't file lawsuits, they don't share company secrets, they don't disparage, deliberately sandbag to get extra free time, whine, burn out or go AWOL. The best AI model/employee is infinitely replicatable, and can share its knowledge with other agents perfectly and clone itself arbitrarily many times, and it doesn't have a clash of egos working with copies of itself, it just optimizes and is refit to accomplish whatever task its given.
All this means is that gradually the relative advantage of humans in any economically viable domain will predictably trend towards zero. We have to figure out now what that will mean for general human welfare, freedom and happiness, because barring extremely restrictive measures on AI development or voluntary cessation by all AI companies, AGI will arrive.
Imagine a software company without a single software engineer. What kind of software would it produce? How would a product manager or some other stakeholder work with "AI agents"? How do the humans decide that the agent is finished with the job?
Software engineering changes with the tools. Programming via text editors will be less important, that much is clear. But "AI" is a tool. A compressed database of all languages, essentially. You can use that tool to become more efficient, in some cases wastly more efficient, but you still need to be a software engineer.
Given that understanding, consider another question: When has a company you worked for ever said "that's enough software, the backlog is empty. We're done for the quarter with software development?"
AI agents are replacing junior software engineers now at big companies, or at least lowering the number they are hiring.
Currently AI failure modes (consistency over long context lengths, multi-modal consistency, hallucinations) make it untenable as a "full-replacement" software engineer, but effective as a short-term task agent overseen by an engineer who can review code and quickly determine what's good and what's bad. This allows a 5x engineer to become a 7x engineer, 10x become a 13x, etc. which allows the same amount of work to be done with fewer coders, effectively replacing the least productive engineers in aggregate.
However, as those failure modes becomes less and less frequent, we will gradually see "replacement". It will come in the form of senior engineers using AI tools noting that a PR of a certain complexity is coded correctly 99% of the time by a given AI model, so they will start assigning longer, more complex tasks to it and stop overseeing the smaller ones. The length of tasks it can reliably complete get longer and longer, until all a suite of agents needs is a spec, API endpoints and the ability to serve testing deployments to PM's, and it begins doing first only what a small, poorly run team could accomplish, but month after month gets better and better until companies start offloading entire teams to AI models and simply require a higher-up team to check and reconfigure them once and a while and budget manage token use.
This process will continue as long as AI models grow more capable, less hallucinatory over long-context horizons, and agentic/scaffolding systems become more robust and effectively designed to mitigate and deal with the issues affecting the AI models that do exist. It won't be easy or straightforward, but the economic potential gains are so enormous that it makes sense that billions are being poured into any AI agent startup that can snatch a few IOI medalists and a coworking space in SF.
You're very optimistic in the potential of these tools. I tend to agree, but I think that they will find their master in formal systems. If productivity raises as you're predicting, the world won't accept 99,9% correct software anymore. There will be demand for 100% correctness.
Regarding the potential economic gains, they're exactly the salary of software engineers. That's a decent amount but not massive.
Compare this to civil engineering, architecture, and craftsmen. None have been replaced because machines let amateurs do something resembling their job.
That analogy only makes sense if current AI capabilities : AGI :: 1950s car speed : light speed.
The speed equivalent of AGI is way below light speed, in that the requirements for silicon to replicate the synaptic complexity of the human brain is far below the maximum compute human civilization can achieve as allowable by physics.
The more important question is whether the progress we've seen in AI is putting us on reliable track to hit AGI in the near future. My opinion is that we are, and not just because Demis, Sam, Elon and Dario say so, though they have very good reasons for believing so (yes, besides mere hype and speculation.)
I'm bullish on specific areas improving (I'm sure you could selectively train an LLM on the latest Angular version to replace the majority of front-end devs given enough time and money, it's a limited problem space and a strongly opinionated framework after all), but for the most part enshittification is already starting to happen with the general models.
Nowadays even ChatGPT doesn't bother to even refer to the original question posed after a few responses, so you're left summarising a conversation and starting a new context to get anywhere.
So, yeah, I think we're very much into finding the equilibrium now. Cost vs scale. Exponential improvements won't be in the general LLMs.
Are you using the free version of ChatGPT, or just 4o?
Whatever model is cheap to provide inference for free is irrelevant when it comes to discussing SOTA AI capabilities and their impact. The state of the art has been reliably improving markedly over the past 3 years. o3, Claude opus 4, gemini-2.5 all surpass their predecessors in every benchmark and indicate that improvement isn't slowing down.
If GPT-5 comes out and it's somehow worse then I'll concede to your point, but so far the claim that the latest models are getting worse is mere speculation and makes no sense given that most labs are already aware of the potential for data contamination and such and have taken measures to ensure high data quality for the models they're spending hundreds of millions to train.
Exactly. The inability of people to extrapolate towards the future and foresee second-order effects is astounding. We've seen this in climate change and we've just seen this in COVID. The ones with foresight are warning about the massive upheaval coming. It's time for people to shake away their preconceived notions, look at the situation with fresh eyes, and deeply think about what the technology diff from 5 years ago to today, means for 5 years from now.
Things on an exponential trend tend to continue unless they hit a fundamental limit that leads to an inflection point and then a sloped off S-curve.
Moore's law continued on an exponential for decades. The fundamental limit in terms of transistor density are the laws of physics (uncertainty principle will eventually be a problem), but so far so many paradigms in compute improvement have emerged (especially in GPUs and AI-specific compute) that it has become super-exponential in some respects.
So the question is whether there is a fundamental barrier that AI will hit. The main issues people bring up are a lack of high quality human-generated data, fall-off in value per compute spent, and limits to autoregressive models. However it seems that pretraining has been the only paradigm beginning to show diminished returns but test-time compute and RL are still on the exponential curve.
I'm guessing he didn't have access to any LLMs while competing, but I think a "centaur" approach probably would have outperformed both "only human" and "only LLM" competitors.
Reading through the challenge, there's a lot of data modelling and test harness writing and ideating that an LLM could knock out fairly quickly, but would take even a competitive coder some time to write (even if just limited by typing speed).
That'd give the human more time to experiment with different approaches and test incremental improvements.
He did use a little autocomplete apparently, but used [Vscode](https://x.com/jacob_posel/status/1945585787690738051).
And it's not against the rules to use LLMs apparently in the competition. (https://atcoder.jp/posts/1495). I'd be curious what other competitors used.
Interesting, thanks for the links! I had read this part of the article:
> All competitors, including OpenAI, were limited to identical hardware provided by AtCoder, ensuring a level playing field between human and AI contestants.
And assumed that meant a pretty restricted (and LLM-free) environment. I think their policy is pretty pragmatic.
Ten hours is a decent amount of time, so I'm not too surprised the human won. LLMs don't really tend to improve the longer they get to chew on a problem (often the opposite in fact).
The LLM was probably getting nowhere trying to improve after the first few minutes.
On the livestream (perhaps elsewhere?) you can watch the submissions and scores come in over time. The LLM steadily increased (and sometimes decreased) it's score over time though by the end did seem to hit a lacuna. You could even see it try out new strategies (with walls e.g.) which didn't appear until about half-way through the competition.
> The LLM was probably getting nowhere trying to improve after the first few minutes.
How did you come to that conclusion from the contents of the article?
The final scores are all relatively close. How could that happen if the ai was floundering the whole time? Just a good initial guess?
>How could that happen if the ai was floundering the whole time? Just a good initial guess?
Yes, that and marginal improvements over it.
Yeap, self-reinforcement learning is missing in LLMs.
I would think the LLM though is not trying one solution for 10 hours like a human.
I would assume the LLM is trying an inhuman number of solutions and the best one was #2 in this contest.
Impressive by the human winner but good luck on that in 2026.
This is a real modern day John Henry story, except John Henry dies in the end.
https://en.wikipedia.org/wiki/John_Henry_(folklore)
Mentioned in TFA as well.
Guess I should read more than the summary :)
This is interesting, but aren't "coding competitions" about writing small leetcode programs from a prompt? I would expect the AI to excel at that.
I'm old enough to remember being taught the Ballad of John Henry...
The article mentions it...
Yup. I was talking about how they taught it to us, in school. It actually had an emotional place in my heart. For some reason, I found the story compelling.
I found it sad, because he dies at the end to prove he could beat the machine once, but the machine could keep producing its lesser output every day after his death. He gave it all for a Pyrrhic victory.
Well, millions did, that's why it's a classic!
I suspect it may not be taught, anymore, though.
I seem to encounter cultural milestones, that are no longer there, every day.
I learned it in elementary school in the late 90s
Really feels like it could be an onion title.
Ha! Came here to say the same thing...
How was the model operated? Was it someone prompting it continuously or was it just given the initial prompt?
I despise the company that competed, but I feel obligated to acknowledge that headline buries the lede that their bot got SECOND place, and their 2nd place was closer to first than 3rd was to 2nd.
Are the submissions available online without needing to become a member of AtCoder?
I want to see what these 'heuristic' solutions look like.
Is it just that the ai precomputed more states and shoved their solutions in as the 'heuristic' or did it come up with novel, more broad, heuristics? Did the human and ai solutions have overlapping heuristics?
so many things
First, there's a world coding championship?! Of course there is. There's a competition for anything these days.
Why is he exhausted?
> The 10-hour marathon left him "completely exhausted."
> ... noting he had little sleep while competing in several competitions across three days. "I'm completely exhausted. ... I'm barely alive."
oh! That's a lot.
> beating an advanced AI model from OpenAI ...
> On Wednesday, programmer Przemysław Dębiak (known as "Psyho"), a former OpenAI employee,
Interesting that he used to work there.
> Dębiak won 500,000 yen
JPY 500,000 -> USD 3367.20 -> EUR 2889.35
I'm guessing it's more about the clout than it is about the payment, because that's not a lot of money for the effort spent
He's retired, so I'm guessing more about the clout. Or even just "love of the game"? He had a fairly popular tweet thread a couple years back where he wrote out 80 tips for competitive programming -- that feels less likely to be clout based
> I'm guessing it's more about the clout than it is about the payment
to be fair he also said
> "Honestly, the hype feels kind of bizarre," Dębiak said on X. "Never expected so many people would be interested in programming contests."
> I'm guessing it's more about the clout than it is about the payment
Yeah I'm not in tech but I've seen his handle like 3 times today already, so he's definitely got recognition.
Does someone know the problem/challenge being solved?
https://atcoder.jp/contests/awtf2025heuristic/tasks/awtf2025...
Damn I feel exhausted reading that problem, seeing the input/output... what. Granted I skimmed it for like 20 seconds but yeah.
This reminds me of many a challenge on Advent of Code
I'm at a complete loss to discern why this would be a useful task to solve. It seems like the equivalent of elementary schoolers saying "OK, if you're so smart, what's 9,203,278,023 times 3,333,300,209?"
It’s a coding contest not a fiverr programming task. If it seems like a challenge for challenge sake, it’s probably because it’s a challenge for challenge sake.
As someone with a degree in computer science it reminds me of almost every course I took. As someone who has worked at multiple FAANG and adjacent companies with high expectations, I’ve encountered things like this in most interviews and have devised similar problems to be given as interviews. The point isn’t to make something objectively useful in the question itself but to provide a toy example of a specific class of problem that absolutely shows up in practical situations, although by in large most IT programmers would never see such a problem in their careers. This does not however mean such problems don’t exist in the world and are not solved by computer scientists professionally at work in practical uses. Beyond that they also are tests of how well people have learned computer science, discrete math, and complex programming as a proxy for general technical intelligence (albeit not testing any specific technology or toolkit, as is emphasized in IT work). This seems surprising to me when people belly ache about computer science being asked in any context - at school, work, or in a programming contest as if the only worthwhile things to do are systems programming questions.
It's more the equivalent of "why would anyone race the 400m on a standard track, you just wind up back where you started!"
It's a problem that has no perfect solution, only incremental improvements. So it's really not like your example at all.
Say you have a bunch of warehouse robots, some which work on different sections in the warehouse. Maybe one section has less things to do, while another section has more things to do - and thus needs more help. So you need to move a bunch of robots there, in groups.
Something like that.
I feel like if that question was asked when calculators were invented, and someone was claiming humans were still better at arithmetic than machines, that it would be appropriate.
I was surprised reading through this problem that the machine solved it well at all.
I get that it’s a leet code style question but it’s got a lot of specifics and I assumed the corpus of training data on optimizing this type of problem was several orders of magnitude too small to train an LLM on and have good results.
He misspelled psycho.
In Polish ch and h would be read the same (kinda like the English h sound).
[dead]
Remember this is the worst AI will ever be from here on out. Models are only going to get better, faster, cheaper, more accessible and more easily deployable.
I think people need to realize that just because an AI model fails at one point, or some certain architecture has common failure modes, that billions of dollars are poured into correcting those failures and improving in every economically viable domain. Two years ago AI video looked like a garbled 140p nightmare, now it's higher quality video than all but professional production studios could make.
AI agents don't get tired. They don't need to sleep. They don't require sick days, parental leave, or PTO. They don't file lawsuits, they don't share company secrets, they don't disparage, deliberately sandbag to get extra free time, whine, burn out or go AWOL. The best AI model/employee is infinitely replicatable, and can share its knowledge with other agents perfectly and clone itself arbitrarily many times, and it doesn't have a clash of egos working with copies of itself, it just optimizes and is refit to accomplish whatever task its given.
All this means is that gradually the relative advantage of humans in any economically viable domain will predictably trend towards zero. We have to figure out now what that will mean for general human welfare, freedom and happiness, because barring extremely restrictive measures on AI development or voluntary cessation by all AI companies, AGI will arrive.
Yet, AI agents don't replace software engineers.
Imagine a software company without a single software engineer. What kind of software would it produce? How would a product manager or some other stakeholder work with "AI agents"? How do the humans decide that the agent is finished with the job?
Software engineering changes with the tools. Programming via text editors will be less important, that much is clear. But "AI" is a tool. A compressed database of all languages, essentially. You can use that tool to become more efficient, in some cases wastly more efficient, but you still need to be a software engineer.
Given that understanding, consider another question: When has a company you worked for ever said "that's enough software, the backlog is empty. We're done for the quarter with software development?"
AI agents are replacing junior software engineers now at big companies, or at least lowering the number they are hiring.
Currently AI failure modes (consistency over long context lengths, multi-modal consistency, hallucinations) make it untenable as a "full-replacement" software engineer, but effective as a short-term task agent overseen by an engineer who can review code and quickly determine what's good and what's bad. This allows a 5x engineer to become a 7x engineer, 10x become a 13x, etc. which allows the same amount of work to be done with fewer coders, effectively replacing the least productive engineers in aggregate.
However, as those failure modes becomes less and less frequent, we will gradually see "replacement". It will come in the form of senior engineers using AI tools noting that a PR of a certain complexity is coded correctly 99% of the time by a given AI model, so they will start assigning longer, more complex tasks to it and stop overseeing the smaller ones. The length of tasks it can reliably complete get longer and longer, until all a suite of agents needs is a spec, API endpoints and the ability to serve testing deployments to PM's, and it begins doing first only what a small, poorly run team could accomplish, but month after month gets better and better until companies start offloading entire teams to AI models and simply require a higher-up team to check and reconfigure them once and a while and budget manage token use.
This process will continue as long as AI models grow more capable, less hallucinatory over long-context horizons, and agentic/scaffolding systems become more robust and effectively designed to mitigate and deal with the issues affecting the AI models that do exist. It won't be easy or straightforward, but the economic potential gains are so enormous that it makes sense that billions are being poured into any AI agent startup that can snatch a few IOI medalists and a coworking space in SF.
You're very optimistic in the potential of these tools. I tend to agree, but I think that they will find their master in formal systems. If productivity raises as you're predicting, the world won't accept 99,9% correct software anymore. There will be demand for 100% correctness.
Regarding the potential economic gains, they're exactly the salary of software engineers. That's a decent amount but not massive.
Compare this to civil engineering, architecture, and craftsmen. None have been replaced because machines let amateurs do something resembling their job.
Oh no, no, this isn't the worst AI will ever be. Way worse LLMs are yet to come once the cost cutting efforts begin.
I mean it in that this is the worst the "best currently existing AI model" will ever be
Yes, I know. But the economic reality is that people won't have access to the best existing model, but the most profitable one.
Assuming that AI models don’t disappear, this is a tautology. Without that assumption, you can’t be sure.
> AGI will arrive.
This does not follow. Your argument, set in the 1950s, would be that cars keep getting faster, therefore they will reach light speed.
That analogy only makes sense if current AI capabilities : AGI :: 1950s car speed : light speed.
The speed equivalent of AGI is way below light speed, in that the requirements for silicon to replicate the synaptic complexity of the human brain is far below the maximum compute human civilization can achieve as allowable by physics.
The more important question is whether the progress we've seen in AI is putting us on reliable track to hit AGI in the near future. My opinion is that we are, and not just because Demis, Sam, Elon and Dario say so, though they have very good reasons for believing so (yes, besides mere hype and speculation.)
Haven't they already started to regress?
I'm bullish on specific areas improving (I'm sure you could selectively train an LLM on the latest Angular version to replace the majority of front-end devs given enough time and money, it's a limited problem space and a strongly opinionated framework after all), but for the most part enshittification is already starting to happen with the general models.
Nowadays even ChatGPT doesn't bother to even refer to the original question posed after a few responses, so you're left summarising a conversation and starting a new context to get anywhere.
So, yeah, I think we're very much into finding the equilibrium now. Cost vs scale. Exponential improvements won't be in the general LLMs.
Happy to be wrong on this one..
Are you using the free version of ChatGPT, or just 4o?
Whatever model is cheap to provide inference for free is irrelevant when it comes to discussing SOTA AI capabilities and their impact. The state of the art has been reliably improving markedly over the past 3 years. o3, Claude opus 4, gemini-2.5 all surpass their predecessors in every benchmark and indicate that improvement isn't slowing down.
If GPT-5 comes out and it's somehow worse then I'll concede to your point, but so far the claim that the latest models are getting worse is mere speculation and makes no sense given that most labs are already aware of the potential for data contamination and such and have taken measures to ensure high data quality for the models they're spending hundreds of millions to train.
Exactly. The inability of people to extrapolate towards the future and foresee second-order effects is astounding. We've seen this in climate change and we've just seen this in COVID. The ones with foresight are warning about the massive upheaval coming. It's time for people to shake away their preconceived notions, look at the situation with fresh eyes, and deeply think about what the technology diff from 5 years ago to today, means for 5 years from now.
> Exactly. The inability of people to extrapolate towards the future and foresee second-order effects is astounding.
On a related note, many people also assume that just because something has been trending exponential that it will _continue_ to do so...
Things on an exponential trend tend to continue unless they hit a fundamental limit that leads to an inflection point and then a sloped off S-curve.
Moore's law continued on an exponential for decades. The fundamental limit in terms of transistor density are the laws of physics (uncertainty principle will eventually be a problem), but so far so many paradigms in compute improvement have emerged (especially in GPUs and AI-specific compute) that it has become super-exponential in some respects.
So the question is whether there is a fundamental barrier that AI will hit. The main issues people bring up are a lack of high quality human-generated data, fall-off in value per compute spent, and limits to autoregressive models. However it seems that pretraining has been the only paradigm beginning to show diminished returns but test-time compute and RL are still on the exponential curve.
Now imagine where we'll be in 10 years, and where we were 10 years ago. Things move, fast.