My experience is opposite to yours. I have had Claude Code fix issues in a compiler over the last week with very little guidance. Occasionally it gets frustrating, but most of the time Claude Code just churns through issue after issue, fixing subtle code generation and parser bugs with very little intervention. In fact, most of my intervention is tool weaknesses in terms of managing compaction to avoid running out of context at inopportune moments.
It's implemented methods I'd have to look up in books to even know about, and shown that it can get them working. It may not do much truly "novel" work, but very little code is novel.
They follow instructions very well if structured right, but you can't just throw random stuff in CLAUDE.md or similar. The biggest issue I've run into recently is that they need significant guidance on process. My instructions tends to focus on three separate areas: 1) debugging guidance for a given project (for my compiler project, that means things like "here's how to get an AST dumped from the compiler" and "use gdb to debug crashes" (it sometimes did that without being told, but not consistently; with the instructions it usually does tht), 2) acceptance criteria - this does need reiteration, 3) telling it to run tests frequently, make small, testable changes, and to frequently update a detailed file outlining the approach to be taken, progress towards it, and any outcomes of investigation during the work.
My experience is that with those three things in place, I can have Claude run for hours with --dangerously-skip-permissions and only step in to say "continue" or do a /compact in the middle of long runs, with only the most superficial checks.
It doesn't always provide perfect code every step. But neither do I. It does however usually move in the right direction every step, and has consistently produced progress over time with far less effort on my behalf.
I wouldn't have it start from scratch without at least some scaffolding that is architecturally sound yet, but it can often do that too, though that needs review before it "locks in" a bad choice.
I'm at a stage where I'm considering harnesses to let Claude work on a problem over the course of days without human intervention instead of just tens of minutes to hours.
It is like, when you need some prediction (e.g. about market behavior), knowing that somewhere out there there is a person who will make the perfect one. However, instead of your problem being to make the prediction, now it is how to find and identify that expert. Is that type of problem that you converted yours into any less hard though?
I too had some great minor successes, the current products are definitely a great step forward. However, every time I start anything more complex I never know in advance if I end up with utterly unusable code, even after corrections (with the "AI" always confidently claiming that now it definitely fixed the problem), or something usable.
All those examples such as yours suffer from one big problem: They are selected afterwards.
To be useful, you would have to make predictions in advance and then run the "AI" and have your prediction (about its usefulness) verified.
Selecting positive examples after the work is done is not very helpful. All it does is prove that at least sometimes somebody gets something useful out of using an LLM for a complex problem. Okay? I think most people understand that by now.
PS/Edit: Also, success stories we only hear about but cannot follow and reproduce may have been somewhat useful initially, but by now most people are beyond that, willing to give it a try, and would like to have a link to the working and reproducible example. I understand that work can rarely be shared, but then those examples are not very useful any more at this point. What would add real value for readers of these discussions now is when people who say they were successful posted the full, working, reproducible example.
EDIT 2: Another thing: I see comments from people who say they did tweak CLAUDE.md and got it to work. But the point is predictability and consistency! If you have that one project where you twiddled around with the file and added random sentences that you thought could get the LLM to do what you need, that's not very useful. We already know that trying out many things sometimes yields results. But we need predictability and consistency.
We are used to being able to try stuff, and when we get it working we could almost always confidently say that we found the solution, and share it. But LLMs are not that consistent.
My point is that these are not minor successes, and not occasional. Not every attempt is equally successful, but a significant majority of my attempts are. Otherwise I wouldn't be letting it run for longer and longer without intervention.
For me this isn't one project where I've "twiddled around with the file and added random sentences". It's an increasingly systematic approach to giving it an approach to making changes, giving it regression tests, and making it make small, testable changes.
I do that because I can predict with a high rate of success that it will achieve progress for me at this point.
There are failures, but they are few, and they're usually fixed simply by starting it over again from after the last succesful change when it takes too long without passing more tests. Occasionally it requires me to turn off --dangerously-skip-permissions and guide it through a tricky part. But that is getting rarer and rarer.
No, I haven't formally documented it, so it's reasonable to be skeptical (I have however started packaging up the hooks and agents and instructions that consistently work for me on multiple projects. For now, just for a specific client, but I might do a writeup of it at some point) but at the same time, it's equally warranted to wonder whether the vast difference in reported results is down to what you suggest, or down to something you're doing differently with respect to how you're using these tools.
New hires perform consistently. Even if you can't predict beforehand how well they'll work, after a short observation time you can predict very well how they will continue to work.
I had a highly repetitive task (/subagents is great to know about), but I didn't get more advanced than a script that sent "continue\n" into the terminal where CC was running every X minutes. What was frustrating is CC was inconsistent with how long it would run. Needing to compact was a bit of a curveball.
The compaction is annoying, especially when it sometimes will then fail to compact with an error, forcing rewinding. They do need to tighten that up so it doesn't need so much manual intervention...
That is true, so don't give it entirely free reign with that. I let Claude generate as many additional tests as it'd like, but I either produce high level tests, or review a set generated by Claude first, before I let it fill in the blanks, and it's instructed very firmly to see a specific set of test cases as critical, and then increasingly "boxed in" with more validated test cases as we go along.
E.g. for my compiler, I had it build scaffolding to make it possible to run rubyspecs. Then I've had it systematically attack the crashes and failures mostly by itself once the test suite ran.
Is it? Stuff like ripgrep, msmpt,… are very much one-man project. And most packages on distro are maintained by only one person. Expertise is a thing and getting reliable results is what differentiates expert from amateurs.
It's implemented methods I'd have to look up in books to even know about, and shown that it can get them working. It may not do much truly "novel" work, but very little code is novel.
They follow instructions very well if structured right, but you can't just throw random stuff in CLAUDE.md or similar. The biggest issue I've run into recently is that they need significant guidance on process. My instructions tends to focus on three separate areas: 1) debugging guidance for a given project (for my compiler project, that means things like "here's how to get an AST dumped from the compiler" and "use gdb to debug crashes" (it sometimes did that without being told, but not consistently; with the instructions it usually does tht), 2) acceptance criteria - this does need reiteration, 3) telling it to run tests frequently, make small, testable changes, and to frequently update a detailed file outlining the approach to be taken, progress towards it, and any outcomes of investigation during the work.
My experience is that with those three things in place, I can have Claude run for hours with --dangerously-skip-permissions and only step in to say "continue" or do a /compact in the middle of long runs, with only the most superficial checks.
It doesn't always provide perfect code every step. But neither do I. It does however usually move in the right direction every step, and has consistently produced progress over time with far less effort on my behalf.
I wouldn't have it start from scratch without at least some scaffolding that is architecturally sound yet, but it can often do that too, though that needs review before it "locks in" a bad choice.
I'm at a stage where I'm considering harnesses to let Claude work on a problem over the course of days without human intervention instead of just tens of minutes to hours.