My experience is opposite to yours. I have had Claude Code fix issues in a compi...

nosianu · 2025-10-14T08:51:31 1760431891

> My experience is opposite to yours.

But that is exactly the problem, no?

It is like, when you need some prediction (e.g. about market behavior), knowing that somewhere out there there is a person who will make the perfect one. However, instead of your problem being to make the prediction, now it is how to find and identify that expert. Is that type of problem that you converted yours into any less hard though?

I too had some great minor successes, the current products are definitely a great step forward. However, every time I start anything more complex I never know in advance if I end up with utterly unusable code, even after corrections (with the "AI" always confidently claiming that now it definitely fixed the problem), or something usable.

All those examples such as yours suffer from one big problem: They are selected afterwards.

To be useful, you would have to make predictions in advance and then run the "AI" and have your prediction (about its usefulness) verified.

Selecting positive examples after the work is done is not very helpful. All it does is prove that at least sometimes somebody gets something useful out of using an LLM for a complex problem. Okay? I think most people understand that by now.

PS/Edit: Also, success stories we only hear about but cannot follow and reproduce may have been somewhat useful initially, but by now most people are beyond that, willing to give it a try, and would like to have a link to the working and reproducible example. I understand that work can rarely be shared, but then those examples are not very useful any more at this point. What would add real value for readers of these discussions now is when people who say they were successful posted the full, working, reproducible example.

EDIT 2: Another thing: I see comments from people who say they did tweak CLAUDE.md and got it to work. But the point is predictability and consistency! If you have that one project where you twiddled around with the file and added random sentences that you thought could get the LLM to do what you need, that's not very useful. We already know that trying out many things sometimes yields results. But we need predictability and consistency.

We are used to being able to try stuff, and when we get it working we could almost always confidently say that we found the solution, and share it. But LLMs are not that consistent.

vidarh · 2025-10-14T12:47:25 1760446045

My point is that these are not minor successes, and not occasional. Not every attempt is equally successful, but a significant majority of my attempts are. Otherwise I wouldn't be letting it run for longer and longer without intervention.

For me this isn't one project where I've "twiddled around with the file and added random sentences". It's an increasingly systematic approach to giving it an approach to making changes, giving it regression tests, and making it make small, testable changes.

I do that because I can predict with a high rate of success that it will achieve progress for me at this point.

There are failures, but they are few, and they're usually fixed simply by starting it over again from after the last succesful change when it takes too long without passing more tests. Occasionally it requires me to turn off --dangerously-skip-permissions and guide it through a tricky part. But that is getting rarer and rarer.

No, I haven't formally documented it, so it's reasonable to be skeptical (I have however started packaging up the hooks and agents and instructions that consistently work for me on multiple projects. For now, just for a specific client, but I might do a writeup of it at some point) but at the same time, it's equally warranted to wonder whether the vast difference in reported results is down to what you suggest, or down to something you're doing differently with respect to how you're using these tools.

baq · 2025-10-14T11:27:30 1760441250

replace 'AI|LLM' with 'new hire' in your post for a funny outcome.

svieira · 2025-10-14T12:36:30 1760445390

Replace 'new hire' with 'AI|LLM' in the updated post for a very sad outcome.

marcosdumay · 2025-10-14T15:47:29 1760456849

New hires perform consistently. Even if you can't predict beforehand how well they'll work, after a short observation time you can predict very well how they will continue to work.

hitarpetar · 2025-10-14T11:51:25 1760442685

this is the first time I've ever seen this joke, well done!

kordlessagain · 2025-10-14T16:57:14 1760461034

You are using the wrong tools if you are getting crappy results. It’s like editing a photo with notepad, it’s possible but likely to fail.

fragmede · 2025-10-14T09:57:02 1760435822

I had a highly repetitive task (/subagents is great to know about), but I didn't get more advanced than a script that sent "continue\n" into the terminal where CC was running every X minutes. What was frustrating is CC was inconsistent with how long it would run. Needing to compact was a bit of a curveball.

vidarh · 2025-10-14T12:52:36 1760446356

The compaction is annoying, especially when it sometimes will then fail to compact with an error, forcing rewinding. They do need to tighten that up so it doesn't need so much manual intervention...

alwahi · 2025-10-14T09:12:09 1760433129

if claude generates the tests, runs those tests, applies the fixes without any oversight, it is a very "who watches the watchmen" situation.

vidarh · 2025-10-14T12:51:18 1760446278

That is true, so don't give it entirely free reign with that. I let Claude generate as many additional tests as it'd like, but I either produce high level tests, or review a set generated by Claude first, before I let it fill in the blanks, and it's instructed very firmly to see a specific set of test cases as critical, and then increasingly "boxed in" with more validated test cases as we go along.

E.g. for my compiler, I had it build scaffolding to make it possible to run rubyspecs. Then I've had it systematically attack the crashes and failures mostly by itself once the test suite ran.

ErikBjare · 2025-10-14T09:23:14 1760433794

If you generate the tests, run those tests, apply fixes without any oversight, it is the very same situation. In reality, we have PR reviews.

skydhash · 2025-10-14T11:47:05 1760442425

Is it? Stuff like ripgrep, msmpt,… are very much one-man project. And most packages on distro are maintained by only one person. Expertise is a thing and getting reliable results is what differentiates expert from amateurs.

fragmede · 2025-10-14T09:13:21 1760433201

Gemini?

gmb_uk · 2025-10-14T09:27:08 1760434028

Good lord, that would be like the blind leading the daft.