Why agents DO NOT write most of our code – a reality check

72 points by birdculture 2 days ago

Every time I read a post about this, none of the prompts are shared, and when I review the actual commands and how the AI is working it makes me realize that the person who is driving is not experienced in doing so. AI's will do a best attempt, you can see this by looking at the reasoning / thinking output, additionally, the temperature, is usually pretty moderate (4.5-8) and so you'll have heavy "creative liberties" taken. So you need to account for that, you have to show it the right and wrong way to do things. I don't usually use agents or AI for things that are one-offs but not copy & paste, or for deep thinking / critical tasks that require human thought where AI wouldn't be able to do it.

For all the other trivial things, I can delegate those out to it, and expect junior results when I give it sub-optimal guidance, however through nominal and or extreme guidance I can get adequate / near-perfect results.

Another dimension that really matters here is the actual model used, not every model is the same.

Also, if the AI does something wrong, have it assess why things went wrong, revert back to the previous checkpoint and integrate that into the plan.

You're driving, you are ultimately, in control, learn to drive. It's a tool, it can be adjusted, you can modify the output, you can revert, you can also just not use it. But, if you do actually learn how to use it you'll find it can speed up your process. It is not a cure-all though, it's good in certain situations, just like a hammer.

davidclark 2 days ago

On the other hand, when people who claim success with AI share their prompts, I see all the same misses and flaws that keep me from fully buying in. For the person though, it seems like they gloss over these errors and claim wild success. Their prompts never actually seem that different from the ones that fail me as well.
It seems like “you’re not doing it correctly” is just a rationalization to protect the pro-AI person’s established opinion.
- _boffin_ 2 days ago
  
  Nobody does it correctly—ai or not.
  It’s about breaking the problem down into epics, tasks, and acceptance criteria that is reviewed. Review the written code and adjust as needed.
  Tests… a lot of tests.

reaslonik 2 days ago

One thing I find that constantly makes pain for users is assuming that any of these models are thinking, when in reality they're completing a sentence. This might seem like a nitpick at first, but it's a huge deal in reality: if you ask a language model to evaluate whether a solution is right, it's not evaluating the solution, it's giving you a statistically likely next sentence where yes and no are fairly common. If you tell it's wrong, the likely next sentence is something affirming it, but it doesen't really make a difference.

The only way to use a tool like this is to give a problem that fits context, evaluate the solution it chugs at you and re-roll it if it wasn't correct. Don't tell a language model to think because it can't and won't. It's a way less efficient way of re-rolling the solution

sunir 2 days ago

You’re right and wrong at the same time. A quantum superposition of validity.
The word thinking is going too much work in your argument, but arguably “assume it’s thinking” is not doing enough work.
The models do compute and can reduce entropy; however, they don’t match the way we presume things do this because we assume every intelligence is human or more accurately the same as our own mind.
To see the algorithm for what it is, you can make it work through a logical set of steps from input to output but it requires multiple passes. The models use a heuristic pattern matching approach to reasoning instead of a computational one like symbolic logic.
While the algorithms are computed, the virtual space the input is transformed to the output is not computational.
The models remain incredible and remarkable but they are incomplete.
Further there is a huge garbage in garbage out problem as often the input to the model lacks enough information to decide on the next transformation to the code base. That’s part of the illusion of conversationality that tricks us into thinking the algorithm is like a human.
AI has always had human reactions like this. Eliza was surprisingly effective, right?
It may be that average humans are not capable of interacting with an AI reliably because the illusion is overwhelming for instinctive reasons.
As engineers we should try to accurately assess and measure what is actually happening so we can predict and reason about how the models fit into systems.
giuscri 2 days ago

but it’s also true that the next sentence is generated by evaluating the whole conversation including the proposed solution.
my mental model is that the llm learned to predict what another person would say just by looking at that solution.
so it’s really telling whether the solution is likely (likely!) to be right or wrong
- ben_w 2 days ago
  
  Slight quibble, but the reinforcement learning from human feedback means they're trained (somewhat) on what the specific human asking the question is likely to consider right or wrong.
  This is both why they're sycophantic, and also why they're better than just median internet comments.
  But this is only a slight quibble, because what you say is also somewhat true, and why they have such a hard time saying "I don't know".
  - giuscri 2 days ago
    
    idk… maybe we’ll found out the reason is that on the internet no one ends a conversation saying “i don’t know” :D
    
    ben_w a day ago
    
    That's my point :)
nijave 2 days ago

>The only way to use a tool like this is to give a problem that fits context
Or give context to the model which fits the problem. That's more of an art than a science at this point it seems
I think people with better success are those better at generating prompts but that's non trivial
stray a day ago

I get that a submarine can't swim.
I'm just not so sure of importance of the difference between swimming and whatever the word for how a submarine moves is.
If it looks like thinking and quacks like thinking...
pietz 2 days ago

Can you go into a bit more detail why the two approaches are so different in your opinion?
I don't think I agree and I want to understand this argument better.
- ismailmaj 2 days ago
  
  I’m guessing the argument is that LLMs get worse for problems they haven’t seen before, so you may assume they think for problems that are commonly discussed in the internet or seen on github, but once you step out of that zone, you get plausible but logically false results.
  That or a reductive fallacy, in either case I’m not convinced, IMO they are just not smart enough (either due to lack of complexity in the architecture or bad training that didn’t help it generalize reasoning patterns).
- nijave 2 days ago
  
  They regurgitate what they're trained on so they're largely consensus based. However, the consensus can be frequently wrong--especially when the information is outdated
  Someone with the ability to "think" should be able to separate oft repeated fiction from fact
  - solumunus 2 days ago
    
    > Someone with the ability to "think" should be able to separate oft repeated fiction from fact
    I guess humans don’t think then.

raflueder 2 days ago

I had a similar experience a couple of months ago where I decided to give it a go and "vibe code" a small TUI to get a feel for the workflow.

I used Claude Code and while the end result works (kinda) I noticed I was less satisfied with the process and, more importantly, I now had to review "someone else's" code instead of writing it myself, I had no idea of the internal workings of the application and it felt like starting at day one on a new codebase. It shifted my way of working from thinking/writing into reviewing/giving feedback which for me personally is way less mentally stimulating and rewarding.

There were def. some "a-ha" moments where CC came up with certain suggestions I wouldn't have thought of myself but those were only a small fraction of the total output and there's def. a dopamine hit from seeing all that code being spit out so fast.

Used as a prototyping tool to quickly test an idea seems to be a good use case but there should be better tooling around taking that prototype, splitting it into manageable parts, sharing the reasoning behind it so I can then rework it so I have the necessary understanding to move it forward.

For now I've decided to stick to code completion, writing of unit tests, commit messages, refactoring short snippets, CHANGELOG updates, it does fairly well on all of those small very focused tasks and the saved time on those end up being net positive.

mnky9800n 2 days ago

> Used as a prototyping tool to quickly test an idea seems to be a good use case but there should be better tooling around taking that prototype, splitting it into manageable parts, sharing the reasoning behind it so I can then rework it so I have the necessary understanding to move it forward.
This would be amazing. I think claude code is a great prototyping tool, but I agree, you don't really learn your code base. But I think, that is okay for a prototype if you just want to see if the idea works at all. Then you can restart as you say with some scaffolding to implement it better.

netdevphoenix 2 days ago

This has always been true. The difference is that now more people are admitting it. While you could argue that LLMs have junior level capabilities, they definitely do not have junior level self reflection or self awareness or self anything. It fundamentally doesn't learn where learning means being significantly less likely to fail at a task class x after being taught about it. And even just the ability of asking for help. These agents just choose to generate unusable code over stopping and asking for help or guidance and this implies that they are unable to tell their limits skill wise, knowledge wise, etc.

Frankly, I have been highly concerned seeing all the transformer hype in here when the gains people claims cannot be reliably replicated everywhere.

The financial incentives to make transformer tech work as it is being sold (even when it might not be cost effective) need to be paid close attention because to me, it looks a bit too much like blockchain or big data.

pramodbiligiri 2 days ago

Great article.

One thing I was wondering after looking at the list of items in the “Cursor agent produced a coding plan” image: do folks actually make such lists when developing a feature without AI assistants?

That list has items like “Create API endpoints for …”, “Write tests …”. If you’re working on a feature that’s within a single codebase and not involving dependencies on other systems or teams, isn’t that a lot of ceremony for what you’ll eventually end up doing anyway (and only likely to miss due to oversight)?

I see a downside to such lists, because when I see a dozen items lined up like that… who knows whether they’re all the right ones for the feature at hand? Or whether the feature needs some other change entirely, or whether you’ve figured out the right order to do them in?

Where I’ve seen such fine-grained lists have value is for task timeline estimation, but rarely for the actual implementation.

another_twist 2 days ago

Eerily maps to my experience almost word for word. I had codex write a chunk of code step by step with guidance and whatnot. Had to spend days cleaning up the mess.

vidarh 2 days ago

My experience is that if AI creates the mess, AI should clean it up, and it usually can, if you put it in a suitable agent loop that does a review, hands off small, well defined cleanup steps to an agent, and runs test suites.
If you review the first-stage output from the AI manually, you're wasting time.
You still need to review the final outputs, but reviewing the initial output is like demanding a developer hands over code they just barely got working and pointing out all of the issues to them without giving them a chance to clean it up first. It's not helpful to anyone unless your time costs the business less than the AI's time.
- Zardoz84 2 days ago
  
  IA reviewing code generated by AI, it's a recipe for disaster.
  - vidarh 2 days ago
    
    That's categorically not true, as long as there's a human reviewer at the end of the chain. It can usually continue to deliver actual improvements over several iterations (just like a human would).
    That does not mean you can get away with not reviewing it. But you can most certainly with substantial benefit defer reviewing it until an AI review thinks the code doesn't need further refinement. It probably still does need refinement despite the AI's say so - and sometimes it needs throwing away -, but it's also highly likely in my experience to need less, and take less time to review.

mprast 2 days ago

great stuff; I've had almost exactly the same experience. I think blow-by-blow writeups like this are a sorely needed antidote to the hype

OBELISK_ASI 2 days ago

[dead]

OBELISK_ASI 2 days ago

[dead]