100% agree. I had Gemini flash 2 chew through thousands of points of nasty unstr...

jdthedisciple · 2025-04-17T22:37:08 1744929428

> thousands of points of nasty unstructured client data

What I always wonder in these kinds of cases is: What makes you confident the AI actually did a good job since presumably you haven't looked at the thousands of client data yourself?

For all you know it made up 50% of the result.

mediaman · 2025-04-18T03:20:36 1744946436

This was solved a hundred years ago.

It's the same problem factories have: they produce a lot of parts, and it's very expensive to put a full operator or more on a machine to do 100% part inspection. And the machines aren't perfect, so we can't just trust that they work.

So starting in the 1920s Walter Shewhart and Edward Deming came up with Statistical Process Control. We accept the quality of the product produced based on the variance we see of samples, and how they measure against upper and lower control limits.

Based on that, we can estimate a "good parts rate" (which later got used in ideas like Six Sigma to describe the probability of bad parts being passed).

The software industry was built on determinism, but now software engineers will need to learn the statistical methods created by engineers who have forever lived in the stochastic world of making physical products.

thawawaycold · 2025-04-18T06:43:46 1744958626

I hope you're being sarcastic. SPC is necessary because mechanical parts have physical tolerances and manufacturing processes are affected by unavoidable statistical variations; it is beyond idiotic to be provided with a machine that can execute deterministic, repeatable processes and then throw that all into the gutter for mere convenience, justifying that simply because "the time is ripe for SWE to learn statistics"

int_19h · 2025-04-18T07:22:06 1744960926

We don't know how to implement a "deterministic, repeatable process" that can look at a bug in a repo and implement a fix end-to-end.

thawawaycold · 2025-04-18T07:35:21 1744961721

that is not what OP was talking about though.

rorytbyrne · 2025-04-18T09:23:29 1744968209

LLMs are literally stochastic, so the point is the same no matter what the example application is.

warkdarrior · 2025-04-18T18:26:55 1745000815

Humans are literally stochastic, so the point is the same no matter what the example application is.

perching_aix · 2025-04-18T18:09:49 1744999789

The deterministic, repeatable process of human (and now machine) judgement and semantic processing?

tominous · 2025-04-17T23:12:47 1744931567

In my case I had hundreds of invoices in a not-very-consistent PDF format which I had contemporaneously tracked in spreadsheets. After data extraction (pdftotext + OpenAI API), I cross-checked against the spreadsheets, and for any discrepancies I reviewed the original PDFs and old bank statements.

The main issue I had was it was surprisingly hard to get the model to consistently strip commas from dollar values, which broke the csv output I asked for. I gave up on prompt engineering it to perfection, and just looped around it with a regex check.

Otherwise, accuracy was extremely good and it surfaced a few errors in my spreadsheets over the years.

jofzar · 2025-04-18T01:04:29 1744938269

I hope there is a future where csv comma's don't screw up data. I know it will never happen but it's a nightmare.

Everyone has a story of a csv formatting nightmare

Nihilartikel · 2025-04-18T00:33:30 1744936410

For what it's worth, I did check over many hundreds of them. Formatted things for side by side comparison and ordered by some heuristics of data nastiness.

It wasn't a one shot deal at all. I found the ambiguous modalities in the data and hand corrected examples to include in the prompt. After about 10 corrections and some exposition about the cases it seemed to misundestand, it got really good. Edit: not too different from a feedback loop with an intern ;)

summerlight · 2025-04-17T23:16:29 1744931789

Though the same logic can be applied to everywhere, right? Even if it's done by human interns, you need to audit everything to be 100% confident or just have some trust on them.

andrei_says_ · 2025-04-18T18:34:54 1745001294

Not the same logic because interns can make meaning out of the data - that’s built-in error correction.

They also remember what they did - if you spot one misunderstanding, there’s a chance they’ll be able to check all similar scenarios.

Comparing the mechanics of an LLM to human intelligence shows deep misunderstanding of one, the other, or both - if done in good faith of course.

summerlight · 2025-04-18T22:14:26 1745014466

Not sure why you're trying to conflate intellectual capability problems into this and complicate the argument? The problem layout is the same. You delegate the works to someone so you cannot understand all the details. This makes a fundamental tension between trust and confidence. Their parameters might be different due to intellectual capability, but whoever you're going to delegate, you cannot evade this trade-off.

BTW, not sure if you have experiences of delegating some works to human interns or new grads and being rewarded by disastrous results? I've done that multiple times and don't trust anyone too much. This is why we typically develop review processes, guardrails etc etc.

andrei_says_ · 2025-04-22T02:47:54 1745290074

> not sure if you have experiences of delegating some works to human interns or new grads and being rewarded by disastrous results?

Oh yes I have ;)

Which is why I always explain the why behind the task.

FooBarWidget · 2025-04-18T05:05:27 1744952727

You can use AI to verify its own work. Last time I split a C++ header file into header + implementation file. I noticed some code got rewritten in a wrong manner, so I asked it to compare the new implementation file against the original header file, but to do so one method at a time. For each method, say whether the code is exactly the same and has the same behavior, ignoring superficial syntax changes and renames. Took me a few times to get the prompt right, though.

golergka · 2025-04-17T23:08:48 1744931328

Many types of data have very easily checkable aggregates. Think accounting books.

jofzar · 2025-04-18T01:02:21 1744938141

It also depends on what you are using the data for, if it's for non (precise) data based decisions then it's fine. Specially if you looking for "vibe" based decisions before then dedicating time to "actually" process the data for confirmation.

30$ to get an view into data that would take at least x many hours of someone's time is actually super cheap, specially if the decision of that result is then to invest or not invest the x many hours to confirm it.

pamplemoose · 2025-04-17T23:11:01 1744931461

You take a sample and check

visarga · 2025-04-18T04:50:55 1744951855

In my professional opinion they can extract data at 85-95% accuracy.

tcgv · 2025-04-18T16:20:06 1744993206

> I'm leveraging it for massive refactoring now and it is almost magical.

Can you share more about your strategy for "massive refactoring" with Gemini?

Like the steps in general for processing your codebase, and even your main goals for the refactoring.

roygbiv2 · 2025-04-18T04:15:00 1744949700

Isn't it better to get gemini to create a tool to format the data? Or was it in such a state that that would have been impossible?

cdelsolar · 2025-04-18T02:45:23 1744944323

what tool are you using 2.5-pro-exp through? Cline? Or the browser directly?

Nihilartikel · 2025-04-18T03:49:20 1744948160

For 2.5 pro exp I've been attaching files into AIStudio in the browser in some cases. In others, I have been using vscode's Gemini Code Assist which I believe recently started using 2.5 Pro. Though at one point I noticed that it was acting noticeably dumber, and over in the corner, sure enough it warned that it had reverted to 2.0 due to heavy traffic.

For the bulk data processing I just used the python API and Jupyter notebooks to build things out, since it was a one-time effort.

manmal · 2025-04-18T06:10:54 1744956654

Copilot experimental (need VSCode Insiders) has it. I‘ve thought about trying aider —-watch-files though, also works with multiple files.