Another anecdote. I've got a personal benchmark that I try out on these systems every time there's a new release. It is an academic math question which could be understood by an undergraduate, and which seems easy enough to solve if I were just to hammer it out over a few weeks. My prompt includes a big list of mistakes it is likely to fall into and which it should avoid. The models haven't ever made any useful progress on this question. They usually spin their wheels for a while and then output one of the errors I said to avoid.
My hit/miss rate with using these models for academic questions is low, but non-trivial. I've definitely learned new math because of using them, but it's really just an indulgence because they make stuff up so frequently.
I get generally good results from prompts asking for something I know definitely exists or is definitely possible, like an ffmpeg command I know I’ve used in the past but can’t remember. Recently I asked how to something in Imagemagick which I’d not done before but felt like the kind of thing Imagemagick should be able to do. It made up a feature that doesn’t exist.
Maybe I should have asked it to write a patch that implements that feature.
I find it incredibly useful for information retrieval from dense, archival-like text knowledge. I research cellular networks, and everything on Google/DDG is just fluffy SEO spam, but I find Gemini can reliably hone into the precise subsection out of tens of thousands of dense standards to tell me what 5G should do in a given scenario
There is no difference between "hallucination" and "soberness", it's just a database you can't trust.
The response to your query might not be what you needed, similar to interacting with an RDBMS and mistyping a table name and getting data from another table or misremembering which tables exist and getting an error. We would not call such faults "hallucinations", and shouldn't when the database is a pile of eldritch vectors either. If we persist in doing so we'll teach other people to develop dangerous and absurd expectations.
No it's absolutely not. One of these is a generative stochastic process that has no guarantee at all that it will produce correct data, and in fact you can make the OPPOSITE guarantee, you are guaranteed to sometimes get incorrect data. The other is a deterministic process of data access. I could perhaps only agree with you in the sense that such faults are not uniquely hallucinatory, all outputs from an LLM are.
I don't agree with these theoretical boundaries you provide. Any database can appear to lack in determinism, because data might get deleted, corrupted or mutated. Hardware and software involved might fail intermittently.
The illusion of determinism in RDBMS systems is just that, an illusion. The reason why I used the examples of failures in interacting with such systems that I did is that most experienced developers are familiar with those situations and can relate to them, while the probability for the reader to having experienced a truer apparent indeterminism is lower.
LLM:s can provide an illusion of determinism as well, some are quite capable of repeating themselves, e.g. overfitting, intentional or otherwise.
If the information it gives is wrong, but is grammatically correct, then the "AI" has fulfilled its purpose. So it isn't really "wrong output" because that is what the system was designed to do. The problem is when people use "AI" and expect it will produce truthful responses - it was never designed to do that.
But the point is that everyone uses the phrase "hallucinations" and language is just how people use it. In this forum at least, I expect everyone to understand that it is simply the result of next token generation and not an edge case failure mode.
I would have thought to assume that, but given how many on HN throw about how LLM's can think, reason, understand I think it does bear clearly defining some of the terms used.
Sorry, I'm failing to see the danger of this choice of language? People who aren't really technical don't care about these nuances. It's not going to sway their opinion one way or another.
Yep. All these do is “hallucinate”. It’s hard to work those out of the system because that’s the entire thing it does. Sometimes the hallucinations just happen to be useful.
> It sometimes feels like the only thing saving LLMs are when they’re forced to tap into a better system like running a search engine query.
This is actually very profound. All free models are only reasonable if they scrape 100 web pages (according to their own output) before answering. Even then they usually have multiple errors in their output.
I like asking it about my great great grandparents (without mentioning they were my great great grandparents just saying their names, jobs, places of birth).
It hallucinates whole lives out of nothing but stereotypes.
Responding with "skill issue" in a discussion is itself a skill issue. Maybe invest in some conversational skills and learn to be constructive rather than parroting a useless meme.
First of all, there is no such thing as "prompt engineering". Engineering, by definition, is a matter of applying scientific principles to solve practical problems. There are no clear scientific principles here. Writing better prompts is more a matter of heuristics, intuition, and empiricism. And there's nothing wrong with that — it can generate a lot of business value — but don't presume to call it engineering.
Writing better prompts can reduce the frequency of hallucinations but frequent hallucinations still occur even with the latest frontier LLMs regardless of prompt quality.
So you are saying the acceptable customer experience for these systems is that we need to explicitly tell them to accept defeat when they can’t find any training content/web search results that matches my query enough?
Why don't they have any concept of having a percentage of confidence in their answer?
It isn’t 2022 anymore, this is supposed to be a mature product.
Why am I even using this thing rather than using the game’s own mod database search tool? Or the wiki documentation?
What value is this system adding for me if I’m supposed to be a prompt engineer?
To take a different perspective on the same event.
The model expected a feature to exist because it fitted with the overall structure of the interface.
This in itself can be a valuable form of feedback. I currently don't know of any people doing it, but testing interfaces by getting LLMs to use them could be an excellent resource. Th the AI runs into trouble, it might be worth checking your designs to see if you have any inconsistencies, redundancies or other confusion causing issues.
One would assume that a consistent user interface would be easier for both AI and humas. Fixing the issues would improve it for both.
That failure could be leveraged into an automated process that identified areas to improve.
Literally yesterday ChatGPT hallucinated an entire feature of a mod for a video game I am playing including making up a fake console command.
It just straight up doesn’t exist, it just seemed like a relatively plausible thing to exist.
This is still happening. It never stopped happening. I don’t even see a real slowdown in how often it happens.
It sometimes feels like the only thing saving LLMs are when they’re forced to tap into a better system like running a search engine query.