So if you set temperature=0 and run the LLM serially (making it deterministic) it would stop hallucinating? I don't think so. I would guess that the nondeterminism issues mentioned in the article are not at all a primary cause of hallucinations.
That's an implementation detail I believe. But what I meant was just greedy decoding (picking the token with the highest logit in the LLM output), which can be implemented very easily
"In other words, the primary reason nearly all LLM inference endpoints are nondeterministic is that the load (and thus batch-size) nondeterministically varies! This nondeterminism is not unique to GPUs — LLM inference endpoints served from CPUs or TPUs will also have this source of nondeterminism."
LLMs hallucinate because they are probabilistic by nature not because the source material is lossy or too big. They are literally designed to create some level of "randomness" https://thinkingmachines.ai/blog/defeating-nondeterminism-in...