Hacker Newsnew | past | comments | ask | show | jobs | submit | rzmmm's commentslogin

The model has multiple layers of mechanisms to prevent carbon copy output of the training data.

Do you have a source for this?

Carbon copy would mean over fitting


I saw weird results with Gemini 2.5 Pro when I asked it to provide concrete source code examples matching certain criteria, and to quote the source code it found verbatim. It said it in its response quoted the sources verbatim, but that wasn't true at all—they had been rewritten, still in the style of the project it was quoting from, but otherwise quite different, and without a match in the Git history.

It looked a bit like someone at Google subscribed to a legal theory under which you can avoid copyright infringement if you take a derivative work and apply a mechanical obfuscation to it.


Source is just read the definition of what "temperature" is.

But honestly source = "a knuckle sandwich" would be appropriate here.


forgive the skepticism, but this translates directly to "we asked the model pretty please not to do it in the system prompt"

The model doesn't know what its training data is, nor does it know what sequences of tokens appeared verbatim in there, so this kind of thing doesn't work.

It's mind boggling if you think about the fact they're essential "just" statistical models

It really contextualizes the old wisdom of Pythagoras that everything can be represented as numbers / math is the ultimate truth


They are not just statistical models

They create concepts in latent space which is basically compression which forces this


You’re describing a complex statistical model.

What is "latent space"? I'm wary of metamagical descriptions of technology that's in a hype cycle.

How so? Truth is naturally an apriori concept; you don't need a chatbot to reach this conclusion.

That might be somewhat ungenerous unless you have more detail to provide.

I know that at least some LLM products explicitly check output for similarity to training data to prevent direct reproduction.


Should they though? If the answer to a question^Wprompt happens to be in the training set, wouldn't it be disingenuous to not provide that?

Maybe it's intended to avoid legal liability resulting from reproducing copyright material not licensed for training?

Would it really be infeasible to take a sample and do a search over an indexed training set? Maybe a bloom filter can be adapted

It's not the searching that's infeasible. Efficient algorithms for massive scale full text search are available.

The infeasibility is searching for the (unknown) set of translations that the LLM would put that data through. Even if you posit only basic symbolic LUT mappings in the weights (it's not), there's no good way to enumerate them anyway. The model might as well be a learned hash function that maintains semantic identity while utterly eradicating literal symbolic equivalence.


Unfortunately.

does it?

this is a verbatim quote from gemini 3 pro from a chat couple of days ago:

"Because I have done this exact project on a hot water tank, I can tell you exactly [...]"

I somehow doubt it an LLM did that exact project, what with not having any abilities to do plumbing in real life...


Isn't that easily explicable as hallucination, rather than regurgitation?

Those are not mutually exclusive in this instance, it seems.

Someone presented a hypothetical scenario: What if a hacker would write a virus, which breached a totally unprotected database after the hacker has passed away. It's clear that the therapy provider is at least partially responsible.

Posthumous crime is the ultimate because the legal system is all about punishing the living until they are dead.


If only human beings were good at learning from past mistakes. It requires multiple tries before we realize, fire bad, unless good, if controlled.

I believe the industry has largely accepted that prompt injection is inherent part of LLM tech.

There are many open-source toy browser implementations available, so this seems quite likely.

I'm hopeful. The open source AI ecosystem could benefit from large players like Mozilla making moves.

In what world is Mozilla large?

Maybe influential would be a better word choice. Firefox has 100M+ users.

Has anyone analyzed what is the proportion of AI-related submissions over time in HN?

Here is something like that https://beuke.org/hn-ai-coverage/

This proposal seems solid. I personally also like how many scientific journals have added a mandatory AI disclosure in publication. Practically it's one or two sentences how (or if) Gen AI was used.

"ChatGPT model GPT-5.2 was used to identify spelling errors"

"Google Gemini 3 was used to generate the abstract of the paper".


"Whatever Overleaf has was used to identify spelling errors"

"Google Docs AI (whatever the name is, Gemini) has was used to identify spelling, grammar and idioms errors"

"Gemini in Google Search has been used to understand how to use obscure Fortan 77 instruction"

...


Along those lines, yes. Often journals ask specifically about generative AI, so other types of AI tools don't require disclosure.

It doesn't happen in non-diabetic people. It's different in type 2 diabetics who will see large swings in blood fat and glucose after meals.

I'm skeptical that paleo diet would be healthy for long term. There are studies where they find atherosclerosis in pre-industrial hunter-gatherer remains. It's called HORUS study.

From what I've managed to find in the newest research, it apppears that diet does not appear to have any impact on atherosclerosis itself. But, as they say, more data needed.

> In 2013, meanwhile, researchers in the Netherlands subjected 17 healthy adults to temperatures of 15-16C (59-60.8F) for six hours a day.

It seems that that these articles often discuss cold plunges, cold showers etc. but the rigorous research is often conducted simply via rooms with reduced temperature combined with light clothing.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: