I wouldn't resort to language examples, as the other comments show how this gets imprecise and lost in semantic details quickly. Instead think of a basic logic example: Consider an OR gate with inputs A and B and output X. If B=1 that means X=1. But if X=1 you can't infer that B=1, because there is an alternative (i.e. A=1,B=0). So from B=1→X=1, the inverse simply does not follow. This extends to all statements where a relation is not symmetric. Of course you can also go beyond and find cases where the relationship is not transitive or not even reflexive. There's a whole branch of language based IQ test puzzles (e.g. "all X are Y, some Y are Z" kind of stuff) that exploit this rabbit hole. Any LLM that does good on these will not jump to conclusions about reverse equalities quickly.
>Some amount of knowledge is required for reasoning.
This is the root of problem. If you think about STEM universities, they don't really teach you things you need in the real world. They teach you what you need to know in order to go out there and accumulate the necessary information which can then be used to solve problems. Giving a person access to the internet or a super powerful calculator (like Mathematica) won't mean that they can do anything useful. They need tons of experience to use these tools in an effective way. That experience is basically all that implicit adjacent knowledge that we pick up along the way getting our degrees. And LLMs pick that up during pre-training. Drop this part and the outcome will be worthless.
Take mathematics as an example. Humanity has found math notation, which allowed to express math rules — distill them to the core. Before math was expressed in prose — a very inefficient way, very similar to current LLMs.
In my school, math teacher was giving me prose, which I was converting to math notation. I could argue, that this prose→reasoning conversion is not required at training, and can be obtained at inference time with search tools.
I suppose it is both.
Basically all frontier models are inference-time compute bound thanks to reasoning. And actual reasoning traces are locked behind closed doors at all American labs. So whenever they want to push a new model and need to give it hardware, it would make sense to cut into the reasoning budgets of older models. Users will not be able to see that directly, it will only become apparent on high-end, difficult tasks - exactly the kind of tasks where the provider wants you to use the new model anyway, so they can further improve it.
You could also use the responses api which stores all message contents (including reasoning) on OAI servers. This has been possible for quite a while now. Encryption is only necessary if you really care about local storage (which is different from privacy concerns, because the data gets sent to their servers anyway).
well the encryption part is also mostly about OAI wanting to avoid others to distill from their COT/reasoning traces, since this is not ever displayed to devs or final users, and as you say lives on their servers.
but yes you're correct on the responses api already baking it in too
supposedly keeping these between tool calls should help the model reason and have better overall outputs etc
I guess they still use a tokenizer? Why would this kind of issue be solved? The model fundamentally can't see the word character by character like you do. For o200k tokenizers for example, what the model sees are 3 tokens: [302, 1618, 19772]. These are shown to you as ["st", "raw", "berry"] in the UI. The only way any model can infer individual characters is by using external tools or implicit knowledge picked up during training or (what many of the big labs apparently do) special training for these edge cases that fail once the next special case comes along.
People forgot, but Google had their own internal version of ChatGPT before OpenAI. But they never even intended to launch it. If OpenAI hadn't just thrown the technology out there for everyone to see, Google would probably still be sitting on it. Google does tons of original stuff, but they haven't released any original product in more than a decade. All they do now is play catch-up once they see people actually like something.
There's also the major problem of people expecting Google to be right when it tells them something but OpenAI had no starting reputation so it was okay to say "be aware it might be wrong sometimes"
The chances are actually often way better than 1/4. For the words I didn't know, I was almost always able to exclude one or two options. Sometimes even three, finding the solution by exclusion.
Does the Swiss rail not receive public funding? It seems to me that undercharging would only necessitate more public funding, not some fundamental change where taxpayers suddenly have to pay for something they didn't before.
You should note that while a single use (like to kill mould) may be fine, regular use on stone, metal or wood (i.e. most stuff in a bathroom) is not recommended because it is a powerful oxidizer that will considerably damage these surfaces if used regularly. That's because it releases hydroxyl radicals that destroy not only molecular bonds in stains and microorganism cell walls, but also attacks treated surfaces and corrodes metals.
reply