Going on an old legacy website, downloading reports, summarizing them, and then doing things based on those
Or basically any app without MCP capabilities
I ask the AI daily to summarize information across surfaces, and it's painful when I have to go screenshot things myself in a bunch of places because those apps were not made to extract information out of them, and are complete black boxes with a UI on top
I enabled the computer use plugin yesterday. Today I asked it to summarize a slack thread, along with a spreadsheet without thinking about it
I was expecting it to use MCPs I have for them, but they happened to not be authenticated for some reason
I got _really_ freaked out when a glowing cursor popped up while I was doing something else and started looking at slack and then navigating on chrome to the sheet to get the data it needs
Like on one hand it's really cool that it just "did the thing" but I was also freaked out during the experience
> do you mean they have branching early on to shortcut certain prompts?
Putting a classifier in front of a fleet of different models is a great way to provide higher quality results and spend less energy. Classification is significantly cheaper than generation and it is the very first thing you would do here.
A default, catch-all model is very expensive, but handles most queries reasonably well. The game from that point is to aggressively intercept prompts that would hit the catch-all model with cheaper, more targeted models. I have a suspicion that OAI employs different black boxes depending on things like the programming language you are asking it to use.
There are no date, time or datetime types in JSON, so you'll have to serialise it to a string or an int anyway, and then when deserialising you'll need to identify explicitly which values should be parsed as dates.
Well, you could still have a compound object in JSON, that is output by the Temporal API, and which given as input is guaranteed to result in an equal object it was created/serialized from. This compound object must contain all required infos about timezones and such stuff.
.... we're talking about serialization here. "convert to a raw string" is sort of the name of the game.
It's a string in a well specified string format. That's typically what you want for serialization.
Temporal is typed; but its serialization helpers aren't, because there's no single way to talk about types across serialization. That's functionality a serialization library may choose to provide, but can't really be designed into the language.
You realize that JSON isn't just for JavaScript to JavaScript communication, right? Even if you had a magical format (which doesn't make sense and is a bad idea to attempt to auto-deserialize), it wouldn't work across languages.
If you really want that, it's not very hard to design a pair of functions `mySerialize()`, `myDeserialize()` that's a thin wrapper over `JSON.parse`.
One could argue a smaller number of employees that are more motivated and feel connected to their coworkersis better than a more employees that are all isolated and "meh".
I use Claude for work and Codex for private use due to already having a Plus subscription.
I can't say that I have noticed that 5.3-Codex is much better, but it's definitely on par with Opus 4.6, and its limits for $25/months is comparable to Max x5 at 1/4th of the cost (not to mention pay-per-token which we use at work). Claude Code is generally a much better experience though.
> I get it that in 10 years all of this might peak and we're gonna be content using old models
I would personally be happy using gpt 5.3 codex for the foreseeable future, with just improvements in harnesses
IMO we're already at the point where even if these company collapse and the models end up being sold at the cost of inference (no new training), we would be massively ahead
It's hard to explain, but I've found LLMs to be significantly better in the "review" stage than the implementation stage.
So the LLM will do something and not catch at all that it did it badly. But the same LLM asked to review against the same starting requirement will catch the problem almost always
The missing thing in these tools is that automatic feedback loop between the two LLMs: one in review mode, one in implementation mode.
Anecdotaly I think this is in Claude Code. It's pretty frequent to see it implement something, then declare it "forgot" a requirement and go back and alter or add to the implementation.
AFAICT this is already baked into the GitHub Copilot agent. I read its sessions pretty often and reviewing/testing after writing code is a standard part of its workflow almost every time. It's kind of wild seeing how diligent it is even with the most trivial of changes.
My reaction in that case is that most other readers of the codebase would probably also assume this, and so it should be either made clearer that it's stateful, or it should be refactored to not be stateful
Or basically any app without MCP capabilities
I ask the AI daily to summarize information across surfaces, and it's painful when I have to go screenshot things myself in a bunch of places because those apps were not made to extract information out of them, and are complete black boxes with a UI on top
reply