Hacker Newsnew | past | comments | ask | show | jobs | submit | frde_me's commentslogin

Going on an old legacy website, downloading reports, summarizing them, and then doing things based on those

Or basically any app without MCP capabilities

I ask the AI daily to summarize information across surfaces, and it's painful when I have to go screenshot things myself in a bunch of places because those apps were not made to extract information out of them, and are complete black boxes with a UI on top


Feels like a lot of summarizing, which is just something I rarely need. YMMV depending on your job of course.

I enabled the computer use plugin yesterday. Today I asked it to summarize a slack thread, along with a spreadsheet without thinking about it

I was expecting it to use MCPs I have for them, but they happened to not be authenticated for some reason

I got _really_ freaked out when a glowing cursor popped up while I was doing something else and started looking at slack and then navigating on chrome to the sheet to get the data it needs

Like on one hand it's really cool that it just "did the thing" but I was also freaked out during the experience


Out of curiosity, are there any sources to there being a significant amount of other steps before being fed into the weights

Security guards / ... are the obvious ones, but do you mean they have branching early on to shortcut certain prompts?


> do you mean they have branching early on to shortcut certain prompts?

Putting a classifier in front of a fleet of different models is a great way to provide higher quality results and spend less energy. Classification is significantly cheaper than generation and it is the very first thing you would do here.

A default, catch-all model is very expensive, but handles most queries reasonably well. The game from that point is to aggressively intercept prompts that would hit the catch-all model with cheaper, more targeted models. I have a suspicion that OAI employs different black boxes depending on things like the programming language you are asking it to use.


Aren't you describing why they use mixture of experts? Where a sub-set of weights are activated depending on the query?


So it's intentional to make people pass down raw strings versus making the communication safe(er) by default?


There are no date, time or datetime types in JSON, so you'll have to serialise it to a string or an int anyway, and then when deserialising you'll need to identify explicitly which values should be parsed as dates.


Well, you could still have a compound object in JSON, that is output by the Temporal API, and which given as input is guaranteed to result in an equal object it was created/serialized from. This compound object must contain all required infos about timezones and such stuff.


.... we're talking about serialization here. "convert to a raw string" is sort of the name of the game.

It's a string in a well specified string format. That's typically what you want for serialization.

Temporal is typed; but its serialization helpers aren't, because there's no single way to talk about types across serialization. That's functionality a serialization library may choose to provide, but can't really be designed into the language.


You realize that JSON isn't just for JavaScript to JavaScript communication, right? Even if you had a magical format (which doesn't make sense and is a bad idea to attempt to auto-deserialize), it wouldn't work across languages.

If you really want that, it's not very hard to design a pair of functions `mySerialize()`, `myDeserialize()` that's a thin wrapper over `JSON.parse`.


its gotta become bytes somehow


One could argue a smaller number of employees that are more motivated and feel connected to their coworkersis better than a more employees that are all isolated and "meh".


Nothing inspires people to feel motivated and connected more than layoffs.


You have fewer people you worked with, a constant threat of unemployment and more work? Sign me up for that boss

/s


You see, in my circles (me included), people are shifting _to_ codex since 5.3 codex came out.

The only places where I hear people say claude is better is: - Frontend design - Random computer use tasks

But people trust codex for large scale architecture and changes


I use Claude for work and Codex for private use due to already having a Plus subscription.

I can't say that I have noticed that 5.3-Codex is much better, but it's definitely on par with Opus 4.6, and its limits for $25/months is comparable to Max x5 at 1/4th of the cost (not to mention pay-per-token which we use at work). Claude Code is generally a much better experience though.


> I get it that in 10 years all of this might peak and we're gonna be content using old models

I would personally be happy using gpt 5.3 codex for the foreseeable future, with just improvements in harnesses

IMO we're already at the point where even if these company collapse and the models end up being sold at the cost of inference (no new training), we would be massively ahead


What exact models were you using? And with what settings? 4.6 / 5.3 codex both with thinking / high modes?


minimax 2.5, kimi k2.5, codex 5.2, gemini 3 flash and pro, glm 4.7, devstral2 123b, etc.


It's hard to explain, but I've found LLMs to be significantly better in the "review" stage than the implementation stage.

So the LLM will do something and not catch at all that it did it badly. But the same LLM asked to review against the same starting requirement will catch the problem almost always

The missing thing in these tools is that automatic feedback loop between the two LLMs: one in review mode, one in implementation mode.


I've noticed this too and am wondering why this hasn't been baked into the popular agents yet. Or maybe it has and it just hasn't panned out?


Anecdotaly I think this is in Claude Code. It's pretty frequent to see it implement something, then declare it "forgot" a requirement and go back and alter or add to the implementation.


AFAICT this is already baked into the GitHub Copilot agent. I read its sessions pretty often and reviewing/testing after writing code is a standard part of its workflow almost every time. It's kind of wild seeing how diligent it is even with the most trivial of changes.


You have to dump the context window for the review to work good.


My reaction in that case is that most other readers of the codebase would probably also assume this, and so it should be either made clearer that it's stateful, or it should be refactored to not be stateful


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: