My general goto for tasks that are n level complex is to have the agent store summaries after every generation. I do this for interacting with websites - Ill sit there and type text for the agent to correctly inject js to do something on a website, and every iteration is asyncronously writes a history in a background thread of what it has done and what the result was. On every invocation, it injects that context.
No, it reflects the nature of misunderstanding Python by people who think their system is better, have no idea how Python in production actually works, and just publish things like the article to make themselves feel better.
Typing is not a huge issue, period. In Python, if you pass a wrong type to something, program just throws exceptions. Exceptions are not the end of the world like people make it seem. Functionally, finding errors during the process of taking code and compiling it with type checking is no different than taking code and just running it against a set of tests, which every production code has (or should have)
The only waytyping ever saves you from it is by being absolutely strict - every type defined has a finite range of values, and every operation has bounded domain and range. I.e if you have a string field, its not enough that its a string, you also must define the total number of characters that string can have, and values for each character, along with more complex rules on sequences of characters.
If you have this system, (something like Coq comes close), then if your program compiles, its by definition correct. But even the strongest proponents of typing don't really want to do this, because they realize how long it would take to write code.
The simple truth is that Python is easy and flexible enough to work in that you don't even need type checking. An LLM can effectively function as a type checker for you if you care enough. For any errors that you encounter due to lack of typing, its ultimately way faster to fix with Python than it is to spend time writing strongly typed language.
This is probably the most intellectualism ive seen anyone put into a comment that is so very, very, obviously wrong.
Yeah, in the age of AI where the whole goal is to not have to think, type as fast as you can with misspellings, and copy paste stuff without thinking, its TOTALLY a better system to worry about the types of whatever you are feeding into llms.
The gymnastics people are putting their ops teams through in order to validate oceans of generated slop is insane. Just use Rust and half of that work goes away.
I absolutely adore the historical revisionism that apple cares about privacy.
Run your router through a linux laptop as a proxy so you can capture traffic, connect any apple device to your router, and see the vasts amount of data your device sends to apple.
Apple DGAF about privacy, they want your data as much as anyone else, their only thing is that they should be the only ones to get it and then other people have to pay them for it, rather than your device sending the data to the 3d party directly.
And if you think your data is secure, reminder that The Fappening was all done targeting apple devices.
I don't use it, but yes I think that it works because they have everything to lose and nothing to gain. They definitely could share whatever data they have with whoever, which is why you use e2ee.
At the end of the day you are trusting Apple with pretty much everything if you use this service because they make most of your phone, the entire operating system, host the update servers, etc.
I use small models exclusively. They aren't a replacement for large models. You need decent hardware to run those models efficiently, as smaller parameter models plain suck and are still slow on macbooks. And affordability of higher end hardware is very limited.
Even at non VC subsidized $/token prices, its still much cheaper to run cloud based models.
> Even at non VC subsidized $/token prices, its still much cheaper to run cloud based models.
On a price-per-wattage level, this is not true, people have done the math on /r/LocalLLaMA many times over[1]. Local models, while not as good as premier models (GPT 5.5, etc.), are like ~80%+ of the way there, and often converge to a similar solution after a few dead ends.
Maybe not per watt, but unless you already happen to own a 3900 cited by that post, you'd have to buy that as well, which is currently selling for around $1400 used.
3090s are running $1400 now? Wowsers. I thought I was overspending when I bought 6x of them for around $800 a pop.
Might be time to sell, to be honest. It's fun to have that at home, but I can't justify having $10k (with memory, mobo, cpu, etc) sitting in my basement without being fully utilized.
I do have a 3090 Ti on my gaming PC, but even my old M1 MBP (with a mere 32gb of RAM) is quite competent and can run a quantized `Gemma4-26B-A4B` in the background while I do other stuff.
Where you are developing software. Its significantly faster to use google gemini and copy paste code back and forth compared to having gemini edit files for you.
well to be fair that's right now, I think the question is what about in 6 months, 12 months, 2 years?
Where do these improvement curves go? Does the gap close, do they intersect for practical purposes (factoring in cost etc)? Or is the local curve always just a translation of the hosted, lagging behind, or indeed does hosted just pull ahead?
Nobody knows, but it's a very open question I feel, and it certainly appears like the answer might quite reasonably be that yes they intersect on that kind of short-ish term time horizon.
Large models haven't seen that much improvement, just small unique tasks performance which is all special cased RLed to game metrics
For local models, its the same story. You can download Gemma 3 QAT from last year, and it will be just as good as Gemma:31b on the average. Qwen also boasts that its better, because again, they RLed it to game some metrics. Its better in coding then Gemma, but Gemma is better in more creative thinking (again, all RL)
Fundamentally, you need detail in the gradients for the models to pick up on the smaller details. If you don't have those, your output is gonna suck. No amount of clever architecture is going to fix this.
The only way to improve local models by training them to fetch context, and then their job becomes much simpler because all they need to do is reinterpret the fetched content and provide an answer. But fundamentally, if you are trying to keep things in house for advertising purposes like what all companies do with search, you want them to go to your service, which means running on your servers. And its not really that much extra per invocation (i.e excluding initial hardware costs) to instead just offer a large model as a service, which will be way better than any small models.
>I can't help but think that there's got to be a better mechanism
There is.
Transformers are basically autoencoders on the decode step - they take a compressed set of information and expand it into a 3 matrices which then get combined back into one matrix.
You can unroll the entire self attention step into fully connected layers, just with a lot of zeros for things that don't get multiplied together.
So it stands to reason that there is probably an optimal form of weights that does the same thing as current transformers.
I highly encourage you to vibecode something. Its really easy. You can get a small fast library that can do OCR with coordinates, and the rest is just interfacing with the x server to draw stuff over the top.
As one who's driving Claude daily due to corporate mandates, I can see why people fall in love with genAI coding, but my revulsion has only grown as I've learned to do it, so I won't be spending my free time with LLMs.
Basically with upcoming spark laptops, the smaller models will likely get fine tuned to interface with google services. Then, Google can essentially make Chromebook software include those models, which is the same use case as android.
And you better believe that they will be collecting user data and building advertising models.
Large models still are quite far ahead, don't be fooled that even Gemma:31b (which is better than the 12b overall) is anywhere close to big models.
There is definitely room for optimization, but fundamentally, for complex tasks, you need visible small gradients for accuracy that allow the model to be trained on (and consequently be followed during inference). For example, if you specify in instructions not to write code but ask coding question, Gemma will still write code. Whereas Gemini/Claude will pick up on that and follow your instructions better.
It doesn't matter if Large models are undeniably better, if a local model is "good enough" to handle the task. With API costs ramping up, I think a lot of companies are going to want to look into what can be run locally instead, possibly only using larger models when the local models fall short.
reply