Hacker Newsnew | past | comments | ask | show | jobs | submit | alexhans's commentslogin

Congrats on the launch. Very quick feedback on the site, since I usually try to check out the blog section, https://www.runcaptain.com/blog layout is broken in mobile, I tried brave/chrome/Firefox.

Thanks, just shipped a fix ;)

Very, very heterogenous and fast moving space.

Depending on how they're made up, different teams do vastly different things.

No evals at all, integration tests with no tooling, some use mixed observability tools like LangFuse in their CI/CD. Some other tools like arize phoenix, deepeval, braintrust, promptfoo, pydanticai throughout their development.

It's definitely an afterthought for most teams although we are starting to see increased interest.

My hope is that we can start thinking about evals as a common language for "product" across role families so I'm trying some advocacy [1] trying to keep it very simple including wrapping coding agents like Claude. Sandboxing and observability "for the masses" is still quite a hard concept but UX getting better with time.

What are you doing for yourself/teams? If not much yet, i'd recommend to just start and figure out where the friction/value is for you.

- [1] https://ai-evals.io/ (practical examples https://github.com/Alexhans/eval-ception)


What I've noticed is that it's hard to measure outputs that aren't binary right or wrong, and that's where most human intervention is needed. The biggest examples of this are chatbots and coding agents – basically any output where you can say "hmm well that's a good response, but there is a better response" and that's what still _feels_ like an unsolved problem, benchmarking those kinds of responses.

On top of that, there are combinations of models+prompts that give different results. For example a prompt could yield a great response from Claude, but the same prompt could yield a mediocre response from Gemini. Not just that but different models have different capabilities (example of this is that composite function calling doesn't work the same way for all models).

I'm asking because I'm generally curious on how teams are solving this today – and it _seems_ like there is no gold standard for evals yet although it's gaining interest.

How I do evals today is by testing an output across different dimensions (and it can vary based on use-case): relevance, instruction following, clarity, hallucination rate, etc. which sucks a lot of time (and can never be fully accurate because how do you fully measure something like "clarity"?), and I feel like there's a better way out there.


> The vibes are not enough. Define what correct means. Then measure.

Pretty much. I've been advocating this for a while. For automation you need intent, and for comparison you need measurement. Blast radius/risk profile is also important to understand how much you need to cover upfront.

The Author mentions evaluations, which in this context are often called AI evals [1] and one thing I'd love to see is those evals become a common language of actually provable user stories instead of there being a disconnect between different types of roles, e.g. a scientist, a business guy and a software developer.

The more we can speak a common language and easily write and maintain these no matter which background we have, the easier it'll be to collaborate and empower people and to move fast without losing control.

- [1] https://ai-evals.io/ (or the practical repo: https://github.com/Alexhans/eval-ception )


Summers in Spain, e.g. Madrid can be extremely hot. Having the sun not set until very late can create an unpleasant city life experience.

With people acknowledging heatwaves and energy issues, I find it interesting how that's seldom part of the conversation.


I don't like switching to daylight saving and back but I'd rather have that that permanently moving to +1. Then you have extreme examples which are already shifted like Spain (for historical reasons around aligning with Germany) and I don't find that alignment useful economically, in city life practice and more.

If had to make an executive choice with no further analysis at this moment I'd put them all in their respective original times and move Spain and any outlier to their proper timezone (a vertical map alignment of sorts)


Don't get discouraged by being in the minority in one particular forum, specially when specific angles dominate.

People put different weights to different arguments.

For the Spain argument below. I actually think it's quite uncomfortable to be +1 and +2 in daily life because people leaving office at 5pm are actually leaving at 3pm under scorching sun. The difference of having light until 23 instead of 22 is negligible in a country that is still up at night in winter.

I can't cite anything at the moment but from what I can recall, economic benefits of switching during the year have not been as tauted and the cost of changing every year has been harmful in many ways (operational being one), but I think here the discussion is where should countries land.

I hope that a country like UK doesn't decide to switch to +1 and the same for Europe, further separating themselves from the American continent countries with the focus on summer sunlight where summer already has a huge window of sun and people often tend to want to escape that heat.


Yes, once you look at the daylight times it's clear that UK time naturally fits Spain better.


Dark patterns degrade our computing experience and are worth illustrating, but there's a larger discussion to be had about keeping individual control over our own devices.

Technically, that means being able to install Linux, run local models, and use open-source software as we see fit.

Legally, it's opposing compliance guises that erode those rights, like backdoors or restrictions on what can run so that we no longer really in control of the hardware we own but need to adjust to the whims of the controller/operator, which could, at a moments notice, default to these dark patterns for "pragmatic reasons" of their own which don't align with your interests.

We know enough bad stories for the "internet of things" devices. Anyone interested in FOSS and control should probably invest in this angle.


> I do think there is something interesting to think about here in how coding agents like Claude Code can quickly iterate given a test suite.

This is a point I've tried to advocate for a while. Specially to empower non coders and make them see that we CAN approach automation with control.

Some aspects will be the classic unit or integration tests for validation. Others, will be AI Evals [1] which to me could be the common language for product design for different families/disciplines who don't quite understand how to collaborate with each other.

The amount of progress in a short time is amazing to see.

- [1] https://ai-evals.io/


Please stop spreading this "AI evals" terminology. "evals" is what providers like OpenAI and Anthropic do with their models. If you wrote a test for a feature that uses an LLM, it's just a test, there's no need to say "evals." Having a separate term only further confuses people who already have no idea what that actually means.


I respectfully disagree. I think there needs to be a common term for the aspects around LLM testing and saying "It's just integration/system tests" doesn't really reach audiences well. They don't disambiguate the differences.

Words win when they're used. Just because Agent Skills is just a pattern for standarization and saving context doesn't mean it wasn't incredibly useful.

Think beyond software developers by trade. Think beyond people those who realized they needed tests instead of those who thought "the models will just get smarter" and "they told me there's guardrails".


I love emergent behaviour and story telling. Anyone who has played City builders like Sim City or roguelikes like Dwarf Fortress knows how interesting, fun and even informative they can be.

In a world where setting them up and letting rogue agents run rampant becomes relatively low cost and fast, I think focusing on the desired outcomes, the story telling and specially the UX for the human user, is key and maybe we can take some learnings from Will Wright on "Designing User Interfaces to Simulation Games" [1].

I'm going to be unable to do much this weekend so I can't say I'll try check this out (yet?) but I'd be interested in your own experiences so far. Any surprises? Things you'd like to do next? What's most fun/challenging?

An actual report/writeup will probably resonate more than a repo for people who can't check it out easily or are not willing to.

- [1] https://donhopkins.medium.com/designing-user-interfaces-to-s...


Appreciate this! and yeah the will wright talk is exactly what I was leaning into.

Actually posted this on X 2 weeks ago, hosted the werld observatory public, and had gemini stream a new chapter of the story in natural language every 10,000 ticks - so it felt like reading through a david attenburgh novel of werld being born.

Most interesting thing from the last run was definitely the language and the behaviours, they decoding what they were actually saying was a difficult one, and noticing them group within their diverged species.

Up next, i want to get the storytelling side up and running too - kept running out of storage, and cloudflare playing up as usual - maybe get gemini to visualise each chapter - and get an upgraded interface for the werld observatory.

If you want to check out my previous attempt at streaming the story line - it's still on my X - https://x.com/im_urav?s=21&t=6Si-w-DvNJC7RfvSz2Aw-w


> Anyone who has played City builders like Sim City or ... knows how interesting, fun and even informative they can be.

Yes. The most intersting and fun thing in SimCity was always having to upgrade your storage. So instead of building a City you were building storage. /s


I think a key ingredient here is accountabilty and liability.

If there's a mistake, you can't blame the computer. Who is the human accountable at the end of it all? If there's liability, who pays for it?

That's where defining clear boundaries helps you design for your risk profile.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: