A simple way to get more value from metrics

bitcharmer · on May 30, 2020

You'd be surprised how many serious tech shops have close to zero performance metrics collected and utilised.

I've done this in fintech a few times already and the best stack that worked from my experience was telegraf + influxdb + grafana. There are many things you can get wrong with metrics collection (what you collect, how, units, how the data is stored, aggregated and eventually presented) and I learned about most of them the hard way.

However, when done right and covering all layers of your software execution stack this can be a game changer, both in terms of capacity planing/picking low hanging perf fruit and day to day operations.

Highly recommend giving it a try as the tools are free, mature and cover a wide spectrum of platforms and services.

chucky_z · on May 30, 2020

I'd like to followup on this and say InfluxDB is fantastic -- with the caveat that you know exactly what you want, and use it with that in mind. It's great at storing some metrics with high volume, but I've found the OSS version falls over hard in some circumstances (high cardinality in tags absolutely murders performance, making it nearly unusable).

I'm in the middle of doing a test-run of vmagent + victoria-metrics as a mostly-replacement (see: not replacing the cases where it's known exactly what's wanted, and used explicitly for that), and victoria-metrics fully support telegraf pushes which is a big bonus.

Also they're apparently going to be dropping InfluxQL entirely, for Flux, which is weird to me? The docs don't outright state this, but InfluxDB 2.0 has only Flux examples in the docs I can find. It's a better language, however the learning curve is not trivial.

sciurus · on May 31, 2020

Thankfully they still support influxql via transpiling it to flux. They've talked about supporting promql eventually as well.

Tickscript is gone in 2.0, though. I'm not in love with Flux, but I certainly won't miss tickscript.

chucky_z · on May 31, 2020

Oh, interesting. What's the core difference between Tickscript and Flux? Was Tickscript what Kapacitor used for InfluxDB 1.7?

sciurus · on June 1, 2020

> Was Tickscript what Kapacitor used for InfluxDB 1.7?

Yes. The learning curve was steep, and debugging was near-impossible. Flux is rough too, but it's already getting better documentation and tooling support than Tickscript had.

benraskin92 · on May 30, 2020

Might also want to give Prometheus a try as it's extremely simple to setup, and it is very well supported in the open source community.

contravariant · on May 31, 2020

Note that this isn't an either/or. You can use telegraf to push data to prometheus, or pull data from prometheus compliant http endpoints. You can also use grafana to show data from both InfluxDB and Prometheus.

This might be obvious to some but I thought it worth mentioning.

bitcharmer · on May 30, 2020

Prometheus was another option but millisecond-precise timestamps are a deal breaker in my field.

benraskin92 · on May 30, 2020

That makes sense -- have you taken a look at M3DB (https://www.m3db.io/)?

amenod · on May 30, 2020

Curious, would microseconds suffice? Or are we talking about higher precision still?

simcop2387 · on May 30, 2020

Very likely they need nanosecond precision, https://www.thetradenews.com/fintech-firms-reduce-trading-ti...

bitcharmer · on May 30, 2020

Just like the sibling comment indicated, we (HFT, algo trading) need nanosecond precision for some metrics.

chrisweekly · on May 30, 2020

Agreed; same exp (w many years of web perf optimization as ui arch and/or perf strategy consulting gigs. Even after all these years, simply using WebPageTest to capture test runs and share the results is often a transformational experience.

duckmysick · on May 30, 2020

> There are many things you can get wrong with metrics collection

Can you share the important things that are overlooked?

jamessun · on May 30, 2020

"I don't have anything against hot new technologies, but a lot of useful work comes from plugging boring technologies together and doing the obvious thing."

Simulacra · on May 30, 2020

The most exciting phrase to hear in science, the one that heralds new discoveries, is not “Eureka!” (I found it!) but “That’s funny …” — Isaac Asimov

m463 · on May 30, 2020

But that's not as fun as doing a scheme to rust cross-compiler on kubernetes

dirtydroog · on May 30, 2020

wise words

jrowen · on May 30, 2020

Starting the article off with "I did this in one day" - complete with a massive footnote disclaiming that it obviously took a lot more than one day - kinda ruined it for me. Why even bother with that totally unnecessary claim?

eshyong · on May 30, 2020

My read on it is the author is saying that seemingly small changes can have big impacts. I agree it could have been worded better, though I doubt he's trying to promote himself as a genius (as other people are saying) because he clearly highlights the effort his team put into the project in the footnote.

brmgb · on May 30, 2020

I was really off-put by it too.

"I did it by myself in one day, well actually it was one week but had I known the stack I would likely have done it in one day. Oh, and by the way, after that week, there was yet another month of work involving at least two other persons from my team and then even more work from other teams. But let's not dwell on boring details".

It's nearly as infuriating as the "Appendix: stuff I screwed up" which doesn't contain actual screw up. It's a shame because the rest of the writing is interesting and doesn't need to be propped up.

caiobegotti · on May 30, 2020

It's kind of a personal marketing thing these days to have this maverick/hero aura of genius instead of the "unproductive" but real and hard grinding work to get something done and delivered. It worked for a few so thousands try the same and we are here now, I guess?

derivativethrow · on May 30, 2020

Given:

- the context that the author already has a very successful career as a well-known developer

- the humility he evidences in most posts on his blog

- the fact that he explicitly highlights the work of others in this post alongside his own

I really don't think Dan is doing this as any form of personal marketing. He has no need of personal marketing, his blog already has several million views per month and frequently shows up on HN as it is, and it isn't really his style.

caiobegotti · on May 30, 2020

I did not say he did that, I said that I believe it's common these days given the points I mentioned. You just need to hang around and see a bunch of posts on HN to notice that. QED, he's probably one of the "few" I talked about.

jacques_chester · on May 30, 2020

In fairness, based on the limited time I've spent in his company, Dan Luu is pretty bright.

derivativethrow · on May 30, 2020

That strikes me as a short tolerance for feeling something is ruined. He appropriately highlighted the real time estimate of the more involved work in a footnote. He didn't literally mean all of the work was one day, he's trying to convey a larger point about outsized engineering returns from comparatively small person-hours of work.

Were you able to move past this to read the rest of the article? Because it's a very good article.

jrowen · on May 30, 2020

I did go back and read the rest of it, and I do agree that it's pretty good overall. For someone going into detail about the mistaken capitalization of a variable name (which I appreciated), the "one day" bit still stands out as oddly hand-wavy and boastful (and, as the opening remark, I would argue it's pretty important for setting the tone).

If that point needed to be made (and I don't think it really did in this article, that's not the focus), it could have been done more carefully.

waheoo · on May 30, 2020

The writing style is dense. I suspect a voice fresh out of academia.

The post about salary reads much better so might just be an experience thing.

https://youtu.be/vtIzMaLkCaM

derivativethrow · on May 30, 2020

Dan Luu is not "fresh out of academia."

waheoo · on May 31, 2020

So youre saying he's just dense?

roskilli · on May 30, 2020

There's a lot of interest in this space with respect to analytics on top of monitoring and observability data.

Anyone interested in this topic might want to check out an issue thread on the Thanos GitHub project. I would love to see M3, Thanos, Cortex and other Prometheus long term storage solutions all be able to benefit from a project in this space that could dynamically pull back data from any form of Prometheus long term storage using the Prometheus Remote Read protocol: https://github.com/thanos-io/thanos/issues/2682

Spark and Presto both support predicate push down to a data layer, which can be a Prometheus long term metrics store, and are able to perform queries on arbitrary sets of data.

Spark is also super useful for ETLing data into a warehouse (such as HDFS or other backends, i.e. see the BigQuery connector for Spark[1] that could write a query from say a Prometheus long term store metrics and export it into BigQuery for further querying).

[1]: https://cloud.google.com/dataproc/docs/tutorials/bigquery-co...

tixocloud · on May 31, 2020

Thanks for sharing. It’s interesting to see the space gaining steam. What sort of things are people looking at?

gigatexal · on May 30, 2020

This is a really awesome blog. The post about programmer salaries is insightful: https://danluu.com/bimodal-compensation/

julianeon · on May 30, 2020

I was thinking the answer to 'why are programmers paid more than other white-collar professions that are similarly profitable' is: because programmers control the means of production.

I might be a great telecomm tech, a genius even, but once I'm out of a job, I can't build my own telecomm system - that would cost billions. I have to go back to some other telecomm system to start making money again.

But, at a startup, a kicked-out senior engineer can actually pretty much exactly recreate the company; they can do the equivalent of a laid-off telecomm employee starting a new, almost-as-good (except for branding) telecomm company.

No billions in infrastructure required: within a month or two, the cloned company could be near-indistinguishable from the original.

So companies have to pay employees more like partners, instead of employees, because either they pay them as equals or they'll be forced to compete against them, as equal rivals.

SpicyLemonZest · on May 30, 2020

Interesting. I hadn't thought about it that way before, but it does seem to predict the bimodality; the lower mode is (presumably) programmers who either aren't skilled enough or don't work in the right areas to be able to take their ball and found a startup with it.

sprt · on May 30, 2020

Interesting, wonder if he's looked at the data from levels.fyi. Although it's surely not representative.

borramakot · on May 30, 2020

All the offer's I've gotten have been pretty solidly in the bell curve of what levels.fyi indicated.

renewiltord · on May 30, 2020

Just so I understand, the simple way the headline talks about was "collect all metrics, but store the anal fraction you care about in an easily accessible place; delete the raw data every week"?

Title didn't live up to article imho. But I get it. Thanks for sharing your methods.

simonw · on May 30, 2020

Love the section in this about using "boring technology" - and then writing about how you used it, to help counter the much more common narrative of using something exciting and new.

m463 · on May 30, 2020

But "exciting and new" is very often just lipstick on a pig.

and anyway:

https://wondermark.com/c/2007-10-11-344ennui.gif

heliodor · on May 30, 2020

If we consider Graphite, InfluxDB, and Prometheus, at this point in the monitoring industry's evolution, we have the capability to easily criss-cross metrics generated in the format of one of these systems to store them in one of the other ones.

The missing piece remains to be able to query one system with the query language of the others. For example, query Prometheus using Graphite's query language.

sa46 · on May 30, 2020

Speaking of high cardinality metrics, what are good options that aren’t as custom as map reduce jobs and a bit more real time?

We killed our influx cluster numerous times with high cardinality metrics. We migrated to Datadog which charges based on cardinality so we actively avoid useful tags that have too much cardinality. I’m investigating Timescale since our data isn’t that big and btrees are unaffected by cardinality.

RabbitmqGuy · on May 31, 2020

TimescaleDB

grandvoye · on May 30, 2020

The boring technology observation (here referring to the challenge of getting publicity for "solving a 'boring' problem") is really true.

It extends very well to something that we constantly hammer home on my team: using boring tools is often best because it's easier to manage around known problems than forge into the unknown, especially for use-cases that don't have to do with your core business. Extreme & contrived example: it's much better to build your web backend in PHP over Rust because you're standing on the shoulders of decades of prior work, although people will definitely make fun of you at your next webdev meetup.

(Functionality that is core to your business is where you should differentiate and push the boundaries on interesting technology e.g. Search for Google, streaming infrastructure for Netflix. All bets are off here and this is where to reinvent the wheel if you must – this is where you earn your paycheck!)

mv4 · on May 30, 2020

Thank you for sharing this. I recently started working on metrics at a FAANG and saw some of the challenges you mentioned... the fact that you were able to get good results so quickly is super inspiring!

tixocloud · on May 31, 2020

Interesting. I thought this would already be a solved problem at FAANG.

chrchang523 · on May 30, 2020

Minor nit: long -> double -> long cannot introduce more rounding error than long -> double, if the same long type is at both ends.

dirtydroog · on May 30, 2020

What's the standard for metrics gathering, push or pull? I prefer pull, but depending on the app it can mean you need to build in a micro HTTP server so there's something to query. That can be a PITA, but pushing a stat on every event seems wasteful, especially if there's a hot path in the code.

bbrazil · on May 30, 2020

I don't think there's any clear standard. There's many confusions about push vs pull that make the discussions hard to follow, as they often make apples to oranges comparisons. For example the push you're talking about in your comment is events, whereas a fair comparison for Prometheus would be with pushing metrics to Graphite. https://www.robustperception.io/which-kind-of-push-events-or... covers this in more detail.

Taking your example you could push without sending a packet on every event by instead accumulating a counter in memory, and pushing out the current total every N seconds to your preferred push-based monitoring system. You could even do this on top of a Prometheus client library, some of the official ones even as a demo allow pushing to Graphite with just two lines of code: https://github.com/prometheus/client_python#graphite

In my personal opinion, pull is overall better than push but only very slightly. Each have their own problems you'll hit as you scale, but those problems can be engineered around in both cases.

Disclaimer: Prometheus developer

halbritt · on May 30, 2020

The hot new technology for metrics is Prometheus and it's ilk which is pull-based.

bostik · on May 30, 2020

At this point Prometheus is pretty close to becoming the boring technology. The latest versions have finally brought in the plumbing and tuning knobs to protect against [most] overly expensive queries. So you can't easily take it down anymore.

The single-binary approach is still a problem, though. In my mind any serious telemetry collection stack should separate the query engine and ingestion path from each other - Prometheus has both the query interface and the ingestion/writing subsystem in the same process.[ß]

As for the parent poster: you certainly want to push telemetry out on every event, but the mechanism has to be VERY lightweight. With prometheus the solution is to have a telemetry collection/aggregation agent on the host, feed it with the event data and have prometheus scrape the agent. Statsd with the KV extension is a great protocol for shoveling the telemetry out from the process and into the agent.

ß: you can get around this with Thanos + Trickster to take care of the read path only, but it's quite a bit more complex than plain Prometheus.

roskilli · on May 30, 2020

M3 separates query and ingestion if you're interested in clustered storage for metrics, slide in question here: https://www.slideshare.net/RobSkillington/fosdem-2019-m3-pro...

bostik · on May 31, 2020

Oh, nice. Thank you. I had bookmarked M3 earlier on, but never got around to really looking into it with proper thought.

NikolaeVarius · on May 30, 2020

I heavily disagree with Prometheus being boring tech. The storage backends still have heavy churn.

wikibob · on May 30, 2020

See the Cortex project

chris_f · on May 30, 2020

There have been a lot of articles posted recently about the 'old' web, and while I like the concept I still have a hard time finding quality information in many of the directories and webrings posted. The level of research and density of information in this blog is very good.

neoplatonian · on May 30, 2020

This is a great post! We should have more of these out there. Does anyone have any recommendations for similar posts for Node.js (instead of JVM)?

Or any good resource which discusses possible optimizations in the infra stack at a more theoretical, abstract, generalizable level?

dmos62 · on May 30, 2020

I'm not sure I understood the solution there. Storing only 0.1%-0.01% of the interesting metrics data makes sense in the same way that you'd take a poll of a small fraction of the population to make guesses about the whole?

yellowstuff · on May 30, 2020

I believe he means they are storing all the data, but for a subset of types of data, sorta like extracting just a few columns out of a big table. Presumably someone on some other team gets use out of having access to the 99.9% of data stored that is not relevant to "performance or capacity related queries."

wwarner · on May 30, 2020

Would be a very natural AWS dashboard.

Aperocky · on May 30, 2020

Cloudwatch is pretty awesome.

m0zg · on May 31, 2020

Funny how "groundbreaking" stuff like this is outside e.g. Google, where you could collect and query metrics in realtime for more than a decade now.

dandare · on May 30, 2020

> since i like boring, descriptive, names..

I feel like I have an inception. Should "boring, descriptive, names" be the default in all IT?

ertian · on May 30, 2020

The problem with that is that you end up with tons of confusing name collisions.

willvarfar · on May 30, 2020

Great article!

The bit about not being able to use columns for each metric because there were too many ....

the classic solution is to have a column called “metric name” and another for “metric value”.

Can’t spot why they didn’t just do that.

jsnell · on May 30, 2020

Then you lose the benefits that columnar databases have for time series data.

kyllo · on May 30, 2020

You lose a lot of the benefits, but you can still take advantage of time range partition elimination just as long as your data is still physically partitioned by the timestamp column. That's the most important thing when processing time series data--never read from disk any of the data that's outside the time range you actually need for your query.

willvarfar · on May 30, 2020

Yes and no. We already know there were too many named metrics to give each its own column on the system they were using (paraquet on a data lake), so what are they left with?

Does a column store like paraquet make a good time series DB? Trendy named time series databases I’ve had the displeasure of using would all fail miserably by high cardinality series too, so I’m not convinced there is actually a better thing than files on a lake for this stuff.

So, use some format to name the metric in each row. If paraquet, use dictionary encoding on that column and sort or cluster the rows ... will give go min/max pruning etc.

But presto is currently 5x slower to chew through paraquet vs orc so perhaps simply use orc. Or, for this data, Avro or json lines.

And then when you’ve used presto to discover interesting metrics you can always use presto (or scalding or whatever your poison is) to extract the metrics you have identified you want to examine more closely and to put them into separate datasets etc.

I’m just outlining standard approach’s to these kinds of problems.