You'd be surprised how many serious tech shops have close to zero performance metrics collected and utilised.
I've done this in fintech a few times already and the best stack that worked from my experience was telegraf + influxdb + grafana.
There are many things you can get wrong with metrics collection (what you collect, how, units, how the data is stored, aggregated and eventually presented) and I learned about most of them the hard way.
However, when done right and covering all layers of your software execution stack this can be a game changer, both in terms of capacity planing/picking low hanging perf fruit and day to day operations.
Highly recommend giving it a try as the tools are free, mature and cover a wide spectrum of platforms and services.
I'd like to followup on this and say InfluxDB is fantastic -- with the caveat that you know exactly what you want, and use it with that in mind. It's great at storing some metrics with high volume, but I've found the OSS version falls over hard in some circumstances (high cardinality in tags absolutely murders performance, making it nearly unusable).
I'm in the middle of doing a test-run of vmagent + victoria-metrics as a mostly-replacement (see: not replacing the cases where it's known exactly what's wanted, and used explicitly for that), and victoria-metrics fully support telegraf pushes which is a big bonus.
Also they're apparently going to be dropping InfluxQL entirely, for Flux, which is weird to me? The docs don't outright state this, but InfluxDB 2.0 has only Flux examples in the docs I can find. It's a better language, however the learning curve is not trivial.
> Was Tickscript what Kapacitor used for InfluxDB 1.7?
Yes. The learning curve was steep, and debugging was near-impossible. Flux is rough too, but it's already getting better documentation and tooling support than Tickscript had.
Note that this isn't an either/or. You can use telegraf to push data to prometheus, or pull data from prometheus compliant http endpoints. You can also use grafana to show data from both InfluxDB and Prometheus.
This might be obvious to some but I thought it worth mentioning.
Agreed; same exp (w many years of web perf optimization as ui arch and/or perf strategy consulting gigs. Even after all these years, simply using WebPageTest to capture test runs and share the results is often a transformational experience.
"I don't have anything against hot new technologies, but a lot of useful work comes from plugging boring technologies together and doing the obvious thing."
Starting the article off with "I did this in one day" - complete with a massive footnote disclaiming that it obviously took a lot more than one day - kinda ruined it for me. Why even bother with that totally unnecessary claim?
My read on it is the author is saying that seemingly small changes can have big impacts. I agree it could have been worded better, though I doubt he's trying to promote himself as a genius (as other people are saying) because he clearly highlights the effort his team put into the project in the footnote.
"I did it by myself in one day, well actually it was one week but had I known the stack I would likely have done it in one day. Oh, and by the way, after that week, there was yet another month of work involving at least two other persons from my team and then even more work from other teams. But let's not dwell on boring details".
It's nearly as infuriating as the "Appendix: stuff I screwed up" which doesn't contain actual screw up. It's a shame because the rest of the writing is interesting and doesn't need to be propped up.
It's kind of a personal marketing thing these days to have this maverick/hero aura of genius instead of the "unproductive" but real and hard grinding work to get something done and delivered. It worked for a few so thousands try the same and we are here now, I guess?
- the context that the author already has a very successful career as a well-known developer
- the humility he evidences in most posts on his blog
- the fact that he explicitly highlights the work of others in this post alongside his own
I really don't think Dan is doing this as any form of personal marketing. He has no need of personal marketing, his blog already has several million views per month and frequently shows up on HN as it is, and it isn't really his style.
I did not say he did that, I said that I believe it's common these days given the points I mentioned. You just need to hang around and see a bunch of posts on HN to notice that. QED, he's probably one of the "few" I talked about.
That strikes me as a short tolerance for feeling something is ruined. He appropriately highlighted the real time
estimate of the more involved work in a footnote. He didn't literally mean all of the work was one day, he's trying to convey a larger point about outsized engineering returns from comparatively small person-hours of work.
Were you able to move past this to read the rest of the article? Because it's a very good article.
I did go back and read the rest of it, and I do agree that it's pretty good overall. For someone going into detail about the mistaken capitalization of a variable name (which I appreciated), the "one day" bit still stands out as oddly hand-wavy and boastful (and, as the opening remark, I would argue it's pretty important for setting the tone).
If that point needed to be made (and I don't think it really did in this article, that's not the focus), it could have been done more carefully.
There's a lot of interest in this space with respect to analytics on top of monitoring and observability data.
Anyone interested in this topic might want to check out an issue thread on the Thanos GitHub project. I would love to see M3, Thanos, Cortex and other Prometheus long term storage solutions all be able to benefit from a project in this space that could dynamically pull back data from any form of Prometheus long term storage using the Prometheus Remote Read protocol:
https://github.com/thanos-io/thanos/issues/2682
Spark and Presto both support predicate push down to a data layer, which can be a Prometheus long term metrics store, and are able to perform queries on arbitrary sets of data.
Spark is also super useful for ETLing data into a warehouse (such as HDFS or other backends, i.e. see the BigQuery connector for Spark[1] that could write a query from say a Prometheus long term store metrics and export it into BigQuery for further querying).
I was thinking the answer to 'why are programmers paid more than other white-collar professions that are similarly profitable' is: because programmers control the means of production.
I might be a great telecomm tech, a genius even, but once I'm out of a job, I can't build my own telecomm system - that would cost billions. I have to go back to some other telecomm system to start making money again.
But, at a startup, a kicked-out senior engineer can actually pretty much exactly recreate the company; they can do the equivalent of a laid-off telecomm employee starting a new, almost-as-good (except for branding) telecomm company.
No billions in infrastructure required: within a month or two, the cloned company could be near-indistinguishable from the original.
So companies have to pay employees more like partners, instead of employees, because either they pay them as equals or they'll be forced to compete against them, as equal rivals.
Interesting. I hadn't thought about it that way before, but it does seem to predict the bimodality; the lower mode is (presumably) programmers who either aren't skilled enough or don't work in the right areas to be able to take their ball and found a startup with it.
Just so I understand, the simple way the headline talks about was "collect all metrics, but store the anal fraction you care about in an easily accessible place; delete the raw data every week"?
Title didn't live up to article imho. But I get it. Thanks for sharing your methods.
Love the section in this about using "boring technology" - and then writing about how you used it, to help counter the much more common narrative of using something exciting and new.
If we consider Graphite, InfluxDB, and Prometheus, at this point in the monitoring industry's evolution, we have the capability to easily criss-cross metrics generated in the format of one of these systems to store them in one of the other ones.
The missing piece remains to be able to query one system with the query language of the others. For example, query Prometheus using Graphite's query language.
Speaking of high cardinality metrics, what are good options that aren’t as custom as map reduce jobs and a bit more real time?
We killed our influx cluster numerous times with high cardinality metrics. We migrated to Datadog which charges based on cardinality so we actively avoid useful tags that have too much cardinality. I’m investigating Timescale since our data isn’t that big and btrees are unaffected by cardinality.
The boring technology observation (here referring to the challenge of getting publicity for "solving a 'boring' problem") is really true.
It extends very well to something that we constantly hammer home on my team: using boring tools is often best because it's easier to manage around known problems than forge into the unknown, especially for use-cases that don't have to do with your core business. Extreme & contrived example: it's much better to build your web backend in PHP over Rust because you're standing on the shoulders of decades of prior work, although people will definitely make fun of you at your next webdev meetup.
(Functionality that is core to your business is where you should differentiate and push the boundaries on interesting technology e.g. Search for Google, streaming infrastructure for Netflix. All bets are off here and this is where to reinvent the wheel if you must – this is where you earn your paycheck!)
Thank you for sharing this. I recently started working on metrics at a FAANG and saw some of the challenges you mentioned... the fact that you were able to get good results so quickly is super inspiring!
What's the standard for metrics gathering, push or pull? I prefer pull, but depending on the app it can mean you need to build in a micro HTTP server so there's something to query. That can be a PITA, but pushing a stat on every event seems wasteful, especially if there's a hot path in the code.
I don't think there's any clear standard. There's many confusions about push vs pull that make the discussions hard to follow, as they often make apples to oranges comparisons. For example the push you're talking about in your comment is events, whereas a fair comparison for Prometheus would be with pushing metrics to Graphite. https://www.robustperception.io/which-kind-of-push-events-or... covers this in more detail.
Taking your example you could push without sending a packet on every event by instead accumulating a counter in memory, and pushing out the current total every N seconds to your preferred push-based monitoring system. You could even do this on top of a Prometheus client library, some of the official ones even as a demo allow pushing to Graphite with just two lines of code: https://github.com/prometheus/client_python#graphite
In my personal opinion, pull is overall better than push but only very slightly. Each have their own problems you'll hit as you scale, but those problems can be engineered around in both cases.
At this point Prometheus is pretty close to becoming the boring technology. The latest versions have finally brought in the plumbing and tuning knobs to protect against [most] overly expensive queries. So you can't easily take it down anymore.
The single-binary approach is still a problem, though. In my mind any serious telemetry collection stack should separate the query engine and ingestion path from each other - Prometheus has both the query interface and the ingestion/writing subsystem in the same process.[ß]
As for the parent poster: you certainly want to push telemetry out on every event, but the mechanism has to be VERY lightweight. With prometheus the solution is to have a telemetry collection/aggregation agent on the host, feed it with the event data and have prometheus scrape the agent. Statsd with the KV extension is a great protocol for shoveling the telemetry out from the process and into the agent.
ß: you can get around this with Thanos + Trickster to take care of the read path only, but it's quite a bit more complex than plain Prometheus.
There have been a lot of articles posted recently about the 'old' web, and while I like the concept I still have a hard time finding quality information in many of the directories and webrings posted. The level of research and density of information in this blog is very good.
I'm not sure I understood the solution there. Storing only 0.1%-0.01% of the interesting metrics data makes sense in the same way that you'd take a poll of a small fraction of the population to make guesses about the whole?
I believe he means they are storing all the data, but for a subset of types of data, sorta like extracting just a few columns out of a big table. Presumably someone on some other team gets use out of having access to the 99.9% of data stored that is not relevant to "performance or capacity related queries."
You lose a lot of the benefits, but you can still take advantage of time range partition elimination just as long as your data is still physically partitioned by the timestamp column. That's the most important thing when processing time series data--never read from disk any of the data that's outside the time range you actually need for your query.
Yes and no. We already know there were too many named metrics to give each its own column on the system they were using (paraquet on a data lake), so what are they left with?
Does a column store like paraquet make a good time series DB? Trendy named time series databases I’ve had the displeasure of using would all fail miserably by high cardinality series too, so I’m not convinced there is actually a better thing than files on a lake for this stuff.
So, use some format to name the metric in each row. If paraquet, use dictionary encoding on that column and sort or cluster the rows ... will give go min/max pruning etc.
But presto is currently 5x slower to chew through paraquet vs orc so perhaps simply use orc. Or, for this data, Avro or json lines.
And then when you’ve used presto to discover interesting metrics you can always use presto (or scalding or whatever your poison is) to extract the metrics you have identified you want to examine more closely and to put them into separate datasets etc.
I’m just outlining standard approach’s to these kinds of problems.
I've done this in fintech a few times already and the best stack that worked from my experience was telegraf + influxdb + grafana. There are many things you can get wrong with metrics collection (what you collect, how, units, how the data is stored, aggregated and eventually presented) and I learned about most of them the hard way.
However, when done right and covering all layers of your software execution stack this can be a game changer, both in terms of capacity planing/picking low hanging perf fruit and day to day operations.
Highly recommend giving it a try as the tools are free, mature and cover a wide spectrum of platforms and services.