More

anarazel · 2025-12-15T15:15:23 1765811723

The issue is more fundamental - if you have purely random keys, there's basically no spatial locality for the index data. Which means that for decent performance your entire index needs to be in memory, rather than just recent data. And it means that you have much bigger write amplification, since it's rare that the same index page is modified multiple times close-enough in time to avoid a second write.

anarazel · 2025-12-14T18:32:29 1765737149

> Every time Postgres advice says to “schedule [important maintenance] during low traffic period” (OP) or “outside business hours”, it reinforces my sense that it’s not suitable for performance-sensitive data path on a 24/7/365 service and I’m not sure it really aims to be.

It's a question of resource margins. If you have regular and predictable windows of low resource utilization, you can afford to run closer to the sun during busy periods, deferring (and amortizing, to some degree) maintenance costs till later. If you have a 24/7/365 service, you need considerably higher safety margins.

Also, there's a lot of terrible advice on the internet, if you haven't noticed.

> (To be fair, running it like that for several years and desperately trying to make it work also gave me that feeling. But I’m kind of aghast that necessary operational maintenance still carries these caveats.)

To be fair, I find oxides' continual low-info griping against postgres a bit tedious. There's plenty weaknesses in postgres, but criticizing postgres based on 10+ year old experiences of running an, at the time, outdated postgres, on an outdated OS is just ... not useful? Like, would it useful to criticize oxides lack of production hardware availability in 2021 or so?

Edit: duplicated word removed

dap · 2025-12-14T19:30:09 1765740609

> It's a question of resource margins.

What you describe is true and very important (more margin lets you weather more disruption), but it's not the whole story. The problem we had was queueing delays mainly due to I/O contention. The disks had the extra IOPS for the maintenance operation, but the resulting latency for all operations was higher. This meant overall throughput decreased when the maintenance was going on. The customer, finally accepting the problem, thought: "we'll just build enough extra shards to account for the degradation". But it just doesn't work like that. If the degradation is 30%, and you reduce the steady-state load on the database by 30%, that doesn't change the fact that when the maintenance is ongoing, even if the disks have the IOPS for the extra load, latency goes up. Throughput will still degrade. What they wanted was predictability but we just couldn't give that to them.

> To be fair, I find oxides' continual low-info griping against postgres a bit tedious. There's plenty weaknesses in postgres, but criticizing postgres based on 10+ year old experiences of running an, at the time, outdated postgres, on an outdated OS is just ... not useful?

First, although I work at Oxide, please don't think I speak for Oxide. None of this happened at Oxide. It informed some of the choices we made at Oxide and we've talked about that publicly. I try to remember to include the caveat that this information is very dated (and I made that edit immediately after my initial comment above).

I admit that some of this has been hard for me personally to let go. These issues dominated my professional life for three very stressful years. For most of that time (and several years earlier), the community members we reached out to were very dismissive, saying either these weren't problems, or they were known problems and we were wrong for not avoiding them, etc. And we certainly did make mistakes! But many of those problems were later acknowledged by the community. And many have been improved -- which is great! What remains is me feeling triggered when it feels like users' pain is being casually dismissed.

I'm sorry I let my crankiness slip into the comment above. I try to leave out the emotional baggage. Nonetheless, I do feel like it's a problem that, intentionally or otherwise, a lot of the user base has absorbed the idea that it's okay for necessary database maintenance to significantly degrade performance because folks will have some downtime in which to run it.*

anarazel · 2025-12-15T16:29:46 1765816186

> First, although I work at Oxide, please don't think I speak for Oxide. None of this happened at Oxide. It informed some of the choices we made at Oxide and we've talked about that publicly. I try to remember to include the caveat that this information is very dated (and I made that edit immediately after my initial comment above).

I said oxide, because it's come up so frequently and at such length on the oxide podcast... Without that I probably wouldn't have commented here. It's one thing to comment on bad experiences, but at this point it feels like more like bashing. And I feel like an open source focused company should treat other folks working on open source with a bit more, idk, respect (not quite the right word, but I can't come up with a better one right now).

I probably shouldn't have commented on this here. But I read the message after just having spent a Sunday morning looking into a problem and I guess that made more thin skinned than usual.

> For most of that time (and several years earlier), the community members we reached out to were very dismissive, saying either these weren't problems, or they were known problems and we were wrong for not avoiding them, etc.

I agree that the wider community sometimes has/had the issue of excusing away postgres problems. While I try to avoid doing that, I certainly have fallen prey to that myself.

Leaving fandom like stuff aside, there's an aspect of having been told over and over we're doing xyz wrong and things would never work that way, and succeeding (to some degree) regardless. While ignoring some common wisdom has been advantageous, I think there's also plenty where we just have been high on our own supply.

> What remains is me feeling triggered when it feels like users' pain is being casually dismissed.

Was that done in this thread?

bcantrill · 2025-12-15T17:02:02 1765818122

I don't agree that we have been "bashing" Postgres. As far as I can tell, Postgres has come up a very small number of times over the years: certainly on the CockroachDB episode[0] (where our experience with Postgres is germane, as it was very much guiding our process for finding a database for Oxide) and then again this year when we talked about our use of statemaps on a Rust async issue[1] (where our experience with Postgres was again relevant because it in part motivated the work that we had used to develop the tooling that we again used on the Rust issue).

I (we?) think Postgres is incredibly important, and I think we have properly contextualized our use of it. Moreover, I think it is unfair to simply deny us our significant experience with Postgres because it was not unequivocally positive -- or to dismiss us recounting some really difficult times with the system as "bashing" it. Part of being a consequential system is that people will have experience with it; if one views recounting that experience as showing insufficient "respect" to its developers, it will have the effect of discouraging transparency rather than learning from it.

[0] https://oxide-and-friends.transistor.fm/episodes/whither-coc...

[1] https://oxide-and-friends.transistor.fm/episodes/when-async-...

anarazel · 2025-12-15T17:33:47 1765820027

I'm certainly very biased (having worked on postgres for way too long), so it's entirely plausible that I've over-observed and over-analyzed the criticism, leading to my description.

> I (we?) think Postgres is incredibly important, and I think we have properly contextualized our use of it. Moreover, I think it is unfair to simply deny us our significant experience with Postgres because it was not unequivocally positive -- or to dismiss us recounting some really difficult times with the system as "bashing" it. Part of being a consequential system is that people will have experience with it; if one views recounting that experience as showing insufficient "respect" to its developers, it will have the effect of discouraging transparency rather than learning from it.

I agree that criticism is important and worthwhile! It's helpful though if it's at least somewhat actionable. We can't travel back in time to fix the problems you had in the early 2010s... My experience of the criticism of the last years from the "oxide corner" was that it sometimes felt somewhat unrelated to the context and to today's postgres.

> if one views recounting that experience as showing insufficient "respect" to its developers

I should really have come up with a better word, but I'm still blanking on choosing a really apt word, even though I know it exists. I could try to blame ESL for it, but I can't come up with a good German word for it either... Maybe "goodwill". Basically believing that the other party is trying to do the right thing.

dap · 2025-12-15T19:20:57 1765826457

>> What remains is me feeling triggered when it feels like users' pain is being casually dismissed.

> Was that done in this thread?

Well, I raised a general problem around 24/7/365 use cases (rooted in my operational experience, reinforced by the more-current words that I was replying to and the OP) and you called it "tedious", "low-info griping". Yes, that seems pretty dismissive.

(Is it fair? Though I thought the podcast episodes were fairly specific, they probably glossed over details. They weren't intended to be about those issues per se. I did write a pretty detailed post though: https://www.davepacheco.net/blog/2024/challenges-deploying-p... (Note the prominent caveat at the top about the experience being dated.))

You also wrote:

> running an, at the time, outdated postgres, on an outdated OS

Yes, pointing to the fact that the software is old and the OS is unusual (it was never outdated; it was just not Linux) are common ways to quickly dismiss users' problems. If the problems had been fixed in newer versions, that'd be one thing. Many (if not all) of them hadn't been. But also: the reason we were running an old version was precisely that it was a 24/7/365 service and there was no way to update databases without downtime, especially replicated ones, nor a great way to mitigate risk (e.g., a mode for running the new software without updating the on-disk format so that you can go back if it's a disaster). This should be seen as a signal of the problem, not a reason to dismiss it (as I feel like you're doing here). As for the OS, I can only think of one major issue we hit that was OS-specific. (We did make a major misconfiguration related to the filesystem that certainly made many of our issues much worse.)

I get that it sucks to keep hearing about problems from years ago. All of this was on 9.2 - 9.6 -- certainly ancient today. When this comes up, I try to balance sharing my operational experience with the fact that it's dated by just explaining that it's dated. After all, all experience is dated. Readers can ignore it if they want, do some research, or folks in the PostgreSQL world can update me when specific things are no longer a problem. That's how I learned that the single-threaded WAL receiver had been updated, apparently in part because of our work: https://x.com/MengTangmu/status/1828665449850294518 (full thread: https://x.com/MengTangmu/status/1828665439234474350). I'll happily share these updates wherever I would otherwise share my gripes!

anarazel · 2025-12-14T17:56:19 1765734979

> > When VACUUM runs, it removes those dead tuples and compacts the remaining rows within each page.

> No it doesn’t. It just removes unused line pointers and marks the space as free in the FSM.

It does:

https://github.com/postgres/postgres/blob/b853e644d78d99ef17...

Which is executed as part of vacuum.

supermatt · 2025-12-14T18:19:22 1765736362

I guess I stand corrected - I thought that was the purpose behind VACUUM FULL

anarazel · 2025-12-14T18:41:40 1765737700

VACUUM FULL is about cleanup up things above the level of a single page. Moving stuff around within a page doesn't allow you to reclaim space on the OS level, nor does it "compact" tuples onto fewer pages.

But it's important for normal vacuum to compact the tuples on the page, otherwise the space of deleted tuples couldn't effectively be reused. Imagine a page that's entirely filled with 100 byte tuples, then every other tuple is deleted. Then, after a vacuum, a single 108 byte tuple should be inserted onto the page. Without compacting the space in the page during the vacuum, there would not be any space for that larger tuple.

anarazel · 2025-12-14T17:26:45 1765733205

It's true - otherwise the space couldn't freely be reused, because the gaps for the vacuumed tuples wouldn't allow for any larger tuples to be inserted.

See https://github.com/postgres/postgres/blob/b853e644d78d99ef17...

anarazel · 2025-12-04T15:35:16 1764862516

You can just set fsync=off if you don't want to flush to disk and are ok with corruption in case of a OS/hw level crash.

ffsm8 · 2025-12-05T12:30:45 1764937845

Huh, i always just mounted the data directory as tmpfs/ramdisk. Worked nicely too

anarazel · 2025-11-11T17:05:02 1762880702

That should be trivial to change:

https://github.com/postgres/postgres/blob/d2f24df19b7a42a094...

ngalstyan4 · 2025-11-11T17:14:54 1762881294

They would need to handle all the translation changes as well, no?

<https://github.com/search?q=repo%3Apostgres%2Fpostgres+%22ma...>

I agree code change is simple, but I guess the task is complex for other reasons

anarazel · 2025-11-11T17:23:23 1762881803

If you just make a change of code, you don't need to handle translations at that time. That will get done by the various translation teams closer to the release. However you do need to make sure that the code is translatable (e.g. injecting pre-formulated english messages into a larger message is problematic).

sam_lowry_ · 2025-11-11T17:14:46 1762881286

Think of all the translations )

anarazel · 2025-11-11T17:26:37 1762881997

You don't need to deal with them as a patch author :)

anarazel · 2025-11-06T15:32:06 1762443126

We don't have the complete version history of postgres, so that's not easy to know. There definitely are still lines from Postgres95 that haven't been changed since the initial import into our repository.

Somewhere there's a CVS repository with some history from before the import into the current repository, but unfortunately there's a few years missing between that repository and the initial import. I've not done the work to analyze whether any lines from that historical repo still survive.

anarazel · 2025-10-24T13:58:29 1761314309

The docs for 18 also show it, where do you get from that it's not available for 18?

DicIfTEx · 2025-10-24T22:32:28 1761345148

Ah my mistake, I linked to the docs for `pg_dump` (which has long had the `format` option) rather than `pg_dumpall` (which lacks it).

Before Postgres 18 was released, the docs listed `format` as an option for `pg_dumpall` in the upcoming version 18 (e.g. Wayback Machine from Jun 2025 https://web.archive.org/web/20250624230110/https://www.postg... ). The relevant commit is from Apr 2025 (see link #0 in my original comment). But now all mention has been scrubbed, even from the Devel branch docs.

anarazel · 2025-10-25T07:42:10 1761378130

It got reverted for now: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit...

anarazel · 2025-10-24T02:59:11 1761274751

Afaict nothing in this benchmark will actually use AIO in 18. As of 18 there is aio reads for seq scans, bitmap scans, vacuum, and a few other utility commands. But the queries being run should normally be planned as index range scans. We're hoping to the the work for using AIO for index scans into 19, but it could work end up in 20, it's nontrivial.

It's also worth noting that the default for data checksums has changed, with some overhead due to that.

mebcitto · 2025-10-24T04:10:27 1761279027

That explains why `sync` and `worker` have so similar results in almost all runs. The benchmarks from Tomas Vondra (https://vondra.me/posts/tuning-aio-in-postgresql-18/) showed some significant differences.

ozgune · 2025-10-24T13:22:12 1761312132

If the benchmark doesn’t use AIO, why the performance difference between PG 17 and 18 in the blog post (sync, worker, and io_uring)?

Is it because remote storage in the cloud always introduces some variance & the benchmark just picks that up?

For reference, anarazel had a presentation at pgconf.eu yesterday about AIO. anarazel mentioned that remote cloud storage always introduced variance making the benchmark results hard to interpret. His solution was to introduce synthetic latency on local NVMes for benchmarks.

nopurpose · 2025-10-24T08:52:42 1761295962

Then io_uring AIO mode underperformance is even more curious.

anarazel · 2025-10-24T13:38:31 1761313111

It is. I tried to repro it without success.

I wonder if it's just being executed on a different VMs with slightly different performance characteristics. I can't tell based on the formulation in the post whether all the runs for one test are executed on the same VM or not.

anarazel · 2025-09-11T15:00:56 1757602856

This is just wildly wrong.

On an older 2 socket workstation, with relatively poor memory bandwidth, I ran a linux kernel compile.

  perf stat --topdown --td-level 2

indicates that memory bandwidth is not a bottleneck. Fetch latency, branch mispredicts and the frontend are.

I also analyzed the memory bandwidth using

  perf stat --per-socket  -M memory_bandwidth_read,memory_bandwidth_write -a -r0 sleep 1

and it never gets anywhere close to the memory bandwidth the system can trivially utilize (it barely reaches the bandwidth a single core can utilize).

iostat indicates there are pretty much no reads/writes happening on the relevant disks.

Every core is 100% busy.

menaerus · 2025-09-11T15:18:27 1757603907

It is not wildly wrong, be more respectful please since I am speaking from my own experience. Nowhere in my comment have I used Linux kernel as an example. It's not a great example neither since it's mostly trivial to compile in comparison to the projects I had experience with.

Core can be 100% busy but as I see you're a database kernel developer you must surely know that this can be an artifact of a stall in a memory backend of the CPU. I rest my case.

anarazel · 2025-09-11T15:31:20 1757604680

> Nowhere in my comment have I used Linux kernel as an example. It's not a great example neither since it's mostly trivial to compile in comparison to the projects I had experience with.

It's true across a wide range of projects. I build a lot of stuff from source and I routinely look at performance counters and other similar metrics to see what the bottlenecks are (I'm almost clinically impatient).

Building e.g. LLVM, a project with much longer per-translation unit build times, shows that memory bandwidth is even less of a bottleneck. Whereas fetch latency increased as a bottleneck.

> Core can be 100% busy but as I see you're a database kernel developer you must surely know that this can be an artifact of a stall in a memory backend of the CPU. I rest my case.

Hence my reference to doing a topdown analysis with perf. That provides you with a high-level analysis of what the actual bottlenecks are.

Typical compiler work (with typical compiler design) has lots of random memory accesses. Due to access latencies being what they are, that prevents you from actually doing enough memory accesses to reach a particularly high memory bandwidth.

bluGill · 2025-09-11T15:14:53 1757603693

How many cores on that workstation? The claim is you need 40 cores to observe that - very few people have access to such a thing - they exist, but they are expensive.

anarazel · 2025-09-11T15:19:02 1757603942

That workstation has 2x10 cores / 20 threads. I also executed the test on a newer workstation with 2x24 cores with similar results, but I thought the older workstation is more interesting, as the older workstation has a much worse memory bandwidth.

Sorry, but compilation is simply not memory bandwidth bound. There are significant memory latency effects, but bandwidth != latency.

menaerus · 2025-09-11T15:43:24 1757605404

I doubt you can saturate the bandwidth with dual-socket configuration with each having 10 cores. Perhaps if you have very recent cores, which I believe you don't, but Intel design hasn't been that good. What you're also measuring in your experiment, and needs to be taken into account, is the latency across the NUMA nodes which is ridiculously high, 1.5x to 2x to the local node, amounting to usually ~130ns. Because of this, in NUMA configurations, you usually need more (Intel) cores to saturate the bw. I know because I have one sitting at my desk. Memory bandwidth saturation usually begins at ~20 cores with the Intel design that is roughly ~5 year old. I might be off with that number but it's roughly something like that. Other cores if you have them burning the cycles are just sitting there and waiting in the line for the bus to become free.

bluGill · 2025-09-11T19:23:18 1757618598

At 48 cores you are right about at the point where memory bandwidth becomes the limit. I suspect you are over the line, but by so little it is impossible to measure with all the ther noise. Get a larger machine and report back.

anarazel · 2025-09-11T20:19:13 1757621953

On the 48 core system, building linux peaks at about 48GB/s; LLVM peaks at something like 25GB/s.

The system has well over 450GB/s of memory bandwidth.

menaerus · 2025-09-12T10:12:37 1757671957

> On the 48 core system, building linux peaks at about 48GB/s; LLVM peaks at something like 25GB/s

LLVM peak is suspiciously low since building LLVM is heavier than the kernel? Anyway, on my machine, which is dual-socket 2x22-core skylake-x, for pure release build without debug symbols (less memory pressure), I get ~60GB/s.

   # python do_pair_combined.py out_clang_release
   Peak combined memory bandwidth found in block #180:
   S0_write: 8046.8 MB/s
   S0_read: 23098.2 MB/s
   S1_write: 7611.3 MB/s
   S1_read: 21231.3 MB/s
   Total: 59987.6 MB/s

For release build with debug symbols, which is much heavier, and what I normally use during the development, so my experience is probably more biased towards that workload, is >50% larger - ~98GB/s.

  $ python do_pair_combined.py out_clang_relwithdeb
  Peak combined memory bandwidth found in block #601:
  S0_write: 11648.5 MB/s
  S0_read: 17347.9 MB/s
  S1_write: 31686.2 MB/s
  S1_read: 37532.7 MB/s
  Total: 98215.3 MB/s

I repeated the experiment with linux kernel, and I get almost the same figure as you do - ~48GB/s.

  $ python do_pair_combined.py out_kernel 
  Peak combined memory bandwidth found in block #329:
  S0_write: 8963.9 MB/s
  S0_read: 16584.1 MB/s
  S1_write: 7863.4 MB/s
  S1_read: 14371.0 MB/s
  Total: 47782.399999999994 MB/s

Now this was peak accumulated but I was also interested in what is the single highest read/write bw measured. For LLVM/clang release with debug symbols this is what I get ~32GB/s for write bw and ~52GB/s for read bw.

  $ python do_single.py out_clang_relwithdeb
    Peak memory_bandwidth_write: 31686.2 MB/s
    Peak memory_bandwidth_read: 52038.0 MB/s

This is btw very close to what my socket can handle, store bandwidth is ~40GB/s, load bandwidth is ~80GB/s, and combined load-store bandwidth is 65G/s.

So, I think it is not unreasonable to say that there are compiler workloads that can be limited by the memory bandwidth. I for sure worked with heavier codebases even than LLVM, and even though I did not do the measurements back then, the gut feeling I was having is that the bw is consumed. Some translation units would literally stay for few minutes "compiling" but no progress would have been made.

I agree that random access memory patterns and the latency those patterns incur are also a cost that need to be added to this cost function.

My initial comment on this topic was - I don't really believe that the bottleneck in compilation for larger codebases, of course not on _any_ given machine, is on the compute side, and therefore I don't see how modules are going to fix any of this.

gpderetta · 2025-09-12T11:59:34 1757678374

> This is just wildly wrong.

Indeed! Compilation is notorious for being a classing pointer chasing load that is hard to brute force and a good way to benchmark overall single-thread core performance. It is more likely to be memory latency bound than memory bandwidth bound.