The article doesn't seem to discuss one of the most fundamental guarantees of current-day DB-application interaction:
Acknowledged writes must not be lost.
For example, if a user hits "Delete my account", and gets a confirmation "You account was deleted", that answer must be final. It would be bad if the account reappeared afterwards. Similarly, if a user uploads some data, and gets a confirmation (say via HTTP 200), they should be able to assume that the data was durably stored on the other side, and that they can delete it locally.
Most applications make this assumption, and that makes sense: Otherwise you could never know how how much longer a client needs to hold onto the data until being sure that the DB stored it.
This can only be achieved reliably with a server-side network roundtrip on write ("synchronous replication"), because a single machine can fry any time.
The approach presented in the article does not provide this guarantee. It provides low latency by writing to the local SSD, acknowledging the write to the client, and then performing "asynchronous replication" with some delay afterwards. If the server dies after the local SSD write, but before the WAL is shipped, the acknowledged write will be lost. It will still be on the local SSD, but that is not of much use if the server's mainboard is fried (long time to recovery) and another server with old data takes over as the source of truth.
This is why I think it's justified that some other commenters call this approach a "cache" when compared with a multi-AZ DB cluster doing synchronous replication.
The Litestream approach seems to provide roughly the same properties as postgres-on-localhost with async replication turned on. (I also wonder if that would be an interesting implementation of this approach for Fly.io -- it should provide similar microsecond latency while also providing all features that Postgres has.)
As I understand it, Fly.io provides Postgres with synchronous replication (kurt wrote "You can also configure your postgres to use synchronous replication", https://community.fly.io/t/early-look-postgresql-on-fly-we-w...), and https://fly.io/docs/reference/postgres/#high-availability explains that it uses Stolon, which does support synchronous replication if you turn it on. But the "Postgres on Fly" page doesn't seem to explain whether sync or async is the default, and how exactly I can turn on sync mode on Fly.
So I think it would be helpful if the article stated clearly "this is asynchronous replication", thus making clear that it will likely forget acknowledged writes on machine failure, and maybe link to Fly's Postgres offering that provides more guarantees.
Acknowledged writes must not be lost.
For example, if a user hits "Delete my account", and gets a confirmation "You account was deleted", that answer must be final. It would be bad if the account reappeared afterwards. Similarly, if a user uploads some data, and gets a confirmation (say via HTTP 200), they should be able to assume that the data was durably stored on the other side, and that they can delete it locally.
Most applications make this assumption, and that makes sense: Otherwise you could never know how how much longer a client needs to hold onto the data until being sure that the DB stored it.
This can only be achieved reliably with a server-side network roundtrip on write ("synchronous replication"), because a single machine can fry any time.
The approach presented in the article does not provide this guarantee. It provides low latency by writing to the local SSD, acknowledging the write to the client, and then performing "asynchronous replication" with some delay afterwards. If the server dies after the local SSD write, but before the WAL is shipped, the acknowledged write will be lost. It will still be on the local SSD, but that is not of much use if the server's mainboard is fried (long time to recovery) and another server with old data takes over as the source of truth.
This is why I think it's justified that some other commenters call this approach a "cache" when compared with a multi-AZ DB cluster doing synchronous replication.
The Litestream approach seems to provide roughly the same properties as postgres-on-localhost with async replication turned on. (I also wonder if that would be an interesting implementation of this approach for Fly.io -- it should provide similar microsecond latency while also providing all features that Postgres has.)
As I understand it, Fly.io provides Postgres with synchronous replication (kurt wrote "You can also configure your postgres to use synchronous replication", https://community.fly.io/t/early-look-postgresql-on-fly-we-w...), and https://fly.io/docs/reference/postgres/#high-availability explains that it uses Stolon, which does support synchronous replication if you turn it on. But the "Postgres on Fly" page doesn't seem to explain whether sync or async is the default, and how exactly I can turn on sync mode on Fly.
So I think it would be helpful if the article stated clearly "this is asynchronous replication", thus making clear that it will likely forget acknowledged writes on machine failure, and maybe link to Fly's Postgres offering that provides more guarantees.