i did look into this before writing the post. there's a fossil-users mailing list post by Isaac Jurado where he reported that importing Django took ~20 minutes and importing glibc on a 16GB machine had to be interrupted after a couple of hours. he explicitly warned against trying the linux kernel. the largest documented import on the fossil site itself was NetBSD pkgsrc (~550MB) which already showed scaling issues. so "never did" is fair - not because anyone tried and failed, but because it was known to be impractical and explicitly discouraged.
Thanks! LWN's development cycle reports are incredible and were actually an inspiration. The goal here wasn't to replace that kind of expert analysis but to show what becomes possible when you can just write SQL against the raw history. Your reports add the context and understanding that no database query can provide.
yeah totally get that. the main blocker was delta compression. sqlite's extension api made it really slow for custom storage. i either had to do all the compression on the pgit side (and lose native SQL queryability) or just use postgres which handles it natively. but an sqlite version isn't off the table for smaller repos where that tradeoff makes more sense.
haha, great that you tried! i also imported it multiple times now and it does work. but it's huge. the times actually match quite well, i also had around 3 hours, i'm surprised you managed to do it that fast actually. so yeah, i'm currently working on multiple things to improve the speed for importing and then also for analysing the kernel. but that will be something for the next post. stay tuned! as a quick teaser: it imported the 123GB uncompressed master branch into 2.98 GB pgit actual data while git aggressive puts it into 1.95 GB. but keep in mind, pgit was never meant to beat git in any terms. it really started as a demo XD
Ok, cheers! I occasionally need to investigate older releases and compare to out-of-tree things, and was thinking pgit might be of help there. I put up a reminder for myself to check pgit again next time I need to do that sort of stuff!
Sounds great! Yeah i have been working on a 3 layer cache in pg-xpatch so its not only in-memory cache but a little more sufisticated and hopefully uses less ram... haha. but its still not quite what i want.
thanks! FUSE is actually a really cool idea, hadn't thought about that. would basically let you mount a repo as a filesystem backed by postgres. server side branches and change sets are interesting too, postgres already handles concurrent access well so that could work nicely. definitely adding these to the ideas list!
I've already spun up claude to make a POC for this.
I like gerrit, but the server is such a pain to handle (java plus FS). PG would be the only server side component required, though you could have an optional review server that would act like a PG client as well.
The FUSE would be extremely nice for CI/CD for instant cloning with a local resource cache, which is much harder to do with a FS based git.
The FUSE angle is what got me. Our monorepo takes about 90 seconds just to clone in CI, and most jobs only touch two or three packages. Shallow clone helps with history but you basically still pull the entire working tree. Something that could mount the tree and fetch files on demand would cut that to almost nothing for most pipeline steps.
good question! the "pgit actual" column tries to compare just the compression algorithms, similar to how the git side only counts the .pack file and not .idx/.rev/.bitmap or filesystem overhead. so both sides strip their "container" overhead to make it a fair comparison.
but you're totally right that in practice the on-disk size is what you actually pay. that's why both numbers are in the table. and yes, pgit on-disk is usually larger than git aggressive. the tradeoff is that you get SQL queryability over your entire history, which git just can't do natively.
Okay, thanks. I would revise the write-up then. It makes it sound like there’s a storage benefit here when there really isn’t. The real message might be that it’s very close to git’s aggressive optimization and it also gives you the sql benefits. I’m also a bit confused by all the write up on delta compression. That’s interesting for the size comparison, but if the real benefit to most users is going to be the sql features, then I’m not sure why all the talk of delta compression, which I’m guessing slows things down slightly. I’m assuming you could do all the sql features without any of the delta compression.
yeah i get that. sorry if it comes across as too salesy. but keep in mind that pgit was only meant to be a demo of pg-xpatch and wasn't built with beating git in mind. the fact that it's SQL queryable and comes close to git's compression was a nice side-effect. so the whole thing was really just built for showcasing xpatch's compression and evolved into what it is now. but yes, in theory you could also just store the git history uncompressed, which would actually solve quite a lot of issues i had :)
reply