My years-old M1 macbook with 16GB of ram runs them just fine. Several Geforce 40-series cards have at least 16GB of vram. Macbook pros go up to 128GB of ram and the mac studio goes up to 192GB. Running regular CPU inference on lots of system ram is cheap-ish and not intolerably slow.
These aren't totally common configurations, but they're not totally out of reach like buying an H100 for personal use.
1. I wouldn't consider Mac Studio ($7,000) a customer product.
2. Yes, and my MBP M1 Pro can run quantized 34b models. My point was that when you do MoE, memory requirements suddenly become too challenging. A 7b Q8 is roughly 7GB (7b parameters × 8 bits each). But 8x of that would be 56GB, and all of that must be in memory to run.
These aren't totally common configurations, but they're not totally out of reach like buying an H100 for personal use.