Yeah, I've definitely switched largely away from Flux. Much as I do like Flux (for prompt adherency), BFL's baffling licensing structure along with its excessive censorship makes it a noop.
For ref, the Porcupine-cone creature that ZiT couldn't handle by itself in my aforementioned test was easily handled using a Qwen20b + ZiT refiner workflow and even with two separate models STILL runs faster than Flux2 [dev].
If Flux Base is easy to train and works as well with multiple LoRAs as SDXL, quite probably, yes, at least as the center of mass for new community efforts.
> Is Flux 1/2/Kontext left in the dust by the Z Image and Qwen combo?
Flux.2 looks like it may be the kind of weights-available editing models, but Qwen is strong here, too (and it remains to be seen if much-lighter Z-Image Edit is a powerhouse here, as well.) But for most local generation tasks, its probably hard to justify the weight of Flux.2 (even though recent improvements in ComfyUI with assistance from NVidia make quantized versions usable on modest consumer systems, its still big and slow.) Add BFLs restrictive licensing on top of the difference in resource requirements for training, and I don't imagine you would see nearly as much community LoRA and finetuning work for Flux.2 as for Z-Image, so the practical difference is likely to be narrower than the difference in base model quality.
Most of the people I know doing local AI prefer SDXL to Flux. Lots of people are still using SDXL, even today.
Flux has largely been met with a collective yawn.
The only thing Flux had going for it was photorealism and prompt adherence. But the skin and jaws of the humans it generated looked weird, it was difficult to fine tune, and the licensing was weird. Furthermore, Flux never had good aesthetics. It always felt plain.
Nobody doing anime or cartoons used Flux. SDXL continues to shine here. People doing photoreal kept using Midjourney.
Yep. It's pretty difficult to fine tune, mostly because it's a distilled model. You can fine tune it a little bit, but it will quickly collapse and start producing garbage, even though fundamentally it should have been an easier architecture to fine-tune compared to SDXL (since it uses the much more modern flow matching paradigm).
I think that's probably the reason why we never really got any good anime Flux models (at least not as good as they were for SDXL). You just don't have enough leeway to be able to train the model for long enough to make the model great for a domain it's currently suboptimal for without completely collapsing it.
So what this does - you trigger the model once with a negative prompt (which can be empty) to get the "starting point" for the prediction, and then you run the model again with a positive prompt to get the direction in which you want to go, and then you combine them.
So, for example, let's assume your positive prompt is "dog", and your negative prompt is empty. So triggering the model with your empty prompt with generate a "neutral" latent, and then you nudge it into the direction of your positive prompt, in the direction of a "dog". And you do this for 20 steps, and you get an image of a dog.
The guidance here was distilled into the model. It's cheaper to do inference with, but now we can't really train the model too much without destroying this embedded guidance (the model will just forget it and collapse).
There's also an issue of training dynamics. We don't know exactly how they trained their models, so it's impossible for us to jerry rig our training runs in a similar way. And if you don't match the original training dynamics when finetuning it also negatively affects the model.
So you might ask here - what if we just train the model for a really long time - will it be able to recover? And the answer is - yes, but at this point the most of the original model will essentially be overwritten. People actually did this for Flux Schnell, but you need way more resources to pull it off and the results can be disappointing: https://huggingface.co/lodestones/Chroma
Thanks for the extended reply, very illuminating. So the core issue is how they distilled it, ie that they "baked in the offset" so to speak.
I did try Chroma and I was quite disappointed, what I got out looked nowhere near as good as what was advertised. Now I have a better understanding why.
> How much would it cost the community to pretrain something with a more modern architecture?
Quite a lot. Search for "Chroma" (which was a partial-ish retraining of Flux Schnell) or Pony (which was a partial-ish retraining of SDXL). You're probably looking at a cost of at least tens of thousands or even hundred of thousands of dollars. Even bigger SDXL community finetunes like bigASP cost thousands.
And it's not only the compute that's the issue. You also need a ton of data. You need a big dataset, with millions of images, and you need it cleaned, filtered, and labeled.
And of course you need someone who knows what they're doing. Training these state-of-art models takes quite a bit of skill, especially since a lot of it is pretty much a black art.
> Search for "Chroma" (which was a partial-ish retraining of Flux Schnell)
Chroma is not simply a "partial-ish" retraining of Schnell, its a retraining of Schnell after rearchitecting part of the model (replacing a 3.3B parameter portion of the model with a 250M parameter replacement with different architecture.)
> You're probably looking at a cost of at least tens of thousands or even hundred of thousands of dollars.
For reference here, Chroma involved 105,000 hours of H100 GPU time [0]. Doing a quick search, $2/hr seems to be about the low end of pricing for H100 time per hour, so hundreds of thousands seems right for that model, and still probably lower for a base model from scratch.
Is Flux 1/2/Kontext left in the dust by the Z Image and Qwen combo?