As binary formats go, UTF-8 is extremely tame. Some of the complexities that binary formats love to throw at you:
* Things may be non-byte-aligned bitstreams.
* Arrays of structures that go "read until id is 5, but if id is 5, nothing else of the structure is emitted."
* Fields that may be optional if some parent of the current record has some weird value.
* Files may be composed of records at arbitrary, random offsets that essentially require seeking to make any sense of it.
* The metadata of your structure may depend on some early parameter (for example, is this field big-endian or little-endian?)
and so on.
File formats like ELF (supporting ELF32, ELF64, and both little-endian and big-endian, all in a single format definition) or Java class files (long and double entries in the constant pool take up two slots, not one) are a better guideline for how powerful the format is in handling weirder idiosyncracies.
I found their ELF format specification to have a decent coverage, even if not completely exhaustive (e.g. some debug info isn't breakdown after a certain point, but it just might be incomplete rather than limitations).
There are ID3 tags used for MP3 and other files. Old players may not know about them and may misread their data as MPEG frames. To prevent this a tag may break up sequences of 00 bytes with an FF byte. Or may not, because now most players are aware of tags. So there is a preference, at two levels, default and for a single tag. Not too hard to program, but rather unfriendly to a grammar-based approach.
* Things may be non-byte-aligned bitstreams.
* Arrays of structures that go "read until id is 5, but if id is 5, nothing else of the structure is emitted."
* Fields that may be optional if some parent of the current record has some weird value.
* Files may be composed of records at arbitrary, random offsets that essentially require seeking to make any sense of it.
* The metadata of your structure may depend on some early parameter (for example, is this field big-endian or little-endian?)
and so on.
File formats like ELF (supporting ELF32, ELF64, and both little-endian and big-endian, all in a single format definition) or Java class files (long and double entries in the constant pool take up two slots, not one) are a better guideline for how powerful the format is in handling weirder idiosyncracies.