I get your point about the surprising news that gpt4 does so poorly at translation. I didn’t know that!
However, I think the idea is that LLM technologies have improved considerably since then. Do you still feel that Claude or ChatGPT perform worse than DeepL? It would be really nice to have an objective benchmark for comparison
Well there's this[0] but unfortunately I don't see any results and I'm not about to put in the time to run it myself.
Responding to GP, I won't object that LLMs aren't optimized for translation but I would generally expect them to perform quite well at it given that it seems to come with the territory of being able to respond to natural language prompts in multiple languages.
> massive, overfit generative models are awful at translation
> part-of-speech classification + dictionary lookup + grammar mapping – an incredibly simplistic system with performance measurable in chapters per second – does a better job
Those are two distinct claims you made and I'm not inclined to accept either of them without evidence given how unexpected both of them would be from my perspective.
Transformer models are good at translation, given appropriate training data (i.e., "this text but in multiple languages") – though you still have to watch out for them translating "English" as "Deutsch", "Français", etc. Asking them to do repeated next token prediction isn't asking them to translate, though, especially not after the RLHF passes that OpenAI does to their ChatGPT models. When you test it, you get exactly the failure modes you'd expect: translations that start off okay, but go off on tangents; translations that "correct" the original text, so the translations aren't faithful; attempts (usually successful) to cover up a gap in "knowledge" that prevents the model from translating correctly.
When considering these failure modes, which have come up every single time I've seen ChatGPT used for translation, it's clear that the simplistic system I described would work better. It'll output gibberish often (just as LibreTranslate does when given decontextualised Chinese and asked to translate to English), but that's better than a GPT model, which will just confabulate something in the same circumstance. The goal isn't "maximise the amount of successful translatedness": it's "reduce the language barrier as much as possible", something the benchmarks don't test.
It's not surprising news. It's obvious news. When I was making those claims, I hadn't tested it: I was making purely theoretical arguments, based on having glanced at the relevant papers in the GPT-2 days.
That GPT models are bad at translation, and will always be bad at translation (while, perhaps, "improving" where they're overfit on specific benchmarks), is obvious to anyone with even a cursory understanding of how they work.
However, I think the idea is that LLM technologies have improved considerably since then. Do you still feel that Claude or ChatGPT perform worse than DeepL? It would be really nice to have an objective benchmark for comparison