What about deleting vision layers (e.g. the "multi_modal_projector" and the "vision_tower.vision_model" layers, assuming I go with Gemma 3), since I need just language generation? Would that also be considered a "kick in the balls", or a useful trimming?
Should be safe to do, as long as none of that is load bearing. If it's the usual naive "massage the image into a hundred tokens and throw that into the context" vision implementation, nothing bad would happen from removing or just freezing them.
I've seen "cut off unused vision inputs" done for older multimodals, just not the newer Gemma 3.
The language is Hasidic Yiddish (which is by now different enough from YIVO Yiddish to almost be considered a different language). The amount of (all kinds of) Yiddish included in pre training is probably very little, but not nothing. Also, it's a Germanic language with Hebrew script and roots, and some Slavic roots and suffixes. Most concepts and structure are probably not *very* foreign to a good model.
As I wrote in another comment, I have thought about initializing the new embeddings based on equivalent tokens in the old ones (e.g. by translating a token to English and finding the closest old token), but I'm starting to rethink the feasibility.
I will probably get more text sometime in the future, but I have to build the first version now.
Not an answer to your original question, but I think you’d be surprised how much high quality historical linguistic content was hiding in the dusty old corners of the internet. I’ve been doing some work recently with LLMs on historical languages (various forms of Latin, Ancient Greek and medieval European languages) and the out-of-the-box performance of state of the art LLMs is shockingly good. It isn’t that surprising when you remember all these archive digitization projects that took place in the early 00s, but ended up either as stale links, preserved only by archive.org, or stored in arcane CRMs essentially unusable by humans. I assume the same is especially true for various historical Yiddish corpora.
I ran some tests and, without fine-tuning, GPT can translate medieval German, for example, considerably better than well-known scholars today.
Why would you throw out the original embedding layer? That seems like a step backwards to me. It's likely it was partly trained on Yiddish and without it you're throwing out a lot of information in the rest of the model.
I strongly suspect you’re overvaluing how far Hasidic Yiddish has drifted, and that fine-tuning an existing model as a dialect will work just fine, particularly given that the languages the different loan words are from will be present in such a model, and that you’re going to a dialect with a simpler grammar.
There’s plenty of guides online for fine-tuning for dialects. 2GB still isn’t a huge amount of data, but it seems like it would definitely be worth a concerted try (including fiddling with it a bit) given how expensive training from scratch is.
Perhaps. But I don't think there is an existing (open weights) model that really knows YIVO Yiddish, either, so what should I base this fine-tuning on?
You might be able to start with German, since German-Yiddish cognates tend to have fairly regular spelling correspondences (not exactly one-to-one, but often few-to-one).
So given a Latin-script token from a model that does OK in German (bonus points if it also does Hebrew), generate several candidate Hebrew-script tokens with some regex search-and-replace, then use the resulting vocabulary to tokenize your Yiddish corpus and for each original token keep the candidate replacement that was used most often in the tokenization.
This vocabulary replacement should give you a model that does OK in German-in-Hebrew-script. I think that would be a better base for a Yiddish model than training from scratch, but of course that's just a hunch that might turn out to be wrong.
Qwen3 lists Eastern Yiddish (presumably YIVO) as one of the 119 training languages. It’s available at various sizes including rather small ones to experiment with cheaply, and has good documentation for suggested fine-tuning pipelines. I’d start with that.
For a similar project, I worked with GPT to create an extensive dataset of translations from a historical language. I could then use this both to evaluate base capacity of other models in the language, i.e. giving the model the task of translating the various passages and evaluating the results with GPT, as well as for fine-tuning.
Thank you!
I have thought about initializing the new embeddings based on equivalent tokens in the old ones (e.g. by translating a token to English and finding the closest old token), but this is all getting convoluted.
New tokenizer and embeddings will probably be required anyway, since the language is practically missing from any model worth to play with, but at that point simply creating a small specialized model from scratch is perhaps a better bet than trying to glue it upon a big ready model?
Here’s a quick sanity check before you embark on the real thing:
- Tokenize your entire corpus with a few off-the-shelf multilingual tokenizers like Llama, Qwen and Gemma and calculate the ratio of letters to tokens. The higher the better, ideally in the 3-5 range
- Manually produce or select sentences that are similar in meaning but not in writing (embedding models also leverage graphemic overlap not just semantic similarity), and then check if similar sentences show consistently higher cosine similarity than dissimilar sentences. This is for embedding models like XLM RoBERTa not for LLMs, but it has a similar insight potential.
If both of these tests are promising then you likely don’t need custom implementations for these.
Personally if I was you, I would just take Qwen 0.6B Base (not instruct, since you want text completion) and continue pretraining it on the data that you have. It is very likely to work decently out of the box.
Yes. There's a scoring and filtering pipeline first, whereby I try to automatically check for the quality of the correction using a custom multilingual embedding model, madmon[0] and language identification model, gherbal[1]. Above a certain similarity threshold it goes into the training dataset, below it it's flagged for human review. This is mostly to stave off the trolls or blatant mistakes.
For the continuous training itself, yes I simply continue training the model from the last checkpoint (cosine lr scheduler). I am considering doing a full retraining at some point when I collect enough data to compare with this progressive training.
Apologies for the poor links, it takes a lot of time to work on this let alone fully document everything.
What about deleting vision layers (e.g. the "multi_modal_projector" and the "vision_tower.vision_model" layers, assuming I go with Gemma 3), since I need just language generation? Would that also be considered a "kick in the balls", or a useful trimming?