faxipay349's comments

faxipay349 · 2025-09-01T02:27:57 1756693677

Yeah, the standard SPLADE model trained from BERT typically already has a vocabulary/vector size of 30,552. If the SPLADE model is based on a multilingual version of BERT, such as mBERT or XLM-R, the vocabulary size could inherently expand to approximately 100,000, as does the vector size.

faxipay349 · 2025-09-01T02:20:05 1756693205

I just came across an evaluation of state-of-the-art SPLADE models. Yeah they utilize BERT's vocabulary size as their sparse vector dimensionality and do capture semantics. As expected, they significantly outperform all dense models in this benchmark. https://github.com/frinkleko/LIMIT-Sparse-Embedding OpenSearch team seemed has been working on inference-free versions of these models. Similar to BM25, these models only encode documents offline. So now we have sparse, small and efficient models while is much better than dense ones, at least on LIMIT