I wrote it up. The full system reference is here: https://blakecrosley.com/guide...

thecopy · 2026-03-03T12:56:02 1772542562

I implemented this as well successfully. Re structured data i transformed it from JSON into more "natural language". Also ended up using MiniLM-L6-v2. Will post GitHub link when i have packaged it independently (currently in main app code, want to extract into independent micro-service)

You wrote:

>A search for “review configuration” matches every JSON file with a review key.

Its good point, not sure how to de-rank the keys or to encode the "commonness" of those words

blakec · 2026-03-03T19:38:12 1772566692

IDF handles most of it. In BM25, inverse document frequency naturally down-weights terms that appear in every document, so JSON keys like "id", "status", "type" that show up in every chunk get low IDF scores automatically. The rare, meaningful keys still rank.

For the remaining noise, I chunk the flattened key-paths separately from the values. The key-path goes into a metadata field that BM25 indexes but with lower weight. The value goes into the main content field. So a search for "review configuration" matches on the value side, not because "configuration" appeared as a JSON key in 500 files.

MiniLM-L6-v2 is solid. I went with Model2Vec (potion-base-8M) for the speed tradeoff. 50-500x faster on CPU, 89% of MiniLM quality on MTEB. For a microservice where you're embedding on every request, the latency difference matters more than the quality gap.

danw1979 · 2026-03-05T08:04:04 1772697844

Thank you !