I am working on yet another news aggregator: newsmuncher.com So far i've got the...

tenkabuto · 2025-11-11T16:20:45 1762878045

May I ask what techniques either you're using or would recommend for similarity clustering? I looked into topic modeling, but it seemed a long way off from reliably bundling together stories like on Techmeme.

(I'm working on basic blog and video aggregators like Planet Python.)

balksi · 2025-11-11T18:49:50 1762886990

For similarity it is important to consider the dimensionality of your embeddings. The larger the text you wish to compare the bigger each embedding should be (to my limited understanding).

So a paragraph might be good as a 384-dim vector but if you have 1,000 words then you might want a 768-dim embedding (if not higher). Embedding models have slightly better/worse accuracy based on the training data they're fed, but higher dimensionality definitely gives better results - to a great extent. If you have an extensively long piece of text, it's easier to chunk it into pieces and create separate embeddings. You do have to manually stitch them back together and do some cleanup when displaying results but it works.

Once you have embeddings for all your data the rest is just cosine similarity, play around with the min_similarity. You will need to build good indexes on postgres but it is basically all you need.