I am working on yet another news aggregator: newsmuncher.com
So far i've got the scraping and embeddings / similarity clustering down (to build timelines of news stories), lots of data cleaning and UI refinement required. I find it hard to make choices, maybe I need a cofounder who can pair up with me. Looking to either monetize news data or build a news analysis / intelligence platform.
May I ask what techniques either you're using or would recommend for similarity clustering? I looked into topic modeling, but it seemed a long way off from reliably bundling together stories like on Techmeme.
(I'm working on basic blog and video aggregators like Planet Python.)
For similarity it is important to consider the dimensionality of your embeddings. The larger the text you wish to compare the bigger each embedding should be (to my limited understanding).
So a paragraph might be good as a 384-dim vector but if you have 1,000 words then you might want a 768-dim embedding (if not higher). Embedding models have slightly better/worse accuracy based on the training data they're fed, but higher dimensionality definitely gives better results - to a great extent. If you have an extensively long piece of text, it's easier to chunk it into pieces and create separate embeddings. You do have to manually stitch them back together and do some cleanup when displaying results but it works.
Once you have embeddings for all your data the rest is just cosine similarity, play around with the min_similarity. You will need to build good indexes on postgres but it is basically all you need.
So far i've got the scraping and embeddings / similarity clustering down (to build timelines of news stories), lots of data cleaning and UI refinement required. I find it hard to make choices, maybe I need a cofounder who can pair up with me. Looking to either monetize news data or build a news analysis / intelligence platform.