Ask HN: Have AI learn my own business Knowledge(verity of formats) for chat bot

abdullin · on July 30, 2024

I’m consulting multiple teams on shipping LLM-driven business automation. So far I have seen only one case where fine-tuning a model really paid off (and didn’t just blow up the RLHF calibration and caused wild hallucinations).

I would suggest to avoid training and look into RAG systems, prompt engineering and using OpenAI API for a start.

You can do a small PoC quickly using something like LangChain or LlamaIndex. Their pipelines can ingest unstructured data in all file formats, which is good for getting a quick feel.

Afterwards, if you encounter hallucinations in your tasks - throw out vector DB and embeddings into the trashcan (they are pulling junk information into the context and causing hallucinations). Replace embeddings with a RAG based on full text search and query expansion based on the nuances of your business.

If there are any specific types of questions or requests that you need special handling for - add a lightweight router (request classifier) that will direct user request to a dedicated prompt with dedicated data.

By that time you would’ve probably lost all of RAG, replacing it with a couple of prompt templates, a file based knowledge base in markdown and CSV and a few helpers to pull relevant information into the context.

That’s how most of working LLM-driven workflows end up (in my bubble). Maybe just with PostgreSQL and ES instead of file-based knowledge base. But that’s an implementation detail.

Update: if you really want to try fine-tuning your own LLM - this article links to a Google Collab Notebook for the latest Llama 3.1 8B: https://unsloth.ai/blog/llama3-1

It will not learn new things from your data, though. Might just pick up the style.

vannevar · on July 31, 2024

>throw out vector DB and embeddings into the trashcan (they are pulling junk information into the context and causing hallucinations)

Not sure why this would be true. In my experience, semantic search based on a vector index/embeddings pulls in more relevant information than a full-text keyword search. Maybe there is too broad a set of materials in your vector db, or the chunking strategy isn't good?

abdullin · on July 31, 2024

It might depend on the case.

My problem with similarity search - it is unpredictable. It can sometimes miss really obvious matches or pull completely irrelevant snippets. When this happens - this causes downstream hallucinations that are hard to fix.

My customers don’t tolerate hallucinations.

Query expansion with FTS search works more predictably for me. Especially, if we factor in search scope reduction driven by the request classifier (“agent router”)

vannevar · on July 31, 2024

For sure it will depend on use case, if you have fairly structured data or a clear domain-specific terminology to rely on, there's probably no reason to use semantic search.

>Query expansion with FTS search works more predictably for me. Especially, if we factor in search scope reduction driven by the request classifier (“agent router”)

You might be able to quantify this and gain some insight into why query expansion/FTS is working better by comparing the precision/recall with a vector db using some set of benchmark docs and queries.

abdullin · on Aug 1, 2024

> For sure it will depend on use case, if you have fairly structured data or a clear domain-specific terminology to rely on

Indeed. This works only in a subset of business domains for me: search and assistants within enterprise knowledge base (e.g. ~40k documents with 20GB of text) within logistics, supply chain, legal, fintech and medtech.

> You might be able to quantify this and gain some insight into why query expansion/FTS is working better by comparing the precision/recall with a vector db using some set of benchmark docs and queries.

Embeddings tend to miss a lot of nuances, plus they are just unpredictable when searching on large sets of text (e.g. 40k documents fragmented), frequently pulling irrelevant texts before the relevant ones. Context contamination leads to hallucinations in our cases.

However with LLM-driven query expansion and FTS search I can get controllable retrieval quality in business tasks. Plus, if something edge case shows up, it is fairly easy to explain and adjust the query expansion logic to cover specific nuances.

This is the setup I'm happy with.

cranberryturkey · on July 31, 2024

Doesn't RAG approach (and LangChain) both require you send the context data (ie: your book data) in the prompt query api call? How would you fit 20,000 books in that call?

abdullin · on July 31, 2024

It is impossible to fit all that information into the call.

The whole point of RAG - we (somehow) retrieve only the relevant information and put it into the context to generate the answer.

kingkongjaffa · on July 30, 2024

Plus one for this approach, I’m trying to say broadly the same thing with my comment.

What are you using for full text search RAG in production?

abdullin · on July 30, 2024

It really depends on the setup that the dev/ops at customer are more comfortable with. Elastic or PostgreSQL can be both fine.

Personally for small cases (e.g. under 50k documents and 20GB of text) I like to use SQLite FTS, while linking text fragments to the additional metadata (native or extracted). This way I can really narrow down the search scope to a few case related documents in each conversation path.

But ultimately the flavor of DB and FTS is just an implementation detail. Most of them will do just fine.

Edit: fixed grammar.

pjb88 · on Aug 2, 2024

Not sure if this will help, but a while ago I was thoroughly confused about all the AI options (and advice from other people) so spent a while experimenting, now make systems for commercial sometimes, but for a basic-yet-functional knowledge base, that you can expand with whatever tooling you want:

- Don't use llamaindex/llangchain etc. - fine to get started quick but you'll quickly get frustrated when you try to do something different

- Suck in all your files using public libraries. convert to text. Remove obvious crap like line breaks etc. Don't worry about it too much.

- Use postgres as vectorDB - cheap.

- OpenAI is fine, and the docs are great - gpt 3.5 gives fine results; cheapest embedding model fine.

- Spend some time optimising the prompts - that's the most important thing.

I wrote up basics for my specific niche here, has cost/time breakdowns and costs about $4 per month for hosting (and only then because I couldn't face setting up postgres on my other server) and < $1 per 50GB of text/xlsx/etc embedded: https://superstarsoftware.co.uk/ai-for-drilling-engineers/

(as in: dirt cheap).

I basically made it as a showcase for potential customers, was half thinking of open sourcing it so people can get up and running quickly including with decent frontend, but not sure if there's much appetite since it's basic.

kingkongjaffa · on July 30, 2024

If you’re on a budget you don’t want to “Train” the model. I.e fine tune.

Since you have multi format data you likely want a pipeline to convert it all to text using various tools, make sure it’s structured and then shove it in a RAG system for the LLM chatbot to work with.

You can get started with lang chain and openAI’s API

Experiment with gpt4o mini for a while to keep costs down and then test if cranking up to gpt4o proper is worth it.

That’s the LLM part solved. You’ll need to control the logic after that depending on controls in your chatbot pop-up window to be able to arrange the calls/send emails etc.

ndr_ · on July 30, 2024

Depends on the budget, I’d say. If it’s less than 10¢, that may be true. If it‘s 15¢ or more, you could try training with OpenAI: https://ndurner.github.io/training-own-model-finetuning

crazymoka · on July 30, 2024

I can put all the content into text. Would putting it all into json or cvs with headers be helpful? Break down content by topic, categories, links to other pages if it needs to read other content?

kingkongjaffa · on July 30, 2024

Yeah whatever will give you accurate vector embeddings.

You’ll combine those json properties into one vector.

For links you might hard code those as an extra part of each object or potentially require another API call to retrieve relevant links from your “link store”.

It’s useful to have benchmarks in place like:

input “should equal” output And start building a suite of test cases to evaluate your set up.

cranberryturkey · on July 31, 2024

> shove it in a RAG system for the LLM chatbot to work with

how would I do this with an ollama model?

kingkongjaffa · on Aug 2, 2024

You want to ingest your input documents and generate embeddings using an embedding model like https://ollama.com/blog/embedding-models

theolivenbaum · on Aug 2, 2024

If you want to try our software (includes search, RAG, file handling, and a bunch of integrations, and can be deployed on prem or cloud), happy to give you a license in exchange for some feedback! Just shout me an email at rafael (at) curiosity.ai