Turns out what I've been calling Fact Extraction (as early as WalkingRAG) has a proper name: APS or Abstractive Proposition Segmentation.
Only learned this because of the new Gemma model finetuned exclusively for this purpose huggingface.co/google/gemma-2b-aps-it
Why is this important?
Longer writeup coming soon (with cooking analogies) - but transformations on input data (BEFORE any kind of chunking) is the easiest way to improve retrieval performance.
At GW, the most useful thing we did in retrieval was to transform long docs using LLMs into Facts.
Or Abstractive propositions. Essentially take something like this thread, and turn it into facts like
'Hrishi has a longer writeup coming soon.'
'Hrishi learned about a new Gemma model.'
Even if you're just semantic searching,
these facts are a lot closer to your question than your document. When you start connecting facts - and enriching them with additional data - it becomes much easier to move them around without losing contextual knowledge.
Okay I'll save the rest for later, now to testing the model!
@TarunAmasa this is the model I was mentioning