Anthropic has published a great article about Contextual Retrieval that suggests the combination of vector-based search and TF-IDF.

The way to think about a data flow for the pipeline is:

Image description

In addition to prepare the vector embedding as basic embedding example above, after the source data parsing, we can do the following:

  • For each document, extract keywords from it with its frequency.

  • Across all documents, group by keywords and sum up their frequencies, putting into an internal storage.

  • For each document each keyword, calculate the TF-IDF score using two inputs: frequency in the current document, and the sum of frequency across all documents. Store keyword and along with TF-IDF score keyword (if above a certain threshold) to a keyword index.

On query time, we can query both the vector index and the keyword index, and combine results from both.