Incremental processing is one of the core values provided by CocoIndex.
In CocoIndex, users declare the transformation, and don't need to worry about the work to keep index and source in sync.
CocoIndex creates & maintains an index, and keeps the derived index up to date based on source updates, with minimal computation and changes. That makes it suitable for ETL/RAG or any transformation tasks that stay low latency between source and index updates, and also minimizes the computation cost.
If you like our work, it would mean a lot to us if you could support Cocoindex on Github with a star. Thank you so much with a warm coconut hug 🥥🤗.
What is incremental processing?
Figuring out what exactly needs to be updated, and only updating that without having to recompute everything throughout.
How does it work?
You don't really need to do anything special, just focus on defining the transformation needed.
CocoIndex automatically tracks the lineage of your data and maintains a cache of computation results. When you update your source data, CocoIndex will:
- Identify which parts of the data have changed
- Only recompute transformations for the changed data
- Reuse cached results for unchanged data
- Update the index with minimal changes
And CocoIndex will handle the incremental processing for you.
CocoIndex provides two modes to run pipeline:
- One time update: Once triggered, CocoIndex updates the target data to reflect the version of source data up to the current moment.
- Live update: CocoIndex continuously reacts to changes of source data and updates the target data accordingly, based on various change capture mechanisms for the source.
Both modes run with incremental processing. You can view more details in Life Cycle of an Indexing Flow.
Who needs Incremental Processing?
Many people may think incremental processing are only beneficial for large scale data. Thinking carefully,
it really depends on the cost and requirement for data freshness.
Google processes huge scale data, backed by huge amount of resources.
Your data scale is much less, but your resource provision is also much more limited.
Incremental processing is needed upon the following conditions:
-
High freshness requirement
For most user-facing applications this is needed. e.g. users update their documents,
and it's unexpected if they see stale information in search results.If the search result is fed into an AI agent, it may mean unexpected response to users (i.e. LLM generate output based on inaccurate information).
It's more dangerous and users may even take the unexpected response without noticing. Transformation cost is significantly higher than retrieval itself
Overall, say T is your most acceptable staleness. If you don't want to recompute the whole thing repeatedly every cycle of T,
you will need incremental processing more or less.
It would mean a lot to us if you could support CocoIndex on Github with a star if you find it helpful. Thank you so much with a warm coconut hug 🥥🤗.