At Digdep, our goal is to help people find supplements that actually work — not just by claims, but by scientific research and user-reported outcomes.

The catch? We had over 30000+ product-condition combinations (e.g. Vitamin A for acne, Omega 3 for ADHD) and needed to generate trustworthy, dynamic, evolving pages — without hiring a hundred content writers.

So we did what any backend-leaning team would do:

We built a pipeline-first, AI-assisted content system, structured around research data, user reviews, and intent-based modules.

🧱 Architecture Overview
We split the problem into three systems:

  1. Content Orchestration Layer A scheduled ETL engine (Airflow + custom workers) that:

Fetches new research data from PubMed, clinical trial APIs, and internal annotations

Pulls structured review data from reputable sellers.

Normalizes supplement metadata (dosage, source, purity, etc.)

  1. ML/NLP Layer This is where the raw data gets meaning:

Clinical research is chunked, embedded (SBERT), and summarized using a hybrid of GPT-4 + in-house fine-tuned classifiers

Reviews are clustered by condition + sentiment, scored, and tagged (e.g. “2-week results”, “used with zinc”)

FAQ candidates are extracted from natural language queries, Reddit, Quora, and Digdep’s internal search logs

  1. Headless CMS + API Delivery The processed content lives in a GraphQL-accessible store (we use Strapi but heavily extended)

Each page is assembled dynamically on the frontend via metadata-driven composition: which sections to show, what order, how they’re prioritized

Content updates are non-destructive and versioned — users get fresh insights without pages losing their SEO/indexing

🧠 AI Where It Makes Sense
We were careful not to overfit with LLMs. Here’s how we actually use them:

Summarization: Input = abstract + result + cohort size; Output = 2-sentence result with risk qualifiers

Semantic clustering: We embed every user review and map it into symptom categories and conditions (some users don’t say “acne” — they say “skin bumps”)

Question synthesis: LLMs turn query logs into human-readable FAQs, then we pass them through filters for duplication, bias, and hallucination

We built a confidence scoring layer to decide when to show or suppress LLM output. If the model’s not sure, it defers to rules or hides the result.

📦 How Pages Are Built
Each product page is made of composable modules, injected via API:

from the ML pipeline

from review tagging

from research weighting

generated dynamically

based on co-purchase graph

The backend controls what renders, and the frontend just assembles.

We also exposed a JSON manifest for each page so QA/devs can debug pipeline decisions without inspecting raw DB rows.

📊 Feedback Loops
This system let us do things we couldn’t before:

Trigger model re-training when new research changes a supplement’s score

Use search and review logs to automatically discover emerging use-cases (e.g. berberine + PCOS suddenly rising)

Log anonymized click paths to see which modules drive trust, then tune the page structure accordingly

🚀 Results & Takeaways
We scaled to thousands of pages within 2 weeks without bottlenecks

Pages adapt over time as new data/reviews/research arrives

Everything is traceable, explainable, and testable — no “black box content”

If you’re building content at scale in a high-trust domain (health, legal, finance), structured pipelines + LLM-assisted augmentation is a sweet spot. It’s not sexy, but it’s robust.

💬 Curious how we handle edge cases (e.g. conflicting research, multi-supplement effects), cold-start products, or data validation? Drop a question below — always happy to nerd out.