Research overview

This product is evaluated as execution memory, not just summary quality.

The research thesis is simple: meeting intelligence becomes more useful when it remembers prior commitments, tracks accountability over time, and is benchmarked longitudinally against a transcript-only baseline.

Scenario

5 meetings

Onboarding Growth Initiative - 5 Week Accountability Regression

Suite coverage

2 scenarios

The suite now compares 2 systems across accountability and recovery sequences instead of relying on one benchmark story.

Current system

37 / 1

Passed / failed checks with project memory enabled. Approximate pass rate: 97%.

Transcript-only baseline

34 / 4

Same scenario, but without prior project memory. Approximate pass rate: 89%.

Suite snapshot

The suite is designed to produce inspectable evidence, not just a headline number.

Every full suite run writes JSON and Markdown artifacts. When a suite report exists in the environment, it appears here automatically; otherwise the page shows the exact command needed to generate one.

No suite artifact found yet

This environment does not currently have a generated suite report under benchmark/reports/suites. The benchmark code is ready; the artifact appears here after the suite is run.

Generate the suite artifact

docker compose --env-file .env.docker up -d postgres
pnpm --filter @meeting-ai/ai-backend dev
pnpm benchmark:suite

Ablation design

The evaluation explicitly asks what the memory layer adds.

Stateful execution memory

live

Uses project memory, carry-forward reconciliation, evidence metadata, and accountability-aware extraction across recurring meetings.

Transcript-only baseline

live

Reasoning is limited to the current meeting transcript and meeting-local metadata, with no prior project state or carry-forward context.

Next ablations to add

planned

Planned follow-ups include memory-without-evidence, extraction-only without reconciliation, and lifecycle-transition scoring against gold labels.

Research contribution

The novelty is in continuity, accountability, and defensible evaluation.

  • Stateful project memory instead of transcript-only summarization
  • Accountability-aware extraction that tracks owners, deadlines, and silent blockers
  • Longitudinal benchmark harness with a transcript-only baseline for ablation
  • Evidence trace surfaces that tie extracted items back to transcript spans and context

Why the baseline matters

Better notes are not enough.

A transcript-only system can produce readable outputs, but it cannot reliably decide what stayed open, what was resolved later, which deadlines silently slipped, or when the project is genuinely ready. That is the difference this project is trying to measure.

Core claim

Longitudinal project memory should outperform single-meeting reasoning when the task is execution support, not note generation.

Evaluation layers

The system is scored on more than writing quality.

The rubric separates extraction quality, state-tracking quality, and PM usefulness so the research story stays grounded in measurable behavior.

Extraction quality

Measures whether the system identifies the right actions, owners, deadlines, questions, and risks from a meeting transcript.

  • item extraction F1
  • owner attribution accuracy
  • deadline accuracy

State-tracking quality

Measures whether the system carries work forward correctly across recurring meetings instead of treating every meeting in isolation.

  • lifecycle transition accuracy
  • false-closure rate
  • unresolved-question carry-forward accuracy

PM usefulness quality

Measures whether the output is actually operational for product and engineering follow-up, not just readable as notes.

  • readiness label accuracy
  • evidence grounding coverage
  • human PM scoring