Research overview

This product is evaluated as execution memory, not just summary quality.

The research thesis is simple: meeting intelligence becomes more useful when it remembers prior commitments, tracks accountability over time, and is benchmarked longitudinally against a transcript-only baseline.

Scenario

5 meetings

Onboarding Growth Initiative - 5 Week Accountability Regression

Suite coverage

2 scenarios

The suite now compares 2 systems across accountability and recovery sequences instead of relying on one benchmark story.

Current system

37 / 1

Passed / failed checks with project memory enabled. Approximate pass rate: 97%.

Transcript-only baseline

34 / 4

Same scenario, but without prior project memory. Approximate pass rate: 89%.

Suite snapshot

The suite is designed to produce inspectable evidence, not just a headline number.

Every full suite run writes JSON and Markdown artifacts. When a suite report exists in the environment, it appears here automatically; otherwise the page shows the exact command needed to generate one.

No suite artifact found yet

This environment does not currently have a generated suite report under benchmark/reports/suites. The benchmark code is ready; the artifact appears here after the suite is run.

Generate the suite artifact

docker compose --env-file .env.docker up -d postgres
pnpm --filter @meeting-ai/ai-backend dev
pnpm benchmark:suite

Scenario library

The benchmark now covers more than one type of recurring PM reality.

Onboarding Growth Initiative

5 meetings

Carry-forward accountability, launch readiness, unresolved questions, and whether a project-memory system can reconcile shifting execution state over time.

Release Recovery Cycle

4 meetings

Incident recovery memory, rollback decision closure, launch gating, vendor escalation tracking, and transition from firefighting into release readiness.

Ablation design

The evaluation explicitly asks what the memory layer adds.

Stateful execution memory

live

Uses project memory, carry-forward reconciliation, evidence metadata, and accountability-aware extraction across recurring meetings.

Transcript-only baseline

live

Reasoning is limited to the current meeting transcript and meeting-local metadata, with no prior project state or carry-forward context.

Next ablations to add

planned

Planned follow-ups include memory-without-evidence, extraction-only without reconciliation, and lifecycle-transition scoring against gold labels.

Research contribution

The novelty is in continuity, accountability, and defensible evaluation.

Stateful project memory instead of transcript-only summarization
Accountability-aware extraction that tracks owners, deadlines, and silent blockers
Longitudinal benchmark harness with a transcript-only baseline for ablation
Evidence trace surfaces that tie extracted items back to transcript spans and context

Why the baseline matters

Better notes are not enough.

A transcript-only system can produce readable outputs, but it cannot reliably decide what stayed open, what was resolved later, which deadlines silently slipped, or when the project is genuinely ready. That is the difference this project is trying to measure.

Core claim

Longitudinal project memory should outperform single-meeting reasoning when the task is execution support, not note generation.

Evaluation layers

The system is scored on more than writing quality.

The rubric separates extraction quality, state-tracking quality, and PM usefulness so the research story stays grounded in measurable behavior.

Extraction quality

Measures whether the system identifies the right actions, owners, deadlines, questions, and risks from a meeting transcript.

item extraction F1
owner attribution accuracy
deadline accuracy

State-tracking quality

Measures whether the system carries work forward correctly across recurring meetings instead of treating every meeting in isolation.

lifecycle transition accuracy
false-closure rate
unresolved-question carry-forward accuracy

PM usefulness quality

Measures whether the output is actually operational for product and engineering follow-up, not just readable as notes.

readiness label accuracy
evidence grounding coverage
human PM scoring

Artifacts

Benchmark code, rubric, and dataset are all inspectable.

The research claim should be inspectable by anyone reviewing the project, so the repo includes the scenario files, evaluation rubric, and benchmark runner instead of hiding them behind screenshots.

Benchmark harness

Scenario runner, schemas, reports, and dataset structure for longitudinal evaluation.

Evaluation rubric

Research framing, automatic metrics, human scoring, and recommended acceptance thresholds.

Dataset transcripts

Chronological transcript files used for the accountability regression scenario.

Benchmark suite

How to run the multi-scenario suite and compare the stateful system against the transcript-only baseline.

Latest suite reports

Committed JSON and Markdown outputs from the latest benchmark suite run, including aggregate totals and per-scenario breakdowns.

Suite runner

The orchestration script that executes every registered scenario and writes aggregate JSON plus Markdown research artifacts.