Stateful execution memory
liveUses project memory, carry-forward reconciliation, evidence metadata, and accountability-aware extraction across recurring meetings.
Research overview
The research thesis is simple: meeting intelligence becomes more useful when it remembers prior commitments, tracks accountability over time, and is benchmarked longitudinally against a transcript-only baseline.
Scenario
5 meetings
Onboarding Growth Initiative - 5 Week Accountability Regression
Suite coverage
2 scenarios
The suite now compares 2 systems across accountability and recovery sequences instead of relying on one benchmark story.
Current system
37 / 1
Passed / failed checks with project memory enabled. Approximate pass rate: 97%.
Transcript-only baseline
34 / 4
Same scenario, but without prior project memory. Approximate pass rate: 89%.
Suite snapshot
Every full suite run writes JSON and Markdown artifacts. When a suite report exists in the environment, it appears here automatically; otherwise the page shows the exact command needed to generate one.
No suite artifact found yet
This environment does not currently have a generated suite report under benchmark/reports/suites. The benchmark code is ready; the artifact appears here after the suite is run.
Generate the suite artifact
docker compose --env-file .env.docker up -d postgres
pnpm --filter @meeting-ai/ai-backend dev
pnpm benchmark:suiteScenario library
Carry-forward accountability, launch readiness, unresolved questions, and whether a project-memory system can reconcile shifting execution state over time.
Incident recovery memory, rollback decision closure, launch gating, vendor escalation tracking, and transition from firefighting into release readiness.
Ablation design
Uses project memory, carry-forward reconciliation, evidence metadata, and accountability-aware extraction across recurring meetings.
Reasoning is limited to the current meeting transcript and meeting-local metadata, with no prior project state or carry-forward context.
Planned follow-ups include memory-without-evidence, extraction-only without reconciliation, and lifecycle-transition scoring against gold labels.
Research contribution
Why the baseline matters
A transcript-only system can produce readable outputs, but it cannot reliably decide what stayed open, what was resolved later, which deadlines silently slipped, or when the project is genuinely ready. That is the difference this project is trying to measure.
Core claim
Longitudinal project memory should outperform single-meeting reasoning when the task is execution support, not note generation.
Evaluation layers
The rubric separates extraction quality, state-tracking quality, and PM usefulness so the research story stays grounded in measurable behavior.
Measures whether the system identifies the right actions, owners, deadlines, questions, and risks from a meeting transcript.
Measures whether the system carries work forward correctly across recurring meetings instead of treating every meeting in isolation.
Measures whether the output is actually operational for product and engineering follow-up, not just readable as notes.
Artifacts
The research claim should be inspectable by anyone reviewing the project, so the repo includes the scenario files, evaluation rubric, and benchmark runner instead of hiding them behind screenshots.
Benchmark harness
Scenario runner, schemas, reports, and dataset structure for longitudinal evaluation.
Evaluation rubric
Research framing, automatic metrics, human scoring, and recommended acceptance thresholds.
Dataset transcripts
Chronological transcript files used for the accountability regression scenario.
Benchmark suite
How to run the multi-scenario suite and compare the stateful system against the transcript-only baseline.
Latest suite reports
Committed JSON and Markdown outputs from the latest benchmark suite run, including aggregate totals and per-scenario breakdowns.
Suite runner
The orchestration script that executes every registered scenario and writes aggregate JSON plus Markdown research artifacts.