Research roadmap

Eight pillars. One evidence layer.

Our research is shaped around what scientific teams need to defend. Every pillar produces an artifact a skeptical reviewer can inspect — open schemas, eval suites, grader recipes, routing policies, retrieval traces, safety bundles, and deployment notes.

01Private alpha

Cartograph — scientific claim graphs

Scientific claims drift across papers, decks, reviews, and packages. Reviewers cannot answer which source, which span, and which model judgment supported a stated dose, mechanism, target rationale, or contraindication.

  • Schema for (claim_id, source_doc_id, span_offsets, retrieval_run_id, model_call_id, eval_verdict) anchored in a typed graph store.
  • Domain LM for claim extraction; entity and relation linker for downstream graph writes.
  • Contradiction detector that compares competing claims across sources with reviewer queue handoff.
  • Confidence flags driven by evaluator-model scores rather than free-form language.
  • Open schema and reference ingestion pipeline.
  • Golden contradiction set across published biomedical claims with known disagreements.
  • Reviewer-state state machine and queue API.
What would change our mindsWe would step back from this direction if reviewers consistently prefer chat over addressable graph artifacts when shown both in their actual review workflow.
02Research track

Procedure — protocol reproducibility deltas

Methods sections compress steps. Lab execution diverges. The delta — parameter changes, instrument swaps, reagent substitutions — rarely makes it back into the next paper or the next batch.

  • Protocol ingestion that yields a structured parameter graph with reagents, instruments, ranges, and dependencies.
  • Ambiguity annotator that flags underspecified steps where execution can drift.
  • Executable checklists generated from the parameter graph for bench use.
  • Delta diff against captured run logs, with deltas written back to the claim graph.
  • Reference parameter-graph spec for a small set of assay families.
  • Open ambiguity-recall eval set with paired protocol and execution log examples.
  • Reproducibility delta visualizer for review.
What would change our mindsIf real bench operators find the checklists slower than current SOP docs even after adjustment, the harness fails its purpose.
03Prototype

Crosswalk — biomedical eval foundry

Model selection and regression on scientific tasks still rides on leaderboards designed for chatbots. No standard suite for citation fidelity, dose extraction, adverse-event triage, mechanism plausibility, or synthesis reasoning.

  • Open eval task definitions with rubrics and adversarial probes.
  • Grader models per task: citation linker, dose extractor, AE classifier, synthesis step checker.
  • Reproducible run manifest (model id, temperature, seed, dataset version, grader version) attached to every benchmark run.
  • Regression reports comparable across model generations and fine-tunes.
  • First public eval suite for citation fidelity and dose extraction.
  • Grader-model recipes and weights where licensing permits.
  • Public model cards with eval verdicts per task.
What would change our mindsIf grader models prove unstable across runs and cannot meet inter-rater agreement floors, the suites are not yet trustworthy enough to publish.
04Research track

Model routing for scientific reasoning

Frontier models are expensive and slow for narrow extraction. Small models are cheap and fast but unreliable on reasoning. Single-model deployments waste both ends. There is no shared rubric for which model takes which step.

  • Router rubric mapping task families (extraction, reasoning, classification, grading) to model classes (frontier reasoner, domain LM, evaluator).
  • Cost-aware routing with bounded fallback chains and explicit decision logs.
  • Open routing policies committed to the run manifest, not hidden in product layers.
  • Inline eval gates that can demote routing decisions on contradiction or insufficient evidence.
  • Open routing policy DSL with example policies per task family.
  • Decision-trace export format compatible with the claim graph.
  • Comparison harness for routed runs versus single-model baselines.
What would change our mindsIf routed runs do not show meaningful quality improvements or cost reductions over best single-model baselines on biomedical tasks, the layer is premature.
05Prototype

Private corpus retrieval

Internal lab notes, private PDFs, ELN exports, and unpublished manuscripts sit in fragmented stores with different access rules. Public search and naive RAG either over-share or miss the right span.

  • Per-tenant retrieval with tenant-isolated indices and strict access boundaries.
  • Hybrid lexical and dense retrieval with reranking tuned for scientific phrasing.
  • Local embedding generation; configurable on-prem or edge deployment.
  • Trace export that records which spans crossed the model boundary on every run.
  • Reference deployment for self-hosted private retrieval.
  • Benchmarks across mixed lexical and domain-specific queries.
  • Open retrieval-trace format that downstream reviewers can audit.
What would change our mindsIf hybrid retrieval cannot reliably outperform tuned BM25 plus reranker on scientific queries, the added complexity is not warranted.
06Research track

Citation-preserving media generation

Hard science generally loses its source chain when adapted into explainers, diagrams, internal teaching, or onboarding materials. Stripped citations damage downstream trust.

  • Generation pipeline that requires every assertion to map to a (claim_id, source span) before render.
  • Diagram and storyboard agents that emit source attributions inline rather than as a postscript.
  • QA loops that detect quietly dropped citations between revisions.
  • Optional reviewer pass before media artifacts leave the system.
  • Reference pipeline for citation-preserving lesson and explainer artifacts.
  • Eval suite for citation-drop detection across revisions.
  • Open templates compatible with the broader claim graph.
What would change our mindsIf maintaining inline citations measurably reduces media quality without an audit benefit reviewers care about, this is a research-only direction.
07Research track

Safety and QA for biomedical AI

Generic refusal language is the wrong tool for biomedical AI. The right tool is a domain-specific safety policy that blocks claim-without-citation, dose-without-source, and routes adverse-event mentions to review.

  • Runtime guardrails framed as evidence requirements, not topic refusals.
  • Mandatory citation for any claim about dose, mechanism, contraindication, or adverse event.
  • Routing of adverse-event mentions and safety-critical claims to the reviewer queue with reasons attached.
  • Open policy bundles teams can audit, fork, and extend per deployment.
  • Reference safety policy bundle for biomedical reasoning.
  • Eval suite for policy bypass attempts.
  • Reviewer-handoff record with explicit rationale and policy hit.
What would change our mindsIf teams adopt the policies but bypass them in practice for everyday questions, the policies are too coarse and should be rewritten as evidence requirements rather than topic gates.
08Platform direction

Open-source roadmap

Scientific teams need to be able to inspect, audit, and self-host the systems they build evidence trails with. Closed black boxes are not a serious offering for this audience.

  • Open-source core for the claim graph schema, protocol harness primitives, and grader interfaces.
  • Permissive licensing where possible; copyleft where the artifact must remain open.
  • Private deployment patterns documented alongside the open-source core, with no feature gating that prevents self-hosted teams from running their own evidence layer.
  • Public benchmark gate built on the eval foundry — the company eats its own dog food.
  • Public repository with reference implementations.
  • Self-host guide covering the agent runtime, retrieval, and eval gate runner.
  • Release notes that document every grader change and eval-set version.
What would change our mindsIf the open-source surface and the private deployment surface diverge enough that self-hosted teams cannot reproduce production behavior, the split has failed.