Evidence control plane for scientific work

Custom models for science, bound to a harness that proves their work.

SaaS Syndicate Labs is an independent AI systems lab. We build a stack of custom scientific models — domain fine-tunes, small task LMs, embeddings, rerankers, graders, and routing policies — and the open-source harness that orchestrates them with provenance, evaluations, and human review. Every output is addressable by run id and source span.

Domain fine-tunesClaim graphsProtocol harnessesBiomedical eval suitesPrivate agents

What breaks

Scientific work does not fail because teams lack chatbots.

It fails because evidence, protocols, models, datasets, and decisions do not stay connected. Six bottlenecks shape every model we train and every product we build.

Claim drift

Citations shift across decks, reviews, and packages until no one can answer which source said what.

Protocol ambiguity

Methods sections compress steps. Lab execution diverges. The delta never makes it back.

Assay variance

Run-to-run drift in instruments and reagents goes unrecorded next to model-derived interpretations.

Model hallucination

Frontier models read papers well and invent references confidently. Chat UX hides which is which.

Fragmented corpora

Internal lab notes, private PDFs, and public literature sit in different stores with different access rules.

Unbenchmarked agents

Scientific agents are evaluated on chatbot leaderboards. No standard for citation fidelity or dose extraction.

How we approach it

Two things matter: the models we build, and the harness that proves them.

Frontier reasoning will keep getting better. That is not enough. Scientific work also needs custom models — extractors, classifiers, embedders, rerankers, graders — adapted to biomedical and chemical vocabulary, and an orchestration harness that produces an audit trail any reviewer can replay.

The model zoo and the harness are the two compounding assets. Both ship open-source where possible. Both versioned independently. Both inspectable end-to-end.

Layer 1

Custom models

Domain fine-tunes, small task LMs, embedding and reranker models, grader models, routing policies — built and adapted for scientific work, not borrowed from chatbot benchmarks.

Layer 2

Retrieval

Public literature, private corpora, run artifacts, and lab notes — indexed independently with hybrid lexical, vector, and reranker stages.

Layer 3

Model routing

Frontier reasoners for hard inference, our domain fine-tunes for extraction and classification, our evaluator models for grading other models. One policy, one decision log.

Layer 4

Tool orchestration

Long-horizon tool use for protocol checks, claim extraction, contradiction detection, and structured graph writes.

Layer 5

Eval gates

Discrete gates with binary states — pass, review, contradiction, insufficient evidence — applied before any claim leaves the run.

Layer 6

Provenance + review

Every claim carries (claim_id, run_id) with source spans, model calls, tool traces, eval verdicts, and reviewer state. Human review is part of the loop, not a footnote.

Model layer

Six classes of model. One adaptation pipeline.

We do not ship a single fine-tune. We ship a stack: frontier reasoners we route, domain fine-tunes we adapt from open weights, small task LMs we train for the hot path, embedding and reranker models tuned to scientific phrasing, grader models that score other models, and routing policies that pick which model handles which step.

01Orchestrated, not trained

Frontier reasoners

Long-context reasoning across multi-document evidence bundles, mechanism plausibility, and contradiction synthesis. We orchestrate them through a router policy with full decision logs — we do not train models at this scale.

Claude / GPT / Grok / Gemini classUsed inside a router with bounded cost + fallback policyDecision logs written to the run manifest
Hosted APIRouted
02Built on open weights

Domain fine-tunes

Open-weight base models adapted to scientific tasks where general models drop precision, miss biomedical entities, or fail on dose / mechanism / adverse-event reasoning. Trained with LoRA, QLoRA, or full SFT on curated golden sets.

Claim linker — paper claim to source spanDose and units extractorAdverse-event classifierSynthesis-step checkerMechanism plausibility scorer
7B–14BLoRA / QLoRASelf-hosted
03Purpose-built

Small task LMs

Narrow, fast models for extraction and classification on the hot path of every run. They handle the high-volume structured work where frontier reasoners would be slow and expensive.

Section + claim extractorEntity + relation linkerPDF / methods structurerAmbiguity-flag annotator
1B–3BQuantizedSelf-hosted
04Tuned to scientific phrasing

Embedding + reranker models

Domain-adapted embeddings and rerankers tuned to biomedical and chemistry vocabulary. Public embeddings under-recall on technical synonyms, abbreviations, and quantitative phrasing.

Biomedical sentence + passage embeddingsChemistry-aware embeddingsCross-encoder reranker for citation precision
Bi-encoderCross-encoderOn-prem
05Calibrated for science

Grader and judge models

Model-as-judge with explicit rubrics. Graders are versioned independently of the systems they score, so eval changes are auditable. These run inside every Crosswalk benchmark and every Cartograph contradiction check.

Citation-fidelity graderDose-extraction graderMechanism-plausibility graderSynthesis-step verifier
VersionedAdversarial probesReproducible
06Cost + quality aware

Routing policies

Small policy networks and rules that decide which model handles which step under cost, latency, sensitivity, and quality budgets. Routing decisions are part of the audit trail, not hidden in a product layer.

Cost-aware task routingSensitivity-aware fallbackPer-tenant routing overridesOpen policy DSL
Policy-as-codeTrace-exported

One adaptation pipeline under all of it.

Curated datasets → supervised fine-tuning, LoRA, QLoRA, or full SFT depending on the task → quantization-aware compression → eval gate through Crosswalk → versioned release with model card. The same pipeline that produces our claim linker also produces our adverse-event classifier and our biomedical embeddings. The product wedge is the harness; the platform asset is the model zoo this pipeline builds.

curateSFT / LoRA / QLoRAquantizeCrosswalk gaterelease

System

A loop that produces evidence, not answers.

Ingestion, retrieval, model routing across our custom models and the frontier reasoners they sit alongside, tool orchestration, eval gates, a provenance graph, human review, and a typed export surface. Every run emits a manifest any consumer can replay.

What we are building

Five modules. One graph. One harness. One model foundation.

Four user-facing modules sit on top of one model foundation. Each module is a real system with named users, declared maturity, and a durable artifact it writes back into the claim graph.

Private alpha01

Cartograph

Scientific claim graph with provenance edges, confidence flags, contradictions, and review queues.

Built for
Computational lead at an early-stage discovery biotech.
Pain it removes
Claim drift across decks, reviews, target packages, and pre-IND evidence — no traceable answer to which source supported which decision.
System shape
Document ingestion → section + claim extraction → entity and relation linking → provenance edge writer → contradiction detector → reviewer queue.
Model layer
Our claim linker for extraction. Our entity + relation linker for graph writes. Our mechanism-plausibility grader for confidence. Frontier reasoner for contradiction synthesis. Embedding + reranker retrieval.
Compounds the graph every other product writes into.
Research track02

Procedure

Protocol reproducibility harness — checklists, parameter maps, ambiguity flags, and execution deltas.

Built for
Bench scientist plus scientific-operations lead at a wet-lab discovery organization.
Pain it removes
Protocol drift between methods sections and bench execution. Assay variance never makes it back into the next paper or the next batch.
System shape
Protocol ingestion → structured parameter graph → ambiguity flag annotator → executable checklist generator → diff against captured run logs.
Model layer
Our PDF / methods structurer for ingestion. Our ambiguity annotator. Our dose + units extractor. Our parameter-completeness grader. Frontier reasoner for hard protocol structuring.
Feeds the claim graph with executable bench evidence.
Prototype03

Crosswalk

Open biomedical eval foundry — citation fidelity, dose extraction, mechanism reasoning, adverse-event triage, synthesis checks.

Built for
AI / ML lead at a science-adjacent platform, publisher, or research lab evaluating science fine-tunes.
Pain it removes
Model selection for scientific tasks today rides on chatbot leaderboards. No standard suite for the things that actually matter at the bench or the desk.
System shape
Eval task definitions → golden sets → grader models per task → run harness → regression report generator. Pluggable to any model endpoint.
Model layer
Our grader stack: citation-fidelity grader, dose-extraction grader, mechanism-plausibility grader, synthesis-step verifier. Each versioned independently with adversarial probes.
Becomes the public benchmark gate every other module uses.
Prototype04

Atrium

Private research agent runtime — tool-using agents over local corpora with auditable traces.

Built for
Research-platform engineer at a discovery biotech or scientific publisher running private papers, protocols, and internal lab notes.
Pain it removes
Internal AI helpers either send private data to public endpoints or operate as chat with no audit trail. Teams need private-by-default agents that produce inspectable traces.
System shape
Private corpus ingestion → per-tenant retrieval → tool-using agent runtime → trace export → write-back into the claim graph.
Model layer
Routes between hosted frontier reasoner for hard reasoning and our self-hosted domain fine-tunes for sensitive content. Local biomedical embedding + reranker. Inline grader checks from Crosswalk.
Production traces improve every other module simultaneously.

05 · Foundation

Anvil — the custom-model foundry under everything else.

The custom-model foundry under everything else — adaptation pipelines, golden sets, and benchmark gates that produce the small models the rest of the stack runs on.

Built for
Internal — and partner research teams that want to fine-tune scientific models without rebuilding an ML platform.
What it produces
Claim linker · dose extractor · AE classifier · synthesis-step checker · mechanism-plausibility scorer · biomedical embeddings · cross-encoder reranker · graders.
Eval gate
Every release passes through versioned Crosswalk suites. No model ships without a published model card and an eval verdict per task.

Adaptation pipeline

Operating profile

Open-weight base7B–14B
AdaptationLoRA · QLoRA · SFT
ServingvLLM · TensorRT-LLM

Where it goes

Three horizons. One direction.

Horizons describe what we are building, what comes next, and what the platform becomes — not quarterly promises.

Now

Open-source core + first fine-tunes

  • Claim graph schema, ingestion, and contradiction checks.
  • First domain fine-tunes: claim linker, dose extractor, adverse-event classifier.
  • Initial biomedical eval suite: citation fidelity, dose extraction, adverse-event classification.
  • Private research agent runtime with OpenTelemetry-compatible traces.
Next

Lab integration + expanded model zoo

  • ELN and LIMS adapter SDK with vendor-agnostic data contracts.
  • Synthesis-step verifier, mechanism-plausibility scorer, chemistry-aware embeddings.
  • Team review workflows on top of the claim graph reviewer queue.
  • Hosted private deployments with per-tenant isolation.
Later

Evidence operating layer

  • Cross-organization evidence packets with revocable access.
  • Long-horizon agent runs gated by team-defined reviewer policies.
  • Public benchmark marketplace built on the eval foundry.
  • Open API for scientific organizations to route AI work through one auditable layer.

Operating posture

Five lines we will not cross.

These are not slogans. They are the constraints every product, every fine-tune, every eval, and every deployment is checked against.

Custom models, not borrowed benchmarks.Scientific tasks deserve domain models, not chatbot fine-tunes.
Reproducibility over vibes.Every claim ships with run, source, model, and grader manifests.
Traceability over chat UX.Outputs are addressable artifacts, not transcript fragments.
Human review as infrastructure.Reviewer state lives in the data model, not in a side channel.
Private by default.Sensitive corpora stay in-tenant; trace export is opt-in.

Company

Independent systems lab.

SaaS Syndicate Labs is an independent AI systems lab focused on the evidence layer for scientific work — open-source core, custom domain models, and private deployment patterns for teams operating under GxP, GLP, or IRB constraints. Research and product, not consulting.

EntityDBA of SaaS Syndicate LLC, Wyoming
PostureIndependent systems lab, research-grade traces
DistributionOpen-source core, private deployment patterns
Contacthello@saassyndicate.llc