GPU and infrastructure dossier

The compute and serving plan behind the evidence layer.

This dossier explains what GPUs are for, what runs on them, how training and inference are split, how retrieval and graph work fits in, what the deployment path looks like, and which NVIDIA-stack components play which role. It is a research and infrastructure plan, not a traction page.

SaaS Syndicate Labs builds and operates a stack of small domain models, frontier-model routing, private corpus retrieval, and an eval foundry. The compute footprint exists to make scientific AI outputs reproducible, auditable, and useful under real lab and operational constraints — not to advertise model size.

At a glance

Train and evaluate small domain models for extraction, classification, and grading on biomedical tasks.
Run frontier reasoners through cost-aware router policies, inline with inference-time eval gates.
Build and serve private corpus retrieval per-tenant, with embedding generation and rerankers under tenant isolation.
Operate the eval foundry as repeatable benchmark gates with explicit run manifests.

Workload map

Eight sections covering the full plan.

Each section names a workload class, why it needs the compute it needs, and the specific operations it runs. No marketing logo placement, no acceptance claims.

01Workload

Why GPUs

Scientific evidence work is compute heavy along four axes: training domain models that extract structured claims from messy literature; running frontier reasoners over long-context paper bundles; serving private retrieval and rerankers per-tenant with low latency; and executing repeatable eval suites that produce comparable, versioned scores across model generations.

Per-task fine-tunes for claim extraction, dose extraction, adverse-event classification, and synthesis-step checking.
Long-context inference over multi-document evidence bundles with grader-model verification inline.
Embedding generation across mixed public and private corpora at ingestion and incremental update cadence.
Batch grader runs across the eval foundry on every release and regression check.

02Workload

Training workloads

Most of the training surface is small, focused domain models adapted from open weights. We do not train frontier models from scratch.

LoRA and QLoRA recipes for domain extractors (claim linker, dose extractor, adverse-event classifier, synthesis-step checker).
Supervised fine-tunes with domain-curated golden sets; preference data where available for grader models.
Quantization-aware training where serving constraints demand it.
Reproducible training manifests: base model, adapter spec, dataset version, seed, hardware profile.

03Workload

Inference workloads

Inference splits between hosted frontier reasoners for hard inference and self-hosted open models for sensitive content. Routing decisions are recorded with every run.

Self-hosted open models served via vLLM and TensorRT-LLM with quantized weights where quality holds.
Triton inference server for multi-model deployments with per-model batching policies.
Hosted frontier-reasoner calls gated by router policy and cost budgets, with decision logs.
Streaming and tool-use loops with bounded fallbacks to keep long agent runs responsive.

04Workload

Evaluation workloads

Crosswalk runs batch evaluations across versioned suites whenever a model, grader, or routing policy changes. These runs are the public face of the eval foundry and need to be reproducible.

Batch grader-model inference per suite, per release.
Comparable regression reports across model generations with the same dataset version pinned.
Adversarial-probe runs that test specific failure modes (citation drop, dose unit confusion, contradiction tolerance).
Run manifests stored alongside results for full reproducibility.

05Workload

Retrieval and graph workloads

Retrieval is per-tenant and hybrid by default. Graph writes happen on every claim extraction and contradiction check. Both move enough data to be compute-bound at scale.

Embedding generation across public literature and private corpora with incremental indexing.
Reranker scoring at query time, tuned to biomedical phrasing.
Graph extraction passes for entity and relation linking; contradiction detection across claim subgraphs.
Trace export emitting OpenTelemetry-compatible spans for retrieval and graph operations.

06Workload

Deployment path

The same systems run as a self-hosted open-source core or as a private deployment under our operational pattern. There is no third managed multi-tenant tier that fundamentally diverges from the open-source core.

Open-source core: claim graph schema, retrieval primitives, eval gate runner, agent runtime.
Private deployment: containerized, per-tenant isolation, audited trace export, configurable model routing.
Designed for teams operating under GxP, GLP, or IRB constraints, with documented data-handling boundaries.
No feature gating that prevents self-hosted teams from reproducing production behavior.

NVIDIA-relevant stack

Roles, not logos.

We rely on the standard NVIDIA software stack for training and serving. Each component plays a specific role rather than appearing as a logo.

CUDAprimary compute backend for all training and inference.

PyTorchtraining framework for domain models and grader fine-tunes.

TensorRT-LLMoptimized serving for self-hosted models with quantized inference.

Triton Inference Servermulti-model serving with batching and scheduling for our deployment pattern.

NeMoselectively used for customization and data tooling where it fits the model adaptation pipeline.

vLLMopen-source serving runtime used in parallel with TensorRT-LLM for throughput trade-offs.

Boundary

What we are not claiming.

This dossier describes the GPU and infrastructure plan for an early-stage research and product company. It is not a partnership announcement and not a traction page.

Explicit boundary

No production customers in deployment today.
No clinical validation, FDA approval, or regulatory clearance.
No partnership, acceptance, or sponsorship claims for any third party named in the stack.
Releases, eval gates, and deployment patterns are under active development; behavior may change before public release.