MCP tools

`doiget` runs as a Model Context Protocol server when invoked as `doiget serve`. It

MCP tools

Status: NORMATIVE. Defines the tool surface exposed by doiget serve over stdio JSON-RPC. Renaming or removing a tool is a breaking change.

doiget runs as a Model Context Protocol server when invoked as doiget serve. It speaks stdio only (ADR-0001, SCOPE.md §non-goal 6).

1. Tool list

ToolPurpose
doiget_resolve_paperResolve DOI / arXiv id to authoritative metadata.
doiget_fetch_paperResolve and download a single PDF to the store. Accepts dry_run.
doiget_metadata_onlyResolve to metadata. Guarantees no PDF / publisher fetch. Accepts dry_run.
doiget_batch_fetchUp to 100 refs in one call. Accepts dry_run.
doiget_infoRetrieve a store entry's metadata.
doiget_search_localSearch store metadata (title / authors / venue).
doiget_list_recentLast N fetched entries.
doiget_paper_pdf_pathReturn the local path of a cached PDF. Does not read, parse, or transmit content.
doiget_capability_profileReport which sources this instance is allowed to use.
doiget_healthOperational sanity (store writable, version, schema).

Additional tools:

ToolPurpose
doiget_expand_citation_graphBFS expansion of citations. Hard-capped.
doiget_bibtex_exportBibTeX for one or many entries.
doiget_csl_exportCSL JSON for one or many entries.

2. Naming and convention

3. Tool description format

Each tool's description field follows this six-section format so LLM agents can pick the right tool with minimal mistakes:

WHEN TO USE: <one sentence>
INPUTS: <field-by-field>
OUTPUTS: <shape on success>
COSTS: <network / time / quota>
SIDE EFFECTS: <what writes to disk / log / store>
LIMITS: <hard caps>

4. Example tool spec — doiget_fetch_paper

{
  "name": "doiget_fetch_paper",
  "description": "WHEN TO USE: User wants to download a paper PDF given a DOI or arXiv id.\nINPUTS: ref: DOI ('10.1234/abc') or arXiv id ('2401.12345').\nOUTPUTS: { ok: true, ref, source, path, license, size_bytes } or { ok: false, error: { code, message } }.\nCOSTS: 1-3 s network call. May fail if not Open Access.\nSIDE EFFECTS: Writes PDF to the store. Appends a row to the provenance log.\nLIMITS: Max 5 fetches/sec. Use doiget_batch_fetch for >5 refs.",
  "inputSchema": {
    "type": "object",
    "required": ["ref"],
    "properties": {
      "ref": {
        "type": "string",
        "minLength": 7,
        "maxLength": 256,
        "pattern": "^(10\\.\\d{4,9}/[A-Za-z0-9._/()-]+|arXiv:\\d{4}\\.\\d{4,5}|\\d{4}\\.\\d{4,5})$"
      }
    },
    "additionalProperties": false
  }
}

5. Output shape (NORMATIVE)

type FetchResult =
  | { ok: true,
      ref: string,
      source: "crossref" | "unpaywall" | "arxiv"
            | "openalex" | "s2" | "doaj" | "oa-publisher"
            | "tdm-elsevier" | "tdm-aps" | "tdm-springer",
      // ADR-0021 §4 / ADR-0024: the resolver profile under which the
      // canonical-digest for this fetch was minted. Currently equal to
      // `source` verbatim; kept a distinct field so the two can be
      // decoupled if overlapping resolvers are ever added.
      resolver_profile: string,
      path: string,
      license: string,
      size_bytes: number,
      schema_version: string,
    }
  | { ok: true, dry_run: true, ref: RefShape, plan: FetchPlan,
      rate_limit_budget: { global_per_sec: number, per_source_min_gap_ms: number } }
  | { ok: false,
      ref: string,
      error: { code: ErrorCode, message: string, denial_context?: DenialContext }
    };

type DenialContext = {
  reason: "redirect_not_in_allowlist" | "insecure_scheme"
        | "host_in_block_list"
        | "size_cap_exceeded" | "schema_drift" | "capability_not_granted"
        | "rate_limit_window" | "ssrf_private_address" | "content_type_mismatch",
  source?: string,
  attempted?: string,
  // `expected?` is absent when the producer did not populate this field for
  // this reason. An empty array (`"expected": []`) is the distinct
  // "explicit empty allowlist" signal — see ADR-0023 §3 for the
  // None / Some(vec![]) disambiguation.
  expected?: string[],
  hop_index?: number,
  cap?: number,
  actual?: number,
};

ErrorCode is the closed enum in ERRORS.md. DenialContext is the optional structured-recovery payload defined in ADR-0023. FetchPlan is the dry-run preview shape — see §10 below.

5.1 denial_context presence: single-paper vs batch (NORMATIVE)

There is an intentional, normative asymmetry in how the optional denial_context field is represented on an ok:false error:

An agent that wants to work across both surfaces should read error.denial_context and treat both missing and null as "none". A serialization failure of a non-null DenialContext (today unreachable — the type is a typed Serialize struct) emits null and a tracing::warn! on stderr so the swallow is observable; it is never silent (see #154 / ADR-0023 §4).

6. Excluded tools (permanent)

The following are intentionally not offered as MCP tools and will not be added. See SCOPE.md §"Credential / safety non-goals":

7. Capability awareness

Agents can call doiget_capability_profile first to determine which sources the instance is allowed to use. The output is redacted (no API key contents) and is suitable for an agent to use in planning whether a TDM-class fetch will succeed.

type CapabilityProfileResponse = {
  oa_enabled: true,
  metadata_sources: string[],          // e.g. ["openalex"]
  tdm_enabled: boolean,                // disjunction over individual TDM grants
  tdm_elsevier: boolean,
  tdm_aps: boolean,
  tdm_springer: boolean,
  rate_limit_per_sec: number,          // always 5.0
};

8. Server lifecycle

9. Smoke test

A CI workflow mcp-smoke.yml spawns the server, sends a minimal sequence (initializetools/listtools/call doiget_health), asserts the responses, and asserts that no stray bytes appeared on stdout outside JSON-RPC frames.

10. Dry-run preview (NORMATIVE; ADR-0022)

doiget_fetch_paper, doiget_metadata_only, and doiget_batch_fetch accept an optional dry_run: boolean input field, defaulting to false. When true:

{
  "ok": true,
  "dry_run": true,
  "ref": { "doi": "10.1234/foo" },
  "plan": {
    "metadata_sources": ["crossref", "unpaywall"],
    "pdf_sources":      [{
      "key":             "oa-publisher",
      "candidate_hosts": ["*.springer.com", "*.springeropen.com"]
    }],
    "redirect_allowlists_loaded":      ["crossref", "unpaywall", "arxiv", "oa-publisher"],
    "candidate_hosts_are_upper_bound": true,
    "target_pdf_path":                 "/home/.../store/doi_10.1234_foo.pdf",
    "target_metadata_path":            "/home/.../store/doi_10.1234_foo.toml",
    "would_append_provenance":         true
  },
  "rate_limit_budget": {
    "global_per_sec":        5.0,
    "per_source_min_gap_ms": 200
  }
}

Tools where dry_run does not apply (doiget_info, doiget_search_local, doiget_list_recent, doiget_paper_pdf_path, doiget_capability_profile, doiget_health, doiget_resolve_paper) reject the field as INVALID_REF-class — i.e. surface as {ok:false, error:{code:"INVALID_REF", ...}}.

11. doiget_metadata_only (NORMATIVE)

doiget_metadata_only resolves a ref through the configured metadata sources (Crossref + Unpaywall + arXiv-meta) and returns the resulting metadata. It MUST NOT trigger a publisher-side PDF fetch, even when the metadata source returns an OA URL. The OA URL, when known, is surfaced in the response as oa_url (string) for the caller to act on separately.

{
  "name": "doiget_metadata_only",
  "description": "WHEN TO USE: User wants metadata for a DOI / arXiv id without paying for or being noticed by a PDF download.\nINPUTS: ref (DOI or arXiv id), dry_run (optional bool).\nOUTPUTS: { ok: true, ref, source, license?, oa_url:string|null, metadata } or { ok:false, error }.\nCOSTS: 1-2 s metadata round-trip. No publisher fetch.\nSIDE EFFECTS: Appends a provenance row tagged 'metadata-only' (unless dry_run). Writes the metadata TOML to the store.\nLIMITS: Subject to the same rate cap as fetch_paper (5/sec). The OA URL is reported but never followed.",
  "inputSchema": {
    "type": "object",
    "required": ["ref"],
    "properties": {
      "ref": {
        "type": "string",
        "minLength": 7,
        "maxLength": 256,
        "pattern": "^(10\\.\\d{4,9}/[A-Za-z0-9._/()-]+|arXiv:\\d{4}\\.\\d{4,5}|\\d{4}\\.\\d{4,5})$"
      },
      "dry_run": { "type": "boolean", "default": false }
    },
    "additionalProperties": false
  }
}

Output:

type MetadataOnlyResult =
  | { ok: true,
      ref: string,
      source: "crossref" | "unpaywall" | "arxiv",
      // ADR-0021 §4 / ADR-0024: the resolver profile under which the
      // canonical-digest for this metadata-only call was minted.
      // Currently equal to `source` verbatim.
      resolver_profile: string,
      license: string,
      oa_url: string | null,
      metadata: object,
      schema_version: string,
    }
  | { ok: true, dry_run: true, ref: RefShape, plan: FetchPlan,
      rate_limit_budget: { global_per_sec: number, per_source_min_gap_ms: number } }
  | { ok: false, ref: string, error: { code: ErrorCode, message: string, denial_context?: DenialContext } };

Posture: covered by the same posture-lint check as ADR-0022 §5 — a metadata_only codepath that reaches HttpClient::fetch_pdf is a hard failure.


Source: site/content/developer/mcp-tools.md