Skip to main content

Module arxiv

Module arxiv 

Source
Expand description

arXiv source — arXiv id → PDF + Atom-feed metadata.

Spec: docs/SOURCES.md §4 arXiv. No auth; the API has a 3-second-per-request rate guideline that doiget’s 5/sec global + 200ms per-source backoff comfortably respects (no extra source-specific tuning needed).

§Fetch flow (full)

  1. can_serve returns true only for Ref::Arxiv(_); Ref::Doi(_) is rejected up front.
  2. fetch acquires a permit from the shared RateLimiter, then best-effort fetches the Atom feed (<base>/api/query?id_list=<id>) and parses it into a JSON metadata object via the private parse_atom_feed helper. Atom failures degrade gracefully (metadata_json = None + tracing::warn!) — the existing 1.0 PDF-leg semantics are preserved.
  3. The PDF URL <base>/pdf/<id>.pdf is fetched via crate::http::HttpClient::fetch_pdf which enforces the magic-byte (%PDF-) check per docs/SECURITY.md §1.2.
  4. ONE LogEvent::Fetch row is appended for the PDF leg. The Atom leg does NOT emit its own row — the source-level audit unit is “one fetch attempt = one row” and the Atom call is a supporting leg of the same attempt.

§Metadata-only path

ArxivSource::fetch_metadata_only performs ONLY the Atom feed fetch and is the entry point for the metadata_only orchestrator (crate::orchestrator::metadata_only). It MUST NOT call crate::http::HttpClient::fetch_pdf — doing so would violate the doiget_metadata_only contract (docs/MCP_TOOLS.md §11). It emits one LogEvent::Fetch row under Capability::Metadata so the audit trail distinguishes metadata-only fetches from full fetches without breaking the schema (the capability field is the structured channel for this distinction; spec §3 documents it as one of oa / metadata / tdm-*).

Structs§

ArxivSource
arXiv Source impl. Phase 1 returns the PDF bytes and skips metadata (the export.arxiv.org Atom feed is documented but XML parsing is deferred to a follow-up PR — TODO Phase 1+).