Expand description
arXiv source — arXiv id → PDF + Atom-feed metadata.
Spec: docs/SOURCES.md §4 arXiv. No auth; the API has a 3-second-per-request
rate guideline that doiget’s 5/sec global + 200ms per-source backoff
comfortably respects (no extra source-specific tuning needed).
§Fetch flow (full)
can_servereturnstrueonly forRef::Arxiv(_);Ref::Doi(_)is rejected up front.fetchacquires a permit from the sharedRateLimiter, then best-effort fetches the Atom feed (<base>/api/query?id_list=<id>) and parses it into a JSON metadata object via the privateparse_atom_feedhelper. Atom failures degrade gracefully (metadata_json = None+tracing::warn!) — the existing 1.0 PDF-leg semantics are preserved.- The PDF URL
<base>/pdf/<id>.pdfis fetched viacrate::http::HttpClient::fetch_pdfwhich enforces the magic-byte (%PDF-) check perdocs/SECURITY.md§1.2. - ONE
LogEvent::Fetchrow is appended for the PDF leg. The Atom leg does NOT emit its own row — the source-level audit unit is “one fetch attempt = one row” and the Atom call is a supporting leg of the same attempt.
§Metadata-only path
ArxivSource::fetch_metadata_only performs ONLY the Atom feed fetch
and is the entry point for the metadata_only orchestrator
(crate::orchestrator::metadata_only). It MUST NOT call
crate::http::HttpClient::fetch_pdf — doing so would violate the
doiget_metadata_only contract (docs/MCP_TOOLS.md §11). It emits
one LogEvent::Fetch row under Capability::Metadata so the audit
trail distinguishes metadata-only fetches from full fetches without
breaking the schema (the capability field is the structured channel
for this distinction; spec §3 documents it as one of oa / metadata
/ tdm-*).
Structs§
- Arxiv
Source - arXiv
Sourceimpl. Phase 1 returns the PDF bytes and skips metadata (the export.arxiv.org Atom feed is documented but XML parsing is deferred to a follow-up PR — TODO Phase 1+).