Redirect allowlist
Defense-in-depth against open-redirect SSRF and against publisher / metadata-source
Redirect allowlist
Status: NORMATIVE. Binding for the doiget HTTP redirect policy. Changes require an ADR.
1. Purpose
Defense-in-depth against open-redirect SSRF and against publisher / metadata-source
responses being abused to misroute a fetch to an attacker-controlled host. Even though
SECURITY.md §1.3 already restricts redirects to https:// and bounds
the redirect chain to ten hops, those mitigations alone do not stop a redirect to an
arbitrary attacker-owned HTTPS host. The redirect allowlist closes that gap by
constraining each source's redirect targets to a small, source-specific set of hosts
that the source legitimately uses.
The allowlist is consulted on every redirect hop, not only the final location, and
on the OA URL discovered through metadata sources before the actual PDF fetch is issued
(see SECURITY.md §1.4 entry on Crossref re-validation).
2. Format
The allowlist is a structured table keyed by source (the same source key used in
SOURCES.md §1).
2.1 Required fields per source
| Field | Type | Description |
|---|---|---|
source | string | Source key. MUST match a source value in SOURCES.md §1 (e.g. crossref, unpaywall, arxiv). |
redirect_hosts | array of strings | Allowed redirect target host patterns. Each entry is either an exact FQDN or a wildcard suffix pattern as defined in §2.2. |
2.2 Host matching rule (NORMATIVE)
Each entry in redirect_hosts matches a candidate redirect target host as follows.
- The candidate host is the lowercased hostname of the redirect target URL — i.e.
the value of
Url::host_str()after parse, lowercased. Port, path, query, and fragment are ignored. Userinfo is rejected unconditionally. - Exact-FQDN form: an entry without a leading
*.matches only when the candidate host is byte-identical to the entry, after lowercasing. - Suffix-glob form: an entry of the form
*.<suffix>matches when the candidate host either equals<suffix>exactly or ends with.<suffix>. This means*.example.commatches bothexample.comandcdn.example.com, but does not matchnotexample.com. - The matching rule is byte-level on the lowercased ASCII form of the host. IDN
hosts MUST be Punycoded before comparison; raw Unicode in
redirect_hostsis a spec violation and rejected at config-load time. - A redirect is permitted if and only if at least one entry in the source's
redirect_hostsmatches. No global fallback; an empty or missingredirect_hostsfor a source means "no redirects permitted from this source".
2.3 Reference encoding
The allowlist data is stored as a single TOML document at
crates/doiget-core/src/sources/redirect_allowlist.toml, embedded into the binary via
include_str! and parsed once at process start. Schema:
# Reference TOML form. The file declares one [[source]] entry per integrated source.
[[source]]
source = "crossref"
redirect_hosts = [
"api.crossref.org",
"*.crossref.org",
]
[[source]]
source = "unpaywall"
redirect_hosts = [
"api.unpaywall.org",
]
# ...
The exact list of entries is given in §3.
3. Tier 1 entries
The entries below are the binding redirect-host allowlist for the Tier 1 sources. They were validated by replaying real fetches against representative DOIs / arXiv ids; subsequent changes follow the §5 update process.
The Tier 1 sources are taken from SOURCES.md §1: Crossref, Unpaywall,
arXiv. Tier 2 / Tier 3 sources are covered in §4.
3.1 crossref
| Field | Value |
|---|---|
source | crossref |
redirect_hosts | api.crossref.org, *.crossref.org |
Notes:
api.crossref.orgis the documented endpoint host.*.crossref.orgcovers any internal Crossref subdomain redirects (e.g. legacy or CDN-fronted variants).- Crossref's
linkarray can contain publisher-side OA URLs whose host is NOT undercrossref.org. Those URLs are NOT followed under thecrossrefsource's allowlist; they are instead handed to the publisher-side fetch path, which is governed by the allowlist of the source that owns that publisher (see §4).
3.2 unpaywall
| Field | Value |
|---|---|
source | unpaywall |
redirect_hosts | api.unpaywall.org |
Notes:
api.unpaywall.orgis the documented endpoint host.- Unpaywall's response describes an OA URL hosted on a third-party server (publisher,
preprint server, institutional repository). Redirects encountered while fetching
that OA URL are NOT subject to the
unpaywallallowlist; they are subject to the allowlist of the source that owns the publisher host (the syntheticoa-publishersource — see §3.4). OA URLs that resolve to non-allowlisted hosts abort the fetch.
3.3 arxiv
| Field | Value |
|---|---|
source | arxiv |
redirect_hosts | arxiv.org, export.arxiv.org, *.arxiv.org |
Notes:
arxiv.organdexport.arxiv.orgare the documented endpoint hosts (HTML / API vs. metadata export).*.arxiv.orgcovers redirects to subdomains arXiv may use for PDF delivery.- arXiv MAY in some configurations redirect to a CDN host outside
arxiv.org. If the fetcher observes such a redirect, the response is to add the CDN's host suffix here via ADR — NOT to silently widen the allowlist at runtime.
3.4 oa-publisher
| Field | Value |
|---|---|
source | oa-publisher (synthetic — see notes) |
redirect_hosts | *.springer.com, *.springeropen.com, *.springernature.com, *.nature.com, *.wiley.com, *.elsevier.com, *.sciencedirect.com, *.frontiersin.org, *.mdpi.com, *.plos.org, *.biorxiv.org, *.medrxiv.org, europepmc.org, *.europepmc.org, *.nih.gov, *.ncbi.nlm.nih.gov, *.aps.org, scipost.org, *.scipost.org, *.iop.org, arxiv.org, *.arxiv.org |
Notes:
- Synthetic source key. Unlike the other §3 entries,
oa-publisheris not one of the integrated metadata sources inSOURCES.md§1. It is the source key the orchestrator uses when fetching the PDF that Unpaywall'sbest_oa_location.url_for_pdf(orbest_oa_location.url) resolves to — i.e. the URL Unpaywall hands back, which lives on a publisher / preprint / repository host, not onapi.unpaywall.org. The redirect-policy closure is the same per-source one used everywhere else; the orchestrator registers anHttpClientwith this allowlist and callsHttpClient::fetch_pdf("oa-publisher", url). - Documented OA hosts. Each entry below is the documented OA host for the named publisher / repository. Changes follow the §5 update process.
- Empirical verification (ADR-0027). The physics-society / diamond-OA
hosts (
*.aps.org,scipost.org,*.scipost.org,*.iop.org) were added from a realdoiget batchover 30 OpenAlex-OA finite-temperature-MPS DOIs in which Unpaywallbest_oa_locationresolved to these hosts and the PDF leg was denied. This is the empirical pass the §3 informed-best-effort note calls for; these entries are verified, not(unverified). Open-ended institutional / handle repositories observed in the same run (hdl.handle.net,ruj.uj.edu.pl) are deliberately excluded as a separate open-surface question. - Partial-success semantics. When the OA URL host is NOT on this list,
the denial is raised as
HttpError::RedirectDenied— at the pre-fetch check on the metadata-discovered OA URL per §1 (the host is rejected before the PDF fetch is issued, even when no redirect hop occurs) and, for hosts that pass the pre-check but redirect off-list mid-fetch, again at the redirect-closure boundary. Both raise the sameHttpError::RedirectDeniedvalue, so the downstream shape (the structuredDenialContextwithreason = redirect_not_in_allowlist) is identical regardless of which guard fires. In either case the orchestrator falls back to metadata-only success — the metadata is still useful. The PDF outcome is logged as a distinctFetchprovenance row withsource = "oa-publisher"andresult = err/error_code = "NETWORK_ERROR". - Host families:
- Springer Nature OA imprints:
*.springer.com,*.springeropen.com,*.springernature.com,*.nature.com. - Wiley OA:
*.wiley.com. - Elsevier OA route only:
*.elsevier.com,*.sciencedirect.com. The TDM gated path is a separate Tier 3 source (elsevier-tdm). - Frontiers:
*.frontiersin.org. - MDPI:
*.mdpi.com. - PLOS:
*.plos.org. - Preprint servers:
*.biorxiv.org,*.medrxiv.org. - Europe PMC + NIH PMC:
europepmc.org,*.europepmc.org,*.nih.gov,*.ncbi.nlm.nih.gov. - Physics society / diamond OA (empirically verified, ADR-0027):
*.aps.org(APS —link.aps.org/journals.aps.org; also trusted under the separatetdm-apsTier-3 key),scipost.org+*.scipost.org(SciPost — community-run diamond OA),*.iop.org(IOP Publishing —iopscience.iop.org, New J. Phys. etc.). - arXiv:
arxiv.org,*.arxiv.org(mirrors §3.3 because the Unpaywall flow re-derives an arXiv URL through theoa-publishersource key, not the tier-1arxivkey).
- Springer Nature OA imprints:
- Additions or removals follow the §5 update process.
4. Tier 2 / Tier 3 entries
| Source | Tier | Phase | Status |
|---|---|---|---|
openalex | 2 | 4 | (reserved) |
semantic-scholar | 2 | 4 | (reserved) |
doaj | 2 | 4 | (reserved) |
springer-tdm | 3 | 5a | (reserved) |
aps-tdm | 3 | 5b | (reserved) |
elsevier-tdm | 3 | 5c | (reserved) |
Each (reserved) entry is populated via the update process in §5 when that
source's redirect targets are validated. A (reserved) source has no
redirect_hosts, so per §2.2 rule 5 no redirects are permitted from it; such
a fetch is also blocked earlier by the source's Cargo feature gate
(SOURCES.md §3) and never reaches the redirect policy.
5. Update process
Changes to this allowlist are user-impacting: a fetch that previously worked may stop working (a redirect target host is removed) or a fetch that previously failed may start working (a host is added). Both directions are subject to the same process:
- ADR. Add or update a
docs/DECISIONS/NNNN-redirect-allowlist-<source>.mdADR that names the source, lists the host(s) added or removed, and explains why (e.g., "observed in real fetch traces", "publisher migrated CDN"). - CHANGELOG. Add an entry under
[Unreleased] -> Changed(orAdded/Removedas appropriate) inCHANGELOG.mdreferencing the ADR. - Reference file. Update the TOML reference file described in §2.3.
- Tests. Update or add a test in
crates/doiget-core/tests/that asserts the new entry matches / does not match the relevant host strings, including the suffix-glob negative case (notexample.comMUST NOT match*.example.com).
The §3 entries were populated under the initial ADR series. Subsequent changes always require a dedicated ADR.
6. Non-goals
- This document does NOT govern the initial fetch URL; that is constructed from
validated identifiers via source-side URL templates and is bounded by the
https://-only redirect policy inSECURITY.md§1.3. - This document does NOT define rate-limiting, retry behavior, or politeness; see
SOURCES.md§6. - This document does NOT govern outbound DNS, proxying, or anonymization; see
SECURITY.md§1.10.