Redirect allowlist

Defense-in-depth against open-redirect SSRF and against publisher / metadata-source

Redirect allowlist

Status: NORMATIVE. Binding for the doiget HTTP redirect policy. Changes require an ADR.

1. Purpose

Defense-in-depth against open-redirect SSRF and against publisher / metadata-source responses being abused to misroute a fetch to an attacker-controlled host. Even though SECURITY.md §1.3 already restricts redirects to https:// and bounds the redirect chain to ten hops, those mitigations alone do not stop a redirect to an arbitrary attacker-owned HTTPS host. The redirect allowlist closes that gap by constraining each source's redirect targets to a small, source-specific set of hosts that the source legitimately uses.

The allowlist is consulted on every redirect hop, not only the final location, and on the OA URL discovered through metadata sources before the actual PDF fetch is issued (see SECURITY.md §1.4 entry on Crossref re-validation).

2. Format

The allowlist is a structured table keyed by source (the same source key used in SOURCES.md §1).

2.1 Required fields per source

FieldTypeDescription
sourcestringSource key. MUST match a source value in SOURCES.md §1 (e.g. crossref, unpaywall, arxiv).
redirect_hostsarray of stringsAllowed redirect target host patterns. Each entry is either an exact FQDN or a wildcard suffix pattern as defined in §2.2.

2.2 Host matching rule (NORMATIVE)

Each entry in redirect_hosts matches a candidate redirect target host as follows.

  1. The candidate host is the lowercased hostname of the redirect target URL — i.e. the value of Url::host_str() after parse, lowercased. Port, path, query, and fragment are ignored. Userinfo is rejected unconditionally.
  2. Exact-FQDN form: an entry without a leading *. matches only when the candidate host is byte-identical to the entry, after lowercasing.
  3. Suffix-glob form: an entry of the form *.<suffix> matches when the candidate host either equals <suffix> exactly or ends with .<suffix>. This means *.example.com matches both example.com and cdn.example.com, but does not match notexample.com.
  4. The matching rule is byte-level on the lowercased ASCII form of the host. IDN hosts MUST be Punycoded before comparison; raw Unicode in redirect_hosts is a spec violation and rejected at config-load time.
  5. A redirect is permitted if and only if at least one entry in the source's redirect_hosts matches. No global fallback; an empty or missing redirect_hosts for a source means "no redirects permitted from this source".

2.3 Reference encoding

The allowlist data is stored as a single TOML document at crates/doiget-core/src/sources/redirect_allowlist.toml, embedded into the binary via include_str! and parsed once at process start. Schema:

# Reference TOML form. The file declares one [[source]] entry per integrated source.
[[source]]
source = "crossref"
redirect_hosts = [
  "api.crossref.org",
  "*.crossref.org",
]

[[source]]
source = "unpaywall"
redirect_hosts = [
  "api.unpaywall.org",
]
# ...

The exact list of entries is given in §3.

3. Tier 1 entries

The entries below are the binding redirect-host allowlist for the Tier 1 sources. They were validated by replaying real fetches against representative DOIs / arXiv ids; subsequent changes follow the §5 update process.

The Tier 1 sources are taken from SOURCES.md §1: Crossref, Unpaywall, arXiv. Tier 2 / Tier 3 sources are covered in §4.

3.1 crossref

FieldValue
sourcecrossref
redirect_hostsapi.crossref.org, *.crossref.org

Notes:

3.2 unpaywall

FieldValue
sourceunpaywall
redirect_hostsapi.unpaywall.org

Notes:

3.3 arxiv

FieldValue
sourcearxiv
redirect_hostsarxiv.org, export.arxiv.org, *.arxiv.org

Notes:

3.4 oa-publisher

FieldValue
sourceoa-publisher (synthetic — see notes)
redirect_hosts*.springer.com, *.springeropen.com, *.springernature.com, *.nature.com, *.wiley.com, *.elsevier.com, *.sciencedirect.com, *.frontiersin.org, *.mdpi.com, *.plos.org, *.biorxiv.org, *.medrxiv.org, europepmc.org, *.europepmc.org, *.nih.gov, *.ncbi.nlm.nih.gov, *.aps.org, scipost.org, *.scipost.org, *.iop.org, arxiv.org, *.arxiv.org

Notes:

4. Tier 2 / Tier 3 entries

SourceTierPhaseStatus
openalex24(reserved)
semantic-scholar24(reserved)
doaj24(reserved)
springer-tdm35a(reserved)
aps-tdm35b(reserved)
elsevier-tdm35c(reserved)

Each (reserved) entry is populated via the update process in §5 when that source's redirect targets are validated. A (reserved) source has no redirect_hosts, so per §2.2 rule 5 no redirects are permitted from it; such a fetch is also blocked earlier by the source's Cargo feature gate (SOURCES.md §3) and never reaches the redirect policy.

5. Update process

Changes to this allowlist are user-impacting: a fetch that previously worked may stop working (a redirect target host is removed) or a fetch that previously failed may start working (a host is added). Both directions are subject to the same process:

  1. ADR. Add or update a docs/DECISIONS/NNNN-redirect-allowlist-<source>.md ADR that names the source, lists the host(s) added or removed, and explains why (e.g., "observed in real fetch traces", "publisher migrated CDN").
  2. CHANGELOG. Add an entry under [Unreleased] -> Changed (or Added / Removed as appropriate) in CHANGELOG.md referencing the ADR.
  3. Reference file. Update the TOML reference file described in §2.3.
  4. Tests. Update or add a test in crates/doiget-core/tests/ that asserts the new entry matches / does not match the relevant host strings, including the suffix-glob negative case (notexample.com MUST NOT match *.example.com).

The §3 entries were populated under the initial ADR series. Subsequent changes always require a dedicated ADR.

6. Non-goals


Source: site/content/developer/redirect-allowlist.md