Companion arXiv preprints (also_arxiv)
Option B from the reference-policy design. When [fetch].also_arxiv = true, every reference whose primary fetch succeeded via a publisher route (:unpaywall / :aps / :elsevier / :springer / :direct) triggers a second download: the arXiv preprint of the same paper, saved alongside as <key>__preprint.pdf.
The use case is comparative: appendices that only survive in a specific preprint revision, diffs between v1 and the journal camera- ready, archival runs that want both versions of record on disk. Plain-arxiv primaries skip the companion step — the primary is the preprint.
Storage layout
For a DOI whose arXiv preprint is arxiv:1008.3477:
papers/
10.1016_j.aop.2010.09.012.pdf ← primary (publisher PDF)
10.1016_j.aop.2010.09.012__preprint.pdf ← companion (arXiv preprint)
.metadata/
10.1016_j.aop.2010.09.012.toml ← single metadata TOMLThe single metadata TOML gains five preprint_* fields:
| Field | Meaning |
|---|---|
preprint_pdf | on-disk path of the companion |
preprint_source | always "arxiv" for now |
preprint_sha256 | hash of the companion PDF |
preprint_fetched_at | timestamp of companion download |
preprint_status | "ok" / "cached" / "failed" / "no-arxiv-id" |
preprint_error | error message when status = "failed" |
The job file
[folder]
target = "papers"
[fetch]
also_arxiv = true
[doi]
list = [
"10.1016/j.aop.2010.09.012", # Schollwöck → arxiv:1008.3477
"10.1103/PhysRevB.99.214433", # → arxiv:1905.07639 (via title search)
"arxiv:1706.03762", # primary already arxiv — no companion
]load_job doesn't care about also_arxiv at parse time — it's surfaced as a boolean field on FetchJob and threaded into fetch_paper! during the run.
using BiblioFetch
mktemp() do path, io
write(
io,
"""
[folder]
target = "papers"
[fetch]
also_arxiv = true
[doi]
list = ["10.1016/j.aop.2010.09.012"]
""",
)
close(io)
job = BiblioFetch.load_job(path)
job.also_arxiv
endtrueWhen the companion fetch fires
The companion is attempted iff all of these hold:
also_arxiv = truein the job config.- The primary fetch succeeded.
- The primary's source symbol is not
:arxiv(arxiv-sourced primaries already are the preprint). - An arXiv id is discoverable — preferring in order:
md["arxiv_id"]populated by the Crossrefrelation.has-preprintlookup (most common path),- a title + first-author search against the arXiv API,
- the ref itself if it's already
arxiv:…(though that path hits condition 3 first and short-circuits).
If step 4 finds nothing, preprint_status is set to "no-arxiv-id" so bibliofetch info tells you the flag was honored but had no target. The primary's status is unaffected — the companion is strictly best-effort.
Cache behavior
On a re-run without --force:
- Primary cached +
also_arxivoff — the run is a no-op, as before. - Primary cached +
also_arxivon + companion missing — the companion is fetched now (retroactively upgrading an existing store). - Primary cached + companion cached — both file presences are confirmed and hashes backfilled if they weren't recorded.
--force— both PDFs re-downloaded.
Real-run smoke
From a Todai shell, bibliofetch run of the one-DOI job above produced these two PDFs with the expected preprint_status = "ok" in metadata:
$ ls papers/
10.1016_j.aop.2010.09.012.pdf
10.1016_j.aop.2010.09.012__preprint.pdfDownstream integration
bibliofetch info <key>renders thepreprint_*fields alongside the primary trail.bibliofetch doctorrecognises__preprint.pdfsiblings as non-orphans (they're expected artifacts, not stray files).- BibTeX export ignores the companion — the export cites the primary DOI / arxiv id, not both artifacts of the same reference.