API Reference

Environment detection

BiblioFetch.detect_environment — Function

detect_environment(; probe = true) -> Runtime

Detect hostname, applicable config profile, effective proxy (env > profile), optionally probe reachability, and classify the operating mode.

source

BiblioFetch.effective_runtime — Function

effective_runtime(; probe = true) -> Runtime

Alias for detect_environment; kept as the public "what should I use right now?" accessor.

source

BiblioFetch.load_config — Function

load_config(; path = ENV["BIBLIOFETCH_CONFIG"] or default)
    -> (config::Dict, path_or_nothing)

Read and parse the global BiblioFetch config TOML. Returns (Dict(), nothing) when no file is present at path. The default location is ~/.config/bibliofetch/config.toml; $BIBLIOFETCH_CONFIG overrides it.

source

References — parse / classify

BiblioFetch.normalize_key — Function

normalize_key(s) -> String

Normalize a user-provided reference to a canonical key:

DOI → lowercase DOI (10.1103/physrevb.xx.yyyy)
arXiv → arxiv:<id>

Throws ArgumentError if unrecognized.

source

BiblioFetch.is_doi — Function

is_doi(s) -> Bool

Whether s looks like a DOI (10.xxxx/anything). Strips surrounding whitespace but does not otherwise transform the input.

source

BiblioFetch.is_arxiv — Function

is_arxiv(s) -> Bool

Whether s looks like an arXiv id — both the new-style (1706.03762, optionally with a version suffix v2 and an arxiv: prefix) and the legacy slash form (cond-mat/0608208).

source

BiblioFetch.is_arxiv_versions — Function

is_arxiv_versions(s) -> Bool

Whether s is the multi-version pseudo-ref form arxiv:<id>@all or arxiv:<id>@v1,v3 / arxiv:<id>@1,3. These refs can't be fetched as-is — the run loop expands them into one FetchEntry per version before dispatching to fetch_paper!.

source

BiblioFetch.parse_arxiv_version_spec — Function

parse_arxiv_version_spec(s) -> (base_key, spec)

Parse an arxiv:<id>@… pseudo-ref into its components.

base_key — the canonical arxiv:<id> key (lower-cased, no version suffix, with arxiv: prefix).
spec — either :all (every known version) or a sorted Vector{Int} of explicit version numbers.

Throws ArgumentError when s is not a well-formed pseudo-ref.

source

arXiv version discovery

BiblioFetch.arxiv_latest_version — Function

arxiv_latest_version(id; proxy, timeout, base_url = ARXIV_API_URL)
    -> Int or nothing

Return the number of the latest published version of arXiv paper id. arXiv's API answers an id_list=<id> query with the entry's canonical URL in <id>, which always carries the current vN suffix — the integer after v is the latest-version number. Missing or unparseable responses return nothing. Strips an arxiv: prefix if passed.

source

BiblioFetch.arxiv_list_versions — Function

arxiv_list_versions(id; kwargs...) -> Vector{Int}

Return every version number an arXiv paper has, in ascending order. arXiv numbers versions sequentially from 1, so this is 1:arxiv_latest_version(id) with the API call cached into a single trip. Returns Int[] on lookup failure.

kwargs are forwarded to arxiv_latest_version.

source

Store

BiblioFetch.Store — Type

Store(root)

Handle on a BiblioFetch store directory. Holds the root path; all PDF and metadata paths are derived from it. Construct with open_store — the raw constructor does not create the backing directory layout.

source

BiblioFetch.open_store — Function

open_store(root) -> Store

Create (if needed) the store directory layout under root:

<root>/
  <group>/<safekey>.pdf         # PDFs live next to their group subdir
  <safekey>.pdf                 # (or at the root for ungrouped entries)
  .metadata/<safekey>.toml      # one TOML per paper (editable, hidden)

source

BiblioFetch.list_entries — Function

list_entries(store) -> Vector{String}

Return the filesystem-safe keys of every paper currently tracked in the store, sorted alphabetically. These are the stems of files under <root>/.metadata/, not the canonical DOI/arXiv keys (use entry_info to get the key).

source

BiblioFetch.entry_info — Function

entry_info(store, key) -> NamedTuple | Nothing

Summary record for one entry — key, title, status, source, group, pdf_path, year. Returns nothing when the key has no metadata on disk.

source

Project skeleton

BiblioFetch.generate — Function

generate(path; force = false) -> String

Create a BiblioFetch project skeleton under path. Copies every file in the package's template/ directory (job.toml + README.md at present) into path, creating intermediate directories as needed.

path — absolute or ~-prefixed; expanded before use. If it's relative, it's resolved against pwd().
force — when false (default) and path already exists and is non-empty, generate refuses with an ArgumentError. true overwrites any clashing file unconditionally.

Returns the absolute path of the created project, ready to pass to bibliofetch run <path>/job.toml (with relative-target resolution — see load_job).

source

Fetch

BiblioFetch.fetch_paper! — Function

fetch_paper!(store, key; rt, group = "", force = false,
             sources = DEFAULT_SOURCES, source_policy = :lenient,
             verbose = true) -> FetchResult

Resolve key (DOI or arxiv:…) and try the configured sources in order:

:unpaywall → OA PDF (requires rt.email)
:arxiv → arXiv preprint (always OA)
:direct → doi.org/<doi> through proxy (only when proxy is reachable)

source_policy controls which sources are allowed to produce candidates:

:lenient (default) — every source listed in sources is eligible.
:strict — only PUBLISHER_SOURCES produce candidates; preprint routes (:arxiv, :s2) are silently dropped, and :unpaywall is only kept when its bestoalocation has host_type = "publisher".

also_arxiv (default false) — after a successful primary fetch whose source is not already :arxiv, BiblioFetch does a companion download of the arXiv preprint (if an arXiv id was discovered from Crossref relation.has-preprint or the title-search fallback) into preprint_pdf_path(store, key; group). Records preprint_* fields in the entry's metadata TOML. Silently no-ops when no arXiv id exists.

The PDF is stored at pdf_path(store, key; group) — i.e. in store.root/<group>/. Per-attempt diagnostics are recorded in the returned FetchResult.attempts.

source

BiblioFetch.sync! — Function

sync!(store; rt = detect_environment(), force = false, verbose = true)
    -> Vector{FetchResult}

Walk the store's metadata directory and (re)fetch entries, preserving each entry's stored group.

default (force = false): skip entries that already have status = "ok" and a PDF on disk. Everything else — pending, failed, or status-ok with a missing PDF — is fetched. Useful for resuming a partial run.
force = true: every tracked entry is re-downloaded, even ones already on disk. force = true is propagated to fetch_paper!, so its cached fast-path is bypassed and the PDF is overwritten.

source

BiblioFetch.AttemptLog — Type

AttemptLog

One source attempt during a fetch — useful for diagnosing why a key failed. retry_count is the number of retries burned inside this attempt (driven by retry_statuses / exceptions in _http_get_with_retry); retried_statuses is the list of HTTP statuses that triggered each retry. 0 inside retried_statuses stands for a pre-server / exception retry (no response arrived) — the request never reached HTTP. When a source completed on the first try, retry_count == 0 and retried_statuses is empty.

source

Jobs

BiblioFetch.load_job — Function

load_job(path; runtime = detect_environment()) -> FetchJob

Parse a bibliofetch.toml file. Fills in missing fetch.email from runtime, flattens [doi] groups into FetchEntrys, deduplicates keys (lenient by default), and returns the job without performing any network I/O.

source

BiblioFetch.run — Function

BiblioFetch.run(path_or_job; verbose = true) -> FetchJobResult

Execute a job. path_or_job may be a path to a bibliofetch.toml or an already-loaded FetchJob. Writes PDFs into job.target/<group>/, metadata into job.target/.metadata/, and a run log into job.log_file.

source

BiblioFetch.FetchEntry — Type

FetchEntry

One reference pulled from a job file: normalized key, assigned group, and (after running) its fetch status and per-source attempt log.

source

BiblioFetch.FetchJob — Type

FetchJob

Parsed bibliofetch.toml — the list of references to pull, where to put them, and which sources / concurrency / overwrite policy to use.

source

BiblioFetch.FetchJobResult — Type

FetchJobResult

Returned by BiblioFetch.run — the job plus post-run entries and elapsed time.

source

BibTeX

BiblioFetch.bibtex_entry — Function

bibtex_entry(md; key = _bibtex_key(md)) -> String

Render one metadata dict as a BibTeX entry string (including trailing newline). Uses @article when md["journal"] is non-empty, @misc otherwise (arXiv preprints, tech reports). Fields are always written in the same order.

source

BiblioFetch.write_bibtex — Function

write_bibtex(store, path; key_filter = nothing) -> Int

Iterate every status = "ok" entry in the store's .metadata/, assign a FirstAuthorSurnameYear citekey (with letter-suffix disambiguation for collisions), and write the combined BibTeX to path. Returns the number of entries written. When key_filter is a Set{String} of normalized keys only those entries are written.

source

BiblioFetch.parse_bibtex — Function

parse_bibtex(text) -> Vector{BibEntry}

Walk a BibTeX source string and collect every @TYPE{key, fields…} entry. Top-level brace balancing is manual (so nested {…} inside field values don't confuse the scanner); individual field extraction uses regex that tolerates single-level braces, which covers every real doi / eprint / url value.

Entries that fail to parse (malformed headers, unbalanced braces, etc.) are skipped silently — a single broken entry shouldn't abort the whole import.

source

BiblioFetch.bibentry_to_ref — Function

bibentry_to_ref(entry) -> String | Nothing

Derive the identifier BiblioFetch should queue for a bib entry. Checks, in order:

doi → return the DOI as-is (will be normalized downstream)
eprint → if archivePrefix is arxiv or absent, return arxiv:<eprint>
url → if it's a doi.org/… or arxiv.org/abs/… URL, return the extracted identifier

Returns nothing when nothing usable is found (e.g. an entry with only a title and unstructured publisher).

source

BiblioFetch.import_bib! — Function

import_bib!(store, path) -> (added, skipped)

Parse path as a BibTeX file and queue every entry that yields a recognizable DOI or arXiv id into store. Returns:

added::Vector{NamedTuple{(:citekey, :ref, :key)}} — entries successfully queued. citekey is the BibTeX citekey, ref is what we extracted, key is the normalized store key.
skipped::Vector{NamedTuple{(:citekey, :reason)}} — entries rejected either because no usable identifier was found or because normalization of the extracted string failed.

Duplicate refs already in the store are treated as success (queued = idempotent).

source

BiblioFetch.BibEntry — Type

BibEntry

One entry scanned out of a .bib file — type, citekey, and a flattened field map. Field keys are lowercased; field values are stripped of their {…} / "…" wrapper but not of nested LaTeX braces (which almost never appear in the identifier fields we care about).

source

Citation graph visualization

BiblioFetch.to_dot — Function

to_dot(store; queued_only = false, include_isolated = false) -> String

Render the store's citation graph as a Graphviz DOT source string. Pipe through dot -Tpng > graph.png (or -Tsvg) to view.

queued_only = true — show only the expansion tree (edges from referenced_by), not the full citation fabric.
include_isolated = true — keep entries that aren't part of any edge (default: hide them so the graph stays readable).

Node labels are the same citekeys bibliofetch bib emits; node colour/style reflects status (ok / pending / failed).

source

BiblioFetch.to_mermaid — Function

to_mermaid(store; queued_only = false, include_isolated = false) -> String

Render the store's citation graph as Mermaid (graph LR) source, ready to paste into a Markdown fence on GitHub / Obsidian / Docusaurus. Same edge-policy and filter flags as to_dot.

source

Deduplication

BiblioFetch.find_duplicates — Function

find_duplicates(store) -> Vector{Pair{String,Vector{String}}}

Scan store's metadata directory and return a list of sha256 => keys pairs for every hash held by more than one entry. Each keys vector is sorted lexicographically, so the canonical (kept) key in dedup operations is deterministic.

The SHA-256 comes from the sha256 field written by fetch_paper!. Entries whose metadata lacks that field (very old stores, failed fetches, entries resolved into duplicate_of) are skipped.

source

BiblioFetch.resolve_duplicates! — Function

resolve_duplicates!(store; apply = false) -> NamedTuple

Walk the duplicate groups reported by find_duplicates. For each group keep the lexicographically first key as canonical; the rest are recorded with duplicate_of = "<canonical>" and their pdf_path is redirected to the canonical entry's file. On-disk duplicate PDFs are removed when apply = true; otherwise the function just reports what would happen.

Returns (; groups, bytes_freed, canonicals) — groups is the output of find_duplicates, bytes_freed is the size that would be (or was) recovered, and canonicals is a duplicate_key => canonical_key map.

source

Doctor (store integrity)

BiblioFetch.doctor — Function

doctor(store) -> Vector{StoreIssue}

Inventory the store for operational problems:

cross-reference metadata pdf_path vs on-disk files (missing / orphan)
flag .part leftover files from interrupted downloads
flag 0-byte PDFs
when a metadata entry records a sha256, verify the on-disk file still hashes to the same value (:sha_mismatch)

One pass, no network. Returns a flat list sorted first by kind, then by key / path.

source

BiblioFetch.fix! — Function

fix!(store, issues; kinds = (:incomplete_part,)) -> Int

Apply safe auto-fixes to a subset of issues. Returns the number of issues acted on. Safe defaults:

:incomplete_part — remove the .part file unconditionally
:pdf_missing — clear pdf_path from the metadata entry; don't touch the metadata's other fields, so a subsequent bibliofetch sync --force can re-fetch

Other kinds (:orphan_pdf, :sha_mismatch, :empty_pdf) are opt-in — pass their symbol in kinds to include them. Orphan removal in particular is destructive and should be reviewed first.

source

BiblioFetch.StoreIssue — Type

StoreIssue

One integrity problem doctor found. kind is one of:

:pdf_missing — metadata lists a pdf_path whose file is gone
:orphan_pdf — a PDF on disk isn't referenced by any metadata entry
:incomplete_part — a .part leftover from an interrupted download
:sha_mismatch — metadata has a sha256 that no longer matches the file on disk (PDF was replaced / corrupted)
:empty_pdf — pdf_path exists but the file is 0 bytes

key identifies the metadata entry the issue belongs to, if any; orphan disk files have key == "".

source

Search

BiblioFetch.search_entries — Function

search_entries(store, query; fields, group, status, case_sensitive)
    -> Vector{SearchMatch}

Substring-search the store's metadata. By default matches in any of title / authors / abstract / journal / key; override with fields.

query — the text to search for (empty ⇒ every entry, useful with filters).
fields — tuple / vector of Symbol field names to match against.
group — optional group-prefix filter (empty ⇒ all groups).
status — optional exact-match filter ("ok" / "failed" / "pending").
case_sensitive — default false.

Results are sorted by number of matched fields (desc), key (asc). A paper is returned at most once even if multiple fields hit.

source

BiblioFetch.SearchMatch — Type

SearchMatch

One row in the result of search_entries — the hit's normalized key, its status/title/year/group for display, which fields contained the query, and a ±40-char snippet around the first match for context.

source

Statistics

BiblioFetch.stats — Function

stats(store) -> StoreStats

Walk the store's .metadata/ directory once and aggregate:

per-status / per-source / per-group counts
PDF file count and total byte size (counts only files that exist)
pdf_missing — entries whose metadata lists a pdf_path that's gone
duplicate_resolved — entries linked to a canonical by resolve_duplicates!
graph_expanded — entries queued by a citation hop (depth > 0)
oldest_fetch / newest_fetch — earliest and latest fetched_at timestamps, nothing when the store has no successful fetches yet

One pass, no network. Safe to call on huge stores; per-entry cost is dominated by TOML.parsefile on the metadata file.

source

BiblioFetch.StoreStats — Type

StoreStats

Aggregate counts and sizes for a store, one walk of .metadata/ away. Used by bibliofetch stats for a daily-review dashboard and by any caller who wants to know "what's actually in here?" without enumerating entries by hand.

source

External metadata sources

BiblioFetch.datacite_lookup — Function

datacite_lookup(doi; proxy = nothing, timeout = 15, base_url = DATACITE_URL,
                max_retries, base_delay) -> Dict

Fetch DataCite metadata for a DOI and return it in Crossref's metadata shape (so it slots straight into the existing fetch_paper! extraction). Returns an empty Dict on any failure.

Used as a fallback after Crossref returns nothing — covers dataset DOIs registered through Zenodo, Figshare, institutional DataCite clients, etc.

source

BiblioFetch.s2_lookup — Function

s2_lookup(ref; api_key = ENV["SEMANTIC_SCHOLAR_API_KEY"], proxy = nothing,
          timeout = 15, base_url = S2_URL, max_retries, base_delay)
    -> Dict

Look up a paper on Semantic Scholar. ref is a normalized key (10.xxxx/yyy or arxiv:…).

Returns a Dict{String,Any} with the fields BiblioFetch cares about:

"title" — String
"authors" — Vector{String} (display names, one per author)
"year" — Int or nothing
"abstract" — String (empty when S2 didn't have one)
"journal" — String (empty when S2 didn't record one)
"oa_pdf_url" — String pointing at the publisher / repository PDF, present only when openAccessPdf.url is non-empty
"s2_paper_id" — S2's own stable id, useful for follow-ups

Empty Dict on any failure (unreachable, 404, malformed JSON, etc.).

source

Publisher TDM (authenticated)

BiblioFetch.aps_tdm_url — Function

aps_tdm_url(doi; base_url = APS_TDM_URL) -> String

Build the harvest.aps.org URL that returns a PDF for the given APS DOI. Does no validation beyond the 10.1103 prefix check in is_aps_doi.

source

BiblioFetch.is_aps_doi — Function

is_aps_doi(doi) -> Bool

Whether doi is published by the American Physical Society — all APS DOIs live under the 10.1103/ prefix. Checked before dispatching to harvest.aps.org to avoid spraying the endpoint with DOIs that would 404 anyway (and to conserve token quota).

source

BiblioFetch.elsevier_tdm_url — Function

elsevier_tdm_url(doi; base_url = ELSEVIER_TDM_URL) -> String

Build the api.elsevier.com URL that returns the article PDF when combined with Accept: application/pdf and the API-key headers from elsevier_tdm_auth_headers.

source

BiblioFetch.is_elsevier_doi — Function

is_elsevier_doi(doi) -> Bool

Whether doi is published by Elsevier. 10.1016/* covers ScienceDirect, Cell Press, The Lancet, and the vast majority of Elsevier content.

source

BiblioFetch.elsevier_tdm_auth_headers — Function

elsevier_tdm_auth_headers(; api_key = ENV["ELSEVIER_API_KEY"],
                         insttoken = ENV["ELSEVIER_INSTTOKEN"])
    -> Vector{Pair{String,String}}

Build the header set for an Elsevier TDM request: X-ELS-APIKey always (when a key is configured), plus X-ELS-Insttoken when a token is set too. Returns an empty Vector when no key is configured — the fetch pipeline uses that to skip :elsevier entirely, matching the APS TDM pattern (don't spray requests that will 401 anyway).

source

BiblioFetch.springer_oa_lookup — Function

springer_oa_lookup(doi; api_key = ENV["SPRINGER_API_KEY"],
                   proxy = nothing, timeout = 15,
                   base_url = SPRINGER_OA_URL)
    -> (pdf_url_or_nothing, metadata_dict)

Ask the Springer Nature OpenAccess API whether doi is registered as an OA article. Returns (pdf_url, metadata):

pdf_url is the canonical link.springer.com/content/pdf/<DOI>.pdf URL when the API confirms OA registration, else nothing.
metadata is the parsed JSON body (empty Dict on hard failure or when the response has no records).

Returns (nothing, Dict()) without a network call when no API key is configured — the fetch pipeline uses that to skip :springer entirely, matching the APS/Elsevier "don't spray un-authenticated requests" pattern.

source

BiblioFetch.is_springer_doi — Function

is_springer_doi(doi) -> Bool

Whether doi is published under a Springer Nature imprint. Covers the four prefixes worth gating on:

10.1007/ — Springer (journals + books, the overwhelming majority)
10.1038/ — Nature portfolio (mix of OA and paywalled; OA API will tell us which)
10.1186/ — BMC / BioMed Central (all OA)
10.1140/ — European Physical Journal (EPJ)

Other Springer-distributed prefixes (10.1057 Palgrave, 10.1023 legacy Kluwer, 10.1134 Allerton) are rare in practice and omitted to keep the guard tight.

source

Network status

BiblioFetch.status — Function

status(; rt = detect_environment(), timeout = 5.0, probes = _STATUS_PROBES)
    -> NetworkStatus

Probe every supported metadata / PDF endpoint concurrently (@async + fetch) and report which ones respond from the current network. Total wall time is roughly timeout plus a small fixed cost, not #probes × timeout.

Exposed so user code (and live-network tests) can ask "is Crossref reachable from here?" before queueing work. The probes kwarg is overridable so integration tests can point it at a local mock server.

source

BiblioFetch.is_reachable — Function

is_reachable(status, source) -> Bool

Quick predicate for test-gating code: is_reachable(status, :crossref) returns true iff the corresponding probe succeeded.

source

BiblioFetch.NetworkStatus — Type

NetworkStatus

Aggregate of all probe results + the derived effective_sources — which of (:unpaywall, :arxiv, :direct) can actually do their job right now.

Useful to distinguish "at the university, full access" from "at home, OA only" before kicking off a long job, and to gate live-network tests so they skip cleanly on CI / offline machines.

source

BiblioFetch.ProbeResult — Type

ProbeResult

One reachability probe record — which endpoint, did it respond, how fast, what HTTP status came back. reachable treats HTTP 2xx-4xx as reachable (the server is up, just may or may not have the thing we probed for); 5xx / connection errors / timeouts are false.

source

Vault (topic-based collection)

BiblioFetch.VaultTopic — Type

VaultTopic

One topic loaded from a TOML file in the vault directory.

source

BiblioFetch.VaultIndex — Type

VaultIndex

Parsed vault.toml (or a synthetic one from the directory listing).

source

BiblioFetch.load_vault_index — Function

load_vault_index(dir) -> VaultIndex

Read vault.toml if present; otherwise treat every *.toml (except vault.toml) in dir as a topic file.

source

BiblioFetch.list_topics — Function

list_topics(index) -> Vector{VaultTopic}

source

BiblioFetch.topic_refs — Function

topic_refs(topic) -> Vector{String}

Return the normalized keys for all refs in a topic.

source

BiblioFetch.vault_add_ref! — Function

vault_add_ref!(topic_name, raw_ref; dir) -> String

Append raw_ref to [doi].list in <dir>/<topic_name>.toml, creating the file with an empty [topic] header if it does not exist. Returns the normalized key.

source

BiblioFetch.vault_fetch! — Function

vault_fetch!(index; topic_name, runtime, verbose) -> Dict{String,FetchJobResult}

Fetch papers for all topics (or a named subset) into index.store. Returns a Dict mapping topic name → FetchJobResult.

source

BiblioFetch.vault_bib — Function

vault_bib(index; topic_name, out) -> Int

Write a BibTeX file for all vault papers (or one topic). Returns entry count.

source

BiblioFetch.vault_search — Function

vault_search(index, query; fields, case_sensitive) -> Vector{SearchMatch}

Search across all papers in the vault store.

source

CLI

BiblioFetch.cli_main — Function

cli_main(args = ARGS) -> Int

Dispatch a bibliofetch … command line. Returns exit code.

source

BiblioFetch.julia_main — Function

julia_main() -> Cint

Entry point for PackageCompiler.create_app. Delegates to cli_main(ARGS).

source

Native app build

BiblioFetch.build — Function

build(; sysimage_dir, bindir, force) -> String

Compile BiblioFetch into a sysimage using PackageCompiler.jl (create_sysimage with incremental=true), then write a thin shell wrapper into bindir.

Using a sysimage (rather than create_app) avoids the isolated-build errors that create_app triggers for packages with binary C extensions (HTTP → MbedTLS). It also produces a much smaller artefact (~40 MB vs ~300 MB) because the Julia runtime is not bundled — the system-installed julia is reused.

After a successful build, bibliofetch starts in under a second.

Arguments

sysimage_dir: directory where sys.so (Linux/macOS) or sys.dll (Windows) is written. Default: ~/.local/share/bibliofetch
bindir: where the bibliofetch wrapper script is installed. Default: ~/.local/bin
force: overwrite an existing sysimage. Default: false

Example

using Pkg; Pkg.add("PackageCompiler")   # once
using BiblioFetch
BiblioFetch.build()                      # ~2–4 min, run once per Julia version
BiblioFetch.build(force=true)            # rebuild after Pkg.update()

source