API Reference

Environment detection

BiblioFetch.detect_environmentFunction
detect_environment(; probe = true) -> Runtime

Detect hostname, applicable config profile, effective proxy (env > profile), optionally probe reachability, and classify the operating mode.

source
BiblioFetch.effective_runtimeFunction
effective_runtime(; probe = true) -> Runtime

Alias for detect_environment; kept as the public "what should I use right now?" accessor.

source
BiblioFetch.load_configFunction
load_config(; path = ENV["BIBLIOFETCH_CONFIG"] or default)
    -> (config::Dict, path_or_nothing)

Read and parse the global BiblioFetch config TOML. Returns (Dict(), nothing) when no file is present at path. The default location is ~/.config/bibliofetch/config.toml; $BIBLIOFETCH_CONFIG overrides it.

source

References — parse / classify

BiblioFetch.normalize_keyFunction
normalize_key(s) -> String

Normalize a user-provided reference to a canonical key:

  • DOI → lowercase DOI (10.1103/physrevb.xx.yyyy)
  • arXiv → arxiv:<id>

Throws ArgumentError if unrecognized.

source
BiblioFetch.is_doiFunction
is_doi(s) -> Bool

Whether s looks like a DOI (10.xxxx/anything). Strips surrounding whitespace but does not otherwise transform the input.

source
BiblioFetch.is_arxivFunction
is_arxiv(s) -> Bool

Whether s looks like an arXiv id — both the new-style (1706.03762, optionally with a version suffix v2 and an arxiv: prefix) and the legacy slash form (cond-mat/0608208).

source
BiblioFetch.is_arxiv_versionsFunction
is_arxiv_versions(s) -> Bool

Whether s is the multi-version pseudo-ref form arxiv:<id>@all or arxiv:<id>@v1,v3 / arxiv:<id>@1,3. These refs can't be fetched as-is — the run loop expands them into one FetchEntry per version before dispatching to fetch_paper!.

source
BiblioFetch.parse_arxiv_version_specFunction
parse_arxiv_version_spec(s) -> (base_key, spec)

Parse an arxiv:<id>@… pseudo-ref into its components.

  • base_key — the canonical arxiv:<id> key (lower-cased, no version suffix, with arxiv: prefix).
  • spec — either :all (every known version) or a sorted Vector{Int} of explicit version numbers.

Throws ArgumentError when s is not a well-formed pseudo-ref.

source

arXiv version discovery

BiblioFetch.arxiv_latest_versionFunction
arxiv_latest_version(id; proxy, timeout, base_url = ARXIV_API_URL)
    -> Int or nothing

Return the number of the latest published version of arXiv paper id. arXiv's API answers an id_list=<id> query with the entry's canonical URL in <id>, which always carries the current vN suffix — the integer after v is the latest-version number. Missing or unparseable responses return nothing. Strips an arxiv: prefix if passed.

source
BiblioFetch.arxiv_list_versionsFunction
arxiv_list_versions(id; kwargs...) -> Vector{Int}

Return every version number an arXiv paper has, in ascending order. arXiv numbers versions sequentially from 1, so this is 1:arxiv_latest_version(id) with the API call cached into a single trip. Returns Int[] on lookup failure.

kwargs are forwarded to arxiv_latest_version.

source

Store

BiblioFetch.StoreType
Store(root)

Handle on a BiblioFetch store directory. Holds the root path; all PDF and metadata paths are derived from it. Construct with open_store — the raw constructor does not create the backing directory layout.

source
BiblioFetch.open_storeFunction
open_store(root) -> Store

Create (if needed) the store directory layout under root:

<root>/
  <group>/<safekey>.pdf         # PDFs live next to their group subdir
  <safekey>.pdf                 # (or at the root for ungrouped entries)
  .metadata/<safekey>.toml      # one TOML per paper (editable, hidden)
source
BiblioFetch.list_entriesFunction
list_entries(store) -> Vector{String}

Return the filesystem-safe keys of every paper currently tracked in the store, sorted alphabetically. These are the stems of files under <root>/.metadata/, not the canonical DOI/arXiv keys (use entry_info to get the key).

source
BiblioFetch.entry_infoFunction
entry_info(store, key) -> NamedTuple | Nothing

Summary record for one entry — key, title, status, source, group, pdf_path, year. Returns nothing when the key has no metadata on disk.

source

Project skeleton

BiblioFetch.generateFunction
generate(path; force = false) -> String

Create a BiblioFetch project skeleton under path. Copies every file in the package's template/ directory (job.toml + README.md at present) into path, creating intermediate directories as needed.

  • path — absolute or ~-prefixed; expanded before use. If it's relative, it's resolved against pwd().
  • force — when false (default) and path already exists and is non-empty, generate refuses with an ArgumentError. true overwrites any clashing file unconditionally.

Returns the absolute path of the created project, ready to pass to bibliofetch run <path>/job.toml (with relative-target resolution — see load_job).

source

Fetch

BiblioFetch.fetch_paper!Function
fetch_paper!(store, key; rt, group = "", force = false,
             sources = DEFAULT_SOURCES, source_policy = :lenient,
             verbose = true) -> FetchResult

Resolve key (DOI or arxiv:…) and try the configured sources in order:

  1. :unpaywall → OA PDF (requires rt.email)
  2. :arxiv → arXiv preprint (always OA)
  3. :directdoi.org/<doi> through proxy (only when proxy is reachable)

source_policy controls which sources are allowed to produce candidates:

  • :lenient (default) — every source listed in sources is eligible.
  • :strict — only PUBLISHER_SOURCES produce candidates; preprint routes (:arxiv, :s2) are silently dropped, and :unpaywall is only kept when its bestoalocation has host_type = "publisher".

also_arxiv (default false) — after a successful primary fetch whose source is not already :arxiv, BiblioFetch does a companion download of the arXiv preprint (if an arXiv id was discovered from Crossref relation.has-preprint or the title-search fallback) into preprint_pdf_path(store, key; group). Records preprint_* fields in the entry's metadata TOML. Silently no-ops when no arXiv id exists.

The PDF is stored at pdf_path(store, key; group) — i.e. in store.root/<group>/. Per-attempt diagnostics are recorded in the returned FetchResult.attempts.

source
BiblioFetch.sync!Function
sync!(store; rt = detect_environment(), force = false, verbose = true)
    -> Vector{FetchResult}

Walk the store's metadata directory and (re)fetch entries, preserving each entry's stored group.

  • default (force = false): skip entries that already have status = "ok" and a PDF on disk. Everything else — pending, failed, or status-ok with a missing PDF — is fetched. Useful for resuming a partial run.
  • force = true: every tracked entry is re-downloaded, even ones already on disk. force = true is propagated to fetch_paper!, so its cached fast-path is bypassed and the PDF is overwritten.
source
BiblioFetch.AttemptLogType
AttemptLog

One source attempt during a fetch — useful for diagnosing why a key failed. retry_count is the number of retries burned inside this attempt (driven by retry_statuses / exceptions in _http_get_with_retry); retried_statuses is the list of HTTP statuses that triggered each retry. 0 inside retried_statuses stands for a pre-server / exception retry (no response arrived) — the request never reached HTTP. When a source completed on the first try, retry_count == 0 and retried_statuses is empty.

source

Jobs

BiblioFetch.load_jobFunction
load_job(path; runtime = detect_environment()) -> FetchJob

Parse a bibliofetch.toml file. Fills in missing fetch.email from runtime, flattens [doi] groups into FetchEntrys, deduplicates keys (lenient by default), and returns the job without performing any network I/O.

source
BiblioFetch.runFunction
BiblioFetch.run(path_or_job; verbose = true) -> FetchJobResult

Execute a job. path_or_job may be a path to a bibliofetch.toml or an already-loaded FetchJob. Writes PDFs into job.target/<group>/, metadata into job.target/.metadata/, and a run log into job.log_file.

source
BiblioFetch.FetchEntryType
FetchEntry

One reference pulled from a job file: normalized key, assigned group, and (after running) its fetch status and per-source attempt log.

source
BiblioFetch.FetchJobType
FetchJob

Parsed bibliofetch.toml — the list of references to pull, where to put them, and which sources / concurrency / overwrite policy to use.

source

BibTeX

BiblioFetch.bibtex_entryFunction
bibtex_entry(md; key = _bibtex_key(md)) -> String

Render one metadata dict as a BibTeX entry string (including trailing newline). Uses @article when md["journal"] is non-empty, @misc otherwise (arXiv preprints, tech reports). Fields are always written in the same order.

source
BiblioFetch.write_bibtexFunction
write_bibtex(store, path; key_filter = nothing) -> Int

Iterate every status = "ok" entry in the store's .metadata/, assign a FirstAuthorSurnameYear citekey (with letter-suffix disambiguation for collisions), and write the combined BibTeX to path. Returns the number of entries written. When key_filter is a Set{String} of normalized keys only those entries are written.

source
BiblioFetch.parse_bibtexFunction
parse_bibtex(text) -> Vector{BibEntry}

Walk a BibTeX source string and collect every @TYPE{key, fields…} entry. Top-level brace balancing is manual (so nested {…} inside field values don't confuse the scanner); individual field extraction uses regex that tolerates single-level braces, which covers every real doi / eprint / url value.

Entries that fail to parse (malformed headers, unbalanced braces, etc.) are skipped silently — a single broken entry shouldn't abort the whole import.

source
BiblioFetch.bibentry_to_refFunction
bibentry_to_ref(entry) -> String | Nothing

Derive the identifier BiblioFetch should queue for a bib entry. Checks, in order:

  1. doi → return the DOI as-is (will be normalized downstream)
  2. eprint → if archivePrefix is arxiv or absent, return arxiv:<eprint>
  3. url → if it's a doi.org/… or arxiv.org/abs/… URL, return the extracted identifier

Returns nothing when nothing usable is found (e.g. an entry with only a title and unstructured publisher).

source
BiblioFetch.import_bib!Function
import_bib!(store, path) -> (added, skipped)

Parse path as a BibTeX file and queue every entry that yields a recognizable DOI or arXiv id into store. Returns:

  • added::Vector{NamedTuple{(:citekey, :ref, :key)}} — entries successfully queued. citekey is the BibTeX citekey, ref is what we extracted, key is the normalized store key.
  • skipped::Vector{NamedTuple{(:citekey, :reason)}} — entries rejected either because no usable identifier was found or because normalization of the extracted string failed.

Duplicate refs already in the store are treated as success (queued = idempotent).

source
BiblioFetch.BibEntryType
BibEntry

One entry scanned out of a .bib file — type, citekey, and a flattened field map. Field keys are lowercased; field values are stripped of their {…} / "…" wrapper but not of nested LaTeX braces (which almost never appear in the identifier fields we care about).

source

Citation graph visualization

BiblioFetch.to_dotFunction
to_dot(store; queued_only = false, include_isolated = false) -> String

Render the store's citation graph as a Graphviz DOT source string. Pipe through dot -Tpng > graph.png (or -Tsvg) to view.

  • queued_only = true — show only the expansion tree (edges from referenced_by), not the full citation fabric.
  • include_isolated = true — keep entries that aren't part of any edge (default: hide them so the graph stays readable).

Node labels are the same citekeys bibliofetch bib emits; node colour/style reflects status (ok / pending / failed).

source
BiblioFetch.to_mermaidFunction
to_mermaid(store; queued_only = false, include_isolated = false) -> String

Render the store's citation graph as Mermaid (graph LR) source, ready to paste into a Markdown fence on GitHub / Obsidian / Docusaurus. Same edge-policy and filter flags as to_dot.

source

Deduplication

BiblioFetch.find_duplicatesFunction
find_duplicates(store) -> Vector{Pair{String,Vector{String}}}

Scan store's metadata directory and return a list of sha256 => keys pairs for every hash held by more than one entry. Each keys vector is sorted lexicographically, so the canonical (kept) key in dedup operations is deterministic.

The SHA-256 comes from the sha256 field written by fetch_paper!. Entries whose metadata lacks that field (very old stores, failed fetches, entries resolved into duplicate_of) are skipped.

source
BiblioFetch.resolve_duplicates!Function
resolve_duplicates!(store; apply = false) -> NamedTuple

Walk the duplicate groups reported by find_duplicates. For each group keep the lexicographically first key as canonical; the rest are recorded with duplicate_of = "<canonical>" and their pdf_path is redirected to the canonical entry's file. On-disk duplicate PDFs are removed when apply = true; otherwise the function just reports what would happen.

Returns (; groups, bytes_freed, canonicals)groups is the output of find_duplicates, bytes_freed is the size that would be (or was) recovered, and canonicals is a duplicate_key => canonical_key map.

source

Doctor (store integrity)

BiblioFetch.doctorFunction
doctor(store) -> Vector{StoreIssue}

Inventory the store for operational problems:

  • cross-reference metadata pdf_path vs on-disk files (missing / orphan)
  • flag .part leftover files from interrupted downloads
  • flag 0-byte PDFs
  • when a metadata entry records a sha256, verify the on-disk file still hashes to the same value (:sha_mismatch)

One pass, no network. Returns a flat list sorted first by kind, then by key / path.

source
BiblioFetch.fix!Function
fix!(store, issues; kinds = (:incomplete_part,)) -> Int

Apply safe auto-fixes to a subset of issues. Returns the number of issues acted on. Safe defaults:

  • :incomplete_part — remove the .part file unconditionally
  • :pdf_missing — clear pdf_path from the metadata entry; don't touch the metadata's other fields, so a subsequent bibliofetch sync --force can re-fetch

Other kinds (:orphan_pdf, :sha_mismatch, :empty_pdf) are opt-in — pass their symbol in kinds to include them. Orphan removal in particular is destructive and should be reviewed first.

source
BiblioFetch.StoreIssueType
StoreIssue

One integrity problem doctor found. kind is one of:

  • :pdf_missing — metadata lists a pdf_path whose file is gone
  • :orphan_pdf — a PDF on disk isn't referenced by any metadata entry
  • :incomplete_part — a .part leftover from an interrupted download
  • :sha_mismatch — metadata has a sha256 that no longer matches the file on disk (PDF was replaced / corrupted)
  • :empty_pdfpdf_path exists but the file is 0 bytes

key identifies the metadata entry the issue belongs to, if any; orphan disk files have key == "".

source
BiblioFetch.search_entriesFunction
search_entries(store, query; fields, group, status, case_sensitive)
    -> Vector{SearchMatch}

Substring-search the store's metadata. By default matches in any of title / authors / abstract / journal / key; override with fields.

  • query — the text to search for (empty ⇒ every entry, useful with filters).
  • fields — tuple / vector of Symbol field names to match against.
  • group — optional group-prefix filter (empty ⇒ all groups).
  • status — optional exact-match filter ("ok" / "failed" / "pending").
  • case_sensitive — default false.

Results are sorted by number of matched fields (desc), key (asc). A paper is returned at most once even if multiple fields hit.

source
BiblioFetch.SearchMatchType
SearchMatch

One row in the result of search_entries — the hit's normalized key, its status/title/year/group for display, which fields contained the query, and a ±40-char snippet around the first match for context.

source

Statistics

BiblioFetch.statsFunction
stats(store) -> StoreStats

Walk the store's .metadata/ directory once and aggregate:

  • per-status / per-source / per-group counts
  • PDF file count and total byte size (counts only files that exist)
  • pdf_missing — entries whose metadata lists a pdf_path that's gone
  • duplicate_resolved — entries linked to a canonical by resolve_duplicates!
  • graph_expanded — entries queued by a citation hop (depth > 0)
  • oldest_fetch / newest_fetch — earliest and latest fetched_at timestamps, nothing when the store has no successful fetches yet

One pass, no network. Safe to call on huge stores; per-entry cost is dominated by TOML.parsefile on the metadata file.

source
BiblioFetch.StoreStatsType
StoreStats

Aggregate counts and sizes for a store, one walk of .metadata/ away. Used by bibliofetch stats for a daily-review dashboard and by any caller who wants to know "what's actually in here?" without enumerating entries by hand.

source

External metadata sources

BiblioFetch.datacite_lookupFunction
datacite_lookup(doi; proxy = nothing, timeout = 15, base_url = DATACITE_URL,
                max_retries, base_delay) -> Dict

Fetch DataCite metadata for a DOI and return it in Crossref's metadata shape (so it slots straight into the existing fetch_paper! extraction). Returns an empty Dict on any failure.

Used as a fallback after Crossref returns nothing — covers dataset DOIs registered through Zenodo, Figshare, institutional DataCite clients, etc.

source
BiblioFetch.s2_lookupFunction
s2_lookup(ref; api_key = ENV["SEMANTIC_SCHOLAR_API_KEY"], proxy = nothing,
          timeout = 15, base_url = S2_URL, max_retries, base_delay)
    -> Dict

Look up a paper on Semantic Scholar. ref is a normalized key (10.xxxx/yyy or arxiv:…).

Returns a Dict{String,Any} with the fields BiblioFetch cares about:

  • "title"String
  • "authors"Vector{String} (display names, one per author)
  • "year"Int or nothing
  • "abstract"String (empty when S2 didn't have one)
  • "journal"String (empty when S2 didn't record one)
  • "oa_pdf_url"String pointing at the publisher / repository PDF, present only when openAccessPdf.url is non-empty
  • "s2_paper_id" — S2's own stable id, useful for follow-ups

Empty Dict on any failure (unreachable, 404, malformed JSON, etc.).

source

Publisher TDM (authenticated)

BiblioFetch.aps_tdm_urlFunction
aps_tdm_url(doi; base_url = APS_TDM_URL) -> String

Build the harvest.aps.org URL that returns a PDF for the given APS DOI. Does no validation beyond the 10.1103 prefix check in is_aps_doi.

source
BiblioFetch.is_aps_doiFunction
is_aps_doi(doi) -> Bool

Whether doi is published by the American Physical Society — all APS DOIs live under the 10.1103/ prefix. Checked before dispatching to harvest.aps.org to avoid spraying the endpoint with DOIs that would 404 anyway (and to conserve token quota).

source
BiblioFetch.is_elsevier_doiFunction
is_elsevier_doi(doi) -> Bool

Whether doi is published by Elsevier. 10.1016/* covers ScienceDirect, Cell Press, The Lancet, and the vast majority of Elsevier content.

source
BiblioFetch.elsevier_tdm_auth_headersFunction
elsevier_tdm_auth_headers(; api_key = ENV["ELSEVIER_API_KEY"],
                         insttoken = ENV["ELSEVIER_INSTTOKEN"])
    -> Vector{Pair{String,String}}

Build the header set for an Elsevier TDM request: X-ELS-APIKey always (when a key is configured), plus X-ELS-Insttoken when a token is set too. Returns an empty Vector when no key is configured — the fetch pipeline uses that to skip :elsevier entirely, matching the APS TDM pattern (don't spray requests that will 401 anyway).

source
BiblioFetch.springer_oa_lookupFunction
springer_oa_lookup(doi; api_key = ENV["SPRINGER_API_KEY"],
                   proxy = nothing, timeout = 15,
                   base_url = SPRINGER_OA_URL)
    -> (pdf_url_or_nothing, metadata_dict)

Ask the Springer Nature OpenAccess API whether doi is registered as an OA article. Returns (pdf_url, metadata):

  • pdf_url is the canonical link.springer.com/content/pdf/<DOI>.pdf URL when the API confirms OA registration, else nothing.
  • metadata is the parsed JSON body (empty Dict on hard failure or when the response has no records).

Returns (nothing, Dict()) without a network call when no API key is configured — the fetch pipeline uses that to skip :springer entirely, matching the APS/Elsevier "don't spray un-authenticated requests" pattern.

source
BiblioFetch.is_springer_doiFunction
is_springer_doi(doi) -> Bool

Whether doi is published under a Springer Nature imprint. Covers the four prefixes worth gating on:

  • 10.1007/ — Springer (journals + books, the overwhelming majority)
  • 10.1038/ — Nature portfolio (mix of OA and paywalled; OA API will tell us which)
  • 10.1186/ — BMC / BioMed Central (all OA)
  • 10.1140/ — European Physical Journal (EPJ)

Other Springer-distributed prefixes (10.1057 Palgrave, 10.1023 legacy Kluwer, 10.1134 Allerton) are rare in practice and omitted to keep the guard tight.

source

Network status

BiblioFetch.statusFunction
status(; rt = detect_environment(), timeout = 5.0, probes = _STATUS_PROBES)
    -> NetworkStatus

Probe every supported metadata / PDF endpoint concurrently (@async + fetch) and report which ones respond from the current network. Total wall time is roughly timeout plus a small fixed cost, not #probes × timeout.

Exposed so user code (and live-network tests) can ask "is Crossref reachable from here?" before queueing work. The probes kwarg is overridable so integration tests can point it at a local mock server.

source
BiblioFetch.is_reachableFunction
is_reachable(status, source) -> Bool

Quick predicate for test-gating code: is_reachable(status, :crossref) returns true iff the corresponding probe succeeded.

source
BiblioFetch.NetworkStatusType
NetworkStatus

Aggregate of all probe results + the derived effective_sources — which of (:unpaywall, :arxiv, :direct) can actually do their job right now.

Useful to distinguish "at the university, full access" from "at home, OA only" before kicking off a long job, and to gate live-network tests so they skip cleanly on CI / offline machines.

source
BiblioFetch.ProbeResultType
ProbeResult

One reachability probe record — which endpoint, did it respond, how fast, what HTTP status came back. reachable treats HTTP 2xx-4xx as reachable (the server is up, just may or may not have the thing we probed for); 5xx / connection errors / timeouts are false.

source

Vault (topic-based collection)

BiblioFetch.load_vault_indexFunction
load_vault_index(dir) -> VaultIndex

Read vault.toml if present; otherwise treat every *.toml (except vault.toml) in dir as a topic file.

source
BiblioFetch.vault_add_ref!Function
vault_add_ref!(topic_name, raw_ref; dir) -> String

Append raw_ref to [doi].list in <dir>/<topic_name>.toml, creating the file with an empty [topic] header if it does not exist. Returns the normalized key.

source
BiblioFetch.vault_fetch!Function
vault_fetch!(index; topic_name, runtime, verbose) -> Dict{String,FetchJobResult}

Fetch papers for all topics (or a named subset) into index.store. Returns a Dict mapping topic name → FetchJobResult.

source
BiblioFetch.vault_bibFunction
vault_bib(index; topic_name, out) -> Int

Write a BibTeX file for all vault papers (or one topic). Returns entry count.

source
BiblioFetch.vault_searchFunction
vault_search(index, query; fields, case_sensitive) -> Vector{SearchMatch}

Search across all papers in the vault store.

source

CLI

Native app build

BiblioFetch.buildFunction
build(; sysimage_dir, bindir, force) -> String

Compile BiblioFetch into a sysimage using PackageCompiler.jl (create_sysimage with incremental=true), then write a thin shell wrapper into bindir.

Using a sysimage (rather than create_app) avoids the isolated-build errors that create_app triggers for packages with binary C extensions (HTTP → MbedTLS). It also produces a much smaller artefact (~40 MB vs ~300 MB) because the Julia runtime is not bundled — the system-installed julia is reused.

After a successful build, bibliofetch starts in under a second.

Arguments

  • sysimage_dir: directory where sys.so (Linux/macOS) or sys.dll (Windows) is written. Default: ~/.local/share/bibliofetch
  • bindir: where the bibliofetch wrapper script is installed. Default: ~/.local/bin
  • force: overwrite an existing sysimage. Default: false

Example

using Pkg; Pkg.add("PackageCompiler")   # once
using BiblioFetch
BiblioFetch.build()                      # ~2–4 min, run once per Julia version
BiblioFetch.build(force=true)            # rebuild after Pkg.update()
source