Linkhut.Archiving.Pipeline (linkhut v0.1.4)

View Source

Orchestrates the archiving pipeline from URL validation through crawler dispatch. Called by the Archiver worker.

Summary

Functions

Runs the full archiving pipeline for the given crawl run.

Functions

run(crawl_run, opts \\ [])

@spec run(
  Linkhut.Archiving.CrawlRun.t(),
  keyword()
) :: {:ok, map()} | {:error, term()}

Runs the full archiving pipeline for the given crawl run.

  1. Validates URL (SSRF check)
  2. Preflight request to get content_type, final_url, status
  3. SSRF check on final_url
  4. Selects eligible crawlers via can_handle?/2
  5. Atomically dispatches crawler jobs + creates pending snapshots

Always-dispatch crawlers (third-party) are selected before preflight and dispatched alongside target crawlers. Not-archivable outcomes (invalid URL, unsupported scheme, no eligible crawlers, file too large) are finalized as :not_archivable — no retries.

Options:

  • :recrawl - boolean, whether this is a re-crawl attempt