Linkhut.Archiving (linkhut v0.1.4)

View Source

Manages link archiving — creating snapshots of bookmarked pages, storing them, and generating time-limited tokens to view them.

Crawling is handled by Linkhut.Archiving.Workers.Archiver and Linkhut.Archiving.Workers.Crawler, which call back into this context to persist results.

Summary

Functions

Returns the list of content types accepted for snapshot uploads.

Returns comprehensive archive statistics for the admin dashboard.

Returns archive statistics for a user.

Returns the number of available slots in the archiver queue. Counts jobs in active states and subtracts from the queue limit.

Returns true if the user can create new archives.

Returns true if the user can view/download existing archives. Any active user can view when archiving isn't disabled.

Cleans up older snapshots of the same (link_id, type) that are superseded by a newly-terminal snapshot.

Creates a new crawl run record.

Creates a new snapshot for a link.

Deletes a single snapshot's storage and database record. Returns :ok on success, {:error, reason} on failure.

Returns a MapSet of domain strings that have had a crawl run created within the cooldown window. Used by the scheduler to skip domains that were recently crawled.

Returns the list of users eligible for archiving based on the current mode.

Enqueues a SnapshotDeleter job for each snapshot in pending_deletion state. Also deletes orphaned crawl runs (terminal state with no remaining snapshots).

Generates a short-lived token for serving a snapshot.

Returns all snapshots for a link (any state), newest first.

Returns a complete snapshot by ID, or {:error, :not_found}.

Returns all complete snapshots for a link, newest first, with crawl_run preloaded.

Returns all crawl runs for a link (excluding pending_deletion), with preloaded snapshots (also excluding pending_deletion), newest first.

Returns all non-deleted snapshots for a link, ordered by format locality then recency.

Returns the latest complete snapshot of a given format for a link.

Returns the latest complete snapshot of a given format and source for a link.

Returns a snapshot by link_id and job_id, or nil.

Gets a snapshot by its ID.

Returns links that have configured sources not covered by a current snapshot with matching version, excluding links with in-flight crawl runs.

Lists unarchived links for a user (links without completed snapshots and without an existing archive).

Marks all snapshots and crawl runs for a link as pending deletion.

Transitions a :processing crawl run to :complete when all its snapshots have reached a terminal state (:complete, :not_available, :failed, or :pending_deletion).

Returns the archiving mode.

Atomically recomputes the total_size_bytes for a single crawl run from its complete snapshots.

Atomically recomputes the total_size_bytes for a crawl run by ID. Uses a single UPDATE ... SET ... = (SELECT ...) statement — no locks needed.

Marks a single snapshot as pending deletion.

Schedules a re-crawl for a link by enqueueing a new Archiver job with the recrawl flag.

Transitions a :pending crawl run to :processing. Idempotent for already-processing crawl runs (safe for Oban retries). Returns {:error, :not_found} if the crawl run doesn't exist or is in an unexpected state.

Returns steps relevant to a single snapshot: orchestration steps (no snapshot_id) plus steps matching the given snapshot_id. Sorted by timestamp.

Returns total storage bytes used across all users (complete snapshots only).

Returns total storage bytes used by a specific user (complete snapshots only).

Updates a crawl run's attributes.

Updates a snapshot's attributes.

Uploads a user-provided snapshot for a link.

Verifies a snapshot serving token, returning the snapshot_id or an error.

Functions

accepted_upload_types()

Returns the list of content types accepted for snapshot uploads.

admin_archive_stats()

Returns comprehensive archive statistics for the admin dashboard.

archive_stats_for_user(user)

Returns archive statistics for a user.

available_archiver_slots()

Returns the number of available slots in the archiver queue. Counts jobs in active states and subtracts from the queue limit.

can_create_archives?(user)

@spec can_create_archives?(Linkhut.Accounts.User.t()) :: boolean()

Returns true if the user can create new archives.

can_view_archives?(arg1)

@spec can_view_archives?(Linkhut.Accounts.User.t()) :: boolean()

Returns true if the user can view/download existing archives. Any active user can view when archiving isn't disabled.

cleanup_superseded_snapshots(snapshot_id, link_id, format, new_state, new_source)

Cleans up older snapshots of the same (link_id, type) that are superseded by a newly-terminal snapshot.

Quality ordering — a new state supersedes older snapshots in these states:

  • :complete:complete, :not_available, :failed
  • :not_available:not_available, :failed
  • :failed:failed

Also marks crawl runs that end up with zero remaining non-deleted snapshots as :pending_deletion.

create_crawl_run(attrs)

Creates a new crawl run record.

create_snapshot(link_id, user_id, attrs \\ %{})

Creates a new snapshot for a link.

delete_snapshot(snapshot_id)

Deletes a single snapshot's storage and database record. Returns :ok on success, {:error, reason} on failure.

domains_on_cooldown(cooldown_seconds \\ 120)

Returns a MapSet of domain strings that have had a crawl run created within the cooldown window. Used by the scheduler to skip domains that were recently crawled.

eligible_users()

Returns the list of users eligible for archiving based on the current mode.

  • :disabled → empty list
  • :limited → users with an active supporter subscription
  • :enabled → all active users

enqueue_pending_deletions()

Enqueues a SnapshotDeleter job for each snapshot in pending_deletion state. Also deletes orphaned crawl runs (terminal state with no remaining snapshots).

generate_token(snapshot_id)

Generates a short-lived token for serving a snapshot.

get_all_snapshots_by_link(link_id)

Returns all snapshots for a link (any state), newest first.

get_complete_snapshot(id)

Returns a complete snapshot by ID, or {:error, :not_found}.

get_complete_snapshots_by_link(link_id)

Returns all complete snapshots for a link, newest first, with crawl_run preloaded.

get_crawl_runs_by_link(link_id)

Returns all crawl runs for a link (excluding pending_deletion), with preloaded snapshots (also excluding pending_deletion), newest first.

get_current_snapshots_by_link(link_id)

Returns all non-deleted snapshots for a link, ordered by format locality then recency.

get_latest_complete_snapshot(link_id, format)

Returns the latest complete snapshot of a given format for a link.

get_latest_complete_snapshot(link_id, format, source)

Returns the latest complete snapshot of a given format and source for a link.

get_snapshot(link_id, job_id)

Returns a snapshot by link_id and job_id, or nil.

get_snapshot_by_id(id)

Gets a snapshot by its ID.

list_reconcilable_links(user, limit \\ 100)

Returns links that have configured sources not covered by a current snapshot with matching version, excluding links with in-flight crawl runs.

Returns a list of {link, remaining_sources} tuples where remaining_sources is a MapSet of crawler source type strings not yet covered by any snapshot.

mark_snapshots_for_deletion(link_id)

Marks all snapshots and crawl runs for a link as pending deletion.

maybe_complete_crawl_run(crawl_run_id)

Transitions a :processing crawl run to :complete when all its snapshots have reached a terminal state (:complete, :not_available, :failed, or :pending_deletion).

Uses atomic UPDATE ... WHERE state = :processing to prevent race conditions when concurrent crawlers finish simultaneously.

mode()

Returns the archiving mode.

  • :disabled — no archiving features
  • :enabled — archiving for all active users
  • :limited — archiving only for active paying users

recompute_crawl_run_size(crawl_run)

Atomically recomputes the total_size_bytes for a single crawl run from its complete snapshots.

recompute_crawl_run_size_by_id(crawl_run_id)

Atomically recomputes the total_size_bytes for a crawl run by ID. Uses a single UPDATE ... SET ... = (SELECT ...) statement — no locks needed.

request_snapshot_deletion(snapshot_id, user_id)

Marks a single snapshot as pending deletion.

Returns {:ok, snapshot} on success, {:error, :active} if the snapshot is in an active state (pending/crawling/retryable), {:error, :not_found} if the snapshot doesn't exist or doesn't belong to the user.

schedule_recrawl(link, opts \\ [])

Schedules a re-crawl for a link by enqueueing a new Archiver job with the recrawl flag.

start_processing(crawl_run_id)

Transitions a :pending crawl run to :processing. Idempotent for already-processing crawl runs (safe for Oban retries). Returns {:error, :not_found} if the crawl run doesn't exist or is in an unexpected state.

steps_for_snapshot(steps, snapshot_id)

Returns steps relevant to a single snapshot: orchestration steps (no snapshot_id) plus steps matching the given snapshot_id. Sorted by timestamp.

storage_used()

Returns total storage bytes used across all users (complete snapshots only).

storage_used_by_user(user_id)

Returns total storage bytes used by a specific user (complete snapshots only).

update_crawl_run(crawl_run, attrs)

Updates a crawl run's attributes.

update_snapshot(snapshot, attrs)

Updates a snapshot's attributes.

upload_snapshot(link_id, user_id, upload)

Uploads a user-provided snapshot for a link.

Validates file size, detects format from content type, stores the file, creates a snapshot record, and runs supersession cleanup.

Returns {:ok, snapshot} on success, {:error, reason} on failure.

verify_token(token)

Verifies a snapshot serving token, returning the snapshot_id or an error.