Linkhut.Archiving (linkhut v0.1.7)

Manages link archiving — creating snapshots of bookmarked pages, storing them, and generating time-limited tokens to view them.

Crawling is handled by Linkhut.Archiving.Workers.Archiver and Linkhut.Archiving.Workers.Crawler, which call back into this context to persist results.

Summary

Functions

accepted_upload_types()

Returns the list of content types accepted for snapshot uploads.

admin_archive_stats()

Returns comprehensive archive statistics for the admin dashboard.

archive_stats_for_user(user)

Returns archive statistics for a user.

available_archiver_slots()

Returns the number of available slots in the archiver queue. Counts jobs in active states and subtracts from the queue limit.

can_create_archives?(user)

Returns true if the user can create new archives.

can_view_archives?(arg1)

Returns true if the user can view/download existing archives. Any active user can view when archiving isn't disabled.

cleanup_superseded_snapshots(snapshot_id, link_id, format, new_state, new_source)

Cleans up older snapshots of the same (link_id, type) that are superseded by a newly-terminal snapshot.

create_crawl_run(attrs)

Creates a new crawl run record.

create_snapshot(link_id, user_id, attrs \\ %{})

Creates a new snapshot for a link.

delete_snapshot(snapshot_id)

Deletes a single snapshot's storage and database record. Returns :ok on success, {:error, reason} on failure.

domains_on_cooldown(cooldown_seconds \\ 120)

Returns a MapSet of domain strings that have had a crawl run created within the cooldown window. Used by the scheduler to skip domains that were recently crawled.

eligible_users()

Returns the list of users eligible for archiving based on the current mode.

enqueue_pending_deletions()

Enqueues a SnapshotDeleter job for each snapshot in pending_deletion state. Also deletes orphaned crawl runs (terminal state with no remaining snapshots).

generate_token(snapshot_id)

Generates a short-lived token for serving a snapshot.

get_all_snapshots_by_link(link_id)

Returns all snapshots for a link (any state), newest first.

get_complete_snapshot(id)

Returns a complete snapshot by ID, or {:error, :not_found}.

get_complete_snapshots_by_link(link_id)

Returns all complete snapshots for a link, newest first, with crawl_run preloaded.

get_crawl_runs_by_link(link_id)

Returns all crawl runs for a link (excluding pending_deletion), with preloaded snapshots (also excluding pending_deletion), newest first.

get_current_snapshots_by_link(link_id)

Returns all non-deleted snapshots for a link, ordered by format locality then recency.

get_latest_complete_snapshot(link_id, format)

Returns the latest complete snapshot of a given format for a link.

get_latest_complete_snapshot(link_id, format, source)

Returns the latest complete snapshot of a given format and source for a link.

get_snapshot(link_id, job_id)

Returns a snapshot by link_id and job_id, or nil.

get_snapshot_by_id(id)

Gets a snapshot by its ID.

list_reconcilable_links(user, limit \\ 100)

Returns links that have configured sources not covered by a current snapshot with matching version, excluding links with in-flight crawl runs.

list_unarchived_links_for_user(user, limit \\ 50)

Lists unarchived links for a user (links without completed snapshots and without an existing archive).

mark_snapshots_for_deletion(link_id)

Marks all snapshots and crawl runs for a link as pending deletion.

maybe_complete_crawl_run(crawl_run_id)

Transitions a :processing crawl run to :complete when all its snapshots have reached a terminal state (:complete, :not_available, :failed, or :pending_deletion).

mode()

Returns the archiving mode.

recompute_crawl_run_size(crawl_run)

Atomically recomputes the total_size_bytes for a single crawl run from its complete snapshots.

recompute_crawl_run_size_by_id(crawl_run_id)

Atomically recomputes the total_size_bytes for a crawl run by ID. Uses a single UPDATE ... SET ... = (SELECT ...) statement — no locks needed.

request_snapshot_deletion(snapshot_id, user_id)

Marks a single snapshot as pending deletion.

schedule_recrawl(link, opts \\ [])

Schedules a re-crawl for a link by enqueueing a new Archiver job with the recrawl flag.

start_processing(crawl_run_id)

Transitions a :pending crawl run to :processing. Idempotent for already-processing crawl runs (safe for Oban retries). Returns {:error, :not_found} if the crawl run doesn't exist or is in an unexpected state.

steps_for_snapshot(steps, snapshot_id)

Returns steps relevant to a single snapshot: orchestration steps (no snapshot_id) plus steps matching the given snapshot_id. Sorted by timestamp.

storage_used()

Returns total storage bytes used across all users (complete snapshots only).

storage_used_by_user(user_id)

Returns total storage bytes used by a specific user (complete snapshots only).

update_crawl_run(crawl_run, attrs)

Updates a crawl run's attributes.

update_snapshot(snapshot, attrs)

Updates a snapshot's attributes.

upload_snapshot(link_id, user_id, upload)

Uploads a user-provided snapshot for a link.

verify_token(token)

Verifies a snapshot serving token, returning the snapshot_id or an error.

Functions

accepted_upload_types()

Returns the list of content types accepted for snapshot uploads.

admin_archive_stats()

Returns comprehensive archive statistics for the admin dashboard.

archive_stats_for_user(user)

Returns archive statistics for a user.

available_archiver_slots()

Returns the number of available slots in the archiver queue. Counts jobs in active states and subtracts from the queue limit.

can_create_archives?(user)

@spec can_create_archives?(Linkhut.Accounts.User.t()) :: boolean()

Returns true if the user can create new archives.

can_view_archives?(arg1)

@spec can_view_archives?(Linkhut.Accounts.User.t()) :: boolean()

Returns true if the user can view/download existing archives. Any active user can view when archiving isn't disabled.

cleanup_superseded_snapshots(snapshot_id, link_id, format, new_state, new_source)

Cleans up older snapshots of the same (link_id, type) that are superseded by a newly-terminal snapshot.

Quality ordering — a new state supersedes older snapshots in these states:

:complete → :complete, :not_available, :failed
:not_available → :not_available, :failed
:failed → :failed

Also marks crawl runs that end up with zero remaining non-deleted snapshots as :pending_deletion.

create_crawl_run(attrs)

Creates a new crawl run record.

create_snapshot(link_id, user_id, attrs \\ %{})

Creates a new snapshot for a link.

delete_snapshot(snapshot_id)

Deletes a single snapshot's storage and database record. Returns :ok on success, {:error, reason} on failure.

domains_on_cooldown(cooldown_seconds \\ 120)

Returns a MapSet of domain strings that have had a crawl run created within the cooldown window. Used by the scheduler to skip domains that were recently crawled.

eligible_users()

Returns the list of users eligible for archiving based on the current mode.

:disabled → empty list
:limited → users with an active supporter subscription
:enabled → all active users

enqueue_pending_deletions()

Enqueues a SnapshotDeleter job for each snapshot in pending_deletion state. Also deletes orphaned crawl runs (terminal state with no remaining snapshots).

generate_token(snapshot_id)

Generates a short-lived token for serving a snapshot.

get_all_snapshots_by_link(link_id)

Returns all snapshots for a link (any state), newest first.

get_complete_snapshot(id)

Returns a complete snapshot by ID, or {:error, :not_found}.

get_complete_snapshots_by_link(link_id)

Returns all complete snapshots for a link, newest first, with crawl_run preloaded.

get_crawl_runs_by_link(link_id)

Returns all crawl runs for a link (excluding pending_deletion), with preloaded snapshots (also excluding pending_deletion), newest first.

get_current_snapshots_by_link(link_id)

Returns all non-deleted snapshots for a link, ordered by format locality then recency.

get_latest_complete_snapshot(link_id, format)

Returns the latest complete snapshot of a given format for a link.

get_latest_complete_snapshot(link_id, format, source)

Returns the latest complete snapshot of a given format and source for a link.

get_snapshot(link_id, job_id)

Returns a snapshot by link_id and job_id, or nil.

get_snapshot_by_id(id)

Gets a snapshot by its ID.

list_reconcilable_links(user, limit \\ 100)

Returns links that have configured sources not covered by a current snapshot with matching version, excluding links with in-flight crawl runs.

Returns a list of {link, remaining_sources} tuples where remaining_sources is a MapSet of crawler source type strings not yet covered by any snapshot.

list_unarchived_links_for_user(user, limit \\ 50)

Lists unarchived links for a user (links without completed snapshots and without an existing archive).

mark_snapshots_for_deletion(link_id)

Marks all snapshots and crawl runs for a link as pending deletion.

maybe_complete_crawl_run(crawl_run_id)

Transitions a :processing crawl run to :complete when all its snapshots have reached a terminal state (:complete, :not_available, :failed, or :pending_deletion).

Uses atomic UPDATE ... WHERE state = :processing to prevent race conditions when concurrent crawlers finish simultaneously.

mode()

Returns the archiving mode.

:disabled — no archiving features
:enabled — archiving for all active users
:limited — archiving only for active paying users

recompute_crawl_run_size(crawl_run)

Atomically recomputes the total_size_bytes for a single crawl run from its complete snapshots.

recompute_crawl_run_size_by_id(crawl_run_id)

Atomically recomputes the total_size_bytes for a crawl run by ID. Uses a single UPDATE ... SET ... = (SELECT ...) statement — no locks needed.

request_snapshot_deletion(snapshot_id, user_id)

Marks a single snapshot as pending deletion.

Returns {:ok, snapshot} on success, {:error, :active} if the snapshot is in an active state (pending/crawling/retryable), {:error, :not_found} if the snapshot doesn't exist or doesn't belong to the user.

schedule_recrawl(link, opts \\ [])

Schedules a re-crawl for a link by enqueueing a new Archiver job with the recrawl flag.

start_processing(crawl_run_id)

steps_for_snapshot(steps, snapshot_id)

Returns steps relevant to a single snapshot: orchestration steps (no snapshot_id) plus steps matching the given snapshot_id. Sorted by timestamp.

storage_used()

Returns total storage bytes used across all users (complete snapshots only).

storage_used_by_user(user_id)

Returns total storage bytes used by a specific user (complete snapshots only).

update_crawl_run(crawl_run, attrs)

Updates a crawl run's attributes.

update_snapshot(snapshot, attrs)

Updates a snapshot's attributes.

upload_snapshot(link_id, user_id, upload)

Uploads a user-provided snapshot for a link.

Validates file size, detects format from content type, stores the file, creates a snapshot record, and runs supersession cleanup.

Returns {:ok, snapshot} on success, {:error, reason} on failure.

verify_token(token)

Verifies a snapshot serving token, returning the snapshot_id or an error.