Back to Blog

Semantic Movie Discovery System

5/17/2026
Movies
Semantic Search
PostgreSQL
Plex
Machine Learning
Product
Technical

A semantic curation engine for film discovery, built on a private, million-title movie knowledge base.

**Status:** The core engine and homelab deployment are operational. Consumer-grade “describe what you want” UX and LLM auto-tuning are the primary productization layers still to build.

Executive summary

Most movie systems answer: “What are other people likely to watch?”

This system answers: “What kind of movie experience is the user trying to find?”

It is a large-scale movie discovery and curation engine that helps people find films through meaning, mood, cultural context, quality signals, and personal intent — not only title, genre, or keyword matching.

At a simple level, it answers questions like:

  • “Find me weird dystopian city movies that feel depressing and massive.”
  • “Show me so-bad-it’s-good horror movies from the 80s.”
  • “Create a Plex collection of obscure sci-fi movies with cult appeal.”
  • “Find movies like The Hitchhiker’s Guide to the Galaxy, but more absurd and less action-oriented.”

The core insight: movie discovery is not just a database search problem. People often search with vague, emotional, cultural, or subjective language. Traditional systems struggle because they lean on genres, popularity charts, and collaborative filtering. This platform combines enriched metadata, semantic embeddings, lexical search, rating intelligence, popularity/obscurity controls, and (on the roadmap) an LLM-driven tuning layer that translates human intent into retrieval strategy.

The result is a system that can behave less like a search box and more like a knowledgeable film curator — especially once automatic tuning sits in front of the engine.

Strategic positioning: a semantic curation platform over a constructed movie knowledge base — not merely a search site, and not a traditional recommender.

The problem

Users search with intent, not metadata

People rarely arrive with clean filters. They arrive with feelings, references, eras, tones, and cultural categories:

User language What they actually want
“Depressing movies in giant dystopian cities” Atmosphere, scale, tone, setting — not the word “city” in the title
“Best comedies of the 80s” Hard decade constraint + genre + high ratings + cultural prominence
“So bad it’s good” Low credible ratings, high awareness, unintentional comedy, cult notoriety
“Hidden gems in my Plex library” Personal inventory overlay + quality + obscurity

A single fixed ranking algorithm fails because the correct balance of signals changes with intent.

Incumbent tools optimize the wrong objective

System type Typical objective Weakness for subjective discovery
Streaming recommenders (Netflix, etc.) Maximize engagement on catalog they license No long-tail obscure titles; no “vibe” queries; no personal library semantics
Database search (TMDB, IMDb browse) Exact metadata match Poor at mood, metaphor, cultural framing
Letterboxd / community lists Human curation at scale Requires cinephile literacy; lists go stale; not generative
Plex/Jellyfin built-in search Title/metadata in your files Weak semantic discovery; collection building is manual

Gap: there is no widely available product that combines million-title coverage, semantic retrieval, intent-aware ranking, and personal library overlay in one self-hosted or API-first platform.

What the system is

Movie Index (internal project name) is a locally hosted movie intelligence platform:

  • It builds and owns a private movie catalog (~1.2M+ TMDB identities; bulk snapshots can exceed 1.4M rows when importing large CSV archives).
  • It enriches records from TMDB (and optional bulk sources), indexes them for hybrid search, and serves ranked results via HTTP API and web UI.
  • It optionally overlays a user’s Plex library so discovery can mean “from the whole world” or “from what I already own.”

One-line definition:
A semantic curation engine that turns fuzzy human intent into tunable retrieval over a very large, normalized movie knowledge base.

Movie Index dashboard showing catalog stats and enrichment coverage
Dashboard — million-scale catalog coverage: total IDs, bulk metadata, API enrichment queue, search documents, and embeddings.

What makes it different

Different question, different product

Traditional recommender This system
“What will people like me watch next?” “What experience is the user trying to find?”
Optimizes engagement on a licensed catalog Optimizes interpreted intent on a comprehensive catalog
Opaque matrix factorization / trending Explainable ranking signals (semantic match, lexical match, rating, notoriety, obscurity)
Weak on “so bad it’s good,” cult context, atmosphere Designed for multi-objective discovery (quality, trash, fame, hidden)

Tunable retrieval, not one algorithm

The engine exposes adjustable fusion weights and cutoff strategies so the same query infrastructure can serve opposite goals:

User goal Retrieval emphasis (conceptual)
Famous and good High ratings, high vote count (notoriety), tighter similarity
Obscure and good High ratings, low vote count (obscurity), semantic breadth
Famous and terrible Low Bayesian-adjusted quality, high notoriety
Cult / “so bad it’s good” Trash-quality signal + notoriety + cultural/contextual semantics (roadmap: dedicated embedding space)
Thematic / mood Semantic similarity, relaxed lexical title match, elbow-based membership

Key differentiator (product): an automatic curation layer (planned) that sets these knobs from natural language so casual users never see them.

Business opportunities

Each opportunity below reuses the same core: catalog + embeddings + hybrid search + ranking + (optional) library overlay.

Consumer movie discovery app

Product: Public or freemium web app — “describe the movie you want.”

Example queries:

  • “Movies that feel like lonely neon cities at night.”
  • “Absurd British sci-fi comedies.”
  • “Forgotten 90s thrillers that are actually good.”
  • “Bad movies that are fun, not just bad.”

Positioning: Competes with discovery and exploration (Letterboxd-adjacent browsing, film Twitter/list culture, niche cinephile search) — not with Netflix-style “what to stream tonight on our platform.”

Differentiator: Users do not need to know filters, genres, or metadata vocabulary. They describe vibe, era, tone, or cultural category; the system translates that into retrieval strategy.

Monetization paths: subscription, affiliate links (where legally appropriate), premium collections, API tier for power users.

Plex / Jellyfin collection generator

Product: Connect to a user’s media server; generate curated collections from natural language.

Example queries:

  • “Build a cult sci-fi collection from my library.”
  • “So-bad-it’s-good movies I already own.”
  • “1980s creature-feature playlist.”
  • “Hidden gems I forgot I had.”

Why this niche is strong:

  • Users already maintain large personal libraries and care about organization.
  • They are underserved by semantic discovery inside Plex/Jellyfin.
  • No streaming of copyrighted content required — only metadata analysis and collection instructions returned to the local server.

Technical fit today: Plex sync marks in_library on catalog rows; search and collections APIs accept in_library: true filters. Jellyfin would be a parallel integration.

Monetization paths: one-time license, subscription plugin, homelab “pro” tier.

Movie metadata and semantic search API

Asset: A constructed, normalized, embedded catalog — not raw TMDB dumps.

Potential API customers:

  • Indie app developers
  • Recommendation startups
  • Plex/Jellyfin plugin authors
  • Film researchers and educators
  • AI application builders
  • Media catalog / metadata companies
  • Hobbyists building local movie tools

API capabilities (existing or near-existing):

Capability Description
Normalized metadata Title, year, overview, genres, keywords, cast, crew, ratings, posters
Hybrid search Semantic + lexical + metadata filters in one request
Similarity / “more like this” Same embedding space as search
Saved collections Store query + filters; re-resolve on demand
Library overlay Restrict to in_library for personal-server use cases
Scoring transparency Per-hit scores: semantic similarity, RRF, Bayesian rating, etc.

Positioning: A “semantic layer” on top of movie metadata — the hard data engineering and embedding work already done.

Monetization paths: usage-based API, tiered keys, enterprise license, white-label.

Precomputed embedding dataset

Problem: Generating embeddings for 1M+ movies is expensive, slow, and operationally painful (GPU batching, model versioning, index rebuilds).

Product: Licensed dataset bundles:

  • Movie identity (TMDB ID, title, year, etc.)
  • Plot/metadata embedding vectors
  • (Roadmap) Historical/cultural context embeddings
  • Similarity index metadata / version documentation
  • Incremental update packages when the model or enrichment changes

Buyers: AI developers who want movie search or recommendations without building the pipeline.

Licensing must respect TMDB attribution terms and model licenses; vectors are derived works built on permitted metadata.

Automated editorial and content marketing

Because the engine can surface clusters, outliers, and thematic slices, it can power:

  • Listicles (“Weirdest low-budget 90s sci-fi”)
  • Newsletter segments
  • SEO landing pages
  • Social content calendars

Examples:

  • “Movies that accidentally became cult classics”
  • “Dystopian city films before and after Blade Runner”
  • “The best bad shark movies you’ve never heard of”

Monetization paths: ad-supported media property, B2B content tooling for publishers, lead gen for a consumer app.

Technical platform

This section covers architecture and implementation for readers who need credible technical depth — and for technical partners evaluating feasibility.

Architecture at a glance

flowchart TB
  subgraph external [External — ingestion only]
    TMDB_EXP[TMDB Daily ID Exports]
    TMDB_API[TMDB API v3]
    KAGGLE[Kaggle TMDB CSV snapshot — optional bulk]
    PLEX[Plex server — optional]
  end

  subgraph server [Single-server deployment — Docker Compose]
    API[FastAPI + Uvicorn — port 8080]
    WORKER[Enrichment worker — continuous]
    CLI[movie-index CLI / cron]

    subgraph data [PostgreSQL 16 + extensions]
      META[Constructed metadata — ~1M+ rows]
      FTS[Full-text search — tsvector + GIN]
      TRGM[Trigram title match — pg_trgm]
      VEC[Vector index — pgvector HNSW]
      QUEUE[Enrichment queue + phase tracking]
      COLL[Saved collections]
    end

    MODELS[Local model cache — Hugging Face weights]
    ARTIFACTS[Poster cache — local disk]
  end

  TMDB_EXP --> CLI
  TMDB_API --> WORKER
  KAGGLE --> CLI
  PLEX --> API
  CLI --> META
  WORKER --> META
  META --> FTS
  META --> VEC
  API --> FTS
  API --> VEC
  API --> TRGM
  MODELS --> VEC

Design principle: At query time, the system reads only PostgreSQL and local files — not TMDB, not cloud embedding APIs. External services are used during ingestion and sync only. That yields predictable latency, offline-capable search (once built), and no per-query SaaS inference bill.

Software stack

Layer Technology Role
Database PostgreSQL 16 System of record for all metadata, queues, collections
Vector search pgvector (HNSW, cosine distance) Semantic nearest-neighbor on filtered candidate sets
Lexical search PostgreSQL FTS (tsvector, websearch_to_tsquery) Overviews, keywords, assembled search documents
Fuzzy titles pg_trgm Typo-tolerant and partial title matching
API FastAPI + Uvicorn REST: search, movies, collections, Plex sync, dashboard
Runtime Python 3.12+ Ingestion, search fusion, embedding jobs, CLI
Embeddings sentence-transformers Local bi-encoder; batch on CPU, Apple MPS, or NVIDIA CUDA
Packaging Docker Compose postgres, api, worker services
CLI Click (movie-index command) Operations, imports, enrichment, embedding, stats

Deliberate non-choices (v1): No Elasticsearch, Pinecone, or managed vector DB — one database reduces operational cost for homelab and early commercial pilots.

Data sources and catalog construction

No single upstream provider ships a complete, query-ready movie database. The platform merges sources into one constructed catalog:

Source What it provides How it is used
TMDB Daily ID Exports Near-complete daily list of valid TMDB movie IDs + export-level popularity Catalog spine — hundreds of thousands to ~1M+ IDs without months of API paging
TMDB API v3 Full detail: overview, genres, keywords, credits, images, external IDs Enrichment for searchable depth; rate-limited queue (~4 req/s default, configurable)
Kaggle TMDB snapshot (optional) Bulk CSV (~930k–1.4M rows) with core metadata Fast bulk bootstrap; API queue backfills gaps
Plex (optional) User’s owned titles in_library overlay only — not a metadata source of truth
IMDb datasets (optional, roadmap) Supplemental ratings / crosswalk Offline import; non-commercial license constraints

Scale targets:

  • ~1.2 million movie identities from TMDB export workflow (project design target).
  • ~1.42 million rows available in bundled Kaggle CSV snapshot (useful for bulk import and gap-fill).
  • Enrichment at 4 TMDB requests/sec ≈ 3–4 days of continuous worker time for a full API detail pass over ~1.2M IDs (order-of-magnitude planning number).

Enrichment pipeline (automated):

  1. Import IDs (export and/or Kaggle).
  2. Queue movies for TMDB API detail fetch (priority: in-library first, then popularity).
  3. After each batch: build search documents and embedding vectors (configurable ENRICHMENT_AUTO_INDEX).
  4. Track per-movie enrichment phases: catalog spine → core metadata → poster media.
Enrichment tab showing pipeline stages and queue metrics
Enrichment worker — queue depth, ETA, and the three stages: TMDB metadata fetch, search-document build, and vector embedding.

Stored metadata per movie (representative):

  • Identity: TMDB ID, IMDb ID, titles, release date/year
  • Text: overview, tagline, search document (derived)
  • Facets: genres, keywords (JSON arrays)
  • People: cast (top billed), crew (directors/writers)
  • Franchise: collection id/name
  • Signals: popularity, vote_average, vote_count, adult flag
  • Media: poster paths, locally cached poster files
  • Library: in_library boolean
  • Provenance: detail_fetched_at, search_doc_built_at, enrichment phase status

Search document and embedding model

Search document (today): A single composed text block per movie used for both FTS indexing and embedding input:

  • Title, original title, year, tagline, overview
  • Genres, keywords (capped), top cast, directors
  • Collection name

Embedding model (current default):

Setting Value
Model Qwen/Qwen3-Embedding-0.6B
Dimensions 1024
Storage movie_embeddings table with embedding_model + embedding_version for traceability
Index HNSW on cosine distance (<=> operator)
Alternatives (supported) BAAI/bge-small-en-v1.5 (384d), bge-base, bge-large, bge-m3, nomic-ai/nomic-embed-text-v1.5

Models run entirely locally — weights cached under data/models/. No OpenAI/Cohere-style per-query embedding fees.

Hardware: Auto-detects CUDA → Apple MPS → CPU. Docker worker/API containers typically use CPU; native runs on Apple Silicon can use MPS for faster embedding batches.

Roadmap — second embedding space (“cultural / context”):
Planned separate representation for production history, reception, cult status, trivia, and “why this movie matters” — critical for queries like “so bad it’s good” where plot text is insufficient. Not yet implemented; today all semantic search uses the plot/metadata document embedding.

Hybrid retrieval pipeline

For each query, the engine:

  1. Embeds the query with the same model (query-specific encoding per model family — e.g. Qwen uses a query prompt name).
  2. Runs lexical branch:
    • PostgreSQL FTS rank on search_tsv (websearch_to_tsquery, English).
    • Trigram similarity on title (threshold > 0.2).
  3. Runs semantic branch:
    • pgvector top-K by cosine distance on movie_embeddings, respecting SQL filters.
  4. Fuses lexical + semantic candidate lists with Reciprocal Rank Fusion (RRF) — default k = 60.
  5. Applies weighted score fusion across multiple signals (see below).
  6. Optionally applies membership cutoff (how many results “belong” in the set).
sequenceDiagram
  participant User
  participant API
  participant Encoder
  participant PG as PostgreSQL

  User->>API: Natural language query + filters
  API->>Encoder: Embed query locally
  Encoder-->>API: Query vector (1024-d)
  par Lexical
    API->>PG: FTS + trigram title search
  and Semantic
    API->>PG: pgvector HNSW nearest neighbors
  end
  PG-->>API: Candidate pools (e.g. top 500 each)
  API->>API: RRF merge + weighted scoring + cutoff
  API-->>User: Ranked hits + explainable scores

Ranking dimensions

The API accepts a ScoringConfig with fusion weights (0–10 scale) and rating/cutoff settings:

Signal Meaning Example use
RRF Reciprocal rank fusion across lexical + semantic lists Default hybrid balance
Semantic Vector similarity (1 − cosine distance) Mood, theme, atmosphere queries
Search rank Combined channel rank score Fine-tuning lexical vs semantic emphasis
Rating high Bayesian-shrunk quality (normalized) “Actually good” lists
Rating low (“trash”) Inverse of quality signal “So bad it’s good”
Notoriety log1p(vote_count) — famous / widely rated Mainstream, infamous, cult-famous
Obscurity Inverse of notoriety Hidden gems, deep cuts

Bayesian rating:
Raw TMDB vote_average is misleading for low vote_count. The system shrinks ratings toward a catalog mean (configurable prior, default 50 pseudo-votes) so that:

  • “1.5 stars from 3 people” does not dominate “3.8 stars from 80,000 people.”

Reversing the quality signal (boost rating_low) surfaces titles that are credibly poorly rated at scale — a key ingredient for “so bad it’s good.”

Membership cutoff modes:

Mode Behavior
none All ranked results count
top_n Fixed cap
threshold Minimum semantic (or other) score
elbow Largest gap in score curve — auto-sized collections

Elbow mode addresses: “Chupacabra movies” (tight cluster) vs “dreamlike loneliness” (broad semantic spread) without always returning exactly 100 or 200 titles.

Search sidebar with ranking weight sliders and cutoff options
Tunable ranking — candidate pool, result cap, RRF vs semantic weights, Bayesian rating signals (well-rated, trash quality, notoriety, obscurity), and cutoff modes.

Metadata filters

SQL-level constraints applied before or during retrieval:

Filter Use case
year_min / year_max Decade constraints (“80s comedies”)
genres Genre enforcement
in_library Plex-only discovery
enriched_only Require full TMDB detail
min_vote_count / max_vote_count Obscurity vs notoriety control

API surface

Method Endpoint Purpose
GET /health Liveness + DB stats
GET / Web dashboard (search, enrichment ops, stats)
GET /api/dashboard Pipeline stats, sync state
POST /search Hybrid search with filters + scoring config
GET /movies/{tmdb_id} Full metadata card
POST /collections/preview Preview a collection query
POST /collections Save a collection definition
GET /collections/{id}/movies Resolve saved collection
GET /api/plex/status Plex config + library counts
POST /api/plex/sync Refresh in_library from Plex

Optional X-API-Key when API_KEY is set — suitable for LAN or partner pilots.

Operations tab listing REST API endpoints
Built-in REST API reference — health, catalog jobs, enrichment control, hybrid search, movie cards, and saved collections.

Example search request:

{
  "query": "depressing giant dystopian city",
  "limit": 50,
  "filters": { "year_min": 1970, "in_library": false },
  "scoring": {
    "weights": { "semantic": 2.0, "rrf": 1.0, "obscurity": 0.5 },
    "cutoff_mode": "elbow",
    "cutoff_on": "semantic"
  }
}

Plex integration

  • Config: PLEX_BASE_URL, PLEX_TOKEN, optional PLEX_LIBRARY_NAME.
  • Sync lists movie libraries, extracts TMDB/IMDb GUIDs, matches to catalog, sets in_library.
  • Priority boost: titles in the user’s library are enriched and indexed first.
  • Search UI defaults can restrict to library-only for “what should I watch from what I own?”

Jellyfin: Not implemented; the same overlay pattern applies.

How discovery works (end-to-end)

Example A — Thematic / mood query

User: “Depressing movies that take place in giant dystopian cities.”

Ideal retrieval plan (manual today; automatic via LLM later):

Knob Setting
Semantic weight High
Lexical / title Lower (avoid “city” in title dominating)
Filters Optional sci-fi / thriller genres; year range if implied
Rating signals Neutral (not optimizing for “best”)
Cutoff Elbow on semantic similarity
Candidate pool Broad (e.g. 500+)
Poster-grid search results for a dystopian mood query
Mood / theme query — *“depressing giant dystopian city”* with poster-grid results from hybrid semantic + lexical retrieval.

Example B — Canon / best-of query

User: “Best comedies of the 80s.”

Knob Setting
year_min / year_max 1980–1989
genres Comedy
Rating high + notoriety High
Obscurity Low
Lexical Moderate (decade + genre terms)
Cutoff Higher top-N or threshold

Example C — “So bad it’s good”

User: “Funny-bad horror from the 80s.”

Knob Setting
Filters Horror; 1980–1989; min_vote_count to ensure credibility
Rating low (trash) High
Notoriety Moderate–high
Semantic Theme + (roadmap) cultural/context embedding
Bayesian logic Ensures “bad” means many voters, not data noise
Search for funny-bad 80s horror with trash-quality and notoriety weights raised
Multi-objective query — 1980s horror with trash-quality and notoriety weights raised so infamous low-rated titles surface.

Example D — Hidden gems

User: “Forgotten films that are actually good but not famous.”

Knob Setting
Rating high High
Obscurity High
Notoriety Low
Semantic Moderate–high
Cutoff Elbow
Search tuned for hidden gems with obscurity and quality weights
Hidden gems — semantic query with well-rated and obscurity boosts to favor credible but under-voted titles.

Conversational refinement (roadmap)

Multi-turn dialogue adjusts the same knobs:

  1. “Too many movies with ‘city’ in the title” → reduce lexical/title weight, increase semantic.
  2. “Darker and less mainstream” → increase obscurity, decrease notoriety, refine semantic query expansion.

The user steers like talking to a film expert — without operating a control panel.

Built today vs roadmap

Honest status framing for partners and investors. Screenshots in this post are from a TMDB-scale catalog (~1.2M titles) and intentionally omit personal-library overlays and media-server sync — those are optional deployment features, not required for semantic discovery.

Built and operational

Capability Notes
Million-scale catalog ingestion TMDB export + API queue + optional Kaggle bulk
Continuous enrichment worker Docker worker service; auto search-doc + embed
Hybrid search (FTS + trigram + pgvector + RRF) Production code path
Multi-signal weighted ranking Rating high/low, notoriety, obscurity, Bayesian shrinkage
Cutoff modes none, top_n, threshold, elbow
Saved collections API Preview, persist, resolve
Plex in_library sync Library-first enrichment priority
Web dashboard Search, scoring sliders, enrichment controls, ops
Self-hosted deployment Docker Compose; portable data/ directory
Local embeddings No cloud inference dependency
Model registry Switch embedding model with versioned vectors

In progress / partial

Capability Notes
Full catalog enrichment ~1.2M API passes take days; many rows may be spine-only until enriched
Poster local cache Pipeline exists; coverage grows with enrichment
Cross-encoder rerank Flag in API (use_rerank); not wired in search path yet

Roadmap (high value, not yet built)

Capability Business impact
LLM automatic curation layer Makes product usable by non-experts; core GTM unlock
Second embedding space (cultural/context) Unlocks “so bad it’s good,” cult, production-history queries
Conversational refinement Retention and differentiation vs static search
Jellyfin integration Same market as Plex plugin
Cross-encoder rerank Precision boost for top results
Query → filter LLM parser Natural language to SQL filters (year, genre)
Static collection export Push lists to Plex/Kodi/CSV acquisition workflows
IMDb supplemental import Richer ratings crosswalk (license-dependent)

Deployment and data ownership

Self-hosted first

  • Typical deployment: one server (homelab Mac mini, NAS VM, small cloud VM).
  • All data under MOVIE_INDEX_DATA_DIR (default ./data): Postgres files, exports, models, posters, backups.
  • Moving hosts: copy repo + .env + entire data/ tree — no re-embedding required.

Why self-hosted matters commercially

Stakeholder Benefit
Power users / Plex community Data stays on LAN; no upload of library titles to a third party
Enterprise pilots Air-gapped or VPC deployment possible
API business Offer hosted API or on-prem license from the same codebase

Hosted SaaS (future)

A hosted tier is compatible with the architecture but is a go-to-market choice, not a technical requirement. TMDB attribution and API terms must be reflected in any public UI.

Competitive context

Alternative Strength Gap this system fills
TMDB / IMDb search Complete metadata, trusted IDs Weak subjective/vibe search; no personal library semantics
JustWatch, Reelgood Streaming availability Not built for obscure/cult/long-tail curation
Letterboxd Community taste graph Requires social graph and manual list culture
Plex Discover Convenience inside Plex Limited semantic discovery; no cross-catalog “describe vibe”
General RAG over Wikipedia Flexible Expensive, inconsistent, no structured rating/obscurity controls
Pinecone + raw TMDB embeddings DIY Custom Months of pipeline work; no Bayesian/cult ranking logic

Moat hypothesis: The combination of million-title constructed catalog, hybrid retrieval, intent-aware multi-signal ranking, library overlay, and (when shipped) LLM tuning is harder to replicate than any single component alone.

Risks, dependencies, and compliance

Area Consideration
TMDB API key required; rate limits; attribution on public surfaces; terms restrict how data is exposed
Enrichment cost Time and API quota to detail-enrich full catalog
Compute Initial embed of 1M+ titles is GPU/time-intensive (one-time + re-embed on model change)
Plex Token security; read-only library access; matching unmatched titles
Copyright System indexes metadata only; does not host films
Model licenses Qwen/BGE/nomic terms for redistribution if selling embedding datasets
IMDb datasets Non-commercial restrictions if used
Product risk Without LLM auto-tuning, power of engine exceeds mainstream UX

Strongest near-term wedge: a simple interface —

“Describe the kind of movie you want.”

→ curated list → conversational refinement.

Layer Role
UX Natural language in; ranked posters + explanations out
Auto-curation (build) LLM maps utterances → ScoringConfig + filters + cutoff
Engine (exists) Hybrid search + Plex overlay + collections
Power user mode (exists) Advanced sliders in dashboard for tuning and evaluation

Secondary wedge: Plex collection generator for the homelab community — passionate users, clear pain, willingness to pay for tooling.

Strategic summary

Dimension Statement
What it is Semantic curation engine over a private million-movie knowledge base
What it is not A streaming recommender or a basic TMDB clone
Core insight Discovery is multi-objective; intent changes the right ranking strategy
Technical foundation PostgreSQL + pgvector + hybrid RRF + local embeddings + Bayesian rating signals
Product unlock LLM layer that translates human language into retrieval plans
Business paths Consumer discovery, Plex/Jellyfin collections, metadata API, embedding datasets, editorial automation
Deployment advantage Self-hosted, query-time offline, library-aware, explainable scores

The engine is intentionally complex so the user experience can be simple. The business opportunity is to productize that translation layer — turning a powerful retrieval laboratory into a film discovery assistant that traditional search and recommendation stacks do not provide.

Appendix: query → strategy mapping

User query Filters Weight emphasis Cutoff
Lonely neon cities at night Semantic ↑, obscurity moderate Elbow
Best 80s comedies years 1980–1989, genre Comedy rating_high ↑, notoriety ↑ top_n
So-bad-it’s-good 80s horror years 1980–1989, genre Horror, min votes rating_low ↑, notoriety ↑ Elbow
Chupacabra movies Lexical ↑ (rare exact term) Threshold or small top_n
Hidden gems rating_high ↑, obscurity ↑ Elbow
Cult sci-fi deep cuts genre Sci-Fi semantic ↑, notoriety moderate Elbow
Hidden gems in my Plex in_library: true rating_high ↑, obscurity ↑ Elbow
Cult sci-fi I own in_library: true, genre Sci-Fi semantic ↑, notoriety moderate Elbow