Semantic Movie Discovery System

A semantic curation engine for film discovery, built on a private, million-title movie knowledge base.

**Status:** The core engine and homelab deployment are operational. Consumer-grade “describe what you want” UX and LLM auto-tuning are the primary productization layers still to build.

Executive summary

Most movie systems answer: “What are other people likely to watch?”

This system answers: “What kind of movie experience is the user trying to find?”

It is a large-scale movie discovery and curation engine that helps people find films through meaning, mood, cultural context, quality signals, and personal intent — not only title, genre, or keyword matching.

At a simple level, it answers questions like:

“Find me weird dystopian city movies that feel depressing and massive.”
“Show me so-bad-it’s-good horror movies from the 80s.”
“Create a Plex collection of obscure sci-fi movies with cult appeal.”
“Find movies like The Hitchhiker’s Guide to the Galaxy, but more absurd and less action-oriented.”

The core insight: movie discovery is not just a database search problem. People often search with vague, emotional, cultural, or subjective language. Traditional systems struggle because they lean on genres, popularity charts, and collaborative filtering. This platform combines enriched metadata, semantic embeddings, lexical search, rating intelligence, popularity/obscurity controls, and (on the roadmap) an LLM-driven tuning layer that translates human intent into retrieval strategy.

The result is a system that can behave less like a search box and more like a knowledgeable film curator — especially once automatic tuning sits in front of the engine.

Strategic positioning: a semantic curation platform over a constructed movie knowledge base — not merely a search site, and not a traditional recommender.

The problem

Users search with intent, not metadata

People rarely arrive with clean filters. They arrive with feelings, references, eras, tones, and cultural categories:

User language	What they actually want
“Depressing movies in giant dystopian cities”	Atmosphere, scale, tone, setting — not the word “city” in the title
“Best comedies of the 80s”	Hard decade constraint + genre + high ratings + cultural prominence
“So bad it’s good”	Low credible ratings, high awareness, unintentional comedy, cult notoriety
“Hidden gems in my Plex library”	Personal inventory overlay + quality + obscurity

A single fixed ranking algorithm fails because the correct balance of signals changes with intent.

Incumbent tools optimize the wrong objective

System type	Typical objective	Weakness for subjective discovery
Streaming recommenders (Netflix, etc.)	Maximize engagement on catalog they license	No long-tail obscure titles; no “vibe” queries; no personal library semantics
Database search (TMDB, IMDb browse)	Exact metadata match	Poor at mood, metaphor, cultural framing
Letterboxd / community lists	Human curation at scale	Requires cinephile literacy; lists go stale; not generative
Plex/Jellyfin built-in search	Title/metadata in your files	Weak semantic discovery; collection building is manual

Gap: there is no widely available product that combines million-title coverage, semantic retrieval, intent-aware ranking, and personal library overlay in one self-hosted or API-first platform.

What the system is

Movie Index (internal project name) is a locally hosted movie intelligence platform:

It builds and owns a private movie catalog (~1.2M+ TMDB identities; bulk snapshots can exceed 1.4M rows when importing large CSV archives).
It enriches records from TMDB (and optional bulk sources), indexes them for hybrid search, and serves ranked results via HTTP API and web UI.
It optionally overlays a user’s Plex library so discovery can mean “from the whole world” or “from what I already own.”

One-line definition:
A semantic curation engine that turns fuzzy human intent into tunable retrieval over a very large, normalized movie knowledge base.

Movie Index dashboard showing catalog stats and enrichment coverage — Dashboard — million-scale catalog coverage: total IDs, bulk metadata, API enrichment queue, search documents, and embeddings.

What makes it different

Different question, different product

Traditional recommender	This system
“What will people like me watch next?”	“What experience is the user trying to find?”
Optimizes engagement on a licensed catalog	Optimizes interpreted intent on a comprehensive catalog
Opaque matrix factorization / trending	Explainable ranking signals (semantic match, lexical match, rating, notoriety, obscurity)
Weak on “so bad it’s good,” cult context, atmosphere	Designed for multi-objective discovery (quality, trash, fame, hidden)

Tunable retrieval, not one algorithm

The engine exposes adjustable fusion weights and cutoff strategies so the same query infrastructure can serve opposite goals:

User goal	Retrieval emphasis (conceptual)
Famous and good	High ratings, high vote count (notoriety), tighter similarity
Obscure and good	High ratings, low vote count (obscurity), semantic breadth
Famous and terrible	Low Bayesian-adjusted quality, high notoriety
Cult / “so bad it’s good”	Trash-quality signal + notoriety + cultural/contextual semantics (roadmap: dedicated embedding space)
Thematic / mood	Semantic similarity, relaxed lexical title match, elbow-based membership

Key differentiator (product): an automatic curation layer (planned) that sets these knobs from natural language so casual users never see them.

Business opportunities

Each opportunity below reuses the same core: catalog + embeddings + hybrid search + ranking + (optional) library overlay.

Consumer movie discovery app

Product: Public or freemium web app — “describe the movie you want.”

Example queries:

“Movies that feel like lonely neon cities at night.”
“Absurd British sci-fi comedies.”
“Forgotten 90s thrillers that are actually good.”
“Bad movies that are fun, not just bad.”

Positioning: Competes with discovery and exploration (Letterboxd-adjacent browsing, film Twitter/list culture, niche cinephile search) — not with Netflix-style “what to stream tonight on our platform.”

Differentiator: Users do not need to know filters, genres, or metadata vocabulary. They describe vibe, era, tone, or cultural category; the system translates that into retrieval strategy.

Monetization paths: subscription, affiliate links (where legally appropriate), premium collections, API tier for power users.

Plex / Jellyfin collection generator

Product: Connect to a user’s media server; generate curated collections from natural language.

Example queries:

“Build a cult sci-fi collection from my library.”
“So-bad-it’s-good movies I already own.”
“1980s creature-feature playlist.”
“Hidden gems I forgot I had.”

Why this niche is strong:

Users already maintain large personal libraries and care about organization.
They are underserved by semantic discovery inside Plex/Jellyfin.
No streaming of copyrighted content required — only metadata analysis and collection instructions returned to the local server.

Technical fit today: Plex sync marks in_library on catalog rows; search and collections APIs accept in_library: true filters. Jellyfin would be a parallel integration.

Monetization paths: one-time license, subscription plugin, homelab “pro” tier.

Movie metadata and semantic search API

Asset: A constructed, normalized, embedded catalog — not raw TMDB dumps.

Potential API customers:

Indie app developers
Recommendation startups
Plex/Jellyfin plugin authors
Film researchers and educators
AI application builders
Media catalog / metadata companies
Hobbyists building local movie tools

API capabilities (existing or near-existing):

Capability	Description
Normalized metadata	Title, year, overview, genres, keywords, cast, crew, ratings, posters
Hybrid search	Semantic + lexical + metadata filters in one request
Similarity / “more like this”	Same embedding space as search
Saved collections	Store query + filters; re-resolve on demand
Library overlay	Restrict to `in_library` for personal-server use cases
Scoring transparency	Per-hit scores: semantic similarity, RRF, Bayesian rating, etc.

Positioning: A “semantic layer” on top of movie metadata — the hard data engineering and embedding work already done.

Monetization paths: usage-based API, tiered keys, enterprise license, white-label.

Precomputed embedding dataset

Problem: Generating embeddings for 1M+ movies is expensive, slow, and operationally painful (GPU batching, model versioning, index rebuilds).

Product: Licensed dataset bundles:

Movie identity (TMDB ID, title, year, etc.)
Plot/metadata embedding vectors
(Roadmap) Historical/cultural context embeddings
Similarity index metadata / version documentation
Incremental update packages when the model or enrichment changes

Buyers: AI developers who want movie search or recommendations without building the pipeline.

Licensing must respect TMDB attribution terms and model licenses; vectors are derived works built on permitted metadata.

Automated editorial and content marketing

Because the engine can surface clusters, outliers, and thematic slices, it can power:

Listicles (“Weirdest low-budget 90s sci-fi”)
Newsletter segments
SEO landing pages
Social content calendars

Examples:

“Movies that accidentally became cult classics”
“Dystopian city films before and after Blade Runner”
“The best bad shark movies you’ve never heard of”

Monetization paths: ad-supported media property, B2B content tooling for publishers, lead gen for a consumer app.

Technical platform

This section covers architecture and implementation for readers who need credible technical depth — and for technical partners evaluating feasibility.

Architecture at a glance

flowchart TB
  subgraph external [External — ingestion only]
    TMDB_EXP[TMDB Daily ID Exports]
    TMDB_API[TMDB API v3]
    KAGGLE[Kaggle TMDB CSV snapshot — optional bulk]
    PLEX[Plex server — optional]
  end

  subgraph server [Single-server deployment — Docker Compose]
    API[FastAPI + Uvicorn — port 8080]
    WORKER[Enrichment worker — continuous]
    CLI[movie-index CLI / cron]

    subgraph data [PostgreSQL 16 + extensions]
      META[Constructed metadata — ~1M+ rows]
      FTS[Full-text search — tsvector + GIN]
      TRGM[Trigram title match — pg_trgm]
      VEC[Vector index — pgvector HNSW]
      QUEUE[Enrichment queue + phase tracking]
      COLL[Saved collections]
    end

    MODELS[Local model cache — Hugging Face weights]
    ARTIFACTS[Poster cache — local disk]
  end

  TMDB_EXP --> CLI
  TMDB_API --> WORKER
  KAGGLE --> CLI
  PLEX --> API
  CLI --> META
  WORKER --> META
  META --> FTS
  META --> VEC
  API --> FTS
  API --> VEC
  API --> TRGM
  MODELS --> VEC

Design principle: At query time, the system reads only PostgreSQL and local files — not TMDB, not cloud embedding APIs. External services are used during ingestion and sync only. That yields predictable latency, offline-capable search (once built), and no per-query SaaS inference bill.

Software stack

Layer	Technology	Role
Database	PostgreSQL 16	System of record for all metadata, queues, collections
Vector search	pgvector (HNSW, cosine distance)	Semantic nearest-neighbor on filtered candidate sets
Lexical search	PostgreSQL FTS (`tsvector`, `websearch_to_tsquery`)	Overviews, keywords, assembled search documents
Fuzzy titles	pg_trgm	Typo-tolerant and partial title matching
API	FastAPI + Uvicorn	REST: search, movies, collections, Plex sync, dashboard
Runtime	Python 3.12+	Ingestion, search fusion, embedding jobs, CLI
Embeddings	sentence-transformers	Local bi-encoder; batch on CPU, Apple MPS, or NVIDIA CUDA
Packaging	Docker Compose	`postgres`, `api`, `worker` services
CLI	Click (`movie-index` command)	Operations, imports, enrichment, embedding, stats

Deliberate non-choices (v1): No Elasticsearch, Pinecone, or managed vector DB — one database reduces operational cost for homelab and early commercial pilots.

Data sources and catalog construction

No single upstream provider ships a complete, query-ready movie database. The platform merges sources into one constructed catalog:

Source	What it provides	How it is used
TMDB Daily ID Exports	Near-complete daily list of valid TMDB movie IDs + export-level popularity	Catalog spine — hundreds of thousands to ~1M+ IDs without months of API paging
TMDB API v3	Full detail: overview, genres, keywords, credits, images, external IDs	Enrichment for searchable depth; rate-limited queue (~4 req/s default, configurable)
Kaggle TMDB snapshot (optional)	Bulk CSV (~930k–1.4M rows) with core metadata	Fast bulk bootstrap; API queue backfills gaps
Plex (optional)	User’s owned titles	`in_library` overlay only — not a metadata source of truth
IMDb datasets (optional, roadmap)	Supplemental ratings / crosswalk	Offline import; non-commercial license constraints

Scale targets:

~1.2 million movie identities from TMDB export workflow (project design target).
~1.42 million rows available in bundled Kaggle CSV snapshot (useful for bulk import and gap-fill).
Enrichment at 4 TMDB requests/sec ≈ 3–4 days of continuous worker time for a full API detail pass over ~1.2M IDs (order-of-magnitude planning number).

Enrichment pipeline (automated):

Import IDs (export and/or Kaggle).
Queue movies for TMDB API detail fetch (priority: in-library first, then popularity).
After each batch: build search documents and embedding vectors (configurable ENRICHMENT_AUTO_INDEX).
Track per-movie enrichment phases: catalog spine → core metadata → poster media.

Enrichment tab showing pipeline stages and queue metrics — Enrichment worker — queue depth, ETA, and the three stages: TMDB metadata fetch, search-document build, and vector embedding.

Stored metadata per movie (representative):

Identity: TMDB ID, IMDb ID, titles, release date/year
Text: overview, tagline, search document (derived)
Facets: genres, keywords (JSON arrays)
People: cast (top billed), crew (directors/writers)
Franchise: collection id/name
Signals: popularity, vote_average, vote_count, adult flag
Media: poster paths, locally cached poster files
Library: in_library boolean
Provenance: detail_fetched_at, search_doc_built_at, enrichment phase status

Search document and embedding model

Search document (today): A single composed text block per movie used for both FTS indexing and embedding input:

Title, original title, year, tagline, overview
Genres, keywords (capped), top cast, directors
Collection name

Embedding model (current default):

Setting	Value
Model	`Qwen/Qwen3-Embedding-0.6B`
Dimensions	1024
Storage	`movie_embeddings` table with `embedding_model` + `embedding_version` for traceability
Index	HNSW on cosine distance (`<=>` operator)
Alternatives (supported)	`BAAI/bge-small-en-v1.5` (384d), `bge-base`, `bge-large`, `bge-m3`, `nomic-ai/nomic-embed-text-v1.5`

Models run entirely locally — weights cached under data/models/. No OpenAI/Cohere-style per-query embedding fees.

Hardware: Auto-detects CUDA → Apple MPS → CPU. Docker worker/API containers typically use CPU; native runs on Apple Silicon can use MPS for faster embedding batches.

Roadmap — second embedding space (“cultural / context”):
Planned separate representation for production history, reception, cult status, trivia, and “why this movie matters” — critical for queries like “so bad it’s good” where plot text is insufficient. Not yet implemented; today all semantic search uses the plot/metadata document embedding.

Hybrid retrieval pipeline

For each query, the engine:

Embeds the query with the same model (query-specific encoding per model family — e.g. Qwen uses a query prompt name).
Runs lexical branch:
- PostgreSQL FTS rank on search_tsv (websearch_to_tsquery, English).
- Trigram similarity on title (threshold > 0.2).
Runs semantic branch:
- pgvector top-K by cosine distance on movie_embeddings, respecting SQL filters.
Fuses lexical + semantic candidate lists with Reciprocal Rank Fusion (RRF) — default k = 60.
Applies weighted score fusion across multiple signals (see below).
Optionally applies membership cutoff (how many results “belong” in the set).

sequenceDiagram
  participant User
  participant API
  participant Encoder
  participant PG as PostgreSQL

  User->>API: Natural language query + filters
  API->>Encoder: Embed query locally
  Encoder-->>API: Query vector (1024-d)
  par Lexical
    API->>PG: FTS + trigram title search
  and Semantic
    API->>PG: pgvector HNSW nearest neighbors
  end
  PG-->>API: Candidate pools (e.g. top 500 each)
  API->>API: RRF merge + weighted scoring + cutoff
  API-->>User: Ranked hits + explainable scores

Ranking dimensions

The API accepts a ScoringConfig with fusion weights (0–10 scale) and rating/cutoff settings:

Signal	Meaning	Example use
RRF	Reciprocal rank fusion across lexical + semantic lists	Default hybrid balance
Semantic	Vector similarity (1 − cosine distance)	Mood, theme, atmosphere queries
Search rank	Combined channel rank score	Fine-tuning lexical vs semantic emphasis
Rating high	Bayesian-shrunk quality (normalized)	“Actually good” lists
Rating low (“trash”)	Inverse of quality signal	“So bad it’s good”
Notoriety	`log1p(vote_count)` — famous / widely rated	Mainstream, infamous, cult-famous
Obscurity	Inverse of notoriety	Hidden gems, deep cuts

Bayesian rating:
Raw TMDB vote_average is misleading for low vote_count. The system shrinks ratings toward a catalog mean (configurable prior, default 50 pseudo-votes) so that:

“1.5 stars from 3 people” does not dominate “3.8 stars from 80,000 people.”

Reversing the quality signal (boost rating_low) surfaces titles that are credibly poorly rated at scale — a key ingredient for “so bad it’s good.”

Membership cutoff modes:

Mode	Behavior
`none`	All ranked results count
`top_n`	Fixed cap
`threshold`	Minimum semantic (or other) score
`elbow`	Largest gap in score curve — auto-sized collections

Elbow mode addresses: “Chupacabra movies” (tight cluster) vs “dreamlike loneliness” (broad semantic spread) without always returning exactly 100 or 200 titles.

Search sidebar with ranking weight sliders and cutoff options — Tunable ranking — candidate pool, result cap, RRF vs semantic weights, Bayesian rating signals (well-rated, trash quality, notoriety, obscurity), and cutoff modes.

Metadata filters

SQL-level constraints applied before or during retrieval:

Filter	Use case
`year_min` / `year_max`	Decade constraints (“80s comedies”)
`genres`	Genre enforcement
`in_library`	Plex-only discovery
`enriched_only`	Require full TMDB detail
`min_vote_count` / `max_vote_count`	Obscurity vs notoriety control

API surface

Method	Endpoint	Purpose
GET	`/health`	Liveness + DB stats
GET	`/`	Web dashboard (search, enrichment ops, stats)
GET	`/api/dashboard`	Pipeline stats, sync state
POST	`/search`	Hybrid search with filters + scoring config
GET	`/movies/{tmdb_id}`	Full metadata card
POST	`/collections/preview`	Preview a collection query
POST	`/collections`	Save a collection definition
GET	`/collections/{id}/movies`	Resolve saved collection
GET	`/api/plex/status`	Plex config + library counts
POST	`/api/plex/sync`	Refresh `in_library` from Plex

Optional X-API-Key when API_KEY is set — suitable for LAN or partner pilots.

Operations tab listing REST API endpoints — Built-in REST API reference — health, catalog jobs, enrichment control, hybrid search, movie cards, and saved collections.

Example search request:

{
  "query": "depressing giant dystopian city",
  "limit": 50,
  "filters": { "year_min": 1970, "in_library": false },
  "scoring": {
    "weights": { "semantic": 2.0, "rrf": 1.0, "obscurity": 0.5 },
    "cutoff_mode": "elbow",
    "cutoff_on": "semantic"
  }
}

Plex integration

Config: PLEX_BASE_URL, PLEX_TOKEN, optional PLEX_LIBRARY_NAME.
Sync lists movie libraries, extracts TMDB/IMDb GUIDs, matches to catalog, sets in_library.
Priority boost: titles in the user’s library are enriched and indexed first.
Search UI defaults can restrict to library-only for “what should I watch from what I own?”

Jellyfin: Not implemented; the same overlay pattern applies.

How discovery works (end-to-end)

Example A — Thematic / mood query

User: “Depressing movies that take place in giant dystopian cities.”

Ideal retrieval plan (manual today; automatic via LLM later):

Knob	Setting
Semantic weight	High
Lexical / title	Lower (avoid “city” in title dominating)
Filters	Optional sci-fi / thriller genres; year range if implied
Rating signals	Neutral (not optimizing for “best”)
Cutoff	Elbow on semantic similarity
Candidate pool	Broad (e.g. 500+)

Poster-grid search results for a dystopian mood query — Mood / theme query — *“depressing giant dystopian city”* with poster-grid results from hybrid semantic + lexical retrieval.

Example B — Canon / best-of query

User: “Best comedies of the 80s.”

Knob	Setting
`year_min` / `year_max`	1980–1989
`genres`	Comedy
Rating high + notoriety	High
Obscurity	Low
Lexical	Moderate (decade + genre terms)
Cutoff	Higher top-N or threshold

Example C — “So bad it’s good”

User: “Funny-bad horror from the 80s.”

Knob	Setting
Filters	Horror; 1980–1989; `min_vote_count` to ensure credibility
Rating low (trash)	High
Notoriety	Moderate–high
Semantic	Theme + (roadmap) cultural/context embedding
Bayesian logic	Ensures “bad” means many voters, not data noise

Search for funny-bad 80s horror with trash-quality and notoriety weights raised — Multi-objective query — 1980s horror with trash-quality and notoriety weights raised so infamous low-rated titles surface.

Example D — Hidden gems

User: “Forgotten films that are actually good but not famous.”

Knob	Setting
Rating high	High
Obscurity	High
Notoriety	Low
Semantic	Moderate–high
Cutoff	Elbow

Search tuned for hidden gems with obscurity and quality weights — Hidden gems — semantic query with well-rated and obscurity boosts to favor credible but under-voted titles.

Multi-turn dialogue adjusts the same knobs:

“Too many movies with ‘city’ in the title” → reduce lexical/title weight, increase semantic.
“Darker and less mainstream” → increase obscurity, decrease notoriety, refine semantic query expansion.

The user steers like talking to a film expert — without operating a control panel.

Built today vs roadmap

Honest status framing for partners and investors. Screenshots in this post are from a TMDB-scale catalog (~1.2M titles) and intentionally omit personal-library overlays and media-server sync — those are optional deployment features, not required for semantic discovery.

Built and operational

Capability	Notes
Million-scale catalog ingestion	TMDB export + API queue + optional Kaggle bulk
Continuous enrichment worker	Docker `worker` service; auto search-doc + embed
Hybrid search (FTS + trigram + pgvector + RRF)	Production code path
Multi-signal weighted ranking	Rating high/low, notoriety, obscurity, Bayesian shrinkage
Cutoff modes	none, top_n, threshold, elbow
Saved collections API	Preview, persist, resolve
Plex `in_library` sync	Library-first enrichment priority
Web dashboard	Search, scoring sliders, enrichment controls, ops
Self-hosted deployment	Docker Compose; portable `data/` directory
Local embeddings	No cloud inference dependency
Model registry	Switch embedding model with versioned vectors

In progress / partial

Capability	Notes
Full catalog enrichment	~1.2M API passes take days; many rows may be spine-only until enriched
Poster local cache	Pipeline exists; coverage grows with enrichment
Cross-encoder rerank	Flag in API (`use_rerank`); not wired in search path yet

Roadmap (high value, not yet built)

Capability	Business impact
LLM automatic curation layer	Makes product usable by non-experts; core GTM unlock
Second embedding space (cultural/context)	Unlocks “so bad it’s good,” cult, production-history queries
Conversational refinement	Retention and differentiation vs static search
Jellyfin integration	Same market as Plex plugin
Cross-encoder rerank	Precision boost for top results
Query → filter LLM parser	Natural language to SQL filters (year, genre)
Static collection export	Push lists to Plex/Kodi/CSV acquisition workflows
IMDb supplemental import	Richer ratings crosswalk (license-dependent)

Deployment and data ownership

Self-hosted first

Typical deployment: one server (homelab Mac mini, NAS VM, small cloud VM).
All data under MOVIE_INDEX_DATA_DIR (default ./data): Postgres files, exports, models, posters, backups.
Moving hosts: copy repo + .env + entire data/ tree — no re-embedding required.

Why self-hosted matters commercially

Stakeholder	Benefit
Power users / Plex community	Data stays on LAN; no upload of library titles to a third party
Enterprise pilots	Air-gapped or VPC deployment possible
API business	Offer hosted API or on-prem license from the same codebase

Hosted SaaS (future)

A hosted tier is compatible with the architecture but is a go-to-market choice, not a technical requirement. TMDB attribution and API terms must be reflected in any public UI.

Competitive context

Alternative	Strength	Gap this system fills
TMDB / IMDb search	Complete metadata, trusted IDs	Weak subjective/vibe search; no personal library semantics
JustWatch, Reelgood	Streaming availability	Not built for obscure/cult/long-tail curation
Letterboxd	Community taste graph	Requires social graph and manual list culture
Plex Discover	Convenience inside Plex	Limited semantic discovery; no cross-catalog “describe vibe”
General RAG over Wikipedia	Flexible	Expensive, inconsistent, no structured rating/obscurity controls
Pinecone + raw TMDB embeddings DIY	Custom	Months of pipeline work; no Bayesian/cult ranking logic

Moat hypothesis: The combination of million-title constructed catalog, hybrid retrieval, intent-aware multi-signal ranking, library overlay, and (when shipped) LLM tuning is harder to replicate than any single component alone.

Risks, dependencies, and compliance

Area	Consideration
TMDB	API key required; rate limits; attribution on public surfaces; terms restrict how data is exposed
Enrichment cost	Time and API quota to detail-enrich full catalog
Compute	Initial embed of 1M+ titles is GPU/time-intensive (one-time + re-embed on model change)
Plex	Token security; read-only library access; matching unmatched titles
Copyright	System indexes metadata only; does not host films
Model licenses	Qwen/BGE/nomic terms for redistribution if selling embedding datasets
IMDb datasets	Non-commercial restrictions if used
Product risk	Without LLM auto-tuning, power of engine exceeds mainstream UX

Recommended near-term product

Strongest near-term wedge: a simple interface —

“Describe the kind of movie you want.”

→ curated list → conversational refinement.

Layer	Role
UX	Natural language in; ranked posters + explanations out
Auto-curation (build)	LLM maps utterances → `ScoringConfig` + filters + cutoff
Engine (exists)	Hybrid search + Plex overlay + collections
Power user mode (exists)	Advanced sliders in dashboard for tuning and evaluation

Secondary wedge: Plex collection generator for the homelab community — passionate users, clear pain, willingness to pay for tooling.

Strategic summary

Dimension	Statement
What it is	Semantic curation engine over a private million-movie knowledge base
What it is not	A streaming recommender or a basic TMDB clone
Core insight	Discovery is multi-objective; intent changes the right ranking strategy
Technical foundation	PostgreSQL + pgvector + hybrid RRF + local embeddings + Bayesian rating signals
Product unlock	LLM layer that translates human language into retrieval plans
Business paths	Consumer discovery, Plex/Jellyfin collections, metadata API, embedding datasets, editorial automation
Deployment advantage	Self-hosted, query-time offline, library-aware, explainable scores

The engine is intentionally complex so the user experience can be simple. The business opportunity is to productize that translation layer — turning a powerful retrieval laboratory into a film discovery assistant that traditional search and recommendation stacks do not provide.

Appendix: query → strategy mapping

User query	Filters	Weight emphasis	Cutoff
Lonely neon cities at night	—	Semantic ↑, obscurity moderate	Elbow
Best 80s comedies	years 1980–1989, genre Comedy	rating_high ↑, notoriety ↑	top_n
So-bad-it’s-good 80s horror	years 1980–1989, genre Horror, min votes	rating_low ↑, notoriety ↑	Elbow
Chupacabra movies	—	Lexical ↑ (rare exact term)	Threshold or small top_n
Hidden gems	—	rating_high ↑, obscurity ↑	Elbow
Cult sci-fi deep cuts	genre Sci-Fi	semantic ↑, notoriety moderate	Elbow
Hidden gems in my Plex	`in_library: true`	rating_high ↑, obscurity ↑	Elbow
Cult sci-fi I own	`in_library: true`, genre Sci-Fi	semantic ↑, notoriety moderate	Elbow