Troubleshooting

This is the symptom-first companion to the conceptual footguns guide. When something is misbehaving at 3am, start here: each pattern is indexed by what you observe, then walks Symptom → Cause → Fix → Verify using only shipped, supported seams.

The tone is deliberately blame-free — these are situations the system can land in, not operator mistakes. Where a pattern overlaps a deeper concept, it links to footguns for the "why" rather than restating it, so this guide stays focused on the shortest path from symptom to a verified fix. Each "Verify" step names a public observable — a telemetry event, a Rulestead.Runtime.diagnostics/1 field, a %Rulestead.Result{} reason, or a %Rulestead.Error{} type — so you can confirm the fix landed without reaching into internals.

Flags evaluate but installation or migration seems incomplete

Symptom: A fresh integration raises %Rulestead.Error{type: :repo_not_configured} or %Rulestead.Error{type: :store_not_configured}, or evaluation returns nothing because the package tables are absent.

Cause: The installer step or the Ecto migration has not run yet, so the authoring schema the runtime reads from does not exist in this environment.

Fix: Run the installer, then apply the generated migrations with the same repo your host app uses:

mix rulestead.install
mix ecto.migrate

Package-owned tables land in the rulestead Postgres schema; no special search_path is needed. See the deployment recipe for ordering migrations inside a real deploy.

Verify: Re-run your evaluation path. A correctly configured install stops surfacing %Rulestead.Error{type: :repo_not_configured} / :store_not_configured and the authoring tables are present for snapshot publication.

Evaluation returns an unexpected shape or rejects your arguments

Symptom: A call you expected to answer "is this on for this actor?" instead returns a full result payload, or raises because the arguments do not match — often after passing a string flag key where a payload or context was expected.

Cause: There are two distinct entry points. Rulestead.evaluate/3 is the payload-first evaluator (you already hold the flag payload); Rulestead.Runtime.enabled?/3 is the keyed runtime lookup for a Phoenix app backed by the snapshot cache and an environment key. Reaching for one while expecting the other's signature produces the mismatch.

Fix: For a keyed lookup against the live snapshot, use the runtime surface with (environment, flag_key, context):

{:ok, enabled?} =
  Rulestead.Runtime.enabled?(context.environment, "checkout-redesign", context)

Use Rulestead.evaluate/3 only when you already have the flag payload (tests, simulations). For the conceptual distinction and the exact anti-call to avoid, see footguns.

Verify: The keyed call returns the boolean tuple shape shown above; the payload call returns a %Rulestead.Result{}. Matching the right return shape to the call site confirms you are on the intended seam.

Evaluation looks empty or stale right after a node boots

Symptom: Immediately after a deploy or node restart, evaluations return defaults or raise %Rulestead.Error{type: :snapshot_not_found}, then recover on their own a moment later.

Cause: The node booted before its snapshot was populated or refreshed. This is the expected degraded-mode window, not a failure — the runtime is designed to tolerate startup-order imperfections and serve last-known-good or defaults until refresh completes.

Fix: Lean on supervision and refresh ordering rather than request-time fallbacks. Follow the deployment recipe's degraded-mode expectations: assume a node may boot before the store is reachable and may briefly serve defaults, and observe refresh health explicitly in ops tooling instead of switching to ad-hoc SQL lookups during the window. For why an empty or stale snapshot can mislead, see footguns.

Verify: Call Rulestead.Runtime.diagnostics/1 and inspect the infrastructure_health and environments fields — once refresh has landed, the environment reports healthy and %Rulestead.Error{type: :snapshot_not_found} stops appearing.

Rollouts flip per request or report a missing targeting key

Symptom: The same actor bounces between variants across requests, or a result carries reason: :targeting_key_missing.

Cause: Context is not being propagated end-to-end, so a stable targeting_key never reaches evaluation. Percentage and variant rollouts hash on the targeting key; without a durable one, bucketing is unstable.

Fix: Build and forward %Rulestead.Context{} explicitly at each boundary using the supported propagation seams: Rulestead.Plug (or Rulestead.Phoenix.context_from_conn/2) in the request pipeline, Rulestead.LiveView.assign_flags/3 into LiveView, and Rulestead.Oban.Middleware.attach/2 into background jobs. The context propagation recipe shows the full chain. For why a stable targeting_key matters, see footguns.

Verify: Inspect the %Rulestead.Context{} targeting_key at the evaluation site — once it is populated and stable per actor, results stop returning reason: :targeting_key_missing and bucketing holds steady across requests.

Admin actions return 403 / unauthorized

Symptom: An operator action fails with %Rulestead.Error{type: :unauthorized} (domain :auth, carrying a plug_status).

Cause: The host actor does not map to a Rulestead operator role with the action they attempted. Authorization is the host's responsibility — Rulestead does not ship a bundled auth stack; it maps host actors onto the canonical operator role model and the specific workflow actions allowed.

Fix: Resolve the action against the policy surface in your can?/4 implementation, and compare the attempted action to the role catalogs Rulestead.Admin.Policy.viewer_actions/0, Rulestead.Admin.Policy.editor_actions/0, Rulestead.Admin.Policy.admin_actions/0, and Rulestead.Admin.Policy.governance_actions/0. Grant the actor a role whose catalog includes the action, or adjust the host mapping so Rulestead.Admin.Policy.can?/4 returns an allow for that actor. Authorization decisions remain host-owned (see the product boundary).

Verify: Re-attempt the action with a correctly mapped actor; an authorized path no longer surfaces %Rulestead.Error{type: :unauthorized}, and the action's plug_status reflects success rather than a denied request.

A mutation is blocked pending a change request

Symptom: A mutation that normally applies is held back, and an operator reports it cannot proceed even though they appear authorized.

Cause: Governance policy requires a change request for this mutation in this environment. The Rulestead.Admin.Policy.change_request_required?/4 callback returned true, so the apply is gated behind review rather than landing directly.

Fix: Route the mutation through your governance flow: confirm whether Rulestead.Admin.Policy.change_request_required?/4 is intended to gate this environment, and have the change reviewed and applied through that governed path instead of expecting a direct write. The gating is policy-driven, so adjust the policy if a given environment should not require review — otherwise treat the block as the review step working as designed.

Verify: Observe the [:rulestead, :admin, :mutation, :stop] telemetry event for the attempted mutation — a change-request-gated attempt surfaces as a blocked-mutation outcome on that event, which confirms the governance path (not an error) is what stopped the write.

OpenFeature reads look stale after a Redis-backed change

Symptom: Consumers reading through the open_feature_rulestead provider see an outdated value for a short window after a change, then converge.

Cause: The runtime served a cached snapshot whose freshness lags the latest authored state — an expected cache window, surfaced rather than hidden. The provider is only the consumer boundary; the freshness behavior belongs to the runtime cache.

Fix: Treat this as a snapshot-freshness window. The operational refresh outcome is to bring the latest runtime snapshots into the cache (the package ships a dedicated mix rulestead.redis.sync Mix task for this refresh outcome); after refresh, consumers reading through open_feature_rulestead see the current value. For why readiness and snapshot timing matter, see footguns.

Verify: Watch the public cache telemetry events — [:rulestead, :runtime, :cache, :stale_used] firing indicates a stale read served, while [:rulestead, :runtime, :cache, :miss] and [:rulestead, :runtime, :cache, :refresh] show the refresh cycle. Once Result.cache_age_ms drops back to a fresh value, the staleness window has closed.

Where to go next. If none of the seven patterns above matches your symptom, the conceptual footguns guide explains the design choices behind targeting keys, rule order, snapshot readiness, and preview semantics — most surprises trace back to one of those. For propagation specifics, see the context propagation recipe; for boot-order and refresh posture, see the deployment recipe. When you do need to escalate, the public telemetry contract plus Rulestead.Runtime.diagnostics/1 give you an auditable picture of runtime health without scraping internal process or table names.

← Previous Page Migrating from FunWithFlags

Next Page → Conventions