Product

AI-Powered Failure Diagnosis: How It Works

April 8, 20266 min readJustRun Team

Traditional cron monitoring tells you that a job failed. JustRun tells you why it failed and how to fix it. This is the feature we are most excited about, and the one that required the most careful engineering to get right.

The diagnosis pipeline

When a job execution returns a non-2xx status code or times out, we trigger the diagnosis pipeline. It works in three stages:

Context assembly.We gather the HTTP response status, response headers, response body (first 4KB), request configuration (URL, method, headers with secrets redacted), the last 5 successful executions for comparison, and the job's retry history.
Pattern matching. Before calling the LLM, we run the failure through a set of deterministic rules. Common patterns like DNS resolution failures, SSL certificate errors, 429 rate limits, and connection timeouts have known causes and fixes. If we match a pattern, we return the diagnosis instantly without using any AI quota.
LLM analysis. For failures that do not match a known pattern, we send the assembled context to our language model with a carefully tuned system prompt. The prompt instructs the model to explain the failure in plain English, identify the most likely root cause, and suggest a concrete fix with example code or configuration changes.

Keeping it fast and cheap

Diagnosis needs to be fast. If a developer is woken up at 3am by a failure alert, they do not want to wait 30 seconds for an AI analysis. Our target is under 2 seconds from failure detection to diagnosis delivery.

The deterministic pattern matcher handles roughly 60% of all failures with zero latency and zero cost. For the remaining 40%, we use a small, fast model (not GPT-4 class) with a constrained output format. The response is structured JSON with three fields: summary, cause, and fix. This keeps token usage low and response times under 1.5 seconds.

What the output looks like

Here is a real example from our beta testing. A job targeting a WordPress REST API endpoint started failing with a 401 status code after working fine for two weeks:

{
  "summary": "Authentication rejected — API key expired",
  "cause": "The endpoint returned 401 Unauthorized. The response body contains 'Application password expired.' WordPress application passwords can be set to expire, and this key was created with a 14-day TTL.",
  "fix": "Generate a new application password in WordPress → Users → Your Profile → Application Passwords. Update the Authorization header in your JustRun job settings with the new credentials."
}

Without AI diagnosis, this developer would have needed to check their application logs, realize the issue was authentication-related, look up WordPress application password expiry rules, and figure out how to rotate the key. With JustRun, they get the answer in their alert notification.

Privacy and security

We never send full request or response bodies to the LLM. Headers are redacted to remove authorization tokens, API keys, and cookies. The response body is truncated to 4KB and stripped of anything that looks like a secret using regex pattern matching. All diagnosis data is encrypted at rest and automatically deleted after 30 days.

For users on the Scale plan who need stricter data controls, we offer the option to disable AI diagnosis entirely and rely only on the deterministic pattern matcher.

Back to blog

The diagnosis pipeline

Keeping it fast and cheap

What the output looks like

Privacy and security

Start using JustRun