Skip to content

BUG: Plugin Daemon Signals Readiness Before Plugins Complete Startup in Kubernetes Deployments #598

@NieRonghua

Description

@NieRonghua

Self Checks

To make sure we get to you in time, please check the following :)

  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • "Please do not modify this template :) and fill in all the required fields."

Versions

  1. dify-plugin-daemon Version: 0.5.1
  2. dify-api Version: 1.11.1

Describe the bug

In Kubernetes environments, the plugin daemon has a race condition during startup:

  1. HTTP service starts immediately (0s) → readiness probe gets 200 ✅
  2. Plugins start asynchronously in background (1-600+ seconds) ⏳
  3. K8s Gateway detects ready status (~5s) → starts forwarding traffic 🚨
  4. Plugins still initializing or startup fails (~30-600s) → requests return errors ❌

Root Causes

# Issue Impact
1 No plugin state awareness readiness probe only checks HTTP service, not plugin readiness
2 Time desynchronization HTTP fast (1-5s), plugins slow (30-600s), async startup
3 Poor observability Cannot see plugin startup progress or failure reasons
4 Hardcoded retry limit 15 retries fixed in code, startup takes 600+ seconds

Current Behavior

0s      : HTTP server running ✅
         readiness probe → 200 OK (but plugins not ready!)

~5s     : K8s Gateway traffic begins flowing
         Requests fail if plugins still loading

~30-600s: Plugins finally ready (or failed)
         Too late! Traffic already flowing

Business Constraints

  • Private cloud deployment - no manual intervention possible
  • Partial availability acceptable - service should start even if some plugins fail
  • Fast startup required - K8s gateway needs sufficient time to detect true ready state

To Reproduce

Steps to reproduce the behavior in Kubernetes:

  1. Deploy dify-plugin-daemon to Kubernetes with current readiness probe pointing to /health/check
  2. Observe initial readiness probe response time (returns 200 immediately)
  3. Wait for plugins to fully load (10-600+ seconds depending on plugin count/type)
  4. Monitor traffic during this window:
    • Traffic arrives before plugins fully loaded
    • Requests fail with "plugin not ready" or timeout errors
  5. Check logs to see retry attempts are hardcoded at 15 times

Expected behavior

  • ✅ readiness probe returns 503 while plugins are still starting
  • ✅ readiness probe returns 200 only after all plugin startup attempts complete (success or failure after max retries)
  • ✅ Configurable retry limit instead of hardcoded 15
  • ✅ Visible plugin startup state and error information
  • ✅ Startup time reduced from 600s to ~225s (63% improvement)

Screenshots
The Kubernetes pod health check has passed, and the pod has started accepting API requests forwarded by the gateway.
Image

However, the plugin in local mode is still starting up, causing the plugin to be called.

Image

Additional context
Dify is deployed via Kubernetes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions