From 254c0d1557658c9db0e0f57a71b5f986904107e3 Mon Sep 17 00:00:00 2001 From: Sylvain Niles Date: Thu, 5 Jun 2025 10:41:00 -0700 Subject: [PATCH 01/19] Initial commit, very alpha design. Signed-off-by: Sylvain Niles --- .../2025-06-graph-db-replace-planes | 310 ++++++++++++++++++ 1 file changed, 310 insertions(+) create mode 100644 architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes diff --git a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes new file mode 100644 index 00000000..26613e0d --- /dev/null +++ b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes @@ -0,0 +1,310 @@ +## Proposal: Integrating Kùzu for Radius Application Graph + +**Version:** 1.0 +**Date:** June 4, 2025 +**Author:** Sylvain Niles + +--- + +### Overview + +Project Radius currently defines and manages application graphs, representing resources and their relationships within cloud-native applications. Currently Radius relies on Kubernetes, where these graph structures are stored using Kubernetes Custom Resource Definitions (CRDs) and the Kubernetes API server as a backing store. As we are adding support for nested recipes the imperative Go code is becoming complex and brittle because it needs to implement basic graph traversal logic not present in etcd. + +This proposal outlines a plan to modify Project Radius to utilize Kùzu, an embedded, high-performance graph database, as the primary store for its application graph. This change aims to decouple the core graph logic and operations from the underlying Kubernetes infrastructure, enabling more powerful graph queries, potentially improving performance for complex relationship traversals, and offering a more specialized and efficient graph persistence layer. This will also pave the way for Radius to operate more seamlessly in environments beyond Kubernetes or where direct Kubernetes API access for graph storage is not ideal. + +### Terms and Definitions + +* **Radius:** An open-source, cloud-native application platform that helps developers build, deploy, and manage applications across various environments. +* **Application Graph:** A representation of an application's components (resources, services, environments, etc.) as nodes and their interconnections as edges, including metadata on both. +* **Kubernetes (K8s):** An open-source system for automating deployment, scaling, and management of containerized applications. +* **CRDs (Custom Resource Definitions):** Extensions of the Kubernetes API that allow users to create custom resource types. Currently a common way Radius stores graph data in K8s. +* **Kùzu:** An embeddable, transactional, high-performance graph database management system (GDBMS) supporting the Cypher query language and property graphs. +* **Node:** An entity in a graph (e.g., a Radius resource, an environment). +* **Edge:** A relationship between two nodes in a graph (e.g., "connectsTo", "runsIn"). +* **Property:** Key-value pairs associated with nodes or edges, storing metadata. +* **Cypher:** A declarative graph query language. +* **RP (Resource Provider):** A component in Radius responsible for managing a specific type of resource. + +### Objectives + +1. **Decouple Graph Storage:** Abstract the application graph storage from Kubernetes CRDs, allowing Radius to use a dedicated graph database. +2. **Enhance Query Capabilities:** Leverage Kùzu's Cypher query language for more complex and efficient graph traversals and relationship analysis than what is easily achievable with Kubernetes API queries. +3. **Improve Performance:** Potentially improve the performance of graph read and write operations, especially for large or complex application graphs. +4. **Increase Portability:** Facilitate the use of Radius in non-Kubernetes environments or scenarios where a dedicated graph store is preferred. +5. **Maintain Existing Functionality:** Ensure that all existing Radius features that rely on the application graph continue to function correctly with Kùzu as the backend. + +### Issue Reference: + +* `radius-project/radius#` (To be created) + +### Goals + +* First step in allowing the Radius control plane to run outside Kubernetes. +* Integrate Kùzu as an embedded graph database within relevant Radius components (e.g., Core RP, Deployment Engine). +* Define a clear schema for the Radius application graph within Kùzu. +* Migrate existing graph data representation (currently in K8s CRDs) to the Kùzu data model. +* Implement a data access layer (DAL) service that abstracts Kùzu operations (CRUD, queries) for other Radius components. +* Update Radius RPs and controllers to use the new Kùzu-backed DAL for all graph-related operations. +* Provide mechanisms for backup and potential restore of the Kùzu database as part of Radius install/upgrade/rollback operations. +* Develop a comprehensive test suite covering graph operations with Kùzu. + +### Non-goals + +* Re-architecting more of Radius to eliminate Kubernetes as a dependency beyond the graph storage and access layer. +* Providing a distributed Kùzu cluster as part of this initial integration (Kùzu is primarily embedded; clustering would be a separate, future consideration if needed). +* Exposing direct Kùzu Cypher query capabilities to end-users of Radius (interaction should remain through Radius APIs and abstractions). +* Replacing all uses of the Kubernetes API server by Radius, only those related to storing and querying the core application graph structure. +* Migrating existing Radius clusters to the new graph, a fresh install is required. + +### User Scenarios (optional) + +* **Scenario 1 (Developer):** A Radius developer needs to implement a new feature that requires finding all resources connected to a specific environment that also have a particular tag. Using Cypher through the Kùzu DAL would be more expressive and potentially faster than multiple K8s API calls and client-side filtering. This is currently a sticking point with nested Recipes. +* **Scenario 2 (Operator):** An operator managing a large Radius deployment experiences performance degradation when listing complex application relationships. Moving to Kùzu could alleviate these bottlenecks. +* **Scenario 3 (Platform Engineer):** A platform engineer wants to run Radius without Kubernetes, this is one of the decoupling features required. + +### User Experience (if applicable) + +* **End-users (Application Developers using Radius CLI/APIs):** The change should be largely transparent. Existing commands and APIs for managing applications and resources should continue to work. Performance improvements might be noticeable. +* **Radius Developers/Contributors:** Will need to learn how to interact with the new Kùzu DAL and potentially understand the Kùzu data model and Cypher for advanced debugging or development. +* **Operators:** Will need to be aware of the new Kùzu database component for backup, monitoring, and troubleshooting purposes. The operational burden of managing K8s CRDs for graph data would be shifted. + +### Sample Input/Output: + +* **Sample Input (Conceptual Cypher Query via DAL):** + ``` + // Find all 'applications' that 'contain' a 'container' resource with 'image' = 'nginx' + MATCH (app:Application)-[:CONTAINS]->(res:Resource {type: 'Applications.Core/container', properties_image: 'nginx'}) + RETURN app.name, res.name + ``` +* **Sample Output (from DAL):** + ```json + [ + { "appName": "myWebApp", "resourceName": "frontendContainer" }, + { "appName": "myService", "resourceName": "workerContainer" } + ] + ``` + +### Design + +#### High-Level Design + +1. **Introduce Kùzu:** Kùzu will be integrated as an embedded library within the primary Radius component(s) responsible for managing the application graph (e.g., the Radius Core RP or a dedicated graph service). +2. **Graph Abstraction Layer:** A new internal service or Data Access Layer (DAL) will be created. This layer will provide an API for all graph operations (e.g., `listDeployments`, `listResources`, and allow new operations like `showDependencies(resource)`). Internally, this DAL will translate these requests into Kùzu operations (Cypher queries, API calls to Kùzu's Go driver). +3. **Schema Definition:** A formal schema for Radius entities (Applications, Environments, Resources, etc. as nodes) and their relationships (as edges with types and properties) will be defined and enforced in Kùzu. +4. **Data Synchronization/Migration:** No migration is planned. +5. **Component Updates:** All Radius components that currently interact with Kubernetes CRDs for graph information will be updated to use the new Graph Abstraction Layer. + +#### Architecture Diagram + +```mermaid +graph TD + A["Radius CLI/API
(User Interactions)"] + B["Radius Core Components
(Control Plane, RPs)"] + C["Recipe Execution
(For non-graph
resource mgmt)"] + D["Graph Abstraction Layer
(DAL)"] + E["Kùzu Embedded DB container
(kuzu.db file on disk allowing backup/restore)"] + + A <--> B + B <--> C + B <-->|"Uses (for graph ops)"| D + D <-->|"Interacts with"| E +``` + +* **Current (Simplified):** Radius Core Components <-> Kubernetes API Server (for CRD-based graph) +* **Proposed:** Radius Core Components <-> Graph Abstraction Layer <-> Kùzu Embedded DB + +#### Detailed Design + +1. **Kùzu Integration:** + * The Kùzu Go driver (`github.com/kuzudb/go-kuzu`) will be used. + * Kùzu database will be initialized during `rad init`. The database file (`radius_app_graph.kuzu`) will be stored on ephemeral or persistent storage accessible to the Radius control plane. (requiring setup of a volume for laptop installations seems excessive) + * Initially CoreRP will move CRUD operations to K8s planes to the DAL without using Kùzu. + * Once the DAL is released Kùzu support will be added in a pluggable way, allowing for future network graph db support vs the embedded solution. + +2. **Schema Management:** + * A Go module will define constants for node labels (e.g., `NodeTypeApplication`, `NodeTypeResource`) and edge labels (e.g., `EdgeTypeContains`, `EdgeTypeConnectsTo`). + * On startup, Radius will ensure the schema (node tables, relationship tables, property definitions) exists in Kùzu, creating or migrating it if necessary. + * Node properties will be strongly typed (string, int, bool, arrays, maps). Complex nested objects might need to be stored as JSON strings if Kùzu's direct support is limited, or flattened, but no current use case should require this. + +3. **Graph Abstraction Layer (DAL) API:** + * Example Go interface: + ```go + type GraphStore interface { + // Node operations + CreateNode(ctx context.Context, node Node) error + GetNode(ctx context.Context, nodeID string) (Node, error) + UpdateNodeProperties(ctx context.Context, nodeID string, properties map[string]interface{}) error + DeleteNode(ctx context.Context, nodeID string) error // Handle cascading deletes for owned relationships + + // Edge operations (connections) + CreateEdge(ctx context.Context, edge Edge) error + GetEdge(ctx context.Context, fromNodeID, toNodeID string, edgeType string) (Edge, error) // Or a unique edge ID + UpdateEdgeProperties(ctx context.Context, edgeID string, properties map[string]interface{}) error + DeleteEdge(ctx context.Context, edgeID string) error + + // Query operations + GetOutgoingNeighbors(ctx context.Context, nodeID string, edgeTypePattern string) ([]Node, error) + GetIncomingNeighbors(ctx context.Context, nodeID string, edgeTypePattern string) ([]Node, error) + FindPaths(ctx context.Context, startNodeID, endNodeID string, maxHops int) ([][]Node, error) // More complex queries + ExecuteCypherQuery(ctx context.Context, query string, params map[string]interface{}) ([]map[string]interface{}, error) // For advanced internal use + } + + type Node struct { + ID string + Type string // e.g., "Applications.Core/application" + Properties map[string]interface{} + } + + type Edge struct { + ID string // Optional, could be derived from both nodes + FromNodeID string + ToNodeID string + Type string // e.g., "Connection" + Properties map[string]interface{} + } + ``` + +4. **Data Persistence and State:** + * Kùzu runs embedded, so the Radius process managing it is authoritative. + * If a single DAL node in the cluster is unable to scale to the traffic of the cluster we would have to separate out write traffic to a single instance and reads could scale reasonably. + * For this proposal, we assume a single active Radius DAL node is managing the Kùzu DB file for writes, with potential for read-only file replication for other instances if feasible and performant. + +5. **Transaction Management:** + * All compound operations (e.g., creating a resource node and its relationship edge) must be performed within a Kùzu transaction to ensure atomicity. The DAL will manage this. + * Radius upgrades and rollbacks would need to coordinate with the DAL. + +#### Advantages (of Kùzu over K8s CRDs for graph storage) + +* **Rich Querying:** Cypher provides significantly more powerful and expressive graph query capabilities than filtering K8s resources by labels/annotations. +* **Performance:** For complex graph traversals (multi-hop queries, pathfinding), Kùzu is likely to be much faster as it's optimized for such operations. For Radius this would be during most recipe execution as the entire graph is rendered. +* **Specialized Data Store:** Kùzu is purpose-built for graph data, leading to efficient storage and indexing for graph structures. +* **Decoupling from Kubernetes:** Reduces dependency on the K8s API server for core graph logic, improving portability and potentially reducing load on the K8s control plane for graph-heavy operations. +* **Not strongly tied to Kùzu** The DAL will allow for Radius users to use any Cypher compatible graph database such as Neo4J or CosmosDB with Gremlin. +* **Transactional Guarantees:** Kùzu provides ACID transactions for graph operations. +* **Schema Enforcement:** Better ability to define and enforce a graph schema within Kùzu. + +#### Disadvantages (of Kùzu integration) + +* **New Dependency:** Introduces Kùzu as a new core dependency for Radius, including its Go driver. +* **Operational Overhead:** + * Managing the Kùzu database file (backups, storage). + * Monitoring Kùzu's performance and health. (LRT Cluster) + * Requires persistent volume for the Kùzu database file. (Same as Postgres for production usage) +* **Complexity:** Adds a new layer (DAL, Kùzu integration) to the Radius architecture. +* **Embedded Nature & HA:** Kùzu's primary mode is embedded. Achieving high availability for the Kùzu store in a distributed Radius control plane requires careful design (e.g., leader election for the writer, replication strategy for readers, or a future Kùzu server mode). This proposal initially focuses on a simpler embedded model. +* **Learning Curve:** Radius developers might need to learn Kùzu and Cypher. + +#### Proposed Option + +Integrate **Kùzu as an embedded graph database** managed by a new Radius service. A Graph Abstraction Layer (DAL) will be developed to mediate all graph operations. This approach balances the benefits of a dedicated graph DB with the relative simplicity of an embedded solution for the initial phase. + +### API design + +The primary API change will be internal, within the Graph Abstraction Layer (DAL) as described in "Detailed Design." External Radius APIs (e.g., `rad resource list`, `rad application graph`) should remain functionally the same, but their implementation will now call the DAL instead of directly querying Kubernetes CRDs. + +No changes to the public Radius REST API are anticipated initially, other than potential performance improvements or new (future) API endpoints that leverage advanced graph queries. + +### CLI Design + +* Existing `rad` CLI commands should continue to work transparently. + +### Implementation Details + + +#### Core RP (Resource Provider) + +* Core RP will use the DAL to manage the graph during deployment rendering and query the DAL for any app graph API requests. + +### Error Handling + +* The DAL will be responsible for translating Kùzu-specific errors into standardized Radius errors. +* Errors such as database connection issues, query failures, transaction rollbacks, or schema violations must be handled gracefully. +* Retry mechanisms for transient Kùzu errors will be implemented in the DAL. +* The DAL will integrate with the Radius OpenTelemetry implementation. + +### Test plan + +1. **Unit Tests:** + * Test individual functions within the Graph Abstraction Layer (mocking Kùzu Go driver). + * Test Kùzu schema creation and migration logic. +2. **Integration Tests:** + * Test the DAL against an actual embedded Kùzu instance. + * Verify CRUD operations for nodes and edges with various property types. + * Test complex Cypher queries through the DAL. + * Test transactional behavior. + * Test Core RP interacting with the Kùzu-backed DAL. +3. **End-to-End (E2E) Tests:** + * Adapt existing Radius E2E tests to ensure all application deployment and management scenarios function correctly with Kùzu as the graph backend. + * Include tests for data persistence across Radius restarts. + * Test migration from CRD store to Kùzu store (if applicable). +4. **Performance Tests:** + * Benchmark graph read/write operations with Kùzu against the current CRD-based implementation for representative workloads. + * Test concurrent access to the graph. + * Add checks to LRT Cluster for graph operations. +5. **Backup/Restore Tests:** + * Verify that Kùzu database backups can be successfully created and restored. + +### Security + +* **Data at Rest:** The Kùzu database file (`radius_graph.kuzu`) contains the application graph data. It should be protected by appropriate file system permissions on the persistent volume where it's stored. Encryption at rest for the volume should be considered, managed by the underlying infrastructure (e.g., Kubernetes PV encryption). +* **Access Control:** Access to Kùzu is through the embedded Go driver within the Radius process. Standard Radius authentication and authorization mechanisms will protect the Radius APIs that indirectly interact with Kùzu. There is no direct network exposure of Kùzu in the embedded model. +* **Input Sanitization:** If any user-provided data is used to construct Cypher queries (even if parameterized), ensure proper parameterization is always used by the DAL to prevent injection vulnerabilities (Kùzu's Go driver should support parameterized queries). +* **Threat Model:** The Radius threat model must be updated to have a section for the DAL. + +### Compatibility (optional) + +* **Backward Compatibility:** + * For existing Radius deployments using Kubernetes CRDs as the graph store, a migration path will be necessary. This could involve a period where Radius supports both backends, or a one-time migration tool. Currently we don't promise backward compatibility so no migration tool is planned. + * The public Radius API and CLI should remain backward compatible. +* **Data Format:** The structure of the application graph (apps, resources, properties) should remain conceptually the same, even though the storage backend changes. + +### Monitoring and Logging + +* **Logging:** + * The Graph Abstraction Layer should log all significant operations (e.g., Kùzu queries, errors, transaction boundaries) at appropriate log levels. +* **Metrics:** + * Expose metrics from the DAL in Radius OpenTelemetry: + * Number of Kùzu queries (per type: read/write). + * Latency of Kùzu queries. + * Error rates for Kùzu operations. + * Kùzu database size. + * Transaction commit/rollback counts. + +### Development plan + +0. **Phase 0: DAL (Milestone 0)** + * Create the DAL. + * Implement CRUD endpoints representing Radius abstraction level graph operations. + * Develop initial unit & integration tests for the DAL. + * Ensure via debug logging that no components are communicating with the planes CRD other than the DAL. +1. **Phase 1: Core Integration (Milestone 1)** + * Research Kùzu Go driver capabilities and limitations in detail. + * Set up Kùzu as an embedded dependency - but as a pluggable architecture where the embedded calls could be swapped for network calls to any Cypher compatible graph database. + * Define and implement Kùzu schema creation. + * Implement robust error handling and transaction management in the DAL. + * Add Kùzu specific tests (schema creation, backup/restore, etc) + * Modify Radius init and upgrade processes to trigger appropriate behavior in the DAL. +4. **Phase 2: Tooling & Testing (Milestone 2)** + * Implement backup/restore CLI commands. + * Conduct comprehensive E2E testing, performance testing, and security review of DAL threat model. + * Develop documentation for operators and developers. +3. **Phase 2: Query Enhancement (Milestone 3 - optional)** + * Enhance DAL with more advanced query capabilities (pathfinding, complex traversals, to support new User Stories defined by product). + +### Open Questions + +1. **Kùzu Performance under Concurrent Go Routines:** How well does Kùzu's Go driver and embedded database handle high concurrency from multiple goroutines within a single Radius process? Are there internal locking mechanisms in Kùzu to be aware of? +2. **Schema Evolution:** How will schema changes in Kùzu (e.g., adding new node/edge types, new properties) be managed over time with Radius updates? +3. **Kùzu Resource Footprint:** What is the typical CPU, memory, and disk I/O footprint of an embedded Kùzu instance for representative Radius graph sizes? + +### Alternatives considered + +1. **Continue using Kubernetes CRDs:** + * **Advantages:** Leverages existing Kubernetes infrastructure and expertise. No new database dependency. + * **Disadvantages:** Limited query capabilities, potential performance bottlenecks for complex graph operations, nested rendering logic very manual and complex, tightly coupled to Kubernetes. +2. **Other Embedded Graph Databases (e.g., a Go-native one if a mature one exists):** + * **Advantages:** Could offer tighter integration if fully Go-native. + * **Disadvantages:** Kùzu is chosen for its performance, Cypher support, and active development. A pure Go alternative might lack some of these mature features or performance characteristics. +3. **Hosted/Server-based Graph Databases (e.g., Neo4j, Dgraph as a service, NebulaGraph):** + * **Advantages:** Mature, feature-rich, often provide built-in clustering and HA. + * **Disadvantages:** Adds significant operational complexity (managing a separate database cluster), network latency between Radius and the DB, cost, and deviates from the goal of a more self-contained/embeddable solution for core graph logic. This proposal prioritizes decoupling and enhancing capabilities with an embedded solution first. \ No newline at end of file From ad9a35de31d18cadaf396982dd51dc14d2cd364a Mon Sep 17 00:00:00 2001 From: Sylvain Niles Date: Thu, 5 Jun 2025 10:44:48 -0700 Subject: [PATCH 02/19] make markdown. Signed-off-by: Sylvain Niles --- ...graph-db-replace-planes => 2025-06-graph-db-replace-planes.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename architecture/2025-06-non-k8s-controlplane/{2025-06-graph-db-replace-planes => 2025-06-graph-db-replace-planes.md} (100%) diff --git a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md similarity index 100% rename from architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes rename to architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md From c4bbd2e82780aae52d3a7e9529ea9c2e5a14072e Mon Sep 17 00:00:00 2001 From: Sylvain Niles Date: Thu, 5 Jun 2025 12:31:59 -0700 Subject: [PATCH 03/19] Integrated Karishma and Nithya's feedback, expanded examples to specific radius scenarios Signed-off-by: Sylvain Niles --- .../2025-06-graph-db-replace-planes.md | 94 +++++++++++++++---- 1 file changed, 76 insertions(+), 18 deletions(-) diff --git a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md index 26613e0d..24a1dac1 100644 --- a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md +++ b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md @@ -8,9 +8,11 @@ ### Overview -Project Radius currently defines and manages application graphs, representing resources and their relationships within cloud-native applications. Currently Radius relies on Kubernetes, where these graph structures are stored using Kubernetes Custom Resource Definitions (CRDs) and the Kubernetes API server as a backing store. As we are adding support for nested recipes the imperative Go code is becoming complex and brittle because it needs to implement basic graph traversal logic not present in etcd. +Project Radius currently defines and manages application graphs, representing resources and their relationships within cloud-native applications. Currently Radius relies on Kubernetes to install etcd, the default datastore, where these graph structures are stored as key value pairs. As we are adding support for nested connections the imperative Go code is becoming complex and brittle because it needs to implement basic graph traversal logic not present in etcd. Additionally we are already hitting performance limits in test environments which inspired the work @superbeeny has done to swap the key value operations out to use postgres. -This proposal outlines a plan to modify Project Radius to utilize Kùzu, an embedded, high-performance graph database, as the primary store for its application graph. This change aims to decouple the core graph logic and operations from the underlying Kubernetes infrastructure, enabling more powerful graph queries, potentially improving performance for complex relationship traversals, and offering a more specialized and efficient graph persistence layer. This will also pave the way for Radius to operate more seamlessly in environments beyond Kubernetes or where direct Kubernetes API access for graph storage is not ideal. +This proposal outlines a plan to modify Project Radius to utilize Kùzu, an embedded, high-performance graph database, as the primary store for its application graph. This change aims to decouple the core graph logic and operations from the underlying Kubernetes infrastructure, enabling more powerful graph queries, potentially improving performance for complex relationship traversals, and offering a more specialized and efficient graph persistence layer. One of the benefits to recipe authors is this will allow them to define nested connections of types to both return complex relationships and properties to a recipe as well as see these relationships in the graph/dashboard. This will also pave the way for Radius to operate more seamlessly in environments beyond Kubernetes or where direct Kubernetes API access for graph storage is not ideal. + +Delivering this will allow us to shift to a far better user experience where connections become a rich re-usable concept that shares data and exposes deep relationships currently obfuscated by monolithic types with embedded objects (database has a credentials object with username and password properties). ### Terms and Definitions @@ -42,45 +44,68 @@ This proposal outlines a plan to modify Project Radius to utilize Kùzu, an embe * First step in allowing the Radius control plane to run outside Kubernetes. * Integrate Kùzu as an embedded graph database within relevant Radius components (e.g., Core RP, Deployment Engine). * Define a clear schema for the Radius application graph within Kùzu. -* Migrate existing graph data representation (currently in K8s CRDs) to the Kùzu data model. -* Implement a data access layer (DAL) service that abstracts Kùzu operations (CRUD, queries) for other Radius components. +* Migrate existing graph data representation (currently in etcd/postgres) to the Kùzu data model. +* Implement a data access layer (DAL, name TBD) service that abstracts Kùzu operations (CRUD, queries) for other Radius components. * Update Radius RPs and controllers to use the new Kùzu-backed DAL for all graph-related operations. * Provide mechanisms for backup and potential restore of the Kùzu database as part of Radius install/upgrade/rollback operations. * Develop a comprehensive test suite covering graph operations with Kùzu. +* Build this in a way that Kùzu could be swapped out for a network accessible graph database such as Neo4J or Postgres with Apache AGE extension. + ### Non-goals * Re-architecting more of Radius to eliminate Kubernetes as a dependency beyond the graph storage and access layer. * Providing a distributed Kùzu cluster as part of this initial integration (Kùzu is primarily embedded; clustering would be a separate, future consideration if needed). * Exposing direct Kùzu Cypher query capabilities to end-users of Radius (interaction should remain through Radius APIs and abstractions). -* Replacing all uses of the Kubernetes API server by Radius, only those related to storing and querying the core application graph structure. -* Migrating existing Radius clusters to the new graph, a fresh install is required. +* Replacing all uses of the Kubernetes CRDs by Radius. +* Migrating existing Radius clusters to the new graph, a fresh install is required. (up for debate) ### User Scenarios (optional) -* **Scenario 1 (Developer):** A Radius developer needs to implement a new feature that requires finding all resources connected to a specific environment that also have a particular tag. Using Cypher through the Kùzu DAL would be more expressive and potentially faster than multiple K8s API calls and client-side filtering. This is currently a sticking point with nested Recipes. -* **Scenario 2 (Operator):** An operator managing a large Radius deployment experiences performance degradation when listing complex application relationships. Moving to Kùzu could alleviate these bottlenecks. -* **Scenario 3 (Platform Engineer):** A platform engineer wants to run Radius without Kubernetes, this is one of the decoupling features required. +* **Scenario 1 (Developer):** A Radius developer needs to address performance issues for application graphs containing many resources. Retriving all resources connected to a specific environment, which is a very expensive operation. Using Cypher through the Kùzu DAL would be more expressive and orders of magnitude faster than multiple etcd calls and client-side filtering. +* **Scenario 2 (Platform Engineer):** A platform engineer wants to run Radius without Kubernetes, this is one of the decoupling features required. ### User Experience (if applicable) * **End-users (Application Developers using Radius CLI/APIs):** The change should be largely transparent. Existing commands and APIs for managing applications and resources should continue to work. Performance improvements might be noticeable. * **Radius Developers/Contributors:** Will need to learn how to interact with the new Kùzu DAL and potentially understand the Kùzu data model and Cypher for advanced debugging or development. -* **Operators:** Will need to be aware of the new Kùzu database component for backup, monitoring, and troubleshooting purposes. The operational burden of managing K8s CRDs for graph data would be shifted. +* **Operators:** Will need to be aware of the new Kùzu database component for backup, monitoring, and troubleshooting purposes. The operational burden of managing etcd for graph data would be shifted. ### Sample Input/Output: -* **Sample Input (Conceptual Cypher Query via DAL):** +* **Sample Input (Conceptual Cypher Query via DAL, replicating `rad app graph todo`):** ``` - // Find all 'applications' that 'contain' a 'container' resource with 'image' = 'nginx' - MATCH (app:Application)-[:CONTAINS]->(res:Resource {type: 'Applications.Core/container', properties_image: 'nginx'}) - RETURN app.name, res.name + // Show the full graph for the "todo" application + MATCH (app:Application {name: 'todo'})-[rel]->(res:Resource) + RETURN app.name AS application, type(rel) AS relationship, res.name AS resource, res.type AS resourceType ``` * **Sample Output (from DAL):** ```json [ - { "appName": "myWebApp", "resourceName": "frontendContainer" }, - { "appName": "myService", "resourceName": "workerContainer" } + { + "application": "todo", + "relationship": "CONTAINS", + "resource": "frontend", + "resourceType": "Applications.Core/container" + }, + { + "application": "todo", + "relationship": "CONTAINS", + "resource": "backend", + "resourceType": "Applications.Core/container" + }, + { + "application": "todo", + "relationship": "CONNECTS_TO", + "resource": "todo-db", + "resourceType": "Applications.Core/postgres" + }, + { + "application": "todo", + "relationship": "CONNECTS_TO", + "resource": "redis-cache", + "resourceType": "Applications.Core/redis" + } ] ``` @@ -118,7 +143,7 @@ graph TD 1. **Kùzu Integration:** * The Kùzu Go driver (`github.com/kuzudb/go-kuzu`) will be used. * Kùzu database will be initialized during `rad init`. The database file (`radius_app_graph.kuzu`) will be stored on ephemeral or persistent storage accessible to the Radius control plane. (requiring setup of a volume for laptop installations seems excessive) - * Initially CoreRP will move CRUD operations to K8s planes to the DAL without using Kùzu. + * Initially CoreRP will move CRUD operations using etcd to the DAL without using Kùzu. * Once the DAL is released Kùzu support will be added in a pluggable way, allowing for future network graph db support vs the embedded solution. 2. **Schema Management:** @@ -175,7 +200,7 @@ graph TD #### Advantages (of Kùzu over K8s CRDs for graph storage) -* **Rich Querying:** Cypher provides significantly more powerful and expressive graph query capabilities than filtering K8s resources by labels/annotations. +* **Rich Querying:** Cypher provides significantly more powerful and expressive graph query capabilities than filtering etcd values client side. * **Performance:** For complex graph traversals (multi-hop queries, pathfinding), Kùzu is likely to be much faster as it's optimized for such operations. For Radius this would be during most recipe execution as the entire graph is rendered. * **Specialized Data Store:** Kùzu is purpose-built for graph data, leading to efficient storage and indexing for graph structures. * **Decoupling from Kubernetes:** Reduces dependency on the K8s API server for core graph logic, improving portability and potentially reducing load on the K8s control plane for graph-heavy operations. @@ -183,6 +208,39 @@ graph TD * **Transactional Guarantees:** Kùzu provides ACID transactions for graph operations. * **Schema Enforcement:** Better ability to define and enforce a graph schema within Kùzu. +--- + +**Modeling Deep Relationships Instead of Monolithic Types** + +Currently, many Radius resource types (such as databases) are modeled as monolithic objects with embedded properties or sub-objects. For example, a database resource might have a `credentials` object, which itself contains `username` and `password` properties. This approach makes it difficult to express and traverse relationships between resources, and limits reusability and visibility in the application graph. + +With a graph database like Kùzu, these relationships can be modeled explicitly. Instead of embedding credentials as an object within the database resource, the database node can be connected to a separate `credentials` node (e.g., named `db_creds` of type `Credentials`). This credentials node can then be connected to two `secret` nodes representing the username and password. This approach enables: + +- **Reusability:** Credentials or secrets can be shared across multiple resources. +- **Visibility:** Relationships between resources, credentials, and secrets are explicit and queryable. +- **Extensibility:** New types of relationships or properties can be added without changing the monolithic resource schema. Future API versions could allow some properties to be private (not exposed by connection). + + +**Example Graph Structure:** +```mermaid +graph TD + db[Database] + creds[Credentials: db_creds] + user[Secret: username] + pass[Secret: password] + + db -- "CONNECTION" --> creds + creds -- "CONNECTION" --> user + creds -- "CONNECTION" --> pass +``` + +In this model: +- The `Database` node is connected to a `Credentials` node via a connection. +- The `Credentials` node is connected to two `Secret` nodes (for username and password) via connections. + +This structure enables richer queries and recipe author use cases like `context.connected_resources.database.credentials.username` instead of only being able to access the embedded `credentials` object and requiring the recipe author to parse. +Additionally it provides a better separation of concerns, and a more flexible, maintainable application graph. + #### Disadvantages (of Kùzu integration) * **New Dependency:** Introduces Kùzu as a new core dependency for Radius, including its Go driver. From 654251225979734e352de21829172393246c8aef Mon Sep 17 00:00:00 2001 From: Sylvain Niles Date: Thu, 5 Jun 2025 12:38:11 -0700 Subject: [PATCH 04/19] fixed CRD references with correct etcd references. Signed-off-by: Sylvain Niles --- .../2025-06-graph-db-replace-planes.md | 25 +++++++++---------- 1 file changed, 12 insertions(+), 13 deletions(-) diff --git a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md index 24a1dac1..0f7583e0 100644 --- a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md +++ b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md @@ -29,7 +29,7 @@ Delivering this will allow us to shift to a far better user experience where con ### Objectives -1. **Decouple Graph Storage:** Abstract the application graph storage from Kubernetes CRDs, allowing Radius to use a dedicated graph database. +1. **Decouple Graph Storage:** Abstract the application graph storage from etcd, allowing Radius to use a dedicated graph database. 2. **Enhance Query Capabilities:** Leverage Kùzu's Cypher query language for more complex and efficient graph traversals and relationship analysis than what is easily achievable with Kubernetes API queries. 3. **Improve Performance:** Potentially improve the performance of graph read and write operations, especially for large or complex application graphs. 4. **Increase Portability:** Facilitate the use of Radius in non-Kubernetes environments or scenarios where a dedicated graph store is preferred. @@ -117,7 +117,7 @@ Delivering this will allow us to shift to a far better user experience where con 2. **Graph Abstraction Layer:** A new internal service or Data Access Layer (DAL) will be created. This layer will provide an API for all graph operations (e.g., `listDeployments`, `listResources`, and allow new operations like `showDependencies(resource)`). Internally, this DAL will translate these requests into Kùzu operations (Cypher queries, API calls to Kùzu's Go driver). 3. **Schema Definition:** A formal schema for Radius entities (Applications, Environments, Resources, etc. as nodes) and their relationships (as edges with types and properties) will be defined and enforced in Kùzu. 4. **Data Synchronization/Migration:** No migration is planned. -5. **Component Updates:** All Radius components that currently interact with Kubernetes CRDs for graph information will be updated to use the new Graph Abstraction Layer. +5. **Component Updates:** All Radius components that currently interact with etcd or postgrest for graph information will be updated to use the new Graph Abstraction Layer. #### Architecture Diagram @@ -135,7 +135,7 @@ graph TD D <-->|"Interacts with"| E ``` -* **Current (Simplified):** Radius Core Components <-> Kubernetes API Server (for CRD-based graph) +* **Current (Simplified):** Radius Core Components <-> etcd * **Proposed:** Radius Core Components <-> Graph Abstraction Layer <-> Kùzu Embedded DB #### Detailed Design @@ -198,7 +198,7 @@ graph TD * All compound operations (e.g., creating a resource node and its relationship edge) must be performed within a Kùzu transaction to ensure atomicity. The DAL will manage this. * Radius upgrades and rollbacks would need to coordinate with the DAL. -#### Advantages (of Kùzu over K8s CRDs for graph storage) +#### Advantages (of Kùzu over etcd/key value stores for graph storage) * **Rich Querying:** Cypher provides significantly more powerful and expressive graph query capabilities than filtering etcd values client side. * **Performance:** For complex graph traversals (multi-hop queries, pathfinding), Kùzu is likely to be much faster as it's optimized for such operations. For Radius this would be during most recipe execution as the entire graph is rendered. @@ -258,7 +258,7 @@ Integrate **Kùzu as an embedded graph database** managed by a new Radius servic ### API design -The primary API change will be internal, within the Graph Abstraction Layer (DAL) as described in "Detailed Design." External Radius APIs (e.g., `rad resource list`, `rad application graph`) should remain functionally the same, but their implementation will now call the DAL instead of directly querying Kubernetes CRDs. +The primary API change will be internal, within the Graph Abstraction Layer (DAL) as described in "Detailed Design." External Radius APIs (e.g., `rad resource list`, `rad application graph`) should remain functionally the same, but their implementation will now call the DAL instead of directly querying etcd. No changes to the public Radius REST API are anticipated initially, other than potential performance improvements or new (future) API endpoints that leverage advanced graph queries. @@ -294,9 +294,8 @@ No changes to the public Radius REST API are anticipated initially, other than p 3. **End-to-End (E2E) Tests:** * Adapt existing Radius E2E tests to ensure all application deployment and management scenarios function correctly with Kùzu as the graph backend. * Include tests for data persistence across Radius restarts. - * Test migration from CRD store to Kùzu store (if applicable). 4. **Performance Tests:** - * Benchmark graph read/write operations with Kùzu against the current CRD-based implementation for representative workloads. + * Benchmark graph read/write operations with Kùzu against the current key/value based implementation for representative workloads. * Test concurrent access to the graph. * Add checks to LRT Cluster for graph operations. 5. **Backup/Restore Tests:** @@ -312,7 +311,7 @@ No changes to the public Radius REST API are anticipated initially, other than p ### Compatibility (optional) * **Backward Compatibility:** - * For existing Radius deployments using Kubernetes CRDs as the graph store, a migration path will be necessary. This could involve a period where Radius supports both backends, or a one-time migration tool. Currently we don't promise backward compatibility so no migration tool is planned. + * For existing Radius deployments using etcd as the graph store, a migration path will be necessary. This could involve a period where Radius supports both backends with concurrent writes, or a one-time migration tool. Currently we don't promise backward compatibility so no migration tool is planned. * The public Radius API and CLI should remain backward compatible. * **Data Format:** The structure of the application graph (apps, resources, properties) should remain conceptually the same, even though the storage backend changes. @@ -334,7 +333,7 @@ No changes to the public Radius REST API are anticipated initially, other than p * Create the DAL. * Implement CRUD endpoints representing Radius abstraction level graph operations. * Develop initial unit & integration tests for the DAL. - * Ensure via debug logging that no components are communicating with the planes CRD other than the DAL. + * Ensure via debug logging that no components are communicating with etcd other than the DAL. 1. **Phase 1: Core Integration (Milestone 1)** * Research Kùzu Go driver capabilities and limitations in detail. * Set up Kùzu as an embedded dependency - but as a pluggable architecture where the embedded calls could be swapped for network calls to any Cypher compatible graph database. @@ -352,14 +351,14 @@ No changes to the public Radius REST API are anticipated initially, other than p ### Open Questions 1. **Kùzu Performance under Concurrent Go Routines:** How well does Kùzu's Go driver and embedded database handle high concurrency from multiple goroutines within a single Radius process? Are there internal locking mechanisms in Kùzu to be aware of? -2. **Schema Evolution:** How will schema changes in Kùzu (e.g., adding new node/edge types, new properties) be managed over time with Radius updates? +2. **Schema Evolution:** How will schema changes in Kùzu (e.g., adding new node/edge types, new properties) be managed over time with Radius updates? With the introduction of Radius Types this should remain fairly stable as aside from Environment and Application all nodes would be resources with a type property based on the backing type. 3. **Kùzu Resource Footprint:** What is the typical CPU, memory, and disk I/O footprint of an embedded Kùzu instance for representative Radius graph sizes? ### Alternatives considered -1. **Continue using Kubernetes CRDs:** - * **Advantages:** Leverages existing Kubernetes infrastructure and expertise. No new database dependency. - * **Disadvantages:** Limited query capabilities, potential performance bottlenecks for complex graph operations, nested rendering logic very manual and complex, tightly coupled to Kubernetes. +1. **Continue using etcd:** + * **Advantages:** Leverages existing Kubernetes provided etcd installation and expertise. No new database dependency. + * **Disadvantages:** Limited query capabilities, known performance bottlenecks for sizeable application graphs, nested rendering logic very manual and complex, tightly coupled to key value stores. 2. **Other Embedded Graph Databases (e.g., a Go-native one if a mature one exists):** * **Advantages:** Could offer tighter integration if fully Go-native. * **Disadvantages:** Kùzu is chosen for its performance, Cypher support, and active development. A pure Go alternative might lack some of these mature features or performance characteristics. From 0d52e3fd63ae6244547a2739495d4c883d01c073 Mon Sep 17 00:00:00 2001 From: Sylvain Niles Date: Mon, 9 Jun 2025 16:00:22 -0700 Subject: [PATCH 05/19] added dashboard question Signed-off-by: Sylvain Niles --- .../2025-06-graph-db-replace-planes.md | 1 + 1 file changed, 1 insertion(+) diff --git a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md index 0f7583e0..ef2e1f79 100644 --- a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md +++ b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md @@ -353,6 +353,7 @@ No changes to the public Radius REST API are anticipated initially, other than p 1. **Kùzu Performance under Concurrent Go Routines:** How well does Kùzu's Go driver and embedded database handle high concurrency from multiple goroutines within a single Radius process? Are there internal locking mechanisms in Kùzu to be aware of? 2. **Schema Evolution:** How will schema changes in Kùzu (e.g., adding new node/edge types, new properties) be managed over time with Radius updates? With the introduction of Radius Types this should remain fairly stable as aside from Environment and Application all nodes would be resources with a type property based on the backing type. 3. **Kùzu Resource Footprint:** What is the typical CPU, memory, and disk I/O footprint of an embedded Kùzu instance for representative Radius graph sizes? +4. **Dashboard:** The changes proposed here such as nested types and expanded use of connections will make the app graph both richer and larger, the existing dashboard will probably need some UX design and work in order to leverage that effectively and intuitively (I would want to click a gateway to see the types it depends on and their sources in an environment). ### Alternatives considered From 86a9a5492f8a42b8705268101f690af77bb30ec6 Mon Sep 17 00:00:00 2001 From: Sylvain Niles Date: Wed, 11 Jun 2025 10:58:22 -0700 Subject: [PATCH 06/19] Addressed feedback from a few people, called out the production path clearly, made migration tool a requirement. Signed-off-by: Sylvain Niles --- .../2025-06-graph-db-replace-planes.md | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md index ef2e1f79..6f6a3b84 100644 --- a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md +++ b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md @@ -8,9 +8,9 @@ ### Overview -Project Radius currently defines and manages application graphs, representing resources and their relationships within cloud-native applications. Currently Radius relies on Kubernetes to install etcd, the default datastore, where these graph structures are stored as key value pairs. As we are adding support for nested connections the imperative Go code is becoming complex and brittle because it needs to implement basic graph traversal logic not present in etcd. Additionally we are already hitting performance limits in test environments which inspired the work @superbeeny has done to swap the key value operations out to use postgres. +Project Radius currently defines and manages application graphs, representing resources and their relationships within cloud-native applications. Currently Radius relies on Kubernetes to install etcd, the default datastore, where these graph structures are stored as key value pairs. As we are adding support for nested connections the imperative Go code is becoming complex and brittle because it needs to implement basic graph traversal logic not present in etcd. Additionally we are already hitting performance limits in test environments (N+ resources take X time) which inspired the work @superbeeny has done to swap the key value operations out to use postgres. Radius users could not use Drasi today to act on changes to their environments as there's no current support for key value stores and if that was added the client side filtering requirements might be a challenge in Drasi. -This proposal outlines a plan to modify Project Radius to utilize Kùzu, an embedded, high-performance graph database, as the primary store for its application graph. This change aims to decouple the core graph logic and operations from the underlying Kubernetes infrastructure, enabling more powerful graph queries, potentially improving performance for complex relationship traversals, and offering a more specialized and efficient graph persistence layer. One of the benefits to recipe authors is this will allow them to define nested connections of types to both return complex relationships and properties to a recipe as well as see these relationships in the graph/dashboard. This will also pave the way for Radius to operate more seamlessly in environments beyond Kubernetes or where direct Kubernetes API access for graph storage is not ideal. +This proposal outlines a plan to modify Project Radius to utilize Kùzu, an embedded, high-performance graph database, as the primary store for its application graph. This change aims to decouple the core graph logic and operations from the underlying Kubernetes infrastructure, enabling more powerful graph queries, potentially improving performance for complex relationship traversals, and offering a more specialized and efficient graph persistence layer. One of the benefits to recipe authors is this will allow them to define nested connections of types to both return complex relationships and properties to a recipe as well as see these relationships in the graph/dashboard. Kùzu is intended for developer and POC/test environments, for production environments the DB connection object would simply be replaced with a cypher compliant DB client to something such as Postgres with the Apache AGE plugin. Delivering this will allow us to shift to a far better user experience where connections become a rich re-usable concept that shares data and exposes deep relationships currently obfuscated by monolithic types with embedded objects (database has a credentials object with username and password properties). @@ -47,6 +47,7 @@ Delivering this will allow us to shift to a far better user experience where con * Migrate existing graph data representation (currently in etcd/postgres) to the Kùzu data model. * Implement a data access layer (DAL, name TBD) service that abstracts Kùzu operations (CRUD, queries) for other Radius components. * Update Radius RPs and controllers to use the new Kùzu-backed DAL for all graph-related operations. +* Provide migration tool for moving to graph db. * Provide mechanisms for backup and potential restore of the Kùzu database as part of Radius install/upgrade/rollback operations. * Develop a comprehensive test suite covering graph operations with Kùzu. * Build this in a way that Kùzu could be swapped out for a network accessible graph database such as Neo4J or Postgres with Apache AGE extension. @@ -116,7 +117,7 @@ Delivering this will allow us to shift to a far better user experience where con 1. **Introduce Kùzu:** Kùzu will be integrated as an embedded library within the primary Radius component(s) responsible for managing the application graph (e.g., the Radius Core RP or a dedicated graph service). 2. **Graph Abstraction Layer:** A new internal service or Data Access Layer (DAL) will be created. This layer will provide an API for all graph operations (e.g., `listDeployments`, `listResources`, and allow new operations like `showDependencies(resource)`). Internally, this DAL will translate these requests into Kùzu operations (Cypher queries, API calls to Kùzu's Go driver). 3. **Schema Definition:** A formal schema for Radius entities (Applications, Environments, Resources, etc. as nodes) and their relationships (as edges with types and properties) will be defined and enforced in Kùzu. -4. **Data Synchronization/Migration:** No migration is planned. +4. **Data Synchronization/Migration:** We will need to provide a simple migration tool that copies existing Radius environments/app graph to the newly installed graph db. 5. **Component Updates:** All Radius components that currently interact with etcd or postgrest for graph information will be updated to use the new Graph Abstraction Layer. #### Architecture Diagram @@ -191,7 +192,7 @@ graph TD 4. **Data Persistence and State:** * Kùzu runs embedded, so the Radius process managing it is authoritative. - * If a single DAL node in the cluster is unable to scale to the traffic of the cluster we would have to separate out write traffic to a single instance and reads could scale reasonably. + * If a single DAL node in the cluster is unable to scale to the traffic of the cluster we would have to separate out write traffic to a single instance and reads could scale reasonably, but moving to a network cypher client would be expected. * For this proposal, we assume a single active Radius DAL node is managing the Kùzu DB file for writes, with potential for read-only file replication for other instances if feasible and performant. 5. **Transaction Management:** @@ -207,6 +208,7 @@ graph TD * **Not strongly tied to Kùzu** The DAL will allow for Radius users to use any Cypher compatible graph database such as Neo4J or CosmosDB with Gremlin. * **Transactional Guarantees:** Kùzu provides ACID transactions for graph operations. * **Schema Enforcement:** Better ability to define and enforce a graph schema within Kùzu. +* **Support for streaming monitoring of graph changes:** A project like Drasi cannot consume a change feed of the Radius app graph because it can't replicate the client side filtering necessary for identifying the changes desired. --- @@ -311,7 +313,7 @@ No changes to the public Radius REST API are anticipated initially, other than p ### Compatibility (optional) * **Backward Compatibility:** - * For existing Radius deployments using etcd as the graph store, a migration path will be necessary. This could involve a period where Radius supports both backends with concurrent writes, or a one-time migration tool. Currently we don't promise backward compatibility so no migration tool is planned. + * For existing Radius deployments using etcd as the graph store, a migration path will be necessary. * The public Radius API and CLI should remain backward compatible. * **Data Format:** The structure of the application graph (apps, resources, properties) should remain conceptually the same, even though the storage backend changes. @@ -333,7 +335,7 @@ No changes to the public Radius REST API are anticipated initially, other than p * Create the DAL. * Implement CRUD endpoints representing Radius abstraction level graph operations. * Develop initial unit & integration tests for the DAL. - * Ensure via debug logging that no components are communicating with etcd other than the DAL. + * Ensure via debug logging that no components are communicating directly with etcd other than the DAL. 1. **Phase 1: Core Integration (Milestone 1)** * Research Kùzu Go driver capabilities and limitations in detail. * Set up Kùzu as an embedded dependency - but as a pluggable architecture where the embedded calls could be swapped for network calls to any Cypher compatible graph database. @@ -353,7 +355,7 @@ No changes to the public Radius REST API are anticipated initially, other than p 1. **Kùzu Performance under Concurrent Go Routines:** How well does Kùzu's Go driver and embedded database handle high concurrency from multiple goroutines within a single Radius process? Are there internal locking mechanisms in Kùzu to be aware of? 2. **Schema Evolution:** How will schema changes in Kùzu (e.g., adding new node/edge types, new properties) be managed over time with Radius updates? With the introduction of Radius Types this should remain fairly stable as aside from Environment and Application all nodes would be resources with a type property based on the backing type. 3. **Kùzu Resource Footprint:** What is the typical CPU, memory, and disk I/O footprint of an embedded Kùzu instance for representative Radius graph sizes? -4. **Dashboard:** The changes proposed here such as nested types and expanded use of connections will make the app graph both richer and larger, the existing dashboard will probably need some UX design and work in order to leverage that effectively and intuitively (I would want to click a gateway to see the types it depends on and their sources in an environment). +4. **Dashboard:** The changes proposed here such as nested types and expanded use of connections will make the app graph both richer and larger, the existing dashboard will probably need some UX design and work in order to leverage that effectively and intuitively (I would want to click a gateway to see the types it depends on and their sources in an environment) ### Alternatives considered From f6f49fbfc7f17467bcbbc937b449629ed458c29f Mon Sep 17 00:00:00 2001 From: Sylvain Niles Date: Wed, 11 Jun 2025 21:40:46 -0700 Subject: [PATCH 07/19] Addressed all feedback from design session Signed-off-by: Sylvain Niles --- .../2025-06-graph-db-replace-planes.md | 332 ++++++++++-------- 1 file changed, 193 insertions(+), 139 deletions(-) diff --git a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md index 6f6a3b84..cd1f4cd7 100644 --- a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md +++ b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md @@ -1,39 +1,41 @@ -## Proposal: Integrating Kùzu for Radius Application Graph +## Proposal: Integrating a Graph Database for Radius Application Graph -**Version:** 1.0 -**Date:** June 4, 2025 +**Version:** 1.2 +**Date:** June 11, 2025 **Author:** Sylvain Niles --- ### Overview -Project Radius currently defines and manages application graphs, representing resources and their relationships within cloud-native applications. Currently Radius relies on Kubernetes to install etcd, the default datastore, where these graph structures are stored as key value pairs. As we are adding support for nested connections the imperative Go code is becoming complex and brittle because it needs to implement basic graph traversal logic not present in etcd. Additionally we are already hitting performance limits in test environments (N+ resources take X time) which inspired the work @superbeeny has done to swap the key value operations out to use postgres. Radius users could not use Drasi today to act on changes to their environments as there's no current support for key value stores and if that was added the client side filtering requirements might be a challenge in Drasi. +Project Radius currently defines and manages application graphs, representing resources and their relationships within cloud-native applications. Currently Radius relies on Kubernetes to install etcd, the default datastore, where these graph structures are stored as key value pairs. As we are adding support for nested connections the imperative Go code is becoming complex and brittle because it needs to implement basic graph traversal logic not present in etcd. Additionally we are already hitting performance limits in test environments (under a hundred resources were reported to slow things to a crawl) which inspired the work @superbeeny has done to swap the key value operations out to use postgres. Radius users could not use Drasi today to act on changes to their environments as there's no current support for key value stores and if that was added the client side filtering requirements would be a challenge in Drasi, requiring extensive middleware to parse the custom Radius data structures. -This proposal outlines a plan to modify Project Radius to utilize Kùzu, an embedded, high-performance graph database, as the primary store for its application graph. This change aims to decouple the core graph logic and operations from the underlying Kubernetes infrastructure, enabling more powerful graph queries, potentially improving performance for complex relationship traversals, and offering a more specialized and efficient graph persistence layer. One of the benefits to recipe authors is this will allow them to define nested connections of types to both return complex relationships and properties to a recipe as well as see these relationships in the graph/dashboard. Kùzu is intended for developer and POC/test environments, for production environments the DB connection object would simply be replaced with a cypher compliant DB client to something such as Postgres with the Apache AGE plugin. +This proposal outlines a plan to modify Project Radius to utilize a **graph database** as the primary store for its application graph, accessed via a new **Graph Access Layer (GAL)**. This change aims to decouple the core graph logic and operations from etcd, enabling more powerful graph queries, improving performance for complex relationship traversals, comparable performance to key value stores for non-graph storage/retrieval, and offering a more specialized and efficient graph persistence layer. One of the benefits to recipe authors is this will allow them to define nested connections of types to both return complex relationships and properties to a recipe as well as see these relationships in the graph/dashboard. -Delivering this will allow us to shift to a far better user experience where connections become a rich re-usable concept that shares data and exposes deep relationships currently obfuscated by monolithic types with embedded objects (database has a credentials object with username and password properties). +For development, testing, and proof-of-concept environments, we will provide Kùzu as the default embedded graph database. For production environments, the plan is to support Postgres with the Apache AGE plugin, which provides Cypher compatibility and is suitable for scalable, production-grade deployments. The GAL will be designed to support pluggable backends, allowing configuration of the graph database provider to be any supporting Cypher. + +Delivering this will allow us to shift to a far better user experience where connections become a rich re-usable concept that shares data and exposes deep relationships currently obfuscated by monolithic types with embedded objects (ex: database has a credentials object with username and password properties). ### Terms and Definitions * **Radius:** An open-source, cloud-native application platform that helps developers build, deploy, and manage applications across various environments. * **Application Graph:** A representation of an application's components (resources, services, environments, etc.) as nodes and their interconnections as edges, including metadata on both. * **Kubernetes (K8s):** An open-source system for automating deployment, scaling, and management of containerized applications. -* **CRDs (Custom Resource Definitions):** Extensions of the Kubernetes API that allow users to create custom resource types. Currently a common way Radius stores graph data in K8s. -* **Kùzu:** An embeddable, transactional, high-performance graph database management system (GDBMS) supporting the Cypher query language and property graphs. +* **Graph Access Layer (GAL):** The internal abstraction layer that mediates all graph operations between Radius components and the underlying graph database. +* **Kùzu:** An embeddable, transactional, high-performance graph database management system (GDBMS) supporting the Cypher query language and property graphs. Provided for dev/test/POC. +* **Postgres with Apache AGE:** A production-ready, scalable graph database solution supporting Cypher queries, built on PostgreSQL. * **Node:** An entity in a graph (e.g., a Radius resource, an environment). -* **Edge:** A relationship between two nodes in a graph (e.g., "connectsTo", "runsIn"). +* **Edge:** A relationship between two nodes in a graph (e.g., "connectsTo", "runsIn"). All edges in Radius will be of type "CONNECTION". * **Property:** Key-value pairs associated with nodes or edges, storing metadata. * **Cypher:** A declarative graph query language. * **RP (Resource Provider):** A component in Radius responsible for managing a specific type of resource. ### Objectives -1. **Decouple Graph Storage:** Abstract the application graph storage from etcd, allowing Radius to use a dedicated graph database. -2. **Enhance Query Capabilities:** Leverage Kùzu's Cypher query language for more complex and efficient graph traversals and relationship analysis than what is easily achievable with Kubernetes API queries. +1. **Decouple Graph Storage:** Abstract the application graph storage from etcd, allowing Radius to use a dedicated graph database via the Graph Access Layer. +2. **Enhance Query Capabilities:** Leverage Cypher query language for more complex and efficient graph traversals and relationship analysis than what is easily achievable with etcd retrieval and client side filtering. 3. **Improve Performance:** Potentially improve the performance of graph read and write operations, especially for large or complex application graphs. -4. **Increase Portability:** Facilitate the use of Radius in non-Kubernetes environments or scenarios where a dedicated graph store is preferred. -5. **Maintain Existing Functionality:** Ensure that all existing Radius features that rely on the application graph continue to function correctly with Kùzu as the backend. +4. **Maintain Existing Functionality:** Ensure that all existing Radius features that rely on the application graph continue to function correctly with the new backend. ### Issue Reference: @@ -41,118 +43,156 @@ Delivering this will allow us to shift to a far better user experience where con ### Goals -* First step in allowing the Radius control plane to run outside Kubernetes. -* Integrate Kùzu as an embedded graph database within relevant Radius components (e.g., Core RP, Deployment Engine). -* Define a clear schema for the Radius application graph within Kùzu. -* Migrate existing graph data representation (currently in etcd/postgres) to the Kùzu data model. -* Implement a data access layer (DAL, name TBD) service that abstracts Kùzu operations (CRUD, queries) for other Radius components. -* Update Radius RPs and controllers to use the new Kùzu-backed DAL for all graph-related operations. +* Implement a Graph Access Layer (GAL) that abstracts graph operations with pluggable backend support. +* Integrate Kùzu as an embedded graph database for development, testing, and proof-of-concept environments. +* Develop Postgres with Apache AGE plugin support for production environments. +* Define a clear schema for the Radius application graph within both backends. +* Migrate existing graph data representation (currently in etcd/postgres) to the graph database data model. +* Update Radius RPs and controllers to use the new GAL for all graph-related operations. * Provide migration tool for moving to graph db. -* Provide mechanisms for backup and potential restore of the Kùzu database as part of Radius install/upgrade/rollback operations. -* Develop a comprehensive test suite covering graph operations with Kùzu. -* Build this in a way that Kùzu could be swapped out for a network accessible graph database such as Neo4J or Postgres with Apache AGE extension. +* Provide mechanisms for backup and potential restore of the graph database as part of Radius install/upgrade/rollback operations. +* Develop a comprehensive test suite covering graph operations with both backends. +* Add configuration options to select between graph database providers. ### Non-goals -* Re-architecting more of Radius to eliminate Kubernetes as a dependency beyond the graph storage and access layer. * Providing a distributed Kùzu cluster as part of this initial integration (Kùzu is primarily embedded; clustering would be a separate, future consideration if needed). -* Exposing direct Kùzu Cypher query capabilities to end-users of Radius (interaction should remain through Radius APIs and abstractions). -* Replacing all uses of the Kubernetes CRDs by Radius. -* Migrating existing Radius clusters to the new graph, a fresh install is required. (up for debate) +* Exposing direct Cypher query capabilities to end-users of Radius (interaction should remain through Radius APIs and abstractions). +* Supporting zero-downtime migration from etcd to graph database (migration will require a maintenance window). ### User Scenarios (optional) -* **Scenario 1 (Developer):** A Radius developer needs to address performance issues for application graphs containing many resources. Retriving all resources connected to a specific environment, which is a very expensive operation. Using Cypher through the Kùzu DAL would be more expressive and orders of magnitude faster than multiple etcd calls and client-side filtering. -* **Scenario 2 (Platform Engineer):** A platform engineer wants to run Radius without Kubernetes, this is one of the decoupling features required. +* **Scenario 1 (Developer):** A Radius developer needs to address performance issues for application graphs containing many resources. Retrieving all resources connected to a specific environment, which is a very expensive operation. Using Cypher through the GAL would be more expressive and orders of magnitude faster than multiple etcd calls and client-side filtering. +* **Scenario 2 (Developer):** A developer is troubleshooting a deployment that isn't working correctly. By examining the application graph through the new deeply nested app graph, they can see that their gateway resource is connected to two secret resources (crt and key), and upon inspection, they discover this certificate is for the wrong domain, explaining why their HTTPS connections are failing. ### User Experience (if applicable) * **End-users (Application Developers using Radius CLI/APIs):** The change should be largely transparent. Existing commands and APIs for managing applications and resources should continue to work. Performance improvements might be noticeable. -* **Radius Developers/Contributors:** Will need to learn how to interact with the new Kùzu DAL and potentially understand the Kùzu data model and Cypher for advanced debugging or development. -* **Operators:** Will need to be aware of the new Kùzu database component for backup, monitoring, and troubleshooting purposes. The operational burden of managing etcd for graph data would be shifted. +* **Radius Developers/Contributors:** Will need to learn how to interact with the new GAL and potentially understand the graph data model and Cypher for advanced debugging or development. The graph implementation is fairly simple, with a small number of queries that should remain static, only new queries to support new functionality would require ramp-up on Cypher. Non-cloud test suites may speed up significantly. +* **Operators:** Will need to be aware of the new graph database component for backup, monitoring, and troubleshooting purposes. The operational burden of managing etcd for graph data would be shifted. ### Sample Input/Output: -* **Sample Input (Conceptual Cypher Query via DAL, replicating `rad app graph todo`):** +**Example 1: Application Graph Query (replicating `rad app graph todo`)** + +* **Sample Input:** ``` - // Show the full graph for the "todo" application - MATCH (app:Application {name: 'todo'})-[rel]->(res:Resource) - RETURN app.name AS application, type(rel) AS relationship, res.name AS resource, res.type AS resourceType + // Show the full graph for the "todo" application with connections + MATCH (app:Application {name: 'todo'})-[rel:CONNECTION]->(res:Resource) + OPTIONAL MATCH (res)-[conn:CONNECTION]->(connected:Resource) + RETURN app.name AS application, res.name AS resource, res.type AS resourceType, + collect(connected.name) AS connections ``` -* **Sample Output (from DAL):** + +* **Sample Output:** ```json [ { "application": "todo", - "relationship": "CONTAINS", "resource": "frontend", - "resourceType": "Applications.Core/container" + "resourceType": "Applications.Core/container", + "connections": ["backend"] }, { "application": "todo", - "relationship": "CONTAINS", "resource": "backend", - "resourceType": "Applications.Core/container" + "resourceType": "Applications.Core/container", + "connections": ["todo-db", "redis-cache"] }, { "application": "todo", - "relationship": "CONNECTS_TO", "resource": "todo-db", "resourceType": "Applications.Core/postgres" }, { "application": "todo", - "relationship": "CONNECTS_TO", "resource": "redis-cache", "resourceType": "Applications.Core/redis" } ] ``` +**Example 2: Troubleshooting Query (Scenario 2 - Finding certificate domains)** + +* **Sample Input:** + ``` + // Find all secrets connected to a gateway resource for certificate inspection + MATCH (gateway:Resource {name: 'api-gateway', type: 'Applications.Core/gateway'}) + -[conn:CONNECTION]->(secret:Resource {type: 'Applications.Core/secret'}) + RETURN gateway.name AS gateway, secret.name AS secretName, + secret.properties.domain AS certificateDomain + ``` + +* **Sample Output:** + ```json + [ + { + "gateway": "api-gateway", + "secretName": "tls-cert", + "certificateDomain": "wrong-domain.com" + }, + { + "gateway": "api-gateway", + "secretName": "tls-key", + "certificateDomain": "wrong-domain.com" + } + ] + ``` + ### Design #### High-Level Design -1. **Introduce Kùzu:** Kùzu will be integrated as an embedded library within the primary Radius component(s) responsible for managing the application graph (e.g., the Radius Core RP or a dedicated graph service). -2. **Graph Abstraction Layer:** A new internal service or Data Access Layer (DAL) will be created. This layer will provide an API for all graph operations (e.g., `listDeployments`, `listResources`, and allow new operations like `showDependencies(resource)`). Internally, this DAL will translate these requests into Kùzu operations (Cypher queries, API calls to Kùzu's Go driver). -3. **Schema Definition:** A formal schema for Radius entities (Applications, Environments, Resources, etc. as nodes) and their relationships (as edges with types and properties) will be defined and enforced in Kùzu. -4. **Data Synchronization/Migration:** We will need to provide a simple migration tool that copies existing Radius environments/app graph to the newly installed graph db. -5. **Component Updates:** All Radius components that currently interact with etcd or postgrest for graph information will be updated to use the new Graph Abstraction Layer. +1. **Introduce Graph Access Layer (GAL):** The GAL will be integrated as an internal service within the primary Radius component(s) responsible for managing the application graph (e.g., the Radius Core RP or a dedicated graph service). +2. **Pluggable Backend Support:** The GAL will support both Kùzu (for dev/test/POC) and Postgres+AGE (for production), with configuration to select the backend. +3. **Schema Definition:** A formal schema for Radius entities (Applications, Environments, Resources, etc. as nodes) and their relationships (as edges with types and properties) will be defined and enforced in both backends. +4. **Data Synchronization/Migration:** Provide a migration tool to copy existing Radius environments/app graph to the newly installed graph db. +5. **Component Updates:** All Radius components that currently interact with etcd or Postgres for graph information will be updated to use the new GAL for all graph-related operations. #### Architecture Diagram ```mermaid graph TD A["Radius CLI/API
(User Interactions)"] - B["Radius Core Components
(Control Plane, RPs)"] - C["Recipe Execution
(For non-graph
resource mgmt)"] - D["Graph Abstraction Layer
(DAL)"] - E["Kùzu Embedded DB container
(kuzu.db file on disk allowing backup/restore)"] + B["Radius CoreRP
"] + D["Graph Access Layer
(GAL)"] + E1["Kùzu Embedded DB
(Dev/Test/POC)"] + E2["Postgres + Apache AGE
(Production)"] A <--> B - B <--> C - B <-->|"Uses (for graph ops)"| D - D <-->|"Interacts with"| E + B <-->|"storage/query"| D + D -- "dev/test/POC" --> E1 + D -- "production" --> E2 ``` * **Current (Simplified):** Radius Core Components <-> etcd -* **Proposed:** Radius Core Components <-> Graph Abstraction Layer <-> Kùzu Embedded DB +* **Proposed:** Radius Core Components <-> Graph Access Layer <-> Graph Database (Kùzu or Postgres+AGE) #### Detailed Design -1. **Kùzu Integration:** +1. **Graph Access Layer (GAL) Implementation:** + * The GAL will be implemented as a Go service that abstracts all graph operations. + * It will provide a pluggable interface allowing different graph database backends. + * Configuration will determine which backend to use (Kùzu for dev/test, Postgres+AGE for production). + +2. **Dev/Test/POC Backend - Kùzu Integration:** * The Kùzu Go driver (`github.com/kuzudb/go-kuzu`) will be used. - * Kùzu database will be initialized during `rad init`. The database file (`radius_app_graph.kuzu`) will be stored on ephemeral or persistent storage accessible to the Radius control plane. (requiring setup of a volume for laptop installations seems excessive) - * Initially CoreRP will move CRUD operations using etcd to the DAL without using Kùzu. - * Once the DAL is released Kùzu support will be added in a pluggable way, allowing for future network graph db support vs the embedded solution. + * Kùzu database will be initialized during `rad init`. The database file (`radius_app_graph.kuzu`) will be stored on persistent storage accessible to the Radius control plane. + * Initially CoreRP will move existing etcd CRUD operations to the GAL without using Kùzu. + * Once the GAL is released Kùzu support will be added. + +3. **Production Backend - Postgres with Apache AGE:** + * Postgres with Apache AGE plugin will be supported for production deployments. + * Network-based connection using standard PostgreSQL drivers with AGE extensions. + * Support for connection pooling, high availability, and scaling. (Can be a later phase) -2. **Schema Management:** - * A Go module will define constants for node labels (e.g., `NodeTypeApplication`, `NodeTypeResource`) and edge labels (e.g., `EdgeTypeContains`, `EdgeTypeConnectsTo`). - * On startup, Radius will ensure the schema (node tables, relationship tables, property definitions) exists in Kùzu, creating or migrating it if necessary. - * Node properties will be strongly typed (string, int, bool, arrays, maps). Complex nested objects might need to be stored as JSON strings if Kùzu's direct support is limited, or flattened, but no current use case should require this. +4. **Schema Management:** + * A Go module will define constants for node labels (e.g., `NodeTypeApplication`, `NodeTypeResource`) and edge labels (all edges will be type `CONNECTION`). + * On startup, Radius will ensure the schema (node tables, relationship tables, property definitions) exists in the backend, creating or migrating it if necessary. + * Node properties will be strongly typed (string, int, bool, arrays, maps). Complex nested objects might need to be stored as JSON strings if direct support is limited, or flattened, but no current use case should require this. -3. **Graph Abstraction Layer (DAL) API:** +5. **Graph Access Layer (GAL) API:** * Example Go interface: ```go type GraphStore interface { @@ -191,24 +231,23 @@ graph TD ``` 4. **Data Persistence and State:** - * Kùzu runs embedded, so the Radius process managing it is authoritative. - * If a single DAL node in the cluster is unable to scale to the traffic of the cluster we would have to separate out write traffic to a single instance and reads could scale reasonably, but moving to a network cypher client would be expected. - * For this proposal, we assume a single active Radius DAL node is managing the Kùzu DB file for writes, with potential for read-only file replication for other instances if feasible and performant. + * **Kùzu (Dev/Test/POC):** Kùzu runs embedded, so the Radius process managing it is authoritative. The database file (`radius_app_graph.kuzu`) is stored on persistent storage accessible to the Radius control plane. + * **Postgres+AGE (Production):** Network-based connection with standard PostgreSQL high availability, clustering, and backup mechanisms. 5. **Transaction Management:** - * All compound operations (e.g., creating a resource node and its relationship edge) must be performed within a Kùzu transaction to ensure atomicity. The DAL will manage this. - * Radius upgrades and rollbacks would need to coordinate with the DAL. + * All compound operations (e.g., creating a resource node and its relationship edge) must be performed within a database transaction to ensure atomicity. The GAL will manage this. + * Radius upgrades and rollbacks would need to coordinate with the GAL. -#### Advantages (of Kùzu over etcd/key value stores for graph storage) +#### Advantages (of Graph Database via GAL over etcd/key value stores for graph storage) * **Rich Querying:** Cypher provides significantly more powerful and expressive graph query capabilities than filtering etcd values client side. -* **Performance:** For complex graph traversals (multi-hop queries, pathfinding), Kùzu is likely to be much faster as it's optimized for such operations. For Radius this would be during most recipe execution as the entire graph is rendered. -* **Specialized Data Store:** Kùzu is purpose-built for graph data, leading to efficient storage and indexing for graph structures. -* **Decoupling from Kubernetes:** Reduces dependency on the K8s API server for core graph logic, improving portability and potentially reducing load on the K8s control plane for graph-heavy operations. -* **Not strongly tied to Kùzu** The DAL will allow for Radius users to use any Cypher compatible graph database such as Neo4J or CosmosDB with Gremlin. -* **Transactional Guarantees:** Kùzu provides ACID transactions for graph operations. -* **Schema Enforcement:** Better ability to define and enforce a graph schema within Kùzu. -* **Support for streaming monitoring of graph changes:** A project like Drasi cannot consume a change feed of the Radius app graph because it can't replicate the client side filtering necessary for identifying the changes desired. +* **Performance:** For complex graph traversals (multi-hop queries, pathfinding), a graph database is likely to be much faster as it's optimized for such operations. For Radius this would be during most recipe execution as the entire graph is rendered. +* **Specialized Data Store:** Graph databases are purpose-built for graph data, leading to efficient storage and indexing for graph structures while remaining performant for standard storage and retrieval operations of non-graph data. +* **Pluggable Backend:** The GAL will allow for Radius users to use any Cypher compatible graph database such as Neo4J or CosmosDB with Gremlin. +* **Transactional Guarantees:** Both Kùzu and Postgres with Apache AGE provide ACID transactions for graph operations, we've already encountered many situations where the graph is in a bad state from a test failing to clean up properly etc. We may be able to move the entire test framework to using transaction and rollback per test eliminating this problem entirely. +* **Schema Enforcement:** Better ability to define and enforce a graph schema. +* **Support for streaming monitoring of graph changes:** A project like Drasi cannot consume a change feed of the Radius app graph because it doesn't support key value stores and would require extensive middleware custom to Radius data structures to replicate the client side filtering necessary for identifying the changes desired. +* **Simplified Go Code:** Eliminates the complex imperative Go code currently required for creating and traversing key-value structures in etcd, replacing it with declarative Cypher queries that are more maintainable and less error-prone. --- @@ -216,11 +255,11 @@ graph TD Currently, many Radius resource types (such as databases) are modeled as monolithic objects with embedded properties or sub-objects. For example, a database resource might have a `credentials` object, which itself contains `username` and `password` properties. This approach makes it difficult to express and traverse relationships between resources, and limits reusability and visibility in the application graph. -With a graph database like Kùzu, these relationships can be modeled explicitly. Instead of embedding credentials as an object within the database resource, the database node can be connected to a separate `credentials` node (e.g., named `db_creds` of type `Credentials`). This credentials node can then be connected to two `secret` nodes representing the username and password. This approach enables: +With a graph database, these relationships can be modeled explicitly. Instead of embedding credentials as an object within the database resource, the database resource can be connected to a separate `credentials` resource(e.g., named `db_creds` of type `Credentials`). This credentials resource can then be connected to two `secret` resources representing the username and password. This approach enables: -- **Reusability:** Credentials or secrets can be shared across multiple resources. -- **Visibility:** Relationships between resources, credentials, and secrets are explicit and queryable. -- **Extensibility:** New types of relationships or properties can be added without changing the monolithic resource schema. Future API versions could allow some properties to be private (not exposed by connection). +- **Intuitive Use:** Resources can be accessed within the recipe context in an intuitive manner. +- **Visibility:** Relationships between deeply nested resources are explicit and queryable. +- **Extensibility:** New types of relationships or properties can be added without changing the monolithic resource schema. Future API versions could allow some properties to be private (not exposed by connection). For example, a new edge type "USED_BY" could link a recipe to all the resources using it across multiple environments and applications, enabling infrastructure operators to understand the impact of registering a new version of a recipe by querying which resources would be affected. **Example Graph Structure:** @@ -237,30 +276,29 @@ graph TD ``` In this model: -- The `Database` node is connected to a `Credentials` node via a connection. -- The `Credentials` node is connected to two `Secret` nodes (for username and password) via connections. +- The `Database` node is connected to a `Credentials` node via a CONNECTION. +- The `Credentials` node is connected to two `Secret` nodes (for username and password) via CONNECTIONs. This structure enables richer queries and recipe author use cases like `context.connected_resources.database.credentials.username` instead of only being able to access the embedded `credentials` object and requiring the recipe author to parse. Additionally it provides a better separation of concerns, and a more flexible, maintainable application graph. -#### Disadvantages (of Kùzu integration) +#### Disadvantages (of Graph Database integration) -* **New Dependency:** Introduces Kùzu as a new core dependency for Radius, including its Go driver. +* **New Dependency:** Introduces a graph database as a new core dependency for Radius, including its Go driver(s). * **Operational Overhead:** - * Managing the Kùzu database file (backups, storage). - * Monitoring Kùzu's performance and health. (LRT Cluster) - * Requires persistent volume for the Kùzu database file. (Same as Postgres for production usage) -* **Complexity:** Adds a new layer (DAL, Kùzu integration) to the Radius architecture. -* **Embedded Nature & HA:** Kùzu's primary mode is embedded. Achieving high availability for the Kùzu store in a distributed Radius control plane requires careful design (e.g., leader election for the writer, replication strategy for readers, or a future Kùzu server mode). This proposal initially focuses on a simpler embedded model. -* **Learning Curve:** Radius developers might need to learn Kùzu and Cypher. + * Managing the graph database file or instance (backups, storage). + * Monitoring performance and health. + * Requires persistent volume or managed database for production usage. +* **Complexity:** Adds a new layer (GAL, graph database integration) to the Radius architecture. +* **Learning Curve:** Radius developers might need to learn Cypher and the specifics of the chosen graph database(s). #### Proposed Option -Integrate **Kùzu as an embedded graph database** managed by a new Radius service. A Graph Abstraction Layer (DAL) will be developed to mediate all graph operations. This approach balances the benefits of a dedicated graph DB with the relative simplicity of an embedded solution for the initial phase. +Integrate a **Graph Database** behind a **Graph Access Layer (GAL)** with pluggable backend support. Initially provide **Kùzu as an embedded graph database** for dev/test/POC environments and **Postgres with Apache AGE** for production environments. This approach balances the benefits of dedicated graph DB capabilities with flexibility in deployment scenarios. ### API design -The primary API change will be internal, within the Graph Abstraction Layer (DAL) as described in "Detailed Design." External Radius APIs (e.g., `rad resource list`, `rad application graph`) should remain functionally the same, but their implementation will now call the DAL instead of directly querying etcd. +The primary API change will be internal, within the Graph Access Layer (GAL) as described in "Detailed Design." External Radius APIs (e.g., `rad resource list`, `rad application graph`) should remain functionally the same, but their implementation will now call the GAL instead of directly querying etcd. No changes to the public Radius REST API are anticipated initially, other than potential performance improvements or new (future) API endpoints that leverage advanced graph queries. @@ -273,42 +311,51 @@ No changes to the public Radius REST API are anticipated initially, other than p #### Core RP (Resource Provider) -* Core RP will use the DAL to manage the graph during deployment rendering and query the DAL for any app graph API requests. +* Core RP will use the GAL to manage the graph during deployment rendering and query the GAL for any app graph API requests. ### Error Handling -* The DAL will be responsible for translating Kùzu-specific errors into standardized Radius errors. +* The GAL will be responsible for translating backend-specific errors into standardized Radius errors. * Errors such as database connection issues, query failures, transaction rollbacks, or schema violations must be handled gracefully. -* Retry mechanisms for transient Kùzu errors will be implemented in the DAL. -* The DAL will integrate with the Radius OpenTelemetry implementation. +* Retry mechanisms for transient errors will be implemented in the GAL. +* The GAL will integrate with the Radius OpenTelemetry implementation. ### Test plan 1. **Unit Tests:** - * Test individual functions within the Graph Abstraction Layer (mocking Kùzu Go driver). - * Test Kùzu schema creation and migration logic. + * Test individual functions within the Graph Access Layer (mocking graph database drivers). + * Test schema creation and migration logic for both backends. 2. **Integration Tests:** - * Test the DAL against an actual embedded Kùzu instance. + * Test the GAL against actual backend instances (both Kùzu and Postgres with Apache AGE). * Verify CRUD operations for nodes and edges with various property types. - * Test complex Cypher queries through the DAL. - * Test transactional behavior. - * Test Core RP interacting with the Kùzu-backed DAL. + * Test transactional behavior for both backends. + * Test Core RP interacting with the GAL-backed graph stores. 3. **End-to-End (E2E) Tests:** - * Adapt existing Radius E2E tests to ensure all application deployment and management scenarios function correctly with Kùzu as the graph backend. - * Include tests for data persistence across Radius restarts. + * Adapt existing Radius E2E tests to ensure all application deployment and management scenarios function correctly with both graph backends. + * Include tests for data persistence across Radius restarts and upgrades/rollbacks. 4. **Performance Tests:** - * Benchmark graph read/write operations with Kùzu against the current key/value based implementation for representative workloads. - * Test concurrent access to the graph. + * Benchmark graph read/write operations with both backends against the current key/value based implementation for representative workloads. + * Validate performance claims from the advantages section, specifically: + * Complex graph traversal performance compared to etcd + client-side filtering + * Recipe execution performance when rendering the entire graph + * Query performance for large application graphs (100+ resources) + * Test concurrent access to the graph for both backends. * Add checks to LRT Cluster for graph operations. 5. **Backup/Restore Tests:** - * Verify that Kùzu database backups can be successfully created and restored. + * Verify that database backups can be successfully created and restored for both backends. +6. **Backend Compatibility Tests:** + * Ensure identical behavior and results across Kùzu and Postgres with Apache AGE backends. + * Test configuration switching between backends. ### Security -* **Data at Rest:** The Kùzu database file (`radius_graph.kuzu`) contains the application graph data. It should be protected by appropriate file system permissions on the persistent volume where it's stored. Encryption at rest for the volume should be considered, managed by the underlying infrastructure (e.g., Kubernetes PV encryption). -* **Access Control:** Access to Kùzu is through the embedded Go driver within the Radius process. Standard Radius authentication and authorization mechanisms will protect the Radius APIs that indirectly interact with Kùzu. There is no direct network exposure of Kùzu in the embedded model. -* **Input Sanitization:** If any user-provided data is used to construct Cypher queries (even if parameterized), ensure proper parameterization is always used by the DAL to prevent injection vulnerabilities (Kùzu's Go driver should support parameterized queries). -* **Threat Model:** The Radius threat model must be updated to have a section for the DAL. +* **Data at Rest:** + * **Kùzu:** The database file (`radius_app_graph.kuzu`) contains the application graph data. It should be protected by appropriate file system permissions on the persistent volume where it's stored. + * **Postgres with Apache AGE:** Standard PostgreSQL security practices apply, including encryption at rest, access controls, and network security. + * Encryption at rest for storage should be considered, managed by the underlying infrastructure (e.g., Kubernetes PV encryption). +* **Access Control:** Access to the graph database is through the GAL within the Radius process. Standard Radius authentication and authorization mechanisms (when implemented) will protect the Radius APIs that indirectly interact with the graph database. There is no direct network exposure of Kùzu in the embedded model. PostgreSQL with Apache AGE will use a standard networked database access model. +* **Input Sanitization:** If any user-provided data is used to construct Cypher queries (even if parameterized), ensure proper parameterization is always used by the GAL to prevent injection vulnerabilities. +* **Threat Model:** The Radius threat model must be updated to have a section for the GAL. ### Compatibility (optional) @@ -316,46 +363,53 @@ No changes to the public Radius REST API are anticipated initially, other than p * For existing Radius deployments using etcd as the graph store, a migration path will be necessary. * The public Radius API and CLI should remain backward compatible. * **Data Format:** The structure of the application graph (apps, resources, properties) should remain conceptually the same, even though the storage backend changes. +* **Backend Compatibility:** The GAL ensures that both Kùzu and Postgres with Apache AGE backends provide identical functionality and behavior to Radius components. ### Monitoring and Logging * **Logging:** - * The Graph Abstraction Layer should log all significant operations (e.g., Kùzu queries, errors, transaction boundaries) at appropriate log levels. + * The Graph Access Layer should log all significant operations (e.g., graph queries, errors, transaction boundaries) at appropriate log levels. * **Metrics:** - * Expose metrics from the DAL in Radius OpenTelemetry: - * Number of Kùzu queries (per type: read/write). - * Latency of Kùzu queries. - * Error rates for Kùzu operations. - * Kùzu database size. - * Transaction commit/rollback counts. + * Expose metrics from the GAL in Radius OpenTelemetry: + * Number of graph queries (per type: read/write, per backend). + * Latency of graph queries (per backend). + * Error rates for graph operations (per backend). + * Database size and health metrics. + * Transaction commit/rollback counts (per backend). ### Development plan -0. **Phase 0: DAL (Milestone 0)** - * Create the DAL. +0. **Phase 0: GAL (Milestone 0)** + * Create the GAL with pluggable backend interface. * Implement CRUD endpoints representing Radius abstraction level graph operations. - * Develop initial unit & integration tests for the DAL. - * Ensure via debug logging that no components are communicating directly with etcd other than the DAL. -1. **Phase 1: Core Integration (Milestone 1)** - * Research Kùzu Go driver capabilities and limitations in detail. - * Set up Kùzu as an embedded dependency - but as a pluggable architecture where the embedded calls could be swapped for network calls to any Cypher compatible graph database. + * Develop initial unit & integration tests for the GAL. + * Ensure via debug logging that no components are communicating directly with etcd other than the GAL. +1. **Phase 1: Kùzu Integration (Milestone 1)** + * Phase 1 will be worked on with etcd code still in use, a configuration flag will be used to use the graph backend so we can ship smaller change sets and users can test if desired. + * Set up Kùzu as an embedded dependency with pluggable architecture. * Define and implement Kùzu schema creation. - * Implement robust error handling and transaction management in the DAL. - * Add Kùzu specific tests (schema creation, backup/restore, etc) - * Modify Radius init and upgrade processes to trigger appropriate behavior in the DAL. -4. **Phase 2: Tooling & Testing (Milestone 2)** - * Implement backup/restore CLI commands. - * Conduct comprehensive E2E testing, performance testing, and security review of DAL threat model. + * Implement robust error handling and transaction management in the GAL for Kùzu backend. + * Add Kùzu specific tests (schema creation, backup/restore, etc). + * Modify Radius init and upgrade processes to trigger appropriate behavior in the GAL. + * Write idempotent migration tool for etcd => graph db. +2. **Phase 2: Postgres with Apache AGE Integration (Milestone 2)** + * Research Postgres with Apache AGE capabilities and integration requirements. + * Implement Postgres with Apache AGE backend for the GAL. + * Ensure feature parity between Kùzu and Postgres with Apache AGE backends. + * Add comprehensive tests for Postgres with Apache AGE backend. + * Implement configuration options for backend selection. +3. **Phase 3: Testing & Documentation (Milestone 3)** + * Implement backup/restore CLI commands for both backends. + * Conduct comprehensive E2E testing, performance testing, and security review. * Develop documentation for operators and developers. -3. **Phase 2: Query Enhancement (Milestone 3 - optional)** - * Enhance DAL with more advanced query capabilities (pathfinding, complex traversals, to support new User Stories defined by product). +4. **Phase 4: Query Enhancement (Milestone 4 - optional)** + * Enhance GAL with more advanced query capabilities (pathfinding, complex traversals, to support new User Stories defined by product). ### Open Questions -1. **Kùzu Performance under Concurrent Go Routines:** How well does Kùzu's Go driver and embedded database handle high concurrency from multiple goroutines within a single Radius process? Are there internal locking mechanisms in Kùzu to be aware of? -2. **Schema Evolution:** How will schema changes in Kùzu (e.g., adding new node/edge types, new properties) be managed over time with Radius updates? With the introduction of Radius Types this should remain fairly stable as aside from Environment and Application all nodes would be resources with a type property based on the backing type. -3. **Kùzu Resource Footprint:** What is the typical CPU, memory, and disk I/O footprint of an embedded Kùzu instance for representative Radius graph sizes? -4. **Dashboard:** The changes proposed here such as nested types and expanded use of connections will make the app graph both richer and larger, the existing dashboard will probably need some UX design and work in order to leverage that effectively and intuitively (I would want to click a gateway to see the types it depends on and their sources in an environment) +1. **Schema Evolution:** How will schema changes (e.g., adding new node/edge types, new properties) be managed over time with Radius upgrades? This will be critical for the GAL to handle gracefully. +2. **Resource Footprint:** What is the typical CPU, memory, and disk I/O footprint of each backend for representative Radius graph sizes? +3. **Dashboard:** The changes proposed here such as nested types and expanded use of connections will make the app graph both richer and larger, the existing dashboard will probably need some UX design and work in order to leverage that effectively and intuitively. ### Alternatives considered @@ -364,7 +418,7 @@ No changes to the public Radius REST API are anticipated initially, other than p * **Disadvantages:** Limited query capabilities, known performance bottlenecks for sizeable application graphs, nested rendering logic very manual and complex, tightly coupled to key value stores. 2. **Other Embedded Graph Databases (e.g., a Go-native one if a mature one exists):** * **Advantages:** Could offer tighter integration if fully Go-native. - * **Disadvantages:** Kùzu is chosen for its performance, Cypher support, and active development. A pure Go alternative might lack some of these mature features or performance characteristics. + * **Disadvantages:** Kùzu is chosen for dev/test for its performance, Cypher support, and active development. A pure Go alternative might lack some of these mature features or performance characteristics. 3. **Hosted/Server-based Graph Databases (e.g., Neo4j, Dgraph as a service, NebulaGraph):** * **Advantages:** Mature, feature-rich, often provide built-in clustering and HA. - * **Disadvantages:** Adds significant operational complexity (managing a separate database cluster), network latency between Radius and the DB, cost, and deviates from the goal of a more self-contained/embeddable solution for core graph logic. This proposal prioritizes decoupling and enhancing capabilities with an embedded solution first. \ No newline at end of file + * **Disadvantages:** Adds significant operational complexity (managing a separate database cluster), network latency between Radius and the DB, cost, and deviates from the goal of a more self-contained/embeddable solution for core graph logic. This proposal prioritizes decoupling and enhancing capabilities with a pluggable solution first. \ No newline at end of file From d814c3e001c988a847e067d6aa612cd92acbeb26 Mon Sep 17 00:00:00 2001 From: Sylvain Niles Date: Wed, 11 Jun 2025 23:12:18 -0700 Subject: [PATCH 08/19] Major rework to highlight moving graph only and the challenges with doing all data storage via graph today with multiple RPs writing to the same data store. Signed-off-by: Sylvain Niles --- .../2025-06-graph-db-replace-planes.md | 183 ++++++++++++------ 1 file changed, 124 insertions(+), 59 deletions(-) diff --git a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md index cd1f4cd7..b8079171 100644 --- a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md +++ b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md @@ -10,10 +10,12 @@ Project Radius currently defines and manages application graphs, representing resources and their relationships within cloud-native applications. Currently Radius relies on Kubernetes to install etcd, the default datastore, where these graph structures are stored as key value pairs. As we are adding support for nested connections the imperative Go code is becoming complex and brittle because it needs to implement basic graph traversal logic not present in etcd. Additionally we are already hitting performance limits in test environments (under a hundred resources were reported to slow things to a crawl) which inspired the work @superbeeny has done to swap the key value operations out to use postgres. Radius users could not use Drasi today to act on changes to their environments as there's no current support for key value stores and if that was added the client side filtering requirements would be a challenge in Drasi, requiring extensive middleware to parse the custom Radius data structures. -This proposal outlines a plan to modify Project Radius to utilize a **graph database** as the primary store for its application graph, accessed via a new **Graph Access Layer (GAL)**. This change aims to decouple the core graph logic and operations from etcd, enabling more powerful graph queries, improving performance for complex relationship traversals, comparable performance to key value stores for non-graph storage/retrieval, and offering a more specialized and efficient graph persistence layer. One of the benefits to recipe authors is this will allow them to define nested connections of types to both return complex relationships and properties to a recipe as well as see these relationships in the graph/dashboard. +This proposal outlines a plan to modify Project Radius to utilize a **graph database** as the primary store for its **application graph data specifically**, accessed via a new **Graph Access Layer (GAL)**. This change aims to decouple the core graph logic and operations from etcd, enabling more powerful graph queries, improving performance for complex relationship traversals, and offering a more specialized and efficient graph persistence layer. **This proposal is scoped specifically to application graph operations - individual resource metadata and configuration will continue to use the existing `database.Client` interface with the current storage backends (etcd via Kubernetes APIServer, or Postgres for production deployments).** One of the benefits to recipe authors is this will allow them to define nested connections of types to both return complex relationships and properties to a recipe as well as see these relationships in the graph/dashboard. For development, testing, and proof-of-concept environments, we will provide Kùzu as the default embedded graph database. For production environments, the plan is to support Postgres with the Apache AGE plugin, which provides Cypher compatibility and is suitable for scalable, production-grade deployments. The GAL will be designed to support pluggable backends, allowing configuration of the graph database provider to be any supporting Cypher. +**Storage Strategy Rationale:** This proposal focuses specifically on application graph data because: (1) Graph databases excel at relationship queries but may be overkill for simple key-value resource storage; (2) The existing `database.Client` interface and Postgres backend already provide excellent performance for individual resource operations; (3) This allows incremental adoption with lower risk; (4) Different data types (graph relationships vs. resource metadata) have different access patterns and requirements; (5) Production deployments can continue using battle-tested Postgres for resource storage while gaining graph capabilities for application relationship queries. + Delivering this will allow us to shift to a far better user experience where connections become a rich re-usable concept that shares data and exposes deep relationships currently obfuscated by monolithic types with embedded objects (ex: database has a credentials object with username and password properties). ### Terms and Definitions @@ -32,10 +34,10 @@ Delivering this will allow us to shift to a far better user experience where con ### Objectives -1. **Decouple Graph Storage:** Abstract the application graph storage from etcd, allowing Radius to use a dedicated graph database via the Graph Access Layer. +1. **Decouple Graph Storage:** Abstract the application graph storage from etcd, allowing Radius to use a dedicated graph database via the Graph Access Layer while maintaining existing resource storage through the current `database.Client` interface. 2. **Enhance Query Capabilities:** Leverage Cypher query language for more complex and efficient graph traversals and relationship analysis than what is easily achievable with etcd retrieval and client side filtering. -3. **Improve Performance:** Potentially improve the performance of graph read and write operations, especially for large or complex application graphs. -4. **Maintain Existing Functionality:** Ensure that all existing Radius features that rely on the application graph continue to function correctly with the new backend. +3. **Improve Performance:** Improve the performance of graph read and write operations, especially for large or complex application graphs, while maintaining current performance for individual resource operations. +4. **Maintain Existing Functionality:** Ensure that all existing Radius features that rely on the application graph continue to function correctly with the new backend, and that all resource management operations continue unchanged. ### Issue Reference: @@ -43,16 +45,17 @@ Delivering this will allow us to shift to a far better user experience where con ### Goals -* Implement a Graph Access Layer (GAL) that abstracts graph operations with pluggable backend support. +* Implement a Graph Access Layer (GAL) that abstracts application graph operations with pluggable backend support, working alongside the existing `database.Client` interface for resource storage. * Integrate Kùzu as an embedded graph database for development, testing, and proof-of-concept environments. * Develop Postgres with Apache AGE plugin support for production environments. * Define a clear schema for the Radius application graph within both backends. -* Migrate existing graph data representation (currently in etcd/postgres) to the graph database data model. -* Update Radius RPs and controllers to use the new GAL for all graph-related operations. -* Provide migration tool for moving to graph db. -* Provide mechanisms for backup and potential restore of the graph database as part of Radius install/upgrade/rollback operations. +* Migrate existing application graph data representation (currently derived from resource relationships in etcd/postgres) to the dedicated graph database data model. +* Update Radius components to use the new GAL for all application graph operations while continuing to use `database.Client` for individual resource storage. +* Provide migration tools for moving application graph data to the graph database. +* Provide mechanisms for backup and restore of the graph database as part of Radius install/upgrade/rollback operations. * Develop a comprehensive test suite covering graph operations with both backends. * Add configuration options to select between graph database providers. +* Ensure the GAL can reconstruct application graphs from existing resource data during migration. ### Non-goals @@ -60,6 +63,8 @@ Delivering this will allow us to shift to a far better user experience where con * Providing a distributed Kùzu cluster as part of this initial integration (Kùzu is primarily embedded; clustering would be a separate, future consideration if needed). * Exposing direct Cypher query capabilities to end-users of Radius (interaction should remain through Radius APIs and abstractions). * Supporting zero-downtime migration from etcd to graph database (migration will require a maintenance window). +* Replacing the existing `database.Client` interface or resource storage mechanisms - individual resource metadata will continue to use the current storage backends (etcd via Kubernetes APIServer or Postgres). +* Migrating non-graph data (individual resource configurations, secrets, etc.) to the graph database - this proposal is specifically scoped to application graph relationships and traversal operations. ### User Scenarios (optional) @@ -113,42 +118,70 @@ Delivering this will allow us to shift to a far better user experience where con ] ``` -**Example 2: Troubleshooting Query (Scenario 2 - Finding certificate domains)** +**Example 2: Troubleshooting Gateway Certificates (Realistic User Interaction)** -* **Sample Input:** +* **User Command:** + ```bash + rad resource inspect gateway api-gateway --show-dependencies --type secret + ``` + +* **Internal GAL Query (not exposed to user):** ``` - // Find all secrets connected to a gateway resource for certificate inspection + // GAL finds secrets connected to the gateway MATCH (gateway:Resource {name: 'api-gateway', type: 'Applications.Core/gateway'}) -[conn:CONNECTION]->(secret:Resource {type: 'Applications.Core/secret'}) - RETURN gateway.name AS gateway, secret.name AS secretName, - secret.properties.domain AS certificateDomain + RETURN secret.id AS secretId, secret.name AS secretName ``` -* **Sample Output:** +* **GAL Response to CoreRP:** ```json [ - { - "gateway": "api-gateway", - "secretName": "tls-cert", - "certificateDomain": "wrong-domain.com" - }, - { - "gateway": "api-gateway", - "secretName": "tls-key", - "certificateDomain": "wrong-domain.com" - } + {"secretId": "/subscriptions/.../secrets/tls-cert", "secretName": "tls-cert"}, + {"secretId": "/subscriptions/.../secrets/tls-key", "secretName": "tls-key"} ] ``` +* **CoreRP then retrieves full secret data via database.Client and returns to user:** + ```json + { + "gateway": "api-gateway", + "connected_secrets": [ + { + "name": "tls-cert", + "type": "Applications.Core/secret", + "properties": { + "type": "certificate", + "domain": "wrong-domain.com", + "expires": "2024-12-31T23:59:59Z" + } + }, + { + "name": "tls-key", + "type": "Applications.Core/secret", + "properties": { + "type": "private-key", + "domain": "wrong-domain.com" + } + } + ] + } + ``` + +* **User Benefit:** User immediately sees all certificate secrets connected to their gateway and discovers the wrong domain configuration, enabling quick troubleshooting without manually checking each secret individually. + ### Design #### High-Level Design -1. **Introduce Graph Access Layer (GAL):** The GAL will be integrated as an internal service within the primary Radius component(s) responsible for managing the application graph (e.g., the Radius Core RP or a dedicated graph service). -2. **Pluggable Backend Support:** The GAL will support both Kùzu (for dev/test/POC) and Postgres+AGE (for production), with configuration to select the backend. -3. **Schema Definition:** A formal schema for Radius entities (Applications, Environments, Resources, etc. as nodes) and their relationships (as edges with types and properties) will be defined and enforced in both backends. -4. **Data Synchronization/Migration:** Provide a migration tool to copy existing Radius environments/app graph to the newly installed graph db. -5. **Component Updates:** All Radius components that currently interact with etcd or Postgres for graph information will be updated to use the new GAL for all graph-related operations. +1. **Introduce Graph Access Layer (GAL):** The GAL will be integrated as an internal service within Radius components responsible for managing the application graph, working alongside the existing `database.Client` interface for resource storage. +2. **Dual Storage Architecture:** Resources will continue to be stored using the current `database.Client` interface (etcd via Kubernetes APIServer, or Postgres), while application graph relationships will be managed by the GAL with graph database backends. +3. **Pluggable Backend Support:** The GAL will support both Kùzu (for dev/test/POC) and Postgres with Apache AGE (for production), with configuration to select the backend. +4. **Schema Definition:** A formal schema for application graph entities (Applications, Environments, Resources as nodes) and their relationships (as edges with types and properties) will be defined and enforced in both graph backends. +5. **Minimal Graph Data Storage:** The GAL will store only essential resource metadata in the graph database (resource ID, type, connections, and optional properties needed for query filtering), with full resource data retrieval continuing through the existing `database.Client` interface based on graph query results. +6. **Data Synchronization:** The GAL will maintain synchronization between resource changes (via `database.Client`) and their corresponding minimal graph representations, ensuring the application graph accurately reflects the current state of resource relationships and queryable properties. +7. **Component Updates:** Radius components that currently perform application graph operations (traversals, relationship queries) will be updated to use the GAL for graph queries, then retrieve full resource data via `database.Client` based on the graph results, while continuing to use `database.Client` directly for individual resource CRUD operations. + +**Storage Strategy Rationale:** We chose to maintain the existing `database.Client` interface for resource storage because: (1) The current Postgres and etcd backends already provide excellent performance for individual resource operations; (2) Graph databases are optimized for relationship queries, not necessarily single-record lookups; (3) This approach allows incremental migration with lower risk; (4) Different data types (individual resources vs. relationships) have different access patterns and requirements; (5) Production deployments can continue using battle-tested storage patterns while gaining graph capabilities for application relationship queries. **The GAL will store only minimal resource metadata (ID, type, connections, and optional query-filtering properties) in the graph database, with complete resource data retrieval continuing through the existing `database.Client` based on graph query results.** #### Architecture Diagram @@ -156,31 +189,38 @@ Delivering this will allow us to shift to a far better user experience where con graph TD A["Radius CLI/API
(User Interactions)"] B["Radius CoreRP
"] - D["Graph Access Layer
(GAL)"] - E1["Kùzu Embedded DB
(Dev/Test/POC)"] - E2["Postgres + Apache AGE
(Production)"] + C["database.Client
(Resource Storage)"] + D["Graph Access Layer
(Application Graph)"] + E1["etcd (Current) | Postgres
(Production)"] + F1["Kùzu Embedded DB
(Dev/Test/POC)"] + F2["Postgres + Apache AGE
(Production)"] A <--> B - B <-->|"storage/query"| D - D -- "dev/test/POC" --> E1 - D -- "production" --> E2 + B <-->|"resource CRUD"| C + B <-->|"graph queries"| D + C --> E1 + D -- "dev/test/POC" --> F1 + D -- "production" --> F2 ``` -* **Current (Simplified):** Radius Core Components <-> etcd -* **Proposed:** Radius Core Components <-> Graph Access Layer <-> Graph Database (Kùzu or Postgres+AGE) +* **Current (Simplified):** Radius Core Components <-> database.Client <-> etcd/Postgres +* **Proposed:** + * Resource Storage: Radius Core Components <-> database.Client <-> etcd/Postgres (unchanged) + * Application Graph: Radius Core Components <-> Graph Access Layer <-> Graph Database (Kùzu or Postgres+AGE) #### Detailed Design 1. **Graph Access Layer (GAL) Implementation:** - * The GAL will be implemented as a Go service that abstracts all graph operations. + * The GAL will be implemented as a Go service that abstracts all application graph operations, working alongside the existing `database.Client` interface. * It will provide a pluggable interface allowing different graph database backends. * Configuration will determine which backend to use (Kùzu for dev/test, Postgres+AGE for production). + * The GAL will be responsible for maintaining synchronization between resource changes and their graph representations. 2. **Dev/Test/POC Backend - Kùzu Integration:** * The Kùzu Go driver (`github.com/kuzudb/go-kuzu`) will be used. * Kùzu database will be initialized during `rad init`. The database file (`radius_app_graph.kuzu`) will be stored on persistent storage accessible to the Radius control plane. - * Initially CoreRP will move existing etcd CRUD operations to the GAL without using Kùzu. - * Once the GAL is released Kùzu support will be added. + * The GAL will populate the graph database by analyzing existing resource relationships stored via `database.Client`. + * All application graph queries will use the GAL, while individual resource operations continue through `database.Client`. 3. **Production Backend - Postgres with Apache AGE:** * Postgres with Apache AGE plugin will be supported for production deployments. @@ -189,8 +229,9 @@ graph TD 4. **Schema Management:** * A Go module will define constants for node labels (e.g., `NodeTypeApplication`, `NodeTypeResource`) and edge labels (all edges will be type `CONNECTION`). - * On startup, Radius will ensure the schema (node tables, relationship tables, property definitions) exists in the backend, creating or migrating it if necessary. - * Node properties will be strongly typed (string, int, bool, arrays, maps). Complex nested objects might need to be stored as JSON strings if direct support is limited, or flattened, but no current use case should require this. + * On startup, the GAL will ensure the schema (node tables, relationship tables, property definitions) exists in the backend, creating or migrating it if necessary. + * **Minimal Data Storage:** Graph nodes will contain only essential resource metadata (ID, type, and optional properties needed for query filtering), with complete resource data remaining in the existing `database.Client` storage. API responses will use graph queries to identify relevant resources, then retrieve full resource data via `database.Client`. + * The GAL will handle synchronization between resource updates (via `database.Client`) and their corresponding minimal graph representations. 5. **Graph Access Layer (GAL) API:** * Example Go interface: @@ -213,12 +254,10 @@ graph TD GetIncomingNeighbors(ctx context.Context, nodeID string, edgeTypePattern string) ([]Node, error) FindPaths(ctx context.Context, startNodeID, endNodeID string, maxHops int) ([][]Node, error) // More complex queries ExecuteCypherQuery(ctx context.Context, query string, params map[string]interface{}) ([]map[string]interface{}, error) // For advanced internal use - } - - type Node struct { + } type Node struct { ID string Type string // e.g., "Applications.Core/application" - Properties map[string]interface{} + Properties map[string]interface{} // Minimal properties for query filtering only } type Edge struct { @@ -238,16 +277,17 @@ graph TD * All compound operations (e.g., creating a resource node and its relationship edge) must be performed within a database transaction to ensure atomicity. The GAL will manage this. * Radius upgrades and rollbacks would need to coordinate with the GAL. -#### Advantages (of Graph Database via GAL over etcd/key value stores for graph storage) +#### Advantages (of Graph Database via GAL for application graph operations) -* **Rich Querying:** Cypher provides significantly more powerful and expressive graph query capabilities than filtering etcd values client side. +* **Rich Querying:** Cypher provides significantly more powerful and expressive graph query capabilities than filtering etcd values client side for application graph traversals. * **Performance:** For complex graph traversals (multi-hop queries, pathfinding), a graph database is likely to be much faster as it's optimized for such operations. For Radius this would be during most recipe execution as the entire graph is rendered. -* **Specialized Data Store:** Graph databases are purpose-built for graph data, leading to efficient storage and indexing for graph structures while remaining performant for standard storage and retrieval operations of non-graph data. -* **Pluggable Backend:** The GAL will allow for Radius users to use any Cypher compatible graph database such as Neo4J or CosmosDB with Gremlin. -* **Transactional Guarantees:** Both Kùzu and Postgres with Apache AGE provide ACID transactions for graph operations, we've already encountered many situations where the graph is in a bad state from a test failing to clean up properly etc. We may be able to move the entire test framework to using transaction and rollback per test eliminating this problem entirely. -* **Schema Enforcement:** Better ability to define and enforce a graph schema. -* **Support for streaming monitoring of graph changes:** A project like Drasi cannot consume a change feed of the Radius app graph because it doesn't support key value stores and would require extensive middleware custom to Radius data structures to replicate the client side filtering necessary for identifying the changes desired. -* **Simplified Go Code:** Eliminates the complex imperative Go code currently required for creating and traversing key-value structures in etcd, replacing it with declarative Cypher queries that are more maintainable and less error-prone. +* **Specialized Data Store:** Graph databases are purpose-built for graph data, leading to efficient storage and indexing for application graph structures while allowing the existing `database.Client` to continue handling individual resource storage efficiently. +* **Pluggable Backend:** The GAL will allow for Radius users to use any Cypher compatible graph database such as Neo4J or CosmosDB with Gremlin for application graph operations. +* **Transactional Guarantees:** Both Kùzu and Postgres with Apache AGE provide ACID transactions for graph operations, ensuring application graph consistency during complex deployments. +* **Schema Enforcement:** Better ability to define and enforce an application graph schema separate from individual resource schemas. +* **Support for streaming monitoring of graph changes:** A project like Drasi can consume a change feed of the Radius application graph because graph databases can provide relationship-aware change streams, eliminating the need for extensive middleware to parse custom Radius data structures. +* **Simplified Go Code:** Eliminates the complex imperative Go code currently required for creating and traversing application graph relationships, replacing it with declarative Cypher queries that are more maintainable and less error-prone. +* **Separation of Concerns:** Application graph operations and individual resource storage can be optimized independently, with each using the most appropriate storage technology. --- @@ -311,7 +351,10 @@ No changes to the public Radius REST API are anticipated initially, other than p #### Core RP (Resource Provider) -* Core RP will use the GAL to manage the graph during deployment rendering and query the GAL for any app graph API requests. +* Core RP will continue to use the existing `database.Client` interface for all individual resource storage, retrieval, and full CRUD operations. +* Core RP will use the GAL to perform application graph queries (finding connected resources, traversing relationships), then retrieve complete resource data via `database.Client` based on the resource IDs returned from graph queries. +* The GAL will be responsible for maintaining synchronization between resource changes (via `database.Client`) and their corresponding minimal representations in the application graph (ID, type, connections, and essential query properties only). +* **API Response Pattern:** Graph queries identify relevant resources → Full resource data retrieved via `database.Client` → Complete API responses assembled from full resource data. ### Error Handling @@ -328,10 +371,14 @@ No changes to the public Radius REST API are anticipated initially, other than p 2. **Integration Tests:** * Test the GAL against actual backend instances (both Kùzu and Postgres with Apache AGE). * Verify CRUD operations for nodes and edges with various property types. - * Test transactional behavior for both backends. - * Test Core RP interacting with the GAL-backed graph stores. + * Test transactional behavior for both backends. * Test Core RP interacting with the GAL-backed graph stores. + * Verify that graph queries return correct resource IDs and that subsequent `database.Client` retrievals return complete resource data. 3. **End-to-End (E2E) Tests:** * Adapt existing Radius E2E tests to ensure all application deployment and management scenarios function correctly with both graph backends. + * Verify that resource operations via `database.Client` continue to work unchanged. + * Test that application graph operations via GAL work correctly alongside resource operations. + * Verify that API responses contain complete resource data assembled from graph queries + `database.Client` retrieval. + * Include tests for data synchronization between resource storage and minimal graph representation. * Include tests for data persistence across Radius restarts and upgrades/rollbacks. 4. **Performance Tests:** * Benchmark graph read/write operations with both backends against the current key/value based implementation for representative workloads. @@ -421,4 +468,22 @@ No changes to the public Radius REST API are anticipated initially, other than p * **Disadvantages:** Kùzu is chosen for dev/test for its performance, Cypher support, and active development. A pure Go alternative might lack some of these mature features or performance characteristics. 3. **Hosted/Server-based Graph Databases (e.g., Neo4j, Dgraph as a service, NebulaGraph):** * **Advantages:** Mature, feature-rich, often provide built-in clustering and HA. - * **Disadvantages:** Adds significant operational complexity (managing a separate database cluster), network latency between Radius and the DB, cost, and deviates from the goal of a more self-contained/embeddable solution for core graph logic. This proposal prioritizes decoupling and enhancing capabilities with a pluggable solution first. \ No newline at end of file + * **Disadvantages:** Adds significant operational complexity (managing a separate database cluster), network latency between Radius and the DB, cost, and deviates from the goal of a more self-contained/embeddable solution for core graph logic. This proposal prioritizes decoupling and enhancing capabilities with a pluggable solution first. + +### Full Storage Migration Effort Evaluation + +During the design phase, we evaluated the effort required to move ALL Radius storage (not just application graph operations) to the graph database. This analysis revealed that such an approach would require: + +**Effort Assessment:** +- **Timeline**: 12-18 months of development effort +- **Team Size**: 6-8 engineers +- **Risk Level**: Very High - no rollback path, unknown performance characteristics + +**Key Challenges Identified:** +1. **Database Client Interface Replacement**: The unified `database.Client` interface is used by all 5 resource providers (CoreRP, DatastoresRP, DaprRP, MessagingRP, DynamicRP) and supports complex query patterns, optimistic concurrency control, and resource metadata operations that would require significant re-engineering for storage operations to be via a single component. + +2. **Data Model Transformation**: Current resources are optimized for key-value storage with complex nested JSON structures, scope-based organization, and rich metadata that would require fundamental restructuring for graph storage. + +3. **Performance Trade-offs**: Most Radius operations are simple CRUD operations on individual resources, where traditional databases excel. Graph databases optimize for relationship traversals but may perform worse for Radius's retrieval workload patterns. + +**Conclusion**: The scoped approach (application graph only) provides the core benefits of graph database technology (enhanced relationship querying, better performance for graph traversals, support for complex connections) while maintaining the proven, optimized storage patterns for individual resource operations. This delivers significant value with much lower risk and development investment. Once Radius Extensibility has shipped many of the existing RPs will be deprecated in favor of Core Types in DynamicRP, making the work to transition to a single data store viable if we decide to do it to simplify the architecture. From 18b711451d12bf9901a160065131b35182886346 Mon Sep 17 00:00:00 2001 From: Sylvain Niles Date: Thu, 12 Jun 2025 09:31:46 -0700 Subject: [PATCH 09/19] minor update per brooke's comment Signed-off-by: Sylvain Niles --- .../2025-06-graph-db-replace-planes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md index b8079171..552e1bd4 100644 --- a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md +++ b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md @@ -12,7 +12,7 @@ Project Radius currently defines and manages application graphs, representing re This proposal outlines a plan to modify Project Radius to utilize a **graph database** as the primary store for its **application graph data specifically**, accessed via a new **Graph Access Layer (GAL)**. This change aims to decouple the core graph logic and operations from etcd, enabling more powerful graph queries, improving performance for complex relationship traversals, and offering a more specialized and efficient graph persistence layer. **This proposal is scoped specifically to application graph operations - individual resource metadata and configuration will continue to use the existing `database.Client` interface with the current storage backends (etcd via Kubernetes APIServer, or Postgres for production deployments).** One of the benefits to recipe authors is this will allow them to define nested connections of types to both return complex relationships and properties to a recipe as well as see these relationships in the graph/dashboard. -For development, testing, and proof-of-concept environments, we will provide Kùzu as the default embedded graph database. For production environments, the plan is to support Postgres with the Apache AGE plugin, which provides Cypher compatibility and is suitable for scalable, production-grade deployments. The GAL will be designed to support pluggable backends, allowing configuration of the graph database provider to be any supporting Cypher. +The **Graph Access Layer (GAL)** will be designed with a pluggable backend architecture that supports any Cypher-compliant graph database, enabling flexible deployment scenarios and allowing users to choose the graph database that best fits their operational requirements and constraints. This approach provides the freedom to leverage different graph database technologies while maintaining a consistent abstraction layer for Radius components. **Storage Strategy Rationale:** This proposal focuses specifically on application graph data because: (1) Graph databases excel at relationship queries but may be overkill for simple key-value resource storage; (2) The existing `database.Client` interface and Postgres backend already provide excellent performance for individual resource operations; (3) This allows incremental adoption with lower risk; (4) Different data types (graph relationships vs. resource metadata) have different access patterns and requirements; (5) Production deployments can continue using battle-tested Postgres for resource storage while gaining graph capabilities for application relationship queries. From 3419fdd125915957e4533a52147da8256a77954e Mon Sep 17 00:00:00 2001 From: Sylvain Niles Date: Thu, 12 Jun 2025 09:48:34 -0700 Subject: [PATCH 10/19] Added section on installation config and helm config Signed-off-by: Sylvain Niles --- .../2025-06-graph-db-replace-planes.md | 146 +++++++++++++++++- 1 file changed, 141 insertions(+), 5 deletions(-) diff --git a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md index 552e1bd4..b4570947 100644 --- a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md +++ b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md @@ -192,21 +192,19 @@ graph TD C["database.Client
(Resource Storage)"] D["Graph Access Layer
(Application Graph)"] E1["etcd (Current) | Postgres
(Production)"] - F1["Kùzu Embedded DB
(Dev/Test/POC)"] - F2["Postgres + Apache AGE
(Production)"] + F["Cypher-compliant Graph DB
(Kùzu, Postgres+AGE, Neo4j, etc.)"] A <--> B B <-->|"resource CRUD"| C B <-->|"graph queries"| D C --> E1 - D -- "dev/test/POC" --> F1 - D -- "production" --> F2 + D --> F ``` * **Current (Simplified):** Radius Core Components <-> database.Client <-> etcd/Postgres * **Proposed:** * Resource Storage: Radius Core Components <-> database.Client <-> etcd/Postgres (unchanged) - * Application Graph: Radius Core Components <-> Graph Access Layer <-> Graph Database (Kùzu or Postgres+AGE) + * Application Graph: Radius Core Components <-> Graph Access Layer <-> Cypher-compliant Graph Database #### Detailed Design @@ -219,6 +217,7 @@ graph TD 2. **Dev/Test/POC Backend - Kùzu Integration:** * The Kùzu Go driver (`github.com/kuzudb/go-kuzu`) will be used. * Kùzu database will be initialized during `rad init`. The database file (`radius_app_graph.kuzu`) will be stored on persistent storage accessible to the Radius control plane. + * **Performance Advantage:** As an embedded graph database, Kùzu eliminates network connection overhead between the GAL and the graph database, providing significantly faster performance for development and test scenarios compared to networked database solutions. This enables rapid iteration during development and faster test suite execution. * The GAL will populate the graph database by analyzing existing resource relationships stored via `database.Client`. * All application graph queries will use the GAL, while individual resource operations continue through `database.Client`. @@ -487,3 +486,140 @@ During the design phase, we evaluated the effort required to move ALL Radius sto 3. **Performance Trade-offs**: Most Radius operations are simple CRUD operations on individual resources, where traditional databases excel. Graph databases optimize for relationship traversals but may perform worse for Radius's retrieval workload patterns. **Conclusion**: The scoped approach (application graph only) provides the core benefits of graph database technology (enhanced relationship querying, better performance for graph traversals, support for complex connections) while maintaining the proven, optimized storage patterns for individual resource operations. This delivers significant value with much lower risk and development investment. Once Radius Extensibility has shipped many of the existing RPs will be deprecated in favor of Core Types in DynamicRP, making the work to transition to a single data store viable if we decide to do it to simplify the architecture. + +### Graph Database Selection Methodology + +**Based on established Radius configuration patterns, the following two-tier approach provides consistent user experience for selecting which graph database to use for a Radius installation:** + +#### Tier 1: Interactive Installation Configuration + +**Command:** `rad init --full` + +Users are presented with an interactive menu during the full initialization process to select their preferred graph database backend: + +``` +? Select graph database for application graph operations: + > Kùzu (Embedded - recommended for development/testing) + PostgreSQL with Apache AGE (Network-based - recommended for production) + Custom Cypher-compatible database (advanced configuration) +``` + +**Implementation Details:** +- Follows established pattern from `rad init --full` for AWS IRSA vs access keys, Azure Service Principal vs Workload Identity +- Default selection: Kùzu for simplicity and zero external dependencies +- Stores selection in Radius configuration for subsequent `rad install` operations +- Advanced option allows users to specify custom connection strings for other Cypher-compatible databases + +#### Tier 2: Non-Interactive Installation Parameters + +**Command:** `rad install kubernetes --set global.graphDatabase.*` + +For automated deployments and GitOps scenarios, users can specify graph database configuration via Helm chart parameters: + +**Kùzu Configuration (Default):** +```bash +rad install kubernetes --set global.graphDatabase.type=kuzu \ + --set global.graphDatabase.kuzu.persistentVolume.enabled=true \ + --set global.graphDatabase.kuzu.persistentVolume.size=10Gi +``` + +**PostgreSQL with Apache AGE Configuration:** +```bash +rad install kubernetes --set global.graphDatabase.type=postgresql-age \ + --set global.graphDatabase.postgresql.host=postgres.example.com \ + --set global.graphDatabase.postgresql.port=5432 \ + --set global.graphDatabase.postgresql.database=radius_graph \ + --set global.graphDatabase.postgresql.username=radius_user \ + --set global.graphDatabase.postgresql.passwordSecretName=postgres-credentials \ + --set global.graphDatabase.postgresql.sslMode=require +``` + +**Custom Cypher Database Configuration:** +```bash +rad install kubernetes --set global.graphDatabase.type=custom \ + --set global.graphDatabase.custom.connectionString="bolt://neo4j.example.com:7687" \ + --set global.graphDatabase.custom.credentialsSecretName=neo4j-credentials \ + --set global.graphDatabase.custom.dialect=neo4j +``` + +#### Configuration Schema + +The GAL will support the following configuration structure in Helm values: + +```yaml +global: + graphDatabase: + type: kuzu # kuzu | postgresql-age | custom + + # Kùzu-specific configuration + kuzu: + persistentVolume: + enabled: true + size: 10Gi + storageClass: "" + dataDirectory: "/data/kuzu" + + # PostgreSQL with Apache AGE configuration + postgresql: + host: "localhost" + port: 5432 + database: "radius_graph" + username: "radius_user" + passwordSecretName: "postgres-credentials" + sslMode: "prefer" # disable | prefer | require + connectionPoolSize: 10 + + # Custom Cypher database configuration + custom: + connectionString: "" + credentialsSecretName: "" + dialect: "neo4j" # neo4j | amazon-neptune | others + additionalParams: {} +``` + +#### Environment Variable Mapping + +The GAL will read configuration from these environment variables (set by Helm chart): + +```bash +# Graph database type selection +RADIUS_GRAPH_DATABASE_TYPE=kuzu + +# Kùzu configuration +RADIUS_GRAPH_KUZU_DATA_DIR=/data/kuzu + +# PostgreSQL configuration +RADIUS_GRAPH_POSTGRESQL_HOST=postgres.example.com +RADIUS_GRAPH_POSTGRESQL_PORT=5432 +RADIUS_GRAPH_POSTGRESQL_DATABASE=radius_graph +RADIUS_GRAPH_POSTGRESQL_USERNAME=radius_user +RADIUS_GRAPH_POSTGRESQL_PASSWORD_FILE=/etc/secrets/postgres/password +RADIUS_GRAPH_POSTGRESQL_SSL_MODE=require + +# Custom configuration +RADIUS_GRAPH_CUSTOM_CONNECTION_STRING=bolt://neo4j.example.com:7687 +RADIUS_GRAPH_CUSTOM_CREDENTIALS_FILE=/etc/secrets/custom/credentials +RADIUS_GRAPH_CUSTOM_DIALECT=neo4j +``` + +#### Design Rationale + +This approach follows **established Radius patterns**: + +1. **Interactive Configuration Pattern**: Mirrors `rad init --full` interactive flows for cloud provider credential configuration +2. **Installation-time Parameters**: Consistent with `rad install kubernetes --set` usage for Helm chart customization +3. **Credential Management**: Follows existing patterns for handling sensitive configuration via Kubernetes secrets +4. **Default Behavior**: Provides sensible defaults (Kùzu) while allowing production-ready alternatives (PostgreSQL+AGE) +5. **Extensibility**: Supports future Cypher-compatible databases through custom configuration + +**Benefits:** +- **Consistency**: Aligns with existing Radius user experience patterns +- **Flexibility**: Supports both development (embedded Kùzu) and production (networked PostgreSQL) scenarios +- **Automation-Friendly**: Non-interactive configuration supports GitOps and CI/CD workflows +- **Progressive Disclosure**: Simple defaults with advanced options for power users + +**Implementation Notes:** +- GAL initialization logic will read configuration from environment variables +- Database connections will be established during GAL startup with appropriate error handling +- Connection pooling and retry logic will be implemented for networked database options +- Schema initialization will be database-specific but abstracted through the GAL interface From 4f9f68546e0958574d6e050995163c78af769a38 Mon Sep 17 00:00:00 2001 From: Sylvain Niles Date: Thu, 12 Jun 2025 09:53:18 -0700 Subject: [PATCH 11/19] added links Signed-off-by: Sylvain Niles --- .../2025-06-graph-db-replace-planes.md | 24 ++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-) diff --git a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md index b4570947..6f4a36e9 100644 --- a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md +++ b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md @@ -24,8 +24,8 @@ Delivering this will allow us to shift to a far better user experience where con * **Application Graph:** A representation of an application's components (resources, services, environments, etc.) as nodes and their interconnections as edges, including metadata on both. * **Kubernetes (K8s):** An open-source system for automating deployment, scaling, and management of containerized applications. * **Graph Access Layer (GAL):** The internal abstraction layer that mediates all graph operations between Radius components and the underlying graph database. -* **Kùzu:** An embeddable, transactional, high-performance graph database management system (GDBMS) supporting the Cypher query language and property graphs. Provided for dev/test/POC. -* **Postgres with Apache AGE:** A production-ready, scalable graph database solution supporting Cypher queries, built on PostgreSQL. +* **Kùzu:** An embeddable, transactional, high-performance graph database management system (GDBMS) supporting the Cypher query language and property graphs. Provided for dev/test/POC. [GitHub](https://github.com/kuzudb/kuzu) +* **Postgres with Apache AGE:** A production-ready, scalable graph database solution supporting Cypher queries, built on PostgreSQL. [Website](https://age.apache.org/) | [GitHub](https://github.com/apache/age) * **Node:** An entity in a graph (e.g., a Radius resource, an environment). * **Edge:** A relationship between two nodes in a graph (e.g., "connectsTo", "runsIn"). All edges in Radius will be of type "CONNECTION". * **Property:** Key-value pairs associated with nodes or edges, storing metadata. @@ -272,7 +272,25 @@ graph TD * **Kùzu (Dev/Test/POC):** Kùzu runs embedded, so the Radius process managing it is authoritative. The database file (`radius_app_graph.kuzu`) is stored on persistent storage accessible to the Radius control plane. * **Postgres+AGE (Production):** Network-based connection with standard PostgreSQL high availability, clustering, and backup mechanisms. -5. **Transaction Management:** +5. **Graph Database Recovery and Regeneration:** + * **Resilient Design:** Since the graph database stores only application relationship data (not the authoritative resource records), it can be safely rebuilt at any time from the canonical resource data stored via `database.Client`. + * **Migration Tool as Recovery Tool:** The same migration tool used for initial etcd-to-graph migration can be executed at any time to regenerate the complete application graph from existing resource relationships stored in etcd/Postgres. + * **Zero Data Loss Scenario:** If the graph database is lost or corrupted: + 1. Radius continues operating for individual resource operations via `database.Client` + 2. Application graph queries will fail gracefully with appropriate error messages + 3. The migration/rebuild tool can be executed to recreate the graph database from scratch + 4. All historical relationship data is preserved because resources maintain connection metadata in their stored definitions + * **Operational Benefits:** This design eliminates the need for complex graph database backup strategies since the authoritative data remains in the proven `database.Client` storage layer. Graph database backups become a performance optimization rather than a data protection requirement. + * **Recovery Commands:** + ```bash + # Detect and rebuild corrupted/missing graph database + rad admin graph rebuild --from-resources + + # Verify graph database integrity against resource storage + rad admin graph verify --repair-if-needed + ``` + +6. **Transaction Management:** * All compound operations (e.g., creating a resource node and its relationship edge) must be performed within a database transaction to ensure atomicity. The GAL will manage this. * Radius upgrades and rollbacks would need to coordinate with the GAL. From c6e66aea180687183150289d7a73d1e07c4a969a Mon Sep 17 00:00:00 2001 From: Sylvain Niles Date: Thu, 12 Jun 2025 10:12:41 -0700 Subject: [PATCH 12/19] updated persistent volume info, alternatives considered Signed-off-by: Sylvain Niles --- .../2025-06-graph-db-replace-planes.md | 21 +++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md index 6f4a36e9..3a3fc75b 100644 --- a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md +++ b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md @@ -216,7 +216,7 @@ graph TD 2. **Dev/Test/POC Backend - Kùzu Integration:** * The Kùzu Go driver (`github.com/kuzudb/go-kuzu`) will be used. - * Kùzu database will be initialized during `rad init`. The database file (`radius_app_graph.kuzu`) will be stored on persistent storage accessible to the Radius control plane. + * Kùzu database will be initialized during `rad install`. The database file (`radius_app_graph.kuzu`) will be stored on persistent storage accessible to the Radius control plane. * **Performance Advantage:** As an embedded graph database, Kùzu eliminates network connection overhead between the GAL and the graph database, providing significantly faster performance for development and test scenarios compared to networked database solutions. This enables rapid iteration during development and faster test suite execution. * The GAL will populate the graph database by analyzing existing resource relationships stored via `database.Client`. * All application graph queries will use the GAL, while individual resource operations continue through `database.Client`. @@ -486,6 +486,9 @@ No changes to the public Radius REST API are anticipated initially, other than p 3. **Hosted/Server-based Graph Databases (e.g., Neo4j, Dgraph as a service, NebulaGraph):** * **Advantages:** Mature, feature-rich, often provide built-in clustering and HA. * **Disadvantages:** Adds significant operational complexity (managing a separate database cluster), network latency between Radius and the DB, cost, and deviates from the goal of a more self-contained/embeddable solution for core graph logic. This proposal prioritizes decoupling and enhancing capabilities with a pluggable solution first. +4. **Cayley Graph Database:** + * **Advantages:** Open-source graph database with support for multiple query languages and storage backends. + * **Disadvantages:** Not designed as an embedded-first solution, requiring additional configuration and operational overhead compared to Kùzu. Would require setting up and managing a separate service instance, network connections, and handling deployment complexity that diverges from our goal of a streamlined, embeddable solution for development and testing environments. ### Full Storage Migration Effort Evaluation @@ -538,7 +541,8 @@ For automated deployments and GitOps scenarios, users can specify graph database ```bash rad install kubernetes --set global.graphDatabase.type=kuzu \ --set global.graphDatabase.kuzu.persistentVolume.enabled=true \ - --set global.graphDatabase.kuzu.persistentVolume.size=10Gi + --set global.graphDatabase.kuzu.persistentVolume.size=10Gi \ + --set global.graphDatabase.kuzu.persistentVolume.storageClass=fast-ssd ``` **PostgreSQL with Apache AGE Configuration:** @@ -568,14 +572,23 @@ The GAL will support the following configuration structure in Helm values: global: graphDatabase: type: kuzu # kuzu | postgresql-age | custom - - # Kùzu-specific configuration + # Kùzu-specific configuration kuzu: persistentVolume: enabled: true size: 10Gi storageClass: "" + accessModes: ["ReadWriteOnce"] + annotations: {} + labels: {} dataDirectory: "/data/kuzu" + resources: + requests: + cpu: "100m" + memory: "256Mi" + limits: + cpu: "500m" + memory: "1Gi" # PostgreSQL with Apache AGE configuration postgresql: From fb418af7abccd89a5f9acb4a970f8c93c0eb2b83 Mon Sep 17 00:00:00 2001 From: Sylvain Niles Date: Thu, 12 Jun 2025 11:35:56 -0700 Subject: [PATCH 13/19] updated spellcheck Signed-off-by: Sylvain Niles --- .github/config/en-custom.txt | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/.github/config/en-custom.txt b/.github/config/en-custom.txt index f967a11d..d8347cec 100644 --- a/.github/config/en-custom.txt +++ b/.github/config/en-custom.txt @@ -935,3 +935,19 @@ GitRepository pyspelling Pyspelling CNCF +Kùzu +Cypher +Drasi +GraphStore +connectsTo +runsIn +GAL +Nebulagraph +NebulaGraph +Dgraph +neptune +CoreRP +DatastoresRP +DaprRP +MessagingRP +DynamicRP From facb19eed780352ac6c76ccc2083e879d04aa0bd Mon Sep 17 00:00:00 2001 From: Sylvain Niles Date: Mon, 16 Jun 2025 18:49:32 -0700 Subject: [PATCH 14/19] revamp to use postgres + age instead of kuzu. Signed-off-by: Sylvain Niles --- .../2025-06-graph-db-replace-planes.md | 510 ++++++++---------- 1 file changed, 234 insertions(+), 276 deletions(-) diff --git a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md index 3a3fc75b..e5fa6113 100644 --- a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md +++ b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md @@ -1,7 +1,7 @@ -## Proposal: Integrating a Graph Database for Radius Application Graph +## Proposal: Integrating PostgreSQL with Apache AGE for Radius Application Graph -**Version:** 1.2 -**Date:** June 11, 2025 +**Version:** 2.0 +**Date:** June 16, 2025 **Author:** Sylvain Niles --- @@ -10,11 +10,11 @@ Project Radius currently defines and manages application graphs, representing resources and their relationships within cloud-native applications. Currently Radius relies on Kubernetes to install etcd, the default datastore, where these graph structures are stored as key value pairs. As we are adding support for nested connections the imperative Go code is becoming complex and brittle because it needs to implement basic graph traversal logic not present in etcd. Additionally we are already hitting performance limits in test environments (under a hundred resources were reported to slow things to a crawl) which inspired the work @superbeeny has done to swap the key value operations out to use postgres. Radius users could not use Drasi today to act on changes to their environments as there's no current support for key value stores and if that was added the client side filtering requirements would be a challenge in Drasi, requiring extensive middleware to parse the custom Radius data structures. -This proposal outlines a plan to modify Project Radius to utilize a **graph database** as the primary store for its **application graph data specifically**, accessed via a new **Graph Access Layer (GAL)**. This change aims to decouple the core graph logic and operations from etcd, enabling more powerful graph queries, improving performance for complex relationship traversals, and offering a more specialized and efficient graph persistence layer. **This proposal is scoped specifically to application graph operations - individual resource metadata and configuration will continue to use the existing `database.Client` interface with the current storage backends (etcd via Kubernetes APIServer, or Postgres for production deployments).** One of the benefits to recipe authors is this will allow them to define nested connections of types to both return complex relationships and properties to a recipe as well as see these relationships in the graph/dashboard. +This proposal outlines a plan to modify Project Radius to utilize **PostgreSQL with Apache AGE** as the primary store for its **application graph data specifically**, accessed via a new **Graph Access Layer (GAL)**. This change aims to decouple the core graph logic and operations from etcd, enabling more powerful graph queries, improving performance for complex relationship traversals, and offering a more specialized and efficient graph persistence layer. PostgreSQL with AGE will be deployed as a separate container in the Radius control plane, providing both traditional relational database capabilities and graph database functionality through Cypher queries. One of the benefits to recipe authors is this will allow them to define nested connections of types to both return complex relationships and properties to a recipe as well as see these relationships in the graph/dashboard. -The **Graph Access Layer (GAL)** will be designed with a pluggable backend architecture that supports any Cypher-compliant graph database, enabling flexible deployment scenarios and allowing users to choose the graph database that best fits their operational requirements and constraints. This approach provides the freedom to leverage different graph database technologies while maintaining a consistent abstraction layer for Radius components. +The **Graph Access Layer (GAL)** will provide a consistent abstraction layer for Radius components to interact with the PostgreSQL + AGE backend, handling both application graph operations and potentially serving as a replacement for etcd-based storage in future iterations. -**Storage Strategy Rationale:** This proposal focuses specifically on application graph data because: (1) Graph databases excel at relationship queries but may be overkill for simple key-value resource storage; (2) The existing `database.Client` interface and Postgres backend already provide excellent performance for individual resource operations; (3) This allows incremental adoption with lower risk; (4) Different data types (graph relationships vs. resource metadata) have different access patterns and requirements; (5) Production deployments can continue using battle-tested Postgres for resource storage while gaining graph capabilities for application relationship queries. +**Storage Strategy Rationale:** This proposal focuses on PostgreSQL with Apache AGE as a unified solution that can handle both application graph operations and serve as a potential replacement for etcd-based storage. PostgreSQL provides: (1) Proven production reliability and operational tooling; (2) Native support for both relational and graph data through Apache AGE; (3) Simplified deployment architecture with a single database technology; (4) Established backup, monitoring, and scaling patterns; (5) The flexibility to migrate away from etcd entirely in future phases while maintaining data consistency and operational simplicity. Delivering this will allow us to shift to a far better user experience where connections become a rich re-usable concept that shares data and exposes deep relationships currently obfuscated by monolithic types with embedded objects (ex: database has a credentials object with username and password properties). @@ -23,9 +23,8 @@ Delivering this will allow us to shift to a far better user experience where con * **Radius:** An open-source, cloud-native application platform that helps developers build, deploy, and manage applications across various environments. * **Application Graph:** A representation of an application's components (resources, services, environments, etc.) as nodes and their interconnections as edges, including metadata on both. * **Kubernetes (K8s):** An open-source system for automating deployment, scaling, and management of containerized applications. -* **Graph Access Layer (GAL):** The internal abstraction layer that mediates all graph operations between Radius components and the underlying graph database. -* **Kùzu:** An embeddable, transactional, high-performance graph database management system (GDBMS) supporting the Cypher query language and property graphs. Provided for dev/test/POC. [GitHub](https://github.com/kuzudb/kuzu) -* **Postgres with Apache AGE:** A production-ready, scalable graph database solution supporting Cypher queries, built on PostgreSQL. [Website](https://age.apache.org/) | [GitHub](https://github.com/apache/age) +* **Graph Access Layer (GAL):** The internal abstraction layer that mediates all graph operations between Radius components and the underlying PostgreSQL + Apache AGE database. +* **PostgreSQL with Apache AGE:** A production-ready, scalable graph database solution supporting Cypher queries, built on PostgreSQL. This will be deployed as a separate container in the Radius control plane. [Website](https://age.apache.org/) | [GitHub](https://github.com/apache/age) * **Node:** An entity in a graph (e.g., a Radius resource, an environment). * **Edge:** A relationship between two nodes in a graph (e.g., "connectsTo", "runsIn"). All edges in Radius will be of type "CONNECTION". * **Property:** Key-value pairs associated with nodes or edges, storing metadata. @@ -34,10 +33,11 @@ Delivering this will allow us to shift to a far better user experience where con ### Objectives -1. **Decouple Graph Storage:** Abstract the application graph storage from etcd, allowing Radius to use a dedicated graph database via the Graph Access Layer while maintaining existing resource storage through the current `database.Client` interface. +1. **Decouple Graph Storage:** Abstract the application graph storage from etcd, allowing Radius to use PostgreSQL with Apache AGE as a dedicated graph database via the Graph Access Layer. 2. **Enhance Query Capabilities:** Leverage Cypher query language for more complex and efficient graph traversals and relationship analysis than what is easily achievable with etcd retrieval and client side filtering. -3. **Improve Performance:** Improve the performance of graph read and write operations, especially for large or complex application graphs, while maintaining current performance for individual resource operations. -4. **Maintain Existing Functionality:** Ensure that all existing Radius features that rely on the application graph continue to function correctly with the new backend, and that all resource management operations continue unchanged. +3. **Improve Performance:** Improve the performance of graph read and write operations, especially for large or complex application graphs, using PostgreSQL's proven performance characteristics. +4. **Maintain Existing Functionality:** Ensure that all existing Radius features that rely on the application graph continue to function correctly with the new PostgreSQL + AGE backend. +5. **Enable Future etcd Migration:** Establish PostgreSQL + AGE as a foundation for potentially migrating away from etcd entirely in future development phases. ### Issue Reference: @@ -45,26 +45,24 @@ Delivering this will allow us to shift to a far better user experience where con ### Goals -* Implement a Graph Access Layer (GAL) that abstracts application graph operations with pluggable backend support, working alongside the existing `database.Client` interface for resource storage. -* Integrate Kùzu as an embedded graph database for development, testing, and proof-of-concept environments. -* Develop Postgres with Apache AGE plugin support for production environments. -* Define a clear schema for the Radius application graph within both backends. -* Migrate existing application graph data representation (currently derived from resource relationships in etcd/postgres) to the dedicated graph database data model. -* Update Radius components to use the new GAL for all application graph operations while continuing to use `database.Client` for individual resource storage. -* Provide migration tools for moving application graph data to the graph database. -* Provide mechanisms for backup and restore of the graph database as part of Radius install/upgrade/rollback operations. -* Develop a comprehensive test suite covering graph operations with both backends. -* Add configuration options to select between graph database providers. +* Implement a Graph Access Layer (GAL) that abstracts application graph operations with PostgreSQL + Apache AGE backend support. +* Deploy PostgreSQL with Apache AGE as a separate container in the Radius control plane architecture. +* Define a clear schema for the Radius application graph within PostgreSQL + AGE. +* Migrate existing application graph data representation (currently derived from resource relationships in etcd) to the PostgreSQL + AGE data model. +* Update Radius components to use the new GAL for all application graph operations. +* Provide migration tools for moving application graph data from etcd to PostgreSQL + AGE. +* Provide mechanisms for backup and restore of the PostgreSQL + AGE database as part of Radius install/upgrade/rollback operations. +* Develop a comprehensive test suite covering graph operations. * Ensure the GAL can reconstruct application graphs from existing resource data during migration. ### Non-goals -* Providing a distributed Kùzu cluster as part of this initial integration (Kùzu is primarily embedded; clustering would be a separate, future consideration if needed). +* Providing a distributed PostgreSQL cluster as part of this initial integration (PostgreSQL clustering and high availability would be a separate, future consideration if needed). * Exposing direct Cypher query capabilities to end-users of Radius (interaction should remain through Radius APIs and abstractions). -* Supporting zero-downtime migration from etcd to graph database (migration will require a maintenance window). -* Replacing the existing `database.Client` interface or resource storage mechanisms - individual resource metadata will continue to use the current storage backends (etcd via Kubernetes APIServer or Postgres). -* Migrating non-graph data (individual resource configurations, secrets, etc.) to the graph database - this proposal is specifically scoped to application graph relationships and traversal operations. +* Supporting zero-downtime migration from etcd to PostgreSQL + AGE (migration will require a maintenance window). +* Migrating all Radius storage to PostgreSQL + AGE in the initial phase - this proposal focuses specifically on application graph relationships and traversal operations. +* Moving existing key-value operations (individual resource CRUD) to the GAL - this proposal is scoped specifically to graph operations. In a future phase, we would migrate key-value operations to PostgreSQL and rename the GAL to DAL (Data Access Layer) to reflect its broader responsibility for all Radius data operations. ### User Scenarios (optional) @@ -141,7 +139,7 @@ Delivering this will allow us to shift to a far better user experience where con ] ``` -* **CoreRP then retrieves full secret data via database.Client and returns to user:** +* **CoreRP then retrieves full secret data and returns to user:** ```json { "gateway": "api-gateway", @@ -173,15 +171,13 @@ Delivering this will allow us to shift to a far better user experience where con #### High-Level Design -1. **Introduce Graph Access Layer (GAL):** The GAL will be integrated as an internal service within Radius components responsible for managing the application graph, working alongside the existing `database.Client` interface for resource storage. -2. **Dual Storage Architecture:** Resources will continue to be stored using the current `database.Client` interface (etcd via Kubernetes APIServer, or Postgres), while application graph relationships will be managed by the GAL with graph database backends. -3. **Pluggable Backend Support:** The GAL will support both Kùzu (for dev/test/POC) and Postgres with Apache AGE (for production), with configuration to select the backend. -4. **Schema Definition:** A formal schema for application graph entities (Applications, Environments, Resources as nodes) and their relationships (as edges with types and properties) will be defined and enforced in both graph backends. -5. **Minimal Graph Data Storage:** The GAL will store only essential resource metadata in the graph database (resource ID, type, connections, and optional properties needed for query filtering), with full resource data retrieval continuing through the existing `database.Client` interface based on graph query results. -6. **Data Synchronization:** The GAL will maintain synchronization between resource changes (via `database.Client`) and their corresponding minimal graph representations, ensuring the application graph accurately reflects the current state of resource relationships and queryable properties. -7. **Component Updates:** Radius components that currently perform application graph operations (traversals, relationship queries) will be updated to use the GAL for graph queries, then retrieve full resource data via `database.Client` based on the graph results, while continuing to use `database.Client` directly for individual resource CRUD operations. +1. **Introduce Graph Access Layer (GAL):** The GAL will be integrated as an internal service within Radius components responsible for managing the application graph, initially working alongside existing storage mechanisms with a path toward full database consolidation. +2. **PostgreSQL + AGE Container:** PostgreSQL with Apache AGE will be deployed as a separate container in the Radius control plane, providing both traditional database capabilities and graph functionality through Cypher queries. +3. **Unified Database Architecture:** This approach establishes PostgreSQL as the foundation for both graph operations and potential future migration of all Radius data storage, simplifying the overall architecture. +4. **Schema Definition:** A formal schema for application graph entities (Applications, Environments, Resources as nodes) and their relationships (as edges with types and properties) will be defined and enforced in PostgreSQL + AGE. +5. **Component Updates:** Radius components that currently perform application graph operations (traversals, relationship queries) will be updated to use the GAL for graph queries while maintaining existing functionality through current storage mechanisms during the transition period. -**Storage Strategy Rationale:** We chose to maintain the existing `database.Client` interface for resource storage because: (1) The current Postgres and etcd backends already provide excellent performance for individual resource operations; (2) Graph databases are optimized for relationship queries, not necessarily single-record lookups; (3) This approach allows incremental migration with lower risk; (4) Different data types (individual resources vs. relationships) have different access patterns and requirements; (5) Production deployments can continue using battle-tested storage patterns while gaining graph capabilities for application relationship queries. **The GAL will store only minimal resource metadata (ID, type, connections, and optional query-filtering properties) in the graph database, with complete resource data retrieval continuing through the existing `database.Client` based on graph query results.** +**Architecture Rationale:** PostgreSQL with Apache AGE provides a strategic foundation that can serve both immediate graph database needs and future storage consolidation. Unlike embedded solutions, a containerized PostgreSQL deployment offers: (1) Production-ready operational patterns familiar to most teams; (2) Native support for both relational and graph data models; (3) Established ecosystem of monitoring, backup, and scaling tools; (4) The flexibility to migrate additional Radius storage needs to the same technology stack over time, reducing operational complexity. **Note: This proposal is scoped specifically to graph operations via the GAL. Future work would involve migrating key-value operations from etcd to PostgreSQL and renaming the GAL to DAL (Data Access Layer) to reflect its expanded role as the unified data access interface for all Radius storage operations.** #### Architecture Diagram @@ -189,48 +185,63 @@ Delivering this will allow us to shift to a far better user experience where con graph TD A["Radius CLI/API
(User Interactions)"] B["Radius CoreRP
"] - C["database.Client
(Resource Storage)"] - D["Graph Access Layer
(Application Graph)"] - E1["etcd (Current) | Postgres
(Production)"] - F["Cypher-compliant Graph DB
(Kùzu, Postgres+AGE, Neo4j, etc.)"] + C["DatastoresRP"] + D["DaprRP"] + E["MessagingRP"] + F["DynamicRP"] + G["Graph Access Layer
(Application Graph)"] + H["PostgreSQL + Apache AGE
(Container)"] + I["etcd (Current)
(All Resource Operations)"] A <--> B - B <-->|"resource CRUD"| C - B <-->|"graph queries"| D - C --> E1 - D --> F + A <--> C + A <--> D + A <--> E + A <--> F + + B <-->|"graph queries"| G + B <-->|"resource operations"| I + C <-->|"resource operations"| I + D <-->|"resource operations"| I + E <-->|"resource operations"| I + F <-->|"resource operations"| I + + G --> H ``` -* **Current (Simplified):** Radius Core Components <-> database.Client <-> etcd/Postgres +* **Current (Simplified):** All Radius RPs independently connect to etcd directly. * **Proposed:** - * Resource Storage: Radius Core Components <-> database.Client <-> etcd/Postgres (unchanged) - * Application Graph: Radius Core Components <-> Graph Access Layer <-> Cypher-compliant Graph Database + * Graph Operations: CoreRP <-> Graph Access Layer <-> PostgreSQL + Apache AGE Container directly. + * Resource Operations: All RPs (CoreRP, DatastoresRP, DaprRP, MessagingRP, DynamicRP) <-> etcd independently + * **Key Insight:** Only CoreRP performs graph operations; other RPs continue using etcd for their resource management #### Detailed Design 1. **Graph Access Layer (GAL) Implementation:** - * The GAL will be implemented as a Go service that abstracts all application graph operations, working alongside the existing `database.Client` interface. - * It will provide a pluggable interface allowing different graph database backends. - * Configuration will determine which backend to use (Kùzu for dev/test, Postgres+AGE for production). - * The GAL will be responsible for maintaining synchronization between resource changes and their graph representations. - -2. **Dev/Test/POC Backend - Kùzu Integration:** - * The Kùzu Go driver (`github.com/kuzudb/go-kuzu`) will be used. - * Kùzu database will be initialized during `rad install`. The database file (`radius_app_graph.kuzu`) will be stored on persistent storage accessible to the Radius control plane. - * **Performance Advantage:** As an embedded graph database, Kùzu eliminates network connection overhead between the GAL and the graph database, providing significantly faster performance for development and test scenarios compared to networked database solutions. This enables rapid iteration during development and faster test suite execution. - * The GAL will populate the graph database by analyzing existing resource relationships stored via `database.Client`. - * All application graph queries will use the GAL, while individual resource operations continue through `database.Client`. - -3. **Production Backend - Postgres with Apache AGE:** - * Postgres with Apache AGE plugin will be supported for production deployments. + * The GAL will be implemented as a Go service that abstracts all application graph operations, initially working alongside existing storage mechanisms. + * It will provide a consistent interface for PostgreSQL + Apache AGE backend operations. + * The GAL will handle all database connections, schema management, and query optimization for the PostgreSQL + AGE backend. + +2. **PostgreSQL + Apache AGE Container Integration:** + * PostgreSQL with Apache AGE will be deployed as a separate container in the Radius control plane. + * **AGE Installation Options:** + * **Init Container Approach (Recommended)**: Use an init container to compile and install Apache AGE during pod startup, ensuring version compatibility and reducing image maintenance overhead. + * **Custom Image Approach**: Maintain a Radius-specific PostgreSQL+AGE Docker image with pre-installed AGE extension, providing faster startup but requiring image maintenance. + * **Runtime Installation**: Install AGE via package manager (apt/apk) after container starts, offering flexibility but with startup time overhead. + * Standard PostgreSQL connection patterns will be used with AGE-specific Cypher query capabilities. + * Container configuration will include persistent volume mounting for data durability. + * The GAL will establish connection pools and handle database lifecycle management. + +3. **Production Deployment Architecture:** * Network-based connection using standard PostgreSQL drivers with AGE extensions. - * Support for connection pooling, high availability, and scaling. (Can be a later phase) + * Support for connection pooling, high availability, and scaling patterns. + * Integration with existing Kubernetes deployment patterns and service discovery. + * Backup and restore capabilities through standard PostgreSQL tooling. 4. **Schema Management:** * A Go module will define constants for node labels (e.g., `NodeTypeApplication`, `NodeTypeResource`) and edge labels (all edges will be type `CONNECTION`). - * On startup, the GAL will ensure the schema (node tables, relationship tables, property definitions) exists in the backend, creating or migrating it if necessary. - * **Minimal Data Storage:** Graph nodes will contain only essential resource metadata (ID, type, and optional properties needed for query filtering), with complete resource data remaining in the existing `database.Client` storage. API responses will use graph queries to identify relevant resources, then retrieve full resource data via `database.Client`. - * The GAL will handle synchronization between resource updates (via `database.Client`) and their corresponding minimal graph representations. + * On startup, the GAL will ensure the schema (node tables, relationship tables, property definitions) exists in PostgreSQL + AGE, creating or migrating it if necessary. + * The GAL will handle data synchronization between any remaining etcd operations and the PostgreSQL + AGE database during the transition period. 5. **Graph Access Layer (GAL) API:** * Example Go interface: @@ -269,42 +280,46 @@ graph TD ``` 4. **Data Persistence and State:** - * **Kùzu (Dev/Test/POC):** Kùzu runs embedded, so the Radius process managing it is authoritative. The database file (`radius_app_graph.kuzu`) is stored on persistent storage accessible to the Radius control plane. - * **Postgres+AGE (Production):** Network-based connection with standard PostgreSQL high availability, clustering, and backup mechanisms. + * **PostgreSQL + Apache AGE Container:** Network-based connection with standard PostgreSQL high availability, clustering, and backup mechanisms. + * Container deployment includes persistent volume configuration for data durability. + * Standard Kubernetes patterns for service discovery, health checks, and restart policies. 5. **Graph Database Recovery and Regeneration:** - * **Resilient Design:** Since the graph database stores only application relationship data (not the authoritative resource records), it can be safely rebuilt at any time from the canonical resource data stored via `database.Client`. - * **Migration Tool as Recovery Tool:** The same migration tool used for initial etcd-to-graph migration can be executed at any time to regenerate the complete application graph from existing resource relationships stored in etcd/Postgres. - * **Zero Data Loss Scenario:** If the graph database is lost or corrupted: - 1. Radius continues operating for individual resource operations via `database.Client` - 2. Application graph queries will fail gracefully with appropriate error messages - 3. The migration/rebuild tool can be executed to recreate the graph database from scratch - 4. All historical relationship data is preserved because resources maintain connection metadata in their stored definitions - * **Operational Benefits:** This design eliminates the need for complex graph database backup strategies since the authoritative data remains in the proven `database.Client` storage layer. Graph database backups become a performance optimization rather than a data protection requirement. - * **Recovery Commands:** + * **Resilient Design:** The PostgreSQL + AGE container provides proven database reliability and recovery mechanisms through standard PostgreSQL tooling. + * **Backup and Restore Responsibility:** DBAs are responsible for PostgreSQL backup and restore operations using standard tooling (pg_dump, continuous archiving, etc.). Radius coordinates with these procedures during upgrades and rollbacks. + * **Radius Upgrade Integration:** Radius upgrade processes (currently under development) will be enhanced to trigger backup checkpoints and rollback during failure scenarios. + * **Migration Tool as Recovery Tool:** Migration tools can be executed to regenerate or verify graph data consistency. + * **Operational Benefits:** Leverages existing PostgreSQL operational expertise and tooling rather than introducing novel backup strategies. * **Recovery Commands:** ```bash - # Detect and rebuild corrupted/missing graph database - rad admin graph rebuild --from-resources + # DBA-managed PostgreSQL backup/restore (standard operations) + # DBAs use standard PostgreSQL tooling: pg_dump, pg_restore, continuous archiving - # Verify graph database integrity against resource storage + # Radius upgrade/rollback coordination + # Integration with upgrade commands currently under development to: + # - Trigger backup checkpoints before upgrades + # - Trigger restore procedures during rollbacks + + # Graph-specific verification and rebuild (Radius-managed) rad admin graph verify --repair-if-needed + rad admin graph rebuild --from-etcd ``` 6. **Transaction Management:** * All compound operations (e.g., creating a resource node and its relationship edge) must be performed within a database transaction to ensure atomicity. The GAL will manage this. * Radius upgrades and rollbacks would need to coordinate with the GAL. -#### Advantages (of Graph Database via GAL for application graph operations) +#### Advantages (of PostgreSQL + AGE for application graph operations) * **Rich Querying:** Cypher provides significantly more powerful and expressive graph query capabilities than filtering etcd values client side for application graph traversals. -* **Performance:** For complex graph traversals (multi-hop queries, pathfinding), a graph database is likely to be much faster as it's optimized for such operations. For Radius this would be during most recipe execution as the entire graph is rendered. -* **Specialized Data Store:** Graph databases are purpose-built for graph data, leading to efficient storage and indexing for application graph structures while allowing the existing `database.Client` to continue handling individual resource storage efficiently. -* **Pluggable Backend:** The GAL will allow for Radius users to use any Cypher compatible graph database such as Neo4J or CosmosDB with Gremlin for application graph operations. -* **Transactional Guarantees:** Both Kùzu and Postgres with Apache AGE provide ACID transactions for graph operations, ensuring application graph consistency during complex deployments. -* **Schema Enforcement:** Better ability to define and enforce an application graph schema separate from individual resource schemas. -* **Support for streaming monitoring of graph changes:** A project like Drasi can consume a change feed of the Radius application graph because graph databases can provide relationship-aware change streams, eliminating the need for extensive middleware to parse custom Radius data structures. +* **Performance:** For complex graph traversals (multi-hop queries, pathfinding), PostgreSQL + AGE is optimized for such operations and likely to be much faster than current etcd-based approaches. +* **Production Ready:** PostgreSQL is a mature, battle-tested database with extensive operational tooling, monitoring, and expertise available. +* **Unified Technology Stack:** Using PostgreSQL + AGE establishes a foundation for potentially consolidating all Radius storage needs, reducing operational complexity over time. +* **Container Architecture:** Deploying as a separate container provides clear separation of concerns while enabling standard Kubernetes deployment patterns. +* **Transactional Guarantees:** PostgreSQL provides ACID transactions for both relational and graph operations, ensuring data consistency during complex deployments. +* **Schema Enforcement:** Better ability to define and enforce an application graph schema with PostgreSQL's robust schema management capabilities. +* **Operational Familiarity:** Most operations teams already have PostgreSQL expertise, reducing the learning curve compared to specialized graph databases. * **Simplified Go Code:** Eliminates the complex imperative Go code currently required for creating and traversing application graph relationships, replacing it with declarative Cypher queries that are more maintainable and less error-prone. -* **Separation of Concerns:** Application graph operations and individual resource storage can be optimized independently, with each using the most appropriate storage technology. +* **Future Migration Path:** Establishes PostgreSQL as a foundation for potentially migrating away from etcd entirely, simplifying the overall Radius architecture. --- @@ -339,19 +354,19 @@ In this model: This structure enables richer queries and recipe author use cases like `context.connected_resources.database.credentials.username` instead of only being able to access the embedded `credentials` object and requiring the recipe author to parse. Additionally it provides a better separation of concerns, and a more flexible, maintainable application graph. -#### Disadvantages (of Graph Database integration) +#### Disadvantages (of PostgreSQL + AGE integration) -* **New Dependency:** Introduces a graph database as a new core dependency for Radius, including its Go driver(s). +* **Enhanced Container Dependency:** Builds upon the existing optional PostgreSQL container by adding the Apache AGE plugin, requiring AGE plugin maintenance, version compatibility testing, and inclusion in our CI/CD pipeline to ensure plugin currency and compatibility with PostgreSQL updates. * **Operational Overhead:** - * Managing the graph database file or instance (backups, storage). - * Monitoring performance and health. - * Requires persistent volume or managed database for production usage. -* **Complexity:** Adds a new layer (GAL, graph database integration) to the Radius architecture. -* **Learning Curve:** Radius developers might need to learn Cypher and the specifics of the chosen graph database(s). + * AGE plugin-specific monitoring and health checks (PostgreSQL operations are already supported). + * Learning Cypher query language and AGE-specific graph operations. + * Plugin version management and compatibility testing with PostgreSQL updates. +* **Complexity:** Adds a new layer (GAL, Apache AGE plugin functionality) to the existing PostgreSQL architecture. +* **Learning Curve:** Radius developers will need to learn Cypher and Apache AGE specifics, though the underlying PostgreSQL knowledge and operational patterns are already established. #### Proposed Option -Integrate a **Graph Database** behind a **Graph Access Layer (GAL)** with pluggable backend support. Initially provide **Kùzu as an embedded graph database** for dev/test/POC environments and **Postgres with Apache AGE** for production environments. This approach balances the benefits of dedicated graph DB capabilities with flexibility in deployment scenarios. +Integrate **PostgreSQL with Apache AGE** as a containerized graph database behind a **Graph Access Layer (GAL)**. This approach provides production-ready graph database capabilities while establishing a foundation for potential future consolidation of all Radius storage needs, reducing long-term architectural complexity. ### API design @@ -368,10 +383,9 @@ No changes to the public Radius REST API are anticipated initially, other than p #### Core RP (Resource Provider) -* Core RP will continue to use the existing `database.Client` interface for all individual resource storage, retrieval, and full CRUD operations. -* Core RP will use the GAL to perform application graph queries (finding connected resources, traversing relationships), then retrieve complete resource data via `database.Client` based on the resource IDs returned from graph queries. -* The GAL will be responsible for maintaining synchronization between resource changes (via `database.Client`) and their corresponding minimal representations in the application graph (ID, type, connections, and essential query properties only). -* **API Response Pattern:** Graph queries identify relevant resources → Full resource data retrieved via `database.Client` → Complete API responses assembled from full resource data. +* Core RP will use the GAL for all application graph operations and queries. +* The GAL will serve as the primary interface for both graph operations and transitional resource storage during migration periods. +* **API Response Pattern:** Graph queries identify and retrieve relevant resources → Complete API responses assembled from PostgreSQL + AGE data. ### Error Handling @@ -383,51 +397,50 @@ No changes to the public Radius REST API are anticipated initially, other than p ### Test plan 1. **Unit Tests:** - * Test individual functions within the Graph Access Layer (mocking graph database drivers). - * Test schema creation and migration logic for both backends. + * Test individual functions within the Graph Access Layer (mocking PostgreSQL + AGE database drivers). + * Test schema creation and migration logic for PostgreSQL + AGE backend. 2. **Integration Tests:** - * Test the GAL against actual backend instances (both Kùzu and Postgres with Apache AGE). + * Test the GAL against actual PostgreSQL + Apache AGE backend instances. * Verify CRUD operations for nodes and edges with various property types. - * Test transactional behavior for both backends. * Test Core RP interacting with the GAL-backed graph stores. - * Verify that graph queries return correct resource IDs and that subsequent `database.Client` retrievals return complete resource data. + * Test transactional behavior for PostgreSQL + AGE backend. + * Test Core RP interacting with the GAL-backed graph stores. + * Verify that graph queries return correct resource IDs and that subsequent data retrieval returns complete resource data. 3. **End-to-End (E2E) Tests:** - * Adapt existing Radius E2E tests to ensure all application deployment and management scenarios function correctly with both graph backends. - * Verify that resource operations via `database.Client` continue to work unchanged. - * Test that application graph operations via GAL work correctly alongside resource operations. - * Verify that API responses contain complete resource data assembled from graph queries + `database.Client` retrieval. - * Include tests for data synchronization between resource storage and minimal graph representation. + * Adapt existing Radius E2E tests to ensure all application deployment and management scenarios function correctly with PostgreSQL + AGE backend. + * Test that application graph operations via GAL work correctly with the containerized PostgreSQL + AGE deployment. + * Include tests for container lifecycle management, health checks, and recovery scenarios. * Include tests for data persistence across Radius restarts and upgrades/rollbacks. 4. **Performance Tests:** - * Benchmark graph read/write operations with both backends against the current key/value based implementation for representative workloads. + * Benchmark graph read/write operations with PostgreSQL + AGE against the current key/value based implementation for representative workloads. * Validate performance claims from the advantages section, specifically: * Complex graph traversal performance compared to etcd + client-side filtering * Recipe execution performance when rendering the entire graph * Query performance for large application graphs (100+ resources) - * Test concurrent access to the graph for both backends. + * Test concurrent access to the graph database. * Add checks to LRT Cluster for graph operations. -5. **Backup/Restore Tests:** - * Verify that database backups can be successfully created and restored for both backends. -6. **Backend Compatibility Tests:** - * Ensure identical behavior and results across Kùzu and Postgres with Apache AGE backends. - * Test configuration switching between backends. +5. **Container Integration Tests:** + * Verify PostgreSQL + AGE container deployment and configuration in Kubernetes environments. + * Test container networking, service discovery, and connection pooling. + * Validate persistent volume mounting and data durability. ### Security * **Data at Rest:** - * **Kùzu:** The database file (`radius_app_graph.kuzu`) contains the application graph data. It should be protected by appropriate file system permissions on the persistent volume where it's stored. - * **Postgres with Apache AGE:** Standard PostgreSQL security practices apply, including encryption at rest, access controls, and network security. - * Encryption at rest for storage should be considered, managed by the underlying infrastructure (e.g., Kubernetes PV encryption). -* **Access Control:** Access to the graph database is through the GAL within the Radius process. Standard Radius authentication and authorization mechanisms (when implemented) will protect the Radius APIs that indirectly interact with the graph database. There is no direct network exposure of Kùzu in the embedded model. PostgreSQL with Apache AGE will use a standard networked database access model. -* **Input Sanitization:** If any user-provided data is used to construct Cypher queries (even if parameterized), ensure proper parameterization is always used by the GAL to prevent injection vulnerabilities. -* **Threat Model:** The Radius threat model must be updated to have a section for the GAL. + * **PostgreSQL with Apache AGE:** Standard PostgreSQL security practices apply, including encryption at rest, access controls, and network security. + * Container security follows standard Kubernetes patterns with appropriate security contexts and network policies. + * Encryption at rest for persistent volumes should be configured according to deployment requirements. +* **Access Control:** Access to the PostgreSQL + AGE database is through the GAL within the Radius process and standard PostgreSQL network access controls. Standard Radius authentication and authorization mechanisms (when implemented) will protect the Radius APIs that interact with the graph database. +* **Input Sanitization:** All user-provided data used to construct Cypher queries will use proper parameterization to prevent injection vulnerabilities. +* **Network Security:** Container-to-container communication will use Kubernetes network policies and service mesh patterns as appropriate. +* **Threat Model:** The Radius threat model must be updated to include the PostgreSQL + AGE container and GAL components. -### Compatibility (optional) +### Compatibility * **Backward Compatibility:** - * For existing Radius deployments using etcd as the graph store, a migration path will be necessary. + * For existing Radius deployments using etcd, a migration path will be necessary to move application graph data to PostgreSQL + AGE. * The public Radius API and CLI should remain backward compatible. * **Data Format:** The structure of the application graph (apps, resources, properties) should remain conceptually the same, even though the storage backend changes. -* **Backend Compatibility:** The GAL ensures that both Kùzu and Postgres with Apache AGE backends provide identical functionality and behavior to Radius components. +* **Container Compatibility:** The PostgreSQL + AGE container deployment ensures consistent behavior across different Kubernetes environments and cloud providers. ### Monitoring and Logging @@ -435,60 +448,68 @@ No changes to the public Radius REST API are anticipated initially, other than p * The Graph Access Layer should log all significant operations (e.g., graph queries, errors, transaction boundaries) at appropriate log levels. * **Metrics:** * Expose metrics from the GAL in Radius OpenTelemetry: - * Number of graph queries (per type: read/write, per backend). - * Latency of graph queries (per backend). - * Error rates for graph operations (per backend). - * Database size and health metrics. - * Transaction commit/rollback counts (per backend). + * Number of graph queries (per type: read/write). + * Latency of graph queries. + * Error rates for graph operations. + * PostgreSQL connection pool metrics. + * Container health and resource utilization metrics. + * Transaction commit/rollback counts. ### Development plan -0. **Phase 0: GAL (Milestone 0)** - * Create the GAL with pluggable backend interface. +0. **Phase 0: GAL Foundation (Milestone 0)** + * Create the GAL with PostgreSQL + Apache AGE backend interface. * Implement CRUD endpoints representing Radius abstraction level graph operations. * Develop initial unit & integration tests for the GAL. - * Ensure via debug logging that no components are communicating directly with etcd other than the GAL. -1. **Phase 1: Kùzu Integration (Milestone 1)** - * Phase 1 will be worked on with etcd code still in use, a configuration flag will be used to use the graph backend so we can ship smaller change sets and users can test if desired. - * Set up Kùzu as an embedded dependency with pluggable architecture. - * Define and implement Kùzu schema creation. - * Implement robust error handling and transaction management in the GAL for Kùzu backend. - * Add Kùzu specific tests (schema creation, backup/restore, etc). - * Modify Radius init and upgrade processes to trigger appropriate behavior in the GAL. - * Write idempotent migration tool for etcd => graph db. -2. **Phase 2: Postgres with Apache AGE Integration (Milestone 2)** - * Research Postgres with Apache AGE capabilities and integration requirements. - * Implement Postgres with Apache AGE backend for the GAL. - * Ensure feature parity between Kùzu and Postgres with Apache AGE backends. - * Add comprehensive tests for Postgres with Apache AGE backend. - * Implement configuration options for backend selection. + * Create PostgreSQL + AGE container configuration and deployment manifests. +1. **Phase 1: Container Integration (Milestone 1)** + * Set up PostgreSQL + Apache AGE as a containerized dependency. + * Define and implement schema creation and migration logic. + * Implement robust error handling and transaction management in the GAL. + * Add container-specific tests (deployment, networking, persistence, etc). + * Modify Radius init and upgrade processes to deploy and manage the PostgreSQL + AGE container. +2. **Phase 2: Migration Tooling (Milestone 2)** + * Write idempotent migration tool for etcd => PostgreSQL + AGE. + * Implement data validation and consistency checking tools. + * Integrate backup checkpoint coordination with upgrade commands currently under development (DBAs maintain responsibility for actual PostgreSQL backup/restore operations outside an upgrade). + * Develop rollback procedures for failed migrations. 3. **Phase 3: Testing & Documentation (Milestone 3)** - * Implement backup/restore CLI commands for both backends. * Conduct comprehensive E2E testing, performance testing, and security review. + * Implement container orchestration and operational procedures. * Develop documentation for operators and developers. + * Create operational runbooks for common scenarios. 4. **Phase 4: Query Enhancement (Milestone 4 - optional)** * Enhance GAL with more advanced query capabilities (pathfinding, complex traversals, to support new User Stories defined by product). + * Evaluate opportunities for migrating additional Radius storage needs to PostgreSQL. +5. **Phase 5: Full Storage Migration (Future - out of scope)** + * Migrate existing key-value operations from etcd to PostgreSQL. + * Rename GAL to DAL (Data Access Layer) to reflect unified data access responsibilities. + * Deprecate etcd dependency entirely. + * Implement unified backup, monitoring, and operational procedures for all Radius data. ### Open Questions 1. **Schema Evolution:** How will schema changes (e.g., adding new node/edge types, new properties) be managed over time with Radius upgrades? This will be critical for the GAL to handle gracefully. -2. **Resource Footprint:** What is the typical CPU, memory, and disk I/O footprint of each backend for representative Radius graph sizes? +2. **Resource Footprint:** What is the typical CPU, memory, and disk I/O footprint of the PostgreSQL + AGE container for representative Radius graph sizes? 3. **Dashboard:** The changes proposed here such as nested types and expanded use of connections will make the app graph both richer and larger, the existing dashboard will probably need some UX design and work in order to leverage that effectively and intuitively. +4. **etcd Deprecation Strategy:** What are the long-term benefits and challenges of completely deprecating etcd usage in Radius in favor of PostgreSQL + AGE? This could include: * **Benefits:** Unified storage architecture, reduced operational complexity, single database technology to manage, consistent backup/restore procedures, simplified monitoring and alerting + * **Challenges:** Migration complexity for existing deployments, potential performance implications for key-value workloads, increased PostgreSQL container resource requirements, dependency on PostgreSQL expertise rather than Kubernetes-native etcd * **Timeline:** Should etcd deprecation be a stated goal of this project, or evaluated as a separate future initiative based on the success of graph operations? + * **Migration Timing Advantage:** Implementing etcd deprecation now would require relatively few users to use the migration tool, as Radius is still in early adoption. Once Radius gains more production users, the operational burden of supporting migrations will magnify significantly, making early migration more strategically advantageous for both the project and users. ### Alternatives considered 1. **Continue using etcd:** * **Advantages:** Leverages existing Kubernetes provided etcd installation and expertise. No new database dependency. * **Disadvantages:** Limited query capabilities, known performance bottlenecks for sizeable application graphs, nested rendering logic very manual and complex, tightly coupled to key value stores. -2. **Other Embedded Graph Databases (e.g., a Go-native one if a mature one exists):** - * **Advantages:** Could offer tighter integration if fully Go-native. - * **Disadvantages:** Kùzu is chosen for dev/test for its performance, Cypher support, and active development. A pure Go alternative might lack some of these mature features or performance characteristics. -3. **Hosted/Server-based Graph Databases (e.g., Neo4j, Dgraph as a service, NebulaGraph):** +2. **Embedded Graph Databases (e.g., Kùzu, DuckDB with graph extensions):** + * **Advantages:** Could offer lower latency by eliminating network calls, simpler deployment with no separate container. + * **Disadvantages:** Limited operational tooling compared to PostgreSQL, less familiar to most operations teams, harder to scale or make highly available, complicates backup and monitoring procedures. +3. **Hosted/Server-based Graph Databases (e.g., Neo4j, Amazon Neptune):** * **Advantages:** Mature, feature-rich, often provide built-in clustering and HA. - * **Disadvantages:** Adds significant operational complexity (managing a separate database cluster), network latency between Radius and the DB, cost, and deviates from the goal of a more self-contained/embeddable solution for core graph logic. This proposal prioritizes decoupling and enhancing capabilities with a pluggable solution first. -4. **Cayley Graph Database:** - * **Advantages:** Open-source graph database with support for multiple query languages and storage backends. - * **Disadvantages:** Not designed as an embedded-first solution, requiring additional configuration and operational overhead compared to Kùzu. Would require setting up and managing a separate service instance, network connections, and handling deployment complexity that diverges from our goal of a streamlined, embeddable solution for development and testing environments. + * **Disadvantages:** Adds significant operational complexity (managing a separate database cluster), vendor lock-in for managed services, additional costs, and deviates from the goal of a self-contained solution. +4. **Other PostgreSQL Extensions (e.g., PostgREST with custom graph logic):** + * **Advantages:** Uses familiar PostgreSQL but with different graph capabilities. + * **Disadvantages:** Apache AGE provides native Cypher support and is specifically designed for graph workloads, offering better performance and more comprehensive graph features than custom solutions. ### Full Storage Migration Effort Evaluation @@ -508,149 +529,86 @@ During the design phase, we evaluated the effort required to move ALL Radius sto **Conclusion**: The scoped approach (application graph only) provides the core benefits of graph database technology (enhanced relationship querying, better performance for graph traversals, support for complex connections) while maintaining the proven, optimized storage patterns for individual resource operations. This delivers significant value with much lower risk and development investment. Once Radius Extensibility has shipped many of the existing RPs will be deprecated in favor of Core Types in DynamicRP, making the work to transition to a single data store viable if we decide to do it to simplify the architecture. -### Graph Database Selection Methodology - -**Based on established Radius configuration patterns, the following two-tier approach provides consistent user experience for selecting which graph database to use for a Radius installation:** +### PostgreSQL + AGE Configuration -#### Tier 1: Interactive Installation Configuration +**Container Deployment Configuration:** -**Command:** `rad init --full` - -Users are presented with an interactive menu during the full initialization process to select their preferred graph database backend: - -``` -? Select graph database for application graph operations: - > Kùzu (Embedded - recommended for development/testing) - PostgreSQL with Apache AGE (Network-based - recommended for production) - Custom Cypher-compatible database (advanced configuration) -``` +The PostgreSQL + Apache AGE container will be deployed as part of the Radius control plane with the following configuration approach: -**Implementation Details:** -- Follows established pattern from `rad init --full` for AWS IRSA vs access keys, Azure Service Principal vs Workload Identity -- Default selection: Kùzu for simplicity and zero external dependencies -- Stores selection in Radius configuration for subsequent `rad install` operations -- Advanced option allows users to specify custom connection strings for other Cypher-compatible databases +#### Helm Chart Configuration -#### Tier 2: Non-Interactive Installation Parameters - -**Command:** `rad install kubernetes --set global.graphDatabase.*` - -For automated deployments and GitOps scenarios, users can specify graph database configuration via Helm chart parameters: - -**Kùzu Configuration (Default):** +**PostgreSQL + AGE Container Configuration:** ```bash -rad install kubernetes --set global.graphDatabase.type=kuzu \ - --set global.graphDatabase.kuzu.persistentVolume.enabled=true \ - --set global.graphDatabase.kuzu.persistentVolume.size=10Gi \ - --set global.graphDatabase.kuzu.persistentVolume.storageClass=fast-ssd -``` - -**PostgreSQL with Apache AGE Configuration:** -```bash -rad install kubernetes --set global.graphDatabase.type=postgresql-age \ - --set global.graphDatabase.postgresql.host=postgres.example.com \ - --set global.graphDatabase.postgresql.port=5432 \ - --set global.graphDatabase.postgresql.database=radius_graph \ - --set global.graphDatabase.postgresql.username=radius_user \ - --set global.graphDatabase.postgresql.passwordSecretName=postgres-credentials \ - --set global.graphDatabase.postgresql.sslMode=require -``` - -**Custom Cypher Database Configuration:** -```bash -rad install kubernetes --set global.graphDatabase.type=custom \ - --set global.graphDatabase.custom.connectionString="bolt://neo4j.example.com:7687" \ - --set global.graphDatabase.custom.credentialsSecretName=neo4j-credentials \ - --set global.graphDatabase.custom.dialect=neo4j +rad install kubernetes --set global.database.type=postgresql-age \ + --set global.database.postgresql.persistence.enabled=true \ + --set global.database.postgresql.persistence.size=20Gi \ + --set global.database.postgresql.persistence.storageClass=fast-ssd \ + --set global.database.postgresql.resources.requests.cpu=500m \ + --set global.database.postgresql.resources.requests.memory=1Gi \ + --set global.database.postgresql.resources.limits.cpu=2 \ + --set global.database.postgresql.resources.limits.memory=4Gi ``` #### Configuration Schema -The GAL will support the following configuration structure in Helm values: - ```yaml global: - graphDatabase: - type: kuzu # kuzu | postgresql-age | custom - # Kùzu-specific configuration - kuzu: - persistentVolume: + database: + type: postgresql-age + postgresql: # Container configuration + image: + repository: postgres + tag: "15-alpine" + pullPolicy: IfNotPresent + + # AGE extension configuration + age: + enabled: true + version: "1.5.0" + installMethod: "init-container" # Options: init-container, custom-image, runtime-install + # When installMethod is "init-container", AGE will be compiled and installed during container startup + # When installMethod is "custom-image", we maintain a Radius-specific PostgreSQL+AGE image + # When installMethod is "runtime-install", AGE is installed via apt/apk after container starts + + # Persistence configuration + persistence: enabled: true - size: 10Gi + size: 20Gi storageClass: "" accessModes: ["ReadWriteOnce"] annotations: {} labels: {} - dataDirectory: "/data/kuzu" + + # Resource allocation resources: requests: - cpu: "100m" - memory: "256Mi" - limits: cpu: "500m" memory: "1Gi" - - # PostgreSQL with Apache AGE configuration - postgresql: - host: "localhost" - port: 5432 - database: "radius_graph" - username: "radius_user" - passwordSecretName: "postgres-credentials" - sslMode: "prefer" # disable | prefer | require - connectionPoolSize: 10 - - # Custom Cypher database configuration - custom: - connectionString: "" - credentialsSecretName: "" - dialect: "neo4j" # neo4j | amazon-neptune | others - additionalParams: {} + limits: + cpu: "2" + memory: "4Gi" + + # Database configuration + database: "radius" + username: "radius" + passwordSecretName: "postgresql-credentials" + + # Connection configuration + connectionPoolSize: 20 + maxIdleConnections: 5 + maxOpenConnections: 100 ``` #### Environment Variable Mapping -The GAL will read configuration from these environment variables (set by Helm chart): - ```bash -# Graph database type selection -RADIUS_GRAPH_DATABASE_TYPE=kuzu - -# Kùzu configuration -RADIUS_GRAPH_KUZU_DATA_DIR=/data/kuzu - -# PostgreSQL configuration -RADIUS_GRAPH_POSTGRESQL_HOST=postgres.example.com -RADIUS_GRAPH_POSTGRESQL_PORT=5432 -RADIUS_GRAPH_POSTGRESQL_DATABASE=radius_graph -RADIUS_GRAPH_POSTGRESQL_USERNAME=radius_user -RADIUS_GRAPH_POSTGRESQL_PASSWORD_FILE=/etc/secrets/postgres/password -RADIUS_GRAPH_POSTGRESQL_SSL_MODE=require - -# Custom configuration -RADIUS_GRAPH_CUSTOM_CONNECTION_STRING=bolt://neo4j.example.com:7687 -RADIUS_GRAPH_CUSTOM_CREDENTIALS_FILE=/etc/secrets/custom/credentials -RADIUS_GRAPH_CUSTOM_DIALECT=neo4j +# PostgreSQL container configuration +RADIUS_DATABASE_TYPE=postgresql-age +RADIUS_POSTGRESQL_HOST=radius-postgresql +RADIUS_POSTGRESQL_PORT=5432 +RADIUS_POSTGRESQL_DATABASE=radius +RADIUS_POSTGRESQL_USERNAME=radius +RADIUS_POSTGRESQL_PASSWORD_FILE=/etc/secrets/postgresql/password +RADIUS_POSTGRESQL_SSL_MODE=prefer +RADIUS_POSTGRESQL_CONNECTION_POOL_SIZE=20 ``` - -#### Design Rationale - -This approach follows **established Radius patterns**: - -1. **Interactive Configuration Pattern**: Mirrors `rad init --full` interactive flows for cloud provider credential configuration -2. **Installation-time Parameters**: Consistent with `rad install kubernetes --set` usage for Helm chart customization -3. **Credential Management**: Follows existing patterns for handling sensitive configuration via Kubernetes secrets -4. **Default Behavior**: Provides sensible defaults (Kùzu) while allowing production-ready alternatives (PostgreSQL+AGE) -5. **Extensibility**: Supports future Cypher-compatible databases through custom configuration - -**Benefits:** -- **Consistency**: Aligns with existing Radius user experience patterns -- **Flexibility**: Supports both development (embedded Kùzu) and production (networked PostgreSQL) scenarios -- **Automation-Friendly**: Non-interactive configuration supports GitOps and CI/CD workflows -- **Progressive Disclosure**: Simple defaults with advanced options for power users - -**Implementation Notes:** -- GAL initialization logic will read configuration from environment variables -- Database connections will be established during GAL startup with appropriate error handling -- Connection pooling and retry logic will be implemented for networked database options -- Schema initialization will be database-specific but abstracted through the GAL interface From 1767a96fa4a1463be2ef660e3b02a35e63331a4f Mon Sep 17 00:00:00 2001 From: Sylvain Niles Date: Tue, 17 Jun 2025 12:11:34 -0700 Subject: [PATCH 15/19] removing confusing directory. Signed-off-by: Sylvain Niles --- .../2025-06-graph-db-replace-planes.md | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename architecture/{2025-06-non-k8s-controlplane => }/2025-06-graph-db-replace-planes.md (100%) diff --git a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md b/architecture/2025-06-graph-db-replace-planes.md similarity index 100% rename from architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md rename to architecture/2025-06-graph-db-replace-planes.md From e069f7bd6518b72de1f0afcfc13bce0e103157e1 Mon Sep 17 00:00:00 2001 From: Sylvain Niles Date: Tue, 17 Jun 2025 12:52:48 -0700 Subject: [PATCH 16/19] fix code block indentation for spellcheck. Signed-off-by: Sylvain Niles --- .../2025-06-graph-db-replace-planes.md | 67 ++++++++++--------- .../2025-06-graph-db-replace-planes.md | 0 ...-09-handle-aws-non-idempotent-resources.md | 4 +- 3 files changed, 37 insertions(+), 34 deletions(-) create mode 100644 architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md diff --git a/architecture/2025-06-graph-db-replace-planes.md b/architecture/2025-06-graph-db-replace-planes.md index e5fa6113..9d9a17cd 100644 --- a/architecture/2025-06-graph-db-replace-planes.md +++ b/architecture/2025-06-graph-db-replace-planes.md @@ -245,39 +245,42 @@ graph TD 5. **Graph Access Layer (GAL) API:** * Example Go interface: - ```go - type GraphStore interface { - // Node operations - CreateNode(ctx context.Context, node Node) error - GetNode(ctx context.Context, nodeID string) (Node, error) - UpdateNodeProperties(ctx context.Context, nodeID string, properties map[string]interface{}) error - DeleteNode(ctx context.Context, nodeID string) error // Handle cascading deletes for owned relationships - - // Edge operations (connections) - CreateEdge(ctx context.Context, edge Edge) error - GetEdge(ctx context.Context, fromNodeID, toNodeID string, edgeType string) (Edge, error) // Or a unique edge ID - UpdateEdgeProperties(ctx context.Context, edgeID string, properties map[string]interface{}) error - DeleteEdge(ctx context.Context, edgeID string) error - - // Query operations - GetOutgoingNeighbors(ctx context.Context, nodeID string, edgeTypePattern string) ([]Node, error) - GetIncomingNeighbors(ctx context.Context, nodeID string, edgeTypePattern string) ([]Node, error) - FindPaths(ctx context.Context, startNodeID, endNodeID string, maxHops int) ([][]Node, error) // More complex queries - ExecuteCypherQuery(ctx context.Context, query string, params map[string]interface{}) ([]map[string]interface{}, error) // For advanced internal use - } type Node struct { - ID string - Type string // e.g., "Applications.Core/application" - Properties map[string]interface{} // Minimal properties for query filtering only - } + + ```go + type GraphStore interface { + // Node operations + CreateNode(ctx context.Context, node Node) error + GetNode(ctx context.Context, nodeID string) (Node, error) + UpdateNodeProperties(ctx context.Context, nodeID string, properties map[string]interface{}) error + DeleteNode(ctx context.Context, nodeID string) error // Handle cascading deletes for owned relationships + + // Edge operations (connections) + CreateEdge(ctx context.Context, edge Edge) error + GetEdge(ctx context.Context, fromNodeID, toNodeID string, edgeType string) (Edge, error) // Or a unique edge ID + UpdateEdgeProperties(ctx context.Context, edgeID string, properties map[string]interface{}) error + DeleteEdge(ctx context.Context, edgeID string) error + + // Query operations + GetOutgoingNeighbors(ctx context.Context, nodeID string, edgeTypePattern string) ([]Node, error) + GetIncomingNeighbors(ctx context.Context, nodeID string, edgeTypePattern string) ([]Node, error) + FindPaths(ctx context.Context, startNodeID, endNodeID string, maxHops int) ([][]Node, error) // More complex queries + ExecuteCypherQuery(ctx context.Context, query string, params map[string]interface{}) ([]map[string]interface{}, error) // For advanced internal use + } + + type Node struct { + ID string + Type string // e.g., "Applications.Core/application" + Properties map[string]interface{} // Minimal properties for query filtering only + } - type Edge struct { - ID string // Optional, could be derived from both nodes - FromNodeID string - ToNodeID string - Type string // e.g., "Connection" - Properties map[string]interface{} - } - ``` + type Edge struct { + ID string // Optional, could be derived from both nodes + FromNodeID string + ToNodeID string + Type string // e.g., "Connection" + Properties map[string]interface{} + } + ``` 4. **Data Persistence and State:** * **PostgreSQL + Apache AGE Container:** Network-based connection with standard PostgreSQL high availability, clustering, and backup mechanisms. diff --git a/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md b/architecture/2025-06-non-k8s-controlplane/2025-06-graph-db-replace-planes.md new file mode 100644 index 00000000..e69de29b diff --git a/ucp/aws/handle-non-idempotent-resources/2023-09-handle-aws-non-idempotent-resources.md b/ucp/aws/handle-non-idempotent-resources/2023-09-handle-aws-non-idempotent-resources.md index 9bc7a45e..7bd64107 100644 --- a/ucp/aws/handle-non-idempotent-resources/2023-09-handle-aws-non-idempotent-resources.md +++ b/ucp/aws/handle-non-idempotent-resources/2023-09-handle-aws-non-idempotent-resources.md @@ -1,6 +1,6 @@ # Handling Non-Idempotent AWS Resources in UCP -**Note: This design doc has been ported over from an old design doc and might not match the template completely** +**Note: This design doc has been ported over from an old design doc and might not match Note that here we are using the Radius resource group to construct the key of the tracking entry instead of the AWS `resourceID`. The deployment engine provides the resource group information from the deployment's resource group during the POST call that it makes to UCP to create/update the resource. This will associate the AWS resource with the Radius resource group, and thus allow the user to list AWS resources deployed within a Radius resource group.he template completely** * **Author**: Vinaya Damle (vinayada) @@ -120,7 +120,7 @@ Note that for now, we will add the “alias” property only for AWS resources a #### UCP/DE Changes -We will make changes to UCP to handle AWS resources similarly to Radius resources. UCP will store the AWS resource metadata in the same way that Radius resources are stored today (that is, the key will be the resourceID). We will store three pieces of metadata for each of these tracked AWS resources: +We will make changes to UCP to handle AWS resources similarly to Radius resources. UCP will store the AWS resource metadata in the same way that Radius resources are stored today (that is, the key will be the `resourceID`). We will store three pieces of metadata for each of these tracked AWS resources: - Scope: Describes the AWS resource scope, including the account ID and the region ID for the resource deployment - PrimaryIdentifier: Lists the AWS primary identifier, which is required to uniquely identify the resource From 204cde6128b64d24f99295d9fe68f741fc1bcf1a Mon Sep 17 00:00:00 2001 From: Sylvain Niles Date: Tue, 17 Jun 2025 12:57:16 -0700 Subject: [PATCH 17/19] more spellcheck fixes. Signed-off-by: Sylvain Niles --- .github/config/en-custom.txt | 12 +++ .../2025-06-graph-db-replace-planes.md | 100 +++++++++--------- 2 files changed, 63 insertions(+), 49 deletions(-) diff --git a/.github/config/en-custom.txt b/.github/config/en-custom.txt index b7f20e6c..6c7d6366 100644 --- a/.github/config/en-custom.txt +++ b/.github/config/en-custom.txt @@ -983,3 +983,15 @@ tolerations unopinionated WVKgHularsO youtu +Niles +Sylvain +Drasi +middleware +postgres +DAL +apk +CreateNode +DeleteNode +GetNode +UpdateNodeProperties +nodeID diff --git a/architecture/2025-06-graph-db-replace-planes.md b/architecture/2025-06-graph-db-replace-planes.md index 9d9a17cd..e15bbca4 100644 --- a/architecture/2025-06-graph-db-replace-planes.md +++ b/architecture/2025-06-graph-db-replace-planes.md @@ -245,42 +245,42 @@ graph TD 5. **Graph Access Layer (GAL) API:** * Example Go interface: - - ```go - type GraphStore interface { - // Node operations - CreateNode(ctx context.Context, node Node) error - GetNode(ctx context.Context, nodeID string) (Node, error) - UpdateNodeProperties(ctx context.Context, nodeID string, properties map[string]interface{}) error - DeleteNode(ctx context.Context, nodeID string) error // Handle cascading deletes for owned relationships - - // Edge operations (connections) - CreateEdge(ctx context.Context, edge Edge) error - GetEdge(ctx context.Context, fromNodeID, toNodeID string, edgeType string) (Edge, error) // Or a unique edge ID - UpdateEdgeProperties(ctx context.Context, edgeID string, properties map[string]interface{}) error - DeleteEdge(ctx context.Context, edgeID string) error - - // Query operations - GetOutgoingNeighbors(ctx context.Context, nodeID string, edgeTypePattern string) ([]Node, error) - GetIncomingNeighbors(ctx context.Context, nodeID string, edgeTypePattern string) ([]Node, error) - FindPaths(ctx context.Context, startNodeID, endNodeID string, maxHops int) ([][]Node, error) // More complex queries - ExecuteCypherQuery(ctx context.Context, query string, params map[string]interface{}) ([]map[string]interface{}, error) // For advanced internal use - } - - type Node struct { - ID string - Type string // e.g., "Applications.Core/application" - Properties map[string]interface{} // Minimal properties for query filtering only - } - type Edge struct { - ID string // Optional, could be derived from both nodes - FromNodeID string - ToNodeID string - Type string // e.g., "Connection" - Properties map[string]interface{} - } - ``` +```go +type GraphStore interface { + // Node operations + CreateNode(ctx context.Context, node Node) error + GetNode(ctx context.Context, nodeID string) (Node, error) + UpdateNodeProperties(ctx context.Context, nodeID string, properties map[string]interface{}) error + DeleteNode(ctx context.Context, nodeID string) error // Handle cascading deletes for owned relationships + + // Edge operations (connections) + CreateEdge(ctx context.Context, edge Edge) error + GetEdge(ctx context.Context, fromNodeID, toNodeID string, edgeType string) (Edge, error) // Or a unique edge ID + UpdateEdgeProperties(ctx context.Context, edgeID string, properties map[string]interface{}) error + DeleteEdge(ctx context.Context, edgeID string) error + + // Query operations + GetOutgoingNeighbors(ctx context.Context, nodeID string, edgeTypePattern string) ([]Node, error) + GetIncomingNeighbors(ctx context.Context, nodeID string, edgeTypePattern string) ([]Node, error) + FindPaths(ctx context.Context, startNodeID, endNodeID string, maxHops int) ([][]Node, error) // More complex queries + ExecuteCypherQuery(ctx context.Context, query string, params map[string]interface{}) ([]map[string]interface{}, error) // For advanced internal use +} + +type Node struct { + ID string + Type string // e.g., "Applications.Core/application" + Properties map[string]interface{} // Minimal properties for query filtering only +} + +type Edge struct { + ID string // Optional, could be derived from both nodes + FromNodeID string + ToNodeID string + Type string // e.g., "Connection" + Properties map[string]interface{} +} +``` 4. **Data Persistence and State:** * **PostgreSQL + Apache AGE Container:** Network-based connection with standard PostgreSQL high availability, clustering, and backup mechanisms. @@ -292,20 +292,22 @@ graph TD * **Backup and Restore Responsibility:** DBAs are responsible for PostgreSQL backup and restore operations using standard tooling (pg_dump, continuous archiving, etc.). Radius coordinates with these procedures during upgrades and rollbacks. * **Radius Upgrade Integration:** Radius upgrade processes (currently under development) will be enhanced to trigger backup checkpoints and rollback during failure scenarios. * **Migration Tool as Recovery Tool:** Migration tools can be executed to regenerate or verify graph data consistency. - * **Operational Benefits:** Leverages existing PostgreSQL operational expertise and tooling rather than introducing novel backup strategies. * **Recovery Commands:** - ```bash - # DBA-managed PostgreSQL backup/restore (standard operations) - # DBAs use standard PostgreSQL tooling: pg_dump, pg_restore, continuous archiving - - # Radius upgrade/rollback coordination - # Integration with upgrade commands currently under development to: - # - Trigger backup checkpoints before upgrades - # - Trigger restore procedures during rollbacks - - # Graph-specific verification and rebuild (Radius-managed) - rad admin graph verify --repair-if-needed - rad admin graph rebuild --from-etcd - ``` + * **Operational Benefits:** Leverages existing PostgreSQL operational expertise and tooling rather than introducing novel backup strategies. + * **Recovery Commands:** + +```bash +# DBA-managed PostgreSQL backup/restore (standard operations) +# DBAs use standard PostgreSQL tooling: pg_dump, pg_restore, continuous archiving + +# Radius upgrade/rollback coordination +# Integration with upgrade commands currently under development to: +# - Trigger backup checkpoints before upgrades +# - Trigger restore procedures during rollbacks + +# Graph-specific verification and rebuild (Radius-managed) +rad admin graph verify --repair-if-needed +rad admin graph rebuild --from-etcd +``` 6. **Transaction Management:** * All compound operations (e.g., creating a resource node and its relationship edge) must be performed within a database transaction to ensure atomicity. The GAL will manage this. From 1fb56a64332b8c1e1830e31bfe145c715b4bd827 Mon Sep 17 00:00:00 2001 From: Sylvain Niles Date: Tue, 17 Jun 2025 13:07:51 -0700 Subject: [PATCH 18/19] more spellcheck Signed-off-by: Sylvain Niles --- .github/config/en-custom.txt | 10 ++++++++++ architecture/2025-06-graph-db-replace-planes.md | 7 ++----- 2 files changed, 12 insertions(+), 5 deletions(-) diff --git a/.github/config/en-custom.txt b/.github/config/en-custom.txt index 6c7d6366..bf51f386 100644 --- a/.github/config/en-custom.txt +++ b/.github/config/en-custom.txt @@ -995,3 +995,13 @@ DeleteNode GetNode UpdateNodeProperties nodeID +atomicity +reusability +queryable +OpenTelemetry +LRT +Sanitization +parameterization +runbooks +sizeable +DuckDB \ No newline at end of file diff --git a/architecture/2025-06-graph-db-replace-planes.md b/architecture/2025-06-graph-db-replace-planes.md index e15bbca4..ad61440f 100644 --- a/architecture/2025-06-graph-db-replace-planes.md +++ b/architecture/2025-06-graph-db-replace-planes.md @@ -316,7 +316,7 @@ rad admin graph rebuild --from-etcd #### Advantages (of PostgreSQL + AGE for application graph operations) * **Rich Querying:** Cypher provides significantly more powerful and expressive graph query capabilities than filtering etcd values client side for application graph traversals. -* **Performance:** For complex graph traversals (multi-hop queries, pathfinding), PostgreSQL + AGE is optimized for such operations and likely to be much faster than current etcd-based approaches. +* **Performance:** For complex graph traversals (multi-hop queries, path finding), PostgreSQL + AGE is optimized for such operations and likely to be much faster than current etcd-based approaches. * **Production Ready:** PostgreSQL is a mature, battle-tested database with extensive operational tooling, monitoring, and expertise available. * **Unified Technology Stack:** Using PostgreSQL + AGE establishes a foundation for potentially consolidating all Radius storage needs, reducing operational complexity over time. * **Container Architecture:** Deploying as a separate container provides clear separation of concerns while enabling standard Kubernetes deployment patterns. @@ -354,7 +354,7 @@ graph TD In this model: - The `Database` node is connected to a `Credentials` node via a CONNECTION. -- The `Credentials` node is connected to two `Secret` nodes (for username and password) via CONNECTIONs. +- The `Credentials` node is connected to two `Secret` nodes (for username and password) via connections. This structure enables richer queries and recipe author use cases like `context.connected_resources.database.credentials.username` instead of only being able to access the embedded `credentials` object and requiring the recipe author to parse. Additionally it provides a better separation of concerns, and a more flexible, maintainable application graph. @@ -512,9 +512,6 @@ No changes to the public Radius REST API are anticipated initially, other than p 3. **Hosted/Server-based Graph Databases (e.g., Neo4j, Amazon Neptune):** * **Advantages:** Mature, feature-rich, often provide built-in clustering and HA. * **Disadvantages:** Adds significant operational complexity (managing a separate database cluster), vendor lock-in for managed services, additional costs, and deviates from the goal of a self-contained solution. -4. **Other PostgreSQL Extensions (e.g., PostgREST with custom graph logic):** - * **Advantages:** Uses familiar PostgreSQL but with different graph capabilities. - * **Disadvantages:** Apache AGE provides native Cypher support and is specifically designed for graph workloads, offering better performance and more comprehensive graph features than custom solutions. ### Full Storage Migration Effort Evaluation From 7204862d5bf3082f91c19169c639e047a6e08fa8 Mon Sep 17 00:00:00 2001 From: Sylvain Niles Date: Tue, 17 Jun 2025 13:10:33 -0700 Subject: [PATCH 19/19] last spellcheck! Signed-off-by: Sylvain Niles --- architecture/2025-06-graph-db-replace-planes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/architecture/2025-06-graph-db-replace-planes.md b/architecture/2025-06-graph-db-replace-planes.md index ad61440f..4bb30df2 100644 --- a/architecture/2025-06-graph-db-replace-planes.md +++ b/architecture/2025-06-graph-db-replace-planes.md @@ -484,7 +484,7 @@ No changes to the public Radius REST API are anticipated initially, other than p * Develop documentation for operators and developers. * Create operational runbooks for common scenarios. 4. **Phase 4: Query Enhancement (Milestone 4 - optional)** - * Enhance GAL with more advanced query capabilities (pathfinding, complex traversals, to support new User Stories defined by product). + * Enhance GAL with more advanced query capabilities (path finding, complex traversals, to support new User Stories defined by product). * Evaluate opportunities for migrating additional Radius storage needs to PostgreSQL. 5. **Phase 5: Full Storage Migration (Future - out of scope)** * Migrate existing key-value operations from etcd to PostgreSQL.