Conversation
## Summary ## Causal Multi-Head Attention Forward Pass (CUDA) PR implements the CUDA forward pass for causal multi-head attention (attention_forward). It includes the core GPU kernel, custom block-level reduction primitives, and tensor validation helpers. ## Core Attention Kernelattention_forward_kernel: - Computes scaled dot-product attention on an interleaved QKV input tensor structured as [Batch, Time, 3 * Channels]. - Causal Masking: Enforces autoregressive constraints by preventing tokens from attending to future time steps ($t2 > t$). - Implements parallelized block_max and block_sum device functions. - Leverages cooperative warp shuffles (warp_max, warp_sum) and shared memory to handle stable online softmax normalization #52 #11 #12 #14 #29
# Pull Request Engineering Summary ## Core LLM Pipeline Modernization & Architectural Overhaul > **Executive Summary:** This pull request aggregates a critical sequence of engineering upgrades transitioning the standalone modeling stack to a highly optimized, production-ready Decoder-Only autoregressive Transformer engine. Updates encompass structural layout transformations across front-end web UI wrappers, custom hardware-accelerated CPU Tensor math kernels, and scalable multi-GPU training/telemetry orchestration matrices. --- ## 1. Pull Request Core Metadata | Metadata Field | Description | | :--- | :--- | | **PR Target Branch / Title** | `refactor/core-engine` $\rightarrow$ `main` \| Upgrade Core LLM Infrastructure to Decoder-Only Pipeline & Analytics | | **Primary Changes** | Architecture migration (Decoder-Only), Telemetry implementation (WandB), UI Overhaul (Inline Styles), Native Optimization (AVX/SSE) | | **Impact Scope** | Core Neural Network Engine, Cluster Training Primitives, Cross-Platform Frontend Subsystems, Vector Math Backends | | **Telemetry & Tokenization** | Weights & Biases Runtime Tracking Engine Integration; `tiktoken` (`o200k_base` Byte-Pair Encoding) Backend Migration | | **Hardware Optimization** | Unaligned 256-bit Vector Intrinsics (`__AVX__`) and 128-bit Lane Vectors (`__SSE__`) with fallback Scalar Arrays | --- ## 2. Core Neural Network & Architectural Shifts The engineering modifications consolidate multiple independent core layers (`Embedding`, `LayerNorm`, `Linear`) into a unified, production-grade autoregressive decoder-only Transformer configuration matching state-of-the-art LLM architectures: * **Decoder-Only Refactor:** Phased out legacy sequence-to-sequence (seq2seq) architectures to transition fully to a causal autoregressive structure. This forces causal masking constraints over continuous hidden dimensions during forward execution cycles to prevent the model from looking at future tokens. * **Token & Absolute Position Embeddings:** The core `Embedding` layout maps flat input sequences directly into continuous 3D hidden tensor spaces $[B, T, D]$. Features a dedicated standalone absolute positional embedding route (`forward_pos`) generating specialized spatial frames across variable text context boundaries ($T$). * **Numerical Loss & Optimization Stability:** The `cross_entropy` engine incorporates strict value isolation boundaries (max value normalization) to secure log-softmax arrays against underflow/overflow scenarios. The stateful `AdamW` optimizer registers continuous memory-pointer streams directly to optimize raw weight vectors without multi-hop structural replication overhead. --- ## 3. Low-Level Core Optimizations (C++ Tensor Kernel) To eliminate memory-bound bottlenecks inside native execution calls, element-wise arithmetic passes over raw vector structures (`add`, `add_inplace`) have been decoupled into specialized architecture paths compiled conditionally using preprocessor macro definitions: * **256-Bit AVX Intrinsics:** Invokes explicit unaligned packet loading loops (`_mm256_loadu_ps`) and vector additions (`_mm256_add_ps`) to process eight single-precision floats concurrently per execution lane clock cycle. * **128-Bit SSE Downscaling:** Provides explicit 128-bit vector loops (`_mm_loadu_ps`, `_mm_add_ps`) processing four float variables simultaneously for legacy host target nodes. * **Serialized Zero-Overhead Memory Layouts:** All layer components (`Linear`, `LayerNorm`, `Embedding`) implement flat binary data routing using raw `reinterpret_cast<char*>` byte blocks, ensuring lightning-fast file serialization and model loading checkpoints without structural serialization metadata baggage. --- ## 4. Distributed Orchestration & Cluster Telemetry The Python cluster-orchestration codebase has been fundamentally upgraded to support large-scale high-performance training profiles across distributed multi-node hardware targets: * **Multi-GPU DDP Architecture:** Integrates NCCL-backed `DistributedDataParallel` orchestration, utilizing automated execution-rank filtering, master process controls, and specialized cluster seed off-setting logic to ensure deterministic replication bounds. * **Mixed-Precision Execution (AMP):** Deploys runtime context auto-casting (`torch.amp.autocast`) toggling between pure `bfloat16` and gradient-scaled `float16` layouts to prevent numerical underflow while preserving maximum compute efficiency on Tensor Cores. * **Sub-word Tokenization Backends:** Replaces slow legacy text split-parsers with advanced byte-pair encodings (`tiktoken` utilizing the `o200k_base` matrix), improving token density per context window and reducing language vocabulary padding overhead. * **WandB Experiment Telemetry:** Hooks up centralized Weights & Biases telemetry tracking loops, automating real-time convergence parsing, structural loss diagnostics, and hardware parameter health tracking updates. --- ## 5. Frontend Framework Refactor (React Web Component Tree) The web application dashboard migrates entirely from legacy utility-first global Tailwind configuration models to explicit, typed inline styles (`React.CSSProperties`) combined with native JavaScript pointer events to manage high-frequency application interface states: * **Component Modularity Overhauls:** The structural view layers (`AppLayout` shell, `Sidebar`, `Topbar`, `SessionItem`, `StatsPanel`, `SettingsPanel`, and `ModelBadge`) have been completely rewritten to rely on atomic design tokens and explicit flexbox layout boundaries. * **Dynamic Event Interactivity:** Replaces standard utility hover configurations with optimized micro-interactions using native pointer handlers (`onMouseEnter`, `onMouseLeave`, `onFocusCapture`, `onBlurCapture`) to drive real-time component border glows, state transitions, and translucent background overlays. * **Layout & Responsive Edge-Case Safety:** Enforces rigid multi-device rendering bounds using concrete visual rules (`flexShrink: 0`, `minWidth: 0`, `wordBreak: 'break-all'`, and explicit multi-word text ellipsis clamping) to ensure a bulletproof user interface across desktop and mobile screens.
* feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * feat(ci): optimize workflow pipeline and update docker configurations * refactor(ci): optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * refactor : optimize workflow pipeline and update docker configurations * Added MIT LICENSE to this project Quadtrix.cpp * Refactor Dockerfile to use ARG for CUDA version * Refactor Dockerfile for backend dependencies * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * Delete .devops/Dockerfile.frontend * Delete .devops/Dockerfile.dev.frontend * refactor : Dockerfile.backend optimize workflow pipeline * refactor : Dockerfile.backend optimize workflow pipeline * refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication * refactored (CI): consolidated manual Docker build jobs into a matrix strategy to reduce duplication * refactor(ui): rewrite ThinkingIndicator to use inline styles and CSS keyframes * refactor : message bubble layout to use inline styles * refactor(ui): complete inline-style migration and update auto-scroll implementation * refactor(ui): complete inline-style migration for MessageAvatar component * refactor(ui): rewrite EmptyState component using pure inline styles * refactored(tensor): vectorize element-wise addition and scalar scaling using AVX/SSE - Added SIMD vectorization support (`__AVX__` and `__SSE__`) for element-wise `add`, `add_inplace`, and `scale` operations. - Maintained scalar fallback paths for non-vectorized bounds and platforms lacking hardware extensions. - Explicitly defined rule-of-five constructors (`default` and `noexcept` moves) within the `Tensor` struct layout. - Optimized vector initialization across the core construct layer via `std::move` and `std::vector::reserve`. * refactor(main): redesign training loop to log per-step and sample during evaluation - Replaced the periodic block evaluation layout with standard, per-step logging metrics (`loss`, `ms`, and `tok/s`). - Shifted initial validation loss calculation out of the iteration cycle to establish a zero-state baseline. - Restructured token streaming so that generations are triggered conditionally inside the training loop post-evaluation windows. - Streamlined architecture parameter reporting and consolidated command-line configuration visual prints. * feat: implement GPT training loop with multi-GPU and memory optimizations - Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU. - Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling. - Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options. - Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes. * Update README.md with new banner for qudtrix.cpp --------- Co-authored-by: Max <eamon5174@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.