Skip to content

An implementation of an all-new foundation model architecture that trains on byte sequences from multiple modalities to handle omni-modal generation of text, video, images and more.

License

Notifications You must be signed in to change notification settings

Agora-Lab-AI/OmniByteGPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OmniByteGPT

Join our Discord Subscribe on YouTube Connect on LinkedIn Follow on X.com

Abstract

We present BytePredictor, a novel architecture for universal sequence modeling that operates at the byte level across multiple modalities. By treating all data types as raw byte sequences, our model can learn and generate diverse content types including text, images, audio, and their combinations. The architecture incorporates state-of-the-art advances such as Multi-Query Attention (MQA) and Rotary Position Embeddings (RoPE), while introducing novel optimizations for byte-level prediction tasks.

Architecture

Core Components

  • Byte-Level Processing: Operates on raw bytes (0-255) enabling universal data handling
  • Enhanced Multi-Query Attention: Modified MQA mechanism with fewer key/value heads
  • Rotary Position Embeddings: Position-aware representations without sequence length limitation
  • QK-Normalization: Improved attention mechanism stability
  • Modality-Agnostic Training: Unified approach to multi-modal learning

Technical Specifications

@dataclass
class ModelConfig:
    vocab_size: int = 256  # Byte range
    hidden_size: int = 1024
    num_layers: int = 12
    num_key_value_heads: int = 8
    num_query_heads: int = 32
    max_sequence_length: int = 8192

Innovations

Multi-Modal Byte-Level Processing

Our model introduces several key innovations:

  1. Universal Tokenization: Direct byte-level processing eliminating the need for modality-specific tokenizers
  2. Automatic Modality Detection: Novel algorithms for identifying data types in generated sequences
  3. Boundary-Aware Generation: Specialized attention mechanisms for handling modal transitions

Performance Optimizations

  • Reduced memory footprint through MQA
  • Efficient rotary embeddings implementation
  • Optimized QK normalization for byte-level attention

Results

Quality Metrics

Preliminary evaluation shows promising results across modalities:

  • Text Generation: Comparable to specialized models
  • Image Synthesis: Effective for various formats
  • Multi-Modal Generation: Novel capabilities in cross-modal transitions

Computational Efficiency

Metric Value
Parameters 1B
MQA Memory Reduction 47%
Training FLOPs 3.2e18
Inference Speed 32K bytes/sec

Implementation Details

Attention Mechanism

q = self.q_proj(hidden_states)
k = self.k_proj(hidden_states)
v = self.v_proj(hidden_states)

# Apply rotary embeddings
q, k = self.rotary(q, k, seq_length)

# Multi-query attention
if self.num_key_value_heads != self.num_query_heads:
    k = k.repeat_interleave(
        self.num_query_heads // self.num_key_value_heads, 
        dim=1
    )

Modality Detection

Novel algorithm for automatic detection of generated content types:

  1. Byte pattern analysis
  2. Entropy-based classification
  3. Format signature matching
  4. Boundary detection for mixed content

Applications

Current Use Cases

  • Universal data compression
  • Multi-modal content generation
  • Format conversion and transformation
  • Anomaly detection in byte sequences

Future Directions

  1. Streaming byte prediction
  2. Adaptive modality switching
  3. Cross-modal translation
  4. Compression-aware generation

Citation

@article{bytepredictor2024,
  title={BytePredictor: Universal Next-Byte Prediction for Multi-Modal Generation},
  author={Kye Gomez},
  journal={arXiv preprint},
  year={2024}
}

Installation

pip install bytepredictor

Usage Example

from bytepredictor import BytePredictor, ModelConfig

# Initialize model
config = ModelConfig(hidden_size=1024)
model = BytePredictor(config)

# Generate content
output = model.generate(
    prompt_bytes,
    max_new_tokens=1000,
    temperature=0.8
)

# Auto-detect and decode
detector = ModalityDetector()
result = detector.detect_modality(output)

Contributors

  • Kye Gomez
  • Claude

License

MIT License

Acknowledgments

We thank the research community for their contributions to the advancement of universal sequence modeling and multi-modal generation.

About

An implementation of an all-new foundation model architecture that trains on byte sequences from multiple modalities to handle omni-modal generation of text, video, images and more.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published