Large Language Model for Agentic AI

Go to file

$Tim O\'Neil$ Tim O\'Neil d89095e49b Lotta work here		2025-08-27 14:02:03 -07:00
configs	Initial commit	2025-08-21 10:44:50 -07:00
include/lm	Lotta work here	2025-08-27 14:02:03 -07:00
src	Lotta work here	2025-08-27 14:02:03 -07:00
.gitignore	Initial commit	2025-08-21 11:43:08 -06:00
build_log.md	Lotta work here	2025-08-27 14:02:03 -07:00
CMakeLists.txt	Lotta work here	2025-08-27 14:02:03 -07:00
LICENSE	Initial commit	2025-08-21 10:44:50 -07:00
nocklist.md	Lotta work here	2025-08-27 14:02:03 -07:00
purpose.md	Lotta work here	2025-08-27 14:02:03 -07:00
README.md	Lotta work here	2025-08-27 14:02:03 -07:00
serialization	Lotta work here	2025-08-27 14:02:03 -07:00
todo.md	Lotta work here	2025-08-27 14:02:03 -07:00

README.md

bpe_framework

Byte Pair Encoding Framework

Large Language Model for Agentic AI

Fully internationalized framework for Agentic AI research

Requires:

nlohman/json (https://github.com/nlohmann/json
Internationalzation library for Unicode by Frederick Roubert (https://github.com/unicode-org/icu)
OpenNMT Tokenizer by Thuc Pham (https://github.com/OpenNMT/Tokenize)
Eigen header files (https://github.com/PX4/eigen)

Build: cmake -DCMAKE_POLICY_VERSION_MINIMUM=3.5 -DEIGEN_LOC= ..

The test_bpe application is a comprehensive test program that validates the functionality of the BPE tokenizer implementation in the LM Framework. Here's how it works:

Initialization: Creates an instance of BPETokenizer Defines a training corpus with sample English text
Training Process: Calls tokenizer.train(corpus, 500) to train the tokenizer The training process: Initializes with byte-level vocabulary (0-255) Analyzes word frequencies in the corpus Iteratively merges the most frequent character pairs Builds a vocabulary of 500 tokens (as specified)
Encoding Test: Encodes the test string "the quick brown fox" The encoding process: Splits text into words Converts each character to its initial token ID Applies learned BPE merges to combine tokens Returns a sequence of integer token IDs
Decoding Test: Decodes the token IDs back to text The decoding process: Converts each token ID back to its string representation Concatenates the strings to reconstruct the original text
Serialization Test Saves the trained tokenizer to "bpe_model.txt" The serialization process: Writes vocabulary size and token-ID mappings Records all learned merge rules
Deserialization Test Loads the tokenizer from "bpe_model.txt" Verifies the loaded tokenizer has the same vocabulary size Confirms the tokenizer can perform encoding/decoding

Expected Output text

Training tokenizer... Vocabulary size: 500 Original: the quick brown fox Tokens: [list of token IDs] Decoded: the quick brown fox Successfully loaded tokenizer Loaded vocabulary size: 500

Key Validations Training Completes without errors Encoding/Decoding Round-Trip preserves the original text Serialization/Deserialization maintains tokenizer state Vocabulary Size matches the specified target (500) Token IDs are consistent between sessions

test_unicode.cpp

Lower-level Unicode-specific tests:

Unicode normalization functions

Character boundary detection

Grapheme cluster handling

Encoding conversion utilities

Validation of Unicode compliance

BPE Tokenizer Performance Test Suite

Overview

The performance test application is a comprehensive benchmarking tool designed to evaluate the efficiency and scalability of the Byte Pair Encoding (BPE) tokenizer implementation. The test suite measures critical performance metrics including training time, memory usage, encoding/decoding speed, and serialization performance across various configurations.

Key Features

1. Corpus Generation

Automatically generates realistic test corpora using common AI/ML terminology
Configurable sentence count and word range parameters
Creates diverse text samples that mimic real-world language patterns

2. Multi-Dimensional Testing

Tests multiple corpus sizes (100, 1000, 5000 sentences)
Evaluates different vocabulary sizes (500, 1000, 2000 tokens)
Measures performance across various workload scenarios

3. Comprehensive Performance Metrics

Training Time: Measures how long it takes to build the BPE vocabulary from a corpus
Memory Usage: Tracks peak memory consumption during training (Linux-specific)
Encoding Speed: Calculates processing time per token during text encoding
Round-Trip Verification: Ensures encoding/decoding preserves original content
Serialization Performance: Measures model save/load operations

4. Validation Checks

Verifies encoding/decoding consistency
Detects potential data corruption issues
Validates vocabulary construction

Usage Scenarios

This performance test is ideal for:

Benchmarking different BPE implementations
Evaluating hardware suitability for language processing tasks
Identifying performance bottlenecks in tokenization pipelines
Testing scalability of tokenizer implementations
Comparing optimization techniques

Technical Implementation

The test suite utilizes:

High-resolution timing with <chrono> for precise measurements
Linux-specific memory tracking via /proc/self/status
Randomized corpus generation with configurable parameters
Exception handling for robust testing
Automatic cleanup of temporary files

Output Metrics

The application provides detailed performance reports including:

Training duration in milliseconds
Peak memory usage in megabytes
Encoding speed in microseconds per token
Serialization/deserialization times
Vocabulary size validation
Round-trip integrity verification

This test framework serves as an essential tool for developers and researchers working with BPE tokenizers, providing quantitative data to guide optimization efforts and implementation choices.

Technical Summary: BPE Framework

Overview

The BPE Framework is a C++-based neural network framework designed for building and training language models with Byte Pair Encoding (BPE) tokenization. It implements a complete deep learning stack with automatic differentiation, optimization, and model serialization capabilities. Core Components

1. Tensor Operations with Autograd

Header-only Tensor class with Eigen backend for efficient linear algebra

Automatic differentiation with backward propagation

Comprehensive operator support: element-wise operations, matrix multiplication, reductions

Activation functions: ReLU, GELU, Softmax, Sigmoid with gradient support

Memory-efficient implementation with shape-aware operations

2. BPE Tokenizer

PIMPL pattern implementation for API stability

Efficient vocabulary management with merge operations

Encoding/decoding support for text processing

Non-copyable design (uses unique_ptr) for proper resource management

3. Neural Network Architecture

Transformer-based language model implementation

Configurable dimensions: embedding size, hidden layers, attention heads

Parameter management with named parameters for serialization

Training/inference modes support

4. Training Infrastructure

Adam optimizer with configurable hyperparameters

Gradient accumulation and moment estimation

Batch processing with sequence padding

Loss computation (cross-entropy) with masking support

5. Model Serialization

Binary format with versioning and magic number validation

Parameter-by-name storage and retrieval

Shape preservation and data integrity checks

Error handling for file operations and format validation

Key Technical Features

Memory Management

Eigen integration for optimized matrix operations

Shape-aware memory allocation preventing unnecessary copies

RAII principles for resource management

Performance Considerations

Header-only design for Tensor class enabling compiler optimizations

Batch processing for efficient training

In-place operations where possible to reduce memory overhead

Extensibility

Modular architecture allowing component replacement

Clear interfaces between tokenizer, model, and training components

Parameter naming convention supporting complex architectures

Architecture Patterns

PIMPL Idiom: Used in tokenizer for stable ABI

RAII: Comprehensive resource management throughout

Builder Pattern: Model configuration through constructor parameters

Strategy Pattern: Optimizer implementation allowing algorithm changes

Current Capabilities

* Automatic differentiation with reverse-mode autograd

* BPE tokenization with vocabulary learning

* Transformer language model training

* Adam optimization with moment estimation

* Model serialization/deserialization

* Configurable network architectures

* Batch processing with padding

Technical Stack

C++17 with standard library components

Eigen for linear algebra operations

CMake for build system management

Header-only design for core components

Usage Example

// Initialize components BPETokenizer tokenizer(corpus); LanguageModel model(tokenizer.vocab_size(), 512, 2048, 8); LanguageModelTrainer trainer(tokenizer, 512, 2048, 8);

// Train model trainer.train(training_corpus, 10, 32, 256);

trainer.save_model("language_model.bin");

Based on the research of Timothy O'Neil, Frederick Warren, et. al.