Large Language Model for Agentic AI

Go to file

$Tim O\'Neil$ Tim O\'Neil d83d07a823 Initial commit		2025-08-21 10:44:50 -07:00
configs	Initial commit	2025-08-21 10:44:50 -07:00
include/lm	Initial commit	2025-08-21 10:44:50 -07:00
src	Initial commit	2025-08-21 10:44:50 -07:00
.gitignore	Initial commit	2025-08-21 11:43:08 -06:00
CMakeLists.txt	Initial commit	2025-08-21 10:44:50 -07:00
LICENSE	Initial commit	2025-08-21 10:44:50 -07:00
README.md	Initial commit	2025-08-21 10:44:50 -07:00

README.md

bpe_framework

Large Language Model for Agentic AI

Build: cmake -DCMAKE_POLICY_VERSION_MINIMUM=3.5 ..

The test_bpe application does the following:

Includes necessary headers and defines the main function.
Creates an instance of the BPETokenizer.
Defines a training corpus (a vector of strings).
Trains the tokenizer on the corpus with a specified vocabulary size (500 in this case).
Tests the tokenizer by encoding a sample string ("the quick brown fox").
Decodes the tokens back to a string and prints the original, tokens, and decoded string.
Saves the tokenizer to a file ("bpe_model.txt").
Loads the tokenizer from the file and verifies the loaded tokenizer's vocabulary size. The purpose of this test is to verify that the BPE tokenizer can be trained, encode, decode, and serialize/deserialize correctly. Let's break down the code step by step. test_bpe Application Overview

The test_bpe application is a comprehensive test program that validates the functionality of the BPE tokenizer implementation in the LM Framework. Here's how it works:

Initialization

Creates an instance of BPETokenizer

Defines a training corpus with sample English text

Training Process

Calls tokenizer.train(corpus, 500) to train the tokenizer

The training process:

 Initializes with byte-level vocabulary (0-255)

 Analyzes word frequencies in the corpus

 Iteratively merges the most frequent character pairs

 Builds a vocabulary of 500 tokens (as specified)

Encoding Test

Encodes the test string "the quick brown fox"

The encoding process:

 Splits text into words

 Converts each character to its initial token ID

 Applies learned BPE merges to combine tokens

 Returns a sequence of integer token IDs

Decoding Test

Decodes the token IDs back to text

The decoding process:

 Converts each token ID back to its string representation

 Concatenates the strings to reconstruct the original text

Serialization Test

Saves the trained tokenizer to "bpe_model.txt"

The serialization process:
```
 Writes vocabulary size and token-ID mappings

 Records all learned merge rules
```
Deserialization Test

Loads the tokenizer from "bpe_model.txt"

Verifies the loaded tokenizer has the same vocabulary size

Confirms the tokenizer can perform encoding/decoding

Expected Output text

Training tokenizer... Vocabulary size: 500 Original: the quick brown fox Tokens: [list of token IDs] Decoded: the quick brown fox Successfully loaded tokenizer Loaded vocabulary size: 500

Key Validations

Training Completes without errors

Encoding/Decoding Round-Trip preserves the original text

Serialization/Deserialization maintains tokenizer state

Vocabulary Size matches the specified target (500)

Token IDs are consistent between sessions

BPE Tokenizer Performance Test Suite

Overview

This performance test application is a comprehensive benchmarking tool designed to evaluate the efficiency and scalability of the Byte Pair Encoding (BPE) tokenizer implementation. The test suite measures critical performance metrics including training time, memory usage, encoding/decoding speed, and serialization performance across various configurations.

Key Features

1. Corpus Generation

Automatically generates realistic test corpora using common AI/ML terminology
Configurable sentence count and word range parameters
Creates diverse text samples that mimic real-world language patterns

2. Multi-Dimensional Testing

Tests multiple corpus sizes (100, 1000, 5000 sentences)
Evaluates different vocabulary sizes (500, 1000, 2000 tokens)
Measures performance across various workload scenarios

3. Comprehensive Performance Metrics

Training Time: Measures how long it takes to build the BPE vocabulary from a corpus
Memory Usage: Tracks peak memory consumption during training (Linux-specific)
Encoding Speed: Calculates processing time per token during text encoding
Round-Trip Verification: Ensures encoding/decoding preserves original content
Serialization Performance: Measures model save/load operations

4. Validation Checks

Verifies encoding/decoding consistency
Detects potential data corruption issues
Validates vocabulary construction

Usage Scenarios

This performance test is ideal for:

Benchmarking different BPE implementations
Evaluating hardware suitability for language processing tasks
Identifying performance bottlenecks in tokenization pipelines
Testing scalability of tokenizer implementations
Comparing optimization techniques

Technical Implementation

The test suite utilizes:

High-resolution timing with <chrono> for precise measurements
Linux-specific memory tracking via /proc/self/status
Randomized corpus generation with configurable parameters
Exception handling for robust testing
Automatic cleanup of temporary files

Output Metrics

The application provides detailed performance reports including:

Training duration in milliseconds
Peak memory usage in megabytes
Encoding speed in microseconds per token
Serialization/deserialization times
Vocabulary size validation
Round-trip integrity verification

This test framework serves as an essential tool for developers and researchers working with BPE tokenizers, providing quantitative data to guide optimization efforts and implementation choices.