Sourcify
Effortlessly find and manage open-source dependencies for your projects.

Inference Llama 2 in one file of pure C.

llama2.c is a minimalist and educational "fullstack" solution for training and inferencing Llama 2 LLMs. The core component is `run.c`, a single 700-line C file implementing an inference engine for the Llama 2 architecture. It allows users to run small Llama 2 models (up to 7B parameters, though larger is theoretically possible but slow) with no dependencies. The repository also provides PyTorch code for training the models. The focus is on simplicity and transparency, making it suitable for learning about the inner workings of LLMs and for deploying small, specialized models on resource-constrained devices. While larger models can be loaded, performance is optimized for smaller models due to the current float32 inference implementation.
llama2.
Explore all tools that specialize in model inference. This domain focus ensures llama2.c delivers optimized results for this specific requirement.
The core of llama2.c is a 700-line C file (run.c) that implements the entire Llama 2 inference process, offering a clear and understandable implementation.
The inference engine has no external dependencies, simplifying deployment and reducing the risk of compatibility issues.
Currently supports FP32 (single-precision floating point) inference, offering a balance between performance and accuracy.
The `export.py` script converts standard Llama 2 model checkpoints into a format compatible with the C inference engine.
The repository includes PyTorch code for training Llama 2 models from scratch, offering a complete training and inference solution.
Active development is focused on adding model quantization techniques, which will reduce model size and improve inference speed.
Clone the repository: `git clone https://github.com/karpathy/llama2.c.git`
Navigate to the repository folder: `cd llama2.c`
Download a pre-trained model (e.g., TinyStories 15M): `wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin`
Compile the C code: `make run`
Run the inference engine with the downloaded model: `./run stories15M.bin`
For Meta's Llama 2 models, install Python dependencies: `pip install -r requirements.txt`
Convert the Meta Llama 2 checkpoints using `export.py`: `python export.py llama2_7b.bin --meta-llama path/to/llama/model/7B`
Run the converted model: `./run llama2_7b.bin`
All Set
Ready to go
Verified feedback from other users.
"Generally well-received for its simplicity, educational value, and performance on small models. Quantization is a requested improvement."
Post questions, share tips, and help other users.
Effortlessly find and manage open-source dependencies for your projects.

End-to-end typesafe APIs made easy.

Page speed monitoring with Lighthouse, focusing on user experience metrics and data visualization.

Topcoder is a pioneer in crowdsourcing, connecting businesses with a global talent network to solve technical challenges.

Explore millions of Discord Bots and Discord Apps.

Build internal tools 10x faster with an open-source low-code platform.

Open-source RAG evaluation tool for assessing accuracy, context quality, and latency of RAG systems.

AI-powered synthetic data generation for software and AI development, ensuring compliance and accelerating engineering velocity.