Tech Package

A hands-on lab of LLM engineering techniques I've implemented end-to-end — parameter-efficient fine-tuning, local model serving, and running frontier open-weight models under aggressive quantization. Reproducible notebooks, real datasets, and the un-glamorous glue that turns a research idea into a running system.

Toolchain
LoRA · QLoRA · Parameter-Efficient Fine-Tuning

LLM Fine-Tuning Suite Featured

End-to-end supervised fine-tuning of Llama-2 (7B, 13B) and Llama-3.1 (8B, 70B) with LoRA low-rank adapters layered on 4-bit NF4 quantization (QLoRA via bitsandbytes), so multi-billion-parameter checkpoints train on a single GPU. Built on Hugging Face transformers + peft and trl's SFTTrainer, with Neptune experiment tracking and side-by-side full fine-tuning baselines to measure exactly what the adapters buy. Includes a CPU-only run of Llama-3.1-8B for hardware-constrained settings, plus controlled generation and fine-grained performance analysis of the tuned models.

Applied to
  • Finance-Alpaca QA Instruction-tuning on financial knowledge — down to a Bitcoin buy/sell read from CNBC headlines
  • Implicit Hate Speech · NLE Fine-tuning to generate natural-language explanations of implicit hate
LoRA QLoRA 4-bit NF4 Llama-2 Llama-3.1 SFTTrainer PEFT bitsandbytes Neptune

Local inference & frontier open models

Ollama · Python client

Local Model Serving

Stand up local LLMs (Llama 3.1 / 3.2) behind an Ollama server and drive them from the ollama Python chat API — including structured extraction, like pulling dataset names out of research-paper text.

Ollama Llama 3.2 Chat API Self-hosted
Ollama · deepseek-r1:671b

DeepSeek-R1 · 671B, Local

Pulled and ran the full DeepSeek-R1 (671B) reasoning model locally through Ollama's library — reproducing frontier open-weight reasoning on self-hosted hardware.

DeepSeek-R1 671B Reasoning Ollama
Unsloth GGUF · llama.cpp

DeepSeek-R1 · 1.58-bit Quant

Ran DeepSeek-R1 under Unsloth's 1.58-bit dynamic quantization (UD-IQ1_S GGUF) with llama.cpp and partial GPU offload — shrinking a 671B model enough to run on modest hardware while keeping it coherent.

llama.cpp GGUF 1.58-bit Quantization
conda · environment.yaml

Reproducible Environments

Pinned conda environments and dependency-conflict fixes (the datasets/fsspec cache issue, PEFT-in-pipeline quirks) so every notebook reruns cleanly — the plumbing that makes experiments reproducible.

conda Reproducibility MLOps Jupyter

These are personal engineering labs built as Jupyter notebooks. The LoRA reference implementation is public on GitHub; the rest are available on request.