Part 05 of 18
AI Engineering and AI Research
1. Purpose of This Part
This part defines the AI roadmap.
AI is one of the most important domains in the master plan because it connects software development, research, automation, mathematics, philosophy, product design, and future scientific work.
But this section must be understood carefully.
The goal is not to become someone who merely “uses ChatGPT well.”
The goal is also not to become someone who hides behind AI-generated work.
The goal is:
To understand, build, evaluate, improve, deploy, and research AI systems with enough depth that AI becomes a serious engineering and intellectual tool, not a shortcut around competence.
This directly connects to the original life-plan brief: you want to move from simple custom agents, to LangChain/DSPy-like systems, to TensorFlow/PyTorch development, to LoRA and similar optimization concepts, and eventually to understanding serious AI research papers at a deep level.
AI in this plan has three roles:
- AI as a tool — used to accelerate learning, debugging, writing, research, and building.
- AI as a product layer — used inside applications, agents, automations, research tools, and SaaS systems.
- AI as a research domain — studied through machine learning, deep learning, LLMs, fine-tuning, evals, papers, and experiments.
The final standard is:
I can build AI systems that are useful, evaluated, documented, and understood — and I can read, reproduce, and eventually contribute to AI research.
2. What AI Competence Actually Means
AI competence is not prompting alone.
Prompting is a useful starting skill, but it is not enough.
Real AI competence includes:
● understanding what models can and cannot do ● designing prompts and structured outputs ● building AI workflows ● using model APIs ● using tools and function calling ● building retrieval systems ● building RAG pipelines ● evaluating output quality ● measuring failure cases ● creating datasets ● managing context ● designing agent workflows ● understanding embeddings ● understanding tokenization ● understanding transformers at a conceptual level ● training small models ● fine-tuning models ● using open-source models ● understanding LoRA and PEFT ● deploying AI systems ● monitoring AI behavior ● reading research papers ● reproducing experiments
The standard is not:
“Can I ask an AI to make something?”
The standard is:
Can I build an AI system, test whether it works, understand why it fails, improve it, and explain the tradeoffs?
3. The AI Builder Identity
The identity to build here is:
AI systems engineer-researcher.
That means you are not merely a user of AI tools.
You are someone who can design AI-powered systems.
You are someone who can ask:
● What is the task? ● Does this task need AI? ● What model is appropriate? ● What data is needed? ● What is the failure mode? ● What should be deterministic code instead of AI? ● What should be retrieved instead of memorized? ● What should be evaluated? ● What should be logged? ● What should the user verify? ● How do we prevent hallucinated authority? ● How do we measure improvement? ● How do we know this is useful?
A serious AI engineer does not worship the model.
A serious AI engineer builds the system around the model.
The model is one component.
The product, data, interface, tools, evals, retrieval, logging, security, and human workflow matter just as much.
4. The Research-Backed Source Spine
The AI roadmap should be built from official documentation, practical books, research papers, and reproducible projects.
The main source spine is:
● PyTorch tutorials and documentation for deep learning implementation. PyTorch’s beginner tutorial introduces the complete ML workflow: working with data, creating models, optimizing parameters, and saving trained models. (PyTorch Documentation)
- TensorFlow and Keras tutorials for deep learning from the TensorFlow ecosystem.
TensorFlow’s beginner quickstart uses Keras to load a dataset, build a neural network, train it, and evaluate accuracy. (TensorFlow)
- LangChain and LangGraph documentation for LLM applications, workflows, and
agents. LangChain provides model integrations and agent/application architecture, while LangGraph provides infrastructure for long-running, stateful workflows and agents. (LangChain Docs)
- DSPy documentation and paper for programming language-model systems more
systematically instead of relying only on brittle prompt strings. DSPy describes itself as a declarative framework for modular AI software, and the DSPy paper argues for moving LM pipeline construction away from manual free-form prompt manipulation. (dspy.ai)
- OpenAI official API documentation for model APIs, tools, agents, embeddings,
fine-tuning, and evals. OpenAI’s API docs cover tool use, agent workflows, supervised fine-tuning, vector embeddings, and evaluation workflows. (OpenAI Developers)
- Hugging Face Transformers and PEFT documentation for open-source model usage,
inference pipelines, training, and parameter-efficient fine-tuning. Hugging Face’s Transformers documentation describes pipelines as simple optimized inference interfaces, and PEFT is documented as a library for adapting large pretrained models without fine-tuning all parameters. (Hugging Face)
- LoRA original paper and Hugging Face LoRA documentation for understanding
parameter-efficient fine-tuning. The LoRA paper proposes freezing pretrained model weights and injecting trainable low-rank matrices, while Hugging Face’s LoRA documentation describes LoRA as reducing trainable parameters by decomposing large matrices into smaller low-rank matrices. (arXiv)
- NIST AI Risk Management Framework for trustworthy AI thinking. NIST identifies
trustworthy AI characteristics such as validity, reliability, safety, security, resilience, accountability, transparency, explainability, interpretability, privacy enhancement, and fairness with harmful bias managed. (NIST AI Resource Center)
- Deep Learning with Python for Keras/deep learning practice. Manning describes the
second edition as an introduction to deep learning using Python and Keras, with practical techniques and important theory for neural networks. (Manning Publications)
- Hands-On Large Language Models for practical LLM understanding. The official
GitHub repository contains code examples for the book, and the official book site describes it as an illustrated guide to large language models. (GitHub)
The rule is:
Use AI tools, but learn AI systems from source: documentation, code, papers, experiments, and evaluations.
5. The AI Roadmap Ladder
The AI roadmap has layers.
Each layer should produce artifacts.
The goal is not to rush to fine-tuning or agents before the foundation exists.
The goal is to build a serious stack of capability.
Layer 0 — Correct AI Usage and Mental
Discipline Purpose Before learning AI engineering, the first layer is learning how not to be destroyed by AI.
AI can create fake progress faster than almost any other tool.
It can write code you do not understand.
It can summarize papers you never read.
It can generate essays that contain no real thought.
It can create the feeling of productivity while weakening the person using it.
Therefore, the first layer is discipline.
Core Rule AI may accelerate the work, but it must not replace contact with the work.
This means:
● use AI to clarify, not to avoid understanding ● use AI to review, not to replace judgment ● use AI to generate tests, not to avoid testing ● use AI to explain papers, not to avoid reading papers ● use AI to debug with you, not to stop you from debugging ● use AI to generate alternatives, not to make decisions blindly ● use AI to challenge you, not to flatter you
Good AI Use Use AI as:
- tutor
- Socratic examiner
- code reviewer
- debugger
- paper explainer
- architecture critic
- test generator
- documentation assistant
- research assistant
- opposing argument generator
- project planner
- failure-mode finder
- study partner
Bad AI Use Do not use AI to:
- generate full projects you cannot explain
- avoid learning Python
- avoid learning math
- avoid reading documentation
- avoid debugging
- avoid writing tests
- fake research
- fabricate citations
- submit work you do not understand
- create a portfolio you cannot defend
Required Artifact Create an “AI Usage Constitution” document.
It should include:
- what AI is allowed to do
- what AI is not allowed to do
- rules for AI-generated code
- rules for AI-assisted research
- rules for AI-assisted writing
- rules for AI-assisted math
- rules for AI-assisted debugging
- self-audit checklist
Completion Standard This layer is complete when:
- AI is being used deliberately
- AI outputs are verified
- you can explain AI-assisted work
- you do not treat generated work as mastery
- every serious AI-assisted output has a human verification step
Layer 1 — Python, Data, Notebooks,
and Experiment Workflow Purpose AI engineering requires a strong Python workflow.
Python is the main practical language for machine learning, deep learning, notebooks, data processing, experiments, and AI research reproduction.
This layer is about becoming operational in AI experimentation.
Topics
- Python fundamentals
- virtual environments
- package management
- Jupyter notebooks
- NumPy
- pandas
- Matplotlib
- data loading
- data cleaning
- train/test split
- basic statistics
- plotting
- experiment folders
- reproducible notebooks
- random seeds
- saving results
- reading CSV/JSON/parquet
- command-line scripts for experiments
Required Projects Build:
- CSV data cleaner
- Dataset explorer notebook
- Data visualization notebook
- Simple statistics notebook
- Train/test split demo
- Experiment logging template
- Reproducible ML project template
- Python package for data utilities
- Notebook-to-script conversion exercise
- Data report generator
Artifact Requirements Each experiment should include:
- dataset description
- problem statement
- preprocessing steps
- notebook
- script version if appropriate
- results
- limitations
- README
- environment file
Completion Standard This layer is complete when:
- Python notebooks are comfortable
- data can be loaded and inspected
- visualizations can be created
- experiments are organized
- results can be reproduced
- GitHub contains clean AI/data project templates
Layer 2 — Machine Learning
Foundations Purpose Before deep learning and LLMs, learn the basic machine learning workflow.
This layer teaches the structure of learning from data.
Topics
- supervised learning
- unsupervised learning
- classification
- regression
- clustering
- train/test/validation split
- overfitting
- underfitting
- loss functions
- metrics
- confusion matrix
- precision
- recall
- F1 score
- ROC/AUC
- feature engineering
- cross-validation
- baseline models
- error analysis Required Projects Build:
- Linear regression from scratch
- Logistic regression from scratch
- k-nearest neighbors from scratch
- Decision tree using a library
- Random forest experiment
- Clustering experiment
- Classification evaluation notebook
- Imbalanced classification experiment
- Feature engineering case study
- Model comparison report
Completion Standard This layer is complete when:
- you understand the basic ML workflow
- metrics are chosen intentionally
- baseline models are created before complex models
- errors are analyzed
- notebooks explain what happened and why
- you can explain overfitting and generalization clearly
Layer 3 — Deep Learning
Fundamentals Purpose Deep learning is the foundation for modern AI systems, including computer vision, NLP, speech, multimodal systems, and LLMs.
This layer is about understanding neural networks as implemented systems, not as magic.
PyTorch and TensorFlow/Keras are both valid ecosystems. PyTorch’s beginner material introduces a full ML workflow with data, models, optimization, and saving models; TensorFlow’s beginner quickstart uses Keras to build, train, and evaluate a neural network. (PyTorch Documentation)
Topics
- tensors
- automatic differentiation
- neural network layers
- activation functions
- loss functions
- optimizers
- backpropagation
- training loops
- validation loops
- batching
- datasets
- dataloaders
- regularization
- dropout
- batch normalization
- learning rates
- checkpoints
- saving/loading models
- GPU basics
- experiment tracking
PyTorch Path Use PyTorch to understand lower-level deep learning workflows.
Required projects:
- Tensor operations notebook
- Autograd notebook
- Neural network from scratch using NumPy
- Simple PyTorch classifier
- Custom training loop
- CNN image classifier
- RNN or sequence model experiment
- Transfer learning experiment
- Model saving/loading experiment
- Experiment comparison report TensorFlow/Keras Path Use Keras for clean high-level experimentation.
Keras is described by TensorFlow as the high-level API of the TensorFlow platform, designed to provide an approachable and productive interface for machine learning problems, from data processing to hyperparameter tuning and deployment. (TensorFlow)
Required projects:
- Keras Sequential model
- Keras Functional API model
- Image classification notebook
- Text classification notebook
- Model checkpointing experiment
- Hyperparameter experiment
- TensorBoard logging experiment
- Transfer learning project
- Overfitting/regularization report
- Comparison with PyTorch implementation
Completion Standard This layer is complete when:
- tensors are understood
- training loops are not mysterious
- loss and optimization are understandable
- simple neural networks can be built
- overfitting can be detected
- model performance can be evaluated
- saved models can be reused
- results are documented clearly
Layer 4 — LLM Fundamentals and
Application Engineering Purpose This layer introduces large language models as programmable components inside applications.
The goal is not to become a “prompt wizard.”
The goal is to understand how to build reliable systems around LLMs.
OpenAI’s API documentation covers model usage, structured outputs, tools, embeddings, fine-tuning, and evals, while Hugging Face Transformers provides open-source model usage through pipelines, trainers, and model tooling. (OpenAI Developers)
Topics
- model APIs
- prompts
- system instructions
- structured outputs
- JSON schemas
- tool calling
- function calling
- embeddings
- context windows
- tokens
- temperature
- top-p
- latency
- cost
- retries
- rate limits
- streaming
- safety filters
- logging
- failure modes
Required Projects Build:
- Simple LLM API caller
- Structured JSON extractor
- Document summarizer
- Email drafting assistant
- Study question generator
- Flashcard generator
- ICS revision assistant
- AI code review assistant
- Prompt comparison notebook
- LLM cost/latency tracker
Artifact Requirements Each LLM app should include:
- prompt design notes
- input/output examples
- failure cases
- test cases
- cost notes
- latency notes
- limitations
- README
- evaluation plan
Completion Standard This layer is complete when:
- model calls can be integrated into apps
- structured outputs can be requested and validated
- prompts are versioned
- outputs are tested
- failure cases are documented
- AI features are not treated as magic
Layer 5 — Embeddings, Semantic
Search, and RAG Purpose Retrieval-Augmented Generation is one of the most practical AI engineering patterns.
Instead of expecting a model to “know everything,” you retrieve relevant information and provide it as context. This is essential for document assistants, study tools, knowledge bases, company-data assistants, research assistants, and AI systems that need grounded answers.
OpenAI’s embeddings documentation describes embeddings as turning text into numbers, unlocking use cases such as search and clustering. Hugging Face’s Transformers documentation also supports model-based inference workflows, including feature extraction and question answering through pipelines. (OpenAI Developers)
Topics
- embeddings
- vector similarity
- chunking
- metadata
- vector databases
- retrieval
- reranking
- prompt assembly
- citations
- source grounding
- hallucination reduction
- retrieval evaluation
- answer evaluation
- document ingestion
- PDF parsing
- semantic search UI
- hybrid search
- query rewriting
Required Projects Build:
- Embedding playground
- Semantic search over notes
- PDF question-answering tool
- Study document assistant
- Research paper search system
- RAG system with citations
- RAG evaluation notebook
- Chunking strategy comparison
- Retrieval failure analysis
- Multi-document knowledge assistant The OpenAI Cookbook includes an example focused on building and evaluating a RAG pipeline with LlamaIndex, while Hugging Face’s cookbook includes RAG evaluation workflows using synthetic evaluation data and LLM-as-judge-style scoring. (OpenAI Developers)
Completion Standard This layer is complete when:
- embeddings are understood conceptually
- documents can be chunked and indexed
- retrieval results can be inspected
- answers include source grounding
- bad retrieval can be diagnosed
- RAG quality can be evaluated
- a document assistant can be built end-to-end
Layer 6 — Agents, Tools, and
Workflows Purpose Agents are useful when a system must plan, call tools, maintain state, collaborate across steps, or handle long-running workflows.
But agents are also easy to overuse.
Many problems do not need agents.
Some problems need simple code.
Some need a workflow.
Some need retrieval.
Some need a model call.
Only some need agentic behavior.
LangChain’s agent documentation describes agents as graph-based runtimes using LangGraph, and OpenAI’s Agents SDK documentation describes agents as applications that plan, call tools, collaborate across specialists, and keep enough state to complete multi-step work. (LangChain Docs)
Key Distinction A workflow follows a predetermined path.
An agent dynamically decides steps and tool usage.
LangGraph’s documentation explicitly distinguishes workflows with predetermined code paths from agents that define their own processes and tool usage. (LangChain Docs)
Topics
- tools
- function calling
- workflow graphs
- agent state
- memory
- planning
- tool errors
- retries
- human-in-the-loop
- guardrails
- multi-agent systems
- task decomposition
- tool authorization
- sandboxing
- observability
- agent evaluation
Required Projects Build:
- Tool-calling calculator agent
- File-search assistant
- Calendar/task planning workflow
- Research assistant workflow
- Coding assistant with limited tools
- Customer support triage agent
- Multi-step study planner agent
- RAG + tool-use agent
- Human-in-the-loop approval agent
- Agent failure-mode report
Agent Design Rule Every agent must have a reason to exist.
Before building an agent, ask:
- What tools does it need?
- What state does it need?
- What can go wrong?
- What should require human approval?
- What should be logged?
- What should be deterministic?
- What should be evaluated?
Completion Standard This layer is complete when:
- workflows and agents are not confused
- tools are designed safely
- agent state is understandable
- failures are logged
- agent outputs are evaluated
- human approval is used where needed
- agents are built because the task requires them, not because they sound impressive
Layer 6.5 — OpenClaw and Personal
Agent Infrastructure Purpose OpenClaw belongs in the AI roadmap as a practical case study in personal agent infrastructure.
The purpose of learning OpenClaw is not merely to install a trendy AI assistant.
The purpose is to understand how agentic systems connect models, tools, messaging interfaces, local machines, permissions, plugins, workflows, and memory into a working personal assistant architecture.
OpenClaw is especially relevant because it represents a real-world example of the shift from chatbots to agents that can do things across tools and communication surfaces.
The official OpenClaw documentation describes it as a self-hosted gateway that connects chat apps and channel surfaces to AI coding agents through a gateway process running on your own machine or server. Its GitHub repository describes OpenClaw as a personal AI assistant that runs on your own devices and can answer through channels you already use.
What to Learn Topics:
- self-hosted agent gateways
- chat-based agent interfaces
- channel integrations
- tool calling
- skills
- plugins
- local execution
- memory
- agent permissions
- human approval
- workflow automation
- file access
- shell access
- browser/web tools
- messaging integrations
- security hardening
- privacy risks
- audit logs
- agent failure modes
OpenClaw’s own tool documentation describes three layers: tools, skills, and plugins. Tools are typed functions the agent can invoke, skills teach the agent when and how to use capabilities, and plugins can register additional tools. Why OpenClaw Matters OpenClaw is useful as a study object because it forces several serious AI-engineering questions:
- What should an agent be allowed to do?
- What tools should require approval?
- What should never be automated?
- How should tool calls be logged?
- How should private data be protected?
- How should agent memory be controlled?
- How should shell/file/browser access be sandboxed?
- What happens if the model misunderstands the user?
- What happens if a plugin is malicious?
- What happens if the agent acts at the wrong time?
This makes OpenClaw part of both:
- AI engineering
- cybersecurity / AI safety
Required Projects Build or study:
- OpenClaw architecture notes
- Local OpenClaw setup log
- Tool/skill/plugin concept map
- Safe personal-assistant use case
- OpenClaw security threat model
- Human-approval workflow design
- OpenClaw + GitHub issue triage experiment
- OpenClaw + research assistant workflow
- OpenClaw + calendar/email mock workflow
- OpenClaw failure-mode report
Security and Safety Focus OpenClaw should be studied carefully because agentic tools can access real systems, files, messages, emails, calendars, and shell commands.
The security focus should include:
- least privilege
- allowlisted tools
- sandboxing
- approval before destructive actions
- secrets management
- logging
- plugin trust
- local data boundaries
- safe defaults
- separation between experiments and real personal accounts
This is important because agentic systems introduce risks beyond ordinary chatbot use. Reports around OpenClaw-style agents have specifically raised concerns about autonomous access to emails, files, code execution, and corporate data.
Completion Standard This layer is complete when:
- OpenClaw’s architecture can be explained
- tools, skills, and plugins are understood
- a safe local setup has been documented
- at least one limited workflow has been tested
- a threat model has been written
- human approval boundaries are defined
- OpenClaw is understood as agent infrastructure, not magic
Layer 7 — Evals, Testing, and
Observability Purpose AI systems must be evaluated.
Without evaluation, AI engineering becomes vibes.
You cannot improve what you do not measure.
OpenAI’s evals documentation describes a three-step process for building and running evals for LLM applications, and OpenAI’s agent-evals documentation covers traces, graders, datasets, and eval runs for improving agent quality. (OpenAI Developers)
Topics
- test datasets
- golden examples
- unit tests around prompts
- regression tests
- LLM-as-judge
- human grading
- retrieval metrics
- answer faithfulness
- tool-call accuracy
- latency
- cost
- refusal behavior
- hallucination tracking
- trace inspection
- failure taxonomies
- prompt versioning
- model comparison
Required Projects Build:
- Prompt regression test suite
- RAG evaluation dataset
- Human grading spreadsheet
- LLM-as-judge experiment
- Agent trace analysis
- Model comparison report
- Cost/latency dashboard
- Failure taxonomy document
- Evaluation-driven prompt improvement project
- Before/after AI system quality report Completion Standard This layer is complete when:
- AI outputs are no longer judged only by feeling
- eval datasets exist
- prompts are versioned
- regressions are caught
- model changes are compared
- RAG retrieval is tested
- agent trajectories are inspected
- failure modes are categorized
Layer 8 — Hugging Face and
Open-Source Models Purpose Closed model APIs are useful, but serious AI work also requires familiarity with open-source models.
Open-source models give direct exposure to tokenizers, model weights, inference pipelines, fine-tuning, hardware constraints, and the wider ML ecosystem.
Hugging Face Transformers provides pipelines for inference tasks such as text generation, image segmentation, automatic speech recognition, document question answering, sentiment analysis, feature extraction, and question answering. (Hugging Face)
Topics
- model hub
- model cards
- datasets
- tokenizers
- pipelines
- inference
- text classification
- embeddings
- question answering
- text generation
- model loading
- GPU memory
- quantization basics
- local inference
- licensing
- safety notes
- benchmarking
Required Projects Build:
- Sentiment analysis with pipeline
- Text classification with open model
- Embedding comparison notebook
- Local text-generation demo
- Model-card reading exercise
- Tokenizer visualization notebook
- Open-source RAG system
- Open-source summarizer
- Model benchmark notebook
- Closed vs open model comparison report
Completion Standard This layer is complete when:
- Hugging Face pipelines are usable
- model cards can be read critically
- tokenization is understood at a basic level
- open models can be run locally or in notebooks
- hardware limits are understood
- model selection is justified by task, cost, quality, and constraints
Layer 8.5 — Local LLMs, Ollama, and
Private AI Experimentation Purpose Ollama belongs in the AI roadmap as the main practical tool for running large language models locally.
The purpose of learning Ollama is not merely to chat with local models.
The purpose is to understand local inference, open-source model behavior, privacy tradeoffs, offline experimentation, embeddings, RAG, structured outputs, tool use, and the limits of running AI on personal hardware.
Ollama should be treated as the bridge between:
- open-source models
- local AI experimentation
- private document assistants
- local RAG systems
- model comparison
- embeddings
- structured outputs
- offline AI workflows
- lightweight AI deployment experiments
Ollama’s API allows models to be run and interacted with programmatically, and its embeddings capability can generate vectors for semantic search, retrieval, and RAG pipelines.
What to Learn Topics:
- installing and running Ollama
- pulling models
- listing local models
- model sizes and hardware limits
- local inference
- prompt testing
- REST API usage
- Python/JavaScript integration
- embeddings
- local semantic search
- local RAG
- structured outputs
- JSON schema outputs
- tool/function calling limits
- context window limits
- latency
- memory usage
- CPU vs GPU performance
- model comparison
- privacy and data boundaries
Ollama also supports structured outputs, allowing model responses to be constrained to a JSON schema, which is useful for document parsing, extraction, structured responses, and more reliable AI application behavior.
Required Projects Build:
- Local model playground
- Ollama API caller in Python
- Ollama API caller in TypeScript
- Local summarizer
- Local structured-data extractor
- Local embeddings demo
- Local semantic search over notes
- Local RAG assistant over personal documents
- OpenAI API vs Ollama comparison
- Local model benchmark report
Artifact Requirements Each Ollama project should include:
- model used
- model size
- hardware used
- latency notes
- memory notes
- prompt examples
- structured-output examples if relevant
- failure cases - comparison with cloud models where useful
● privacy notes ● README ● limitations
Completion Standard This layer is complete when:
● local models can be run confidently ● Ollama can be called from code ● embeddings can be generated locally ● a local RAG system can be built ● structured outputs can be tested ● model quality, latency, and hardware limits can be explained ● Ollama is understood as a local AI engineering tool, not just a chatbot
Layer 9 — Fine-Tuning, PEFT, and LoRA
Purpose Fine-tuning is used when prompting and retrieval are not enough.
But fine-tuning should not be the default solution.
First ask:
● Can the task be solved with better prompting? ● Can it be solved with retrieval? ● Can it be solved with deterministic code? ● Is there enough data? ● Is the behavior stable enough to learn? ● How will improvement be evaluated?
OpenAI’s fine-tuning documentation describes fine-tuning as taking a base model, providing examples of expected inputs and outputs, and producing a model that performs better for the target task. (OpenAI Developers)
PEFT and LoRA PEFT stands for parameter-efficient fine-tuning.
Hugging Face documents PEFT as adapting large pretrained models without fine-tuning all parameters, reducing computational and storage costs while maintaining comparable performance in many cases. (Hugging Face)
LoRA is one of the most important PEFT methods.
The original LoRA paper proposes freezing pretrained model weights and injecting trainable low-rank decomposition matrices into transformer layers, greatly reducing trainable parameters for downstream tasks. (arXiv)
Topics
- supervised fine-tuning
- dataset preparation
- instruction tuning
- train/validation splits
- formatting examples
- evaluation before training
- evaluation after training
- overfitting
- catastrophic forgetting basics
- adapters
- LoRA
- QLoRA later
- PEFT
- hyperparameters
- GPU memory constraints
- model checkpoints
- model deployment
- model comparison
Required Projects Build:
- Fine-tuning dataset formatter
- Small text classifier fine-tuning project
- Instruction dataset cleaning project
- LoRA fine-tuning notebook
- Before/after evaluation report
- Overfitting demonstration
- Prompting vs RAG vs fine-tuning comparison
- Domain-specific assistant fine-tune experiment
- Cost and hardware report
- Model card for your fine-tuned model
Completion Standard This layer is complete when:
- fine-tuning is not used blindly
- training data is inspected
- evaluation exists before training
- before/after performance is compared
- LoRA is understood conceptually
- fine-tuned models are documented
- limitations and risks are stated clearly
Layer 10 — Deployment, Inference, and
Optimization Purpose AI systems must eventually run somewhere.
A notebook is not a product.
A model demo is not a production system.
This layer is about serving AI systems reliably, economically, and safely.
Topics
- API deployment
- model serving
- batching
- streaming
- latency
- caching
- retries
- timeouts
- rate limits
- cost tracking
- GPU vs CPU inference
- quantization basics
- monitoring
- logging
- model fallback
- prompt/version management
- deployment security
- privacy boundaries
- data retention
- user feedback loops
Required Projects Build:
- AI API endpoint
- Streaming LLM response app
- RAG API service
- Background summarization worker
- Cost/latency tracker
- Prompt version manager
- Model fallback system
- AI app with logging and feedback
- AI deployment runbook
- Production-readiness checklist
Completion Standard This layer is complete when:
- AI systems can be deployed
- latency and cost are tracked
- retries and failures are handled
- logs are useful
- user feedback is collected
- model behavior can be monitored
- deployment decisions are documented
Layer 11 — AI Research Paper Reading
and Reproduction Purpose The long-term goal is not only to use AI tools.
The goal is to understand AI research deeply enough to reproduce papers, critique methods, and eventually contribute original work.
This requires math, coding, patience, and writing.
Paper Reading Method For each paper, produce:
- Citation
- Problem statement
- Main claim
- Prior work
- Method
- Dataset
- Experiments
- Metrics
- Results
- Limitations
- What you understood
- What you did not understand
- Implementation notes
- Reproduction plan
- Possible extension
Paper Reproduction Ladder Start small.
- Reproduce a simple ML paper result
- Reimplement a known algorithm
- Reproduce a small deep learning experiment
- Reproduce an NLP paper component
- Reproduce a RAG evaluation method
- Reproduce a LoRA-style fine-tuning experiment
- Reproduce an ablation table
- Write a failed reproduction report
- Extend a paper with a small experiment
- Publish a technical report or preprint
Completion Standard This layer is complete when:
- papers can be read structurally
- equations are not skipped blindly
- methods can be translated into code
- experiments can be partially reproduced
- failed reproductions are documented honestly
- paper notes become research ideas
6. AI Project Ladder
The AI project ladder should move from small experiments to serious systems.
Level 1 — Small AI Utilities Purpose: learn model APIs and basic workflows.
Examples:
-
summarizer
-
flashcard generator
-
grammar assistant
-
study question generator
-
code explainer
-
text classifier
-
document tagger
-
meeting note cleaner
-
simple chatbot
-
prompt playground Requirements:
-
README
-
prompt examples
-
failure cases
-
limitations
-
small test set
Level 2 — Structured AI Applications Purpose: build AI features inside proper software.
Examples:
- AI study planner
- AI writing critic
- AI code review tool
- AI document organizer
- AI research assistant
- AI email assistant
- AI task prioritizer
- AI flashcard/Anki generator
- AI PDF summarizer
- AI legal/marine-insurance study helper with strict source grounding
Requirements:
- frontend
- backend
- model API
- structured output validation
- logging
- tests
- README
- user flow
- limitations
Level 3 — RAG and Knowledge Systems Purpose: ground AI in documents and sources. Examples:
- personal knowledge assistant
- research paper assistant
- ICS study document assistant
- electronics datasheet assistant
- quantum paper search assistant
- company knowledge assistant
- bug bounty notes assistant
- legal clause search tool
- technical documentation Q&A system
- multi-document source-grounded tutor
- Local Ollama-powered RAG assistant
- Private document assistant using local embeddings
- Cloud-model vs local-model RAG comparison
Requirements:
- ingestion pipeline
- chunking
- embeddings
- vector search
- source citations
- retrieval evaluation
- answer evaluation
- failure analysis
Level 4 — Agentic Workflows Purpose: build multi-step AI systems.
Examples:
- research workflow agent
- coding workflow agent
- study planning agent
- customer support triage agent
- bug bounty recon note organizer
- document-processing pipeline agent
- AI project manager with human approval
- AI lab assistant for electronics notes
- AI paper-reading workflow
- AI curriculum planner
- OpenClaw personal assistant workflow
- OpenClaw safety and tool-permission experiment
- OpenClaw messaging-interface automation prototype
Requirements:
- tools
- state
- logs
- human approval points
- failure handling
- evals
- trace analysis
- security boundaries
Level 5 — Fine-Tuning and Model Adaptation Purpose: adapt models for specific behavior.
Examples:
- domain-specific classifier
- writing-style classifier
- support-ticket router
- study-question quality classifier
- fine-tuned small model for structured extraction
- LoRA experiment on small open model
- domain-specific assistant experiment
- prompt vs RAG vs fine-tune comparison
- evaluation report
- model card
Requirements:
- dataset
- training script/notebook
- evaluation set
- before/after comparison
- failure analysis
- model card
- reproducibility notes Level 6 — Research Reproduction and Original Work Purpose: move toward research contribution.
Examples:
- reproduce a RAG evaluation paper
- reproduce a small transformer experiment
- reproduce a LoRA experiment
- compare chunking strategies
- compare embedding models
- evaluate hallucination mitigation methods
- test agent failure modes
- study prompt robustness
- write a review paper
- publish an experimental report
- Local LLM evaluation report using Ollama
- Agent safety case study using OpenClaw
- Comparison of cloud agents vs self-hosted agents
Requirements:
- paper notes
- code
- dataset
- reproduction attempt
- results
- limitations
- writeup
- possible extensions
7. GitHub Strategy for AI
AI GitHub work must be serious.
Do not fill GitHub with empty “AI wrapper” projects.
Each AI repo should show:
- problem statement
- model used
- why that model was chosen
- data used
- prompt or system design
- architecture
- evaluation method
- failure cases
- cost/latency notes
- limitations
- setup instructions
- reproducibility notes
- screenshots or demo
- future improvements
AI Repository Categories Create several categories of AI repos:
- ai-experiments — notebooks and small experiments
- llm-apps — practical AI applications
- rag-lab — retrieval and document-grounded systems
- agent-lab — agent workflows and tool-use systems
- deep-learning-lab — PyTorch/TensorFlow projects
- fine-tuning-lab — LoRA, PEFT, and fine-tuning experiments
- paper-reproductions — research paper implementations
- ai-evals — evaluation datasets, graders, and reports
- ai-safety-notes — responsible AI and failure-mode analysis
- ai-product-case-studies — full writeups of AI products
The GitHub goal is:
Make it obvious that AI is not being used as magic. It is being engineered, evaluated, documented, and understood.
8. Responsible AI and Safety Layer
Responsible AI is not optional.
AI systems can mislead people, leak data, amplify bias, produce false confidence, and fail unpredictably. NIST’s AI Risk Management Framework was developed to help manage risks to individuals, organizations, and society, and its trustworthiness characteristics include validity and reliability, safety, security and resilience, accountability and transparency, explainability and interpretability, privacy enhancement, and fairness with harmful bias managed. (NIST)
Responsible AI Checklist For every serious AI project, ask:
- What harm could this cause?
- What happens if the output is wrong?
- Who might overtrust it?
- What data is being used?
- Is private information involved?
- Are sources shown?
- Are limitations shown?
- Can the user verify the output?
- Is there a human review step?
- What logs are stored?
- What should not be stored?
- What biases might appear?
- How will failures be reported?
- How will the system be improved?
Required Artifact Create a responsible-AI review for every serious AI project.
It should include:
- intended use
- prohibited use
- data sources
- privacy concerns
- failure modes
- evaluation method
- human review requirements
- user-facing limitations
- security notes
- improvement plan
Standard An AI system is not complete until its risks and limitations are documented.
9. How AI Should Be Used to Learn AI
This is a special case.
You are allowed to use AI heavily while learning AI.
But the usage must be disciplined.
Correct Use Ask AI to:
- explain concepts at multiple levels
- quiz you
- generate exercises
- review your code
- compare frameworks
- explain papers section by section
- generate implementation plans
- create debugging hypotheses
- produce failure-mode checklists
- help design evals
- challenge your assumptions
Incorrect Use Do not ask AI to:
- read a paper so you do not have to
- write code you cannot explain
- generate fake experiment results
- create citations without verification
- invent benchmarks
- claim a model improved without evals
- write research conclusions before results exist
The AI-Learning Rule For every AI explanation, produce your own artifact.
Examples:
- concept note
- code implementation
- experiment
- diagram
- quiz answers
- paper summary
- evaluation dataset
- failure analysis
10. Common AI Traps
Trap 1 — Prompt Engineering as Identity Prompting is useful, but it is not enough.
Rule:
Learn prompting, then move into systems, tools, data, evals, and model behavior.
Trap 2 — Wrappers Without Engineering Many AI apps are just a textbox connected to an API.
That is not enough.
Rule:
Add structure, workflow, memory, retrieval, evaluation, and product usefulness.
Trap 3 — No Evaluation If there is no eval, there is no engineering. Rule:
Every serious AI system needs test cases.
Trap 4 — RAG Without Retrieval Inspection A RAG system can fail because retrieval is bad, even if the model is good.
Rule:
Always inspect retrieved chunks.
Trap 5 — Agents for Everything Agents are not always needed.
Rule:
Use deterministic code where deterministic code is enough.
Trap 6 — Fine-Tuning Too Early Fine-tuning is often not the first solution.
Rule:
Try prompting, structured outputs, retrieval, and better workflow before fine-tuning.
Trap 7 — No Data Discipline Bad data creates bad AI systems.
Rule:
Inspect, clean, split, version, and document datasets.
Trap 8 — Believing Model Output Because It Sounds Good Language models can sound confident while being wrong.
Rule:
Verify important outputs against sources, tests, or reality.
11. First 17 Serious AI Artifacts
These are the first serious AI artifacts to build.
Artifact 1 — AI Usage Constitution A written rulebook for using AI without destroying learning.
Artifact 2 — Python AI Experiment Template A reusable project template for notebooks, scripts, data, results, and README files.
Artifact 3 — ML Basics Repository Small classical ML experiments with metrics and explanations.
Artifact 4 — Deep Learning Lab PyTorch and Keras notebooks covering tensors, training loops, image classification, text classification, and model saving.
Artifact 5 — LLM API Playground A clean repo for testing prompts, structured outputs, costs, latency, and model comparisons. Artifact 6 — Study Flashcard Generator A practical AI tool that converts notes into flashcards, with quality checks.
Artifact 7 — Source-Grounded Document Assistant A RAG system that answers questions from uploaded documents with citations.
Artifact 8 — ICS Revision AI Assistant A study assistant for your ICS-style exam preparation, with strict source grounding and no unsupported answers.
Artifact 9 — AI Evaluation Lab A repo for eval datasets, graders, prompt tests, RAG tests, and model comparisons.
Artifact 10 — Agent Workflow Lab A collection of agents and workflows with tools, logs, human approval points, and failure analysis.
Artifact 11 — Research Paper Tracker AI A tool for storing papers, summaries, tags, claims, methods, and possible research ideas.
Artifact 12 — Hugging Face Open-Model Lab Experiments using open-source models for classification, embeddings, generation, and comparison.
Artifact 13 — LoRA / PEFT Experiment A small, well-documented parameter-efficient fine-tuning experiment.
Artifact 14 — Paper Reproduction Repo A serious attempt to reproduce one AI paper or one part of a paper. Artifact 15 — AI Product Case Study A full writeup of one AI system covering problem, design, data, model, architecture, evals, failure cases, risks, and improvements.
Artifact 16 — Ollama Local Model Lab A repository for running, comparing, and documenting local models through Ollama.
Includes:
- model setup notes
- API examples
- embedding examples
- structured output examples
- local RAG demo
- latency/memory benchmarks
- comparison with cloud models
- limitations
Artifact 17 — OpenClaw Agent Infrastructure Study A repository or long-form case study documenting OpenClaw as a personal agent system.
Includes:
- setup notes
- architecture map
- tools/skills/plugins explanation
- safe workflow experiments
- security threat model
- permission boundaries
- failure cases
- lessons for building future agents
12. When to Move Forward
Do not move forward because you watched videos or copied notebooks.
Move forward when artifacts show competence. Move past AI tool usage when:
- you can explain what AI did and did not do
- you verify outputs
- you can identify hallucinations
- you use AI without outsourcing understanding
Move past Python/data basics when:
- datasets can be loaded, cleaned, explored, and visualized
- notebooks are reproducible
- experiments are organized
Move past ML basics when:
- baseline models are built
- metrics are understood
- overfitting can be diagnosed
- error analysis is performed
Move past deep learning basics when:
- tensors and training loops are understandable
- simple models can be trained
- model checkpoints can be saved and loaded
- results are evaluated and documented
Move past LLM app basics when:
- model APIs are integrated into software
- structured outputs are validated
- prompts are versioned
- failures are documented
Move past RAG basics when:
- documents are chunked and indexed
- retrieval results are inspected
- answers include sources
- retrieval and answer quality are evaluated Move past agents when:
- workflows and agents are distinguished
- tools are safe and logged
- traces can be inspected
- human approval exists where needed
Move past fine-tuning basics when:
- training data is clean
- evaluation exists before and after training
- LoRA/PEFT is understood conceptually
- model behavior improvements are measured
Move into research when:
- papers can be read structurally
- code can reproduce parts of papers
- failed reproductions can be documented honestly
- research questions begin emerging from experiments
13. The AI Standard
The final standard for this domain is:
I can build AI systems that are useful, evaluated, safe enough for their context, documented, and technically understood. I can use existing models, build applications around them, evaluate their behavior, adapt them when justified, and read research papers deeply enough to reproduce and eventually contribute.
AI is not the replacement for the life plan.
AI is one of the tools and domains inside the life plan.
It must make the builder stronger, not weaker.
It must increase contact with reality, not reduce it. It must help produce better systems, better research, better explanations, better decisions, and better service.