Part 05 of 18

AI Engineering and AI Research

1. Purpose of This Part

This part defines the AI roadmap.

AI is one of the most important domains in the master plan because it connects software development, research, automation, mathematics, philosophy, product design, and future scientific work.

But this section must be understood carefully.

The goal is not to become someone who merely “uses ChatGPT well.”

The goal is also not to become someone who hides behind AI-generated work.

The goal is:

To understand, build, evaluate, improve, deploy, and research AI systems with enough depth that AI becomes a serious engineering and intellectual tool, not a shortcut around competence.

This directly connects to the original life-plan brief: you want to move from simple custom agents, to LangChain/DSPy-like systems, to TensorFlow/PyTorch development, to LoRA and similar optimization concepts, and eventually to understanding serious AI research papers at a deep level.

AI in this plan has three roles:

AI as a tool — used to accelerate learning, debugging, writing, research, and building.
AI as a product layer — used inside applications, agents, automations, research tools, and SaaS systems.
AI as a research domain — studied through machine learning, deep learning, LLMs, fine-tuning, evals, papers, and experiments.

The final standard is:

I can build AI systems that are useful, evaluated, documented, and understood — and I can read, reproduce, and eventually contribute to AI research.

2. What AI Competence Actually Means

AI competence is not prompting alone.

Prompting is a useful starting skill, but it is not enough.

Real AI competence includes:

● understanding what models can and cannot do ● designing prompts and structured outputs ● building AI workflows ● using model APIs ● using tools and function calling ● building retrieval systems ● building RAG pipelines ● evaluating output quality ● measuring failure cases ● creating datasets ● managing context ● designing agent workflows ● understanding embeddings ● understanding tokenization ● understanding transformers at a conceptual level ● training small models ● fine-tuning models ● using open-source models ● understanding LoRA and PEFT ● deploying AI systems ● monitoring AI behavior ● reading research papers ● reproducing experiments

The standard is not:

     “Can I ask an AI to make something?”

The standard is:

Can I build an AI system, test whether it works, understand why it fails, improve it, and explain the tradeoffs?

3. The AI Builder Identity

The identity to build here is:

     AI systems engineer-researcher.

That means you are not merely a user of AI tools.

You are someone who can design AI-powered systems.

You are someone who can ask:

● What is the task? ● Does this task need AI? ● What model is appropriate? ● What data is needed? ● What is the failure mode? ● What should be deterministic code instead of AI? ● What should be retrieved instead of memorized? ● What should be evaluated? ● What should be logged? ● What should the user verify? ● How do we prevent hallucinated authority? ● How do we measure improvement? ● How do we know this is useful?

A serious AI engineer does not worship the model.

A serious AI engineer builds the system around the model.

The model is one component.

The product, data, interface, tools, evals, retrieval, logging, security, and human workflow matter just as much.

4. The Research-Backed Source Spine

The AI roadmap should be built from official documentation, practical books, research papers, and reproducible projects.

The main source spine is:

● PyTorch tutorials and documentation for deep learning implementation. PyTorch’s beginner tutorial introduces the complete ML workflow: working with data, creating models, optimizing parameters, and saving trained models. (PyTorch Documentation)

TensorFlow and Keras tutorials for deep learning from the TensorFlow ecosystem.

TensorFlow’s beginner quickstart uses Keras to load a dataset, build a neural network, train it, and evaluate accuracy. (TensorFlow)

LangChain and LangGraph documentation for LLM applications, workflows, and

agents. LangChain provides model integrations and agent/application architecture, while LangGraph provides infrastructure for long-running, stateful workflows and agents. (LangChain Docs)

DSPy documentation and paper for programming language-model systems more

systematically instead of relying only on brittle prompt strings. DSPy describes itself as a declarative framework for modular AI software, and the DSPy paper argues for moving LM pipeline construction away from manual free-form prompt manipulation. (dspy.ai)

OpenAI official API documentation for model APIs, tools, agents, embeddings,

fine-tuning, and evals. OpenAI’s API docs cover tool use, agent workflows, supervised fine-tuning, vector embeddings, and evaluation workflows. (OpenAI Developers)

Hugging Face Transformers and PEFT documentation for open-source model usage,

inference pipelines, training, and parameter-efficient fine-tuning. Hugging Face’s Transformers documentation describes pipelines as simple optimized inference interfaces, and PEFT is documented as a library for adapting large pretrained models without fine-tuning all parameters. (Hugging Face)

LoRA original paper and Hugging Face LoRA documentation for understanding

parameter-efficient fine-tuning. The LoRA paper proposes freezing pretrained model weights and injecting trainable low-rank matrices, while Hugging Face’s LoRA documentation describes LoRA as reducing trainable parameters by decomposing large matrices into smaller low-rank matrices. (arXiv)

NIST AI Risk Management Framework for trustworthy AI thinking. NIST identifies

trustworthy AI characteristics such as validity, reliability, safety, security, resilience, accountability, transparency, explainability, interpretability, privacy enhancement, and fairness with harmful bias managed. (NIST AI Resource Center)

Deep Learning with Python for Keras/deep learning practice. Manning describes the

second edition as an introduction to deep learning using Python and Keras, with practical techniques and important theory for neural networks. (Manning Publications)

Hands-On Large Language Models for practical LLM understanding. The official

GitHub repository contains code examples for the book, and the official book site describes it as an illustrated guide to large language models. (GitHub)

The rule is:

Use AI tools, but learn AI systems from source: documentation, code, papers, experiments, and evaluations.

5. The AI Roadmap Ladder

The AI roadmap has layers.

Each layer should produce artifacts.

The goal is not to rush to fine-tuning or agents before the foundation exists.

The goal is to build a serious stack of capability.

Layer 0 — Correct AI Usage and Mental

Discipline Purpose Before learning AI engineering, the first layer is learning how not to be destroyed by AI.

AI can create fake progress faster than almost any other tool.

It can write code you do not understand.

It can summarize papers you never read.

It can generate essays that contain no real thought.

It can create the feeling of productivity while weakening the person using it.

Therefore, the first layer is discipline.

Core Rule AI may accelerate the work, but it must not replace contact with the work.

This means:

● use AI to clarify, not to avoid understanding ● use AI to review, not to replace judgment ● use AI to generate tests, not to avoid testing ● use AI to explain papers, not to avoid reading papers ● use AI to debug with you, not to stop you from debugging ● use AI to generate alternatives, not to make decisions blindly ● use AI to challenge you, not to flatter you

Good AI Use Use AI as:

tutor
Socratic examiner
code reviewer
debugger
paper explainer
architecture critic
test generator
documentation assistant
research assistant
opposing argument generator
project planner
failure-mode finder
study partner

Bad AI Use Do not use AI to:

generate full projects you cannot explain
avoid learning Python
avoid learning math
avoid reading documentation
avoid debugging
avoid writing tests
fake research
fabricate citations
submit work you do not understand
create a portfolio you cannot defend

Required Artifact Create an “AI Usage Constitution” document.

It should include:

what AI is allowed to do
what AI is not allowed to do
rules for AI-generated code
rules for AI-assisted research
rules for AI-assisted writing
rules for AI-assisted math
rules for AI-assisted debugging
self-audit checklist

Completion Standard This layer is complete when:

AI is being used deliberately
AI outputs are verified
you can explain AI-assisted work
you do not treat generated work as mastery
every serious AI-assisted output has a human verification step

Layer 1 — Python, Data, Notebooks,

and Experiment Workflow Purpose AI engineering requires a strong Python workflow.

Python is the main practical language for machine learning, deep learning, notebooks, data processing, experiments, and AI research reproduction.

This layer is about becoming operational in AI experimentation.

Topics

Python fundamentals
virtual environments
package management
Jupyter notebooks
NumPy
pandas
Matplotlib
data loading
data cleaning
train/test split
basic statistics
plotting
experiment folders
reproducible notebooks
random seeds
saving results
reading CSV/JSON/parquet
command-line scripts for experiments

Required Projects Build:

CSV data cleaner
Dataset explorer notebook
Data visualization notebook
Simple statistics notebook
Train/test split demo
Experiment logging template
Reproducible ML project template
Python package for data utilities
Notebook-to-script conversion exercise
Data report generator

Artifact Requirements Each experiment should include:

dataset description
problem statement
preprocessing steps
notebook
script version if appropriate
results
limitations
README
environment file

Completion Standard This layer is complete when:

Python notebooks are comfortable
data can be loaded and inspected
visualizations can be created
experiments are organized
results can be reproduced
GitHub contains clean AI/data project templates

Layer 2 — Machine Learning

Foundations Purpose Before deep learning and LLMs, learn the basic machine learning workflow.

This layer teaches the structure of learning from data.

Topics

supervised learning
unsupervised learning
classification
regression
clustering
train/test/validation split
overfitting
underfitting
loss functions
metrics
confusion matrix
precision
recall
F1 score
ROC/AUC
feature engineering
cross-validation
baseline models
error analysis Required Projects Build:

Linear regression from scratch
Logistic regression from scratch
k-nearest neighbors from scratch
Decision tree using a library
Random forest experiment
Clustering experiment
Classification evaluation notebook
Imbalanced classification experiment
Feature engineering case study
Model comparison report

Completion Standard This layer is complete when:

you understand the basic ML workflow
metrics are chosen intentionally
baseline models are created before complex models
errors are analyzed
notebooks explain what happened and why
you can explain overfitting and generalization clearly

Layer 3 — Deep Learning

Fundamentals Purpose Deep learning is the foundation for modern AI systems, including computer vision, NLP, speech, multimodal systems, and LLMs.

This layer is about understanding neural networks as implemented systems, not as magic.

PyTorch and TensorFlow/Keras are both valid ecosystems. PyTorch’s beginner material introduces a full ML workflow with data, models, optimization, and saving models; TensorFlow’s beginner quickstart uses Keras to build, train, and evaluate a neural network. (PyTorch Documentation)

Topics

tensors
automatic differentiation
neural network layers
activation functions
loss functions
optimizers
backpropagation
training loops
validation loops
batching
datasets
dataloaders
regularization
dropout
batch normalization
learning rates
checkpoints
saving/loading models
GPU basics
experiment tracking

PyTorch Path Use PyTorch to understand lower-level deep learning workflows.

Required projects:

Tensor operations notebook
Autograd notebook
Neural network from scratch using NumPy
Simple PyTorch classifier
Custom training loop
CNN image classifier
RNN or sequence model experiment
Transfer learning experiment
Model saving/loading experiment
Experiment comparison report TensorFlow/Keras Path Use Keras for clean high-level experimentation.

Keras is described by TensorFlow as the high-level API of the TensorFlow platform, designed to provide an approachable and productive interface for machine learning problems, from data processing to hyperparameter tuning and deployment. (TensorFlow)

Required projects:

Keras Sequential model
Keras Functional API model
Image classification notebook
Text classification notebook
Model checkpointing experiment
Hyperparameter experiment
TensorBoard logging experiment
Transfer learning project
Overfitting/regularization report
Comparison with PyTorch implementation

Completion Standard This layer is complete when:

tensors are understood
training loops are not mysterious
loss and optimization are understandable
simple neural networks can be built
overfitting can be detected
model performance can be evaluated
saved models can be reused
results are documented clearly

Layer 4 — LLM Fundamentals and

Application Engineering Purpose This layer introduces large language models as programmable components inside applications.

The goal is not to become a “prompt wizard.”

The goal is to understand how to build reliable systems around LLMs.

OpenAI’s API documentation covers model usage, structured outputs, tools, embeddings, fine-tuning, and evals, while Hugging Face Transformers provides open-source model usage through pipelines, trainers, and model tooling. (OpenAI Developers)

Topics

model APIs
prompts
system instructions
structured outputs
JSON schemas
tool calling
function calling
embeddings
context windows
tokens
temperature
top-p
latency
cost
retries
rate limits
streaming
safety filters
logging
failure modes

Required Projects Build:

Simple LLM API caller
Structured JSON extractor
Document summarizer
Email drafting assistant
Study question generator
Flashcard generator
ICS revision assistant
AI code review assistant
Prompt comparison notebook
LLM cost/latency tracker

Artifact Requirements Each LLM app should include:

prompt design notes
input/output examples
failure cases
test cases
cost notes
latency notes
limitations
README
evaluation plan

Completion Standard This layer is complete when:

model calls can be integrated into apps
structured outputs can be requested and validated
prompts are versioned
outputs are tested
failure cases are documented
AI features are not treated as magic

Layer 5 — Embeddings, Semantic

Search, and RAG Purpose Retrieval-Augmented Generation is one of the most practical AI engineering patterns.

Instead of expecting a model to “know everything,” you retrieve relevant information and provide it as context. This is essential for document assistants, study tools, knowledge bases, company-data assistants, research assistants, and AI systems that need grounded answers.

OpenAI’s embeddings documentation describes embeddings as turning text into numbers, unlocking use cases such as search and clustering. Hugging Face’s Transformers documentation also supports model-based inference workflows, including feature extraction and question answering through pipelines. (OpenAI Developers)

Topics

embeddings
vector similarity
chunking
metadata
vector databases
retrieval
reranking
prompt assembly
citations
source grounding
hallucination reduction
retrieval evaluation
answer evaluation
document ingestion
PDF parsing
semantic search UI
hybrid search
query rewriting

Required Projects Build:

Embedding playground
Semantic search over notes
PDF question-answering tool
Study document assistant
Research paper search system
RAG system with citations
RAG evaluation notebook
Chunking strategy comparison
Retrieval failure analysis
Multi-document knowledge assistant The OpenAI Cookbook includes an example focused on building and evaluating a RAG pipeline with LlamaIndex, while Hugging Face’s cookbook includes RAG evaluation workflows using synthetic evaluation data and LLM-as-judge-style scoring. (OpenAI Developers)

Completion Standard This layer is complete when:

embeddings are understood conceptually
documents can be chunked and indexed
retrieval results can be inspected
answers include source grounding
bad retrieval can be diagnosed
RAG quality can be evaluated
a document assistant can be built end-to-end

Layer 6 — Agents, Tools, and

Workflows Purpose Agents are useful when a system must plan, call tools, maintain state, collaborate across steps, or handle long-running workflows.

But agents are also easy to overuse.

Many problems do not need agents.

Some problems need simple code.

Some need a workflow.

Some need retrieval.

Some need a model call.

Only some need agentic behavior.

LangChain’s agent documentation describes agents as graph-based runtimes using LangGraph, and OpenAI’s Agents SDK documentation describes agents as applications that plan, call tools, collaborate across specialists, and keep enough state to complete multi-step work. (LangChain Docs)

Key Distinction A workflow follows a predetermined path.

An agent dynamically decides steps and tool usage.

LangGraph’s documentation explicitly distinguishes workflows with predetermined code paths from agents that define their own processes and tool usage. (LangChain Docs)

Topics

tools
function calling
workflow graphs
agent state
memory
planning
tool errors
retries
human-in-the-loop
guardrails
multi-agent systems
task decomposition
tool authorization
sandboxing
observability
agent evaluation

Required Projects Build:

Tool-calling calculator agent
File-search assistant
Calendar/task planning workflow
Research assistant workflow
Coding assistant with limited tools
Customer support triage agent
Multi-step study planner agent
RAG + tool-use agent
Human-in-the-loop approval agent
Agent failure-mode report

Agent Design Rule Every agent must have a reason to exist.

Before building an agent, ask:

What tools does it need?
What state does it need?
What can go wrong?
What should require human approval?
What should be logged?
What should be deterministic?
What should be evaluated?

Completion Standard This layer is complete when:

workflows and agents are not confused
tools are designed safely
agent state is understandable
failures are logged
agent outputs are evaluated
human approval is used where needed
agents are built because the task requires them, not because they sound impressive

Layer 6.5 — OpenClaw and Personal

Agent Infrastructure Purpose OpenClaw belongs in the AI roadmap as a practical case study in personal agent infrastructure.

The purpose of learning OpenClaw is not merely to install a trendy AI assistant.

The purpose is to understand how agentic systems connect models, tools, messaging interfaces, local machines, permissions, plugins, workflows, and memory into a working personal assistant architecture.

OpenClaw is especially relevant because it represents a real-world example of the shift from chatbots to agents that can do things across tools and communication surfaces.

The official OpenClaw documentation describes it as a self-hosted gateway that connects chat apps and channel surfaces to AI coding agents through a gateway process running on your own machine or server. Its GitHub repository describes OpenClaw as a personal AI assistant that runs on your own devices and can answer through channels you already use.

What to Learn Topics:

self-hosted agent gateways
chat-based agent interfaces
channel integrations
tool calling
skills
plugins
local execution
memory
agent permissions
human approval
workflow automation
file access
shell access
browser/web tools
messaging integrations
security hardening
privacy risks
audit logs
agent failure modes

OpenClaw’s own tool documentation describes three layers: tools, skills, and plugins. Tools are typed functions the agent can invoke, skills teach the agent when and how to use capabilities, and plugins can register additional tools. Why OpenClaw Matters OpenClaw is useful as a study object because it forces several serious AI-engineering questions:

What should an agent be allowed to do?
What tools should require approval?
What should never be automated?
How should tool calls be logged?
How should private data be protected?
How should agent memory be controlled?
How should shell/file/browser access be sandboxed?
What happens if the model misunderstands the user?
What happens if a plugin is malicious?
What happens if the agent acts at the wrong time?

This makes OpenClaw part of both:

AI engineering
cybersecurity / AI safety

Required Projects Build or study:

OpenClaw architecture notes
Local OpenClaw setup log
Tool/skill/plugin concept map
Safe personal-assistant use case
OpenClaw security threat model
Human-approval workflow design
OpenClaw + GitHub issue triage experiment
OpenClaw + research assistant workflow
OpenClaw + calendar/email mock workflow
OpenClaw failure-mode report

Security and Safety Focus OpenClaw should be studied carefully because agentic tools can access real systems, files, messages, emails, calendars, and shell commands.

The security focus should include:

least privilege
allowlisted tools
sandboxing
approval before destructive actions
secrets management
logging
plugin trust
local data boundaries
safe defaults
separation between experiments and real personal accounts

This is important because agentic systems introduce risks beyond ordinary chatbot use. Reports around OpenClaw-style agents have specifically raised concerns about autonomous access to emails, files, code execution, and corporate data.

Completion Standard This layer is complete when:

OpenClaw’s architecture can be explained
tools, skills, and plugins are understood
a safe local setup has been documented
at least one limited workflow has been tested
a threat model has been written
human approval boundaries are defined
OpenClaw is understood as agent infrastructure, not magic

Layer 7 — Evals, Testing, and

Observability Purpose AI systems must be evaluated.

Without evaluation, AI engineering becomes vibes.

You cannot improve what you do not measure.

OpenAI’s evals documentation describes a three-step process for building and running evals for LLM applications, and OpenAI’s agent-evals documentation covers traces, graders, datasets, and eval runs for improving agent quality. (OpenAI Developers)

Topics

test datasets
golden examples
unit tests around prompts
regression tests
LLM-as-judge
human grading
retrieval metrics
answer faithfulness
tool-call accuracy
latency
cost
refusal behavior
hallucination tracking
trace inspection
failure taxonomies
prompt versioning
model comparison

Required Projects Build:

Prompt regression test suite
RAG evaluation dataset
Human grading spreadsheet
LLM-as-judge experiment
Agent trace analysis
Model comparison report
Cost/latency dashboard
Failure taxonomy document
Evaluation-driven prompt improvement project
Before/after AI system quality report Completion Standard This layer is complete when:

AI outputs are no longer judged only by feeling
eval datasets exist
prompts are versioned
regressions are caught
model changes are compared
RAG retrieval is tested
agent trajectories are inspected
failure modes are categorized

Layer 8 — Hugging Face and

Open-Source Models Purpose Closed model APIs are useful, but serious AI work also requires familiarity with open-source models.

Open-source models give direct exposure to tokenizers, model weights, inference pipelines, fine-tuning, hardware constraints, and the wider ML ecosystem.

Hugging Face Transformers provides pipelines for inference tasks such as text generation, image segmentation, automatic speech recognition, document question answering, sentiment analysis, feature extraction, and question answering. (Hugging Face)

Topics

model hub
model cards
datasets
tokenizers
pipelines
inference
text classification
embeddings
question answering
text generation
model loading
GPU memory
quantization basics
local inference
licensing
safety notes
benchmarking

Required Projects Build:

Sentiment analysis with pipeline
Text classification with open model
Embedding comparison notebook
Local text-generation demo
Model-card reading exercise
Tokenizer visualization notebook
Open-source RAG system
Open-source summarizer
Model benchmark notebook
Closed vs open model comparison report

Completion Standard This layer is complete when:

Hugging Face pipelines are usable
model cards can be read critically
tokenization is understood at a basic level
open models can be run locally or in notebooks
hardware limits are understood
model selection is justified by task, cost, quality, and constraints

Layer 8.5 — Local LLMs, Ollama, and

Private AI Experimentation Purpose Ollama belongs in the AI roadmap as the main practical tool for running large language models locally.

The purpose of learning Ollama is not merely to chat with local models.

The purpose is to understand local inference, open-source model behavior, privacy tradeoffs, offline experimentation, embeddings, RAG, structured outputs, tool use, and the limits of running AI on personal hardware.

Ollama should be treated as the bridge between:

open-source models
local AI experimentation
private document assistants
local RAG systems
model comparison
embeddings
structured outputs
offline AI workflows
lightweight AI deployment experiments

Ollama’s API allows models to be run and interacted with programmatically, and its embeddings capability can generate vectors for semantic search, retrieval, and RAG pipelines.

What to Learn Topics:

installing and running Ollama
pulling models
listing local models
model sizes and hardware limits
local inference
prompt testing
REST API usage
Python/JavaScript integration
embeddings
local semantic search
local RAG
structured outputs
JSON schema outputs
tool/function calling limits
context window limits
latency
memory usage
CPU vs GPU performance
model comparison
privacy and data boundaries

Ollama also supports structured outputs, allowing model responses to be constrained to a JSON schema, which is useful for document parsing, extraction, structured responses, and more reliable AI application behavior.

Required Projects Build:

Local model playground
Ollama API caller in Python
Ollama API caller in TypeScript
Local summarizer
Local structured-data extractor
Local embeddings demo
Local semantic search over notes
Local RAG assistant over personal documents
OpenAI API vs Ollama comparison
Local model benchmark report

Artifact Requirements Each Ollama project should include:

model used
model size
hardware used
latency notes
memory notes
prompt examples
structured-output examples if relevant
failure cases - comparison with cloud models where useful

● privacy notes ● README ● limitations

Completion Standard This layer is complete when:

● local models can be run confidently ● Ollama can be called from code ● embeddings can be generated locally ● a local RAG system can be built ● structured outputs can be tested ● model quality, latency, and hardware limits can be explained ● Ollama is understood as a local AI engineering tool, not just a chatbot

Layer 9 — Fine-Tuning, PEFT, and LoRA

Purpose Fine-tuning is used when prompting and retrieval are not enough.

But fine-tuning should not be the default solution.

First ask:

● Can the task be solved with better prompting? ● Can it be solved with retrieval? ● Can it be solved with deterministic code? ● Is there enough data? ● Is the behavior stable enough to learn? ● How will improvement be evaluated?

OpenAI’s fine-tuning documentation describes fine-tuning as taking a base model, providing examples of expected inputs and outputs, and producing a model that performs better for the target task. (OpenAI Developers)

PEFT and LoRA PEFT stands for parameter-efficient fine-tuning.

Hugging Face documents PEFT as adapting large pretrained models without fine-tuning all parameters, reducing computational and storage costs while maintaining comparable performance in many cases. (Hugging Face)

LoRA is one of the most important PEFT methods.

The original LoRA paper proposes freezing pretrained model weights and injecting trainable low-rank decomposition matrices into transformer layers, greatly reducing trainable parameters for downstream tasks. (arXiv)

Topics

supervised fine-tuning
dataset preparation
instruction tuning
train/validation splits
formatting examples
evaluation before training
evaluation after training
overfitting
catastrophic forgetting basics
adapters
LoRA
QLoRA later
PEFT
hyperparameters
GPU memory constraints
model checkpoints
model deployment
model comparison

Required Projects Build:

Fine-tuning dataset formatter
Small text classifier fine-tuning project
Instruction dataset cleaning project
LoRA fine-tuning notebook
Before/after evaluation report
Overfitting demonstration
Prompting vs RAG vs fine-tuning comparison
Domain-specific assistant fine-tune experiment
Cost and hardware report
Model card for your fine-tuned model

Completion Standard This layer is complete when:

fine-tuning is not used blindly
training data is inspected
evaluation exists before training
before/after performance is compared
LoRA is understood conceptually
fine-tuned models are documented
limitations and risks are stated clearly

Layer 10 — Deployment, Inference, and

Optimization Purpose AI systems must eventually run somewhere.

A notebook is not a product.

A model demo is not a production system.

This layer is about serving AI systems reliably, economically, and safely.

Topics

API deployment
model serving
batching
streaming
latency
caching
retries
timeouts
rate limits
cost tracking
GPU vs CPU inference
quantization basics
monitoring
logging
model fallback
prompt/version management
deployment security
privacy boundaries
data retention
user feedback loops

Required Projects Build:

AI API endpoint
Streaming LLM response app
RAG API service
Background summarization worker
Cost/latency tracker
Prompt version manager
Model fallback system
AI app with logging and feedback
AI deployment runbook
Production-readiness checklist

Completion Standard This layer is complete when:

AI systems can be deployed
latency and cost are tracked
retries and failures are handled
logs are useful
user feedback is collected
model behavior can be monitored
deployment decisions are documented

Layer 11 — AI Research Paper Reading

and Reproduction Purpose The long-term goal is not only to use AI tools.

The goal is to understand AI research deeply enough to reproduce papers, critique methods, and eventually contribute original work.

This requires math, coding, patience, and writing.

Paper Reading Method For each paper, produce:

Citation
Problem statement
Main claim
Prior work
Method
Dataset
Experiments
Metrics
Results
Limitations
What you understood
What you did not understand
Implementation notes
Reproduction plan
Possible extension

Paper Reproduction Ladder Start small.

Reproduce a simple ML paper result
Reimplement a known algorithm
Reproduce a small deep learning experiment
Reproduce an NLP paper component
Reproduce a RAG evaluation method
Reproduce a LoRA-style fine-tuning experiment
Reproduce an ablation table
Write a failed reproduction report
Extend a paper with a small experiment
Publish a technical report or preprint

Completion Standard This layer is complete when:

papers can be read structurally
equations are not skipped blindly
methods can be translated into code
experiments can be partially reproduced
failed reproductions are documented honestly
paper notes become research ideas

6. AI Project Ladder

The AI project ladder should move from small experiments to serious systems.

Level 1 — Small AI Utilities Purpose: learn model APIs and basic workflows.

Examples:

summarizer
flashcard generator
grammar assistant
study question generator
code explainer
text classifier
document tagger
meeting note cleaner
simple chatbot
prompt playground Requirements:
README
prompt examples
failure cases
limitations
small test set

Level 2 — Structured AI Applications Purpose: build AI features inside proper software.

Examples:

AI study planner
AI writing critic
AI code review tool
AI document organizer
AI research assistant
AI email assistant
AI task prioritizer
AI flashcard/Anki generator
AI PDF summarizer
AI legal/marine-insurance study helper with strict source grounding

Requirements:

frontend
backend
model API
structured output validation
logging
tests
README
user flow
limitations

Level 3 — RAG and Knowledge Systems Purpose: ground AI in documents and sources. Examples:

personal knowledge assistant
research paper assistant
ICS study document assistant
electronics datasheet assistant
quantum paper search assistant
company knowledge assistant
bug bounty notes assistant
legal clause search tool
technical documentation Q&A system
multi-document source-grounded tutor
Local Ollama-powered RAG assistant
Private document assistant using local embeddings
Cloud-model vs local-model RAG comparison

Requirements:

ingestion pipeline
chunking
embeddings
vector search
source citations
retrieval evaluation
answer evaluation
failure analysis

Level 4 — Agentic Workflows Purpose: build multi-step AI systems.

Examples:

research workflow agent
coding workflow agent
study planning agent
customer support triage agent
bug bounty recon note organizer
document-processing pipeline agent
AI project manager with human approval
AI lab assistant for electronics notes
AI paper-reading workflow
AI curriculum planner
OpenClaw personal assistant workflow
OpenClaw safety and tool-permission experiment
OpenClaw messaging-interface automation prototype

Requirements:

tools
state
logs
human approval points
failure handling
evals
trace analysis
security boundaries

Level 5 — Fine-Tuning and Model Adaptation Purpose: adapt models for specific behavior.

Examples:

domain-specific classifier
writing-style classifier
support-ticket router
study-question quality classifier
fine-tuned small model for structured extraction
LoRA experiment on small open model
domain-specific assistant experiment
prompt vs RAG vs fine-tune comparison
evaluation report
model card

Requirements:

dataset
training script/notebook
evaluation set
before/after comparison
failure analysis
model card
reproducibility notes Level 6 — Research Reproduction and Original Work Purpose: move toward research contribution.

Examples:

reproduce a RAG evaluation paper
reproduce a small transformer experiment
reproduce a LoRA experiment
compare chunking strategies
compare embedding models
evaluate hallucination mitigation methods
test agent failure modes
study prompt robustness
write a review paper
publish an experimental report
Local LLM evaluation report using Ollama
Agent safety case study using OpenClaw
Comparison of cloud agents vs self-hosted agents

Requirements:

paper notes
code
dataset
reproduction attempt
results
limitations
writeup
possible extensions

7. GitHub Strategy for AI

AI GitHub work must be serious.

Do not fill GitHub with empty “AI wrapper” projects.

Each AI repo should show:

problem statement
model used
why that model was chosen
data used
prompt or system design
architecture
evaluation method
failure cases
cost/latency notes
limitations
setup instructions
reproducibility notes
screenshots or demo
future improvements

AI Repository Categories Create several categories of AI repos:

ai-experiments — notebooks and small experiments
llm-apps — practical AI applications
rag-lab — retrieval and document-grounded systems
agent-lab — agent workflows and tool-use systems
deep-learning-lab — PyTorch/TensorFlow projects
fine-tuning-lab — LoRA, PEFT, and fine-tuning experiments
paper-reproductions — research paper implementations
ai-evals — evaluation datasets, graders, and reports
ai-safety-notes — responsible AI and failure-mode analysis
ai-product-case-studies — full writeups of AI products

The GitHub goal is:

Make it obvious that AI is not being used as magic. It is being engineered, evaluated, documented, and understood.

8. Responsible AI and Safety Layer

Responsible AI is not optional.

AI systems can mislead people, leak data, amplify bias, produce false confidence, and fail unpredictably. NIST’s AI Risk Management Framework was developed to help manage risks to individuals, organizations, and society, and its trustworthiness characteristics include validity and reliability, safety, security and resilience, accountability and transparency, explainability and interpretability, privacy enhancement, and fairness with harmful bias managed. (NIST)

Responsible AI Checklist For every serious AI project, ask:

What harm could this cause?
What happens if the output is wrong?
Who might overtrust it?
What data is being used?
Is private information involved?
Are sources shown?
Are limitations shown?
Can the user verify the output?
Is there a human review step?
What logs are stored?
What should not be stored?
What biases might appear?
How will failures be reported?
How will the system be improved?

Required Artifact Create a responsible-AI review for every serious AI project.

It should include:

intended use
prohibited use
data sources
privacy concerns
failure modes
evaluation method
human review requirements
user-facing limitations
security notes
improvement plan

Standard An AI system is not complete until its risks and limitations are documented.

9. How AI Should Be Used to Learn AI

This is a special case.

You are allowed to use AI heavily while learning AI.

But the usage must be disciplined.

Correct Use Ask AI to:

explain concepts at multiple levels
quiz you
generate exercises
review your code
compare frameworks
explain papers section by section
generate implementation plans
create debugging hypotheses
produce failure-mode checklists
help design evals
challenge your assumptions

Incorrect Use Do not ask AI to:

read a paper so you do not have to
write code you cannot explain
generate fake experiment results
create citations without verification
invent benchmarks
claim a model improved without evals
write research conclusions before results exist

The AI-Learning Rule For every AI explanation, produce your own artifact.

Examples:

concept note
code implementation
experiment
diagram
quiz answers
paper summary
evaluation dataset
failure analysis

10. Common AI Traps

Trap 1 — Prompt Engineering as Identity Prompting is useful, but it is not enough.

Rule:

    Learn prompting, then move into systems, tools, data, evals, and model behavior.

Trap 2 — Wrappers Without Engineering Many AI apps are just a textbox connected to an API.

That is not enough.

Rule:

    Add structure, workflow, memory, retrieval, evaluation, and product usefulness.

Trap 3 — No Evaluation If there is no eval, there is no engineering. Rule:

    Every serious AI system needs test cases.

Trap 4 — RAG Without Retrieval Inspection A RAG system can fail because retrieval is bad, even if the model is good.

Rule:

    Always inspect retrieved chunks.

Trap 5 — Agents for Everything Agents are not always needed.

Rule:

    Use deterministic code where deterministic code is enough.

Trap 6 — Fine-Tuning Too Early Fine-tuning is often not the first solution.

Rule:

    Try prompting, structured outputs, retrieval, and better workflow before fine-tuning.

Trap 7 — No Data Discipline Bad data creates bad AI systems.

Rule:

    Inspect, clean, split, version, and document datasets.

Trap 8 — Believing Model Output Because It Sounds Good Language models can sound confident while being wrong.

Rule:

    Verify important outputs against sources, tests, or reality.

11. First 17 Serious AI Artifacts

These are the first serious AI artifacts to build.

Artifact 1 — AI Usage Constitution A written rulebook for using AI without destroying learning.

Artifact 2 — Python AI Experiment Template A reusable project template for notebooks, scripts, data, results, and README files.

Artifact 3 — ML Basics Repository Small classical ML experiments with metrics and explanations.

Artifact 4 — Deep Learning Lab PyTorch and Keras notebooks covering tensors, training loops, image classification, text classification, and model saving.

Artifact 5 — LLM API Playground A clean repo for testing prompts, structured outputs, costs, latency, and model comparisons. Artifact 6 — Study Flashcard Generator A practical AI tool that converts notes into flashcards, with quality checks.

Artifact 7 — Source-Grounded Document Assistant A RAG system that answers questions from uploaded documents with citations.

Artifact 8 — ICS Revision AI Assistant A study assistant for your ICS-style exam preparation, with strict source grounding and no unsupported answers.

Artifact 9 — AI Evaluation Lab A repo for eval datasets, graders, prompt tests, RAG tests, and model comparisons.

Artifact 10 — Agent Workflow Lab A collection of agents and workflows with tools, logs, human approval points, and failure analysis.

Artifact 11 — Research Paper Tracker AI A tool for storing papers, summaries, tags, claims, methods, and possible research ideas.

Artifact 12 — Hugging Face Open-Model Lab Experiments using open-source models for classification, embeddings, generation, and comparison.

Artifact 13 — LoRA / PEFT Experiment A small, well-documented parameter-efficient fine-tuning experiment.

Artifact 14 — Paper Reproduction Repo A serious attempt to reproduce one AI paper or one part of a paper. Artifact 15 — AI Product Case Study A full writeup of one AI system covering problem, design, data, model, architecture, evals, failure cases, risks, and improvements.

Artifact 16 — Ollama Local Model Lab A repository for running, comparing, and documenting local models through Ollama.

Includes:

model setup notes
API examples
embedding examples
structured output examples
local RAG demo
latency/memory benchmarks
comparison with cloud models
limitations

Artifact 17 — OpenClaw Agent Infrastructure Study A repository or long-form case study documenting OpenClaw as a personal agent system.

Includes:

setup notes
architecture map
tools/skills/plugins explanation
safe workflow experiments
security threat model
permission boundaries
failure cases
lessons for building future agents

12. When to Move Forward

Do not move forward because you watched videos or copied notebooks.

Move forward when artifacts show competence. Move past AI tool usage when:

you can explain what AI did and did not do
you verify outputs
you can identify hallucinations
you use AI without outsourcing understanding

Move past Python/data basics when:

datasets can be loaded, cleaned, explored, and visualized
notebooks are reproducible
experiments are organized

Move past ML basics when:

baseline models are built
metrics are understood
overfitting can be diagnosed
error analysis is performed

Move past deep learning basics when:

tensors and training loops are understandable
simple models can be trained
model checkpoints can be saved and loaded
results are evaluated and documented

Move past LLM app basics when:

model APIs are integrated into software
structured outputs are validated
prompts are versioned
failures are documented

Move past RAG basics when:

documents are chunked and indexed
retrieval results are inspected
answers include sources
retrieval and answer quality are evaluated Move past agents when:
workflows and agents are distinguished
tools are safe and logged
traces can be inspected
human approval exists where needed

Move past fine-tuning basics when:

training data is clean
evaluation exists before and after training
LoRA/PEFT is understood conceptually
model behavior improvements are measured

Move into research when:

papers can be read structurally
code can reproduce parts of papers
failed reproductions can be documented honestly
research questions begin emerging from experiments

13. The AI Standard

The final standard for this domain is:

I can build AI systems that are useful, evaluated, safe enough for their context, documented, and technically understood. I can use existing models, build applications around them, evaluate their behavior, adapt them when justified, and read research papers deeply enough to reproduce and eventually contribute.

AI is not the replacement for the life plan.

AI is one of the tools and domains inside the life plan.

It must make the builder stronger, not weaker.

It must increase contact with reality, not reduce it. It must help produce better systems, better research, better explanations, better decisions, and better service.