Generative AI projects produce massive context - training data lineage, model configurations, experiment results, evaluation metrics. GenAI.md gives you structured markdown templates to capture this context so your ML team can iterate faster and avoid repeating mistakes.
Document model architectures, hyperparameter decisions, dataset characteristics, and deployment configurations in versioned markdown. When experiments are captured as structured context, your entire team learns from every iteration.
The gap between ML research and ML production is a context gap. Bridge it with documentation that tracks the full journey from training data to production deployment.
Structure your ML experiments, model configurations, and deployment pipelines as versioned markdown for reproducible, collaborative AI development.
Document each experiment with hypothesis, configuration, results, and learnings in markdown. Failed experiments are valuable context - they prevent your team from re-running dead ends and reveal which variables matter most.
Capture every hyperparameter change with reasoning and results. Create a decision log that explains why learning rate was adjusted, why batch size was changed, and what impact each change had on model performance.
Trace your training data from source to processed dataset. Include cleaning steps, filtering criteria, augmentation techniques, and known biases. Data provenance is critical for debugging model behavior and ensuring compliance.
Define your evaluation suite in markdown - which metrics, which benchmarks, which test sets. Consistent evaluation across experiments enables meaningful comparison. Document why specific metrics were chosen for your use case.
Document the full deployment pipeline - model packaging, serving infrastructure, rollout strategy, canary criteria, and rollback procedures. Production ML failures are expensive - checklists prevent the preventable ones.
Specify model performance thresholds, data drift detection rules, and alerting conditions in markdown. Document what "degraded performance" looks like and the decision tree for when to retrain versus when to roll back.
Capture fairness criteria, bias testing results, and safety constraints for each model. These guardrails should be reviewed with every model update and documented alongside model cards for transparency.
Create a model card for every deployed model - intended use, known limitations, training data summary, performance characteristics, and update history. Model cards are the interface between ML teams and the rest of the organization.
If you cannot reproduce an experiment from its documentation alone, the documentation is incomplete. Every GenAI experiment should be reproducible by a team member who was not involved in the original work. This means documenting not just results, but the complete environment - random seeds, library versions, data snapshots, and hardware configuration. Reproducibility is not overhead; it is the foundation of reliable ML engineering.
# GenAI.md - Generative AI Project Context
<!-- Configuration, training, evaluation, and deployment for GenAI projects -->
<!-- Covers model setup, data pipelines, experiment tracking, and monitoring -->
<!-- Last updated: YYYY-MM-DD -->
## Model Configuration
### Primary Model
**Project**: Claro - Customer Support AI Agent
**Model**: Claude 3.5 Sonnet (via Anthropic API)
**Use Case**: Automated customer support ticket triage and response drafting
**Environment**: Production - handling 2,400 tickets/day across 3 product lines
### Model Selection Rationale
We evaluated 5 models over 3 weeks using a benchmark of 500 labeled support tickets:
| Model | Accuracy | Latency (p95) | Cost/1K tickets | Selected |
|-------|----------|---------------|-----------------|----------|
| Claude 3.5 Sonnet | 94.2% | 1.8s | $4.20 | Yes |
| GPT-4o | 93.8% | 2.1s | $6.80 | No |
| GPT-4o-mini | 88.1% | 0.9s | $0.90 | Fallback |
| Claude 3.5 Haiku | 89.5% | 0.7s | $1.10 | Triage only |
| Llama 3 70B | 86.3% | 3.4s | $0.00 (self-hosted) | No |
**Decision**: Claude 3.5 Sonnet for response drafting (accuracy matters most). Claude 3.5 Haiku for initial ticket triage (speed matters, accuracy acceptable). GPT-4o-mini as fallback if Anthropic API is unavailable.
### API Configuration
```python
# config/model_config.py
import anthropic
client = anthropic.Anthropic(
api_key=os.environ["ANTHROPIC_API_KEY"],
)
# Primary model - response drafting
PRIMARY_CONFIG = {
"model": "claude-sonnet-4-20250514",
"max_tokens": 1024,
"temperature": 0.3, # Low temp for consistent, accurate responses
"top_p": 0.9,
"stop_sequences": ["---"], # Stop at section delimiter
}
# Fast model - ticket triage and classification
TRIAGE_CONFIG = {
"model": "claude-haiku-4-20250514",
"max_tokens": 256,
"temperature": 0.0, # Zero temp for deterministic classification
}
# Rate limiting
RATE_LIMITS = {
"requests_per_minute": 1000,
"tokens_per_minute": 400000,
"max_concurrent": 50,
"retry_attempts": 3,
"retry_backoff_base": 2, # Exponential: 2s, 4s, 8s
}
```
### Cost Management
```python
# Budget tracking and alerting
BUDGET = {
"monthly_limit_usd": 3000,
"daily_alert_threshold_usd": 150, # Alert if daily spend exceeds this
"cost_per_input_token": 0.003 / 1000, # Sonnet pricing
"cost_per_output_token": 0.015 / 1000,
}
# Cost optimization strategies in production:
# 1. Triage with Haiku first ($0.25/1K) - only send complex tickets to Sonnet
# 2. Cache identical ticket responses (Redis, 1-hour TTL)
# 3. Truncate ticket history to last 5 messages (reduces input tokens by ~40%)
# 4. Use stop_sequences to prevent overly long responses
```
## Training Data and Prompts
### System Prompt Architecture
The system prompt is assembled from modular components at request time:
```python
# prompts/system_prompt.py
def build_system_prompt(product_line: str, ticket_category: str) -> str:
"""Assemble the system prompt from components."""
return f"""
{ROLE_DEFINITION}
{PRODUCT_CONTEXT[product_line]}
{RESPONSE_GUIDELINES}
{CATEGORY_SPECIFIC_INSTRUCTIONS[ticket_category]}
{SAFETY_GUARDRAILS}
"""
ROLE_DEFINITION = """
You are a customer support agent for Claro, a project management platform.
Your role is to draft helpful, accurate responses to customer support tickets.
You have access to the customer's account information, subscription tier, and
recent activity. Use this context to personalize your responses.
Important:
- Be empathetic but concise. Aim for 2-4 paragraphs maximum.
- If you are not confident in your answer, say so and escalate to a human agent.
- Never make up product features or pricing. Only reference verified information.
- Never share one customer's data with another customer.
"""
RESPONSE_GUIDELINES = """
Tone: Professional, warm, solution-oriented
Structure:
1. Acknowledge the customer's issue (1 sentence)
2. Provide the solution or explanation (1-2 paragraphs)
3. Offer next steps or ask a clarifying question (1 sentence)
4. Close with availability to help further (1 sentence)
Do NOT:
- Use marketing language or upsell
- Promise features that do not exist
- Provide legal, medical, or financial advice
- Share internal system details or architecture
"""
```
### Few-Shot Examples
```python
# prompts/examples.py
# These examples are injected into the prompt for consistent output formatting
EXAMPLES = [
{
"ticket": "I cannot export my project data as CSV. The button is grayed out.",
"context": "Customer is on Free tier. CSV export is a Pro feature.",
"category": "feature_access",
"response": (
"I understand the frustration of not being able to export your data "
"when you need it.\n\n"
"CSV export is available on our Pro plan and above. Your account is "
"currently on the Free tier, which includes JSON export. You can "
"access JSON export from Project Settings > Data > Export.\n\n"
"If you would like to try CSV export, you can start a 14-day free "
"trial of Pro from your Account Settings page. No credit card "
"required.\n\n"
"Let me know if you have any questions about the export options "
"or the Pro plan features."
),
"draft_quality": "approved",
},
{
"ticket": "Your app charged me twice this month. I want a refund immediately.",
"context": "Billing shows two charges: subscription renewal + add-on purchase.",
"category": "billing",
"response": (
"I completely understand your concern about seeing two charges, and "
"I want to help resolve this right away.\n\n"
"Looking at your account, I can see two separate charges this month: "
"your regular subscription renewal of $29/month (charged on the 1st) "
"and a one-time charge of $15 for the additional storage add-on you "
"purchased on the 12th. These are two distinct charges, not a "
"duplicate billing.\n\n"
"If the add-on purchase was not intentional, I can process a refund "
"for that $15 charge and remove the add-on from your account. Would "
"you like me to go ahead with that?\n\n"
"I am here to help if you have any other questions about your billing."
),
"draft_quality": "approved",
},
]
```
### Evaluation Dataset
```python
# data/eval_dataset.py
# Gold-standard evaluation set: 500 tickets with human-verified responses
# Updated monthly by the support team lead
EVAL_DATASET = {
"name": "claro-support-eval-v3",
"size": 500,
"format": "JSONL",
"location": "s3://claro-ml/datasets/eval/eval_v3.jsonl",
"categories": {
"billing": 120,
"technical_bug": 150,
"feature_request": 80,
"how_to": 100,
"account_access": 50,
},
"last_updated": "YYYY-MM-DD",
"labeled_by": "Senior support team (3 annotators, majority vote)",
}
```
## Experiment Tracking
### Experiment Log
Track every prompt change, model swap, or parameter adjustment:
| ID | Date | Change | Metric Before | Metric After | Decision |
|----|------|--------|---------------|--------------|----------|
| EXP-042 | YYYY-MM-DD | Added product-specific context to system prompt | 91.2% accuracy | 94.2% accuracy | Shipped |
| EXP-041 | YYYY-MM-DD | Reduced temperature from 0.7 to 0.3 | 89.8% consistency | 96.1% consistency | Shipped |
| EXP-040 | YYYY-MM-DD | Switched from GPT-4o to Claude 3.5 Sonnet | $6.80/1K | $4.20/1K | Shipped |
| EXP-039 | YYYY-MM-DD | Added few-shot examples for billing category | 82% billing accuracy | 91% billing accuracy | Shipped |
| [EXP-ID] | [Date] | [Description of change] | [Before] | [After] | [Ship/Revert/Iterate] |
### Running an Experiment
```bash
# 1. Create experiment branch
git checkout -b exp/EXP-043-add-escalation-rules
# 2. Make prompt/config changes
# Edit prompts/system_prompt.py or config/model_config.py
# 3. Run evaluation against the gold dataset
python scripts/evaluate.py \
--dataset s3://claro-ml/datasets/eval/eval_v3.jsonl \
--config config/model_config.py \
--output results/EXP-043.json
# 4. Compare results against baseline
python scripts/compare_experiments.py \
--baseline results/EXP-042.json \
--experiment results/EXP-043.json
# 5. If results are positive, merge and deploy
# If negative, document findings and revert
```
## Evaluation Metrics
### Automated Metrics (Run on Every PR)
```python
metrics = {
# Accuracy: Does the response correctly address the ticket?
"accuracy": exact_category_match + response_relevance_score,
# Consistency: Given the same ticket twice, do we get similar responses?
"consistency": cosine_similarity_between_runs,
# Safety: Does the response violate any guardrails?
"safety_violations": count_of_pii_leaks + unauthorized_promises + hallucinations,
# Latency: How fast is the response generated?
"latency_p50": median_response_time_ms,
"latency_p95": p95_response_time_ms,
# Cost: How much does each response cost?
"cost_per_response": (input_tokens * input_price + output_tokens * output_price),
}
```
### Human Evaluation (Monthly)
The support team lead reviews 50 randomly sampled AI-drafted responses each month:
- **Accuracy** (1-5): Is the information correct and complete?
- **Tone** (1-5): Does it match our brand voice? Is it empathetic?
- **Actionability** (1-5): Does it give the customer a clear next step?
- **Safety** (pass/fail): Any policy violations, PII exposure, or hallucinations?
Target scores: Accuracy >= 4.2, Tone >= 4.0, Actionability >= 4.0, Safety = 100% pass
### A/B Testing Framework
```python
# A/B test configuration for prompt changes
ab_test = {
"name": "EXP-043-escalation-rules",
"start_date": "YYYY-MM-DD",
"end_date": "YYYY-MM-DD",
"control": {
"prompt_version": "v2.8",
"traffic_percentage": 50,
},
"treatment": {
"prompt_version": "v2.9-escalation",
"traffic_percentage": 50,
},
"success_metrics": [
"customer_satisfaction_score",
"human_edit_rate", # How often agents modify the AI draft
"response_time_total", # Time from ticket creation to response sent
],
"guardrail_metrics": [
"safety_violation_rate", # Must not increase
"escalation_rate", # Should decrease (fewer wrong auto-responses)
],
}
```
## Deployment Pipeline
### Model Deployment Workflow
```
1. Prompt/config change committed to feature branch
2. CI runs automated evaluation against gold dataset
3. Results posted as PR comment (accuracy, latency, cost, safety)
4. Human review of 10 sampled responses (manual spot-check)
5. If metrics meet thresholds -> merge to main
6. Main branch auto-deploys to staging
7. 24-hour soak test on staging with live traffic shadow
8. Manual promotion to production (requires team lead approval)
```
### Rollback Procedure
```bash
# Prompt rollback (instant - config change only)
git revert [commit-hash]
git push origin main
# Config re-deploys in under 2 minutes via CI/CD
# Model rollback (if switching models)
# Update MODEL_NAME in config/model_config.py to previous model
# Fallback chain: Sonnet -> Haiku -> GPT-4o-mini -> Human-only mode
```
## Monitoring and Alerting
### Production Dashboards
| Metric | Threshold | Alert Channel |
|--------|-----------|---------------|
| Error rate | > 2% over 5 min | PagerDuty (P2) |
| Latency p95 | > 5s over 10 min | Slack #claro-alerts |
| Safety violations | > 0 per hour | PagerDuty (P1) |
| Daily cost | > $150 | Slack #claro-costs |
| Hallucination rate | > 1% over 1 hour | PagerDuty (P2) |
| API availability | < 99% over 15 min | PagerDuty (P1) |
### Logging Structure
```python
# Every model invocation is logged for debugging and audit
logger.info("model_invocation", {
"request_id": request_id,
"ticket_id": ticket_id,
"model": model_name,
"prompt_version": prompt_version,
"input_tokens": usage.input_tokens,
"output_tokens": usage.output_tokens,
"latency_ms": elapsed_ms,
"cost_usd": calculated_cost,
"category": predicted_category,
"confidence": confidence_score,
"safety_check": "pass", # or "fail" with details
})
```
## Safety and Compliance
### Content Guardrails
```python
# Post-processing safety checks applied to every AI response
SAFETY_CHECKS = [
"pii_detection", # Scan for emails, phone numbers, SSNs in output
"competitor_mention", # Flag responses that mention competitor products
"price_verification", # Verify any prices match current pricing table
"feature_verification", # Verify mentioned features actually exist
"tone_check", # Flag aggressive, dismissive, or overly casual tone
"legal_disclaimer_check", # Flag responses that could be construed as legal advice
]
# If any safety check fails:
# 1. Response is NOT sent to customer
# 2. Ticket is routed to human agent queue
# 3. Incident is logged for review
# 4. Alert fires if failure rate exceeds threshold
```
### Data Privacy
- Customer PII is redacted from logs and training data
- Model API calls go through a proxy that strips sensitive fields
- No customer data is used for model fine-tuning without explicit consent
- All AI-generated responses are marked as "AI-drafted" in the support tool
- Responses are stored for 90 days for quality review, then purged
## Troubleshooting
### High Latency
- Check API provider status page for outages
- Review recent prompt changes (longer prompts = slower responses)
- Check if rate limits are being hit (exponential backoff kicks in)
- Consider routing to faster model (Haiku) for simple tickets
### Low Quality Responses
- Review the last 5 experiment changes - was something reverted incorrectly?
- Check if the evaluation dataset is still representative of current tickets
- Look for category drift - are customers asking new types of questions?
- Review few-shot examples - do they still match current product state?
### Cost Spikes
- Check for retry loops (failed requests being retried excessively)
- Look for unusually long tickets inflating input token counts
- Verify caching is working (Redis hit rate should be > 30%)
- Check if the triage model is routing too many tickets to the expensive model
Generative AI projects succeed or fail based on context quality. GenAI.md documents your model configurations, training data lineage, and evaluation metrics in markdown. Context flows from data to deployment. AI context about AI becomes versioned infrastructure.
Every model iteration represents learning. GenAI.md captures experiment results, parameter decisions, and performance trade-offs in structured markdown. Your AI assistants help analyze patterns across experiments. Institutional memory prevents repeated mistakes.
Deploying generative AI requires more than code - you need deployment pipelines, monitoring strategies, and incident response procedures. GenAI.md documents the operational context that keeps AI systems reliable. From training to production, context is infrastructure.
"Generative AI projects generate massive amounts of context - training data, model configs, experiment results, deployment procedures. GenAI.md helps teams structure this context as versioned markdown, making AI projects manageable and knowledge transferable."
Built by ML engineers who know that the hardest part of AI isn't the model - it's the context.
We understand that generative AI projects generate massive context - training data lineage, model configurations, experiment results, deployment procedures. All of this deserves to be structured as markdown, not scattered across notebooks and Slack. GenAI.md helps ML teams capture their model development journey in a format that both current team members and future AI assistants can learn from.
Our mission is to help AI teams treat their project context as seriously as their model code. When experiment learnings, configuration decisions, and operational procedures live in versioned .md files, AI projects become manageable, knowledge becomes transferable, and teams avoid repeating mistakes. Context is the difference between ML research and ML production.
LLMs parse markdown better than any other format. Fewer tokens, cleaner structure, better results.
Context evolves with code. Git tracks changes, PRs enable review, history preserves decisions.
No special tools needed. Plain text that works everywhere. Documentation humans actually read.
Building generative AI systems? Need advice on model documentation? Share your challenges - we've been there.