GenAI.md is a structured markdown template for documenting generative AI projects. It covers model configuration, experiment tracking, data lineage, evaluation metrics, and deployment pipelines in a format that keeps ML teams aligned and experiments reproducible.

How does GenAI.md help with experiment tracking?

GenAI.md provides structured templates for logging every experiment with its hypothesis, configuration, results, and learnings. By capturing experiments in versioned markdown, your team avoids re-running dead ends and builds on prior discoveries.

Can GenAI.md replace MLflow or Weights & Biases?

GenAI.md complements experiment tracking tools rather than replacing them. It captures the human-readable context - rationale, trade-off decisions, and institutional knowledge - that automated trackers miss. Use both for complete coverage.

What model documentation does GenAI.md support?

GenAI.md includes templates for model cards, selection rationale tables, hyperparameter decision logs, and API configuration documentation. Each template captures both the technical specification and the business reasoning behind model choices.

Is GenAI.md free to use?

Yes. GenAI.md is completely free. Download the markdown template, adapt it to your ML project, and version it in your repository alongside your model code and training pipelines.

How does GenAI.md support production ML deployments?

GenAI.md includes deployment checklists, monitoring criteria templates, rollback procedures, and canary release documentation. These structured runbooks prevent the costly failures that occur when ML deployment knowledge lives only in senior engineers' heads.

GenAI.md - Models Change. Context Persists.

Generative AI Project Best Practices

Structure your ML experiments, model configurations, and deployment pipelines as versioned markdown for reproducible, collaborative AI development.

Log Every Experiment

Document each experiment with hypothesis, configuration, results, and learnings in markdown. Failed experiments are valuable context - they prevent your team from re-running dead ends and reveal which variables matter most.

Version Hyperparameters

Capture every hyperparameter change with reasoning and results. Create a decision log that explains why learning rate was adjusted, why batch size was changed, and what impact each change had on model performance.

Document Data Lineage

Trace your training data from source to processed dataset. Include cleaning steps, filtering criteria, augmentation techniques, and known biases. Data provenance is critical for debugging model behavior and ensuring compliance.

Standardize Evaluation Metrics

Define your evaluation suite in markdown - which metrics, which benchmarks, which test sets. Consistent evaluation across experiments enables meaningful comparison. Document why specific metrics were chosen for your use case.

Create Deployment Checklists

Document the full deployment pipeline - model packaging, serving infrastructure, rollout strategy, canary criteria, and rollback procedures. Production ML failures are expensive - checklists prevent the preventable ones.

Define Monitoring Criteria

Specify model performance thresholds, data drift detection rules, and alerting conditions in markdown. Document what "degraded performance" looks like and the decision tree for when to retrain versus when to roll back.

Document Ethical Guardrails

Capture fairness criteria, bias testing results, and safety constraints for each model. These guardrails should be reviewed with every model update and documented alongside model cards for transparency.

Maintain Model Cards

Create a model card for every deployed model - intended use, known limitations, training data summary, performance characteristics, and update history. Model cards are the interface between ML teams and the rest of the organization.

Reproducibility Is Documentation

If you cannot reproduce an experiment from its documentation alone, the documentation is incomplete. Every GenAI experiment should be reproducible by a team member who was not involved in the original work. This means documenting not just results, but the complete environment - random seeds, library versions, data snapshots, and hardware configuration. Reproducibility is not overhead; it is the foundation of reliable ML engineering.

The GenAI Template

GenAI.md

# GenAI.md - Generative AI Project Context
<!-- Configuration, training, evaluation, and deployment for GenAI projects -->
<!-- Covers model setup, data pipelines, experiment tracking, and monitoring -->
<!-- Last updated: YYYY-MM-DD -->

## Model Configuration

### Primary Model

**Project**: Claro - Customer Support AI Agent
**Model**: Claude 3.5 Sonnet (via Anthropic API)
**Use Case**: Automated customer support ticket triage and response drafting
**Environment**: Production - handling 2,400 tickets/day across 3 product lines

### Model Selection Rationale

We evaluated 5 models over 3 weeks using a benchmark of 500 labeled support tickets:

| Model | Accuracy | Latency (p95) | Cost/1K tickets | Selected |
|-------|----------|---------------|-----------------|----------|
| Claude 3.5 Sonnet | 94.2% | 1.8s | $4.20 | Yes |
| GPT-4o | 93.8% | 2.1s | $6.80 | No |
| GPT-4o-mini | 88.1% | 0.9s | $0.90 | Fallback |
| Claude 3.5 Haiku | 89.5% | 0.7s | $1.10 | Triage only |
| Llama 3 70B | 86.3% | 3.4s | $0.00 (self-hosted) | No |

**Decision**: Claude 3.5 Sonnet for response drafting (accuracy matters most). Claude 3.5 Haiku for initial ticket triage (speed matters, accuracy acceptable). GPT-4o-mini as fallback if Anthropic API is unavailable.

### API Configuration

```python
# config/model_config.py
import anthropic

client = anthropic.Anthropic(
    api_key=os.environ["ANTHROPIC_API_KEY"],
)

# Primary model - response drafting
PRIMARY_CONFIG = {
    "model": "claude-sonnet-4-20250514",
    "max_tokens": 1024,
    "temperature": 0.3,        # Low temp for consistent, accurate responses
    "top_p": 0.9,
    "stop_sequences": ["---"],  # Stop at section delimiter
}

# Fast model - ticket triage and classification
TRIAGE_CONFIG = {
    "model": "claude-haiku-4-20250514",
    "max_tokens": 256,
    "temperature": 0.0,        # Zero temp for deterministic classification
}

# Rate limiting
RATE_LIMITS = {
    "requests_per_minute": 1000,
    "tokens_per_minute": 400000,
    "max_concurrent": 50,
    "retry_attempts": 3,
    "retry_backoff_base": 2,   # Exponential: 2s, 4s, 8s
}
```

### Cost Management

```python
# Budget tracking and alerting
BUDGET = {
    "monthly_limit_usd": 3000,
    "daily_alert_threshold_usd": 150,  # Alert if daily spend exceeds this
    "cost_per_input_token": 0.003 / 1000,   # Sonnet pricing
    "cost_per_output_token": 0.015 / 1000,
}

# Cost optimization strategies in production:
# 1. Triage with Haiku first ($0.25/1K) - only send complex tickets to Sonnet
# 2. Cache identical ticket responses (Redis, 1-hour TTL)
# 3. Truncate ticket history to last 5 messages (reduces input tokens by ~40%)
# 4. Use stop_sequences to prevent overly long responses
```

## Training Data and Prompts

### System Prompt Architecture

The system prompt is assembled from modular components at request time:

```python
# prompts/system_prompt.py
def build_system_prompt(product_line: str, ticket_category: str) -> str:
    """Assemble the system prompt from components."""
    return f"""
{ROLE_DEFINITION}

{PRODUCT_CONTEXT[product_line]}

{RESPONSE_GUIDELINES}

{CATEGORY_SPECIFIC_INSTRUCTIONS[ticket_category]}

{SAFETY_GUARDRAILS}
"""

ROLE_DEFINITION = """
You are a customer support agent for Claro, a project management platform.
Your role is to draft helpful, accurate responses to customer support tickets.

You have access to the customer's account information, subscription tier, and
recent activity. Use this context to personalize your responses.

Important:
- Be empathetic but concise. Aim for 2-4 paragraphs maximum.
- If you are not confident in your answer, say so and escalate to a human agent.
- Never make up product features or pricing. Only reference verified information.
- Never share one customer's data with another customer.
"""

RESPONSE_GUIDELINES = """
Tone: Professional, warm, solution-oriented
Structure:
1. Acknowledge the customer's issue (1 sentence)
2. Provide the solution or explanation (1-2 paragraphs)
3. Offer next steps or ask a clarifying question (1 sentence)
4. Close with availability to help further (1 sentence)

Do NOT:
- Use marketing language or upsell
- Promise features that do not exist
- Provide legal, medical, or financial advice
- Share internal system details or architecture
"""
```

### Few-Shot Examples

```python
# prompts/examples.py
# These examples are injected into the prompt for consistent output formatting

EXAMPLES = [
    {
        "ticket": "I cannot export my project data as CSV. The button is grayed out.",
        "context": "Customer is on Free tier. CSV export is a Pro feature.",
        "category": "feature_access",
        "response": (
            "I understand the frustration of not being able to export your data "
            "when you need it.\n\n"
            "CSV export is available on our Pro plan and above. Your account is "
            "currently on the Free tier, which includes JSON export. You can "
            "access JSON export from Project Settings > Data > Export.\n\n"
            "If you would like to try CSV export, you can start a 14-day free "
            "trial of Pro from your Account Settings page. No credit card "
            "required.\n\n"
            "Let me know if you have any questions about the export options "
            "or the Pro plan features."
        ),
        "draft_quality": "approved",
    },
    {
        "ticket": "Your app charged me twice this month. I want a refund immediately.",
        "context": "Billing shows two charges: subscription renewal + add-on purchase.",
        "category": "billing",
        "response": (
            "I completely understand your concern about seeing two charges, and "
            "I want to help resolve this right away.\n\n"
            "Looking at your account, I can see two separate charges this month: "
            "your regular subscription renewal of $29/month (charged on the 1st) "
            "and a one-time charge of $15 for the additional storage add-on you "
            "purchased on the 12th. These are two distinct charges, not a "
            "duplicate billing.\n\n"
            "If the add-on purchase was not intentional, I can process a refund "
            "for that $15 charge and remove the add-on from your account. Would "
            "you like me to go ahead with that?\n\n"
            "I am here to help if you have any other questions about your billing."
        ),
        "draft_quality": "approved",
    },
]
```

### Evaluation Dataset

```python
# data/eval_dataset.py
# Gold-standard evaluation set: 500 tickets with human-verified responses
# Updated monthly by the support team lead

EVAL_DATASET = {
    "name": "claro-support-eval-v3",
    "size": 500,
    "format": "JSONL",
    "location": "s3://claro-ml/datasets/eval/eval_v3.jsonl",
    "categories": {
        "billing": 120,
        "technical_bug": 150,
        "feature_request": 80,
        "how_to": 100,
        "account_access": 50,
    },
    "last_updated": "YYYY-MM-DD",
    "labeled_by": "Senior support team (3 annotators, majority vote)",
}
```

## Experiment Tracking

### Experiment Log

Track every prompt change, model swap, or parameter adjustment:

| ID | Date | Change | Metric Before | Metric After | Decision |
|----|------|--------|---------------|--------------|----------|
| EXP-042 | YYYY-MM-DD | Added product-specific context to system prompt | 91.2% accuracy | 94.2% accuracy | Shipped |
| EXP-041 | YYYY-MM-DD | Reduced temperature from 0.7 to 0.3 | 89.8% consistency | 96.1% consistency | Shipped |
| EXP-040 | YYYY-MM-DD | Switched from GPT-4o to Claude 3.5 Sonnet | $6.80/1K | $4.20/1K | Shipped |
| EXP-039 | YYYY-MM-DD | Added few-shot examples for billing category | 82% billing accuracy | 91% billing accuracy | Shipped |
| [EXP-ID] | [Date] | [Description of change] | [Before] | [After] | [Ship/Revert/Iterate] |

### Running an Experiment

```bash
# 1. Create experiment branch
git checkout -b exp/EXP-043-add-escalation-rules

# 2. Make prompt/config changes
# Edit prompts/system_prompt.py or config/model_config.py

# 3. Run evaluation against the gold dataset
python scripts/evaluate.py \
  --dataset s3://claro-ml/datasets/eval/eval_v3.jsonl \
  --config config/model_config.py \
  --output results/EXP-043.json

# 4. Compare results against baseline
python scripts/compare_experiments.py \
  --baseline results/EXP-042.json \
  --experiment results/EXP-043.json

# 5. If results are positive, merge and deploy
# If negative, document findings and revert
```

## Evaluation Metrics

### Automated Metrics (Run on Every PR)

```python
metrics = {
    # Accuracy: Does the response correctly address the ticket?
    "accuracy": exact_category_match + response_relevance_score,

    # Consistency: Given the same ticket twice, do we get similar responses?
    "consistency": cosine_similarity_between_runs,

    # Safety: Does the response violate any guardrails?
    "safety_violations": count_of_pii_leaks + unauthorized_promises + hallucinations,

    # Latency: How fast is the response generated?
    "latency_p50": median_response_time_ms,
    "latency_p95": p95_response_time_ms,

    # Cost: How much does each response cost?
    "cost_per_response": (input_tokens * input_price + output_tokens * output_price),
}
```

### Human Evaluation (Monthly)

The support team lead reviews 50 randomly sampled AI-drafted responses each month:

- **Accuracy** (1-5): Is the information correct and complete?
- **Tone** (1-5): Does it match our brand voice? Is it empathetic?
- **Actionability** (1-5): Does it give the customer a clear next step?
- **Safety** (pass/fail): Any policy violations, PII exposure, or hallucinations?

Target scores: Accuracy >= 4.2, Tone >= 4.0, Actionability >= 4.0, Safety = 100% pass

### A/B Testing Framework

```python
# A/B test configuration for prompt changes
ab_test = {
    "name": "EXP-043-escalation-rules",
    "start_date": "YYYY-MM-DD",
    "end_date": "YYYY-MM-DD",
    "control": {
        "prompt_version": "v2.8",
        "traffic_percentage": 50,
    },
    "treatment": {
        "prompt_version": "v2.9-escalation",
        "traffic_percentage": 50,
    },
    "success_metrics": [
        "customer_satisfaction_score",
        "human_edit_rate",       # How often agents modify the AI draft
        "response_time_total",   # Time from ticket creation to response sent
    ],
    "guardrail_metrics": [
        "safety_violation_rate",  # Must not increase
        "escalation_rate",        # Should decrease (fewer wrong auto-responses)
    ],
}
```

## Deployment Pipeline

### Model Deployment Workflow

```
1. Prompt/config change committed to feature branch
2. CI runs automated evaluation against gold dataset
3. Results posted as PR comment (accuracy, latency, cost, safety)
4. Human review of 10 sampled responses (manual spot-check)
5. If metrics meet thresholds -> merge to main
6. Main branch auto-deploys to staging
7. 24-hour soak test on staging with live traffic shadow
8. Manual promotion to production (requires team lead approval)
```

### Rollback Procedure

```bash
# Prompt rollback (instant - config change only)
git revert [commit-hash]
git push origin main
# Config re-deploys in under 2 minutes via CI/CD

# Model rollback (if switching models)
# Update MODEL_NAME in config/model_config.py to previous model
# Fallback chain: Sonnet -> Haiku -> GPT-4o-mini -> Human-only mode
```

## Monitoring and Alerting

### Production Dashboards

| Metric | Threshold | Alert Channel |
|--------|-----------|---------------|
| Error rate | > 2% over 5 min | PagerDuty (P2) |
| Latency p95 | > 5s over 10 min | Slack #claro-alerts |
| Safety violations | > 0 per hour | PagerDuty (P1) |
| Daily cost | > $150 | Slack #claro-costs |
| Hallucination rate | > 1% over 1 hour | PagerDuty (P2) |
| API availability | < 99% over 15 min | PagerDuty (P1) |

### Logging Structure

```python
# Every model invocation is logged for debugging and audit
logger.info("model_invocation", {
    "request_id": request_id,
    "ticket_id": ticket_id,
    "model": model_name,
    "prompt_version": prompt_version,
    "input_tokens": usage.input_tokens,
    "output_tokens": usage.output_tokens,
    "latency_ms": elapsed_ms,
    "cost_usd": calculated_cost,
    "category": predicted_category,
    "confidence": confidence_score,
    "safety_check": "pass",  # or "fail" with details
})
```

## Safety and Compliance

### Content Guardrails

```python
# Post-processing safety checks applied to every AI response
SAFETY_CHECKS = [
    "pii_detection",          # Scan for emails, phone numbers, SSNs in output
    "competitor_mention",     # Flag responses that mention competitor products
    "price_verification",     # Verify any prices match current pricing table
    "feature_verification",   # Verify mentioned features actually exist
    "tone_check",             # Flag aggressive, dismissive, or overly casual tone
    "legal_disclaimer_check", # Flag responses that could be construed as legal advice
]

# If any safety check fails:
# 1. Response is NOT sent to customer
# 2. Ticket is routed to human agent queue
# 3. Incident is logged for review
# 4. Alert fires if failure rate exceeds threshold
```

### Data Privacy
- Customer PII is redacted from logs and training data
- Model API calls go through a proxy that strips sensitive fields
- No customer data is used for model fine-tuning without explicit consent
- All AI-generated responses are marked as "AI-drafted" in the support tool
- Responses are stored for 90 days for quality review, then purged

## Troubleshooting

### High Latency
- Check API provider status page for outages
- Review recent prompt changes (longer prompts = slower responses)
- Check if rate limits are being hit (exponential backoff kicks in)
- Consider routing to faster model (Haiku) for simple tickets

### Low Quality Responses
- Review the last 5 experiment changes - was something reverted incorrectly?
- Check if the evaluation dataset is still representative of current tickets
- Look for category drift - are customers asking new types of questions?
- Review few-shot examples - do they still match current product state?

### Cost Spikes
- Check for retry loops (failed requests being retried excessively)
- Look for unusually long tickets inflating input token counts
- Verify caching is working (Redis hit rate should be > 30%)
- Check if the triage model is routing too many tickets to the expensive model

GenAI.md - Context as Infrastructure

Structure Your Generative AI Projects for Success

Generative AI Project Best Practices

Log Every Experiment

Version Hyperparameters

Document Data Lineage

Standardize Evaluation Metrics

Create Deployment Checklists

Define Monitoring Criteria

Document Ethical Guardrails

Maintain Model Cards

Reproducibility Is Documentation

The GenAI Template

Why Markdown Matters for AI-Native Development

Model Context Management

Experiment Tracking as Documentation

Production AI Context

Explore More Templates

About GenAI.md

Our Mission

Why Markdown Matters

AI-Native

Version Control

Human Readable