Troubleshooting

Common issues and their solutions.

Quick Diagnostics

Health Check

# Backend health
curl https://your-domain.com/health

# Expected: {"status": "ok", "message": "Service is healthy"}

Pod Status

kubectl get pods
kubectl logs deployment/learnpanta-backend --tail=100

Temporal Status

kubectl exec -it deployment/temporal-server -- tctl namespace list

Common Issues

Feedback Generation Times Out

Symptoms:

Session completes but feedback never appears
Worker logs show asyncio.exceptions.CancelledError
Activity retries multiple times (attempt: 4, 5, 6...)

Cause: Gemini API call exceeds activity timeout.

Solution:

Check which model is being used:

# In llm.py - should be flash, not pro
model=self.flash_model_id  # Fast
# NOT: model=self.pro_model_id  # Slow

Ensure thinking mode is disabled:

# Remove this line:
thinking_config=self._get_thinking_config()

Increase timeout if needed:

# In workflows.py
start_to_close_timeout=timedelta(minutes=3)  # Increase from 2

Questions From Wrong Exam

Symptoms:

User takes PMP exam but sees AWS questions
Adaptive papers contain unrelated content

Cause: Bug in adaptive paper creation (questions not filtered by exam).

Solution: Ensure academic.py filters questions by exam:

query = db.query(models.Question).join(
    models.SectionQuestion
).join(
    models.Section
).join(
    models.Paper
).filter(
    models.Paper.exam_id == exam_id  # Critical filter
).distinct()

WebSocket Connection Fails

Symptoms:

Telemetry not being captured
Console shows WebSocket errors
No metrics in workflow state

Diagnosis:

// Check browser console
WebSocket connection to 'wss://...' failed

Solutions:

CORS issue: Add WebSocket origin to allowed origins

origins.append("wss://your-frontend-domain.com")

Load balancer: Ensure LB supports WebSocket upgrades

# GKE Ingress annotation
kubernetes.io/ingress.class: "gce"
cloud.google.com/backend-config: '{"default": "ws-backend-config"}'

Timeout: Increase idle timeout on load balancer

Database Connection Refused

Symptoms:

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) 
connection refused

Solutions:

Local: Ensure PostgreSQL is running

docker-compose up -d

Cloud SQL: Check connection string

# Via proxy
cloud-sql-proxy project:region:instance &
DATABASE_URL=postgresql://user:[email protected]:5432/db

Private IP: Ensure pod can reach Cloud SQL

kubectl exec -it pod-name -- nc -zv CLOUD_SQL_IP 5432

Temporal Workflow Not Found

Symptoms:

HTTPException: Active session workflow not found

Causes:

Worker not running: Start the worker

python -m app.worker

Wrong task queue: Verify task queue name matches

# In sessions.py
task_queue="marathon-session-queue"  # Must match worker

Workflow completed: Check if session already finalized

AI Features Return Mock Data

Symptoms:

Feedback says "Mock response"
TTS returns placeholder audio

Cause: GOOGLE_API_KEY not set or invalid.

Solution:

# Check if set
kubectl exec -it deployment/learnpanta-backend -- env | grep GOOGLE

# Set it
kubectl create secret generic backend-secrets \
  --from-literal=google-api-key="AIzaSy..."

Curation Fails Silently

Symptoms:

/curator/batch-curate returns 200 but no questions created
Completion percentage stays at 0%

Diagnosis:

# Check worker logs during curation
kubectl logs deployment/learnpanta-backend -c worker -f

Common causes:

Rate limiting: Gemini API rate limited
- Solution: Reduce concurrency (concurrency=3)
Invalid exam names: Search grounding can't find syllabus
- Solution: Check exam name formatting
No active papers: Exam has no paper to attach questions to
- Solution: Create paper first

Frontend Shows Blank Page

Symptoms:

White screen after login
Console shows API errors

Solutions:

API URL wrong:

# .env.local
NEXT_PUBLIC_API_URL=https://api.yourdomain.com/api/v1

CORS blocking:

# Backend
ALLOWED_ORIGINS=https://app.yourdomain.com

API key mismatch: Frontend and backend using different keys

Temporal Workflow Stuck / Not Advancing

Symptoms:

Active session workflow not found or workflow stays in ACTIVE
No new metrics processed; continue-as-new not triggered

Checks:

tctl workflow describe -w marathon-{session_id}

Fixes:

If history is huge: lower continue-as-new threshold in workflow and redeploy worker.
If workflow completed: reconnect via /ws/stream/{session_id} (ingestion auto-starts) or start manually.
To force close: tctl workflow terminate -w marathon-{session_id} then recreate session.

WebSocket Telemetry Not Arriving

Symptoms:

No metrics in Temporal state
Browser shows WS close codes 1006/1015

Fixes:

Verify NEXT_PUBLIC_WS_URL (wss) and LB supports upgrades.
Add WSS origin to ALLOWED_ORIGINS.
Backend logs: look for Signal retry failed; if present, Temporal unreachable.
Keep message payloads <64KB.

Frontend Build Fails (CI)

Symptoms:

Cloud Build fails during next build
Missing env vars in logs

Fixes:

Ensure _FIREBASE_* and _TLDRAW_LICENSE_KEY substitutions set in cloudbuild-frontend.yaml.
Confirm docs copied to frontend/content/docs (step 0 in Cloud Build).
Run pnpm build locally to surface type errors before pushing.

Kubernetes Issues

Pod CrashLoopBackOff

# Get events
kubectl describe pod POD_NAME

# Get logs from crashed container
kubectl logs POD_NAME --previous

Common causes:

Missing environment variables
Database connection failure
Invalid configuration

Out of Memory

# Check resource usage
kubectl top pods

# Increase limits
resources:
  limits:
    memory: "2Gi"

Image Pull Errors

# Check image exists
gcloud artifacts docker images list REPO

# Check pull secrets
kubectl get secrets

Performance Issues

Slow API Responses

Diagnosis:

# Time a request
time curl https://api.example.com/api/v1/exams

Solutions:

Database indexes: Add indexes for common queries

CREATE INDEX idx_sessions_candidate ON sessions(candidate_id);
CREATE INDEX idx_papers_exam ON papers(exam_id);

Connection pooling: Increase pool size

engine = create_engine(DATABASE_URL, pool_size=20)

Caching: Add Redis for frequently accessed data

High Memory Usage

Cause: Temporal workflow history growing too large.

Solution: Implement continue-as-new more aggressively:

if loop_count > 30:  # Reduce from 50
    workflow.continue_as_new(...)

Logs Reference

Backend Logs

# All logs
kubectl logs deployment/learnpanta-backend

# Follow logs
kubectl logs -f deployment/learnpanta-backend

# Filter by level
kubectl logs deployment/learnpanta-backend | grep ERROR

Worker Logs

kubectl logs deployment/learnpanta-backend -c worker

Key Log Patterns

Pattern	Meaning
`Feedback Agent generating report`	AI feedback starting
`CancelledError`	Activity timed out
`Session Finished by User Signal`	Normal completion
`continue_as_new`	Workflow history reset

Getting Help

Information to Include

When reporting issues, include:

Steps to reproduce
Expected vs actual behavior
Relevant logs (sanitized)
Environment (local/staging/prod)
Recent changes

Log Collection Script

#!/bin/bash
echo "=== Pod Status ===" > debug.txt
kubectl get pods >> debug.txt
echo "=== Backend Logs ===" >> debug.txt
kubectl logs deployment/learnpanta-backend --tail=200 >> debug.txt
echo "=== Worker Logs ===" >> debug.txt
kubectl logs deployment/learnpanta-backend -c worker --tail=200 >> debug.txt

Next Steps

Configuration - Review settings
Deployment - Deployment procedures
Architecture - System overview