Troubleshooting
Common issues and their solutions.
Quick Diagnostics
Health Check
# Backend health
curl https://your-domain.com/health
# Expected: {"status": "ok", "message": "Service is healthy"}
Pod Status
kubectl get pods
kubectl logs deployment/learnpanta-backend --tail=100
Temporal Status
kubectl exec -it deployment/temporal-server -- tctl namespace list
Common Issues
Feedback Generation Times Out
Symptoms:
- Session completes but feedback never appears
- Worker logs show
asyncio.exceptions.CancelledError - Activity retries multiple times (attempt: 4, 5, 6...)
Cause: Gemini API call exceeds activity timeout.
Solution:
- Check which model is being used:
# In llm.py - should be flash, not pro
model=self.flash_model_id # Fast
# NOT: model=self.pro_model_id # Slow
- Ensure thinking mode is disabled:
# Remove this line:
thinking_config=self._get_thinking_config()
- Increase timeout if needed:
# In workflows.py
start_to_close_timeout=timedelta(minutes=3) # Increase from 2
Questions From Wrong Exam
Symptoms:
- User takes PMP exam but sees AWS questions
- Adaptive papers contain unrelated content
Cause: Bug in adaptive paper creation (questions not filtered by exam).
Solution: Ensure academic.py filters questions by exam:
query = db.query(models.Question).join(
models.SectionQuestion
).join(
models.Section
).join(
models.Paper
).filter(
models.Paper.exam_id == exam_id # Critical filter
).distinct()
WebSocket Connection Fails
Symptoms:
- Telemetry not being captured
- Console shows WebSocket errors
- No metrics in workflow state
Diagnosis:
// Check browser console
WebSocket connection to 'wss://...' failed
Solutions:
- CORS issue: Add WebSocket origin to allowed origins
origins.append("wss://your-frontend-domain.com")
- Load balancer: Ensure LB supports WebSocket upgrades
# GKE Ingress annotation
kubernetes.io/ingress.class: "gce"
cloud.google.com/backend-config: '{"default": "ws-backend-config"}'
- Timeout: Increase idle timeout on load balancer
Database Connection Refused
Symptoms:
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError)
connection refused
Solutions:
- Local: Ensure PostgreSQL is running
docker-compose up -d
- Cloud SQL: Check connection string
# Via proxy
cloud-sql-proxy project:region:instance &
DATABASE_URL=postgresql://user:[email protected]:5432/db
- Private IP: Ensure pod can reach Cloud SQL
kubectl exec -it pod-name -- nc -zv CLOUD_SQL_IP 5432
Temporal Workflow Not Found
Symptoms:
HTTPException: Active session workflow not found
Causes:
- Worker not running: Start the worker
python -m app.worker
- Wrong task queue: Verify task queue name matches
# In sessions.py
task_queue="marathon-session-queue" # Must match worker
- Workflow completed: Check if session already finalized
AI Features Return Mock Data
Symptoms:
- Feedback says "Mock response"
- TTS returns placeholder audio
Cause: GOOGLE_API_KEY not set or invalid.
Solution:
# Check if set
kubectl exec -it deployment/learnpanta-backend -- env | grep GOOGLE
# Set it
kubectl create secret generic backend-secrets \
--from-literal=google-api-key="AIzaSy..."
Curation Fails Silently
Symptoms:
/curator/batch-curatereturns 200 but no questions created- Completion percentage stays at 0%
Diagnosis:
# Check worker logs during curation
kubectl logs deployment/learnpanta-backend -c worker -f
Common causes:
-
Rate limiting: Gemini API rate limited
- Solution: Reduce concurrency (
concurrency=3)
- Solution: Reduce concurrency (
-
Invalid exam names: Search grounding can't find syllabus
- Solution: Check exam name formatting
-
No active papers: Exam has no paper to attach questions to
- Solution: Create paper first
Frontend Shows Blank Page
Symptoms:
- White screen after login
- Console shows API errors
Solutions:
- API URL wrong:
# .env.local
NEXT_PUBLIC_API_URL=https://api.yourdomain.com/api/v1
- CORS blocking:
# Backend
ALLOWED_ORIGINS=https://app.yourdomain.com
- API key mismatch: Frontend and backend using different keys
Temporal Workflow Stuck / Not Advancing
Symptoms:
Active session workflow not foundor workflow stays inACTIVE- No new metrics processed; continue-as-new not triggered
Checks:
tctl workflow describe -w marathon-{session_id}
Fixes:
- If history is huge: lower continue-as-new threshold in workflow and redeploy worker.
- If workflow completed: reconnect via
/ws/stream/{session_id}(ingestion auto-starts) or start manually. - To force close:
tctl workflow terminate -w marathon-{session_id}then recreate session.
WebSocket Telemetry Not Arriving
Symptoms:
- No metrics in Temporal state
- Browser shows WS close codes 1006/1015
Fixes:
- Verify
NEXT_PUBLIC_WS_URL(wss) and LB supports upgrades. - Add WSS origin to
ALLOWED_ORIGINS. - Backend logs: look for
Signal retry failed; if present, Temporal unreachable. - Keep message payloads <64KB.
Frontend Build Fails (CI)
Symptoms:
- Cloud Build fails during
next build - Missing env vars in logs
Fixes:
- Ensure
_FIREBASE_*and_TLDRAW_LICENSE_KEYsubstitutions set incloudbuild-frontend.yaml. - Confirm docs copied to
frontend/content/docs(step 0 in Cloud Build). - Run
pnpm buildlocally to surface type errors before pushing.
Kubernetes Issues
Pod CrashLoopBackOff
# Get events
kubectl describe pod POD_NAME
# Get logs from crashed container
kubectl logs POD_NAME --previous
Common causes:
- Missing environment variables
- Database connection failure
- Invalid configuration
Out of Memory
# Check resource usage
kubectl top pods
# Increase limits
resources:
limits:
memory: "2Gi"
Image Pull Errors
# Check image exists
gcloud artifacts docker images list REPO
# Check pull secrets
kubectl get secrets
Performance Issues
Slow API Responses
Diagnosis:
# Time a request
time curl https://api.example.com/api/v1/exams
Solutions:
- Database indexes: Add indexes for common queries
CREATE INDEX idx_sessions_candidate ON sessions(candidate_id);
CREATE INDEX idx_papers_exam ON papers(exam_id);
- Connection pooling: Increase pool size
engine = create_engine(DATABASE_URL, pool_size=20)
- Caching: Add Redis for frequently accessed data
High Memory Usage
Cause: Temporal workflow history growing too large.
Solution: Implement continue-as-new more aggressively:
if loop_count > 30: # Reduce from 50
workflow.continue_as_new(...)
Logs Reference
Backend Logs
# All logs
kubectl logs deployment/learnpanta-backend
# Follow logs
kubectl logs -f deployment/learnpanta-backend
# Filter by level
kubectl logs deployment/learnpanta-backend | grep ERROR
Worker Logs
kubectl logs deployment/learnpanta-backend -c worker
Key Log Patterns
| Pattern | Meaning |
|---|---|
Feedback Agent generating report | AI feedback starting |
CancelledError | Activity timed out |
Session Finished by User Signal | Normal completion |
continue_as_new | Workflow history reset |
Getting Help
Information to Include
When reporting issues, include:
- Steps to reproduce
- Expected vs actual behavior
- Relevant logs (sanitized)
- Environment (local/staging/prod)
- Recent changes
Log Collection Script
#!/bin/bash
echo "=== Pod Status ===" > debug.txt
kubectl get pods >> debug.txt
echo "=== Backend Logs ===" >> debug.txt
kubectl logs deployment/learnpanta-backend --tail=200 >> debug.txt
echo "=== Worker Logs ===" >> debug.txt
kubectl logs deployment/learnpanta-backend -c worker --tail=200 >> debug.txt
Next Steps
- Configuration - Review settings
- Deployment - Deployment procedures
- Architecture - System overview