Troubleshooting

Common issues and their solutions.


Quick Diagnostics

Health Check

# Backend health
curl https://your-domain.com/health

# Expected: {"status": "ok", "message": "Service is healthy"}

Pod Status

kubectl get pods
kubectl logs deployment/learnpanta-backend --tail=100

Temporal Status

kubectl exec -it deployment/temporal-server -- tctl namespace list

Common Issues

Feedback Generation Times Out

Symptoms:

  • Session completes but feedback never appears
  • Worker logs show asyncio.exceptions.CancelledError
  • Activity retries multiple times (attempt: 4, 5, 6...)

Cause: Gemini API call exceeds activity timeout.

Solution:

  1. Check which model is being used:
# In llm.py - should be flash, not pro
model=self.flash_model_id  # Fast
# NOT: model=self.pro_model_id  # Slow
  1. Ensure thinking mode is disabled:
# Remove this line:
thinking_config=self._get_thinking_config()
  1. Increase timeout if needed:
# In workflows.py
start_to_close_timeout=timedelta(minutes=3)  # Increase from 2

Questions From Wrong Exam

Symptoms:

  • User takes PMP exam but sees AWS questions
  • Adaptive papers contain unrelated content

Cause: Bug in adaptive paper creation (questions not filtered by exam).

Solution: Ensure academic.py filters questions by exam:

query = db.query(models.Question).join(
    models.SectionQuestion
).join(
    models.Section
).join(
    models.Paper
).filter(
    models.Paper.exam_id == exam_id  # Critical filter
).distinct()

WebSocket Connection Fails

Symptoms:

  • Telemetry not being captured
  • Console shows WebSocket errors
  • No metrics in workflow state

Diagnosis:

// Check browser console
WebSocket connection to 'wss://...' failed

Solutions:

  1. CORS issue: Add WebSocket origin to allowed origins
origins.append("wss://your-frontend-domain.com")
  1. Load balancer: Ensure LB supports WebSocket upgrades
# GKE Ingress annotation
kubernetes.io/ingress.class: "gce"
cloud.google.com/backend-config: '{"default": "ws-backend-config"}'
  1. Timeout: Increase idle timeout on load balancer

Database Connection Refused

Symptoms:

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) 
connection refused

Solutions:

  1. Local: Ensure PostgreSQL is running
docker-compose up -d
  1. Cloud SQL: Check connection string
# Via proxy
cloud-sql-proxy project:region:instance &
DATABASE_URL=postgresql://user:[email protected]:5432/db
  1. Private IP: Ensure pod can reach Cloud SQL
kubectl exec -it pod-name -- nc -zv CLOUD_SQL_IP 5432

Temporal Workflow Not Found

Symptoms:

HTTPException: Active session workflow not found

Causes:

  1. Worker not running: Start the worker
python -m app.worker
  1. Wrong task queue: Verify task queue name matches
# In sessions.py
task_queue="marathon-session-queue"  # Must match worker
  1. Workflow completed: Check if session already finalized

AI Features Return Mock Data

Symptoms:

  • Feedback says "Mock response"
  • TTS returns placeholder audio

Cause: GOOGLE_API_KEY not set or invalid.

Solution:

# Check if set
kubectl exec -it deployment/learnpanta-backend -- env | grep GOOGLE

# Set it
kubectl create secret generic backend-secrets \
  --from-literal=google-api-key="AIzaSy..."

Curation Fails Silently

Symptoms:

  • /curator/batch-curate returns 200 but no questions created
  • Completion percentage stays at 0%

Diagnosis:

# Check worker logs during curation
kubectl logs deployment/learnpanta-backend -c worker -f

Common causes:

  1. Rate limiting: Gemini API rate limited

    • Solution: Reduce concurrency (concurrency=3)
  2. Invalid exam names: Search grounding can't find syllabus

    • Solution: Check exam name formatting
  3. No active papers: Exam has no paper to attach questions to

    • Solution: Create paper first

Frontend Shows Blank Page

Symptoms:

  • White screen after login
  • Console shows API errors

Solutions:

  1. API URL wrong:
# .env.local
NEXT_PUBLIC_API_URL=https://api.yourdomain.com/api/v1
  1. CORS blocking:
# Backend
ALLOWED_ORIGINS=https://app.yourdomain.com
  1. API key mismatch: Frontend and backend using different keys

Temporal Workflow Stuck / Not Advancing

Symptoms:

  • Active session workflow not found or workflow stays in ACTIVE
  • No new metrics processed; continue-as-new not triggered

Checks:

tctl workflow describe -w marathon-{session_id}

Fixes:

  1. If history is huge: lower continue-as-new threshold in workflow and redeploy worker.
  2. If workflow completed: reconnect via /ws/stream/{session_id} (ingestion auto-starts) or start manually.
  3. To force close: tctl workflow terminate -w marathon-{session_id} then recreate session.

WebSocket Telemetry Not Arriving

Symptoms:

  • No metrics in Temporal state
  • Browser shows WS close codes 1006/1015

Fixes:

  1. Verify NEXT_PUBLIC_WS_URL (wss) and LB supports upgrades.
  2. Add WSS origin to ALLOWED_ORIGINS.
  3. Backend logs: look for Signal retry failed; if present, Temporal unreachable.
  4. Keep message payloads <64KB.

Frontend Build Fails (CI)

Symptoms:

  • Cloud Build fails during next build
  • Missing env vars in logs

Fixes:

  1. Ensure _FIREBASE_* and _TLDRAW_LICENSE_KEY substitutions set in cloudbuild-frontend.yaml.
  2. Confirm docs copied to frontend/content/docs (step 0 in Cloud Build).
  3. Run pnpm build locally to surface type errors before pushing.

Kubernetes Issues

Pod CrashLoopBackOff

# Get events
kubectl describe pod POD_NAME

# Get logs from crashed container
kubectl logs POD_NAME --previous

Common causes:

  • Missing environment variables
  • Database connection failure
  • Invalid configuration

Out of Memory

# Check resource usage
kubectl top pods

# Increase limits
resources:
  limits:
    memory: "2Gi"

Image Pull Errors

# Check image exists
gcloud artifacts docker images list REPO

# Check pull secrets
kubectl get secrets

Performance Issues

Slow API Responses

Diagnosis:

# Time a request
time curl https://api.example.com/api/v1/exams

Solutions:

  1. Database indexes: Add indexes for common queries
CREATE INDEX idx_sessions_candidate ON sessions(candidate_id);
CREATE INDEX idx_papers_exam ON papers(exam_id);
  1. Connection pooling: Increase pool size
engine = create_engine(DATABASE_URL, pool_size=20)
  1. Caching: Add Redis for frequently accessed data

High Memory Usage

Cause: Temporal workflow history growing too large.

Solution: Implement continue-as-new more aggressively:

if loop_count > 30:  # Reduce from 50
    workflow.continue_as_new(...)

Logs Reference

Backend Logs

# All logs
kubectl logs deployment/learnpanta-backend

# Follow logs
kubectl logs -f deployment/learnpanta-backend

# Filter by level
kubectl logs deployment/learnpanta-backend | grep ERROR

Worker Logs

kubectl logs deployment/learnpanta-backend -c worker

Key Log Patterns

PatternMeaning
Feedback Agent generating reportAI feedback starting
CancelledErrorActivity timed out
Session Finished by User SignalNormal completion
continue_as_newWorkflow history reset

Getting Help

Information to Include

When reporting issues, include:

  1. Steps to reproduce
  2. Expected vs actual behavior
  3. Relevant logs (sanitized)
  4. Environment (local/staging/prod)
  5. Recent changes

Log Collection Script

#!/bin/bash
echo "=== Pod Status ===" > debug.txt
kubectl get pods >> debug.txt
echo "=== Backend Logs ===" >> debug.txt
kubectl logs deployment/learnpanta-backend --tail=200 >> debug.txt
echo "=== Worker Logs ===" >> debug.txt
kubectl logs deployment/learnpanta-backend -c worker --tail=200 >> debug.txt

Next Steps

LearnPanta Exam Prep | Practice Tests & Certification Readiness