106 lines
2.6 KiB
Markdown
106 lines
2.6 KiB
Markdown
# Cross-Spec: GPU Resource Coordination
|
|
|
|
**Date**: 2026-05-16
|
|
**Hardware**: RTX 2060 Super 8GB (Desk-5439)
|
|
**Target Peak**: ~4.5GB VRAM
|
|
**Document**: GPU scheduling strategy for AI workloads
|
|
|
|
---
|
|
|
|
## GPU Workload Overview
|
|
|
|
| Feature | Queue | GPU Usage | Duration | Frequency |
|
|
|---------|-------|-----------|----------|-----------|
|
|
| AI Model Revision | ai-realtime | High (gemma4:e4b) | 5-30s | On user action |
|
|
| AI Model Revision | ai-batch | High (gemma4:e4b) | 30-120s | Background |
|
|
| RFA Approval | rfa-reminders | None | - | - |
|
|
| RFA Approval | rfa-distribution | None | - | - |
|
|
|
|
---
|
|
|
|
## Scheduling Strategy
|
|
|
|
### 1. Time-Based Scheduling
|
|
|
|
```
|
|
Peak Hours (09:00-18:00):
|
|
├── ai-realtime: ACTIVE (user requests)
|
|
└── ai-batch: PAUSED (defer to off-peak)
|
|
|
|
Off-Peak Hours (18:00-09:00):
|
|
├── ai-realtime: ACTIVE (reduced load)
|
|
└── ai-batch: ACTIVE (background processing)
|
|
```
|
|
|
|
### 2. Dynamic Pause/Resume
|
|
|
|
```typescript
|
|
// AiRealtimeProcessor auto-manages ai-batch
|
|
@Processor(QUEUE_AI_REALTIME, { concurrency: 1 })
|
|
export class AiRealtimeProcessor {
|
|
@OnWorkerEvent('active')
|
|
async pauseBatch() {
|
|
await this.aiBatchQueue.pause();
|
|
this.logger.log('Paused ai-batch for realtime job');
|
|
}
|
|
|
|
@OnWorkerEvent('completed')
|
|
async resumeBatch() {
|
|
const activeCount = await this.aiRealtimeQueue.getActiveCount();
|
|
if (activeCount === 0) {
|
|
await this.aiBatchQueue.resume();
|
|
this.logger.log('Resumed ai-batch (no active realtime jobs)');
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### 3. VRAM Budget Management
|
|
|
|
| Model | VRAM Usage | Context |
|
|
|-------|------------|---------|
|
|
| gemma4:e4b Q8_0 | ~4.5GB peak | Main inference |
|
|
| nomic-embed-text | ~0.5GB | Embedding only |
|
|
| **Total Budget** | **~5GB** | Safety margin 3GB |
|
|
|
|
### 4. Contention Prevention
|
|
|
|
- **Single Model Loading**: Only gemma4:e4b loaded at a time
|
|
- **No Concurrent GPU Jobs**: concurrency=1 for both AI queues
|
|
- **Memory Cleanup**: Explicit cleanup after each job
|
|
- **Queue Draining**: ai-batch pauses when ai-realtime active
|
|
|
|
---
|
|
|
|
## Monitoring Commands
|
|
|
|
```bash
|
|
# Monitor GPU usage on Desk-5439
|
|
watch -n 1 nvidia-smi
|
|
|
|
# Check Ollama model status
|
|
curl http://192.168.10.100:11434/api/ps
|
|
|
|
# Monitor queue states
|
|
redis-cli KEYS "bull:*:meta"
|
|
```
|
|
|
|
---
|
|
|
|
## Fallback Strategy
|
|
|
|
If GPU unavailable:
|
|
1. ai-realtime: Return "AI service temporarily unavailable"
|
|
2. ai-batch: Queue jobs with delay, retry every 5 minutes
|
|
3. RFA features: Unaffected (no GPU usage)
|
|
|
|
---
|
|
|
|
## Verification Checklist
|
|
|
|
- [x] ai-realtime has auto-pause for ai-batch
|
|
- [x] concurrency=1 for both AI queues
|
|
- [x] VRAM monitoring in place
|
|
- [x] Fallback handling for GPU unavailability
|
|
- [x] RFA queues don't use GPU
|