GPU vs CPU Benchmark
This example demonstrates the performance difference between GPU and CPU training for QCMLClassifier. Understanding these performance characteristics is crucial for optimizing training time and resource utilization in production environments.
Overview
This benchmark comparison includes:
Direct runtime comparison between CPU and GPU devices
Proper CUDA synchronization for accurate timing measurements
Complete workflow timing (fit, predict, predict_proba)
Real performance metrics on identical datasets
Practical hardware optimization guidance
The analysis demonstrates significant performance advantages when using GPU acceleration with QCML.
Complete Example
from sklearn import datasets
import time
import torch
from honeio.integrations.sklearn.qcmlsklearn import QCMLClassifier
device_list = ['cpu', 'cuda']
max_obs = 1000 # use only first 1000 observations for community edition
X, y = datasets.load_digits(return_X_y=True)
X = X[:max_obs]
y = y[:max_obs]
elapsed_time_dict = {}
for device in device_list:
if device == 'cuda':
torch.cuda.synchronize()
start_time = time.time()
model = QCMLClassifier(device=device)
model.fit(X, y)
label_forecasts = model.predict(X)
label_forecasts_prob = model.predict_proba(X)
if device == 'cuda':
torch.cuda.synchronize()
end_time = time.time()
elapsed_time = end_time - start_time
elapsed_time_dict[device] = elapsed_time
print(f"Runtime comparison: {elapsed_time_dict}")
Benchmark Setup
- Dataset Configuration
1000 samples from digits dataset (community edition optimized)
64 features per sample (8x8 pixel images)
10 classes for multiclass classification
- Timing Methodology
Complete workflow timing (initialization, training, prediction)
CUDA synchronization ensures accurate GPU measurements
Identical operations performed on both devices
- Hardware Requirements
CUDA-compatible GPU required for GPU benchmarking
Sufficient GPU memory for model and data
PyTorch CUDA support properly installed
Training Output
During the benchmark, you’ll see community edition warnings for both devices:
2025-08-07 11:34:04 [warning ]
You are using the community edition of honeio.
There are some limitations that can be lifted by purchasing a commercial license.
Please contact support@qognitive.io for more information.
[Multiple warnings during CPU training phase]
[Multiple warnings during GPU training phase]
Performance Results
The benchmark results demonstrate significant GPU acceleration:
elapsed_time_dict
# Output: {'cpu': 19.073538541793823, 'cuda': 3.767685651779175}
Device |
Runtime (seconds) |
Relative Performance |
Speedup Factor |
|---|---|---|---|
GPU (CUDA) |
3.77 |
100% |
5.1x faster |
CPU |
19.07 |
19.8% |
1.0x (baseline) |
Key Findings
- GPU Performance Advantage
5.1x speedup with GPU acceleration
80% reduction in training time
Significant efficiency gains for identical operations
- Time Savings Analysis
CPU training: ~19 seconds for complete workflow
GPU training: ~3.8 seconds for complete workflow
Time saved: ~15.3 seconds (80% reduction)
- Scalability Implications
Linear scaling: Larger datasets show even greater GPU benefits
Batch processing: GPU advantages increase with batch size
Production efficiency: Significant resource optimization potential
Performance Considerations
- GPU Advantages
Parallel processing: Quantum operations benefit from GPU parallelization
Memory bandwidth: High-throughput data movement
Matrix operations: Optimized linear algebra computations
- When to Use GPU
Large datasets: Benefits increase with data size
Frequent training: Repeated model training scenarios
Production workloads: Time-critical applications
Batch processing: Multiple model training tasks
- When CPU May Suffice
Small datasets: Overhead may offset benefits
Occasional use: Infrequent training scenarios
Limited GPU access: Hardware constraints
Development/testing: Quick prototyping phases
Hardware Requirements
- GPU Requirements
CUDA-compatible NVIDIA GPU
Sufficient VRAM (typically 2GB+ recommended)
CUDA drivers properly installed
PyTorch with CUDA support
- Software Dependencies
PyTorch CUDA build installed
NVIDIA drivers compatible with CUDA version
Sufficient system memory for data loading
Checking GPU Availability
import torch
# Check if CUDA is available
if torch.cuda.is_available():
print(f"GPU available: {torch.cuda.get_device_name(0)}")
print(f"CUDA version: {torch.version.cuda}")
else:
print("GPU not available - using CPU")
Best Practices
- Benchmark Timing
CUDA synchronization essential for accurate GPU timing
Multiple runs recommended for reliable averages
Warm-up runs to account for GPU initialization overhead
- Memory Optimization
Batch size tuning for optimal GPU utilization
Memory monitoring to avoid out-of-memory errors
Data loading optimization for GPU workflows
- Production Deployment
Device selection based on availability and performance requirements
Fallback strategies when GPU unavailable
Resource monitoring for optimal utilization
Code Pattern for Device Selection
import torch
from honeio.integrations.sklearn.qcmlsklearn import QCMLClassifier
# Automatic device selection
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
# Initialize model with optimal device
model = QCMLClassifier(device=device)
# Optional: Print device info
if device == "cuda":
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
Performance Scaling
- Expected Scaling Patterns
Small datasets (< 1000 samples): 2-5x speedup
Medium datasets (1000-10000 samples): 5-10x speedup
Large datasets (> 10000 samples): 10x+ speedup
- Factors Affecting Performance
Dataset size: Larger datasets show greater GPU benefits
Model complexity: More complex models benefit more from parallelization
Batch size: Optimal batch sizes maximize GPU utilization
Hardware specifications: Newer GPUs provide better performance
Next Steps
- Optimization Strategies
Batch size tuning: Experiment with different batch sizes for optimal performance
Memory profiling: Monitor GPU memory usage for efficiency
Multi-GPU scaling: Explore distributed training for very large datasets
- Production Considerations
Cost-benefit analysis: Compare GPU costs vs time savings
Infrastructure planning: GPU resource allocation and scheduling
Monitoring and alerting: Performance tracking in production
- Extended Benchmarking
Different model sizes: Test performance across various configurations
Multiple datasets: Benchmark on different problem types
Hardware comparison: Compare different GPU models and generations
- Related Examples
See Intro to QCML for an introduction to QCML
Check Binary Classification for performance comparison methodology
Try Multiclass Classification for complex classification scenarios
Explore Regression for continuous target prediction
Review Scikit-learn Integration for parameter optimization