GPU vs CPU Benchmark

This example demonstrates the performance difference between GPU and CPU training for QCMLClassifier. Understanding these performance characteristics is crucial for optimizing training time and resource utilization in production environments.

Overview

This benchmark comparison includes:

Direct runtime comparison between CPU and GPU devices
Proper CUDA synchronization for accurate timing measurements
Complete workflow timing (fit, predict, predict_proba)
Real performance metrics on identical datasets
Practical hardware optimization guidance

The analysis demonstrates significant performance advantages when using GPU acceleration with QCML.

Complete Example

from sklearn import datasets
import time
import torch

from honeio.integrations.sklearn.qcmlsklearn import QCMLClassifier

device_list = ['cpu', 'cuda']

max_obs = 1000  # use only first 1000 observations for community edition

X, y = datasets.load_digits(return_X_y=True)
X = X[:max_obs]
y = y[:max_obs]

elapsed_time_dict = {}
for device in device_list:
    if device == 'cuda':
        torch.cuda.synchronize()

    start_time = time.time()
    model = QCMLClassifier(device=device)
    model.fit(X, y)
    label_forecasts = model.predict(X)
    label_forecasts_prob = model.predict_proba(X)

    if device == 'cuda':
        torch.cuda.synchronize()

    end_time = time.time()
    elapsed_time = end_time - start_time
    elapsed_time_dict[device] = elapsed_time

print(f"Runtime comparison: {elapsed_time_dict}")

Benchmark Setup

Dataset Configuration

1000 samples from digits dataset (community edition optimized)
64 features per sample (8x8 pixel images)
10 classes for multiclass classification

Timing Methodology

Complete workflow timing (initialization, training, prediction)
CUDA synchronization ensures accurate GPU measurements
Identical operations performed on both devices

Hardware Requirements

CUDA-compatible GPU required for GPU benchmarking
Sufficient GPU memory for model and data
PyTorch CUDA support properly installed

Training Output

During the benchmark, you’ll see community edition warnings for both devices:

2025-08-07 11:34:04 [warning  ]
You are using the community edition of honeio.
There are some limitations that can be lifted by purchasing a commercial license.
Please contact support@qognitive.io for more information.

[Multiple warnings during CPU training phase]
[Multiple warnings during GPU training phase]

Performance Results

The benchmark results demonstrate significant GPU acceleration:

elapsed_time_dict
# Output: {'cpu': 19.073538541793823, 'cuda': 3.767685651779175}

**GPU vs CPU Performance Comparison**
Device	Runtime (seconds)	Relative Performance	Speedup Factor
GPU (CUDA)	3.77	100%	5.1x faster
CPU	19.07	19.8%	1.0x (baseline)

Key Findings

GPU Performance Advantage

5.1x speedup with GPU acceleration
80% reduction in training time
Significant efficiency gains for identical operations

Time Savings Analysis

CPU training: ~19 seconds for complete workflow
GPU training: ~3.8 seconds for complete workflow
Time saved: ~15.3 seconds (80% reduction)

Scalability Implications

Linear scaling: Larger datasets show even greater GPU benefits
Batch processing: GPU advantages increase with batch size
Production efficiency: Significant resource optimization potential

Performance Considerations

GPU Advantages

Parallel processing: Quantum operations benefit from GPU parallelization
Memory bandwidth: High-throughput data movement
Matrix operations: Optimized linear algebra computations

When to Use GPU

Large datasets: Benefits increase with data size
Frequent training: Repeated model training scenarios
Production workloads: Time-critical applications
Batch processing: Multiple model training tasks

When CPU May Suffice

Small datasets: Overhead may offset benefits
Occasional use: Infrequent training scenarios
Limited GPU access: Hardware constraints
Development/testing: Quick prototyping phases

Hardware Requirements

GPU Requirements

CUDA-compatible NVIDIA GPU
Sufficient VRAM (typically 2GB+ recommended)
CUDA drivers properly installed
PyTorch with CUDA support

Software Dependencies

PyTorch CUDA build installed
NVIDIA drivers compatible with CUDA version
Sufficient system memory for data loading

Checking GPU Availability

import torch

# Check if CUDA is available
if torch.cuda.is_available():
    print(f"GPU available: {torch.cuda.get_device_name(0)}")
    print(f"CUDA version: {torch.version.cuda}")
else:
    print("GPU not available - using CPU")

Best Practices

Benchmark Timing

CUDA synchronization essential for accurate GPU timing
Multiple runs recommended for reliable averages
Warm-up runs to account for GPU initialization overhead

Memory Optimization

Batch size tuning for optimal GPU utilization
Memory monitoring to avoid out-of-memory errors
Data loading optimization for GPU workflows

Production Deployment

Device selection based on availability and performance requirements
Fallback strategies when GPU unavailable
Resource monitoring for optimal utilization

Code Pattern for Device Selection

import torch
from honeio.integrations.sklearn.qcmlsklearn import QCMLClassifier

# Automatic device selection
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Initialize model with optimal device
model = QCMLClassifier(device=device)

# Optional: Print device info
if device == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

Performance Scaling

Expected Scaling Patterns

Small datasets (< 1000 samples): 2-5x speedup
Medium datasets (1000-10000 samples): 5-10x speedup
Large datasets (> 10000 samples): 10x+ speedup

Factors Affecting Performance

Dataset size: Larger datasets show greater GPU benefits
Model complexity: More complex models benefit more from parallelization
Batch size: Optimal batch sizes maximize GPU utilization
Hardware specifications: Newer GPUs provide better performance

Next Steps

Optimization Strategies

Batch size tuning: Experiment with different batch sizes for optimal performance
Memory profiling: Monitor GPU memory usage for efficiency
Multi-GPU scaling: Explore distributed training for very large datasets

Production Considerations

Cost-benefit analysis: Compare GPU costs vs time savings
Infrastructure planning: GPU resource allocation and scheduling
Monitoring and alerting: Performance tracking in production

Extended Benchmarking

Different model sizes: Test performance across various configurations
Multiple datasets: Benchmark on different problem types
Hardware comparison: Compare different GPU models and generations

Related Examples

See Intro to QCML for an introduction to QCML
Check Binary Classification for performance comparison methodology
Try Multiclass Classification for complex classification scenarios
Explore Regression for continuous target prediction
Review Scikit-learn Integration for parameter optimization