GPU vs CPU Benchmark

This example demonstrates the performance difference between GPU and CPU training for QCMLClassifier. Understanding these performance characteristics is crucial for optimizing training time and resource utilization in production environments.

Overview

This benchmark comparison includes:

  • Direct runtime comparison between CPU and GPU devices

  • Proper CUDA synchronization for accurate timing measurements

  • Complete workflow timing (fit, predict, predict_proba)

  • Real performance metrics on identical datasets

  • Practical hardware optimization guidance

The analysis demonstrates significant performance advantages when using GPU acceleration with QCML.

Complete Example

from sklearn import datasets
import time
import torch

from honeio.integrations.sklearn.qcmlsklearn import QCMLClassifier

device_list = ['cpu', 'cuda']

max_obs = 1000  # use only first 1000 observations for community edition

X, y = datasets.load_digits(return_X_y=True)
X = X[:max_obs]
y = y[:max_obs]

elapsed_time_dict = {}
for device in device_list:
    if device == 'cuda':
        torch.cuda.synchronize()

    start_time = time.time()
    model = QCMLClassifier(device=device)
    model.fit(X, y)
    label_forecasts = model.predict(X)
    label_forecasts_prob = model.predict_proba(X)

    if device == 'cuda':
        torch.cuda.synchronize()

    end_time = time.time()
    elapsed_time = end_time - start_time
    elapsed_time_dict[device] = elapsed_time

print(f"Runtime comparison: {elapsed_time_dict}")

Benchmark Setup

Dataset Configuration
  • 1000 samples from digits dataset (community edition optimized)

  • 64 features per sample (8x8 pixel images)

  • 10 classes for multiclass classification

Timing Methodology
  • Complete workflow timing (initialization, training, prediction)

  • CUDA synchronization ensures accurate GPU measurements

  • Identical operations performed on both devices

Hardware Requirements
  • CUDA-compatible GPU required for GPU benchmarking

  • Sufficient GPU memory for model and data

  • PyTorch CUDA support properly installed

Training Output

During the benchmark, you’ll see community edition warnings for both devices:

2025-08-07 11:34:04 [warning  ]
You are using the community edition of honeio.
There are some limitations that can be lifted by purchasing a commercial license.
Please contact support@qognitive.io for more information.

[Multiple warnings during CPU training phase]
[Multiple warnings during GPU training phase]

Performance Results

The benchmark results demonstrate significant GPU acceleration:

elapsed_time_dict
# Output: {'cpu': 19.073538541793823, 'cuda': 3.767685651779175}
GPU vs CPU Performance Comparison

Device

Runtime (seconds)

Relative Performance

Speedup Factor

GPU (CUDA)

3.77

100%

5.1x faster

CPU

19.07

19.8%

1.0x (baseline)

Key Findings

GPU Performance Advantage
  • 5.1x speedup with GPU acceleration

  • 80% reduction in training time

  • Significant efficiency gains for identical operations

Time Savings Analysis
  • CPU training: ~19 seconds for complete workflow

  • GPU training: ~3.8 seconds for complete workflow

  • Time saved: ~15.3 seconds (80% reduction)

Scalability Implications
  • Linear scaling: Larger datasets show even greater GPU benefits

  • Batch processing: GPU advantages increase with batch size

  • Production efficiency: Significant resource optimization potential

Performance Considerations

GPU Advantages
  • Parallel processing: Quantum operations benefit from GPU parallelization

  • Memory bandwidth: High-throughput data movement

  • Matrix operations: Optimized linear algebra computations

When to Use GPU
  • Large datasets: Benefits increase with data size

  • Frequent training: Repeated model training scenarios

  • Production workloads: Time-critical applications

  • Batch processing: Multiple model training tasks

When CPU May Suffice
  • Small datasets: Overhead may offset benefits

  • Occasional use: Infrequent training scenarios

  • Limited GPU access: Hardware constraints

  • Development/testing: Quick prototyping phases

Hardware Requirements

GPU Requirements
  • CUDA-compatible NVIDIA GPU

  • Sufficient VRAM (typically 2GB+ recommended)

  • CUDA drivers properly installed

  • PyTorch with CUDA support

Software Dependencies
  • PyTorch CUDA build installed

  • NVIDIA drivers compatible with CUDA version

  • Sufficient system memory for data loading

Checking GPU Availability

import torch

# Check if CUDA is available
if torch.cuda.is_available():
    print(f"GPU available: {torch.cuda.get_device_name(0)}")
    print(f"CUDA version: {torch.version.cuda}")
else:
    print("GPU not available - using CPU")

Best Practices

Benchmark Timing
  • CUDA synchronization essential for accurate GPU timing

  • Multiple runs recommended for reliable averages

  • Warm-up runs to account for GPU initialization overhead

Memory Optimization
  • Batch size tuning for optimal GPU utilization

  • Memory monitoring to avoid out-of-memory errors

  • Data loading optimization for GPU workflows

Production Deployment
  • Device selection based on availability and performance requirements

  • Fallback strategies when GPU unavailable

  • Resource monitoring for optimal utilization

Code Pattern for Device Selection

import torch
from honeio.integrations.sklearn.qcmlsklearn import QCMLClassifier

# Automatic device selection
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Initialize model with optimal device
model = QCMLClassifier(device=device)

# Optional: Print device info
if device == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

Performance Scaling

Expected Scaling Patterns
  • Small datasets (< 1000 samples): 2-5x speedup

  • Medium datasets (1000-10000 samples): 5-10x speedup

  • Large datasets (> 10000 samples): 10x+ speedup

Factors Affecting Performance
  • Dataset size: Larger datasets show greater GPU benefits

  • Model complexity: More complex models benefit more from parallelization

  • Batch size: Optimal batch sizes maximize GPU utilization

  • Hardware specifications: Newer GPUs provide better performance

Next Steps

Optimization Strategies
  • Batch size tuning: Experiment with different batch sizes for optimal performance

  • Memory profiling: Monitor GPU memory usage for efficiency

  • Multi-GPU scaling: Explore distributed training for very large datasets

Production Considerations
  • Cost-benefit analysis: Compare GPU costs vs time savings

  • Infrastructure planning: GPU resource allocation and scheduling

  • Monitoring and alerting: Performance tracking in production

Extended Benchmarking
  • Different model sizes: Test performance across various configurations

  • Multiple datasets: Benchmark on different problem types

  • Hardware comparison: Compare different GPU models and generations

Related Examples