Regression

This example demonstrates how to evaluate QCMLRegressor performance on a regression task using the California housing dataset and compare it with traditional machine learning methods. This showcases QCMLRegressor’s capabilities for continuous target prediction.

Overview

This regression evaluation includes:

  • Real-world dataset with California housing prices

  • Comprehensive metrics covering multiple aspects of regression performance

  • GPU acceleration when available with dropout regularization

  • Cross-validation comparison with established sklearn regressors

  • Superior performance across all regression metrics

The analysis demonstrates QCMLRegressor achieving excellent performance on this challenging regression problem.

Complete Example

from honeio.integrations.sklearn.qcmlsklearn import QCMLRegressor

import pandas as pd
import torch

from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import root_mean_squared_error, mean_absolute_error, mean_absolute_percentage_error, r2_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler

# Load California housing dataset and set up CV
SEED = 0
K_FOLDS = 5
max_obs = 1000  # use only first 1000 observations for community edition

X, y = datasets.fetch_california_housing(return_X_y=True)
X = X[:max_obs]
y = y[:max_obs]

kf = KFold(n_splits=K_FOLDS, shuffle=True, random_state=SEED)

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

Dataset Information

The California housing dataset provides a realistic regression benchmark:

X shape: (1000, 8)
y shape: (1000,)
  • 1000 samples with 8 features (housing characteristics)

  • Continuous target (median house values in hundreds of thousands of dollars)

  • Real-world data with practical applications

  • Community edition optimized sample size

Dataset features include: * Median income in the block group * Housing median age * Average number of rooms per household * Average number of bedrooms per household * Population of the block group * Average number of household members * Latitude and longitude coordinates

Cross-Validation Setup

# Initialize models and run 5-fold CV
model_list = [
    QCMLRegressor(
        device="cuda" if torch.cuda.is_available() else "cpu",
        dropout_rate=0.3,
    ),
    LinearRegression(),
    RandomForestRegressor(),
]

error_funcs = [
    mean_absolute_percentage_error,
    root_mean_squared_error,
    mean_absolute_error,
    r2_score,
]

error_stats = {}
for model in model_list:
    model_name = model.__class__.__name__
    print(f"Training {model_name}...")
    for fold, (train_index, test_index) in enumerate(kf.split(X, y)):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # Standardize the features
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)

        # Train the model using the training sets
        model.fit(X_train_scaled, y_train)

        # Make predictions using the testing set
        y_pred = model.predict(X_test_scaled)

        for error_func in error_funcs:
            error_stats.setdefault((model_name, fold), {})[f"{error_func.__name__}"] = error_func(y_test, y_pred)

Advanced Configuration

The QCMLRegressor is configured with advanced options:

Device Selection

Automatically uses GPU acceleration when available with device="cuda"

Regularization

Includes dropout_rate=0.3 for robust generalization

Metric Selection

Comprehensive evaluation with four key regression metrics

Training Output

During training, community edition warnings appear throughout the process:

2025-08-07 11:32:45 [warning  ]
You are using the community edition of honeio.
There are some limitations that can be lifted by purchasing a commercial license.
Please contact support@qognitive.io for more information.

Training QCMLRegressor...
[Multiple community edition warnings during 5-fold CV]
Training LinearRegression...
Training RandomForestRegressor...

Results Analysis

# Summarize results
error_stats_df = pd.DataFrame(error_stats).T
average_error_stats = error_stats_df.groupby(level=0).mean()
average_error_stats.sort_values('mean_absolute_percentage_error', ascending=True, inplace=True)
print(average_error_stats)

Performance Results

The cross-validation results demonstrate excellent regression performance:

Regression Model Comparison Results

Model

MAPE

RMSE

MAE

R² Score

QCMLRegressor

0.1226

0.3918

0.2492

0.8034

RandomForestRegressor

0.1438

0.4273

0.2728

0.7668

LinearRegression

0.2143

0.5439

0.3834

0.6219

Key Findings

QCMLRegressor Excellence
  • Best MAPE at 12.26% (vs 14.38% for RandomForest)

  • Lowest RMSE at 0.3918 (vs 0.4273 for RandomForest)

  • Best MAE at 0.2492 (vs 0.2728 for RandomForest)

  • Highest R² at 0.8034 (vs 0.7668 for RandomForest)

Performance Advantages
  • 17% better MAPE than RandomForestRegressor

  • 43% better MAPE than LinearRegression

  • Superior across all metrics consistently

Regression Capabilities
  • Excellent predictive accuracy on continuous targets

  • Strong generalization with dropout regularization

  • GPU acceleration for faster training when available

Metric Interpretation

Mean Absolute Percentage Error (MAPE)
  • 12.26% average percentage error for QCMLRegressor

  • Ideal for comparing models across different scales

  • Lower values indicate better performance

Root Mean Squared Error (RMSE)
  • 0.3918 in hundreds of thousands of dollars

  • Penalizes larger errors more heavily

  • Scale: ~$39,180 average prediction error

Mean Absolute Error (MAE)
  • 0.2492 in hundreds of thousands of dollars

  • Robust to outliers, easier to interpret

  • Scale: ~$24,920 average absolute error

R² Score (Coefficient of Determination)
  • 0.8034 explains ~80% of target variance

  • Higher values indicate better model fit

  • Excellent performance for regression tasks

Configuration Insights

QCMLRegressor Parameters
  • device: Automatically selects best available (CPU/GPU)

  • dropout_rate=0.3: Prevents overfitting, improves generalization

  • Default epochs: Sufficient for convergence on this dataset

Cross-Validation Strategy
  • KFold (not stratified): Appropriate for continuous targets

  • 5-fold CV: Balance of bias-variance trade-off

  • Fixed seed: Ensures reproducible results

Preprocessing Standards
  • StandardScaler: Critical for neural network approaches

  • Feature scaling: Normalizes different feature ranges

  • Consistent preprocessing: Applied to all models fairly

Best Practices Demonstrated

Regression Evaluation
  • Multiple metrics provide comprehensive assessment

  • Cross-validation essential for reliable estimates

  • Proper scaling critical for neural approaches

GPU Optimization
  • Automatic device detection for optimal performance

  • CUDA acceleration when available

  • Fallback to CPU ensures universal compatibility

Regularization Strategy
  • Dropout regularization prevents overfitting

  • Appropriate dropout rate (0.3) for this problem size

  • Balanced complexity for generalization

Next Steps

Parameter Tuning
  • Experiment with different dropout rates (0.1, 0.5, 0.7)

  • Try various epoch counts and learning rates

  • Explore hilbert_space_dim optimization

Extended Comparisons
  • Include XGBoost and Support Vector Regression

  • Test on other regression datasets

  • Compare training times and memory usage

Advanced Analysis
  • Residual analysis and error distribution

  • Feature importance and sensitivity analysis

  • Learning curves and convergence behavior

Dataset Scaling
  • Full California housing dataset (20,640 samples) with commercial license

  • Other regression datasets (Boston housing, diabetes)

  • High-dimensional regression problems

Related Examples