Regression

This example demonstrates how to evaluate QCMLRegressor performance on a regression task using the California housing dataset and compare it with traditional machine learning methods. This showcases QCMLRegressor’s capabilities for continuous target prediction.

Overview

This regression evaluation includes:

Real-world dataset with California housing prices
Comprehensive metrics covering multiple aspects of regression performance
GPU acceleration when available with dropout regularization
Cross-validation comparison with established sklearn regressors
Superior performance across all regression metrics

The analysis demonstrates QCMLRegressor achieving excellent performance on this challenging regression problem.

Complete Example

from honeio.integrations.sklearn.qcmlsklearn import QCMLRegressor

import pandas as pd
import torch

from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import root_mean_squared_error, mean_absolute_error, mean_absolute_percentage_error, r2_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler

# Load California housing dataset and set up CV
SEED = 0
K_FOLDS = 5
max_obs = 1000  # use only first 1000 observations for community edition

X, y = datasets.fetch_california_housing(return_X_y=True)
X = X[:max_obs]
y = y[:max_obs]

kf = KFold(n_splits=K_FOLDS, shuffle=True, random_state=SEED)

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

Dataset Information

The California housing dataset provides a realistic regression benchmark:

X shape: (1000, 8)
y shape: (1000,)

1000 samples with 8 features (housing characteristics)
Continuous target (median house values in hundreds of thousands of dollars)
Real-world data with practical applications
Community edition optimized sample size

Dataset features include: * Median income in the block group * Housing median age * Average number of rooms per household * Average number of bedrooms per household * Population of the block group * Average number of household members * Latitude and longitude coordinates

Cross-Validation Setup

# Initialize models and run 5-fold CV
model_list = [
    QCMLRegressor(
        device="cuda" if torch.cuda.is_available() else "cpu",
        dropout_rate=0.3,
    ),
    LinearRegression(),
    RandomForestRegressor(),
]

error_funcs = [
    mean_absolute_percentage_error,
    root_mean_squared_error,
    mean_absolute_error,
    r2_score,
]

error_stats = {}
for model in model_list:
    model_name = model.__class__.__name__
    print(f"Training {model_name}...")
    for fold, (train_index, test_index) in enumerate(kf.split(X, y)):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # Standardize the features
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)

        # Train the model using the training sets
        model.fit(X_train_scaled, y_train)

        # Make predictions using the testing set
        y_pred = model.predict(X_test_scaled)

        for error_func in error_funcs:
            error_stats.setdefault((model_name, fold), {})[f"{error_func.__name__}"] = error_func(y_test, y_pred)

Advanced Configuration

The QCMLRegressor is configured with advanced options:

Device Selection: Automatically uses GPU acceleration when available with device="cuda"
Regularization: Includes dropout_rate=0.3 for robust generalization
Metric Selection: Comprehensive evaluation with four key regression metrics

Training Output

During training, community edition warnings appear throughout the process:

2025-08-07 11:32:45 [warning  ]
You are using the community edition of honeio.
There are some limitations that can be lifted by purchasing a commercial license.
Please contact support@qognitive.io for more information.

Training QCMLRegressor...
[Multiple community edition warnings during 5-fold CV]
Training LinearRegression...
Training RandomForestRegressor...

Results Analysis

# Summarize results
error_stats_df = pd.DataFrame(error_stats).T
average_error_stats = error_stats_df.groupby(level=0).mean()
average_error_stats.sort_values('mean_absolute_percentage_error', ascending=True, inplace=True)
print(average_error_stats)

Performance Results

The cross-validation results demonstrate excellent regression performance:

**Regression Model Comparison Results**
Model	MAPE	RMSE	MAE	R² Score
QCMLRegressor	0.1226	0.3918	0.2492	0.8034
RandomForestRegressor	0.1438	0.4273	0.2728	0.7668
LinearRegression	0.2143	0.5439	0.3834	0.6219

Key Findings

QCMLRegressor Excellence

Best MAPE at 12.26% (vs 14.38% for RandomForest)
Lowest RMSE at 0.3918 (vs 0.4273 for RandomForest)
Best MAE at 0.2492 (vs 0.2728 for RandomForest)
Highest R² at 0.8034 (vs 0.7668 for RandomForest)

Performance Advantages

17% better MAPE than RandomForestRegressor
43% better MAPE than LinearRegression
Superior across all metrics consistently

Regression Capabilities

Excellent predictive accuracy on continuous targets
Strong generalization with dropout regularization
GPU acceleration for faster training when available

Metric Interpretation

Mean Absolute Percentage Error (MAPE)

12.26% average percentage error for QCMLRegressor
Ideal for comparing models across different scales
Lower values indicate better performance

Root Mean Squared Error (RMSE)

0.3918 in hundreds of thousands of dollars
Penalizes larger errors more heavily
Scale: ~$39,180 average prediction error

Mean Absolute Error (MAE)

0.2492 in hundreds of thousands of dollars
Robust to outliers, easier to interpret
Scale: ~$24,920 average absolute error

R² Score (Coefficient of Determination)

0.8034 explains ~80% of target variance
Higher values indicate better model fit
Excellent performance for regression tasks

Configuration Insights

QCMLRegressor Parameters

device: Automatically selects best available (CPU/GPU)
dropout_rate=0.3: Prevents overfitting, improves generalization
Default epochs: Sufficient for convergence on this dataset

Cross-Validation Strategy

KFold (not stratified): Appropriate for continuous targets
5-fold CV: Balance of bias-variance trade-off
Fixed seed: Ensures reproducible results

Preprocessing Standards

StandardScaler: Critical for neural network approaches
Feature scaling: Normalizes different feature ranges
Consistent preprocessing: Applied to all models fairly

Best Practices Demonstrated

Regression Evaluation

Multiple metrics provide comprehensive assessment
Cross-validation essential for reliable estimates
Proper scaling critical for neural approaches

GPU Optimization

Automatic device detection for optimal performance
CUDA acceleration when available
Fallback to CPU ensures universal compatibility

Regularization Strategy

Dropout regularization prevents overfitting
Appropriate dropout rate (0.3) for this problem size
Balanced complexity for generalization

Next Steps

Parameter Tuning

Experiment with different dropout rates (0.1, 0.5, 0.7)
Try various epoch counts and learning rates
Explore hilbert_space_dim optimization

Extended Comparisons

Include XGBoost and Support Vector Regression
Test on other regression datasets
Compare training times and memory usage

Advanced Analysis

Residual analysis and error distribution
Feature importance and sensitivity analysis
Learning curves and convergence behavior

Dataset Scaling

Full California housing dataset (20,640 samples) with commercial license
Other regression datasets (Boston housing, diabetes)
High-dimensional regression problems

Related Examples

See Intro to QCML for an introduction to QCML
Check Binary Classification for binary classification comparison
Try Multiclass Classification for 10-class classification examples
Explore GPU vs CPU Benchmark for hardware performance optimization
Review Scikit-learn Integration for parameter details