Regression
This example demonstrates how to evaluate QCMLRegressor performance on a regression task using the California housing dataset and compare it with traditional machine learning methods. This showcases QCMLRegressor’s capabilities for continuous target prediction.
Overview
This regression evaluation includes:
Real-world dataset with California housing prices
Comprehensive metrics covering multiple aspects of regression performance
GPU acceleration when available with dropout regularization
Cross-validation comparison with established sklearn regressors
Superior performance across all regression metrics
The analysis demonstrates QCMLRegressor achieving excellent performance on this challenging regression problem.
Complete Example
from honeio.integrations.sklearn.qcmlsklearn import QCMLRegressor
import pandas as pd
import torch
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import root_mean_squared_error, mean_absolute_error, mean_absolute_percentage_error, r2_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
# Load California housing dataset and set up CV
SEED = 0
K_FOLDS = 5
max_obs = 1000 # use only first 1000 observations for community edition
X, y = datasets.fetch_california_housing(return_X_y=True)
X = X[:max_obs]
y = y[:max_obs]
kf = KFold(n_splits=K_FOLDS, shuffle=True, random_state=SEED)
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
Dataset Information
The California housing dataset provides a realistic regression benchmark:
X shape: (1000, 8)
y shape: (1000,)
1000 samples with 8 features (housing characteristics)
Continuous target (median house values in hundreds of thousands of dollars)
Real-world data with practical applications
Community edition optimized sample size
Dataset features include: * Median income in the block group * Housing median age * Average number of rooms per household * Average number of bedrooms per household * Population of the block group * Average number of household members * Latitude and longitude coordinates
Cross-Validation Setup
# Initialize models and run 5-fold CV
model_list = [
QCMLRegressor(
device="cuda" if torch.cuda.is_available() else "cpu",
dropout_rate=0.3,
),
LinearRegression(),
RandomForestRegressor(),
]
error_funcs = [
mean_absolute_percentage_error,
root_mean_squared_error,
mean_absolute_error,
r2_score,
]
error_stats = {}
for model in model_list:
model_name = model.__class__.__name__
print(f"Training {model_name}...")
for fold, (train_index, test_index) in enumerate(kf.split(X, y)):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train the model using the training sets
model.fit(X_train_scaled, y_train)
# Make predictions using the testing set
y_pred = model.predict(X_test_scaled)
for error_func in error_funcs:
error_stats.setdefault((model_name, fold), {})[f"{error_func.__name__}"] = error_func(y_test, y_pred)
Advanced Configuration
The QCMLRegressor is configured with advanced options:
- Device Selection
Automatically uses GPU acceleration when available with
device="cuda"- Regularization
Includes
dropout_rate=0.3for robust generalization- Metric Selection
Comprehensive evaluation with four key regression metrics
Training Output
During training, community edition warnings appear throughout the process:
2025-08-07 11:32:45 [warning ]
You are using the community edition of honeio.
There are some limitations that can be lifted by purchasing a commercial license.
Please contact support@qognitive.io for more information.
Training QCMLRegressor...
[Multiple community edition warnings during 5-fold CV]
Training LinearRegression...
Training RandomForestRegressor...
Results Analysis
# Summarize results
error_stats_df = pd.DataFrame(error_stats).T
average_error_stats = error_stats_df.groupby(level=0).mean()
average_error_stats.sort_values('mean_absolute_percentage_error', ascending=True, inplace=True)
print(average_error_stats)
Performance Results
The cross-validation results demonstrate excellent regression performance:
Model |
MAPE |
RMSE |
MAE |
R² Score |
|---|---|---|---|---|
QCMLRegressor |
0.1226 |
0.3918 |
0.2492 |
0.8034 |
RandomForestRegressor |
0.1438 |
0.4273 |
0.2728 |
0.7668 |
LinearRegression |
0.2143 |
0.5439 |
0.3834 |
0.6219 |
Key Findings
- QCMLRegressor Excellence
Best MAPE at 12.26% (vs 14.38% for RandomForest)
Lowest RMSE at 0.3918 (vs 0.4273 for RandomForest)
Best MAE at 0.2492 (vs 0.2728 for RandomForest)
Highest R² at 0.8034 (vs 0.7668 for RandomForest)
- Performance Advantages
17% better MAPE than RandomForestRegressor
43% better MAPE than LinearRegression
Superior across all metrics consistently
- Regression Capabilities
Excellent predictive accuracy on continuous targets
Strong generalization with dropout regularization
GPU acceleration for faster training when available
Metric Interpretation
- Mean Absolute Percentage Error (MAPE)
12.26% average percentage error for QCMLRegressor
Ideal for comparing models across different scales
Lower values indicate better performance
- Root Mean Squared Error (RMSE)
0.3918 in hundreds of thousands of dollars
Penalizes larger errors more heavily
Scale: ~$39,180 average prediction error
- Mean Absolute Error (MAE)
0.2492 in hundreds of thousands of dollars
Robust to outliers, easier to interpret
Scale: ~$24,920 average absolute error
- R² Score (Coefficient of Determination)
0.8034 explains ~80% of target variance
Higher values indicate better model fit
Excellent performance for regression tasks
Configuration Insights
- QCMLRegressor Parameters
device: Automatically selects best available (CPU/GPU)
dropout_rate=0.3: Prevents overfitting, improves generalization
Default epochs: Sufficient for convergence on this dataset
- Cross-Validation Strategy
KFold (not stratified): Appropriate for continuous targets
5-fold CV: Balance of bias-variance trade-off
Fixed seed: Ensures reproducible results
- Preprocessing Standards
StandardScaler: Critical for neural network approaches
Feature scaling: Normalizes different feature ranges
Consistent preprocessing: Applied to all models fairly
Best Practices Demonstrated
- Regression Evaluation
Multiple metrics provide comprehensive assessment
Cross-validation essential for reliable estimates
Proper scaling critical for neural approaches
- GPU Optimization
Automatic device detection for optimal performance
CUDA acceleration when available
Fallback to CPU ensures universal compatibility
- Regularization Strategy
Dropout regularization prevents overfitting
Appropriate dropout rate (0.3) for this problem size
Balanced complexity for generalization
Next Steps
- Parameter Tuning
Experiment with different dropout rates (0.1, 0.5, 0.7)
Try various epoch counts and learning rates
Explore hilbert_space_dim optimization
- Extended Comparisons
Include XGBoost and Support Vector Regression
Test on other regression datasets
Compare training times and memory usage
- Advanced Analysis
Residual analysis and error distribution
Feature importance and sensitivity analysis
Learning curves and convergence behavior
- Dataset Scaling
Full California housing dataset (20,640 samples) with commercial license
Other regression datasets (Boston housing, diabetes)
High-dimensional regression problems
- Related Examples
See Intro to QCML for an introduction to QCML
Check Binary Classification for binary classification comparison
Try Multiclass Classification for 10-class classification examples
Explore GPU vs CPU Benchmark for hardware performance optimization
Review Scikit-learn Integration for parameter details