Now that we have a conversational agent that understands natural language and can provide a simple, heuristic-based, hotel price recommendation, we need to take the solution one step further: we want the recommendation to be based on the actual hotel data, to look at competitor prices, season, occupancy rate, etc. and provide an accurate pricing recommendation. For that, we need data. And a predictive model.

The dataset for Lemon Lagos Hotel

The granularity of the data is daily and per room type. This hotel comprises three different room types: standard, sea view and suite. We have data extracted from various sources into the following dictionary:

Time and calendar dimension

Column Type Example Description
date Date 2025-07-15 Reference date of the record
weekday Integer (0–6) 1 Day of the week (0 = Monday, 6 = Sunday)
season Categorical high Season: low, shoulder, high
is_weekend Binary (0/1) 1 Indicates if the day is a leisure weekend (Fri/Sat)
is_holiday Binary (0/1) 0 National holiday flag
event_flag Binary (0/1) 1 Local event indicator affecting demand

Product dimension (rooms)

Column Type Example Description
room_type Categorical sea_view Room category (standard, sea_view, suite)
rooms_capacity Integer 20 Total number of rooms of that type
rooms_occupied Integer 15 Number of rooms occupied on that day

Operations & Demand

Column Type Example Description
occupancy_rate Float (0–1) 0.75 Occupancy rate per room type
demand_index Float (0.05–1.0) 0.82 Normalized demand index
lead_time_avg Float (days) 18.4 Average booking lead time

Pricing & Market

Column Type Example (€) Description
avg_price Float 168.40 Average selling price of the hotel
competitor_avg_price Float 162.30 Estimated competitor average price

Exploratory data analysis

Let’s explore this data visually and try to find out how the hotel works. I’ve imported the data from a csv file from my Github repo here and created a single page Power BI report to analyse the data here. Below, you may see the embedded report:

As you can see, demand is structurally seasonal and there are peaks in the summer and some booking volatility due to fierce competition, cancellations and so on (things we can’t explain).

In the low season, the occupancy rate is around 40 to 50% whereas in the high season it reaches 85 to 98%. In the shoulder season it amounts to 65 to 75%. This is usual for this sector.

Analyzing competitors’ prices vs. own prices we once again notice the fierce competition taking place. The prices are almost identical all year round. This means that the hotel is following a market-aligned pricing strategy with no aggressive undercutting. Premium pricing is only applied when demand is structurally high, and capacity risk is minimal.

So, by now, we can assume the implication for the agent is that it must learn controlled price leadership, not aggressive overpricing.

Model architecture and training pipeline

Since our goal is not to forecast demand, we’re not going to use time series forecasting this time (like we did in a previous article). This time, the objective is to model the causal relationship between price and occupancy under different market conditions and then optimize the price to maximize revenue and occupancy. This reframes pricing as a controlled optimization problem, not a passive forecasting task.

Here’s the high-level architecture. Should you wish to analyze or download the code, you may find it in my GitHub repo here.

  • Target variable (label) y: occupancy_rate (continuous, between 0 and 1)
  • Learning task: supervised regression
  • Role of the model: Learn price elasticity on market context

Core features:

  • avg_price
  • competitor_avg_price
  • season
  • is_weekend
  • is_holiday
  • event_flag
  • lead_time_avg
  • room_type
  • rooms_capacity

Feature engineering:

  • price_gap = avg_price – competitor_avg_price
  • price_ratio = avg_price / competitor_avg_price
  • month = month(date)
  • rooms_capacity

Encoding and preprocessing

  • season: one-hot encoding
  • room_type: one-hot encoding
  • numerical variables we may use normalization for linear models or “as is“ for tree-based models (we’ll decide upon that later)
  • missing-value strategy: for categorical variables, we’ll use an explicit “unknown” value and for numerical variables, median imputation.

To respect the temporal structure and avoid look-ahead bias:

  • Training set: 2023–2024
  • Validation set: early 2025
  • Test set: late 2025

Training the regression model

We’ll train a HistGradientBoostingRegressor, which is a tree-based model that belongs to the Gradient Boosting Decision Trees. It builds many decision trees, one after another, where each new tree focuses on correcting the errors of the previous ones. So, it is like using many small models that combined become very efficient and smart.

Decision trees are particularly effective at learning rule-based patterns, such as “if the season is low, then the occupancy rate tends to be lower”, and at capturing non-linear relationships between the input features and the target variable. This makes them especially well-suited for pricing problems, where the impact of price on demand is rarely linear and depends strongly on context.

Here’s the full Python code with comments:

# train_pricing_model.py

import os
import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib

# ---------------------------------------------------
# 1. Configuration
# ---------------------------------------------------

CSV_PATH = r"C:\Projetos\hotel_pricing\data\hotel_lagos_daily_rooms.csv"

# By default we model occupancy as a function of price and context.
# If you really want to estimate price instead, set this to "avg_price".
TARGET_COLUMN = "occupancy_rate"

MODEL_OUTPUT_PATH = r"C:\Projetos\hotel_pricing\regression_model\pricing_elasticity_model.pkl"

# ---------------------------------------------------
# 2. Load data
# ---------------------------------------------------

df = pd.read_csv(CSV_PATH, parse_dates=["date"])

# Basic sanity checks and type normalization
required_cols = [
    "date",
    "season",
    "room_type",
    "avg_price",
    "competitor_avg_price",
    "occupancy_rate",
    "lead_time_avg",
    "rooms_capacity",
    "is_weekend",
    "is_holiday",
    "event_flag",
]

missing_required = [c for c in required_cols if c not in df.columns]
if missing_required:
    raise ValueError(f"Missing required columns in CSV: {missing_required}")

# ---------------------------------------------------
# 3. Feature engineering
# ---------------------------------------------------

# Clip occupancy and lead time to realistic ranges
df["occupancy_rate"] = df["occupancy_rate"].clip(0.0, 1.0)
df["lead_time_avg"] = df["lead_time_avg"].clip(1, 40)

# Relative price features
df["price_gap"] = df["avg_price"] - df["competitor_avg_price"]
df["price_ratio"] = df["avg_price"] / df["competitor_avg_price"]

# Calendar features
df["month"] = df["date"].dt.month.astype(int)

# Ensure binary flags are integers (0/1)
for col in ["is_weekend", "is_holiday", "event_flag"]:
    df[col] = df[col].astype(int)

# ---------------------------------------------------
# 4. Train / validation / test split (time-based)
# ---------------------------------------------------

# Example: train on 2023–2024, validate on 2025-H1, test on 2025-H2
train_end = pd.Timestamp("2024-12-31")
val_end = pd.Timestamp("2025-06-30")

train_mask = df["date"] <= train_end
val_mask = (df["date"] > train_end) & (df["date"] <= val_end)
test_mask = df["date"] > val_end

df_train = df[train_mask].copy()
df_val = df[val_mask].copy()
df_test = df[test_mask].copy()

print(f"Train rows: {len(df_train)}, Val rows: {len(df_val)}, Test rows: {len(df_test)}")

# ---------------------------------------------------
# 5. Define features and target
# ---------------------------------------------------

feature_cols_numeric = [
    "avg_price",
    "competitor_avg_price",
    "price_gap",
    "price_ratio",
    "lead_time_avg",
    "month",
    "rooms_capacity",
    "is_weekend",
    "is_holiday",
    "event_flag",
]

feature_cols_categorical = [
    "season",
    "room_type",
]

X_train = df_train[feature_cols_numeric + feature_cols_categorical]
y_train = df_train[TARGET_COLUMN]

X_val = df_val[feature_cols_numeric + feature_cols_categorical]
y_val = df_val[TARGET_COLUMN]

X_test = df_test[feature_cols_numeric + feature_cols_categorical]
y_test = df_test[TARGET_COLUMN]

# ---------------------------------------------------
# 6. Preprocessing pipeline
# ---------------------------------------------------

numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        # No scaler needed for tree-based models; add StandardScaler if using linear models
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, feature_cols_numeric),
        ("cat", categorical_transformer, feature_cols_categorical),
    ]
)

# ---------------------------------------------------
# 7. Regression model
# ---------------------------------------------------

regressor = HistGradientBoostingRegressor(
    random_state=42,
    max_depth=None,
    learning_rate=0.1,
    max_iter=300,
)

model = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("regressor", regressor),
    ]
)

# ---------------------------------------------------
# 8. Train model
# ---------------------------------------------------

print("Training model...")
model.fit(X_train, y_train)

# ---------------------------------------------------
# 9. Evaluation helper
# ---------------------------------------------------

def evaluate(split_name: str, y_true, y_pred):
    # If we are modeling occupancy, clip to [0, 1] for interpretability
    if TARGET_COLUMN == "occupancy_rate":
        y_pred = np.clip(y_pred, 0.0, 1.0)

    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2 = r2_score(y_true, y_pred)

    print(f"\n[{split_name}]")
    print(f"MAE : {mae:.4f}")
    print(f"RMSE: {rmse:.4f}")
    print(f"R²  : {r2:.4f}")

# ---------------------------------------------------
# 10. Evaluate on validation and test
# ---------------------------------------------------

y_val_pred = model.predict(X_val)
evaluate("Validation", y_val, y_val_pred)

y_test_pred = model.predict(X_test)
evaluate("Test", y_test, y_test_pred)

# ---------------------------------------------------
# 11. Save trained model
# ---------------------------------------------------

os.makedirs(os.path.dirname(MODEL_OUTPUT_PATH), exist_ok=True)
joblib.dump(model, MODEL_OUTPUT_PATH)
print(f"\nModel saved to: {MODEL_OUTPUT_PATH}")

We train the model and test it by running the script above and get the following result:

Train rows: 2193, Val rows: 543, Test rows: 459

Training model...

[Validation]
MAE : 0.0268
RMSE: 0.0348
R²  : 0.9801

[Test]
MAE : 0.0247
RMSE: 0.0333
R²  : 0.9767

Model saved to: C:\Projetos\hotel_pricing\regression_model\pricing_elasticity_model.pkl

After extending the feature set to explicitly include weekend, holiday, and local event indicators, the model maintained consistently strong performance across both validation and out-of-time test sets. It achieved a Mean Absolute Error of approximately 0.03 on the validation set and 0.02 on the test set, corresponding to an average error of about two percentage points in the occupancy rate. For typical room capacities, this translates into well below one room of average prediction error.

The R² values of 0.98 on the validation set and 0.97 on the test set confirm that the model explains most of the variability in occupancy using price and market context alone. The fact that performance remained stable after adding new behavioral features indicates that the model structure is robust and that it has successfully captured the core price–demand relationships required for reliable pricing optimization.

Time to kick the tires

Ok, let’s think of a scenario where this model could be useful: low-season timing without special events happening and we need to predict the occupancy rate.

We can tell Python to test the local model with:

{
        "name": "Low season – standard room – no event",
        "avg_price": 95,
        "competitor_avg_price": 100,
        "season": "low",
        "room_type": "standard",
        "lead_time_avg": 8,
        "rooms_capacity": 30,
        "is_weekend": 0,
        "is_holiday": 0,
        "event_flag": 0,
        "month": 1,
    }

And the output will be:

============================================================
Scenario: Low season – standard room – no event
Price: €95.00
Competitor price: €100.00
Season: low | Room type: standard | Month: 1 | Weekend: 0 | Holiday: 0 | Event: 0
Lead time: 8 days
Predicted occupancy: 46.89%
Expected revenue: €1,336.39

This is what we would expect from a low-season scenario. With a weekday in January, no events, and a slightly discounted price compared to the market, the model does not “invent” full occupancy. Instead, it predicts that around 47% of the standard rooms will be sold, which translates into roughly 14 out of 30 rooms and about €1.3k in revenue for that night.

The important point here is not the exact number, but the behavior: the model reacts to price and context in a way that is economically reasonable and consistent with the historical pattern we found in the data.

Let’s test the model with two additional scenarios:

{
        "name": "High season – sea view – strong demand (weekend + event)",
        "avg_price": 195,
        "competitor_avg_price": 190,
        "season": "high",
        "room_type": "sea_view",
        "lead_time_avg": 28,
        "rooms_capacity": 20,
        "is_weekend": 1,
        "is_holiday": 0,
        "event_flag": 1,
        "month": 8,
    },
    {
        "name": "Shoulder season – suite – weekend, no event",
        "avg_price": 220,
        "competitor_avg_price": 210,
        "season": "shoulder",
        "room_type": "suite",
        "lead_time_avg": 18,
        "rooms_capacity": 10,
        "is_weekend": 1,
        "is_holiday": 0,
        "event_flag": 0,
        "month": 5,
    }

What’s the model response?

============================================================
Scenario: High season – sea view – strong demand (weekend + event)
Price: €195.00
Competitor price: €190.00
Season: high | Room type: sea_view | Month: 8 | Weekend: 1 | Holiday: 0 | Event: 1
Lead time: 28 days
Predicted occupancy: 100.00%
Expected revenue: €3,900.00

============================================================
Scenario: Shoulder season – suite – weekend, no event
Price: €220.00
Competitor price: €210.00
Season: shoulder | Room type: suite | Month: 5 | Weekend: 1 | Holiday: 0 | Event: 0
Lead time: 18 days
Predicted occupancy: 68.59%
Expected revenue: €1,509.02

These results too are consistent with the exploratory analysis we did before. The model reacts to the season and the competition price.

In the next step, we will stop testing single prices and start simulating multiple price levels for the same scenario, so that we can search for the price that maximizes expected revenue.

We test the model with the same three scenarios with a price range around the competitors:

  • A minimum of 50% of the competitor’s price
  • A maximum of 150% of the competitor’s price

And here are the results:

==================================================================
Scenario: Low season – standard room – no event
Competitor price: €100.00 | Price range tested: €50.00 → €150.00
------------------------------------------------------
Top 5 price points by expected revenue:

 price predicted_occupancy expected_revenue
€75.00              95.41%        €2,146.76
€70.83              96.07%        €2,041.50
€66.67              98.61%        €1,972.23
€79.17              82.34%        €1,955.66
€62.50              97.43%        €1,826.88

Recommended price (argmax revenue):
Price: €75.00 | Predicted occupancy: 95.41% | Expected revenue: €2,146.76

==================================================================
Scenario: High season – sea view – strong demand (weekend + event)
Competitor price: €190.00 | Price range tested: €95.00 → €285.00
------------------------------------------------------
Top 5 price points by expected revenue:

  price predicted_occupancy expected_revenue
€261.25              86.02%        €4,494.47
€253.33              88.64%        €4,490.93
€269.17              82.48%        €4,440.23
€229.58              95.93%        €4,404.56
€285.00              77.26%        €4,403.54

Recommended price (argmax revenue):
Price: €261.25 | Predicted occupancy: 86.02% | Expected revenue: €4,494.47

==================================================================
Scenario: Shoulder season – suite – weekend, no event
Competitor price: €210.00 | Price range tested: €105.00 → €315.00
------------------------------------------------------
Top 5 price points by expected revenue:

  price predicted_occupancy expected_revenue
€183.75              90.93%        €1,670.78
€175.00              93.39%        €1,634.27
€201.25              79.47%        €1,599.36
€210.00              75.92%        €1,594.35
€166.25              95.81%        €1,592.89

Recommended price (argmax revenue):
Price: €183.75 | Predicted occupancy: 90.93% | Expected revenue: €1,670.78

These results are compelling. The first clear pattern across the three scenarios is that they reveal not only different optimal price points but also distinct levels of price elasticity. In low season, demand is highly elastic: small price reductions translate into disproportionately higher occupancy, which explains why the optimal price drops to €75, well below the benchmark. In contrast, during high season with strong demand, the market becomes markedly inelastic. Occupancy declines very slowly as prices rise, allowing an optimal price of €261.25 – far above the competitor’s €190 – while still delivering the highest expected revenue. This contrast shows how the same hotel can operate under completely different demand regimes depending on context.

The shoulder-season suite scenario sits between these two extremes, showing moderate elasticity. Lowering the price from €210 to around €183.75 boosts occupancy meaningfully, but not as sharply as in low season. Revenue optimization here depends on balancing a softer demand curve with the higher intrinsic value of the room. Taken together, the scenarios confirm that optimal pricing is fundamentally shaped by the elasticity regime: elastic under low demand, inelastic when demand is strong, and mixed during transitional periods. This reinforces the central argument of the article: intuition alone is insufficient because elasticity is dynamic and context-dependent, making algorithmic, data-driven pricing essential for consistent revenue optimization.

Finally, to go a step further, we can investigate feature importance. What drives these outcomes. Let me run the feature-importance script and examine the results.

Computing permutation feature importance...

Feature importance (by mean decrease in R²):

price_gap               0.524692
is_weekend              0.365607
season                  0.306367
price_ratio             0.221113
event_flag              0.140881
rooms_capacity          0.069459
competitor_avg_price    0.010325
room_type               0.009927
avg_price               0.009160
lead_time_avg           0.004647
is_holiday              0.004448
month                   0.001936

Permutation feature importance results

The permutation-based feature importance analysis confirms that the price_gap (the difference between our tested price and the competitor price) is the main driver, confirming that competitive positioning plays a central role in shaping demand. This is followed by is_weekend, season, and price_ratio, all of which materially influence occupancy by capturing structural patterns of demand intensity. The presence of an event also contributes meaningfully, though to a lesser extent, reinforcing that external spikes in demand matter but do not overpower baseline factors such as price dynamics and seasonality.

Interestingly, variables often assumed to be major drivers, such as competitor average price, room type, average historical price, lead time, holidays, or even the month, show very low importance in this model.

This suggests that, for the synthetic dataset and scenarios tested, relative pricing and timing (weekend/season) dominate the demand signal, while more granular attributes contribute marginally. This reinforces the broader conclusion: pricing optimization is primarily a function of relative market position and temporal context, rather than static room characteristics or calendar effects.

Want a tailored pricing strategy for your hotel?
Contact Swell AI Lab to schedule a quick diagnostic.