Previsão da Procura no Retalho com Python: O Primeiro Passo para uma Gestão de Inventário Mais Inteligente

Inventory management is a critical challenge for most retail businesses. Overstocking ties up capital and increases the risk of obsolescence. Understocking, on the other hand, leads to lost sales, customer dissatisfaction, and potential long-term brand damage. That’s why accurate demand forecasting is essential—not just for operational efficiency but for strategic decision-making.

In this article, we’ll build a machine learning model in Python to forecast daily product demand using a real retail dataset. In the next step, we’ll deploy this model as a REST API using Azure Functions, enabling real-time prediction at scale. We’ll later use Power BI to visualize the forecast and Power Automate to trigger alerts when a stockout is predicted.

📦 Dataset and Business Problem

The dataset comes from Kaggle’s Retail Store Inventory Forecasting dataset. It includes over 73,000 records with product sales across multiple stores. Key columns include:

  • Data, Store ID, Product ID
  • Category, Region
  • Weather Condition, Seasonality
  • Inventory Level, Units Sold, Demand Forecast, Units Ordered
  • Price, Discount, Competitor Pricing
  • Holiday/Promotion indicators

Our goal is to predict daily Units Sold based on historical data and external factors. With this forecast, we’ll be able to detect whether the inventory is sufficient for the coming days and anticipate stockouts before they occur.

🤖 Why This Matters

By using a machine learning model to predict demand, we unlock smarter inventory decisions. This approach scales well across thousands of products and locations, adapts to seasonal and regional patterns, and gives teams a proactive advantage.

We’re not just creating a report—we’re building an intelligent decision support system that connects data, automation, and operations.

🧠 Step 1 – Load Libraries and Define Context

Let’s begin by importing the libraries we’ll use and setting the stage for the model training.

# ===========================================
# DEMAND FORECASTING WITH RANDOM FOREST
# Dataset: Retail Store Inventory Forecasting
# ===========================================

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_squared_error
from scipy.stats import randint

This script sets up all the tools we’ll need:

  • pandas, numpy: For data manipulation and numerical operations.
  • matplotlib, seaborn: For visualization of distributions and trends.
  • scikit-learn: For model building, training/testing splits, tuning and evaluation.
  • scipy.stats: To support randomized search during model tuning.

These are well-established, production-ready libraries used in thousands of real-world data science projects.

📂 Step 2 – Load the Dataset and Extract Temporal Features

Our dataset is stored in Google Drive, so the first step is to mount the drive in our Colab environment and read the CSV file. Once the data is loaded, we perform basic preprocessing by extracting temporal features like the month, weekday, and year from the date column.

This is important because retail demand often follows strong seasonal and temporal patterns. For example:

  • Weekends may have higher sales than weekdays.
  • Certain months may reflect seasonal demand (e.g. holidays, summer).
  • The year indicator helps us capture long-term trends or multi-year seasonality.

Here’s the code that does this:

# -------------------------------------------
# Mount Google Drive
# -------------------------------------------
from google.colab import drive
drive.mount('/content/drive')

# -------------------------------------------
# STEP 1: Load dataset and initial processing
# -------------------------------------------
caminho_csv = "/content/drive/MyDrive/Datasets/retail_store_inventory.csv"
df = pd.read_csv(caminho_csv, parse_dates=["Date"])

# Convert 'Date' to datetime and extract temporal features
df['Date'] = pd.to_datetime(df['Date'])
df['Month'] = df['Date'].dt.month
df['DayOfWeek'] = df['Date'].dt.dayofweek
df['Year'] = df['Date'].dt.year

These new columns (Month, DayOfWeek, Year) will be used later as input variables for our machine learning model.

By including these temporal variables, the model can learn patterns like “sales spike on Fridays” or “demand drops in January,” improving accuracy significantly.

📊 Step 3 – Exploratory Data Analysis (EDA)

Before jumping into model training, it’s important to explore the dataset to understand the patterns, detect potential issues, and validate our assumptions.

We start by printing:

  • Data types of all columns
  • Missing values
  • Summary statistics for numerical fields

Then, we create three important visualizations:

# -------------------------------------------
# STEP 2: Exploratory Data Analysis (EDA)
# -------------------------------------------

# Overview
print("Data types:\n", df.dtypes)
print("\nMissing values:\n", df.isnull().sum())
print("\nSummary statistics:\n", df.describe())

# Distribution of target variable
sns.histplot(df['Units Sold'], bins=30, kde=True)
plt.title("Distribution of Units Sold")
plt.xlabel("Units Sold")
plt.ylabel("Frequency")
plt.show()

# Units Sold by Category
plt.figure(figsize=(10, 5))
sns.boxplot(x='Category', y='Units Sold', data=df)
plt.title("Units Sold by Category")
plt.xticks(rotation=45)
plt.show()

# Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.select_dtypes(include=[np.number]).corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()

🔍 Insights from the Data

  • Summary statistics reveal that on average, around 136 units are sold per entry, with a standard deviation of 108 units. Prices range from €10 to €100, and inventory levels go up to 600 units. Promotions are active about 50% of the time.
  • The histogram of Units Sold shows a right-skewed distribution. Most products sell under 100 units per day, but a long tail stretches up to 500 units. This skewness can affect model performance and may benefit from log-transformation in more advanced versions.
  • The boxplot by Category indicates that the median demand is similar across product types, but the spread and number of outliers vary. Electronics and Furniture categories show higher variance, while Clothing and Toys are more stable.
  • The correlation heatmap reveals a strong correlation between Units Sold, Inventory Level, and Demand Forecast, which is expected and useful for predictive modeling. However, Price e Competitor Pricing are almost perfectly correlated (0.99), which might introduce multicollinearity if both are used as features.

These EDA steps help validate our intuition and provide the foundation for feature engineering and model selection.

🧼 Step 4 – Data Preprocessing

After understanding the data, we need to prepare it for machine learning. Most ML algorithms, including Random Forests, only accept numerical inputs. That means we need to transform categorical variables into a format the model can understand.

We use one-hot encoding to convert categories (like region or product type) into binary columns, where each possible value becomes a new feature.

Then, we define our input features (X) and the target variable (y), which in this case is the number of units sold.

# -------------------------------------------
# STEP 2: Data preprocessing
# -------------------------------------------

# One-hot encoding
df_model = pd.get_dummies(df, columns=[
    'Store ID', 'Product ID', 'Category', 'Region',
    'Weather Condition', 'Seasonality'
])

# Define features (X) and target (y)
X = df_model.drop(columns=['Units Sold', 'Date'])
y = df_model['Units Sold']

Now we have a clean and structured dataset, fully numerical and ready to feed into a machine learning model. This is one of the most important steps to ensure accurate predictions.

🌲 Step 5 – Train/Test Split and Model Training

With our features and target defined, we now split the dataset into a training set and a test set. This allows us to train the model on one part of the data and evaluate it on unseen data to measure generalization.

We use an 80/20 split and train a Random Forest Regressor using the best hyperparameters previously found through randomized search.

# -------------------------------------------
# STEP 3: Train/test split and model training
# -------------------------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

best_model = RandomForestRegressor(
    n_estimators=124,
    max_depth=9,
    min_samples_split=8,
    min_samples_leaf=3,
    random_state=42
)
best_model.fit(X_train, y_train)

The selected hyperparameters balance model complexity and overfitting prevention:

  • n_estimators=124: Number of trees in the forest
  • max_depth=9: Limits the depth of each tree to control overfitting
  • min_samples_split=8: Minimum number of samples required to split an internal node
  • min_samples_leaf=3: Minimum number of samples required to be at a leaf node

These parameters were selected based on a RandomizedSearchCV tuning process. The image below shows the resulting model configuration in Colab:

Next, we’ll evaluate the model’s performance using RMSE and analyze the prediction results.

📏 Step 6 – Evaluate Model Performance

Once the model is trained, we need to evaluate how well it performs on unseen data. To do this, we use the test set to generate predictions and compute the Root Mean Squared Error (RMSE)—a standard metric for regression problems.

RMSE gives us a sense of the average prediction error in the same units as the target variable (units sold).

# -------------------------------------------
# STEP 4: Evaluate model performance
# -------------------------------------------
y_pred = best_model.predict(X_test)

# Calculate Mean Squared Error and then take the square root for RMSE
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"✅ Root Mean Squared Error: {rmse:.2f}")

✅ Root Mean Squared Error: 8.37

This means that, on average, our model’s forecast is off by around 8.37 units per record. Given that daily sales range from 0 to 500 and the standard deviation is above 100, this is a very good result for a first iteration using historical features only.

In the next step, we’ll save the model and begin the process of operationalizing it as a REST API using Azure Functions.

💾 Step 7 – Export the Model

After training and validating our model, we export it using joblib. This allows us to reuse the trained model in a production environment, such as a web service or an API endpoint.

We’ll deploy this model in the next article using Azure Functions to serve real-time predictions via HTTP requests.

# -------------------------------------------
# STEP 5: Export model to .pkl
# -------------------------------------------
import joblib
joblib.dump(best_model, "modelo_previsao_vendas.pkl")

This command serializes the entire model object into a binary file called modelo_previsao_vendas.pkl, which can be loaded later in any Python environment without needing to retrain the model.

In the next article, we’ll show you how to:

  • Deploy this model in a serverless environment with Azure Functions
  • Create a REST API to serve predictions
  • Consume the API from Power BI or Power Apps

🚀 Stay tuned for Part 2: “Deploying Your Demand Forecast Model with Azure Functions”

Partilhe o seu amor
Nuno Nogueira
Nuno Nogueira
Artigos: 36

Um comentário

Deixe um comentário

Your email address will not be published. Required fields are marked *