カテゴリ変数を説明変数にして

2024年6月8日

カテゴリ変数を説明変数として扱う場合、通常はダミー変数化（One-Hot Encoding）などの手法を使って数値データに変換する必要があります。以下は、カテゴリ変数を含む特徴量を持つ回帰モデルのサンプルコードです。

python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error

# ダミーデータの生成
np.random.seed(0)
n_samples = 1000
n_categories = 3
n_features = 5

# カテゴリ変数を含むダミーデータの生成
data = {
    'category': np.random.choice(['A', 'B', 'C'], size=n_samples),
    'feature1': np.random.rand(n_samples),
    'feature2': np.random.rand(n_samples),
    'feature3': np.random.rand(n_samples),
    'feature4': np.random.rand(n_samples),
    'feature5': np.random.rand(n_samples)
}
df = pd.DataFrame(data)

# 目的変数を生成
coefficients = np.array([3, 1.5, 2, 0.5, 1])
intercept = 2
X_categorical = pd.get_dummies(df['category'], prefix='category')
X_numeric = df.drop(columns=['category'])
X = pd.concat([X_categorical, X_numeric], axis=1)
y = np.dot(X_numeric.values, coefficients) + intercept + np.random.normal(0, 0.1, n_samples)

# データを訓練データとテストデータに分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# モデルの作成と訓練
model = LinearRegression()
model.fit(X_train, y_train)

# モデルの評価
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"Train R^2 score: {train_score:.3f}")
print(f"Test R^2 score: {test_score:.3f}")

# テストデータを用いた予測
y_pred = model.predict(X_test)

# 平均二乗誤差の計算
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.3f}")

このコードでは、pandasを使用してダミーデータを生成し、get_dummies関数を使ってカテゴリ変数をダミー変数化しています。その後、カテゴリ変数のダミー変数と数値変数を結合して特徴量行列 X を作成し、回帰モデルを訓練しています。最後に、モデルの性能を評価するためにR^2スコアと平均二乗誤差を計算しています。

未分類

Posted by ぼっち

在庫数を予測する際に未知の値の売上データはどう設定するのか？