Objective
- Show the real-world difference between a generative model (text) and a reasoning model (ML on tabular data).
- Prove how reasoning models give accurate, explainable results, while generative models can sound right but be wrong.
- Use LIME — lime_tabular for per-case reasoning, lime_text to peek into generative behavior.
Learning Outcomes
- Understand why decision models are a better fit for structured analytics tasks.
- Learn how to engineer features, build preprocessing pipelines, and train a calibrated classifier for reliable probabilities.
- Apply LIME for local interpretability on tabular predictions (and a heuristic for text).
- See controlled examples where generative models produce convincing but wrong answers versus accurate reasoning-based answers tied to data.
Dataset Overview
- Dataset: Titanic passenger dataset from a public GitHub mirror.
- Target: Survived (0/1).
- Columns: Pclass, Sex, Age, SibSp, Parch, Fare, Embarked, plus engineered features FamilySize, IsAlone, Title, TicketPrefix, CabinInitial.
- Balance: approximately 38.4% survived, 61.6% did not survive.
Step 0 — Configuration and environment checks
Ensure reproducibility, check versions, and verify dependencies for smooth execution.
# Configuration switches
USE_GENERATIVE = True # Set False if local downloads are not allowed / to skip GPT-2
RANDOM_STATE = 42
# Environment info
import sys, platform, sklearn
print("Python:", sys.version.splitlines()[0])
print("Platform:", platform.platform())
print("scikit-learn:", sklearn.__version__)
# Dependency checks (no installs; assume packages are present)
missing = []
try:
import numpy as np
except Exception:
missing.append("numpy")
try:
import pandas as pd
except Exception:
missing.append("pandas")
try:
import matplotlib.pyplot as plt
import seaborn as sns
except Exception:
missing.append("matplotlib/seaborn")
try:
# Core scikit-learn components
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, classification_report, confusion_matrix
except Exception:
missing.append("scikit-learn core")
try:
# LIME explainers
from lime.lime_tabular import LimeTabularExplainer
from lime.lime_text import LimeTextExplainer
except Exception:
missing.append("lime")
if USE_GENERATIVE:
try:
# Lightweight local text model
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed
except Exception:
missing.append("transformers/torch (only needed if USE_GENERATIVE=True)")
if missing:
print("WARNING: Missing packages detected:", missing)
else:
print("All required packages are importable.")
# Reproducibility
import random
np.random.seed(RANDOM_STATE)
random.seed(RANDOM_STATE)
try:
from transformers import set_seed as hf_set_seed
hf_set_seed(RANDOM_STATE)
except Exception:
pass
# Display options for pandas
import pandas as pd
pd.set_option('display.max_columns', None)
Output:

- Versions and “All required packages are importable.” This ensures a stable base for everything that follows.
Step 1 — Project objective metadata (reference)
Capture the plan as structured text so the notebook stays self‑documented.
import json
objective = {
"goal": "Show the difference between a generative model (text) and a reasoning/decision model (tabular ML) on analytics tasks with LIME explanations.",
"dataset": "Titanic (public CSV from GitHub mirror)",
"use_cases": [
"UC1: Predict survival for specific passengers (classification).",
"UC2: Answer factual, data-driven questions (counts/rates) directly from the data."
],
"explainability": "Use LIME Tabular for the reasoning model; use LIME Text heuristics for the generative model."
}
print(json.dumps(objective, indent=2))
Output:

- Clear, concise project scope.
Step 2 — Load dataset
Download Titanic CSV and preview.
import requests
# Download the dataset from a public source
URL = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
resp = requests.get(URL, timeout=30)
resp.raise_for_status()
with open("titanic.csv", "wb") as f:
f.write(resp.content)
# Load into DataFrame
df = pd.read_csv("titanic.csv")
print("Loaded titanic.csv with shape:", df.shape)
# Quick peek at the data (head shown in the notebook)
df.head()
Output:

- Shape: (891, 12). Confirms data is in place.
Step 3 — Quick EDA
Check columns, missing data, and target balance; plot three charts for intuition.
import matplotlib.pyplot as plt
import seaborn as sns
# Basic info for orientation
print("Columns:", df.columns.tolist())
print("Missing values:\n", df.isna().sum())
print("Target distribution:\n", df["Survived"].value_counts(normalize=True))
# 1x3 subplot layout to avoid axes confusion
fig, (ax0, ax1, ax2) = plt.subplots(1, 3, figsize=(14, 4))
sns.countplot(data=df, x="Survived", ax=ax0)
ax0.set_title("Survived distribution")
sns.histplot(data=df, x="Age", kde=True, ax=ax1)
ax1.set_title("Age distribution")
sns.countplot(data=df, x="Pclass", hue="Survived", ax=ax2)
ax2.set_title("Pclass vs Survived")
plt.tight_layout()
plt.show()
Output:


- Class balance around 38% survived.
- Visual cues that Sex and Pclass matter, Age has missing values.
- Highlight: sets the stage for feature engineering and modeling.
Step 4 — Feature engineering and preprocessing
Derive informative features, build robust preprocessing, and run a calibrated logistic regression pipeline.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV
data = df.copy()
# --- Feature engineering ---
# Family group signals
data["FamilySize"] = data["SibSp"] + data["Parch"] + 1
data["IsAlone"] = (data["FamilySize"] == 1).astype(int)
# Title extraction (Mr, Mrs, Miss, Master, etc.) normalized to reduce sparsity
data["Title"] = data["Name"].str.extract(r',\s*([^\.]+)\.')
title_map = {
"Mlle": "Miss", "Ms": "Miss", "Mme": "Mrs",
"Lady": "Royalty", "Countess": "Royalty", "Sir": "Royalty", "Jonkheer": "Royalty", "Dona": "Royalty",
"Capt": "Officer", "Col": "Officer", "Dr": "Officer", "Major": "Officer", "Rev": "Officer"
}
data["Title"] = data["Title"].replace(title_map)
rare_titles = data["Title"].value_counts()[data["Title"].value_counts() < 10].index
data["Title"] = data["Title"].replace({t: "Rare" for t in rare_titles})
# Ticket prefix (string part) as a coarse grouping proxy
data["TicketPrefix"] = data["Ticket"].astype(str).str.replace(r'[^A-Za-z]', '', regex=True).str.upper()
data["TicketPrefix"] = data["TicketPrefix"].replace('', 'NONE')
# Cabin initial (first letter); 'U' for unknown
data["CabinInitial"] = data["Cabin"].astype(str).str.slice(0,1).replace('n', 'U')
# --- Train/test split ---
target = "Survived"
numeric_features = ["Age", "SibSp", "Parch", "Fare", "FamilySize"]
categorical_features = ["Pclass", "Sex", "Embarked", "Title", "IsAlone", "TicketPrefix", "CabinInitial"]
X = data[numeric_features + categorical_features]
y = data[target]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=RANDOM_STATE, stratify=y
)
# --- Preprocessing pipelines ---
# OneHotEncoder compatibility across sklearn versions
onehot_kwargs = dict(handle_unknown="ignore")
try:
ohe = OneHotEncoder(**onehot_kwargs, sparse_output=False) # sklearn >= 1.2
except TypeError:
ohe = OneHotEncoder(**onehot_kwargs, sparse=False) # older versions
numeric_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="median")), # robust to skew and outliers
("scaler", StandardScaler()) # put numeric features on comparable scale
])
categorical_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="most_frequent")), # fill missing categories
("onehot", ohe) # one-hot encode categories
])
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features),
]
)
# --- Reasoning model with calibrated probabilities ---
log_reg = LogisticRegression(max_iter=200, class_weight="balanced", solver="lbfgs")
try:
calibrated = CalibratedClassifierCV(estimator=log_reg, method="sigmoid", cv=3) # sklearn >= 1.2
except TypeError:
calibrated = CalibratedClassifierCV(base_estimator=log_reg, method="sigmoid", cv=3)
clf_pipeline = Pipeline(steps=[("preprocess", preprocessor),
("clf", calibrated)])
print("Pipelines prepared.")
Output:

- “Pipelines prepared.” indicates the reasoning pipeline is ready.
Step 5 — Train, cross-validate, and evaluate the reasoning model
Quantify predictive performance and inspect the confusion matrix.
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
# Cross-validation to check stability
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
acc = cross_val_score(clf_pipeline, X_train, y_train, cv=cv, scoring="accuracy")
roc = cross_val_score(clf_pipeline, X_train, y_train, cv=cv, scoring="roc_auc")
f1s = cross_val_score(clf_pipeline, X_train, y_train, cv=cv, scoring="f1")
print(f"CV Accuracy: {acc.mean():.3f} ± {acc.std():.3f}")
print(f"CV ROC AUC : {roc.mean():.3f} ± {roc.std():.3f}")
print(f"CV F1 : {f1s.mean():.3f} ± {f1s.std():.3f}")
# Fit on training and evaluate on holdout test
clf_pipeline.fit(X_train, y_train)
y_pred = clf_pipeline.predict(X_test)
y_proba = clf_pipeline.predict_proba(X_test)[:, 1]
print("\nTest metrics:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("ROC AUC :", roc_auc_score(y_test, y_proba))
print("F1 Score:", f1_score(y_test, y_pred))
print("\nClassification report:\n", classification_report(y_test, y_pred))
# Confusion matrix for error types
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title("Confusion Matrix (Reasoning Model)")
plt.xlabel("Predicted"); plt.ylabel("Actual")
plt.show()
Output:


- CV Accuracy ≈ 0.817, ROC AUC ≈ 0.868; Test Accuracy ≈ 0.812, ROC AUC ≈ 0.859.
- The reasoning model is accurate and measurable, addressing the “decision model” side of the objective.
- The reasoning model correctly predicted 117 negatives and 64 positives, with 20 false positives and 22 false negatives.
Step 6 — LIME Tabular on transformed features
Prepare a LIME explainer operating directly in the pipeline’s transformed space (dense arrays), training a classifier just for LIME to avoid inverse transforms.
import numpy as np
from sklearn.pipeline import Pipeline
from lime.lime_tabular import LimeTabularExplainer
# Preprocess and transform using the same pipeline stage
preprocess_only = Pipeline(steps=[("preprocess", preprocessor)])
preprocess_only.fit(X_train)
X_train_transformed = preprocess_only.transform(X_train)
X_test_transformed = preprocess_only.transform(X_test)
# Ensure dense float arrays for LIME
if hasattr(X_train_transformed, "toarray"):
X_train_transformed = X_train_transformed.toarray()
if hasattr(X_test_transformed, "toarray"):
X_test_transformed = X_test_transformed.toarray()
X_train_transformed = np.asarray(X_train_transformed, dtype=np.float32)
X_test_transformed = np.asarray(X_test_transformed, dtype=np.float32)
# Train a calibrated LR directly on transformed arrays for LIME (keeps interfaces simple)
log_reg_lime = LogisticRegression(max_iter=200, class_weight="balanced", solver="lbfgs")
try:
calibrated_lime = CalibratedClassifierCV(estimator=log_reg_lime, method="sigmoid", cv=3)
except TypeError:
calibrated_lime = CalibratedClassifierCV(base_estimator=log_reg_lime, method="sigmoid", cv=3)
calibrated_lime.fit(X_train_transformed, y_train)
# Build transformed feature names for LIME (numeric + one-hot)
feature_names_num = numeric_features
ohe = preprocessor.named_transformers_["cat"].named_steps["onehot"]
try:
cat_names = ohe.get_feature_names_out(categorical_features).tolist()
except AttributeError:
cat_names = ohe.get_feature_names(categorical_features).tolist()
feature_names_transformed = feature_names_num + cat_names
class_names = ["Not Survived", "Survived"]
# LIME explainer configured for transformed numeric space
explainer_tabular = LimeTabularExplainer(
training_data=X_train_transformed,
mode="classification",
feature_names=feature_names_transformed,
class_names=class_names,
categorical_features=None, # transformed space is all numeric
discretize_continuous=True,
random_state=RANDOM_STATE
)
# Predict function that accepts transformed arrays
def predict_proba_on_transformed(X_trans):
arr = np.asarray(X_trans, dtype=np.float32)
if arr.ndim == 1:
arr = arr.reshape(1, -1)
return calibrated_lime.predict_proba(arr)
print("LIME ready. Shapes:", X_train_transformed.shape, X_test_transformed.shape)
Output:

- “LIME ready. Shapes: (668, 54) (223, 54)”. Confirms feature space; we’re ready to explain.
Step 7 — Explain a few UC1 predictions with LIME Tabular
Show per‑instance feature contributions backing the reasoning model’s predictions.
# Select a few test instances to explain
indices = np.random.choice(range(len(X_test)), size=3, replace=False)
# Number of transformed features for reference
n_features = X_train_transformed.shape[1]
print("n_features (transformed columns):", n_features)
for pos, idx in enumerate(indices, 1):
x0_df = X_test.iloc[idx:idx+1]
y_true = y_test.iloc[idx]
# Reasoning model predictions (for context)
r_pred = clf_pipeline.predict(x0_df)
r_proba = clf_pipeline.predict_proba(x0_df)[0, 1]
# Transform row and flatten to 1D vector for LIME
x0_trans = preprocess_only.transform(x0_df)
if hasattr(x0_trans, "toarray"):
x0_trans = x0_trans.toarray()
x0_vec = np.asarray(x0_trans, dtype=np.float32).ravel()
print(f"\n[{pos}/3] idx={idx} x0_df={x0_df.shape}, x0_vec={x0_vec.shape}, n_features={n_features}")
print(f"true={y_true}, pred={r_pred}, proba={r_proba:.3f}")
# LIME explanation visual
exp = explainer_tabular.explain_instance(
data_row=x0_vec,
predict_fn=predict_proba_on_transformed,
num_features=10
)
display(x0_df)
exp.show_in_notebook(show_table=True, show_all=False)
Output:



- LIME contributions often include Sex_female (positive), Pclass_3 (negative), Title groups, etc.
- Observation: this is where the “reasoning models decide” claim becomes visible—clear, feature-based rationale for decisions.
Step 8 — Generative model and dataset card
Load GPT‑2 and create a compact “data card” to prompt it.
# local generative model for text answers
if USE_GENERATIVE:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed
MODEL_NAME = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
gen_model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
device = "cuda" if torch.cuda.is_available() else "cpu"
gen_model.to(device)
def generate_text(prompt, max_new_tokens=120, temperature=0.9, top_p=0.95):
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
outputs = gen_model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=temperature,
top_p=top_p,
pad_token_id=tokenizer.eos_token_id
)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
return text[len(prompt):].strip()
else:
print("Generative model disabled (USE_GENERATIVE=False).")
# Summarize a few dataset facts for prompting (intentionally incomplete)
def build_dataset_card(df_in):
lines = []
lines.append("You are given aggregated facts from a Titanic passenger dataset.")
lines.append(f"Total rows: {len(df_in)}")
lines.append(f"Columns: {', '.join(df_in.columns)}")
lines.append(f"Overall survival rate: {df_in['Survived'].mean():.3f}")
lines.append(f"Mean age: {df_in['Age'].mean():.2f}")
lines.append(f"Median fare: {df_in['Fare'].median():.2f}")
return "\n".join(lines)
dataset_card = build_dataset_card(df)
print(dataset_card)
Output:

- The card prints aggregate facts; it does not contain the answers to group queries—this sets up the contrast.
Step 9 — UC1: Predict survival with reasoning vs generative
Compare probability‑backed predictions to free‑form text generation on the same passenger profiles.
# Format a passenger row for prompting
def format_passenger_for_prompt(row):
return (f"Pclass={row['Pclass']}, Sex={row['Sex']}, Age={row['Age']}, "
f"SibSp={row['SibSp']}, Parch={row['Parch']}, Fare={row['Fare']:.2f}, "
f"Embarked={row['Embarked']}, Title={row['Title']}, FamilySize={row['FamilySize']}, "
f"IsAlone={row['IsAlone']}, TicketPrefix={row['TicketPrefix']}, CabinInitial={row['CabinInitial']}")
# Reasoning model: numeric prediction + calibrated probability
def reasoning_predict(row_df):
proba = clf_pipeline.predict_proba(row_df)[0, 1]
return int(proba >= 0.5), float(proba)
# Generative model: write an answer based on dataset card and profile text
def generative_predict_text(row):
prompt = dataset_card + "\n\n" + \
"Task: Based on the passenger profile, predict whether the passenger survived (Yes/No) and explain briefly.\n" + \
"Passenger: " + format_passenger_for_prompt(row) + "\n" + \
"Answer with 'Prediction: Yes' or 'Prediction: No' and a short reason.\n"
return generate_text(prompt)
# Sample a few test instances and compare
test_indices = np.random.choice(X_test.index, size=3, replace=False)
results_uc1 = []
for idx in test_indices:
row_df = X_test.loc[[idx]]
row_series = X_test.loc[idx]
true_label = int(y_test.loc[idx])
r_pred, r_proba = reasoning_predict(row_df)
g_text = generative_predict_text(row_series) if USE_GENERATIVE else "(Generative disabled)"
results_uc1.append((idx, true_label, r_pred, r_proba, g_text))
# Display paired outputs
for r in results_uc1:
idx, true_label, r_pred, r_proba, g_text = r
print(f"\nUC1 - Instance {idx}")
print(f"True label: {true_label}")
print(f"Reasoning -> pred: {r_pred}, proba_survived: {r_proba:.3f}")
print("Generative ->", g_text)
Output:

- Reasoning outputs match labels often and always include probabilities.
- Generative outputs can be off-topic or incorrect.
- Observation: this is the first clear place where generative models are “guessing” (free text, not data‑grounded) and reasoning models are “deciding” (probability‑backed, data‑grounded).
Step 10 — LIME explanations for the UC1 instances
Visualize why the reasoning model predicted Yes/No per passenger.
# Explain the UC1 instances with LIME Tabular
for idx, true_label, r_pred, r_proba, _ in results_uc1:
x0_df = X_test.loc[[idx]]
x0_trans = preprocess_only.transform(x0_df)
if hasattr(x0_trans, "toarray"):
x0_trans = x0_trans.toarray()
x0_vec = np.asarray(x0_trans, dtype=np.float32).ravel()
exp = explainer_tabular.explain_instance(
data_row=x0_vec,
predict_fn=predict_proba_on_transformed,
num_features=10
)
print(f"\nLIME explanation for UC1 instance {idx} (Reasoning model)")
exp.show_in_notebook(show_table=True, show_all=False)
Output:



- Feature contributions (e.g., Sex_female positive; Pclass_3 negative) align with intuition and the model’s behavior.
- Observation: this is where the “reasoning models decide” is transparent—decisions are traceable to features.
Step 11 — UC2: Compute ground truths and reasoning answers
Compute exact answers to factual questions directly from the data.
# Define factual questions and compute ground truths
def gt_over_50_survived(df_in):
mask = df_in["Age"] > 50
return int(df_in.loc[mask, "Survived"].sum())
def gt_survival_rate_female_1st(df_in):
mask = (df_in["Sex"] == "female") & (df_in["Pclass"] == 1)
if mask.sum() == 0:
return float("nan")
return float(df_in.loc[mask, "Survived"].mean())
q1 = "Among passengers over age 50, how many survived?"
q2 = "What is the survival rate among females in 1st class?"
gt1 = gt_over_50_survived(df)
gt2 = gt_survival_rate_female_1st(df)
print("Ground truths:")
print(f"Q1 -> {gt1}")
print(f"Q2 -> {gt2:.3f}")
# Reasoning answers are exact computations from the data
ra1 = gt1
ra2 = gt2
print("Reasoning answers (computed from data):")
print(f"Q1 -> {ra1}")
print(f"Q2 -> {ra2:.3f}")
Output:


- Exact numbers from data (e.g., Q1=22; Q2=0.968).
- Observation: this is a concrete demonstration of “deciding” via direct computation.
Step 12 — Generative answers for UC2 and evaluation
Ask the same questions to GPT‑2 using the dataset card; parse numbers and score against ground truth.
import re, math
# Ask the generative model the same factual questions
def generative_answer(question):
if not USE_GENERATIVE:
return "(Generative disabled)"
prompt = dataset_card + "\n\n" + \
"Task: Answer the question based only on the given facts and general knowledge. If uncertain, estimate.\n" + \
f"Question: {question}\n" + \
"Answer succinctly:"
return generate_text(prompt)
# Parsers to extract numbers from free text
def extract_first_int(text):
m = re.search(r'(-?\d+)', text or "")
return int(m.group(1)) if m else None
def extract_first_float(text):
m = re.search(r'(\d+(\.\d+)?)', text or "")
if m:
val = float(m.group(1))
# Treat percentages as rates if needed
if val > 1 and val <= 100:
val = val / 100.0
return val
return None
# Simple correctness checks
def correctness_q1(pred, gt):
if pred is None:
return 0
return int(pred == gt)
def correctness_q2(pred, gt, tol=0.02):
if pred is None or math.isnan(gt):
return 0
return int(abs(pred - gt) <= tol)
# Get answers and score
ga1 = generative_answer(q1)
ga2 = generative_answer(q2)
ga1_num = extract_first_int(ga1) if USE_GENERATIVE else None
ga2_num = extract_first_float(ga2) if USE_GENERATIVE else None
c1 = correctness_q1(ga1_num, gt1) if USE_GENERATIVE else 0
c2 = correctness_q2(ga2_num, gt2) if USE_GENERATIVE else 0
print("Generative answers:")
print(f"Q1 -> {ga1}")
print(f"Q2 -> {ga2}")
print("\nEvaluation of generative answers:")
print(f"Q1 parsed: {ga1_num}, correct={bool(c1)} (gt={gt1})")
print(f"Q2 parsed: {ga2_num}, correct={bool(c2)} (gt={gt2:.3f})")
Output:

- Example: Q1 guessed 18 (incorrect vs 22), Q2 parsed 1.0 (incorrect vs 0.968).
- Observations: this is the clearest demonstration that “generative models are guessing”—fluency without grounded calculation.
Step 13 — LIME Text (fast heuristic) for prompt sensitivity
An illustrative, fast version of LIME Text to see which words in the question might sway outputs; uses caching and low num_samples.
# Fast LIME Text to avoid very long runtimes; heuristic evaluation
if USE_GENERATIVE:
from lime.lime_text import LimeTextExplainer
import time
labels = ["Incorrect", "Correct"]
explainer_text = LimeTextExplainer(class_names=labels, random_state=RANDOM_STATE)
_gen_cache = {}
# Cached generation to speed up repeated calls
def safe_generate(text, max_time_sec=5.0):
if text in _gen_cache:
return _gen_cache[text]
try:
out = generate_text(text, max_new_tokens=80, temperature=0.9, top_p=0.95)
except Exception:
out = ""
_gen_cache[text] = out
return out
# Wrap the model: return a "probability of correctness" based on parsed value vs ground truth
def gen_predict_proba_for_lime(base_question, gt_int=None, gt_float=None):
def predict(texts):
probs = []
for t in texts:
ans = safe_generate(t)
if gt_int is not None:
pred = extract_first_int(ans)
correct = correctness_q1(pred, gt_int)
else:
pred = extract_first_float(ans)
correct = correctness_q2(pred, gt_float)
p_correct = 0.7 if correct == 1 else 0.3
probs.append([1 - p_correct, p_correct])
return np.array(probs)
return predict
# Smaller num_samples for speed
predict_fn_q1 = gen_predict_proba_for_lime(q1, gt_int=gt1)
exp_q1 = explainer_text.explain_instance(
text_instance=q1,
classifier_fn=predict_fn_q1,
num_features=8,
num_samples=100
)
print("\nLIME Text explanation for Generative Q1 (fast):")
exp_q1.show_in_notebook(text=True)
predict_fn_q2 = gen_predict_proba_for_lime(q2, gt_float=gt2)
exp_q2 = explainer_text.explain_instance(
text_instance=q2,
classifier_fn=predict_fn_q2,
num_features=8,
num_samples=100
)
print("\nLIME Text explanation for Generative Q2 (fast):")
exp_q2.show_in_notebook(text=True)
else:
print("Skipping LIME Text (generative disabled).")
Output:


- Highlights question tokens; illustrates why phrasing can sway the model.
- Observation: LIME Text is heuristic here; useful for intuition, not for numeric guarantees.
Step 14 — Side‑by‑side comparison and wrap‑up metrics
Put the story together—questions, ground truths, reasoning answers, generative answers, parsed numerics, correctness; then print overall metrics.
def compare_show(question, gt_display, reasoning_value, gen_text, gen_num, correct_flag):
print("\n" + "-"*80)
print("Question:", question)
print(f"Ground truth: {gt_display}")
print("Reasoning answer:", reasoning_value)
print("Generative answer:", gen_text)
print("Parsed numeric:", gen_num)
print("Generative correctness:", "Correct" if correct_flag else "Incorrect")
compare_show(q1, gt1, ra1, ga1, ga1_num, c1 if USE_GENERATIVE else 0)
compare_show(q2, f"{gt2:.3f}", f"{ra2:.3f}", ga2, ga2_num, c2 if USE_GENERATIVE else 0)
from sklearn.metrics import accuracy_score, roc_auc_score
summary = {
"UC1_reasoning_accuracy": float(accuracy_score(y_test, clf_pipeline.predict(X_test))),
"UC1_reasoning_auc": float(roc_auc_score(y_test, clf_pipeline.predict_proba(X_test)[:,1])),
"UC1_generative_behavior": "Textual justification; may be plausible but wrong (hallucination risk).",
"UC2_reasoning": "Exact computations from data (counts/rates).",
"UC2_generative": "Estimates may deviate from ground truth; sensitive to prompt phrasing.",
"Explainability_LIME": {
"Reasoning": "LIME Tabular highlights feature contributions in transformed space.",
"Generative": "LIME Text provides heuristic insights into phrasing influence (approximate)."
}
}
import json
print("\nSummary:")
print(json.dumps(summary, indent=2))
Output:


- UC1 metrics reaffirm strong performance of the reasoning model.
- UC2 shows the reasoning answers exactly match ground truth; the generative answers deviate.
- Observations:
- Generative models are guessing: UC2 “Evaluation of generative answers” where the parsed numbers differ from ground truth despite fluent text.
- Reasoning models are deciding: UC1 metrics plus LIME Tabular explanations (Steps 5–7) where feature‑based logic is explicit and tied to data.
Conclusion
- On structured analytics tasks, the reasoning/decision pipeline (preprocessing + calibrated classifier) delivers accurate, explainable predictions. LIME Tabular reveals which features drive each decision, making the model’s behavior auditable.
- Generative models excel at producing fluent narratives but will guess when asked to compute facts they were not given directly, leading to confident but wrong answers. Even with a dataset card, they can improvise numbers or misinterpret requests.
- For analytics that require correctness and interpretability, use decision models and use LIME for transparency. Use generative models for narrative framing, explanations, or brainstorming—not as a replacement for computation on data.