Yunaboy - Generative Models Guess, Reasoning Models Decide: A Hands-On LIME Comparison

Objective

Show the real-world difference between a generative model (text) and a reasoning model (ML on tabular data).
Prove how reasoning models give accurate, explainable results, while generative models can sound right but be wrong.
Use LIME — lime_tabular for per-case reasoning, lime_text to peek into generative behavior.

Learning Outcomes

Understand why decision models are a better fit for structured analytics tasks.
Learn how to engineer features, build preprocessing pipelines, and train a calibrated classifier for reliable probabilities.
Apply LIME for local interpretability on tabular predictions (and a heuristic for text).
See controlled examples where generative models produce convincing but wrong answers versus accurate reasoning-based answers tied to data.

Dataset Overview

Dataset: Titanic passenger dataset from a public GitHub mirror.
Target: Survived (0/1).
Columns: Pclass, Sex, Age, SibSp, Parch, Fare, Embarked, plus engineered features FamilySize, IsAlone, Title, TicketPrefix, CabinInitial.
Balance: approximately 38.4% survived, 61.6% did not survive.

Step 0 — Configuration and environment checks

Ensure reproducibility, check versions, and verify dependencies for smooth execution.

# Configuration switches
USE_GENERATIVE = True   # Set False if local downloads are not allowed / to skip GPT-2
RANDOM_STATE = 42

# Environment info
import sys, platform, sklearn
print("Python:", sys.version.splitlines()[0])
print("Platform:", platform.platform())
print("scikit-learn:", sklearn.__version__)

# Dependency checks (no installs; assume packages are present)
missing = []
try:
    import numpy as np
except Exception:
    missing.append("numpy")
try:
    import pandas as pd
except Exception:
    missing.append("pandas")
try:
    import matplotlib.pyplot as plt
    import seaborn as sns
except Exception:
    missing.append("matplotlib/seaborn")
try:
    # Core scikit-learn components
    from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
    from sklearn.preprocessing import OneHotEncoder, StandardScaler
    from sklearn.compose import ColumnTransformer
    from sklearn.pipeline import Pipeline
    from sklearn.impute import SimpleImputer
    from sklearn.linear_model import LogisticRegression
    from sklearn.calibration import CalibratedClassifierCV
    from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, classification_report, confusion_matrix
except Exception:
    missing.append("scikit-learn core")
try:
    # LIME explainers
    from lime.lime_tabular import LimeTabularExplainer
    from lime.lime_text import LimeTextExplainer
except Exception:
    missing.append("lime")
if USE_GENERATIVE:
    try:
        # Lightweight local text model
        import torch
        from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed
    except Exception:
        missing.append("transformers/torch (only needed if USE_GENERATIVE=True)")

if missing:
    print("WARNING: Missing packages detected:", missing)
else:
    print("All required packages are importable.")

# Reproducibility
import random
np.random.seed(RANDOM_STATE)
random.seed(RANDOM_STATE)
try:
    from transformers import set_seed as hf_set_seed
    hf_set_seed(RANDOM_STATE)
except Exception:
    pass

# Display options for pandas
import pandas as pd
pd.set_option('display.max_columns', None)

Output:

Versions and “All required packages are importable.” This ensures a stable base for everything that follows.

Step 1 — Project objective metadata (reference)

Capture the plan as structured text so the notebook stays self‑documented.

import json

objective = {
    "goal": "Show the difference between a generative model (text) and a reasoning/decision model (tabular ML) on analytics tasks with LIME explanations.",
    "dataset": "Titanic (public CSV from GitHub mirror)",
    "use_cases": [
        "UC1: Predict survival for specific passengers (classification).",
        "UC2: Answer factual, data-driven questions (counts/rates) directly from the data."
    ],
    "explainability": "Use LIME Tabular for the reasoning model; use LIME Text heuristics for the generative model."
}
print(json.dumps(objective, indent=2))

Output:

Clear, concise project scope.

Step 2 — Load dataset

Download Titanic CSV and preview.

import requests

# Download the dataset from a public source
URL = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
resp = requests.get(URL, timeout=30)
resp.raise_for_status()
with open("titanic.csv", "wb") as f:
    f.write(resp.content)

# Load into DataFrame
df = pd.read_csv("titanic.csv")
print("Loaded titanic.csv with shape:", df.shape)

# Quick peek at the data (head shown in the notebook)
df.head()

Output:

Shape: (891, 12). Confirms data is in place.

Step 3 — Quick EDA

Check columns, missing data, and target balance; plot three charts for intuition.

import matplotlib.pyplot as plt
import seaborn as sns

# Basic info for orientation
print("Columns:", df.columns.tolist())
print("Missing values:\n", df.isna().sum())
print("Target distribution:\n", df["Survived"].value_counts(normalize=True))

# 1x3 subplot layout to avoid axes confusion
fig, (ax0, ax1, ax2) = plt.subplots(1, 3, figsize=(14, 4))

sns.countplot(data=df, x="Survived", ax=ax0)
ax0.set_title("Survived distribution")

sns.histplot(data=df, x="Age", kde=True, ax=ax1)
ax1.set_title("Age distribution")

sns.countplot(data=df, x="Pclass", hue="Survived", ax=ax2)
ax2.set_title("Pclass vs Survived")

plt.tight_layout()
plt.show()

Output:

Class balance around 38% survived.
Visual cues that Sex and Pclass matter, Age has missing values.
Highlight: sets the stage for feature engineering and modeling.

Step 4 — Feature engineering and preprocessing

Derive informative features, build robust preprocessing, and run a calibrated logistic regression pipeline.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV

data = df.copy()

# --- Feature engineering ---
# Family group signals
data["FamilySize"] = data["SibSp"] + data["Parch"] + 1
data["IsAlone"] = (data["FamilySize"] == 1).astype(int)

# Title extraction (Mr, Mrs, Miss, Master, etc.) normalized to reduce sparsity
data["Title"] = data["Name"].str.extract(r',\s*([^\.]+)\.')
title_map = {
    "Mlle": "Miss", "Ms": "Miss", "Mme": "Mrs",
    "Lady": "Royalty", "Countess": "Royalty", "Sir": "Royalty", "Jonkheer": "Royalty", "Dona": "Royalty",
    "Capt": "Officer", "Col": "Officer", "Dr": "Officer", "Major": "Officer", "Rev": "Officer"
}
data["Title"] = data["Title"].replace(title_map)
rare_titles = data["Title"].value_counts()[data["Title"].value_counts() < 10].index
data["Title"] = data["Title"].replace({t: "Rare" for t in rare_titles})

# Ticket prefix (string part) as a coarse grouping proxy
data["TicketPrefix"] = data["Ticket"].astype(str).str.replace(r'[^A-Za-z]', '', regex=True).str.upper()
data["TicketPrefix"] = data["TicketPrefix"].replace('', 'NONE')

# Cabin initial (first letter); 'U' for unknown
data["CabinInitial"] = data["Cabin"].astype(str).str.slice(0,1).replace('n', 'U')

# --- Train/test split ---
target = "Survived"
numeric_features = ["Age", "SibSp", "Parch", "Fare", "FamilySize"]
categorical_features = ["Pclass", "Sex", "Embarked", "Title", "IsAlone", "TicketPrefix", "CabinInitial"]

X = data[numeric_features + categorical_features]
y = data[target]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=RANDOM_STATE, stratify=y
)

# --- Preprocessing pipelines ---
# OneHotEncoder compatibility across sklearn versions
onehot_kwargs = dict(handle_unknown="ignore")
try:
    ohe = OneHotEncoder(**onehot_kwargs, sparse_output=False)  # sklearn >= 1.2
except TypeError:
    ohe = OneHotEncoder(**onehot_kwargs, sparse=False)         # older versions

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),   # robust to skew and outliers
    ("scaler", StandardScaler())                     # put numeric features on comparable scale
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),  # fill missing categories
    ("onehot", ohe)                                        # one-hot encode categories
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

# --- Reasoning model with calibrated probabilities ---
log_reg = LogisticRegression(max_iter=200, class_weight="balanced", solver="lbfgs")
try:
    calibrated = CalibratedClassifierCV(estimator=log_reg, method="sigmoid", cv=3)  # sklearn >= 1.2
except TypeError:
    calibrated = CalibratedClassifierCV(base_estimator=log_reg, method="sigmoid", cv=3)

clf_pipeline = Pipeline(steps=[("preprocess", preprocessor),
                              ("clf", calibrated)])

print("Pipelines prepared.")

Output:

“Pipelines prepared.” indicates the reasoning pipeline is ready.

Step 5 — Train, cross-validate, and evaluate the reasoning model

Quantify predictive performance and inspect the confusion matrix.

from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Cross-validation to check stability
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
acc = cross_val_score(clf_pipeline, X_train, y_train, cv=cv, scoring="accuracy")
roc = cross_val_score(clf_pipeline, X_train, y_train, cv=cv, scoring="roc_auc")
f1s = cross_val_score(clf_pipeline, X_train, y_train, cv=cv, scoring="f1")

print(f"CV Accuracy: {acc.mean():.3f} ± {acc.std():.3f}")
print(f"CV ROC AUC : {roc.mean():.3f} ± {roc.std():.3f}")
print(f"CV F1      : {f1s.mean():.3f} ± {f1s.std():.3f}")

# Fit on training and evaluate on holdout test
clf_pipeline.fit(X_train, y_train)
y_pred = clf_pipeline.predict(X_test)
y_proba = clf_pipeline.predict_proba(X_test)[:, 1]

print("\nTest metrics:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("ROC AUC :", roc_auc_score(y_test, y_proba))
print("F1 Score:", f1_score(y_test, y_pred))
print("\nClassification report:\n", classification_report(y_test, y_pred))

# Confusion matrix for error types
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title("Confusion Matrix (Reasoning Model)")
plt.xlabel("Predicted"); plt.ylabel("Actual")
plt.show()

Output:

CV Accuracy ≈ 0.817, ROC AUC ≈ 0.868; Test Accuracy ≈ 0.812, ROC AUC ≈ 0.859.
The reasoning model is accurate and measurable, addressing the “decision model” side of the objective.
The reasoning model correctly predicted 117 negatives and 64 positives, with 20 false positives and 22 false negatives.

Step 6 — LIME Tabular on transformed features

Prepare a LIME explainer operating directly in the pipeline’s transformed space (dense arrays), training a classifier just for LIME to avoid inverse transforms.

import numpy as np
from sklearn.pipeline import Pipeline
from lime.lime_tabular import LimeTabularExplainer

# Preprocess and transform using the same pipeline stage
preprocess_only = Pipeline(steps=[("preprocess", preprocessor)])
preprocess_only.fit(X_train)

X_train_transformed = preprocess_only.transform(X_train)
X_test_transformed  = preprocess_only.transform(X_test)

# Ensure dense float arrays for LIME
if hasattr(X_train_transformed, "toarray"):
    X_train_transformed = X_train_transformed.toarray()
if hasattr(X_test_transformed, "toarray"):
    X_test_transformed = X_test_transformed.toarray()
X_train_transformed = np.asarray(X_train_transformed, dtype=np.float32)
X_test_transformed  = np.asarray(X_test_transformed, dtype=np.float32)

# Train a calibrated LR directly on transformed arrays for LIME (keeps interfaces simple)
log_reg_lime = LogisticRegression(max_iter=200, class_weight="balanced", solver="lbfgs")
try:
    calibrated_lime = CalibratedClassifierCV(estimator=log_reg_lime, method="sigmoid", cv=3)
except TypeError:
    calibrated_lime = CalibratedClassifierCV(base_estimator=log_reg_lime, method="sigmoid", cv=3)
calibrated_lime.fit(X_train_transformed, y_train)

# Build transformed feature names for LIME (numeric + one-hot)
feature_names_num = numeric_features
ohe = preprocessor.named_transformers_["cat"].named_steps["onehot"]
try:
    cat_names = ohe.get_feature_names_out(categorical_features).tolist()
except AttributeError:
    cat_names = ohe.get_feature_names(categorical_features).tolist()
feature_names_transformed = feature_names_num + cat_names
class_names = ["Not Survived", "Survived"]

# LIME explainer configured for transformed numeric space
explainer_tabular = LimeTabularExplainer(
    training_data=X_train_transformed,
    mode="classification",
    feature_names=feature_names_transformed,
    class_names=class_names,
    categorical_features=None,   # transformed space is all numeric
    discretize_continuous=True,
    random_state=RANDOM_STATE
)

# Predict function that accepts transformed arrays
def predict_proba_on_transformed(X_trans):
    arr = np.asarray(X_trans, dtype=np.float32)
    if arr.ndim == 1:
        arr = arr.reshape(1, -1)
    return calibrated_lime.predict_proba(arr)

print("LIME ready. Shapes:", X_train_transformed.shape, X_test_transformed.shape)

Output:

“LIME ready. Shapes: (668, 54) (223, 54)”. Confirms feature space; we’re ready to explain.

Step 7 — Explain a few UC1 predictions with LIME Tabular

Show per‑instance feature contributions backing the reasoning model’s predictions.

# Select a few test instances to explain
indices = np.random.choice(range(len(X_test)), size=3, replace=False)

# Number of transformed features for reference
n_features = X_train_transformed.shape[1]
print("n_features (transformed columns):", n_features)

for pos, idx in enumerate(indices, 1):
    x0_df = X_test.iloc[idx:idx+1]
    y_true = y_test.iloc[idx]

    # Reasoning model predictions (for context)
    r_pred = clf_pipeline.predict(x0_df)
    r_proba = clf_pipeline.predict_proba(x0_df)[0, 1]

    # Transform row and flatten to 1D vector for LIME
    x0_trans = preprocess_only.transform(x0_df)
    if hasattr(x0_trans, "toarray"):
        x0_trans = x0_trans.toarray()
    x0_vec = np.asarray(x0_trans, dtype=np.float32).ravel()

    print(f"\n[{pos}/3] idx={idx} x0_df={x0_df.shape}, x0_vec={x0_vec.shape}, n_features={n_features}")
    print(f"true={y_true}, pred={r_pred}, proba={r_proba:.3f}")

    # LIME explanation visual
    exp = explainer_tabular.explain_instance(
        data_row=x0_vec,
        predict_fn=predict_proba_on_transformed,
        num_features=10
    )
    display(x0_df)
    exp.show_in_notebook(show_table=True, show_all=False)

Output:

LIME contributions often include Sex_female (positive), Pclass_3 (negative), Title groups, etc.
Observation: this is where the “reasoning models decide” claim becomes visible—clear, feature-based rationale for decisions.

Step 8 — Generative model and dataset card

Load GPT‑2 and create a compact “data card” to prompt it.

# local generative model for text answers
if USE_GENERATIVE:
    import torch
    from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed

    MODEL_NAME = "gpt2"
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    gen_model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
    device = "cuda" if torch.cuda.is_available() else "cpu"
    gen_model.to(device)

    def generate_text(prompt, max_new_tokens=120, temperature=0.9, top_p=0.95):
        inputs = tokenizer(prompt, return_tensors="pt").to(device)
        with torch.no_grad():
            outputs = gen_model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=True,
                temperature=temperature,
                top_p=top_p,
                pad_token_id=tokenizer.eos_token_id
            )
        text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        return text[len(prompt):].strip()
else:
    print("Generative model disabled (USE_GENERATIVE=False).")

# Summarize a few dataset facts for prompting (intentionally incomplete)
def build_dataset_card(df_in):
    lines = []
    lines.append("You are given aggregated facts from a Titanic passenger dataset.")
    lines.append(f"Total rows: {len(df_in)}")
    lines.append(f"Columns: {', '.join(df_in.columns)}")
    lines.append(f"Overall survival rate: {df_in['Survived'].mean():.3f}")
    lines.append(f"Mean age: {df_in['Age'].mean():.2f}")
    lines.append(f"Median fare: {df_in['Fare'].median():.2f}")
    return "\n".join(lines)

dataset_card = build_dataset_card(df)
print(dataset_card)

Output:

The card prints aggregate facts; it does not contain the answers to group queries—this sets up the contrast.

Step 9 — UC1: Predict survival with reasoning vs generative

Compare probability‑backed predictions to free‑form text generation on the same passenger profiles.

# Format a passenger row for prompting
def format_passenger_for_prompt(row):
    return (f"Pclass={row['Pclass']}, Sex={row['Sex']}, Age={row['Age']}, "
            f"SibSp={row['SibSp']}, Parch={row['Parch']}, Fare={row['Fare']:.2f}, "
            f"Embarked={row['Embarked']}, Title={row['Title']}, FamilySize={row['FamilySize']}, "
            f"IsAlone={row['IsAlone']}, TicketPrefix={row['TicketPrefix']}, CabinInitial={row['CabinInitial']}")

# Reasoning model: numeric prediction + calibrated probability
def reasoning_predict(row_df):
    proba = clf_pipeline.predict_proba(row_df)[0, 1]
    return int(proba >= 0.5), float(proba)

# Generative model: write an answer based on dataset card and profile text
def generative_predict_text(row):
    prompt = dataset_card + "\n\n" + \
        "Task: Based on the passenger profile, predict whether the passenger survived (Yes/No) and explain briefly.\n" + \
        "Passenger: " + format_passenger_for_prompt(row) + "\n" + \
        "Answer with 'Prediction: Yes' or 'Prediction: No' and a short reason.\n"
    return generate_text(prompt)

# Sample a few test instances and compare
test_indices = np.random.choice(X_test.index, size=3, replace=False)
results_uc1 = []
for idx in test_indices:
    row_df = X_test.loc[[idx]]
    row_series = X_test.loc[idx]
    true_label = int(y_test.loc[idx])
    r_pred, r_proba = reasoning_predict(row_df)
    g_text = generative_predict_text(row_series) if USE_GENERATIVE else "(Generative disabled)"
    results_uc1.append((idx, true_label, r_pred, r_proba, g_text))

# Display paired outputs
for r in results_uc1:
    idx, true_label, r_pred, r_proba, g_text = r
    print(f"\nUC1 - Instance {idx}")
    print(f"True label: {true_label}")
    print(f"Reasoning -> pred: {r_pred}, proba_survived: {r_proba:.3f}")
    print("Generative ->", g_text)

Output:

Reasoning outputs match labels often and always include probabilities.
Generative outputs can be off-topic or incorrect.
Observation: this is the first clear place where generative models are “guessing” (free text, not data‑grounded) and reasoning models are “deciding” (probability‑backed, data‑grounded).

Step 10 — LIME explanations for the UC1 instances

Visualize why the reasoning model predicted Yes/No per passenger.

# Explain the UC1 instances with LIME Tabular
for idx, true_label, r_pred, r_proba, _ in results_uc1:
    x0_df = X_test.loc[[idx]]
    x0_trans = preprocess_only.transform(x0_df)
    if hasattr(x0_trans, "toarray"):
        x0_trans = x0_trans.toarray()
    x0_vec = np.asarray(x0_trans, dtype=np.float32).ravel()

    exp = explainer_tabular.explain_instance(
        data_row=x0_vec,
        predict_fn=predict_proba_on_transformed,
        num_features=10
    )
    print(f"\nLIME explanation for UC1 instance {idx} (Reasoning model)")
    exp.show_in_notebook(show_table=True, show_all=False)

Output:

Feature contributions (e.g., Sex_female positive; Pclass_3 negative) align with intuition and the model’s behavior.
Observation: this is where the “reasoning models decide” is transparent—decisions are traceable to features.

Step 11 — UC2: Compute ground truths and reasoning answers

Compute exact answers to factual questions directly from the data.

# Define factual questions and compute ground truths
def gt_over_50_survived(df_in):
    mask = df_in["Age"] > 50
    return int(df_in.loc[mask, "Survived"].sum())

def gt_survival_rate_female_1st(df_in):
    mask = (df_in["Sex"] == "female") & (df_in["Pclass"] == 1)
    if mask.sum() == 0:
        return float("nan")
    return float(df_in.loc[mask, "Survived"].mean())

q1 = "Among passengers over age 50, how many survived?"
q2 = "What is the survival rate among females in 1st class?"

gt1 = gt_over_50_survived(df)
gt2 = gt_survival_rate_female_1st(df)

print("Ground truths:")
print(f"Q1 -> {gt1}")
print(f"Q2 -> {gt2:.3f}")

# Reasoning answers are exact computations from the data
ra1 = gt1
ra2 = gt2
print("Reasoning answers (computed from data):")
print(f"Q1 -> {ra1}")
print(f"Q2 -> {ra2:.3f}")

Output:

Exact numbers from data (e.g., Q1=22; Q2=0.968).
Observation: this is a concrete demonstration of “deciding” via direct computation.

Step 12 — Generative answers for UC2 and evaluation

Ask the same questions to GPT‑2 using the dataset card; parse numbers and score against ground truth.

import re, math

# Ask the generative model the same factual questions
def generative_answer(question):
    if not USE_GENERATIVE:
        return "(Generative disabled)"
    prompt = dataset_card + "\n\n" + \
        "Task: Answer the question based only on the given facts and general knowledge. If uncertain, estimate.\n" + \
        f"Question: {question}\n" + \
        "Answer succinctly:"
    return generate_text(prompt)

# Parsers to extract numbers from free text
def extract_first_int(text):
    m = re.search(r'(-?\d+)', text or "")
    return int(m.group(1)) if m else None

def extract_first_float(text):
    m = re.search(r'(\d+(\.\d+)?)', text or "")
    if m:
        val = float(m.group(1))
        # Treat percentages as rates if needed
        if val > 1 and val <= 100:
            val = val / 100.0
        return val
    return None

# Simple correctness checks
def correctness_q1(pred, gt):
    if pred is None:
        return 0
    return int(pred == gt)

def correctness_q2(pred, gt, tol=0.02):
    if pred is None or math.isnan(gt):
        return 0
    return int(abs(pred - gt) <= tol)

# Get answers and score
ga1 = generative_answer(q1)
ga2 = generative_answer(q2)

ga1_num = extract_first_int(ga1) if USE_GENERATIVE else None
ga2_num = extract_first_float(ga2) if USE_GENERATIVE else None

c1 = correctness_q1(ga1_num, gt1) if USE_GENERATIVE else 0
c2 = correctness_q2(ga2_num, gt2) if USE_GENERATIVE else 0

print("Generative answers:")
print(f"Q1 -> {ga1}")
print(f"Q2 -> {ga2}")

print("\nEvaluation of generative answers:")
print(f"Q1 parsed: {ga1_num}, correct={bool(c1)} (gt={gt1})")
print(f"Q2 parsed: {ga2_num}, correct={bool(c2)} (gt={gt2:.3f})")

Output:

Example: Q1 guessed 18 (incorrect vs 22), Q2 parsed 1.0 (incorrect vs 0.968).
Observations: this is the clearest demonstration that “generative models are guessing”—fluency without grounded calculation.

Step 13 — LIME Text (fast heuristic) for prompt sensitivity

An illustrative, fast version of LIME Text to see which words in the question might sway outputs; uses caching and low num_samples.

# Fast LIME Text to avoid very long runtimes; heuristic evaluation
if USE_GENERATIVE:
    from lime.lime_text import LimeTextExplainer
    import time
    labels = ["Incorrect", "Correct"]
    explainer_text = LimeTextExplainer(class_names=labels, random_state=RANDOM_STATE)

    _gen_cache = {}

    # Cached generation to speed up repeated calls
    def safe_generate(text, max_time_sec=5.0):
        if text in _gen_cache:
            return _gen_cache[text]
        try:
            out = generate_text(text, max_new_tokens=80, temperature=0.9, top_p=0.95)
        except Exception:
            out = ""
        _gen_cache[text] = out
        return out

    # Wrap the model: return a "probability of correctness" based on parsed value vs ground truth
    def gen_predict_proba_for_lime(base_question, gt_int=None, gt_float=None):
        def predict(texts):
            probs = []
            for t in texts:
                ans = safe_generate(t)
                if gt_int is not None:
                    pred = extract_first_int(ans)
                    correct = correctness_q1(pred, gt_int)
                else:
                    pred = extract_first_float(ans)
                    correct = correctness_q2(pred, gt_float)
                p_correct = 0.7 if correct == 1 else 0.3
                probs.append([1 - p_correct, p_correct])
            return np.array(probs)
        return predict

    # Smaller num_samples for speed
    predict_fn_q1 = gen_predict_proba_for_lime(q1, gt_int=gt1)
    exp_q1 = explainer_text.explain_instance(
        text_instance=q1,
        classifier_fn=predict_fn_q1,
        num_features=8,
        num_samples=100
    )
    print("\nLIME Text explanation for Generative Q1 (fast):")
    exp_q1.show_in_notebook(text=True)

    predict_fn_q2 = gen_predict_proba_for_lime(q2, gt_float=gt2)
    exp_q2 = explainer_text.explain_instance(
        text_instance=q2,
        classifier_fn=predict_fn_q2,
        num_features=8,
        num_samples=100
    )
    print("\nLIME Text explanation for Generative Q2 (fast):")
    exp_q2.show_in_notebook(text=True)
else:
    print("Skipping LIME Text (generative disabled).")

Output:

Highlights question tokens; illustrates why phrasing can sway the model.
Observation: LIME Text is heuristic here; useful for intuition, not for numeric guarantees.

Step 14 — Side‑by‑side comparison and wrap‑up metrics

Put the story together—questions, ground truths, reasoning answers, generative answers, parsed numerics, correctness; then print overall metrics.

def compare_show(question, gt_display, reasoning_value, gen_text, gen_num, correct_flag):
    print("\n" + "-"*80)
    print("Question:", question)
    print(f"Ground truth: {gt_display}")
    print("Reasoning answer:", reasoning_value)
    print("Generative answer:", gen_text)
    print("Parsed numeric:", gen_num)
    print("Generative correctness:", "Correct" if correct_flag else "Incorrect")

compare_show(q1, gt1, ra1, ga1, ga1_num, c1 if USE_GENERATIVE else 0)
compare_show(q2, f"{gt2:.3f}", f"{ra2:.3f}", ga2, ga2_num, c2 if USE_GENERATIVE else 0)

from sklearn.metrics import accuracy_score, roc_auc_score
summary = {
    "UC1_reasoning_accuracy": float(accuracy_score(y_test, clf_pipeline.predict(X_test))),
    "UC1_reasoning_auc": float(roc_auc_score(y_test, clf_pipeline.predict_proba(X_test)[:,1])),
    "UC1_generative_behavior": "Textual justification; may be plausible but wrong (hallucination risk).",
    "UC2_reasoning": "Exact computations from data (counts/rates).",
    "UC2_generative": "Estimates may deviate from ground truth; sensitive to prompt phrasing.",
    "Explainability_LIME": {
        "Reasoning": "LIME Tabular highlights feature contributions in transformed space.",
        "Generative": "LIME Text provides heuristic insights into phrasing influence (approximate)."
    }
}
import json
print("\nSummary:")
print(json.dumps(summary, indent=2))

Output:

UC1 metrics reaffirm strong performance of the reasoning model.
UC2 shows the reasoning answers exactly match ground truth; the generative answers deviate.
Observations:
- Generative models are guessing: UC2 “Evaluation of generative answers” where the parsed numbers differ from ground truth despite fluent text.
- Reasoning models are deciding: UC1 metrics plus LIME Tabular explanations (Steps 5–7) where feature‑based logic is explicit and tied to data.

Conclusion

On structured analytics tasks, the reasoning/decision pipeline (preprocessing + calibrated classifier) delivers accurate, explainable predictions. LIME Tabular reveals which features drive each decision, making the model’s behavior auditable.
Generative models excel at producing fluent narratives but will guess when asked to compute facts they were not given directly, leading to confident but wrong answers. Even with a dataset card, they can improvise numbers or misinterpret requests.
For analytics that require correctness and interpretability, use decision models and use LIME for transparency. Use generative models for narrative framing, explanations, or brainstorming—not as a replacement for computation on data.

Categorized in:

AI Projects,

Generative Models Guess, Reasoning Models Decide: A Hands-On LIME Comparison

Objective

Learning Outcomes

Dataset Overview

Step 0 — Configuration and environment checks

Step 1 — Project objective metadata (reference)

Step 2 — Load dataset

Step 3 — Quick EDA

Step 4 — Feature engineering and preprocessing

Step 5 — Train, cross-validate, and evaluate the reasoning model

Step 6 — LIME Tabular on transformed features

Step 7 — Explain a few UC1 predictions with LIME Tabular

Step 8 — Generative model and dataset card

Step 9 — UC1: Predict survival with reasoning vs generative

Step 10 — LIME explanations for the UC1 instances

Step 11 — UC2: Compute ground truths and reasoning answers

Step 12 — Generative answers for UC2 and evaluation

Step 13 — LIME Text (fast heuristic) for prompt sensitivity

Step 14 — Side‑by‑side comparison and wrap‑up metrics

Conclusion

Offline LLM Reasoning for Banking Support: A Generational Snapshot (with code)

Learning Without Fine-Tuning: A Memory-Based Approach

Leave a Reply Cancel reply

Objective

Learning Outcomes

Dataset Overview

Step 0 — Configuration and environment checks

Step 1 — Project objective metadata (reference)

Step 2 — Load dataset

Step 3 — Quick EDA

Step 4 — Feature engineering and preprocessing

Step 5 — Train, cross-validate, and evaluate the reasoning model

Step 6 — LIME Tabular on transformed features

Step 7 — Explain a few UC1 predictions with LIME Tabular

Step 8 — Generative model and dataset card

Step 9 — UC1: Predict survival with reasoning vs generative

Step 10 — LIME explanations for the UC1 instances

Step 11 — UC2: Compute ground truths and reasoning answers

Step 12 — Generative answers for UC2 and evaluation

Step 13 — LIME Text (fast heuristic) for prompt sensitivity

Step 14 — Side‑by‑side comparison and wrap‑up metrics

Conclusion

Offline LLM Reasoning for Banking Support: A Generational Snapshot (with code)

Learning Without Fine-Tuning: A Memory-Based Approach

More in this CategoryAI Projects

Building “PaperDrop”: A Full-Stack AI Newsletter App From Scratch

Building and Deploying a Simple Chatbot on Hugging Face

Learning Without Fine-Tuning: A Memory-Based Approach

Offline LLM Reasoning for Banking Support: A Generational Snapshot (with code)

Leave a Reply Cancel reply