
About Dataset
LeadForge: Synthetic B2B Lead Scoring Dataset (leadforge-lead-scoring-v1)
A relational, reproducible, three-tier synthetic CRM dataset family for
teaching lead scoring at scale. Generated by
leadforge, an
open-source Python framework for synthetic CRM/funnel data. The
framework version is decoupled from the dataset version: the package
stays at 1.x; the dataset is published under the explicit …-v1
tag.
Why lead scoring matters in 2024–2026
Mid-market SaaS vendors entered 2024–2026 with growth slowing and customer-acquisition costs rising[^macro], so predicting which leads convert within a fixed window has moved from a marketing nicety to a survival skill. This dataset teaches that skill on a relational substrate, with the realistic confusions (snapshot-window discipline, leakage traps, channel signal weaker than vendor blogs imply) that students will hit when they finally get hands on real CRM data.
[^macro]: Macroeconomic framing summarised in
docs/external_review/summaries/gemini_v2_summary.md
(median public-SaaS growth 30%→25% from 2023 to 2025; New CAC Ratio
rose materially in 2024).
What's inside
release/
├── intro/ intermediate/ advanced/ # student_public bundles, one per difficulty tier
│ ├── manifest.json # provenance + file hashes
│ ├── metrics.json # per-tier headline metrics (medians + spreads)
│ ├── dataset_card.md # auto-rendered per-bundle card
│ ├── feature_dictionary.csv # authoritative column spec
│ ├── lead_scoring.csv # flat convenience CSV (all splits)
│ ├── tables/*.parquet # 7 snapshot-safe relational tables
│ └── tasks/converted_within_90_days/{train,valid,test}.parquet
├── intermediate_instructor/ # research companion: full-horizon tables + metadata/
├── docs/ # vendored DGP / leakage / break-me docs (agent-readable)
├── notebooks/ # 01 baseline · 02 relational · 03 leakage · 04 calibration
├── metrics.json # top-level cross-tier metrics summary
├── claims_register.{md,json} # claims → backing-artifact map (agent-readable)
└── validation/ # validation_report.{json,md} + figures
student_public bundles ship the snapshot-safe relational view;
research_instructor companions ship the full-horizon view plus the
hidden causal structure (DAG, latent registry, mechanism summary)
under metadata/. The full layout is documented in each bundle's
manifest.json.
Agent-reviewable artifacts
The published bundle is self-contained for AI review and offline auditing — every numeric / structural claim on this page can be verified without following an external link:
metrics.json(root) +<tier>/metrics.json— deterministic JSON view of the headline LR AUC / AP / P@100 / Brier / conversion rate / cohort-shift / cross-tier-ordering medians, with JSON-path back-references tovalidation/validation_report.json(the source of truth).claims_register.{md,json}— every numerical or structural claim on this page paired with the artifact and path that backs it. Rendered fromclaims_register_source.yamlbyscripts/build_claims_register.py.docs/— vendored copies ofgeneration_method.md,channel_signal_audit.md,break_me_guide.md,feature_dictionary.md,v1_acceptance_gates_bands.yaml,v2_decision_log.md, plus a hand-authoredrelational_table_schemas.csvdocumenting every column of every relational table. These match the GitHub-blob links cited below but ship inside the bundle so a reviewer never needs network access.<tier>/manifest.json— SHA-256 hash for every file plus the full redaction contract (structural_redactions.columns,omitted_tables,relational_snapshot_safe,snapshot_day).- Kaggle / HuggingFace preview pages additionally inject a
schema.org/DatasetJSON-LD block in their<head>for agent ingestion without HTML parsing.
Quick start
# Flat CSV
df = pd.read_csv("intermediate/lead_scoring.csv")
# Parquet task splits (recommended)
train = pd.read_parquet("intermediate/tasks/converted_within_90_days/train.parquet")
test = pd.read_parquet("intermediate/tasks/converted_within_90_days/test.parquet")
# Relational tables (feature engineering — example)
leads = pd.read_parquet("intermediate/tables/leads.parquet")
touches = pd.read_parquet("intermediate/tables/touches.parquet")
my_touch_count = (
touches.groupby("lead_id").size().rename("my_touch_count").reset_index()
)
features = leads.merge(my_touch_count, on="lead_id", how="left")
# Reproduce from source
# pip install leadforge
# leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \
# --mode student_public --difficulty intermediate --out my_bundle
The label converted_within_90_days resolves over a 90-day window;
engagement features (touch_count, session_count, etc.) are
computed strictly over events on days [0, 30]. The deliberate
exception is total_touches_all, the leakage trap — flagged
leakage_risk=True in feature_dictionary.csv. Drop it from your
feature set unless you're demonstrating leakage detection.
Dataset summary
| Intro | Intermediate | Advanced | |
|---|---|---|---|
| Leads | 5,000 | 5,000 | 5,000 |
| Accounts | 1,500 | 1,500 | 1,500 |
| Contacts | 4,200 | 4,200 | 4,200 |
| Snapshot columns | 32 / 34* | 32 / 34* | 32 / 34* |
| Target | converted_within_90_days |
converted_within_90_days |
converted_within_90_days |
| Conversion rate (acceptance band, gate G7.*) | 24–61% | 12–31% | 4–12% |
| Conversion rate (observed median, seeds 42–46) | 42.67% | 21.60% | 8.40% |
| Signal strength | 0.90 | 0.70 | 0.50 |
| Noise scale | 0.10 | 0.30 | 0.55 |
| Missing rate | 2% | 8% | 18% |
* student_public / research_instructor. Difficulty is modulated
by the simulation engine — signal strength on latent-trait weights,
Gaussian noise on float features, MCAR missingness, outlier rate —
not post-hoc label flipping. The acceptance band is the recipe
gate's tolerance window (v1_acceptance_gates_bands.yaml G7.*),
not the achievable range — observed five-seed spreads sit
comfortably inside the band.
The scenario
Veridian Technologies is a fictional Series B startup (Austin, US)
selling Veridian Procure, a procurement / AP automation SaaS, to
mid-market firms (200–2,000 employees) in the US and UK. The funnel
runs through inbound marketing (45%), SDR outbound (35%), and
partner referrals (20%); four personas drive deals (VP Finance, AP
Manager, IT Director, Procurement Manager). Task: predict whether
a lead converts (closed_won) within 90 days. ACV bands are
$18k–$120k. See
docs/release/generation_method.md
for the full DGP, and the deeper "what's modelled / approximate / not
modelled" breakdown that this README only summarises.
Public vs instructor: what's redacted
Filtering happens during rendering, not during simulation. The
redaction contract is single-sourced in
leadforge/validation/leakage_probes.py;
the snapshot-safe writer and the validator import the same constants,
so they cannot drift apart.
| Source-of-truth constant | Public bundle treatment |
|---|---|
BANNED_LEAD_COLUMNS = ("converted_within_90_days", "conversion_timestamp") |
Dropped from tables/leads.parquet |
BANNED_OPP_COLUMNS = ("close_outcome", "closed_at") |
Dropped from tables/opportunities.parquet |
BANNED_TABLES = ("customers", "subscriptions") |
Omitted from public bundles |
SNAPSHOT_FILTERED_TABLES (touches, sessions, sales_activities, opportunities) |
Filtered per-lead by lead_created_at + snapshot_day |
Snapshot redaction (current_stage, is_sql) |
Stripped from tasks/ splits and tables/leads.parquet |
total_touches_all (deliberate trap) |
Retained in both modes; flagged leakage_risk=True |
Each bundle's manifest.json records relational_snapshot_safe,
redacted_columns, and snapshot_day, so the bundle is
self-describing.
Calibration
Every realism / calibration / difficulty claim in this README is
backed by
validation/validation_report.md,
regenerated by
scripts/validate_release_candidate.py
with bands declared in
docs/release/v1_acceptance_gates_bands.yaml.
Headline cross-seed medians (seeds 42–46):
| Tier | LR AUC | AP | P@100 | Brier |
|---|---|---|---|---|
| intro | 0.879 | 0.761 | 0.80 | 0.130 |
| intermediate | 0.886 | 0.575 | 0.59 | 0.110 |
| advanced | 0.886 | 0.351 | 0.34 | 0.061 |
AP, P@100, conversion-rate, and lift orderings hold across the intended difficulty axis (intro > intermediate > advanced).
Intended uses
- Teaching baseline lead-scoring on a flat snapshot.
- Teaching relational feature engineering against snapshot-safe tables.
- Teaching leakage detection (the
total_touches_alltrap is designed to be discoverable). - Teaching calibration, lift, P@K, value-aware ranking
(
expected_acv × P(convert)), and cohort-shift evaluation. - Comparing model families under a controlled DGP.
Out-of-scope uses
- Production lead scoring. The company, product, and customers are fictional.
- Vendor benchmarking / paper baselines. Difficulty tiers are calibrated for pedagogy, not cross-paper comparability.
- Causal-inference research that requires recovery of the true DGP. The instructor companion exposes the hidden graph for teaching, not designed counterfactuals.
- Demographic / fairness research. v1 does not model protected attributes.
Known limitations
- Difficulty signal on raw AUC is flat. LR AUC is ~0.88 across every tier. Difficulty is visible in AP, P@K, Brier, and value capture. Treat AUC as a sanity check, not a difficulty signal.
- GBM does not consistently beat LR (gate G7.4.4). GBM−LR AUC delta is slightly negative in every tier (intro −0.0045, intermediate −0.0072, advanced −0.0133); v1's snapshot is dominated by linear features. v2 will inject non-linear interactions in the simulator.
- Channel signal is weak. Per
docs/release/channel_signal_audit.md, out-of-sample univariate AUC oflead_sourceis ≈0.50–0.52 across all tiers and the per-channel rate spread is ≤0.05. The simulator does not encode channel-conditional probabilities; channel-conditional encoding is post-v1 work. - Cohort-shift degradation is small. v1 has no time-of-year drift baked in; the cohort-shift gate (G6.4) is informational and will bite in v2.
Composition
- Entities. Accounts, contacts, leads, touches, sessions,
sales_activities, opportunities (public); plus customers and
subscriptions (instructor only). Per-row counts per bundle live in
manifest.json. - Features. 32 public columns grouped by analytical role in
docs/release/feature_dictionary.md; the per-bundlefeature_dictionary.csvis the authoritative machine-readable spec. - Label.
converted_within_90_days(boolean), event-derived from the simulator. Never sampled directly. - Splits. 70/15/15 train/valid/test, deterministic given seed;
recorded in
tasks/converted_within_90_days/task_manifest.json. Group-leakage warning: the splitter is keyed onlead_idonly, not onaccount_idorcontact_id. On the as-shipped intermediate bundle, 518 of 557 test accounts (≈93 %) also appear in train; the contact-level overlap is similar in magnitude. A flat baseline trained on the random split rides account-level signal across the split boundary. For a generalisation-faithful number, retrain withGroupKFold(account_id)(orcontact_id) and report both — seebreak_me_guide.md§5 for the detection recipe. - Provenance. Recipe
b2b_saas_procurement_v1, seed 42, package version stamped inmanifest.json.
Maintenance, adversarial framing, license
We want the dataset to be broken. The
break-me guide catalogues
nine adversarial patterns to look for (leakage, split
contamination, ranking inversions, calibration drift) with
worked-example pointers back into the notebooks. Issue
templates ship under .github/ISSUE_TEMPLATE/: a
breakage report
form for findings on the bundle itself, and a
realism feedback
form for distributional critiques. Accepted findings are
logged in
docs/release/v2_decision_log.md.
File issues at
leadforge-dev/leadforge;
PRs welcome.
| Field | Value |
|---|---|
| Generator | leadforge 1.0.0+ |
| Recipe | b2b_saas_procurement_v1 |
| Canonical seed | 42 (cross-seed sweep: 42–46) |
| Bundle schema version | 5 |
| Format | Parquet (canonical) + CSV (convenience) |
| License | MIT — see LICENSE |
Verify integrity with leadforge validate <bundle_dir>; every file
is hashed in manifest.json.
Objective
This deliberately fake release is large enough to exercise row previews, file downloads, metadata, and review copy before a real upload to Kaggle.
Do an EDA and try to predict which socks and laundry conditions achieve suspiciously stable pair success.