This is a ShmuggingFace review mock. It is not Hugging Face, Kaggle, or a real dataset release.
😏

😏leadforge-dev · UPDATED 2026-05-24 AGO

LeadForge Lead Scoring v1 — Advanced

Advanced difficulty · 5,000 leads · ~8% conversion rate · LR AUC 0.886 (5-seed median)

⇩ Download
LeadForge Lead Scoring v1 — Advanced5,000 rows
Dataset cover image

About Dataset

LeadForge: Synthetic B2B Lead Scoring Dataset (leadforge-lead-scoring-v1)

A relational, reproducible, three-tier synthetic CRM dataset family for teaching lead scoring at scale. Generated by leadforge, an open-source Python framework for synthetic CRM/funnel data. The framework version is decoupled from the dataset version: the package stays at 1.x; the dataset is published under the explicit …-v1 tag.

Why lead scoring matters in 2024–2026

Mid-market SaaS vendors entered 2024–2026 with growth slowing and customer-acquisition costs rising[^macro], so predicting which leads convert within a fixed window has moved from a marketing nicety to a survival skill. This dataset teaches that skill on a relational substrate, with the realistic confusions (snapshot-window discipline, leakage traps, channel signal weaker than vendor blogs imply) that students will hit when they finally get hands on real CRM data.

[^macro]: Macroeconomic framing summarised in docs/external_review/summaries/gemini_v2_summary.md (median public-SaaS growth 30%→25% from 2023 to 2025; New CAC Ratio rose materially in 2024).

What's inside

release/
├── intro/ intermediate/ advanced/    # student_public bundles, one per difficulty tier
│   ├── manifest.json                 # provenance + file hashes
│   ├── metrics.json                  # per-tier headline metrics (medians + spreads)
│   ├── dataset_card.md               # auto-rendered per-bundle card
│   ├── feature_dictionary.csv        # authoritative column spec
│   ├── lead_scoring.csv              # flat convenience CSV (all splits)
│   ├── tables/*.parquet              # 7 snapshot-safe relational tables
│   └── tasks/converted_within_90_days/{train,valid,test}.parquet
├── intermediate_instructor/          # research companion: full-horizon tables + metadata/
├── docs/                             # vendored DGP / leakage / break-me docs (agent-readable)
├── notebooks/                        # 01 baseline · 02 relational · 03 leakage · 04 calibration
├── metrics.json                      # top-level cross-tier metrics summary
├── claims_register.{md,json}         # claims → backing-artifact map (agent-readable)
└── validation/                       # validation_report.{json,md} + figures

student_public bundles ship the snapshot-safe relational view; research_instructor companions ship the full-horizon view plus the hidden causal structure (DAG, latent registry, mechanism summary) under metadata/. The full layout is documented in each bundle's manifest.json.

Agent-reviewable artifacts

The published bundle is self-contained for AI review and offline auditing — every numeric / structural claim on this page can be verified without following an external link:

  • metrics.json (root) + <tier>/metrics.json — deterministic JSON view of the headline LR AUC / AP / P@100 / Brier / conversion rate / cohort-shift / cross-tier-ordering medians, with JSON-path back-references to validation/validation_report.json (the source of truth).
  • claims_register.{md,json} — every numerical or structural claim on this page paired with the artifact and path that backs it. Rendered from claims_register_source.yaml by scripts/build_claims_register.py.
  • docs/ — vendored copies of generation_method.md, channel_signal_audit.md, break_me_guide.md, feature_dictionary.md, v1_acceptance_gates_bands.yaml, v2_decision_log.md, plus a hand-authored relational_table_schemas.csv documenting every column of every relational table. These match the GitHub-blob links cited below but ship inside the bundle so a reviewer never needs network access.
  • <tier>/manifest.json — SHA-256 hash for every file plus the full redaction contract (structural_redactions.columns, omitted_tables, relational_snapshot_safe, snapshot_day).
  • Kaggle / HuggingFace preview pages additionally inject a schema.org/Dataset JSON-LD block in their <head> for agent ingestion without HTML parsing.

Quick start

# Flat CSV
df = pd.read_csv("intermediate/lead_scoring.csv")

# Parquet task splits (recommended)
train = pd.read_parquet("intermediate/tasks/converted_within_90_days/train.parquet")
test  = pd.read_parquet("intermediate/tasks/converted_within_90_days/test.parquet")

# Relational tables (feature engineering — example)
leads   = pd.read_parquet("intermediate/tables/leads.parquet")
touches = pd.read_parquet("intermediate/tables/touches.parquet")
my_touch_count = (
    touches.groupby("lead_id").size().rename("my_touch_count").reset_index()
)
features = leads.merge(my_touch_count, on="lead_id", how="left")

# Reproduce from source
# pip install leadforge
# leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \
#                    --mode student_public --difficulty intermediate --out my_bundle

The label converted_within_90_days resolves over a 90-day window; engagement features (touch_count, session_count, etc.) are computed strictly over events on days [0, 30]. The deliberate exception is total_touches_all, the leakage trap — flagged leakage_risk=True in feature_dictionary.csv. Drop it from your feature set unless you're demonstrating leakage detection.

Dataset summary

Intro Intermediate Advanced
Leads 5,000 5,000 5,000
Accounts 1,500 1,500 1,500
Contacts 4,200 4,200 4,200
Snapshot columns 32 / 34* 32 / 34* 32 / 34*
Target converted_within_90_days converted_within_90_days converted_within_90_days
Conversion rate (acceptance band, gate G7.*) 24–61% 12–31% 4–12%
Conversion rate (observed median, seeds 42–46) 42.67% 21.60% 8.40%
Signal strength 0.90 0.70 0.50
Noise scale 0.10 0.30 0.55
Missing rate 2% 8% 18%

* student_public / research_instructor. Difficulty is modulated by the simulation engine — signal strength on latent-trait weights, Gaussian noise on float features, MCAR missingness, outlier rate — not post-hoc label flipping. The acceptance band is the recipe gate's tolerance window (v1_acceptance_gates_bands.yaml G7.*), not the achievable range — observed five-seed spreads sit comfortably inside the band.

The scenario

Veridian Technologies is a fictional Series B startup (Austin, US) selling Veridian Procure, a procurement / AP automation SaaS, to mid-market firms (200–2,000 employees) in the US and UK. The funnel runs through inbound marketing (45%), SDR outbound (35%), and partner referrals (20%); four personas drive deals (VP Finance, AP Manager, IT Director, Procurement Manager). Task: predict whether a lead converts (closed_won) within 90 days. ACV bands are $18k–$120k. See docs/release/generation_method.md for the full DGP, and the deeper "what's modelled / approximate / not modelled" breakdown that this README only summarises.

Public vs instructor: what's redacted

Filtering happens during rendering, not during simulation. The redaction contract is single-sourced in leadforge/validation/leakage_probes.py; the snapshot-safe writer and the validator import the same constants, so they cannot drift apart.

Source-of-truth constant Public bundle treatment
BANNED_LEAD_COLUMNS = ("converted_within_90_days", "conversion_timestamp") Dropped from tables/leads.parquet
BANNED_OPP_COLUMNS = ("close_outcome", "closed_at") Dropped from tables/opportunities.parquet
BANNED_TABLES = ("customers", "subscriptions") Omitted from public bundles
SNAPSHOT_FILTERED_TABLES (touches, sessions, sales_activities, opportunities) Filtered per-lead by lead_created_at + snapshot_day
Snapshot redaction (current_stage, is_sql) Stripped from tasks/ splits and tables/leads.parquet
total_touches_all (deliberate trap) Retained in both modes; flagged leakage_risk=True

Each bundle's manifest.json records relational_snapshot_safe, redacted_columns, and snapshot_day, so the bundle is self-describing.

Calibration

Every realism / calibration / difficulty claim in this README is backed by validation/validation_report.md, regenerated by scripts/validate_release_candidate.py with bands declared in docs/release/v1_acceptance_gates_bands.yaml. Headline cross-seed medians (seeds 42–46):

Tier LR AUC AP P@100 Brier
intro 0.879 0.761 0.80 0.130
intermediate 0.886 0.575 0.59 0.110
advanced 0.886 0.351 0.34 0.061

AP, P@100, conversion-rate, and lift orderings hold across the intended difficulty axis (intro > intermediate > advanced).

Intended uses

  • Teaching baseline lead-scoring on a flat snapshot.
  • Teaching relational feature engineering against snapshot-safe tables.
  • Teaching leakage detection (the total_touches_all trap is designed to be discoverable).
  • Teaching calibration, lift, P@K, value-aware ranking (expected_acv × P(convert)), and cohort-shift evaluation.
  • Comparing model families under a controlled DGP.

Out-of-scope uses

  • Production lead scoring. The company, product, and customers are fictional.
  • Vendor benchmarking / paper baselines. Difficulty tiers are calibrated for pedagogy, not cross-paper comparability.
  • Causal-inference research that requires recovery of the true DGP. The instructor companion exposes the hidden graph for teaching, not designed counterfactuals.
  • Demographic / fairness research. v1 does not model protected attributes.

Known limitations

  • Difficulty signal on raw AUC is flat. LR AUC is ~0.88 across every tier. Difficulty is visible in AP, P@K, Brier, and value capture. Treat AUC as a sanity check, not a difficulty signal.
  • GBM does not consistently beat LR (gate G7.4.4). GBM−LR AUC delta is slightly negative in every tier (intro −0.0045, intermediate −0.0072, advanced −0.0133); v1's snapshot is dominated by linear features. v2 will inject non-linear interactions in the simulator.
  • Channel signal is weak. Per docs/release/channel_signal_audit.md, out-of-sample univariate AUC of lead_source is ≈0.50–0.52 across all tiers and the per-channel rate spread is ≤0.05. The simulator does not encode channel-conditional probabilities; channel-conditional encoding is post-v1 work.
  • Cohort-shift degradation is small. v1 has no time-of-year drift baked in; the cohort-shift gate (G6.4) is informational and will bite in v2.

Composition

  • Entities. Accounts, contacts, leads, touches, sessions, sales_activities, opportunities (public); plus customers and subscriptions (instructor only). Per-row counts per bundle live in manifest.json.
  • Features. 32 public columns grouped by analytical role in docs/release/feature_dictionary.md; the per-bundle feature_dictionary.csv is the authoritative machine-readable spec.
  • Label. converted_within_90_days (boolean), event-derived from the simulator. Never sampled directly.
  • Splits. 70/15/15 train/valid/test, deterministic given seed; recorded in tasks/converted_within_90_days/task_manifest.json. Group-leakage warning: the splitter is keyed on lead_id only, not on account_id or contact_id. On the as-shipped intermediate bundle, 518 of 557 test accounts (≈93 %) also appear in train; the contact-level overlap is similar in magnitude. A flat baseline trained on the random split rides account-level signal across the split boundary. For a generalisation-faithful number, retrain with GroupKFold(account_id) (or contact_id) and report both — see break_me_guide.md §5 for the detection recipe.
  • Provenance. Recipe b2b_saas_procurement_v1, seed 42, package version stamped in manifest.json.

Maintenance, adversarial framing, license

We want the dataset to be broken. The break-me guide catalogues nine adversarial patterns to look for (leakage, split contamination, ranking inversions, calibration drift) with worked-example pointers back into the notebooks. Issue templates ship under .github/ISSUE_TEMPLATE/: a breakage report form for findings on the bundle itself, and a realism feedback form for distributional critiques. Accepted findings are logged in docs/release/v2_decision_log.md. File issues at leadforge-dev/leadforge; PRs welcome.

Field Value
Generator leadforge 1.0.0+
Recipe b2b_saas_procurement_v1
Canonical seed 42 (cross-seed sweep: 42–46)
Bundle schema version 5
Format Parquet (canonical) + CSV (convenience)
License MIT — see LICENSE

Verify integrity with leadforge validate <bundle_dir>; every file is hashed in manifest.json.

Objective

This deliberately fake release is large enough to exercise row previews, file downloads, metadata, and review copy before a real upload to Kaggle.

Do an EDA and try to predict which socks and laundry conditions achieve suspiciously stable pair success.

LeadForge Lead Scoring v1 — Advanced

lead_scoring.csv (1353 KB)

About this file

Flat ML-ready snapshot CSV: 5,000 leads × 32 features, snapshot day 30. Includes a 'split' column (train / valid / test) for conventional ML workflows.

Aaccount_id
Link to Organization
123industry
Industry
123region
Region
123employee_band
Employee Band
123estimated_revenue_band
Estimated Revenue Band
123process_maturity_band
Process Maturity Band
Acontact_id
Contact Id
123role_function
Role Function
123seniority
Seniority
123buyer_role
Buyer Role
Alead_id
Lead Id
123lead_created_at
Lead Created At
123lead_source
Lead Source
123first_touch_channel
First Touch Channel
123touch_count
Touch Count
123inbound_touch_count
Inbound Touch Count
123outbound_touch_count
Outbound Touch Count
123session_count
Session Count
123pricing_page_views
Pricing Page Views
123demo_page_views
Demo Page Views
123total_session_duration_seconds
Total Session Duration Seconds
123touches_week_1
Touches Week 1
123touches_last_7_days
Touches Last 7 Days
123days_since_first_touch
Days Since First Touch
123activity_count
Activity Count
123days_since_last_touch
Days Since Last Touch
123opportunity_created
Opportunity Created
123has_open_opportunity
Has Open Opportunity
123opportunity_estimated_acv
Opportunity Estimated Acv
123expected_acv
Expected Acv
123total_touches_all
Total Touches All
123converted_within_90_days
Converted Within 90 Days
136unique values
logistics
18%
logistics
8%
Other (3,533)
79%
UK
17%
UK
7%
Other (3,250)
80%
200-499
16%
500-999
6%
Other (2,967)
81%
$50M-$200M
15%
$10M-$50M
5%
Other (2,684)
82%
low
14%
high
4%
Other (2,401)
83%
cnt_001124
13%
cnt_003354
3%
Other (2,118)
84%
procurement_manager
12%
it_director
2%
Other (1,835)
85%
vp
11%
c_suite
2%
Other (1,552)
86%
end_user
10%
technical_evaluator
2%
Other (1,269)
87%
lead_004250
9%
lead_001565
2%
Other (986)
88%
2024-01-08
8%
2024-01-01
2%
Other (703)
89%
inbound_marketing
7%
inbound_marketing
2%
Other (420)
90%
inbound_marketing
6%
inbound_marketing
2%
Other (137)
91%
102unique values68unique values102unique values85unique values
1.0
6%
4.0
2%
Other (1)
94%
0.0
6%
0.0
2%
Other (1)
94%
136unique values
1.0
6%
4.0
2%
Other (1)
94%
2.0
6%
1.0
2%
Other (1)
94%
24.08619850861116
6%
30.62883305317309
2%
Other (1)
94%
68unique values
-3.5747144539187232
6%
1.8952519188233432
2%
Other (1)
94%
True
6%
True
2%
Other (1)
94%
True
6%
True
2%
Other (1)
94%
37710.42551803246
6%
52804.112720196485
2%
Other (1)
94%
-10645.128882835355
6%
80317.28475907192
2%
Other (1)
94%
51unique values
False
6%
False
2%
Other (1)
94%
acct_000773logisticsUK200-499$50M-$200Mlowcnt_001124procurement_managervpend_userlead_0042502024-01-08inbound_marketinginbound_marketing8.08.00.01.01.00.0484.01.024.086198508611166.0-3.5747144539187232TrueTrue37710.42551803246-10645.128882835355False
acct_000043logisticsUK500-999$10M-$50Mhighcnt_003354it_directorc_suitetechnical_evaluatorlead_0015652024-01-01inbound_marketinginbound_marketing9.09.00.03.04.0900.04.030.628833053173091.8952519188233432TrueTrue52804.11272019648580317.28475907192False
acct_000319logisticsUS200-499$1M-$10Mmediumcnt_000537ap_managerdirectorchampionlead_0022962024-01-05partner_referralpartner_referral9.00.09.04.00.00.01065.04.02.027.0538262148251933.2322809610775023FalseFalse15778.44652664028817.0False
acct_000476healthcare_non_clinicalUS200-499$10M-$50Mmediumcnt_001478ap_managerdirectorchampionlead_0033202024-01-29inbound_marketinginbound_marketing13.013.00.04.00.01381.01.01.022.616794427909494.0-4.398843829294206TrueTrue-1312.8841795232147-11408.72828271654714.0False
acct_000243manufacturingUS1000-1999$10M-$50Mlowcnt_000276vp_financeindividual_contributoreconomic_buyerlead_0011922024-01-01sdr_outboundsdr_outbound8.00.08.02.01.00.01128.032.293110672658751.03.5385907968992516TrueTrue47876.64899383635049.85592930211False
acct_000353manufacturingUK2000+$1M-$10Mmediumcnt_002665procurement_managervpend_userlead_0001232024-01-27sdr_outboundsdr_outbound6.00.06.02.00.00.0150.04.02.026.2152624501300343.03.1313429432706332FalseFalse23551.864433039216False
acct_000029healthcare_non_clinicalUS500-999$10M-$50Mlowcnt_001377procurement_managerindividual_contributorend_userlead_0010762024-01-18sdr_outboundsdr_outbound11.00.011.00.00.00.00.05.02.0FalseFalse75080.3780222939114.0False
acct_001411professional_servicesUS200-499$10M-$50Mhighcnt_002913vp_financec_suiteeconomic_buyerlead_0015842024-01-09sdr_outboundsdr_outbound3.00.03.01.00.0234.028.1880327734796.03.2455395630901305TrueTrue20430.54627729372-17325.3569848659613.0False

See what others are saying about this dataset

What have you used this dataset for?

How would you describe this dataset?

Activity Overview

Views34.7K880 in the last 30 days
Downloads0213 in the last 30 days
Comments5posted