This is a ShmuggingFace review mock. It is not Hugging Face, Kaggle, or a real dataset release.
😏ShmuggingFace

Datasets: leadforge-dev/leadforge-lead-scoring-v1-advanced

0
Modalities:Tabular
Formats:CSV
Languages:English
Libraries:Datasets
License:MIT
Tag:crm
Tag:b2b
Tag:pandas

▦ Dataset Viewer

↻ Auto-converted to Parquet
account_id string · lengths
0 7.07k
acct_000773 · logistics · UK · 200-499 · $50M-$200M · low · cnt_001124 · procurement_manager · vp · end_user · lead_004250 · 2024-01-08 · inbound_marketing · inbound_marketing · 8.0 · 8.0 · 0.0 · 1.0 · 1.0 · 0.0 · 484.0 · 1.0 · · 24.08619850861116 · 6.0 · -3.5747144539187232 · True · True · 37710.42551803246 · -10645.128882835355 · · False
acct_000043 · logistics · UK · 500-999 · $10M-$50M · high · cnt_003354 · it_director · c_suite · technical_evaluator · lead_001565 · 2024-01-01 · inbound_marketing · inbound_marketing · 9.0 · 9.0 · 0.0 · 3.0 · 4.0 · · 900.0 · 4.0 · · 30.62883305317309 · · 1.8952519188233432 · True · True · 52804.112720196485 · 80317.28475907192 · · False
acct_000319 · logistics · US · 200-499 · $1M-$10M · medium · cnt_000537 · ap_manager · director · champion · lead_002296 · 2024-01-05 · partner_referral · partner_referral · 9.0 · 0.0 · 9.0 · 4.0 · 0.0 · 0.0 · 1065.0 · 4.0 · 2.0 · 27.053826214825193 · · 3.2322809610775023 · False · False · · 15778.446526640288 · 17.0 · False
acct_000476 · healthcare_non_clinical · US · 200-499 · $10M-$50M · medium · cnt_001478 · ap_manager · director · champion · lead_003320 · 2024-01-29 · inbound_marketing · inbound_marketing · 13.0 · 13.0 · 0.0 · 4.0 · · 0.0 · 1381.0 · 1.0 · 1.0 · 22.61679442790949 · 4.0 · -4.398843829294206 · True · True · -1312.8841795232147 · -11408.728282716547 · 14.0 · False
acct_000243 · manufacturing · US · 1000-1999 · $10M-$50M · low · cnt_000276 · vp_finance · individual_contributor · economic_buyer · lead_001192 · 2024-01-01 · sdr_outbound · sdr_outbound · 8.0 · 0.0 · 8.0 · 2.0 · 1.0 · 0.0 · 1128.0 · · · 32.29311067265875 · 1.0 · 3.5385907968992516 · True · True · 47876.648993836 · 35049.85592930211 · · False
acct_000353 · manufacturing · UK · 2000+ · $1M-$10M · medium · cnt_002665 · procurement_manager · vp · end_user · lead_000123 · 2024-01-27 · sdr_outbound · sdr_outbound · 6.0 · 0.0 · 6.0 · 2.0 · 0.0 · 0.0 · 150.0 · 4.0 · 2.0 · 26.215262450130034 · 3.0 · 3.1313429432706332 · False · False · · 23551.864433039216 · · False
acct_000029 · healthcare_non_clinical · US · 500-999 · $10M-$50M · low · cnt_001377 · procurement_manager · individual_contributor · end_user · lead_001076 · 2024-01-18 · sdr_outbound · sdr_outbound · 11.0 · 0.0 · 11.0 · 0.0 · 0.0 · 0.0 · 0.0 · 5.0 · 2.0 · · · · False · False · · 75080.37802229391 · 14.0 · False
...

Showing preview page 1 for leadforge-lead-scoring-v1-advanced.

Dataset Card for "LeadForge Lead Scoring v1 — Advanced"

LeadForge: Synthetic B2B Lead Scoring Dataset (leadforge-lead-scoring-v1)

A relational, reproducible, three-tier synthetic CRM dataset family for teaching lead scoring at scale. Generated by leadforge, an open-source Python framework for synthetic CRM/funnel data. The framework version is decoupled from the dataset version: the package stays at 1.x; the dataset is published under the explicit …-v1 tag.

Why lead scoring matters in 2024–2026

Mid-market SaaS vendors entered 2024–2026 with growth slowing and customer-acquisition costs rising[^macro], so predicting which leads convert within a fixed window has moved from a marketing nicety to a survival skill. This dataset teaches that skill on a relational substrate, with the realistic confusions (snapshot-window discipline, leakage traps, channel signal weaker than vendor blogs imply) that students will hit when they finally get hands on real CRM data.

[^macro]: Macroeconomic framing summarised in docs/external_review/summaries/gemini_v2_summary.md (median public-SaaS growth 30%→25% from 2023 to 2025; New CAC Ratio rose materially in 2024).

What's inside

release/
├── intro/ intermediate/ advanced/    # student_public bundles, one per difficulty tier
│   ├── manifest.json                 # provenance + file hashes
│   ├── metrics.json                  # per-tier headline metrics (medians + spreads)
│   ├── dataset_card.md               # auto-rendered per-bundle card
│   ├── feature_dictionary.csv        # authoritative column spec
│   ├── lead_scoring.csv              # flat convenience CSV (all splits)
│   ├── tables/*.parquet              # 7 snapshot-safe relational tables
│   └── tasks/converted_within_90_days/{train,valid,test}.parquet
├── intermediate_instructor/          # research companion: full-horizon tables + metadata/
├── docs/                             # vendored DGP / leakage / break-me docs (agent-readable)
├── notebooks/                        # 01 baseline · 02 relational · 03 leakage · 04 calibration
├── metrics.json                      # top-level cross-tier metrics summary
├── claims_register.{md,json}         # claims → backing-artifact map (agent-readable)
└── validation/                       # validation_report.{json,md} + figures

student_public bundles ship the snapshot-safe relational view; research_instructor companions ship the full-horizon view plus the hidden causal structure (DAG, latent registry, mechanism summary) under metadata/. The full layout is documented in each bundle's manifest.json.

Agent-reviewable artifacts

The published bundle is self-contained for AI review and offline auditing — every numeric / structural claim on this page can be verified without following an external link:

  • metrics.json (root) + <tier>/metrics.json — deterministic JSON view of the headline LR AUC / AP / P@100 / Brier / conversion rate / cohort-shift / cross-tier-ordering medians, with JSON-path back-references to validation/validation_report.json (the source of truth).
  • claims_register.{md,json} — every numerical or structural claim on this page paired with the artifact and path that backs it. Rendered from claims_register_source.yaml by scripts/build_claims_register.py.
  • docs/ — vendored copies of generation_method.md, channel_signal_audit.md, break_me_guide.md, feature_dictionary.md, v1_acceptance_gates_bands.yaml, v2_decision_log.md, plus a hand-authored relational_table_schemas.csv documenting every column of every relational table. These match the GitHub-blob links cited below but ship inside the bundle so a reviewer never needs network access.
  • <tier>/manifest.json — SHA-256 hash for every file plus the full redaction contract (structural_redactions.columns, omitted_tables, relational_snapshot_safe, snapshot_day).
  • Kaggle / HuggingFace preview pages additionally inject a schema.org/Dataset JSON-LD block in their <head> for agent ingestion without HTML parsing.

Quick start

# Flat CSV
df = pd.read_csv("intermediate/lead_scoring.csv")

# Parquet task splits (recommended)
train = pd.read_parquet("intermediate/tasks/converted_within_90_days/train.parquet")
test  = pd.read_parquet("intermediate/tasks/converted_within_90_days/test.parquet")

# Relational tables (feature engineering — example)
leads   = pd.read_parquet("intermediate/tables/leads.parquet")
touches = pd.read_parquet("intermediate/tables/touches.parquet")
my_touch_count = (
    touches.groupby("lead_id").size().rename("my_touch_count").reset_index()
)
features = leads.merge(my_touch_count, on="lead_id", how="left")

# Reproduce from source
# pip install leadforge
# leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \
#                    --mode student_public --difficulty intermediate --out my_bundle

The label converted_within_90_days resolves over a 90-day window; engagement features (touch_count, session_count, etc.) are computed strictly over events on days [0, 30]. The deliberate exception is total_touches_all, the leakage trap — flagged leakage_risk=True in feature_dictionary.csv. Drop it from your feature set unless you're demonstrating leakage detection.

Dataset summary

Intro Intermediate Advanced
Leads 5,000 5,000 5,000
Accounts 1,500 1,500 1,500
Contacts 4,200 4,200 4,200
Snapshot columns 32 / 34* 32 / 34* 32 / 34*
Target converted_within_90_days converted_within_90_days converted_within_90_days
Conversion rate (acceptance band, gate G7.*) 24–61% 12–31% 4–12%
Conversion rate (observed median, seeds 42–46) 42.67% 21.60% 8.40%
Signal strength 0.90 0.70 0.50
Noise scale 0.10 0.30 0.55
Missing rate 2% 8% 18%

* student_public / research_instructor. Difficulty is modulated by the simulation engine — signal strength on latent-trait weights, Gaussian noise on float features, MCAR missingness, outlier rate — not post-hoc label flipping. The acceptance band is the recipe gate's tolerance window (v1_acceptance_gates_bands.yaml G7.*), not the achievable range — observed five-seed spreads sit comfortably inside the band.

The scenario

Veridian Technologies is a fictional Series B startup (Austin, US) selling Veridian Procure, a procurement / AP automation SaaS, to mid-market firms (200–2,000 employees) in the US and UK. The funnel runs through inbound marketing (45%), SDR outbound (35%), and partner referrals (20%); four personas drive deals (VP Finance, AP Manager, IT Director, Procurement Manager). Task: predict whether a lead converts (closed_won) within 90 days. ACV bands are $18k–$120k. See docs/release/generation_method.md for the full DGP, and the deeper "what's modelled / approximate / not modelled" breakdown that this README only summarises.

Public vs instructor: what's redacted

Filtering happens during rendering, not during simulation. The redaction contract is single-sourced in leadforge/validation/leakage_probes.py; the snapshot-safe writer and the validator import the same constants, so they cannot drift apart.

Source-of-truth constant Public bundle treatment
BANNED_LEAD_COLUMNS = ("converted_within_90_days", "conversion_timestamp") Dropped from tables/leads.parquet
BANNED_OPP_COLUMNS = ("close_outcome", "closed_at") Dropped from tables/opportunities.parquet
BANNED_TABLES = ("customers", "subscriptions") Omitted from public bundles
SNAPSHOT_FILTERED_TABLES (touches, sessions, sales_activities, opportunities) Filtered per-lead by lead_created_at + snapshot_day
Snapshot redaction (current_stage, is_sql) Stripped from tasks/ splits and tables/leads.parquet
total_touches_all (deliberate trap) Retained in both modes; flagged leakage_risk=True

Each bundle's manifest.json records relational_snapshot_safe, redacted_columns, and snapshot_day, so the bundle is self-describing.

Calibration

Every realism / calibration / difficulty claim in this README is backed by validation/validation_report.md, regenerated by scripts/validate_release_candidate.py with bands declared in docs/release/v1_acceptance_gates_bands.yaml. Headline cross-seed medians (seeds 42–46):

Tier LR AUC AP P@100 Brier
intro 0.879 0.761 0.80 0.130
intermediate 0.886 0.575 0.59 0.110
advanced 0.886 0.351 0.34 0.061

AP, P@100, conversion-rate, and lift orderings hold across the intended difficulty axis (intro > intermediate > advanced).

Intended uses

  • Teaching baseline lead-scoring on a flat snapshot.
  • Teaching relational feature engineering against snapshot-safe tables.
  • Teaching leakage detection (the total_touches_all trap is designed to be discoverable).
  • Teaching calibration, lift, P@K, value-aware ranking (expected_acv × P(convert)), and cohort-shift evaluation.
  • Comparing model families under a controlled DGP.

Out-of-scope uses

  • Production lead scoring. The company, product, and customers are fictional.
  • Vendor benchmarking / paper baselines. Difficulty tiers are calibrated for pedagogy, not cross-paper comparability.
  • Causal-inference research that requires recovery of the true DGP. The instructor companion exposes the hidden graph for teaching, not designed counterfactuals.
  • Demographic / fairness research. v1 does not model protected attributes.

Known limitations

  • Difficulty signal on raw AUC is flat. LR AUC is ~0.88 across every tier. Difficulty is visible in AP, P@K, Brier, and value capture. Treat AUC as a sanity check, not a difficulty signal.
  • GBM does not consistently beat LR (gate G7.4.4). GBM−LR AUC delta is slightly negative in every tier (intro −0.0045, intermediate −0.0072, advanced −0.0133); v1's snapshot is dominated by linear features. v2 will inject non-linear interactions in the simulator.
  • Channel signal is weak. Per docs/release/channel_signal_audit.md, out-of-sample univariate AUC of lead_source is ≈0.50–0.52 across all tiers and the per-channel rate spread is ≤0.05. The simulator does not encode channel-conditional probabilities; channel-conditional encoding is post-v1 work.
  • Cohort-shift degradation is small. v1 has no time-of-year drift baked in; the cohort-shift gate (G6.4) is informational and will bite in v2.

Composition

  • Entities. Accounts, contacts, leads, touches, sessions, sales_activities, opportunities (public); plus customers and subscriptions (instructor only). Per-row counts per bundle live in manifest.json.
  • Features. 32 public columns grouped by analytical role in docs/release/feature_dictionary.md; the per-bundle feature_dictionary.csv is the authoritative machine-readable spec.
  • Label. converted_within_90_days (boolean), event-derived from the simulator. Never sampled directly.
  • Splits. 70/15/15 train/valid/test, deterministic given seed; recorded in tasks/converted_within_90_days/task_manifest.json. Group-leakage warning: the splitter is keyed on lead_id only, not on account_id or contact_id. On the as-shipped intermediate bundle, 518 of 557 test accounts (≈93 %) also appear in train; the contact-level overlap is similar in magnitude. A flat baseline trained on the random split rides account-level signal across the split boundary. For a generalisation-faithful number, retrain with GroupKFold(account_id) (or contact_id) and report both — see break_me_guide.md §5 for the detection recipe.
  • Provenance. Recipe b2b_saas_procurement_v1, seed 42, package version stamped in manifest.json.

Maintenance, adversarial framing, license

We want the dataset to be broken. The break-me guide catalogues nine adversarial patterns to look for (leakage, split contamination, ranking inversions, calibration drift) with worked-example pointers back into the notebooks. Issue templates ship under .github/ISSUE_TEMPLATE/: a breakage report form for findings on the bundle itself, and a realism feedback form for distributional critiques. Accepted findings are logged in docs/release/v2_decision_log.md. File issues at leadforge-dev/leadforge; PRs welcome.

Field Value
Generator leadforge 1.0.0+
Recipe b2b_saas_procurement_v1
Canonical seed 42 (cross-seed sweep: 42–46)
Bundle schema version 5
Format Parquet (canonical) + CSV (convenience)
License MIT — see LICENSE

Verify integrity with leadforge validate <bundle_dir>; every file is hashed in manifest.json.

Mock release notes

Advanced difficulty · 5,000 leads · ~8% conversion rate · LR AUC 0.886 (5-seed median)