{
  "site": {
    "title": "LeadForge Lead Scoring v1 — Pre-Publication Review",
    "owner": "leadforge-dev",
    "visibility": "Pre-publication review mock — not yet live on Kaggle or Hugging Face",
    "reviewerHint": "Review the dataset card copy, metadata accuracy, file listings, column preview, and download behaviour across all three difficulty tiers. The Shmaggle tab mirrors the Kaggle page; the ShmuggingFace tab mirrors the Hugging Face page.  Flag anything that looks wrong before the real publish.",
    "primarySlug": "leadforge-lead-scoring-v1-intro"
  },
  "datasets": [
    {
      "slug": "leadforge-lead-scoring-v1-intro",
      "title": "LeadForge Lead Scoring v1 — Intro",
      "owner": "leadforge-dev",
      "subtitle": "Intro difficulty · 5,000 leads · ~43% conversion rate · LR AUC 0.879 (5-seed median)",
      "license": "MIT",
      "task": "tabular-classification",
      "language": "English",
      "updated": "2026-05-24",
      "downloads": "0",
      "likes": "0",
      "contactName": "leadforge-dev",
      "contactEmail": "",
      "rowCount": 5000,
      "kaggleUsability": "9.4",
      "kaggleMedals": "Gold",
      "description": "",
      "descriptionHtml": "<h1>LeadForge: Synthetic B2B Lead Scoring Dataset (<code>leadforge-lead-scoring-v1</code>)</h1>\n<p>A relational, reproducible, three-tier synthetic CRM dataset family for\nteaching lead scoring at scale. Generated by\n<a href=\"https://github.com/leadforge-dev/leadforge\">leadforge</a>, an\nopen-source Python framework for synthetic CRM/funnel data. The\nframework version is decoupled from the dataset version: the package\nstays at <code>1.x</code>; the dataset is published under the explicit <code>…-v1</code>\ntag.</p>\n<h2>Why lead scoring matters in 2024–2026</h2>\n<p>Mid-market SaaS vendors entered 2024–2026 with growth slowing and\ncustomer-acquisition costs rising[^macro], so predicting <em>which</em> leads\nconvert within a fixed window has moved from a marketing nicety to a\nsurvival skill. This dataset teaches that skill on a relational\nsubstrate, with the realistic confusions (snapshot-window discipline,\nleakage traps, channel signal weaker than vendor blogs imply) that\nstudents will hit when they finally get hands on real CRM data.</p>\n<p>[^macro]: Macroeconomic framing summarised in\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/docs/external_review/summaries/gemini_v2_summary.md\"><code>docs/external_review/summaries/gemini_v2_summary.md</code></a>\n(median public-SaaS growth 30%→25% from 2023 to 2025; New CAC Ratio\nrose materially in 2024).</p>\n<h2>What's inside</h2>\n<pre><code>release/\n├── intro/ intermediate/ advanced/    # student_public bundles, one per difficulty tier\n│   ├── manifest.json                 # provenance + file hashes\n│   ├── metrics.json                  # per-tier headline metrics (medians + spreads)\n│   ├── dataset_card.md               # auto-rendered per-bundle card\n│   ├── feature_dictionary.csv        # authoritative column spec\n│   ├── lead_scoring.csv              # flat convenience CSV (all splits)\n│   ├── tables/*.parquet              # 7 snapshot-safe relational tables\n│   └── tasks/converted_within_90_days/{train,valid,test}.parquet\n├── intermediate_instructor/          # research companion: full-horizon tables + metadata/\n├── docs/                             # vendored DGP / leakage / break-me docs (agent-readable)\n├── notebooks/                        # 01 baseline · 02 relational · 03 leakage · 04 calibration\n├── metrics.json                      # top-level cross-tier metrics summary\n├── claims_register.{md,json}         # claims → backing-artifact map (agent-readable)\n└── validation/                       # validation_report.{json,md} + figures\n</code></pre>\n<p><code>student_public</code> bundles ship the snapshot-safe relational view;\n<code>research_instructor</code> companions ship the full-horizon view plus the\nhidden causal structure (DAG, latent registry, mechanism summary)\nunder <code>metadata/</code>. The full layout is documented in each bundle's\n<code>manifest.json</code>.</p>\n<h3>Agent-reviewable artifacts</h3>\n<p>The published bundle is self-contained for AI review and offline\nauditing — every numeric / structural claim on this page can be\nverified without following an external link:</p>\n<ul>\n<li><strong><code>metrics.json</code> (root) + <code>&lt;tier&gt;/metrics.json</code></strong> — deterministic\nJSON view of the headline LR AUC / AP / P@100 / Brier / conversion\nrate / cohort-shift / cross-tier-ordering medians, with JSON-path\nback-references to <code>validation/validation_report.json</code> (the\nsource of truth).</li>\n<li><strong><code>claims_register.{md,json}</code></strong> — every numerical or structural\nclaim on this page paired with the artifact and path that backs it.\nRendered from <code>claims_register_source.yaml</code> by\n<code>scripts/build_claims_register.py</code>.</li>\n<li><strong><code>docs/</code></strong> — vendored copies of <code>generation_method.md</code>,\n<code>channel_signal_audit.md</code>, <code>break_me_guide.md</code>,\n<code>feature_dictionary.md</code>, <code>v1_acceptance_gates_bands.yaml</code>,\n<code>v2_decision_log.md</code>, plus a hand-authored\n<code>relational_table_schemas.csv</code> documenting every column of every\nrelational table.  These match the GitHub-blob links cited below but\nship inside the bundle so a reviewer never needs network access.</li>\n<li><strong><code>&lt;tier&gt;/manifest.json</code></strong> — SHA-256 hash for every file plus the\nfull redaction contract (<code>structural_redactions.columns</code>,\n<code>omitted_tables</code>, <code>relational_snapshot_safe</code>, <code>snapshot_day</code>).</li>\n<li>Kaggle / HuggingFace preview pages additionally inject a\n<code>schema.org/Dataset</code> JSON-LD block in their <code>&lt;head&gt;</code> for agent\ningestion without HTML parsing.</li>\n</ul>\n<h2>Quick start</h2>\n<pre><code class=\"language-python\"># Flat CSV\ndf = pd.read_csv(&quot;intermediate/lead_scoring.csv&quot;)\n\n# Parquet task splits (recommended)\ntrain = pd.read_parquet(&quot;intermediate/tasks/converted_within_90_days/train.parquet&quot;)\ntest  = pd.read_parquet(&quot;intermediate/tasks/converted_within_90_days/test.parquet&quot;)\n\n# Relational tables (feature engineering — example)\nleads   = pd.read_parquet(&quot;intermediate/tables/leads.parquet&quot;)\ntouches = pd.read_parquet(&quot;intermediate/tables/touches.parquet&quot;)\nmy_touch_count = (\n    touches.groupby(&quot;lead_id&quot;).size().rename(&quot;my_touch_count&quot;).reset_index()\n)\nfeatures = leads.merge(my_touch_count, on=&quot;lead_id&quot;, how=&quot;left&quot;)\n\n# Reproduce from source\n# pip install leadforge\n# leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \\\n#                    --mode student_public --difficulty intermediate --out my_bundle\n</code></pre>\n<p>The label <code>converted_within_90_days</code> resolves over a 90-day window;\nengagement features (<code>touch_count</code>, <code>session_count</code>, etc.) are\ncomputed strictly over events on days <code>[0, 30]</code>. The deliberate\nexception is <code>total_touches_all</code>, the leakage trap — flagged\n<code>leakage_risk=True</code> in <code>feature_dictionary.csv</code>. Drop it from your\nfeature set unless you're demonstrating leakage detection.</p>\n<h2>Dataset summary</h2>\n<table>\n<thead>\n<tr>\n<th></th>\n<th>Intro</th>\n<th>Intermediate</th>\n<th>Advanced</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Leads</td>\n<td>5,000</td>\n<td>5,000</td>\n<td>5,000</td>\n</tr>\n<tr>\n<td>Accounts</td>\n<td>1,500</td>\n<td>1,500</td>\n<td>1,500</td>\n</tr>\n<tr>\n<td>Contacts</td>\n<td>4,200</td>\n<td>4,200</td>\n<td>4,200</td>\n</tr>\n<tr>\n<td>Snapshot columns</td>\n<td>32 / 34*</td>\n<td>32 / 34*</td>\n<td>32 / 34*</td>\n</tr>\n<tr>\n<td>Target</td>\n<td><code>converted_within_90_days</code></td>\n<td><code>converted_within_90_days</code></td>\n<td><code>converted_within_90_days</code></td>\n</tr>\n<tr>\n<td>Conversion rate (acceptance band, gate G7.*)</td>\n<td>24–61%</td>\n<td>12–31%</td>\n<td>4–12%</td>\n</tr>\n<tr>\n<td>Conversion rate (observed median, seeds 42–46)</td>\n<td>42.67%</td>\n<td>21.60%</td>\n<td>8.40%</td>\n</tr>\n<tr>\n<td>Signal strength</td>\n<td>0.90</td>\n<td>0.70</td>\n<td>0.50</td>\n</tr>\n<tr>\n<td>Noise scale</td>\n<td>0.10</td>\n<td>0.30</td>\n<td>0.55</td>\n</tr>\n<tr>\n<td>Missing rate</td>\n<td>2%</td>\n<td>8%</td>\n<td>18%</td>\n</tr>\n</tbody>\n</table>\n<p>* <code>student_public</code> / <code>research_instructor</code>. Difficulty is modulated\nby the simulation engine — signal strength on latent-trait weights,\nGaussian noise on float features, MCAR missingness, outlier rate —\nnot post-hoc label flipping. The acceptance band is the recipe\ngate's tolerance window (<code>v1_acceptance_gates_bands.yaml</code> G7.*),\nnot the achievable range — observed five-seed spreads sit\ncomfortably inside the band.</p>\n<h2>The scenario</h2>\n<p><strong>Veridian Technologies</strong> is a fictional Series B startup (Austin, US)\nselling <strong>Veridian Procure</strong>, a procurement / AP automation SaaS, to\nmid-market firms (200–2,000 employees) in the US and UK. The funnel\nruns through inbound marketing (45%), SDR outbound (35%), and\npartner referrals (20%); four personas drive deals (VP Finance, AP\nManager, IT Director, Procurement Manager). <strong>Task:</strong> predict whether\na lead converts (<code>closed_won</code>) within 90 days. ACV bands are\n$18k–$120k. See\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/docs/release/generation_method.md\"><code>docs/release/generation_method.md</code></a>\nfor the full DGP, and the deeper &quot;what's modelled / approximate / not\nmodelled&quot; breakdown that this README only summarises.</p>\n<h2>Public vs instructor: what's redacted</h2>\n<p>Filtering happens <strong>during rendering</strong>, not during simulation. The\nredaction contract is single-sourced in\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/leadforge/validation/leakage_probes.py\"><code>leadforge/validation/leakage_probes.py</code></a>;\nthe snapshot-safe writer and the validator import the same constants,\nso they cannot drift apart.</p>\n<table>\n<thead>\n<tr>\n<th>Source-of-truth constant</th>\n<th>Public bundle treatment</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td><code>BANNED_LEAD_COLUMNS = (&quot;converted_within_90_days&quot;, &quot;conversion_timestamp&quot;)</code></td>\n<td>Dropped from <code>tables/leads.parquet</code></td>\n</tr>\n<tr>\n<td><code>BANNED_OPP_COLUMNS = (&quot;close_outcome&quot;, &quot;closed_at&quot;)</code></td>\n<td>Dropped from <code>tables/opportunities.parquet</code></td>\n</tr>\n<tr>\n<td><code>BANNED_TABLES = (&quot;customers&quot;, &quot;subscriptions&quot;)</code></td>\n<td>Omitted from public bundles</td>\n</tr>\n<tr>\n<td><code>SNAPSHOT_FILTERED_TABLES</code> (touches, sessions, sales_activities, opportunities)</td>\n<td>Filtered per-lead by <code>lead_created_at + snapshot_day</code></td>\n</tr>\n<tr>\n<td>Snapshot redaction (<code>current_stage</code>, <code>is_sql</code>)</td>\n<td>Stripped from <code>tasks/</code> splits and <code>tables/leads.parquet</code></td>\n</tr>\n<tr>\n<td><code>total_touches_all</code> (deliberate trap)</td>\n<td><strong>Retained in both modes</strong>; flagged <code>leakage_risk=True</code></td>\n</tr>\n</tbody>\n</table>\n<p>Each bundle's <code>manifest.json</code> records <code>relational_snapshot_safe</code>,\n<code>redacted_columns</code>, and <code>snapshot_day</code>, so the bundle is\nself-describing.</p>\n<h2>Calibration</h2>\n<p>Every realism / calibration / difficulty claim in this README is\nbacked by\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/release/validation/validation_report.md\"><code>validation/validation_report.md</code></a>,\nregenerated by\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/scripts/validate_release_candidate.py\"><code>scripts/validate_release_candidate.py</code></a>\nwith bands declared in\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v1_acceptance_gates_bands.yaml\"><code>docs/release/v1_acceptance_gates_bands.yaml</code></a>.\nHeadline cross-seed medians (seeds 42–46):</p>\n<table>\n<thead>\n<tr>\n<th>Tier</th>\n<th>LR AUC</th>\n<th>AP</th>\n<th>P@100</th>\n<th>Brier</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>intro</td>\n<td>0.879</td>\n<td>0.761</td>\n<td>0.80</td>\n<td>0.130</td>\n</tr>\n<tr>\n<td>intermediate</td>\n<td>0.886</td>\n<td>0.575</td>\n<td>0.59</td>\n<td>0.110</td>\n</tr>\n<tr>\n<td>advanced</td>\n<td>0.886</td>\n<td>0.351</td>\n<td>0.34</td>\n<td>0.061</td>\n</tr>\n</tbody>\n</table>\n<p>AP, P@100, conversion-rate, and lift orderings hold across the\nintended difficulty axis (intro &gt; intermediate &gt; advanced).</p>\n<h2>Intended uses</h2>\n<ul>\n<li>Teaching baseline lead-scoring on a flat snapshot.</li>\n<li>Teaching relational feature engineering against snapshot-safe tables.</li>\n<li>Teaching leakage detection (the <code>total_touches_all</code> trap is\ndesigned to be discoverable).</li>\n<li>Teaching calibration, lift, P@K, value-aware ranking\n(<code>expected_acv × P(convert)</code>), and cohort-shift evaluation.</li>\n<li>Comparing model families under a controlled DGP.</li>\n</ul>\n<h2>Out-of-scope uses</h2>\n<ul>\n<li><strong>Production lead scoring.</strong> The company, product, and customers are\nfictional.</li>\n<li><strong>Vendor benchmarking / paper baselines.</strong> Difficulty tiers are\ncalibrated for pedagogy, not cross-paper comparability.</li>\n<li><strong>Causal-inference research that requires recovery of the true DGP.</strong>\nThe instructor companion exposes the hidden graph for teaching, not\ndesigned counterfactuals.</li>\n<li><strong>Demographic / fairness research.</strong> v1 does not model protected\nattributes.</li>\n</ul>\n<h2>Known limitations</h2>\n<ul>\n<li><strong>Difficulty signal on raw AUC is flat.</strong> LR AUC is ~0.88 across\nevery tier. Difficulty is visible in AP, P@K, Brier, and value\ncapture. Treat AUC as a sanity check, not a difficulty signal.</li>\n<li><strong>GBM does not consistently beat LR (gate G7.4.4).</strong> GBM−LR AUC delta\nis slightly negative in every tier (intro −0.0045, intermediate\n−0.0072, advanced −0.0133); v1's snapshot is dominated by linear\nfeatures. v2 will inject non-linear interactions in the simulator.</li>\n<li><strong>Channel signal is weak.</strong> Per\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/docs/release/channel_signal_audit.md\"><code>docs/release/channel_signal_audit.md</code></a>,\nout-of-sample univariate AUC of <code>lead_source</code> is ≈0.50–0.52 across\nall tiers and the per-channel rate spread is ≤0.05. The simulator\ndoes not encode channel-conditional probabilities; channel-conditional\nencoding is post-v1 work.</li>\n<li><strong>Cohort-shift degradation is small.</strong> v1 has no time-of-year drift\nbaked in; the cohort-shift gate (G6.4) is informational and will\nbite in v2.</li>\n</ul>\n<h2>Composition</h2>\n<ul>\n<li><strong>Entities.</strong> Accounts, contacts, leads, touches, sessions,\nsales_activities, opportunities (public); plus customers and\nsubscriptions (instructor only). Per-row counts per bundle live in\n<code>manifest.json</code>.</li>\n<li><strong>Features.</strong> 32 public columns grouped by analytical role in\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/docs/release/feature_dictionary.md\"><code>docs/release/feature_dictionary.md</code></a>;\nthe per-bundle <code>feature_dictionary.csv</code> is the authoritative\nmachine-readable spec.</li>\n<li><strong>Label.</strong> <code>converted_within_90_days</code> (boolean), event-derived from\nthe simulator. Never sampled directly.</li>\n<li><strong>Splits.</strong> 70/15/15 train/valid/test, deterministic given seed;\nrecorded in <code>tasks/converted_within_90_days/task_manifest.json</code>.\n<strong>Group-leakage warning:</strong> the splitter is keyed on <code>lead_id</code> only,\nnot on <code>account_id</code> or <code>contact_id</code>. On the as-shipped intermediate\nbundle, <strong>518 of 557 test accounts (≈93 %) also appear in train</strong>;\nthe contact-level overlap is similar in magnitude. A flat baseline\ntrained on the random split rides account-level signal across the\nsplit boundary. For a generalisation-faithful number, retrain with\n<code>GroupKFold(account_id)</code> (or <code>contact_id</code>) and report both — see\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md\"><code>break_me_guide.md</code></a> §5 for the\ndetection recipe.</li>\n<li><strong>Provenance.</strong> Recipe <code>b2b_saas_procurement_v1</code>, seed 42, package\nversion stamped in <code>manifest.json</code>.</li>\n</ul>\n<h2>Maintenance, adversarial framing, license</h2>\n<p>We <em>want</em> the dataset to be broken. The\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md\">break-me guide</a> catalogues\nnine adversarial patterns to look for (leakage, split\ncontamination, ranking inversions, calibration drift) with\nworked-example pointers back into the notebooks. Issue\ntemplates ship under <code>.github/ISSUE_TEMPLATE/</code>: a\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml\">breakage report</a>\nform for findings on the bundle itself, and a\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/realism_feedback.yml\">realism feedback</a>\nform for distributional critiques. Accepted findings are\nlogged in\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md\"><code>docs/release/v2_decision_log.md</code></a>.\nFile issues at\n<a href=\"https://github.com/leadforge-dev/leadforge\">leadforge-dev/leadforge</a>;\nPRs welcome.</p>\n<table>\n<thead>\n<tr>\n<th>Field</th>\n<th>Value</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Generator</td>\n<td>leadforge <code>1.0.0+</code></td>\n</tr>\n<tr>\n<td>Recipe</td>\n<td><code>b2b_saas_procurement_v1</code></td>\n</tr>\n<tr>\n<td>Canonical seed</td>\n<td>42 (cross-seed sweep: 42–46)</td>\n</tr>\n<tr>\n<td>Bundle schema version</td>\n<td>5</td>\n</tr>\n<tr>\n<td>Format</td>\n<td>Parquet (canonical) + CSV (convenience)</td>\n</tr>\n<tr>\n<td>License</td>\n<td>MIT — see <a href=\"LICENSE\">LICENSE</a></td>\n</tr>\n</tbody>\n</table>\n<p>Verify integrity with <code>leadforge validate &lt;bundle_dir&gt;</code>; every file\nis hashed in <code>manifest.json</code>.</p>\n",
      "tags": [
        "tabular",
        "lead-scoring",
        "synthetic-data",
        "crm",
        "b2b",
        "datasets",
        "pandas",
        "intro"
      ],
      "coverImage": "../dataset-cover-image.png",
      "splits": [
        "train",
        "valid",
        "test"
      ],
      "subsets": [
        "leadforge-lead-scoring-v1-intro"
      ],
      "files": [
        {
          "path": "lead_scoring.csv",
          "size": "1409 KB",
          "kind": "CSV",
          "sourcePath": "../intro/lead_scoring.csv",
          "about": "Flat ML-ready snapshot CSV: 5,000 leads × 32 features, snapshot day 30.  Includes a 'split' column (train / valid / test) for conventional ML workflows."
        },
        {
          "path": "feature_dictionary.csv",
          "size": "3 KB",
          "kind": "CSV",
          "sourcePath": "../intro/feature_dictionary.csv",
          "about": "Per-column documentation: dtype, analytical category, leakage-risk flag, and plain-language description."
        },
        {
          "path": "tasks/converted_within_90_days/train.parquet",
          "size": "216 KB",
          "kind": "Parquet",
          "sourcePath": "../intro/tasks/converted_within_90_days/train.parquet",
          "about": "Training split — 3,500 leads, stratified by conversion rate.  Target column: `converted_within_90_days` (bool)."
        },
        {
          "path": "tasks/converted_within_90_days/valid.parquet",
          "size": "65 KB",
          "kind": "Parquet",
          "sourcePath": "../intro/tasks/converted_within_90_days/valid.parquet",
          "about": "Validation split — 750 leads."
        },
        {
          "path": "tasks/converted_within_90_days/test.parquet",
          "size": "65 KB",
          "kind": "Parquet",
          "sourcePath": "../intro/tasks/converted_within_90_days/test.parquet",
          "about": "Test split — 750 leads, held out for final evaluation only."
        },
        {
          "path": "dataset_card.md",
          "size": "3 KB",
          "kind": "Dataset card",
          "sourcePath": "../intro/dataset_card.md",
          "about": "Auto-generated tier-specific dataset card."
        }
      ],
      "columns": [
        "account_id",
        "industry",
        "region",
        "employee_band",
        "estimated_revenue_band",
        "process_maturity_band",
        "contact_id",
        "role_function",
        "seniority",
        "buyer_role",
        "lead_id",
        "lead_created_at",
        "lead_source",
        "first_touch_channel",
        "touch_count",
        "inbound_touch_count",
        "outbound_touch_count",
        "session_count",
        "pricing_page_views",
        "demo_page_views",
        "total_session_duration_seconds",
        "touches_week_1",
        "touches_last_7_days",
        "days_since_first_touch",
        "activity_count",
        "days_since_last_touch",
        "opportunity_created",
        "has_open_opportunity",
        "opportunity_estimated_acv",
        "expected_acv",
        "total_touches_all",
        "converted_within_90_days"
      ],
      "rows": [
        {
          "split": "train",
          "account_id": "acct_000773",
          "industry": "logistics",
          "region": "UK",
          "employee_band": "200-499",
          "estimated_revenue_band": "$50M-$200M",
          "process_maturity_band": "low",
          "contact_id": "cnt_001124",
          "role_function": "procurement_manager",
          "seniority": "vp",
          "buyer_role": "end_user",
          "lead_id": "lead_004250",
          "lead_created_at": "2024-01-08",
          "lead_source": "inbound_marketing",
          "first_touch_channel": "inbound_marketing",
          "touch_count": "9.0",
          "inbound_touch_count": "9.0",
          "outbound_touch_count": "0.0",
          "session_count": "3.0",
          "pricing_page_views": "0.0",
          "demo_page_views": "0.0",
          "total_session_duration_seconds": "796.0",
          "touches_week_1": "4.0",
          "touches_last_7_days": "2.0",
          "days_since_first_touch": "28.01572217905263",
          "activity_count": "6.0",
          "days_since_last_touch": "4.035906451086055",
          "opportunity_created": "True",
          "has_open_opportunity": "False",
          "opportunity_estimated_acv": "",
          "expected_acv": "79086.53052716002",
          "total_touches_all": "12.0",
          "converted_within_90_days": "True"
        },
        {
          "split": "train",
          "account_id": "acct_000043",
          "industry": "logistics",
          "region": "UK",
          "employee_band": "500-999",
          "estimated_revenue_band": "$10M-$50M",
          "process_maturity_band": "high",
          "contact_id": "cnt_003354",
          "role_function": "it_director",
          "seniority": "c_suite",
          "buyer_role": "technical_evaluator",
          "lead_id": "lead_001565",
          "lead_created_at": "2024-01-01",
          "lead_source": "inbound_marketing",
          "first_touch_channel": "inbound_marketing",
          "touch_count": "7.0",
          "inbound_touch_count": "7.0",
          "outbound_touch_count": "0.0",
          "session_count": "1.0",
          "pricing_page_views": "0.0",
          "demo_page_views": "0.0",
          "total_session_duration_seconds": "536.0",
          "touches_week_1": "3.0",
          "touches_last_7_days": "1.0",
          "days_since_first_touch": "30.114696019867328",
          "activity_count": "4.0",
          "days_since_last_touch": "6.327765693401558",
          "opportunity_created": "True",
          "has_open_opportunity": "False",
          "opportunity_estimated_acv": "",
          "expected_acv": "56089.68722667384",
          "total_touches_all": "10.0",
          "converted_within_90_days": "True"
        },
        {
          "split": "train",
          "account_id": "acct_000319",
          "industry": "logistics",
          "region": "US",
          "employee_band": "200-499",
          "estimated_revenue_band": "$1M-$10M",
          "process_maturity_band": "medium",
          "contact_id": "cnt_000537",
          "role_function": "ap_manager",
          "seniority": "director",
          "buyer_role": "champion",
          "lead_id": "lead_002296",
          "lead_created_at": "2024-01-05",
          "lead_source": "partner_referral",
          "first_touch_channel": "partner_referral",
          "touch_count": "13.0",
          "inbound_touch_count": "0.0",
          "outbound_touch_count": "13.0",
          "session_count": "5.0",
          "pricing_page_views": "0.0",
          "demo_page_views": "0.0",
          "total_session_duration_seconds": "1286.0",
          "touches_week_1": "5.0",
          "touches_last_7_days": "1.0",
          "days_since_first_touch": "28.64502758561542",
          "activity_count": "4.0",
          "days_since_last_touch": "2.5589920790762024",
          "opportunity_created": "True",
          "has_open_opportunity": "False",
          "opportunity_estimated_acv": "",
          "expected_acv": "23342.16516309185",
          "total_touches_all": "13.0",
          "converted_within_90_days": "True"
        },
        {
          "split": "train",
          "account_id": "acct_000476",
          "industry": "healthcare_non_clinical",
          "region": "US",
          "employee_band": "200-499",
          "estimated_revenue_band": "$10M-$50M",
          "process_maturity_band": "medium",
          "contact_id": "cnt_001478",
          "role_function": "ap_manager",
          "seniority": "director",
          "buyer_role": "champion",
          "lead_id": "lead_003320",
          "lead_created_at": "2024-01-29",
          "lead_source": "inbound_marketing",
          "first_touch_channel": "inbound_marketing",
          "touch_count": "6.0",
          "inbound_touch_count": "6.0",
          "outbound_touch_count": "0.0",
          "session_count": "0.0",
          "pricing_page_views": "",
          "demo_page_views": "0.0",
          "total_session_duration_seconds": "0.0",
          "touches_week_1": "2.0",
          "touches_last_7_days": "1.0",
          "days_since_first_touch": "27.74771018638961",
          "activity_count": "4.0",
          "days_since_last_touch": "-0.760737970268833",
          "opportunity_created": "True",
          "has_open_opportunity": "True",
          "opportunity_estimated_acv": "14911.502746023458",
          "expected_acv": "13188.648533612146",
          "total_touches_all": "13.0",
          "converted_within_90_days": "False"
        },
        {
          "split": "train",
          "account_id": "acct_000243",
          "industry": "manufacturing",
          "region": "US",
          "employee_band": "1000-1999",
          "estimated_revenue_band": "$10M-$50M",
          "process_maturity_band": "low",
          "contact_id": "cnt_000276",
          "role_function": "vp_finance",
          "seniority": "individual_contributor",
          "buyer_role": "economic_buyer",
          "lead_id": "lead_001192",
          "lead_created_at": "2024-01-01",
          "lead_source": "sdr_outbound",
          "first_touch_channel": "sdr_outbound",
          "touch_count": "8.0",
          "inbound_touch_count": "0.0",
          "outbound_touch_count": "8.0",
          "session_count": "0.0",
          "pricing_page_views": "0.0",
          "demo_page_views": "0.0",
          "total_session_duration_seconds": "0.0",
          "touches_week_1": "4.0",
          "touches_last_7_days": "1.0",
          "days_since_first_touch": "30.418251976326776",
          "activity_count": "2.0",
          "days_since_last_touch": "4.43902499954997",
          "opportunity_created": "True",
          "has_open_opportunity": "False",
          "opportunity_estimated_acv": "",
          "expected_acv": "48571.82263865212",
          "total_touches_all": "10.0",
          "converted_within_90_days": "True"
        },
        {
          "split": "train",
          "account_id": "acct_000353",
          "industry": "manufacturing",
          "region": "UK",
          "employee_band": "2000+",
          "estimated_revenue_band": "$1M-$10M",
          "process_maturity_band": "medium",
          "contact_id": "cnt_002665",
          "role_function": "procurement_manager",
          "seniority": "vp",
          "buyer_role": "end_user",
          "lead_id": "lead_000123",
          "lead_created_at": "2024-01-27",
          "lead_source": "sdr_outbound",
          "first_touch_channel": "sdr_outbound",
          "touch_count": "6.0",
          "inbound_touch_count": "0.0",
          "outbound_touch_count": "6.0",
          "session_count": "2.0",
          "pricing_page_views": "2.0",
          "demo_page_views": "1.0",
          "total_session_duration_seconds": "586.0",
          "touches_week_1": "2.0",
          "touches_last_7_days": "2.0",
          "days_since_first_touch": "24.856867775705567",
          "activity_count": "4.0",
          "days_since_last_touch": "1.1956549420122882",
          "opportunity_created": "True",
          "has_open_opportunity": "False",
          "opportunity_estimated_acv": "",
          "expected_acv": "24739.656707689333",
          "total_touches_all": "7.0",
          "converted_within_90_days": "False"
        },
        {
          "split": "train",
          "account_id": "acct_000029",
          "industry": "healthcare_non_clinical",
          "region": "US",
          "employee_band": "500-999",
          "estimated_revenue_band": "$10M-$50M",
          "process_maturity_band": "low",
          "contact_id": "cnt_001377",
          "role_function": "procurement_manager",
          "seniority": "individual_contributor",
          "buyer_role": "end_user",
          "lead_id": "lead_001076",
          "lead_created_at": "2024-01-18",
          "lead_source": "sdr_outbound",
          "first_touch_channel": "sdr_outbound",
          "touch_count": "9.0",
          "inbound_touch_count": "0.0",
          "outbound_touch_count": "9.0",
          "session_count": "4.0",
          "pricing_page_views": "2.0",
          "demo_page_views": "0.0",
          "total_session_duration_seconds": "1160.0",
          "touches_week_1": "4.0",
          "touches_last_7_days": "1.0",
          "days_since_first_touch": "30.03620847580526",
          "activity_count": "3.0",
          "days_since_last_touch": "5.959612069636063",
          "opportunity_created": "True",
          "has_open_opportunity": "True",
          "opportunity_estimated_acv": "71371.34592955658",
          "expected_acv": "79632.01541874865",
          "total_touches_all": "14.0",
          "converted_within_90_days": "False"
        },
        {
          "split": "train",
          "account_id": "acct_001411",
          "industry": "professional_services",
          "region": "US",
          "employee_band": "200-499",
          "estimated_revenue_band": "$10M-$50M",
          "process_maturity_band": "high",
          "contact_id": "cnt_002913",
          "role_function": "vp_finance",
          "seniority": "c_suite",
          "buyer_role": "economic_buyer",
          "lead_id": "lead_001584",
          "lead_created_at": "2024-01-09",
          "lead_source": "sdr_outbound",
          "first_touch_channel": "sdr_outbound",
          "touch_count": "3.0",
          "inbound_touch_count": "0.0",
          "outbound_touch_count": "3.0",
          "session_count": "1.0",
          "pricing_page_views": "",
          "demo_page_views": "0.0",
          "total_session_duration_seconds": "234.0",
          "touches_week_1": "3.0",
          "touches_last_7_days": "0.0",
          "days_since_first_touch": "29.66950619411096",
          "activity_count": "0.0",
          "days_since_last_touch": "27.56128502749063",
          "opportunity_created": "False",
          "has_open_opportunity": "False",
          "opportunity_estimated_acv": "",
          "expected_acv": "46931.84707461178",
          "total_touches_all": "3.0",
          "converted_within_90_days": "False"
        }
      ],
      "discussions": [
        "What is `snapshot_day = 30` and how does it affect which features are valid at inference time?",
        "Is `total_touches_all` a safe feature or a time-window leakage trap?",
        "LR and GBM AUCs are very close across tiers — does relational feature engineering help?",
        "How would you set a probability threshold for a team that can only work 50 leads per week?",
        "What happens to AUC when you evaluate on a chronological hold-out instead of a random split?"
      ]
    },
    {
      "slug": "leadforge-lead-scoring-v1-intermediate",
      "title": "LeadForge Lead Scoring v1 — Intermediate",
      "owner": "leadforge-dev",
      "subtitle": "Intermediate difficulty · 5,000 leads · ~22% conversion rate · LR AUC 0.886 (5-seed median)",
      "license": "MIT",
      "task": "tabular-classification",
      "language": "English",
      "updated": "2026-05-24",
      "downloads": "0",
      "likes": "0",
      "contactName": "leadforge-dev",
      "contactEmail": "",
      "rowCount": 5000,
      "kaggleUsability": "9.1",
      "kaggleMedals": "Silver",
      "description": "",
      "descriptionHtml": "<h1>LeadForge: Synthetic B2B Lead Scoring Dataset (<code>leadforge-lead-scoring-v1</code>)</h1>\n<p>A relational, reproducible, three-tier synthetic CRM dataset family for\nteaching lead scoring at scale. Generated by\n<a href=\"https://github.com/leadforge-dev/leadforge\">leadforge</a>, an\nopen-source Python framework for synthetic CRM/funnel data. The\nframework version is decoupled from the dataset version: the package\nstays at <code>1.x</code>; the dataset is published under the explicit <code>…-v1</code>\ntag.</p>\n<h2>Why lead scoring matters in 2024–2026</h2>\n<p>Mid-market SaaS vendors entered 2024–2026 with growth slowing and\ncustomer-acquisition costs rising[^macro], so predicting <em>which</em> leads\nconvert within a fixed window has moved from a marketing nicety to a\nsurvival skill. This dataset teaches that skill on a relational\nsubstrate, with the realistic confusions (snapshot-window discipline,\nleakage traps, channel signal weaker than vendor blogs imply) that\nstudents will hit when they finally get hands on real CRM data.</p>\n<p>[^macro]: Macroeconomic framing summarised in\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/docs/external_review/summaries/gemini_v2_summary.md\"><code>docs/external_review/summaries/gemini_v2_summary.md</code></a>\n(median public-SaaS growth 30%→25% from 2023 to 2025; New CAC Ratio\nrose materially in 2024).</p>\n<h2>What's inside</h2>\n<pre><code>release/\n├── intro/ intermediate/ advanced/    # student_public bundles, one per difficulty tier\n│   ├── manifest.json                 # provenance + file hashes\n│   ├── metrics.json                  # per-tier headline metrics (medians + spreads)\n│   ├── dataset_card.md               # auto-rendered per-bundle card\n│   ├── feature_dictionary.csv        # authoritative column spec\n│   ├── lead_scoring.csv              # flat convenience CSV (all splits)\n│   ├── tables/*.parquet              # 7 snapshot-safe relational tables\n│   └── tasks/converted_within_90_days/{train,valid,test}.parquet\n├── intermediate_instructor/          # research companion: full-horizon tables + metadata/\n├── docs/                             # vendored DGP / leakage / break-me docs (agent-readable)\n├── notebooks/                        # 01 baseline · 02 relational · 03 leakage · 04 calibration\n├── metrics.json                      # top-level cross-tier metrics summary\n├── claims_register.{md,json}         # claims → backing-artifact map (agent-readable)\n└── validation/                       # validation_report.{json,md} + figures\n</code></pre>\n<p><code>student_public</code> bundles ship the snapshot-safe relational view;\n<code>research_instructor</code> companions ship the full-horizon view plus the\nhidden causal structure (DAG, latent registry, mechanism summary)\nunder <code>metadata/</code>. The full layout is documented in each bundle's\n<code>manifest.json</code>.</p>\n<h3>Agent-reviewable artifacts</h3>\n<p>The published bundle is self-contained for AI review and offline\nauditing — every numeric / structural claim on this page can be\nverified without following an external link:</p>\n<ul>\n<li><strong><code>metrics.json</code> (root) + <code>&lt;tier&gt;/metrics.json</code></strong> — deterministic\nJSON view of the headline LR AUC / AP / P@100 / Brier / conversion\nrate / cohort-shift / cross-tier-ordering medians, with JSON-path\nback-references to <code>validation/validation_report.json</code> (the\nsource of truth).</li>\n<li><strong><code>claims_register.{md,json}</code></strong> — every numerical or structural\nclaim on this page paired with the artifact and path that backs it.\nRendered from <code>claims_register_source.yaml</code> by\n<code>scripts/build_claims_register.py</code>.</li>\n<li><strong><code>docs/</code></strong> — vendored copies of <code>generation_method.md</code>,\n<code>channel_signal_audit.md</code>, <code>break_me_guide.md</code>,\n<code>feature_dictionary.md</code>, <code>v1_acceptance_gates_bands.yaml</code>,\n<code>v2_decision_log.md</code>, plus a hand-authored\n<code>relational_table_schemas.csv</code> documenting every column of every\nrelational table.  These match the GitHub-blob links cited below but\nship inside the bundle so a reviewer never needs network access.</li>\n<li><strong><code>&lt;tier&gt;/manifest.json</code></strong> — SHA-256 hash for every file plus the\nfull redaction contract (<code>structural_redactions.columns</code>,\n<code>omitted_tables</code>, <code>relational_snapshot_safe</code>, <code>snapshot_day</code>).</li>\n<li>Kaggle / HuggingFace preview pages additionally inject a\n<code>schema.org/Dataset</code> JSON-LD block in their <code>&lt;head&gt;</code> for agent\ningestion without HTML parsing.</li>\n</ul>\n<h2>Quick start</h2>\n<pre><code class=\"language-python\"># Flat CSV\ndf = pd.read_csv(&quot;intermediate/lead_scoring.csv&quot;)\n\n# Parquet task splits (recommended)\ntrain = pd.read_parquet(&quot;intermediate/tasks/converted_within_90_days/train.parquet&quot;)\ntest  = pd.read_parquet(&quot;intermediate/tasks/converted_within_90_days/test.parquet&quot;)\n\n# Relational tables (feature engineering — example)\nleads   = pd.read_parquet(&quot;intermediate/tables/leads.parquet&quot;)\ntouches = pd.read_parquet(&quot;intermediate/tables/touches.parquet&quot;)\nmy_touch_count = (\n    touches.groupby(&quot;lead_id&quot;).size().rename(&quot;my_touch_count&quot;).reset_index()\n)\nfeatures = leads.merge(my_touch_count, on=&quot;lead_id&quot;, how=&quot;left&quot;)\n\n# Reproduce from source\n# pip install leadforge\n# leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \\\n#                    --mode student_public --difficulty intermediate --out my_bundle\n</code></pre>\n<p>The label <code>converted_within_90_days</code> resolves over a 90-day window;\nengagement features (<code>touch_count</code>, <code>session_count</code>, etc.) are\ncomputed strictly over events on days <code>[0, 30]</code>. The deliberate\nexception is <code>total_touches_all</code>, the leakage trap — flagged\n<code>leakage_risk=True</code> in <code>feature_dictionary.csv</code>. Drop it from your\nfeature set unless you're demonstrating leakage detection.</p>\n<h2>Dataset summary</h2>\n<table>\n<thead>\n<tr>\n<th></th>\n<th>Intro</th>\n<th>Intermediate</th>\n<th>Advanced</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Leads</td>\n<td>5,000</td>\n<td>5,000</td>\n<td>5,000</td>\n</tr>\n<tr>\n<td>Accounts</td>\n<td>1,500</td>\n<td>1,500</td>\n<td>1,500</td>\n</tr>\n<tr>\n<td>Contacts</td>\n<td>4,200</td>\n<td>4,200</td>\n<td>4,200</td>\n</tr>\n<tr>\n<td>Snapshot columns</td>\n<td>32 / 34*</td>\n<td>32 / 34*</td>\n<td>32 / 34*</td>\n</tr>\n<tr>\n<td>Target</td>\n<td><code>converted_within_90_days</code></td>\n<td><code>converted_within_90_days</code></td>\n<td><code>converted_within_90_days</code></td>\n</tr>\n<tr>\n<td>Conversion rate (acceptance band, gate G7.*)</td>\n<td>24–61%</td>\n<td>12–31%</td>\n<td>4–12%</td>\n</tr>\n<tr>\n<td>Conversion rate (observed median, seeds 42–46)</td>\n<td>42.67%</td>\n<td>21.60%</td>\n<td>8.40%</td>\n</tr>\n<tr>\n<td>Signal strength</td>\n<td>0.90</td>\n<td>0.70</td>\n<td>0.50</td>\n</tr>\n<tr>\n<td>Noise scale</td>\n<td>0.10</td>\n<td>0.30</td>\n<td>0.55</td>\n</tr>\n<tr>\n<td>Missing rate</td>\n<td>2%</td>\n<td>8%</td>\n<td>18%</td>\n</tr>\n</tbody>\n</table>\n<p>* <code>student_public</code> / <code>research_instructor</code>. Difficulty is modulated\nby the simulation engine — signal strength on latent-trait weights,\nGaussian noise on float features, MCAR missingness, outlier rate —\nnot post-hoc label flipping. The acceptance band is the recipe\ngate's tolerance window (<code>v1_acceptance_gates_bands.yaml</code> G7.*),\nnot the achievable range — observed five-seed spreads sit\ncomfortably inside the band.</p>\n<h2>The scenario</h2>\n<p><strong>Veridian Technologies</strong> is a fictional Series B startup (Austin, US)\nselling <strong>Veridian Procure</strong>, a procurement / AP automation SaaS, to\nmid-market firms (200–2,000 employees) in the US and UK. The funnel\nruns through inbound marketing (45%), SDR outbound (35%), and\npartner referrals (20%); four personas drive deals (VP Finance, AP\nManager, IT Director, Procurement Manager). <strong>Task:</strong> predict whether\na lead converts (<code>closed_won</code>) within 90 days. ACV bands are\n$18k–$120k. See\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/docs/release/generation_method.md\"><code>docs/release/generation_method.md</code></a>\nfor the full DGP, and the deeper &quot;what's modelled / approximate / not\nmodelled&quot; breakdown that this README only summarises.</p>\n<h2>Public vs instructor: what's redacted</h2>\n<p>Filtering happens <strong>during rendering</strong>, not during simulation. The\nredaction contract is single-sourced in\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/leadforge/validation/leakage_probes.py\"><code>leadforge/validation/leakage_probes.py</code></a>;\nthe snapshot-safe writer and the validator import the same constants,\nso they cannot drift apart.</p>\n<table>\n<thead>\n<tr>\n<th>Source-of-truth constant</th>\n<th>Public bundle treatment</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td><code>BANNED_LEAD_COLUMNS = (&quot;converted_within_90_days&quot;, &quot;conversion_timestamp&quot;)</code></td>\n<td>Dropped from <code>tables/leads.parquet</code></td>\n</tr>\n<tr>\n<td><code>BANNED_OPP_COLUMNS = (&quot;close_outcome&quot;, &quot;closed_at&quot;)</code></td>\n<td>Dropped from <code>tables/opportunities.parquet</code></td>\n</tr>\n<tr>\n<td><code>BANNED_TABLES = (&quot;customers&quot;, &quot;subscriptions&quot;)</code></td>\n<td>Omitted from public bundles</td>\n</tr>\n<tr>\n<td><code>SNAPSHOT_FILTERED_TABLES</code> (touches, sessions, sales_activities, opportunities)</td>\n<td>Filtered per-lead by <code>lead_created_at + snapshot_day</code></td>\n</tr>\n<tr>\n<td>Snapshot redaction (<code>current_stage</code>, <code>is_sql</code>)</td>\n<td>Stripped from <code>tasks/</code> splits and <code>tables/leads.parquet</code></td>\n</tr>\n<tr>\n<td><code>total_touches_all</code> (deliberate trap)</td>\n<td><strong>Retained in both modes</strong>; flagged <code>leakage_risk=True</code></td>\n</tr>\n</tbody>\n</table>\n<p>Each bundle's <code>manifest.json</code> records <code>relational_snapshot_safe</code>,\n<code>redacted_columns</code>, and <code>snapshot_day</code>, so the bundle is\nself-describing.</p>\n<h2>Calibration</h2>\n<p>Every realism / calibration / difficulty claim in this README is\nbacked by\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/release/validation/validation_report.md\"><code>validation/validation_report.md</code></a>,\nregenerated by\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/scripts/validate_release_candidate.py\"><code>scripts/validate_release_candidate.py</code></a>\nwith bands declared in\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v1_acceptance_gates_bands.yaml\"><code>docs/release/v1_acceptance_gates_bands.yaml</code></a>.\nHeadline cross-seed medians (seeds 42–46):</p>\n<table>\n<thead>\n<tr>\n<th>Tier</th>\n<th>LR AUC</th>\n<th>AP</th>\n<th>P@100</th>\n<th>Brier</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>intro</td>\n<td>0.879</td>\n<td>0.761</td>\n<td>0.80</td>\n<td>0.130</td>\n</tr>\n<tr>\n<td>intermediate</td>\n<td>0.886</td>\n<td>0.575</td>\n<td>0.59</td>\n<td>0.110</td>\n</tr>\n<tr>\n<td>advanced</td>\n<td>0.886</td>\n<td>0.351</td>\n<td>0.34</td>\n<td>0.061</td>\n</tr>\n</tbody>\n</table>\n<p>AP, P@100, conversion-rate, and lift orderings hold across the\nintended difficulty axis (intro &gt; intermediate &gt; advanced).</p>\n<h2>Intended uses</h2>\n<ul>\n<li>Teaching baseline lead-scoring on a flat snapshot.</li>\n<li>Teaching relational feature engineering against snapshot-safe tables.</li>\n<li>Teaching leakage detection (the <code>total_touches_all</code> trap is\ndesigned to be discoverable).</li>\n<li>Teaching calibration, lift, P@K, value-aware ranking\n(<code>expected_acv × P(convert)</code>), and cohort-shift evaluation.</li>\n<li>Comparing model families under a controlled DGP.</li>\n</ul>\n<h2>Out-of-scope uses</h2>\n<ul>\n<li><strong>Production lead scoring.</strong> The company, product, and customers are\nfictional.</li>\n<li><strong>Vendor benchmarking / paper baselines.</strong> Difficulty tiers are\ncalibrated for pedagogy, not cross-paper comparability.</li>\n<li><strong>Causal-inference research that requires recovery of the true DGP.</strong>\nThe instructor companion exposes the hidden graph for teaching, not\ndesigned counterfactuals.</li>\n<li><strong>Demographic / fairness research.</strong> v1 does not model protected\nattributes.</li>\n</ul>\n<h2>Known limitations</h2>\n<ul>\n<li><strong>Difficulty signal on raw AUC is flat.</strong> LR AUC is ~0.88 across\nevery tier. Difficulty is visible in AP, P@K, Brier, and value\ncapture. Treat AUC as a sanity check, not a difficulty signal.</li>\n<li><strong>GBM does not consistently beat LR (gate G7.4.4).</strong> GBM−LR AUC delta\nis slightly negative in every tier (intro −0.0045, intermediate\n−0.0072, advanced −0.0133); v1's snapshot is dominated by linear\nfeatures. v2 will inject non-linear interactions in the simulator.</li>\n<li><strong>Channel signal is weak.</strong> Per\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/docs/release/channel_signal_audit.md\"><code>docs/release/channel_signal_audit.md</code></a>,\nout-of-sample univariate AUC of <code>lead_source</code> is ≈0.50–0.52 across\nall tiers and the per-channel rate spread is ≤0.05. The simulator\ndoes not encode channel-conditional probabilities; channel-conditional\nencoding is post-v1 work.</li>\n<li><strong>Cohort-shift degradation is small.</strong> v1 has no time-of-year drift\nbaked in; the cohort-shift gate (G6.4) is informational and will\nbite in v2.</li>\n</ul>\n<h2>Composition</h2>\n<ul>\n<li><strong>Entities.</strong> Accounts, contacts, leads, touches, sessions,\nsales_activities, opportunities (public); plus customers and\nsubscriptions (instructor only). Per-row counts per bundle live in\n<code>manifest.json</code>.</li>\n<li><strong>Features.</strong> 32 public columns grouped by analytical role in\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/docs/release/feature_dictionary.md\"><code>docs/release/feature_dictionary.md</code></a>;\nthe per-bundle <code>feature_dictionary.csv</code> is the authoritative\nmachine-readable spec.</li>\n<li><strong>Label.</strong> <code>converted_within_90_days</code> (boolean), event-derived from\nthe simulator. Never sampled directly.</li>\n<li><strong>Splits.</strong> 70/15/15 train/valid/test, deterministic given seed;\nrecorded in <code>tasks/converted_within_90_days/task_manifest.json</code>.\n<strong>Group-leakage warning:</strong> the splitter is keyed on <code>lead_id</code> only,\nnot on <code>account_id</code> or <code>contact_id</code>. On the as-shipped intermediate\nbundle, <strong>518 of 557 test accounts (≈93 %) also appear in train</strong>;\nthe contact-level overlap is similar in magnitude. A flat baseline\ntrained on the random split rides account-level signal across the\nsplit boundary. For a generalisation-faithful number, retrain with\n<code>GroupKFold(account_id)</code> (or <code>contact_id</code>) and report both — see\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md\"><code>break_me_guide.md</code></a> §5 for the\ndetection recipe.</li>\n<li><strong>Provenance.</strong> Recipe <code>b2b_saas_procurement_v1</code>, seed 42, package\nversion stamped in <code>manifest.json</code>.</li>\n</ul>\n<h2>Maintenance, adversarial framing, license</h2>\n<p>We <em>want</em> the dataset to be broken. The\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md\">break-me guide</a> catalogues\nnine adversarial patterns to look for (leakage, split\ncontamination, ranking inversions, calibration drift) with\nworked-example pointers back into the notebooks. Issue\ntemplates ship under <code>.github/ISSUE_TEMPLATE/</code>: a\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml\">breakage report</a>\nform for findings on the bundle itself, and a\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/realism_feedback.yml\">realism feedback</a>\nform for distributional critiques. Accepted findings are\nlogged in\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md\"><code>docs/release/v2_decision_log.md</code></a>.\nFile issues at\n<a href=\"https://github.com/leadforge-dev/leadforge\">leadforge-dev/leadforge</a>;\nPRs welcome.</p>\n<table>\n<thead>\n<tr>\n<th>Field</th>\n<th>Value</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Generator</td>\n<td>leadforge <code>1.0.0+</code></td>\n</tr>\n<tr>\n<td>Recipe</td>\n<td><code>b2b_saas_procurement_v1</code></td>\n</tr>\n<tr>\n<td>Canonical seed</td>\n<td>42 (cross-seed sweep: 42–46)</td>\n</tr>\n<tr>\n<td>Bundle schema version</td>\n<td>5</td>\n</tr>\n<tr>\n<td>Format</td>\n<td>Parquet (canonical) + CSV (convenience)</td>\n</tr>\n<tr>\n<td>License</td>\n<td>MIT — see <a href=\"LICENSE\">LICENSE</a></td>\n</tr>\n</tbody>\n</table>\n<p>Verify integrity with <code>leadforge validate &lt;bundle_dir&gt;</code>; every file\nis hashed in <code>manifest.json</code>.</p>\n",
      "tags": [
        "tabular",
        "lead-scoring",
        "synthetic-data",
        "crm",
        "b2b",
        "datasets",
        "pandas",
        "intermediate"
      ],
      "coverImage": "../dataset-cover-image.png",
      "splits": [
        "train",
        "valid",
        "test"
      ],
      "subsets": [
        "leadforge-lead-scoring-v1-intermediate"
      ],
      "files": [
        {
          "path": "lead_scoring.csv",
          "size": "1393 KB",
          "kind": "CSV",
          "sourcePath": "../intermediate/lead_scoring.csv",
          "about": "Flat ML-ready snapshot CSV: 5,000 leads × 32 features, snapshot day 30.  Includes a 'split' column (train / valid / test) for conventional ML workflows."
        },
        {
          "path": "feature_dictionary.csv",
          "size": "3 KB",
          "kind": "CSV",
          "sourcePath": "../intermediate/feature_dictionary.csv",
          "about": "Per-column documentation: dtype, analytical category, leakage-risk flag, and plain-language description."
        },
        {
          "path": "tasks/converted_within_90_days/train.parquet",
          "size": "213 KB",
          "kind": "Parquet",
          "sourcePath": "../intermediate/tasks/converted_within_90_days/train.parquet",
          "about": "Training split — 3,500 leads, stratified by conversion rate.  Target column: `converted_within_90_days` (bool)."
        },
        {
          "path": "tasks/converted_within_90_days/valid.parquet",
          "size": "65 KB",
          "kind": "Parquet",
          "sourcePath": "../intermediate/tasks/converted_within_90_days/valid.parquet",
          "about": "Validation split — 750 leads."
        },
        {
          "path": "tasks/converted_within_90_days/test.parquet",
          "size": "64 KB",
          "kind": "Parquet",
          "sourcePath": "../intermediate/tasks/converted_within_90_days/test.parquet",
          "about": "Test split — 750 leads, held out for final evaluation only."
        },
        {
          "path": "dataset_card.md",
          "size": "3 KB",
          "kind": "Dataset card",
          "sourcePath": "../intermediate/dataset_card.md",
          "about": "Auto-generated tier-specific dataset card."
        }
      ],
      "columns": [
        "account_id",
        "industry",
        "region",
        "employee_band",
        "estimated_revenue_band",
        "process_maturity_band",
        "contact_id",
        "role_function",
        "seniority",
        "buyer_role",
        "lead_id",
        "lead_created_at",
        "lead_source",
        "first_touch_channel",
        "touch_count",
        "inbound_touch_count",
        "outbound_touch_count",
        "session_count",
        "pricing_page_views",
        "demo_page_views",
        "total_session_duration_seconds",
        "touches_week_1",
        "touches_last_7_days",
        "days_since_first_touch",
        "activity_count",
        "days_since_last_touch",
        "opportunity_created",
        "has_open_opportunity",
        "opportunity_estimated_acv",
        "expected_acv",
        "total_touches_all",
        "converted_within_90_days"
      ],
      "rows": [
        {
          "split": "train",
          "account_id": "acct_000773",
          "industry": "logistics",
          "region": "UK",
          "employee_band": "200-499",
          "estimated_revenue_band": "$50M-$200M",
          "process_maturity_band": "low",
          "contact_id": "cnt_001124",
          "role_function": "procurement_manager",
          "seniority": "vp",
          "buyer_role": "end_user",
          "lead_id": "lead_004250",
          "lead_created_at": "2024-01-08",
          "lead_source": "inbound_marketing",
          "first_touch_channel": "inbound_marketing",
          "touch_count": "0.0",
          "inbound_touch_count": "0.0",
          "outbound_touch_count": "0.0",
          "session_count": "0.0",
          "pricing_page_views": "0.0",
          "demo_page_views": "0.0",
          "total_session_duration_seconds": "0.0",
          "touches_week_1": "0.0",
          "touches_last_7_days": "",
          "days_since_first_touch": "",
          "activity_count": "0.0",
          "days_since_last_touch": "",
          "opportunity_created": "False",
          "has_open_opportunity": "False",
          "opportunity_estimated_acv": "",
          "expected_acv": "67116.08121451367",
          "total_touches_all": "0.0",
          "converted_within_90_days": "False"
        },
        {
          "split": "train",
          "account_id": "acct_000043",
          "industry": "logistics",
          "region": "UK",
          "employee_band": "500-999",
          "estimated_revenue_band": "$10M-$50M",
          "process_maturity_band": "high",
          "contact_id": "cnt_003354",
          "role_function": "it_director",
          "seniority": "c_suite",
          "buyer_role": "technical_evaluator",
          "lead_id": "lead_001565",
          "lead_created_at": "2024-01-01",
          "lead_source": "inbound_marketing",
          "first_touch_channel": "inbound_marketing",
          "touch_count": "5.0",
          "inbound_touch_count": "5.0",
          "outbound_touch_count": "0.0",
          "session_count": "2.0",
          "pricing_page_views": "0.0",
          "demo_page_views": "0.0",
          "total_session_duration_seconds": "632.0",
          "touches_week_1": "1.0",
          "touches_last_7_days": "",
          "days_since_first_touch": "30.333835742092244",
          "activity_count": "",
          "days_since_last_touch": "11.003359485085898",
          "opportunity_created": "False",
          "has_open_opportunity": "False",
          "opportunity_estimated_acv": "",
          "expected_acv": "58295.50663158452",
          "total_touches_all": "9.0",
          "converted_within_90_days": "False"
        },
        {
          "split": "train",
          "account_id": "acct_000319",
          "industry": "logistics",
          "region": "US",
          "employee_band": "200-499",
          "estimated_revenue_band": "$1M-$10M",
          "process_maturity_band": "medium",
          "contact_id": "cnt_000537",
          "role_function": "ap_manager",
          "seniority": "director",
          "buyer_role": "champion",
          "lead_id": "lead_002296",
          "lead_created_at": "2024-01-05",
          "lead_source": "partner_referral",
          "first_touch_channel": "partner_referral",
          "touch_count": "9.0",
          "inbound_touch_count": "0.0",
          "outbound_touch_count": "9.0",
          "session_count": "3.0",
          "pricing_page_views": "0.0",
          "demo_page_views": "0.0",
          "total_session_duration_seconds": "900.0",
          "touches_week_1": "3.0",
          "touches_last_7_days": "2.0",
          "days_since_first_touch": "27.96681262771429",
          "activity_count": "2.0",
          "days_since_last_touch": "5.711191915200993",
          "opportunity_created": "True",
          "has_open_opportunity": "False",
          "opportunity_estimated_acv": "",
          "expected_acv": "19986.26251151062",
          "total_touches_all": "17.0",
          "converted_within_90_days": "False"
        },
        {
          "split": "train",
          "account_id": "acct_000476",
          "industry": "healthcare_non_clinical",
          "region": "US",
          "employee_band": "200-499",
          "estimated_revenue_band": "$10M-$50M",
          "process_maturity_band": "medium",
          "contact_id": "cnt_001478",
          "role_function": "ap_manager",
          "seniority": "director",
          "buyer_role": "champion",
          "lead_id": "lead_003320",
          "lead_created_at": "2024-01-29",
          "lead_source": "inbound_marketing",
          "first_touch_channel": "inbound_marketing",
          "touch_count": "5.0",
          "inbound_touch_count": "5.0",
          "outbound_touch_count": "0.0",
          "session_count": "1.0",
          "pricing_page_views": "",
          "demo_page_views": "2.0",
          "total_session_duration_seconds": "435.0",
          "touches_week_1": "0.0",
          "touches_last_7_days": "1.0",
          "days_since_first_touch": "20.265681954383744",
          "activity_count": "2.0",
          "days_since_last_touch": "0.6712216272156128",
          "opportunity_created": "True",
          "has_open_opportunity": "False",
          "opportunity_estimated_acv": "",
          "expected_acv": "39357.420972624575",
          "total_touches_all": "13.0",
          "converted_within_90_days": "False"
        },
        {
          "split": "train",
          "account_id": "acct_000243",
          "industry": "manufacturing",
          "region": "US",
          "employee_band": "1000-1999",
          "estimated_revenue_band": "$10M-$50M",
          "process_maturity_band": "low",
          "contact_id": "cnt_000276",
          "role_function": "vp_finance",
          "seniority": "individual_contributor",
          "buyer_role": "economic_buyer",
          "lead_id": "lead_001192",
          "lead_created_at": "2024-01-01",
          "lead_source": "sdr_outbound",
          "first_touch_channel": "sdr_outbound",
          "touch_count": "5.0",
          "inbound_touch_count": "0.0",
          "outbound_touch_count": "5.0",
          "session_count": "1.0",
          "pricing_page_views": "0.0",
          "demo_page_views": "0.0",
          "total_session_duration_seconds": "171.0",
          "touches_week_1": "1.0",
          "touches_last_7_days": "1.0",
          "days_since_first_touch": "31.217369696525743",
          "activity_count": "5.0",
          "days_since_last_touch": "4.343947540442009",
          "opportunity_created": "True",
          "has_open_opportunity": "True",
          "opportunity_estimated_acv": "50371.89584021073",
          "expected_acv": "43452.46641864772",
          "total_touches_all": "10.0",
          "converted_within_90_days": "False"
        },
        {
          "split": "train",
          "account_id": "acct_000353",
          "industry": "manufacturing",
          "region": "UK",
          "employee_band": "2000+",
          "estimated_revenue_band": "$1M-$10M",
          "process_maturity_band": "medium",
          "contact_id": "cnt_002665",
          "role_function": "procurement_manager",
          "seniority": "vp",
          "buyer_role": "end_user",
          "lead_id": "lead_000123",
          "lead_created_at": "2024-01-27",
          "lead_source": "sdr_outbound",
          "first_touch_channel": "sdr_outbound",
          "touch_count": "1.0",
          "inbound_touch_count": "0.0",
          "outbound_touch_count": "1.0",
          "session_count": "1.0",
          "pricing_page_views": "0.0",
          "demo_page_views": "0.0",
          "total_session_duration_seconds": "359.0",
          "touches_week_1": "1.0",
          "touches_last_7_days": "0.0",
          "days_since_first_touch": "27.58339746775941",
          "activity_count": "0.0",
          "days_since_last_touch": "28.598940785518547",
          "opportunity_created": "False",
          "has_open_opportunity": "False",
          "opportunity_estimated_acv": "",
          "expected_acv": "24212.652011240694",
          "total_touches_all": "1.0",
          "converted_within_90_days": "False"
        },
        {
          "split": "train",
          "account_id": "acct_000029",
          "industry": "healthcare_non_clinical",
          "region": "US",
          "employee_band": "500-999",
          "estimated_revenue_band": "$10M-$50M",
          "process_maturity_band": "low",
          "contact_id": "cnt_001377",
          "role_function": "procurement_manager",
          "seniority": "individual_contributor",
          "buyer_role": "end_user",
          "lead_id": "lead_001076",
          "lead_created_at": "2024-01-18",
          "lead_source": "sdr_outbound",
          "first_touch_channel": "sdr_outbound",
          "touch_count": "10.0",
          "inbound_touch_count": "0.0",
          "outbound_touch_count": "10.0",
          "session_count": "4.0",
          "pricing_page_views": "0.0",
          "demo_page_views": "0.0",
          "total_session_duration_seconds": "765.0",
          "touches_week_1": "6.0",
          "touches_last_7_days": "1.0",
          "days_since_first_touch": "",
          "activity_count": "",
          "days_since_last_touch": "",
          "opportunity_created": "False",
          "has_open_opportunity": "False",
          "opportunity_estimated_acv": "",
          "expected_acv": "65917.65550828965",
          "total_touches_all": "17.0",
          "converted_within_90_days": "False"
        },
        {
          "split": "train",
          "account_id": "acct_001411",
          "industry": "professional_services",
          "region": "US",
          "employee_band": "200-499",
          "estimated_revenue_band": "$10M-$50M",
          "process_maturity_band": "high",
          "contact_id": "cnt_002913",
          "role_function": "vp_finance",
          "seniority": "c_suite",
          "buyer_role": "economic_buyer",
          "lead_id": "lead_001584",
          "lead_created_at": "2024-01-09",
          "lead_source": "sdr_outbound",
          "first_touch_channel": "sdr_outbound",
          "touch_count": "12.0",
          "inbound_touch_count": "0.0",
          "outbound_touch_count": "12.0",
          "session_count": "2.0",
          "pricing_page_views": "",
          "demo_page_views": "0.0",
          "total_session_duration_seconds": "515.0",
          "touches_week_1": "5.0",
          "touches_last_7_days": "4.0",
          "days_since_first_touch": "29.038060387156584",
          "activity_count": "6.0",
          "days_since_last_touch": "3.718211110884809",
          "opportunity_created": "True",
          "has_open_opportunity": "True",
          "opportunity_estimated_acv": "43032.618878340254",
          "expected_acv": "22521.74016451115",
          "total_touches_all": "18.0",
          "converted_within_90_days": "False"
        }
      ],
      "discussions": [
        "What is `snapshot_day = 30` and how does it affect which features are valid at inference time?",
        "Is `total_touches_all` a safe feature or a time-window leakage trap?",
        "LR and GBM AUCs are very close across tiers — does relational feature engineering help?",
        "How would you set a probability threshold for a team that can only work 50 leads per week?",
        "What happens to AUC when you evaluate on a chronological hold-out instead of a random split?"
      ]
    },
    {
      "slug": "leadforge-lead-scoring-v1-advanced",
      "title": "LeadForge Lead Scoring v1 — Advanced",
      "owner": "leadforge-dev",
      "subtitle": "Advanced difficulty · 5,000 leads · ~8% conversion rate · LR AUC 0.886 (5-seed median)",
      "license": "MIT",
      "task": "tabular-classification",
      "language": "English",
      "updated": "2026-05-24",
      "downloads": "0",
      "likes": "0",
      "contactName": "leadforge-dev",
      "contactEmail": "",
      "rowCount": 5000,
      "kaggleUsability": "8.9",
      "kaggleMedals": "Bronze",
      "description": "",
      "descriptionHtml": "<h1>LeadForge: Synthetic B2B Lead Scoring Dataset (<code>leadforge-lead-scoring-v1</code>)</h1>\n<p>A relational, reproducible, three-tier synthetic CRM dataset family for\nteaching lead scoring at scale. Generated by\n<a href=\"https://github.com/leadforge-dev/leadforge\">leadforge</a>, an\nopen-source Python framework for synthetic CRM/funnel data. The\nframework version is decoupled from the dataset version: the package\nstays at <code>1.x</code>; the dataset is published under the explicit <code>…-v1</code>\ntag.</p>\n<h2>Why lead scoring matters in 2024–2026</h2>\n<p>Mid-market SaaS vendors entered 2024–2026 with growth slowing and\ncustomer-acquisition costs rising[^macro], so predicting <em>which</em> leads\nconvert within a fixed window has moved from a marketing nicety to a\nsurvival skill. This dataset teaches that skill on a relational\nsubstrate, with the realistic confusions (snapshot-window discipline,\nleakage traps, channel signal weaker than vendor blogs imply) that\nstudents will hit when they finally get hands on real CRM data.</p>\n<p>[^macro]: Macroeconomic framing summarised in\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/docs/external_review/summaries/gemini_v2_summary.md\"><code>docs/external_review/summaries/gemini_v2_summary.md</code></a>\n(median public-SaaS growth 30%→25% from 2023 to 2025; New CAC Ratio\nrose materially in 2024).</p>\n<h2>What's inside</h2>\n<pre><code>release/\n├── intro/ intermediate/ advanced/    # student_public bundles, one per difficulty tier\n│   ├── manifest.json                 # provenance + file hashes\n│   ├── metrics.json                  # per-tier headline metrics (medians + spreads)\n│   ├── dataset_card.md               # auto-rendered per-bundle card\n│   ├── feature_dictionary.csv        # authoritative column spec\n│   ├── lead_scoring.csv              # flat convenience CSV (all splits)\n│   ├── tables/*.parquet              # 7 snapshot-safe relational tables\n│   └── tasks/converted_within_90_days/{train,valid,test}.parquet\n├── intermediate_instructor/          # research companion: full-horizon tables + metadata/\n├── docs/                             # vendored DGP / leakage / break-me docs (agent-readable)\n├── notebooks/                        # 01 baseline · 02 relational · 03 leakage · 04 calibration\n├── metrics.json                      # top-level cross-tier metrics summary\n├── claims_register.{md,json}         # claims → backing-artifact map (agent-readable)\n└── validation/                       # validation_report.{json,md} + figures\n</code></pre>\n<p><code>student_public</code> bundles ship the snapshot-safe relational view;\n<code>research_instructor</code> companions ship the full-horizon view plus the\nhidden causal structure (DAG, latent registry, mechanism summary)\nunder <code>metadata/</code>. The full layout is documented in each bundle's\n<code>manifest.json</code>.</p>\n<h3>Agent-reviewable artifacts</h3>\n<p>The published bundle is self-contained for AI review and offline\nauditing — every numeric / structural claim on this page can be\nverified without following an external link:</p>\n<ul>\n<li><strong><code>metrics.json</code> (root) + <code>&lt;tier&gt;/metrics.json</code></strong> — deterministic\nJSON view of the headline LR AUC / AP / P@100 / Brier / conversion\nrate / cohort-shift / cross-tier-ordering medians, with JSON-path\nback-references to <code>validation/validation_report.json</code> (the\nsource of truth).</li>\n<li><strong><code>claims_register.{md,json}</code></strong> — every numerical or structural\nclaim on this page paired with the artifact and path that backs it.\nRendered from <code>claims_register_source.yaml</code> by\n<code>scripts/build_claims_register.py</code>.</li>\n<li><strong><code>docs/</code></strong> — vendored copies of <code>generation_method.md</code>,\n<code>channel_signal_audit.md</code>, <code>break_me_guide.md</code>,\n<code>feature_dictionary.md</code>, <code>v1_acceptance_gates_bands.yaml</code>,\n<code>v2_decision_log.md</code>, plus a hand-authored\n<code>relational_table_schemas.csv</code> documenting every column of every\nrelational table.  These match the GitHub-blob links cited below but\nship inside the bundle so a reviewer never needs network access.</li>\n<li><strong><code>&lt;tier&gt;/manifest.json</code></strong> — SHA-256 hash for every file plus the\nfull redaction contract (<code>structural_redactions.columns</code>,\n<code>omitted_tables</code>, <code>relational_snapshot_safe</code>, <code>snapshot_day</code>).</li>\n<li>Kaggle / HuggingFace preview pages additionally inject a\n<code>schema.org/Dataset</code> JSON-LD block in their <code>&lt;head&gt;</code> for agent\ningestion without HTML parsing.</li>\n</ul>\n<h2>Quick start</h2>\n<pre><code class=\"language-python\"># Flat CSV\ndf = pd.read_csv(&quot;intermediate/lead_scoring.csv&quot;)\n\n# Parquet task splits (recommended)\ntrain = pd.read_parquet(&quot;intermediate/tasks/converted_within_90_days/train.parquet&quot;)\ntest  = pd.read_parquet(&quot;intermediate/tasks/converted_within_90_days/test.parquet&quot;)\n\n# Relational tables (feature engineering — example)\nleads   = pd.read_parquet(&quot;intermediate/tables/leads.parquet&quot;)\ntouches = pd.read_parquet(&quot;intermediate/tables/touches.parquet&quot;)\nmy_touch_count = (\n    touches.groupby(&quot;lead_id&quot;).size().rename(&quot;my_touch_count&quot;).reset_index()\n)\nfeatures = leads.merge(my_touch_count, on=&quot;lead_id&quot;, how=&quot;left&quot;)\n\n# Reproduce from source\n# pip install leadforge\n# leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \\\n#                    --mode student_public --difficulty intermediate --out my_bundle\n</code></pre>\n<p>The label <code>converted_within_90_days</code> resolves over a 90-day window;\nengagement features (<code>touch_count</code>, <code>session_count</code>, etc.) are\ncomputed strictly over events on days <code>[0, 30]</code>. The deliberate\nexception is <code>total_touches_all</code>, the leakage trap — flagged\n<code>leakage_risk=True</code> in <code>feature_dictionary.csv</code>. Drop it from your\nfeature set unless you're demonstrating leakage detection.</p>\n<h2>Dataset summary</h2>\n<table>\n<thead>\n<tr>\n<th></th>\n<th>Intro</th>\n<th>Intermediate</th>\n<th>Advanced</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Leads</td>\n<td>5,000</td>\n<td>5,000</td>\n<td>5,000</td>\n</tr>\n<tr>\n<td>Accounts</td>\n<td>1,500</td>\n<td>1,500</td>\n<td>1,500</td>\n</tr>\n<tr>\n<td>Contacts</td>\n<td>4,200</td>\n<td>4,200</td>\n<td>4,200</td>\n</tr>\n<tr>\n<td>Snapshot columns</td>\n<td>32 / 34*</td>\n<td>32 / 34*</td>\n<td>32 / 34*</td>\n</tr>\n<tr>\n<td>Target</td>\n<td><code>converted_within_90_days</code></td>\n<td><code>converted_within_90_days</code></td>\n<td><code>converted_within_90_days</code></td>\n</tr>\n<tr>\n<td>Conversion rate (acceptance band, gate G7.*)</td>\n<td>24–61%</td>\n<td>12–31%</td>\n<td>4–12%</td>\n</tr>\n<tr>\n<td>Conversion rate (observed median, seeds 42–46)</td>\n<td>42.67%</td>\n<td>21.60%</td>\n<td>8.40%</td>\n</tr>\n<tr>\n<td>Signal strength</td>\n<td>0.90</td>\n<td>0.70</td>\n<td>0.50</td>\n</tr>\n<tr>\n<td>Noise scale</td>\n<td>0.10</td>\n<td>0.30</td>\n<td>0.55</td>\n</tr>\n<tr>\n<td>Missing rate</td>\n<td>2%</td>\n<td>8%</td>\n<td>18%</td>\n</tr>\n</tbody>\n</table>\n<p>* <code>student_public</code> / <code>research_instructor</code>. Difficulty is modulated\nby the simulation engine — signal strength on latent-trait weights,\nGaussian noise on float features, MCAR missingness, outlier rate —\nnot post-hoc label flipping. The acceptance band is the recipe\ngate's tolerance window (<code>v1_acceptance_gates_bands.yaml</code> G7.*),\nnot the achievable range — observed five-seed spreads sit\ncomfortably inside the band.</p>\n<h2>The scenario</h2>\n<p><strong>Veridian Technologies</strong> is a fictional Series B startup (Austin, US)\nselling <strong>Veridian Procure</strong>, a procurement / AP automation SaaS, to\nmid-market firms (200–2,000 employees) in the US and UK. The funnel\nruns through inbound marketing (45%), SDR outbound (35%), and\npartner referrals (20%); four personas drive deals (VP Finance, AP\nManager, IT Director, Procurement Manager). <strong>Task:</strong> predict whether\na lead converts (<code>closed_won</code>) within 90 days. ACV bands are\n$18k–$120k. See\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/docs/release/generation_method.md\"><code>docs/release/generation_method.md</code></a>\nfor the full DGP, and the deeper &quot;what's modelled / approximate / not\nmodelled&quot; breakdown that this README only summarises.</p>\n<h2>Public vs instructor: what's redacted</h2>\n<p>Filtering happens <strong>during rendering</strong>, not during simulation. The\nredaction contract is single-sourced in\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/leadforge/validation/leakage_probes.py\"><code>leadforge/validation/leakage_probes.py</code></a>;\nthe snapshot-safe writer and the validator import the same constants,\nso they cannot drift apart.</p>\n<table>\n<thead>\n<tr>\n<th>Source-of-truth constant</th>\n<th>Public bundle treatment</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td><code>BANNED_LEAD_COLUMNS = (&quot;converted_within_90_days&quot;, &quot;conversion_timestamp&quot;)</code></td>\n<td>Dropped from <code>tables/leads.parquet</code></td>\n</tr>\n<tr>\n<td><code>BANNED_OPP_COLUMNS = (&quot;close_outcome&quot;, &quot;closed_at&quot;)</code></td>\n<td>Dropped from <code>tables/opportunities.parquet</code></td>\n</tr>\n<tr>\n<td><code>BANNED_TABLES = (&quot;customers&quot;, &quot;subscriptions&quot;)</code></td>\n<td>Omitted from public bundles</td>\n</tr>\n<tr>\n<td><code>SNAPSHOT_FILTERED_TABLES</code> (touches, sessions, sales_activities, opportunities)</td>\n<td>Filtered per-lead by <code>lead_created_at + snapshot_day</code></td>\n</tr>\n<tr>\n<td>Snapshot redaction (<code>current_stage</code>, <code>is_sql</code>)</td>\n<td>Stripped from <code>tasks/</code> splits and <code>tables/leads.parquet</code></td>\n</tr>\n<tr>\n<td><code>total_touches_all</code> (deliberate trap)</td>\n<td><strong>Retained in both modes</strong>; flagged <code>leakage_risk=True</code></td>\n</tr>\n</tbody>\n</table>\n<p>Each bundle's <code>manifest.json</code> records <code>relational_snapshot_safe</code>,\n<code>redacted_columns</code>, and <code>snapshot_day</code>, so the bundle is\nself-describing.</p>\n<h2>Calibration</h2>\n<p>Every realism / calibration / difficulty claim in this README is\nbacked by\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/release/validation/validation_report.md\"><code>validation/validation_report.md</code></a>,\nregenerated by\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/scripts/validate_release_candidate.py\"><code>scripts/validate_release_candidate.py</code></a>\nwith bands declared in\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v1_acceptance_gates_bands.yaml\"><code>docs/release/v1_acceptance_gates_bands.yaml</code></a>.\nHeadline cross-seed medians (seeds 42–46):</p>\n<table>\n<thead>\n<tr>\n<th>Tier</th>\n<th>LR AUC</th>\n<th>AP</th>\n<th>P@100</th>\n<th>Brier</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>intro</td>\n<td>0.879</td>\n<td>0.761</td>\n<td>0.80</td>\n<td>0.130</td>\n</tr>\n<tr>\n<td>intermediate</td>\n<td>0.886</td>\n<td>0.575</td>\n<td>0.59</td>\n<td>0.110</td>\n</tr>\n<tr>\n<td>advanced</td>\n<td>0.886</td>\n<td>0.351</td>\n<td>0.34</td>\n<td>0.061</td>\n</tr>\n</tbody>\n</table>\n<p>AP, P@100, conversion-rate, and lift orderings hold across the\nintended difficulty axis (intro &gt; intermediate &gt; advanced).</p>\n<h2>Intended uses</h2>\n<ul>\n<li>Teaching baseline lead-scoring on a flat snapshot.</li>\n<li>Teaching relational feature engineering against snapshot-safe tables.</li>\n<li>Teaching leakage detection (the <code>total_touches_all</code> trap is\ndesigned to be discoverable).</li>\n<li>Teaching calibration, lift, P@K, value-aware ranking\n(<code>expected_acv × P(convert)</code>), and cohort-shift evaluation.</li>\n<li>Comparing model families under a controlled DGP.</li>\n</ul>\n<h2>Out-of-scope uses</h2>\n<ul>\n<li><strong>Production lead scoring.</strong> The company, product, and customers are\nfictional.</li>\n<li><strong>Vendor benchmarking / paper baselines.</strong> Difficulty tiers are\ncalibrated for pedagogy, not cross-paper comparability.</li>\n<li><strong>Causal-inference research that requires recovery of the true DGP.</strong>\nThe instructor companion exposes the hidden graph for teaching, not\ndesigned counterfactuals.</li>\n<li><strong>Demographic / fairness research.</strong> v1 does not model protected\nattributes.</li>\n</ul>\n<h2>Known limitations</h2>\n<ul>\n<li><strong>Difficulty signal on raw AUC is flat.</strong> LR AUC is ~0.88 across\nevery tier. Difficulty is visible in AP, P@K, Brier, and value\ncapture. Treat AUC as a sanity check, not a difficulty signal.</li>\n<li><strong>GBM does not consistently beat LR (gate G7.4.4).</strong> GBM−LR AUC delta\nis slightly negative in every tier (intro −0.0045, intermediate\n−0.0072, advanced −0.0133); v1's snapshot is dominated by linear\nfeatures. v2 will inject non-linear interactions in the simulator.</li>\n<li><strong>Channel signal is weak.</strong> Per\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/docs/release/channel_signal_audit.md\"><code>docs/release/channel_signal_audit.md</code></a>,\nout-of-sample univariate AUC of <code>lead_source</code> is ≈0.50–0.52 across\nall tiers and the per-channel rate spread is ≤0.05. The simulator\ndoes not encode channel-conditional probabilities; channel-conditional\nencoding is post-v1 work.</li>\n<li><strong>Cohort-shift degradation is small.</strong> v1 has no time-of-year drift\nbaked in; the cohort-shift gate (G6.4) is informational and will\nbite in v2.</li>\n</ul>\n<h2>Composition</h2>\n<ul>\n<li><strong>Entities.</strong> Accounts, contacts, leads, touches, sessions,\nsales_activities, opportunities (public); plus customers and\nsubscriptions (instructor only). Per-row counts per bundle live in\n<code>manifest.json</code>.</li>\n<li><strong>Features.</strong> 32 public columns grouped by analytical role in\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/docs/release/feature_dictionary.md\"><code>docs/release/feature_dictionary.md</code></a>;\nthe per-bundle <code>feature_dictionary.csv</code> is the authoritative\nmachine-readable spec.</li>\n<li><strong>Label.</strong> <code>converted_within_90_days</code> (boolean), event-derived from\nthe simulator. Never sampled directly.</li>\n<li><strong>Splits.</strong> 70/15/15 train/valid/test, deterministic given seed;\nrecorded in <code>tasks/converted_within_90_days/task_manifest.json</code>.\n<strong>Group-leakage warning:</strong> the splitter is keyed on <code>lead_id</code> only,\nnot on <code>account_id</code> or <code>contact_id</code>. On the as-shipped intermediate\nbundle, <strong>518 of 557 test accounts (≈93 %) also appear in train</strong>;\nthe contact-level overlap is similar in magnitude. A flat baseline\ntrained on the random split rides account-level signal across the\nsplit boundary. For a generalisation-faithful number, retrain with\n<code>GroupKFold(account_id)</code> (or <code>contact_id</code>) and report both — see\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md\"><code>break_me_guide.md</code></a> §5 for the\ndetection recipe.</li>\n<li><strong>Provenance.</strong> Recipe <code>b2b_saas_procurement_v1</code>, seed 42, package\nversion stamped in <code>manifest.json</code>.</li>\n</ul>\n<h2>Maintenance, adversarial framing, license</h2>\n<p>We <em>want</em> the dataset to be broken. The\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/docs/release/break_me_guide.md\">break-me guide</a> catalogues\nnine adversarial patterns to look for (leakage, split\ncontamination, ranking inversions, calibration drift) with\nworked-example pointers back into the notebooks. Issue\ntemplates ship under <code>.github/ISSUE_TEMPLATE/</code>: a\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/dataset_breakage_report.yml\">breakage report</a>\nform for findings on the bundle itself, and a\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/.github/ISSUE_TEMPLATE/realism_feedback.yml\">realism feedback</a>\nform for distributional critiques. Accepted findings are\nlogged in\n<a href=\"https://github.com/leadforge-dev/leadforge/blob/main/docs/release/v2_decision_log.md\"><code>docs/release/v2_decision_log.md</code></a>.\nFile issues at\n<a href=\"https://github.com/leadforge-dev/leadforge\">leadforge-dev/leadforge</a>;\nPRs welcome.</p>\n<table>\n<thead>\n<tr>\n<th>Field</th>\n<th>Value</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Generator</td>\n<td>leadforge <code>1.0.0+</code></td>\n</tr>\n<tr>\n<td>Recipe</td>\n<td><code>b2b_saas_procurement_v1</code></td>\n</tr>\n<tr>\n<td>Canonical seed</td>\n<td>42 (cross-seed sweep: 42–46)</td>\n</tr>\n<tr>\n<td>Bundle schema version</td>\n<td>5</td>\n</tr>\n<tr>\n<td>Format</td>\n<td>Parquet (canonical) + CSV (convenience)</td>\n</tr>\n<tr>\n<td>License</td>\n<td>MIT — see <a href=\"LICENSE\">LICENSE</a></td>\n</tr>\n</tbody>\n</table>\n<p>Verify integrity with <code>leadforge validate &lt;bundle_dir&gt;</code>; every file\nis hashed in <code>manifest.json</code>.</p>\n",
      "tags": [
        "tabular",
        "lead-scoring",
        "synthetic-data",
        "crm",
        "b2b",
        "datasets",
        "pandas",
        "advanced"
      ],
      "coverImage": "../dataset-cover-image.png",
      "splits": [
        "train",
        "valid",
        "test"
      ],
      "subsets": [
        "leadforge-lead-scoring-v1-advanced"
      ],
      "files": [
        {
          "path": "lead_scoring.csv",
          "size": "1353 KB",
          "kind": "CSV",
          "sourcePath": "../advanced/lead_scoring.csv",
          "about": "Flat ML-ready snapshot CSV: 5,000 leads × 32 features, snapshot day 30.  Includes a 'split' column (train / valid / test) for conventional ML workflows."
        },
        {
          "path": "feature_dictionary.csv",
          "size": "3 KB",
          "kind": "CSV",
          "sourcePath": "../advanced/feature_dictionary.csv",
          "about": "Per-column documentation: dtype, analytical category, leakage-risk flag, and plain-language description."
        },
        {
          "path": "tasks/converted_within_90_days/train.parquet",
          "size": "200 KB",
          "kind": "Parquet",
          "sourcePath": "../advanced/tasks/converted_within_90_days/train.parquet",
          "about": "Training split — 3,500 leads, stratified by conversion rate.  Target column: `converted_within_90_days` (bool)."
        },
        {
          "path": "tasks/converted_within_90_days/valid.parquet",
          "size": "61 KB",
          "kind": "Parquet",
          "sourcePath": "../advanced/tasks/converted_within_90_days/valid.parquet",
          "about": "Validation split — 750 leads."
        },
        {
          "path": "tasks/converted_within_90_days/test.parquet",
          "size": "62 KB",
          "kind": "Parquet",
          "sourcePath": "../advanced/tasks/converted_within_90_days/test.parquet",
          "about": "Test split — 750 leads, held out for final evaluation only."
        },
        {
          "path": "dataset_card.md",
          "size": "3 KB",
          "kind": "Dataset card",
          "sourcePath": "../advanced/dataset_card.md",
          "about": "Auto-generated tier-specific dataset card."
        }
      ],
      "columns": [
        "account_id",
        "industry",
        "region",
        "employee_band",
        "estimated_revenue_band",
        "process_maturity_band",
        "contact_id",
        "role_function",
        "seniority",
        "buyer_role",
        "lead_id",
        "lead_created_at",
        "lead_source",
        "first_touch_channel",
        "touch_count",
        "inbound_touch_count",
        "outbound_touch_count",
        "session_count",
        "pricing_page_views",
        "demo_page_views",
        "total_session_duration_seconds",
        "touches_week_1",
        "touches_last_7_days",
        "days_since_first_touch",
        "activity_count",
        "days_since_last_touch",
        "opportunity_created",
        "has_open_opportunity",
        "opportunity_estimated_acv",
        "expected_acv",
        "total_touches_all",
        "converted_within_90_days"
      ],
      "rows": [
        {
          "split": "train",
          "account_id": "acct_000773",
          "industry": "logistics",
          "region": "UK",
          "employee_band": "200-499",
          "estimated_revenue_band": "$50M-$200M",
          "process_maturity_band": "low",
          "contact_id": "cnt_001124",
          "role_function": "procurement_manager",
          "seniority": "vp",
          "buyer_role": "end_user",
          "lead_id": "lead_004250",
          "lead_created_at": "2024-01-08",
          "lead_source": "inbound_marketing",
          "first_touch_channel": "inbound_marketing",
          "touch_count": "8.0",
          "inbound_touch_count": "8.0",
          "outbound_touch_count": "0.0",
          "session_count": "1.0",
          "pricing_page_views": "1.0",
          "demo_page_views": "0.0",
          "total_session_duration_seconds": "484.0",
          "touches_week_1": "1.0",
          "touches_last_7_days": "",
          "days_since_first_touch": "24.08619850861116",
          "activity_count": "6.0",
          "days_since_last_touch": "-3.5747144539187232",
          "opportunity_created": "True",
          "has_open_opportunity": "True",
          "opportunity_estimated_acv": "37710.42551803246",
          "expected_acv": "-10645.128882835355",
          "total_touches_all": "",
          "converted_within_90_days": "False"
        },
        {
          "split": "train",
          "account_id": "acct_000043",
          "industry": "logistics",
          "region": "UK",
          "employee_band": "500-999",
          "estimated_revenue_band": "$10M-$50M",
          "process_maturity_band": "high",
          "contact_id": "cnt_003354",
          "role_function": "it_director",
          "seniority": "c_suite",
          "buyer_role": "technical_evaluator",
          "lead_id": "lead_001565",
          "lead_created_at": "2024-01-01",
          "lead_source": "inbound_marketing",
          "first_touch_channel": "inbound_marketing",
          "touch_count": "9.0",
          "inbound_touch_count": "9.0",
          "outbound_touch_count": "0.0",
          "session_count": "3.0",
          "pricing_page_views": "4.0",
          "demo_page_views": "",
          "total_session_duration_seconds": "900.0",
          "touches_week_1": "4.0",
          "touches_last_7_days": "",
          "days_since_first_touch": "30.62883305317309",
          "activity_count": "",
          "days_since_last_touch": "1.8952519188233432",
          "opportunity_created": "True",
          "has_open_opportunity": "True",
          "opportunity_estimated_acv": "52804.112720196485",
          "expected_acv": "80317.28475907192",
          "total_touches_all": "",
          "converted_within_90_days": "False"
        },
        {
          "split": "train",
          "account_id": "acct_000319",
          "industry": "logistics",
          "region": "US",
          "employee_band": "200-499",
          "estimated_revenue_band": "$1M-$10M",
          "process_maturity_band": "medium",
          "contact_id": "cnt_000537",
          "role_function": "ap_manager",
          "seniority": "director",
          "buyer_role": "champion",
          "lead_id": "lead_002296",
          "lead_created_at": "2024-01-05",
          "lead_source": "partner_referral",
          "first_touch_channel": "partner_referral",
          "touch_count": "9.0",
          "inbound_touch_count": "0.0",
          "outbound_touch_count": "9.0",
          "session_count": "4.0",
          "pricing_page_views": "0.0",
          "demo_page_views": "0.0",
          "total_session_duration_seconds": "1065.0",
          "touches_week_1": "4.0",
          "touches_last_7_days": "2.0",
          "days_since_first_touch": "27.053826214825193",
          "activity_count": "",
          "days_since_last_touch": "3.2322809610775023",
          "opportunity_created": "False",
          "has_open_opportunity": "False",
          "opportunity_estimated_acv": "",
          "expected_acv": "15778.446526640288",
          "total_touches_all": "17.0",
          "converted_within_90_days": "False"
        },
        {
          "split": "train",
          "account_id": "acct_000476",
          "industry": "healthcare_non_clinical",
          "region": "US",
          "employee_band": "200-499",
          "estimated_revenue_band": "$10M-$50M",
          "process_maturity_band": "medium",
          "contact_id": "cnt_001478",
          "role_function": "ap_manager",
          "seniority": "director",
          "buyer_role": "champion",
          "lead_id": "lead_003320",
          "lead_created_at": "2024-01-29",
          "lead_source": "inbound_marketing",
          "first_touch_channel": "inbound_marketing",
          "touch_count": "13.0",
          "inbound_touch_count": "13.0",
          "outbound_touch_count": "0.0",
          "session_count": "4.0",
          "pricing_page_views": "",
          "demo_page_views": "0.0",
          "total_session_duration_seconds": "1381.0",
          "touches_week_1": "1.0",
          "touches_last_7_days": "1.0",
          "days_since_first_touch": "22.61679442790949",
          "activity_count": "4.0",
          "days_since_last_touch": "-4.398843829294206",
          "opportunity_created": "True",
          "has_open_opportunity": "True",
          "opportunity_estimated_acv": "-1312.8841795232147",
          "expected_acv": "-11408.728282716547",
          "total_touches_all": "14.0",
          "converted_within_90_days": "False"
        },
        {
          "split": "train",
          "account_id": "acct_000243",
          "industry": "manufacturing",
          "region": "US",
          "employee_band": "1000-1999",
          "estimated_revenue_band": "$10M-$50M",
          "process_maturity_band": "low",
          "contact_id": "cnt_000276",
          "role_function": "vp_finance",
          "seniority": "individual_contributor",
          "buyer_role": "economic_buyer",
          "lead_id": "lead_001192",
          "lead_created_at": "2024-01-01",
          "lead_source": "sdr_outbound",
          "first_touch_channel": "sdr_outbound",
          "touch_count": "8.0",
          "inbound_touch_count": "0.0",
          "outbound_touch_count": "8.0",
          "session_count": "2.0",
          "pricing_page_views": "1.0",
          "demo_page_views": "0.0",
          "total_session_duration_seconds": "1128.0",
          "touches_week_1": "",
          "touches_last_7_days": "",
          "days_since_first_touch": "32.29311067265875",
          "activity_count": "1.0",
          "days_since_last_touch": "3.5385907968992516",
          "opportunity_created": "True",
          "has_open_opportunity": "True",
          "opportunity_estimated_acv": "47876.648993836",
          "expected_acv": "35049.85592930211",
          "total_touches_all": "",
          "converted_within_90_days": "False"
        },
        {
          "split": "train",
          "account_id": "acct_000353",
          "industry": "manufacturing",
          "region": "UK",
          "employee_band": "2000+",
          "estimated_revenue_band": "$1M-$10M",
          "process_maturity_band": "medium",
          "contact_id": "cnt_002665",
          "role_function": "procurement_manager",
          "seniority": "vp",
          "buyer_role": "end_user",
          "lead_id": "lead_000123",
          "lead_created_at": "2024-01-27",
          "lead_source": "sdr_outbound",
          "first_touch_channel": "sdr_outbound",
          "touch_count": "6.0",
          "inbound_touch_count": "0.0",
          "outbound_touch_count": "6.0",
          "session_count": "2.0",
          "pricing_page_views": "0.0",
          "demo_page_views": "0.0",
          "total_session_duration_seconds": "150.0",
          "touches_week_1": "4.0",
          "touches_last_7_days": "2.0",
          "days_since_first_touch": "26.215262450130034",
          "activity_count": "3.0",
          "days_since_last_touch": "3.1313429432706332",
          "opportunity_created": "False",
          "has_open_opportunity": "False",
          "opportunity_estimated_acv": "",
          "expected_acv": "23551.864433039216",
          "total_touches_all": "",
          "converted_within_90_days": "False"
        },
        {
          "split": "train",
          "account_id": "acct_000029",
          "industry": "healthcare_non_clinical",
          "region": "US",
          "employee_band": "500-999",
          "estimated_revenue_band": "$10M-$50M",
          "process_maturity_band": "low",
          "contact_id": "cnt_001377",
          "role_function": "procurement_manager",
          "seniority": "individual_contributor",
          "buyer_role": "end_user",
          "lead_id": "lead_001076",
          "lead_created_at": "2024-01-18",
          "lead_source": "sdr_outbound",
          "first_touch_channel": "sdr_outbound",
          "touch_count": "11.0",
          "inbound_touch_count": "0.0",
          "outbound_touch_count": "11.0",
          "session_count": "0.0",
          "pricing_page_views": "0.0",
          "demo_page_views": "0.0",
          "total_session_duration_seconds": "0.0",
          "touches_week_1": "5.0",
          "touches_last_7_days": "2.0",
          "days_since_first_touch": "",
          "activity_count": "",
          "days_since_last_touch": "",
          "opportunity_created": "False",
          "has_open_opportunity": "False",
          "opportunity_estimated_acv": "",
          "expected_acv": "75080.37802229391",
          "total_touches_all": "14.0",
          "converted_within_90_days": "False"
        },
        {
          "split": "train",
          "account_id": "acct_001411",
          "industry": "professional_services",
          "region": "US",
          "employee_band": "200-499",
          "estimated_revenue_band": "$10M-$50M",
          "process_maturity_band": "high",
          "contact_id": "cnt_002913",
          "role_function": "vp_finance",
          "seniority": "c_suite",
          "buyer_role": "economic_buyer",
          "lead_id": "lead_001584",
          "lead_created_at": "2024-01-09",
          "lead_source": "sdr_outbound",
          "first_touch_channel": "sdr_outbound",
          "touch_count": "3.0",
          "inbound_touch_count": "0.0",
          "outbound_touch_count": "3.0",
          "session_count": "1.0",
          "pricing_page_views": "",
          "demo_page_views": "0.0",
          "total_session_duration_seconds": "234.0",
          "touches_week_1": "",
          "touches_last_7_days": "",
          "days_since_first_touch": "28.188032773479",
          "activity_count": "6.0",
          "days_since_last_touch": "3.2455395630901305",
          "opportunity_created": "True",
          "has_open_opportunity": "True",
          "opportunity_estimated_acv": "20430.54627729372",
          "expected_acv": "-17325.35698486596",
          "total_touches_all": "13.0",
          "converted_within_90_days": "False"
        }
      ],
      "discussions": [
        "What is `snapshot_day = 30` and how does it affect which features are valid at inference time?",
        "Is `total_touches_all` a safe feature or a time-window leakage trap?",
        "LR and GBM AUCs are very close across tiers — does relational feature engineering help?",
        "How would you set a probability threshold for a team that can only work 50 leads per week?",
        "What happens to AUC when you evaluate on a chronological hold-out instead of a random split?"
      ]
    }
  ],
  "mockNotice": "This is a ShmuggingFace review mock. It is not Hugging Face, Kaggle, or a real dataset release."
}