About Dataset

B2B Lead Scoring Dataset — Advanced Tier

This is a synthetic dataset for practicing B2B lead scoring. It was generated by leadforge, an open-source Python framework for producing realistic CRM/funnel training data. No real company, customer, or transaction is represented.

What you are predicting: Each row is a sales lead at a fictional B2B SaaS company. The task is binary classification:

converted_within_90_days — did this lead close as a paid deal within 90 days?

Features capture the first 30 days of CRM activity per lead (email/call touches, product sessions, deal stage, account firmographics). The label is derived from simulated events — never directly sampled — so there is genuine causal structure behind the signal.

This tier: advanced

Property	Value
Conversion rate	~8%
Signal strength	0.50 / 1.0 (moderate)
Noise level	0.55 / 1.0 (high)
Missing values	~18%
LR AUC (test, 5-seed median)	0.624
GBM AUC (test, 5-seed median)	0.600
Average precision (LR)	0.122
Precision @100	0.11

The advanced tier is a calibration and rare-event exercise. Conversion rate is ~8% — a realistic low-prevalence regime for mid-market SaaS — and noise is heavy enough that count features show artifact zeros (Gaussian noise clamped to zero; treat zero clusters as data-cleaning material, not reliable signal). AUC barely moves across tiers by design; here you'll want average precision, P@K, and value-weighted ranking (expected_acv × P(convert)) to measure what matters. Calibration is harder in this tier: a miscalibrated model can rank correctly but still predict systematically wrong probabilities — the kind of mistake that breaks real-world decision thresholds.

This dataset ships in three tiers — intro → intermediate → advanced — with decreasing signal, lower conversion rates, and heavier noise and missingness. All three tiers share the same schema and simulate the same fictional B2B world.

Table inventory

Table	Rows	Description
accounts	1,500	One row per company
contacts	4,200	One row per buyer-side individual (multiple per account)
leads	5,000	One row per lead — the prediction unit
touches	38,208	Marketing / SDR outreach events (first 30 days per lead)
sessions	9,942	Product demo or trial sessions (first 30 days per lead)
sales_activities	19,995	CRM activities: calls, emails, meetings (first 30 days per lead)
opportunities	4,004	Deal records opened before the 30-day snapshot

Snapshot-safe: event tables contain only rows with timestamps ≤ 30 days from lead creation. Outcome columns (converted_within_90_days, conversion_timestamp, close_outcome) are excluded from the public relational tables — they appear only in the task splits.

Features

Category	Columns	Examples
Account	6	`account_id`, `industry`, `region`
Contact	4	`contact_id`, `role_function`, `seniority`
Lead metadata	3	`lead_id`, `lead_created_at`, `lead_source`
Engagement	11	`touch_count`, `inbound_touch_count`, `outbound_touch_count`
Sales	6	`activity_count`, `days_since_last_touch`, `opportunity_created`
Target	1	`converted_within_90_days`

⚠ Intentional leakage trap: total_touches_all aggregates touches over the full 90-day window (not just the 30-day feature window) and is deliberately retained as a leakage-detection teaching exercise. It is flagged leakage_risk=True in feature_dictionary.csv. Drop it from your feature set unless you are studying leakage.

See feature_dictionary.csv for the full column-by-column specification.

The simulated world

The dataset simulates a fictional company — Veridian Technologies — a Series B startup (Austin, TX, founded 2017) selling Veridian Procure, a cloud procurement / AP automation SaaS. Everything below is invented:

Target customers: 200–2,000-employee firms in the US and UK (manufacturing, logistics, healthcare, professional services)
Deal range: $18,000–$120,000 ACV; average deal $42,000; average sales cycle 45 days
Go-to-market: 45% inbound marketing, 35% SDR outbound, 20% partner referrals
Buyer personas: VP Finance (economic buyer), AP Manager (champion), IT Director (technical evaluator), Procurement Manager (end user)

In this public version, the hidden causal graph, latent trait scores, and mechanism parameters are withheld. The instructor companion bundle includes them.

How to load

import pandas as pd

# Flat CSV — all leads, all splits combined (convenient for exploration)
df = pd.read_csv("lead_scoring.csv")
X = df.drop(columns=["converted_within_90_days"])
y = df["converted_within_90_days"]

# Parquet task splits — recommended for model training
train = pd.read_parquet("tasks/converted_within_90_days/train.parquet")
valid = pd.read_parquet("tasks/converted_within_90_days/valid.parquet")
test  = pd.read_parquet("tasks/converted_within_90_days/test.parquet")

# Relational tables — for feature engineering
leads   = pd.read_parquet("tables/leads.parquet")
touches = pd.read_parquet("tables/touches.parquet")

Splits are 70 / 15 / 15 (train / valid / test), stratified on the target, deterministic given seed 42.

Note on account overlap: ~93% of test-set accounts also appear in the training set (splits are keyed on lead_id). Headline AUC overstates generalisation to unseen accounts. For a faithful out-of-sample estimate, use GroupKFold(groups=df["account_id"]).

Reproducibility

Generated with leadforge v1.0.0, recipe b2b_saas_procurement_v1, seed 42, difficulty advanced. To reproduce:

pip install leadforge
leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \
                   --mode student_public --difficulty advanced --out my_bundle

Every file in this bundle is SHA-256 hashed in manifest.json. Run leadforge validate my_bundle to verify integrity.

Author: Shay Palachy Affek · Kaggle · GitHub

Caveats

Synthetic data only. No real company, customer, or market is represented.
AUC does not distinguish tiers. LR AUC is ~0.62 in this tier — similar to the other two tiers. The tiers differ in conversion rate (43% / 22% / 8%), noise, and missing values — not in rank discrimination. Use average precision, P@K, and calibration metrics to see the difficulty gradient.
Artifact zeros in count/duration features. Gaussian noise is applied before MCAR missingness; values clamped below zero to zero. Suspicious zero clusters in count features (e.g. days_since_last_touch = 0) may be noise artifacts rather than genuine zero values — treat them as intentional data-cleaning material.
~93% train/test account overlap. Splits are keyed on lead_id; most test accounts also appear in train. Headline metrics overstate generalisation to unseen accounts.
Snapshot window. Engagement features cover days 0–30 per lead; the label resolves at day 90. total_touches_all is the intentional exception — it aggregates over the full 90-day window and is a leakage trap.
Public version. The hidden causal graph, latent trait scores, and mechanism parameters are withheld. The instructor companion bundle includes them.

LeadForge Lead Scoring v1 — Advanced

Asplit Link to Organization	Aaccount_id Account Id	123industry Industry	123region Region	123employee_band Employee Band	123estimated_revenue_band Estimated Revenue Band	123process_maturity_band Process Maturity Band	Acontact_id Contact Id	123role_function Role Function	123seniority Seniority	123buyer_role Buyer Role	Alead_id Lead Id	123lead_created_at Lead Created At	123lead_source Lead Source	123touch_count Touch Count	123inbound_touch_count Inbound Touch Count	123outbound_touch_count Outbound Touch Count	123session_count Session Count	123pricing_page_views Pricing Page Views	123demo_page_views Demo Page Views	123total_session_duration_seconds Total Session Duration Seconds	123touches_days_0_7 Touches Days 0 7	123touches_last_7_days Touches Last 7 Days	123days_since_first_touch Days Since First Touch	123activity_count Activity Count	123days_since_last_touch Days Since Last Touch	123opportunity_created Opportunity Created	123has_open_opportunity Has Open Opportunity	123opportunity_estimated_acv Opportunity Estimated Acv	123expected_acv Expected Acv	123total_touches_all Total Touches All	123converted_within_90_days Converted Within 90 Days
11 unique · 0% null train 100%	88 unique · 0% null acct_000029 12.5% acct_000043 12.5% acct_000243 12.5%	44 unique · 0% null logistics 37.5% healthcare_non_clinical 25% manufacturing 25%	22 unique · 0% null US 62.5% UK 37.5%	44 unique · 0% null 200-499 50% 500-999 25% 1000-1999 12.5%	33 unique · 0% null $10M-$50M 62.5% $1M-$10M 25% $50M-$200M 12.5%	33 unique · 0% null low 37.5% medium 37.5% high 25%	88 unique · 0% null cnt_000276 12.5% cnt_000537 12.5% cnt_001124 12.5%	44 unique · 0% null procurement_manager 37.5% ap_manager 25% vp_finance 25%	44 unique · 0% null c_suite 25% director 25% individual_contributor 25%	44 unique · 0% null end_user 37.5% champion 25% economic_buyer 25%	88 unique · 0% null lead_000123 12.5% lead_001076 12.5% lead_001192 12.5%	77 unique · 0% null 2024-01-01 25% 2024-01-05 12.5% 2024-01-08 12.5%	33 unique · 0% null sdr_outbound 50% inbound_marketing 37.5% partner_referral 12.5%	55 unique · 25% null · 3 - 13 9.0 25% 11.0 12.5% 13.0 12.5%	44 unique · 12.5% null · 0 - 13 0.0 50% 13.0 12.5% 8.0 12.5%	66 unique · 0% null · 0 - 11 0.0 37.5% 11.0 12.5% 3.0 12.5%	44 unique · 25% null · 0 - 4 2.0 25% 4.0 25% 0.0 12.5%	33 unique · 50% null · 0 - 4 0.0 25% 1.0 12.5% 4.0 12.5%	11 unique · 12.5% null · 0 - 0 0.0 87.5%	77 unique · 12.5% null · 0 - 1381 0.0 12.5% 1065.0 12.5% 1128.0 12.5%	44 unique · 0% null · 1 - 5 4.0 50% 1.0 25% 2.0 12.5%	22 unique · 25% null · 1 - 2 1.0 37.5% 2.0 37.5%	55 unique · 25% null · 27.053826214825193 - 45.22346496603029 45.22346496603029 25% 27.053826214825193 12.5% 28.188032773479 12.5%	55 unique · 0% null · 1 - 6 2.0 25% 4.0 25% 6.0 25%	55 unique · 37.5% null · 0 - 3.7664633484817576 0.0 12.5% 3.1313429432706332 12.5% 3.2322809610775023 12.5%	22 unique · 0% null True 62.5% False 37.5%	22 unique · 0% null True 62.5% False 37.5%	44 unique · 50% null · 37904.684742284786 - 258999.4780423338 258999.4780423338 12.5% 37904.684742284786 12.5% 47588.58048001376 12.5%	66 unique · 12.5% null · 0 - 80533.53467639766 0.0 25% 15449.44694066237 12.5% 23500.1989513667 12.5%	55 unique · 0% null · 11 - 17 11 25% 14 25% 17 25%	11 unique · 0% null False 100%
train	acct_000773	logistics	UK	200-499	$50M-$200M	low	cnt_001124	procurement_manager	vp	end_user	lead_004250	2024-01-08	inbound_marketing		8.0	0.0	1.0	1.0	0.0	484.0	1.0	1.0		6.0		True	True	37904.684742284786	0.0	17	False
train	acct_000043	logistics	UK	500-999	$10M-$50M	high	cnt_003354	it_director	c_suite	technical_evaluator	lead_001565	2024-01-01	inbound_marketing	9.0	9.0	0.0		4.0	0.0	900.0	4.0	1.0	30.62883305317309	4.0		True	True	52534.606039186896	80533.53467639766	11	False
train	acct_000319	logistics	US	200-499	$1M-$10M	medium	cnt_000537	ap_manager	director	champion	lead_002296	2024-01-05	partner_referral	9.0	0.0	9.0	4.0		0.0	1065.0	4.0	2.0	27.053826214825193	2.0	3.2322809610775023	False	False		15449.44694066237	17	False
train	acct_000476	healthcare_non_clinical	US	200-499	$10M-$50M	medium	cnt_001478	ap_manager	director	champion	lead_003320	2024-01-29	inbound_marketing	13.0	13.0	0.0	4.0			1381.0	1.0	1.0	45.22346496603029	4.0	0.0	True	True			14	False
train	acct_000243	manufacturing	US	1000-1999	$10M-$50M	low	cnt_000276	vp_finance	individual_contributor	economic_buyer	lead_001192	2024-01-01	sdr_outbound			8.0	2.0		0.0	1128.0	4.0	2.0	45.22346496603029	1.0		True	True	47588.58048001376	33774.175358841734	11	False
train	acct_000353	manufacturing	UK	2000+	$1M-$10M	medium	cnt_002665	procurement_manager	vp	end_user	lead_000123	2024-01-27	sdr_outbound	6.0	0.0	6.0	2.0	0.0	0.0		4.0			3.0	3.1313429432706332	False	False		23500.1989513667	15	False
train	acct_000029	healthcare_non_clinical	US	500-999	$10M-$50M	low	cnt_001377	procurement_manager	individual_contributor	end_user	lead_001076	2024-01-18	sdr_outbound	11.0	0.0	11.0	0.0	0.0	0.0	0.0	5.0	2.0	30.198516796116404	2.0	3.7664633484817576	False	False		75796.79050905118	14	False
train	acct_001411	professional_services	US	200-499	$10M-$50M	high	cnt_002913	vp_finance	c_suite	economic_buyer	lead_001584	2024-01-09	sdr_outbound	3.0	0.0	3.0			0.0	234.0	2.0		28.188032773479	6.0	3.2455395630901305	True	True	258999.4780423338	0.0	13	False

LeadForge Lead Scoring v1 — Advanced

About Dataset

B2B Lead Scoring Dataset — Advanced Tier

This tier: advanced

Table inventory

Features

The simulated world

How to load

Reproducibility

Caveats

lead_scoring.csv (1270 KB)

About this file

Metadata

Activity Overview

LeadForge Lead Scoring v1 — Advanced

About Dataset

B2B Lead Scoring Dataset — Advanced Tier

This tier: advanced

Table inventory

Features

The simulated world

How to load

Reproducibility

Caveats

About this file

See what others are saying about this dataset

What have you used this dataset for?

How would you describe this dataset?

Metadata

Activity Overview