About Dataset

B2B Lead Scoring Dataset — Intermediate Tier

This is a synthetic dataset for practicing B2B lead scoring. It was generated by leadforge, an open-source Python framework for producing realistic CRM/funnel training data. No real company, customer, or transaction is represented.

What you are predicting: Each row is a sales lead at a fictional B2B SaaS company. The task is binary classification:

converted_within_90_days — did this lead close as a paid deal within 90 days?

Features capture the first 30 days of CRM activity per lead (email/call touches, product sessions, deal stage, account firmographics). The label is derived from simulated events — never directly sampled — so there is genuine causal structure behind the signal.

This tier: intermediate

Property	Value
Conversion rate	~22%
Signal strength	0.70 / 1.0 (medium)
Noise level	0.30 / 1.0 (moderate)
Missing values	~8%
LR AUC (test, 5-seed median)	0.662
GBM AUC (test, 5-seed median)	0.634
Average precision (LR)	0.332
Precision @100	0.33

The intermediate tier is the default benchmark. Conversion rate is ~22% — more realistic for B2B SaaS than the intro tier — and noise is moderate enough that simple feature engineering starts to matter. GBM does not consistently beat logistic regression here (the snapshot is dominated by near-linear features); that gap is worth investigating. Calibration becomes important at this prevalence — a model that predicts the right rank order can still be badly miscalibrated.

This dataset ships in three tiers — intro → intermediate → advanced — with decreasing signal, lower conversion rates, and heavier noise and missingness. All three tiers share the same schema and simulate the same fictional B2B world.

Table inventory

Table	Rows	Description
accounts	1,500	One row per company
contacts	4,200	One row per buyer-side individual (multiple per account)
leads	5,000	One row per lead — the prediction unit
touches	38,724	Marketing / SDR outreach events (first 30 days per lead)
sessions	10,012	Product demo or trial sessions (first 30 days per lead)
sales_activities	20,679	CRM activities: calls, emails, meetings (first 30 days per lead)
opportunities	4,255	Deal records opened before the 30-day snapshot

Snapshot-safe: event tables contain only rows with timestamps ≤ 30 days from lead creation. Outcome columns (converted_within_90_days, conversion_timestamp, close_outcome) are excluded from the public relational tables — they appear only in the task splits.

Features

Category	Columns	Examples
Account	6	`account_id`, `industry`, `region`
Contact	4	`contact_id`, `role_function`, `seniority`
Lead metadata	3	`lead_id`, `lead_created_at`, `lead_source`
Engagement	11	`touch_count`, `inbound_touch_count`, `outbound_touch_count`
Sales	6	`activity_count`, `days_since_last_touch`, `opportunity_created`
Target	1	`converted_within_90_days`

⚠ Intentional leakage trap: total_touches_all aggregates touches over the full 90-day window (not just the 30-day feature window) and is deliberately retained as a leakage-detection teaching exercise. It is flagged leakage_risk=True in feature_dictionary.csv. Drop it from your feature set unless you are studying leakage.

See feature_dictionary.csv for the full column-by-column specification.

The simulated world

The dataset simulates a fictional company — Veridian Technologies — a Series B startup (Austin, TX, founded 2017) selling Veridian Procure, a cloud procurement / AP automation SaaS. Everything below is invented:

Target customers: 200–2,000-employee firms in the US and UK (manufacturing, logistics, healthcare, professional services)
Deal range: $18,000–$120,000 ACV; average deal $42,000; average sales cycle 45 days
Go-to-market: 45% inbound marketing, 35% SDR outbound, 20% partner referrals
Buyer personas: VP Finance (economic buyer), AP Manager (champion), IT Director (technical evaluator), Procurement Manager (end user)

In this public version, the hidden causal graph, latent trait scores, and mechanism parameters are withheld. The instructor companion bundle includes them.

How to load

import pandas as pd

# Flat CSV — all leads, all splits combined (convenient for exploration)
df = pd.read_csv("lead_scoring.csv")
X = df.drop(columns=["converted_within_90_days"])
y = df["converted_within_90_days"]

# Parquet task splits — recommended for model training
train = pd.read_parquet("tasks/converted_within_90_days/train.parquet")
valid = pd.read_parquet("tasks/converted_within_90_days/valid.parquet")
test  = pd.read_parquet("tasks/converted_within_90_days/test.parquet")

# Relational tables — for feature engineering
leads   = pd.read_parquet("tables/leads.parquet")
touches = pd.read_parquet("tables/touches.parquet")

Splits are 70 / 15 / 15 (train / valid / test), stratified on the target, deterministic given seed 42.

Note on account overlap: ~93% of test-set accounts also appear in the training set (splits are keyed on lead_id). Headline AUC overstates generalisation to unseen accounts. For a faithful out-of-sample estimate, use GroupKFold(groups=df["account_id"]).

Reproducibility

Generated with leadforge v1.0.0, recipe b2b_saas_procurement_v1, seed 42, difficulty intermediate. To reproduce:

pip install leadforge
leadforge generate --recipe b2b_saas_procurement_v1 --seed 42 \
                   --mode student_public --difficulty intermediate --out my_bundle

Every file in this bundle is SHA-256 hashed in manifest.json. Run leadforge validate my_bundle to verify integrity.

Author: Shay Palachy Affek · Kaggle · GitHub

Caveats

Synthetic data only. No real company, customer, or market is represented.
AUC does not distinguish tiers. LR AUC is ~0.66 across all three tiers by design. The tiers differ in conversion rate (43% / 22% / 8%), noise, and missing values — not in rank discrimination. Use average precision, P@K, and calibration metrics to see the difficulty gradient.
~93% train/test account overlap. Splits are keyed on lead_id; most test accounts also appear in train. Headline metrics overstate generalisation to unseen accounts.
Snapshot window. Engagement features cover days 0–30 per lead; the label resolves at day 90. total_touches_all is the intentional exception — it aggregates over the full 90-day window and is a leakage trap.
Public version. The hidden causal graph, latent trait scores, and mechanism parameters are withheld. The instructor companion bundle includes them.

LeadForge Lead Scoring v1 — Intermediate

Asplit Link to Organization	Aaccount_id Account Id	123industry Industry	123region Region	123employee_band Employee Band	123estimated_revenue_band Estimated Revenue Band	123process_maturity_band Process Maturity Band	Acontact_id Contact Id	123role_function Role Function	123seniority Seniority	123buyer_role Buyer Role	Alead_id Lead Id	123lead_created_at Lead Created At	123lead_source Lead Source	123touch_count Touch Count	123inbound_touch_count Inbound Touch Count	123outbound_touch_count Outbound Touch Count	123session_count Session Count	123pricing_page_views Pricing Page Views	123demo_page_views Demo Page Views	123total_session_duration_seconds Total Session Duration Seconds	123touches_days_0_7 Touches Days 0 7	123touches_last_7_days Touches Last 7 Days	123days_since_first_touch Days Since First Touch	123activity_count Activity Count	123days_since_last_touch Days Since Last Touch	123opportunity_created Opportunity Created	123has_open_opportunity Has Open Opportunity	123opportunity_estimated_acv Opportunity Estimated Acv	123expected_acv Expected Acv	123total_touches_all Total Touches All	123converted_within_90_days Converted Within 90 Days
11 unique · 0% null train 100%	88 unique · 0% null acct_000029 12.5% acct_000043 12.5% acct_000243 12.5%	44 unique · 0% null logistics 37.5% healthcare_non_clinical 25% manufacturing 25%	22 unique · 0% null US 62.5% UK 37.5%	44 unique · 0% null 200-499 50% 500-999 25% 1000-1999 12.5%	33 unique · 0% null $10M-$50M 62.5% $1M-$10M 25% $50M-$200M 12.5%	33 unique · 0% null low 37.5% medium 37.5% high 25%	88 unique · 0% null cnt_000276 12.5% cnt_000537 12.5% cnt_001124 12.5%	44 unique · 0% null procurement_manager 37.5% ap_manager 25% vp_finance 25%	44 unique · 0% null c_suite 25% director 25% individual_contributor 25%	44 unique · 0% null end_user 37.5% champion 25% economic_buyer 25%	88 unique · 0% null lead_000123 12.5% lead_001076 12.5% lead_001192 12.5%	77 unique · 0% null 2024-01-01 25% 2024-01-05 12.5% 2024-01-08 12.5%	33 unique · 0% null sdr_outbound 50% inbound_marketing 37.5% partner_referral 12.5%	66 unique · 12.5% null · 0 - 12 5.0 25% 0.0 12.5% 1.0 12.5%	22 unique · 12.5% null · 0 - 5 0.0 62.5% 5.0 25%	66 unique · 0% null · 0 - 12 0.0 37.5% 1.0 12.5% 10.0 12.5%	55 unique · 0% null · 0 - 4 1.0 37.5% 2.0 25% 0.0 12.5%	22 unique · 25% null · 0 - 1 0.0 62.5% 1.0 12.5%	22 unique · 0% null · 0 - 2 0.0 87.5% 2.0 12.5%	88 unique · 0% null · 0 - 900 0.0 12.5% 171.0 12.5% 359.0 12.5%	55 unique · 0% null · 0 - 6 1.0 37.5% 0.0 25% 3.0 12.5%	44 unique · 0% null · 0 - 4 0.0 37.5% 1.0 37.5% 2.0 12.5%	77 unique · 12.5% null · 27.58339746775941 - 43.508329886600016 27.58339746775941 12.5% 27.96681262771429 12.5% 29.038060387156584 12.5%	44 unique · 0% null · 0 - 6 0.0 37.5% 2.0 25% 5.0 25%	66 unique · 25% null · 0.6712216272156128 - 28.598940785518547 0.6712216272156128 12.5% 11.003359485085898 12.5% 2.8763640801052426 12.5%	22 unique · 0% null False 50% True 50%	22 unique · 0% null False 50% True 50%	44 unique · 50% null · 17756.716765584522 - 242469.1078369078 17756.716765584522 12.5% 242469.1078369078 12.5% 36370.40103541685 12.5%	88 unique · 0% null · 15462.350089440228 - 66699.05653659195 15462.350089440228 12.5% 21952.764727910577 12.5% 24194.29230242279 12.5%	77 unique · 0% null · 0 - 18 17 25% 0 12.5% 1 12.5%	11 unique · 0% null False 100%
train	acct_000773	logistics	UK	200-499	$50M-$200M	low	cnt_001124	procurement_manager	vp	end_user	lead_004250	2024-01-08	inbound_marketing	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0		0.0		False	False		66699.05653659195	0	False
train	acct_000043	logistics	UK	500-999	$10M-$50M	high	cnt_003354	it_director	c_suite	technical_evaluator	lead_001565	2024-01-01	inbound_marketing	5.0	5.0	0.0	2.0	0.0	0.0	632.0	1.0	0.0	30.333835742092244	0.0	11.003359485085898	False	False		58372.35262983076	9	False
train	acct_000319	logistics	US	200-499	$1M-$10M	medium	cnt_000537	ap_manager	director	champion	lead_002296	2024-01-05	partner_referral	9.0	0.0	9.0	3.0	0.0	0.0	900.0	3.0	2.0	27.96681262771429	2.0	5.711191915200993	True	True	17756.716765584522	15462.350089440228	17	False
train	acct_000476	healthcare_non_clinical	US	200-499	$10M-$50M	medium	cnt_001478	ap_manager	director	champion	lead_003320	2024-01-29	inbound_marketing	5.0	5.0	0.0	1.0		2.0	435.0	0.0	1.0	43.508329886600016	2.0	0.6712216272156128	True	True	36370.40103541685	30489.660790082788	13	False
train	acct_000243	manufacturing	US	1000-1999	$10M-$50M	low	cnt_000276	vp_finance	individual_contributor	economic_buyer	lead_001192	2024-01-01	sdr_outbound			5.0	1.0		0.0	171.0	1.0	1.0	31.217369696525743	5.0		True	True	50459.383816603025	42999.143947625904	10	False
train	acct_000353	manufacturing	UK	2000+	$1M-$10M	medium	cnt_002665	procurement_manager	vp	end_user	lead_000123	2024-01-27	sdr_outbound	1.0	0.0	1.0	1.0	0.0	0.0	359.0	1.0	0.0	27.58339746775941	0.0	28.598940785518547	False	False		24194.29230242279	1	False
train	acct_000029	healthcare_non_clinical	US	500-999	$10M-$50M	low	cnt_001377	procurement_manager	individual_contributor	end_user	lead_001076	2024-01-18	sdr_outbound	10.0	0.0	10.0	4.0	0.0	0.0	765.0	6.0	1.0	30.10538886531948	5.0	2.8763640801052426	False	False		66172.23795336873	17	False
train	acct_001411	professional_services	US	200-499	$10M-$50M	high	cnt_002913	vp_finance	c_suite	economic_buyer	lead_001584	2024-01-09	sdr_outbound	12.0	0.0	12.0	2.0	1.0	0.0	515.0	5.0	4.0	29.038060387156584	6.0	3.718211110884809	True	True	242469.1078369078	21952.764727910577	18	False

LeadForge Lead Scoring v1 — Intermediate

About Dataset

B2B Lead Scoring Dataset — Intermediate Tier

This tier: intermediate

Table inventory

Features

The simulated world

How to load

Reproducibility

Caveats

lead_scoring.csv (1323 KB)

About this file

Metadata

Activity Overview

LeadForge Lead Scoring v1 — Intermediate

About Dataset

B2B Lead Scoring Dataset — Intermediate Tier

This tier: intermediate

Table inventory

Features

The simulated world

How to load

Reproducibility

Caveats

About this file

See what others are saying about this dataset

What have you used this dataset for?

How would you describe this dataset?

Metadata

Activity Overview