We shared prospective conversion prediction datasets of time-ordered user activity trails collected across many different sources on the Web.
This dataset is specifically designed for prospective advertising, a specific form of targeting where advertisers wish to target only potential and new users. In the updated version of the dataset retargeting signals are included in the additional dataset to enable appropriate A/B tests of new algorithms. The dataset contains millions of sequences of activities performed by users with timestamp annotations preceding events of interest.
Below is the dataset description shared on Yahoo Webscope:
We share user activity trail datasets of timestamp annotated sequences of activities collected from users online, derived from various sources, e.g., Yahoo Search, commercial email receipts, reading news, and other content on publisher’s webpages associated with Verizon Media such as Yahoo and AOL homepages, Yahoo Finance, Sports and News, HuffPost, TechCrunch, etc., advertising data from Yahoo Gemini and Verizon Media DSP, including ad activity and advertiser data (e.g., ad impressions, clicks, conversions, and site visits). These sequences precede events of particular interest to advertisers. Two types of events of interest are considered: conversion, in retargeting and prospecting setup, and retargeting events. These three setups create three datasets for two major advertisers, from retail and communications categories, running conversion campaigns with Verizon Media collected over 100 days ending in May 2019, totaling 6 distinct datasets.
– The first dataset type is a conversion prediction dataset with a sequence of all user activities preceding the conversion with an advertiser.
– The second dataset type is a prospective conversion prediction dataset with a sequence of eligible (non-retargeting) user activities preceding the conversion with an advertiser.
– The third dataset contains a sequence of all user activities preceding the first recorded retargeting event of a user.For dataset labels, Advertiser 1 defined three unique conversions, while Advertiser 2 defined one, both advertisers define a single retargeting target rule.
These datasets contain no user information, while activities are anonymized and timestamp information is misaligned between different user trails while preserving all necessary information. Researchers may validate conversion prediction and other systems and algorithms run on users’ historical data activities. The dataset may serve as a testbed for modeling sequences of users’ activities and modeling temporal information through Deep Learning or other Machine Learning and Statistics techniques for a variety of tasks in supervised and unsupervised learning.