Jump to content

Data Platform/Fundraising

From Wikitech

The fundraising department uses the following statistics from the main wiki cluster to inform campaigns and update banners in realtime:

Banners are displayed on wikis using the CentralNotice extension, which targets pageviews based on country, project, language, device and logged-in status.

On any pageview that meets the criteria to be included in an active CentralNotice campaign, the system may send a request to beacon/impression. Such requests include many data points about whether or not a banner was actually displayed, and why. Data points are sent as URL parameters. For details, see ext.centralNotice.display.state.js.

By default, 1% of pageviews in a campaign are randomly selected to send a beacon/impression request. However, for pageviews in fundraising campaigns, 100% of pageviews cause a request; this is necessary to provide accurate data for A/B testing of fundraising banners.

Calls to beacon/impression are processed on WMF servers in several ways, explained below.

Kafka/Kafkatee/pgehres database

Requests to beacon/impression, sampled at a rate of 1:10. We store timestamp rounded to the minute, plus querystring variables 'banner', 'campaign', 'project', 'country', 'language', 'result', 'reason'. Data is available nearly in real time.

Hive

Like other HTTP requests, calls to beacon/impression are logged in the wmf.webrequest table in Hive. All data points provided by CentalNotice may be selected or filtered in Hive QL via the uri_query column.

Druid/Pivot

Aggregated logs of beacon/impression requests are stored in Druid, in the Banner activity dataset. This allows fast querying on many common criteria. Pivot provides quick visualizations of this data.

EN6C Banner History Documentation

Purpose:

The December en6C campaign represented around 50% of WMF’s annual revenue through banners and email in FY1617. Because the Big English campaign represents such a large part of the Foundation’s revenue, it is critical that we are able to identify trends in banner impressions and donations.

Banner history data are recorded for all donors and a 1% sample of impressions. This data set contains the banner name, timestamp and status (whether the banner was shown, closed, etc.) of up to 30 of the most recent banners seen by a user within the last 30 days. Eventlogging data are typically deleted after 90 days.

We would like to preserve the banner history data for the Big English campaign so that they can be used for continued analysis in the future and as additional fundraising questions arise. Saving banner history data for Big English will help us continue to improve en6C banners through retrospective analyses. This data set can help us answer questions such as: Which banner elements are most often associated with users closing a banner? Which banner elements are most associated with users donating on a 2nd banner or a 3rd banner? Are there any banner elements that are more likely to lead a donor to give through email the following year?

Data Format:

The full banner history data set contains an event blob of banner history log entries (banner name, timestamp, status) within an event capsule with more detailed user information. The saved data sets from 2016 and 2017 contain only the event blob. The only identifying information within the event blob is a randomly generated id. For donors, the id can be linked to the donation in civi. For sampled impressions, the id does not link to anything and the event is anonymous.

Example for a sampled impression: {"i":"ad6365d06de1bc1a","l":["B1718_1107_en6C_mob_p1_lg_dsn_cnt|C1718_en6C_mob_FR|1510075057|6","B1718_120117_en6C_mob_p2_sm_txt_cnt|C1718_en6C_mob_FR|1512342657|6","B1718_120117_en6C_mob_p2_sm_txt_cnt|C1718_en6C_mob_FR|1512345499|6","B1718_120117_en6C_mob_p2_sm_txt_cnt|C1718_en6C_mob_FR|1512345591|6","B1718_120117_en6C_mob_p2_sm_txt_cnt|C1718_en6C_mob_FR|1512345604|6","B1718_120117_en6C_mob_p2_sm_txt_cnt|C1718_en6C_mob_FR|1512345616|6"],"n":6,"r":0.01}

Procedure:

Data were saved by day using Hive code adapted from wikitech. See below for a sample hql file. Each hql file was submitted with a beeline command and saved to an analogous out file e.g. beeline -f banner_history_2017_12_04.hql > banner_history_2017_12_04.out

Current file format and location:

Daily hql files and resulting out files are temporarily saved in jsamra’s home directory on stat1007 until a suitable permanent location can be found. Data for 2018 are located at /home/jsamra/banner_history_big_english_2018, data for 2017 are located at /home/jsamra/banner_history_big_english_2017, and data for 2016 are located at /home/jsamra/banner_history_big_english_2016


Sample hql code for saving one day of data (12/04/2017):




-- save data from 12/04/2017 hours 00-23

ADD JAR file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar;

-- Make sure you don't create tables in the default Hive database.
USE cps;

-- Create a table with a single string field
CREATE EXTERNAL TABLE `CentralNoticeBannerHistory20171204` (
 `json_string` string
)
PARTITIONED BY (
 year int,
 month int,
 day int,
 hour int
)
STORED AS INPUTFORMAT
 'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
 '/some/location/';

-- Add a partition for each hour

ALTER TABLE CentralNoticeBannerHistory20171204
ADD PARTITION (year=2017, month=12, day=04, hour=00)
LOCATION '/some/location/hourly/2017/12/04/00';

SELECT get_json_object(json_string, '$.event') as bh_event
FROM cps.CentralNoticeBannerHistory20171204
WHERE year=2017 AND month=12 AND day=04 AND hour=15;

Landing page impressions

Hits to selected URLs on donate.wikimedia.org and wikimediafoundation.org. Querystring variables stored include utm_*, project, language, and country

Unique email clicks

For landing page impressions with a contact_id, we insert utm_source, utm_campaign, contact_id, and link_id into a table with a unique constraint on those columns, using 'ON DUPLICATE KEY' to discard clicks after the initial one for a given donor and email