Obsolete:2015 data warehouse experiments/2014-12-02 verifications
Use Case: Newly Registered Users on September 2014
Newly register data (from user table) matches exactly for dewiki in the month period we have looked at except for 2 days (see below). For enwiki differences are more prevalent, in a month period more than 10 records differ but only in one or two counts per day. In both cases differences are due to same causes, a jump of one in one day that gets counted on the next day.
dewiki
Data for newly registered users in September for dewiki matches exactly for all days but two: 20140911 and 20140912 whose numbers among the warehouse and dewiki differ on 1 unit. The disparity on the 20140911 registrations comes from a registration that appears on the warehouse at 2014-09-11 23:59 but in dewiki-prod appears at 20140912000. Last registration for dewiki in prod in 2014-09-11 happens at 201409112323.
Below selects for dewiki.
mysql:research@analytics-store.eqiad.wmnet [warehouse]> select left(user_registration, 10), count(*) from user where wiki='dewiki' and user_registration between '2014-09-01 00:00:00' and '2014-09-30 23:59:59' and registration_type='create' group by 1; +-----------------------------+----------+ | left(user_registration, 10) | count(*) | +-----------------------------+----------+ | 2014-09-01 | 542 | | 2014-09-02 | 444 | | 2014-09-03 | 524 | | 2014-09-04 | 465 | | 2014-09-05 | 530 | | 2014-09-06 | 493 | | 2014-09-07 | 587 | | 2014-09-08 | 538 | | 2014-09-09 | 554 | | 2014-09-10 | 572 | | 2014-09-11 | 578 | <- | 2014-09-12 | 537 | <- | 2014-09-13 | 551 | | 2014-09-14 | 613 | | 2014-09-15 | 545 | | 2014-09-16 | 568 | | 2014-09-17 | 527 | | 2014-09-18 | 594 | | 2014-09-19 | 558 | | 2014-09-20 | 556 | | 2014-09-21 | 696 | | 2014-09-22 | 573 | | 2014-09-23 | 531 | | 2014-09-24 | 532 | | 2014-09-25 | 505 | | 2014-09-26 | 467 | | 2014-09-27 | 489 | | 2014-09-28 | 544 | | 2014-09-29 | 521 | | 2014-09-30 | 550 | +-----------------------------+----------+ 30 rows in set (0.25 sec)
Results from production:
select left(log_timestamp, 8), count(*) from logging where log_action='create' and log_type='newusers' and log_timestamp between '20140901000000' and '20140930235959' group by 1;
left(log_timestamp, 8) count(*) 20140901 542 20140902 444 20140903 524 20140904 465 20140905 530 20140906 493 20140907 587 20140908 538 20140909 554 20140910 572 20140911 577 <- 20140912 538 <- 20140913 551 20140914 613 20140915 545 20140916 568 20140917 527 20140918 594 20140919 558 20140920 556 20140921 696 20140922 573 20140923 531 20140924 532 20140925 505 20140926 467 20140927 489 20140928 544 20140929 521 20140930 550
20140911 detail for dewiki
The disparity on the 20140911 registrations comes from a registration that appears on the warehouse at 2014-09-11 23:59 but in dewiki prod appears at 20140912000. Last registration for dewiki in prod in 2014-09-11 happens at 201409112323.
On warehouse mysql:research@analytics-store.eqiad.wmnet [warehouse]> select left(user_registration, 13), count(*) from user where wiki='dewiki' and user_registration between '2014-09-11 00:00:00' and '2014-09-11 23:59:59' and registration_type='create' group by 1; +-----------------------------+----------+ | left(user_registration, 13) | count(*) | +-----------------------------+----------+ | 2014-09-11 00 | 7 | | 2014-09-11 01 | 1 | | 2014-09-11 02 | 2 | | 2014-09-11 03 | 3 | | 2014-09-11 04 | 6 | | 2014-09-11 05 | 10 | | 2014-09-11 06 | 16 | | 2014-09-11 07 | 32 | | 2014-09-11 08 | 36 | | 2014-09-11 09 | 28 | | 2014-09-11 10 | 27 | | 2014-09-11 11 | 45 | | 2014-09-11 12 | 42 | | 2014-09-11 13 | 37 | | 2014-09-11 14 | 42 | | 2014-09-11 15 | 33 | | 2014-09-11 16 | 29 | | 2014-09-11 17 | 29 | | 2014-09-11 18 | 38 | | 2014-09-11 19 | 39 | | 2014-09-11 20 | 35 | | 2014-09-11 21 | 24 | | 2014-09-11 22 | 8 | | 2014-09-11 23 | 9 | <- +-----------------------------+----------+ 24 rows in set (0.05 sec)
On prod
select left(log_timestamp, 10), count(*) from logging where log_action='create' and log_type='newusers' and log_timestamp between '20140911000000' and '20140911235959' group by 1;
left(log_timestamp, 10) count(*) 2014091100 7 2014091101 1 2014091102 2 2014091103 3 2014091104 6 2014091105 10 2014091106 16 2014091107 32 2014091108 36 2014091109 28 2014091110 27 2014091111 45 2014091112 42 2014091113 37 2014091114 42 2014091115 33 2014091116 29 2014091117 29 2014091118 38 2014091119 39 2014091120 35 2014091121 24 2014091122 8 2014091123 8 <-
enwiki
Data for newly register for enwiki for September differs among the warehouse and enwiki-prod more days than enwiki does but causes are the same.
Warehouse
mysql:research@analytics-store.eqiad.wmnet [warehouse]> select left(user_registration, 10), count(*) from user where wiki='enwiki' and user_registration between '2014-09-01 00:00:00' and '2014-09-15 23:59:59' and registration_type='create' group by 1; +-----------------------------+----------+ | left(user_registration, 10) | count(*) | +-----------------------------+----------+ | 2014-09-01 | 8274 | | 2014-09-02 | 7482 | | 2014-09-03 | 8038 | | 2014-09-04 | 8065 | | 2014-09-05 | 7327 | | 2014-09-06 | 7376 | <- | 2014-09-07 | 8109 | <- | 2014-09-08 | 8214 | | 2014-09-09 | 8686 | | 2014-09-10 | 8139 | | 2014-09-11 | 7985 | <- | 2014-09-12 | 7706 | <- | 2014-09-13 | 7553 | | 2014-09-14 | 7912 | <- | 2014-09-15 | 8137 | +-----------------------------+----------+
enwiki production
select left(log_timestamp, 8), count(*) from logging where log_action='create' and log_type='newusers' and log_timestamp between '20140901000000' and '20140915235959' group by 1;
20140901 8274 20140902 7482 20140903 8038 20140904 8065 20140905 7327 20140906 7375 20140907 8110 20140908 8214 20140909 8686 20140910 8139 20140911 7984 20140912 7707 20140913 7553 20140914 7911 20140915 8137
20140911 detail for enwiki
Detail for enwiki by the hour. Note that groupings by the hour differ slightly but overall count is only off by 1 on 20140911 among the enwiki-prod db and the warehouse db.
mysql:research@analytics-store.eqiad.wmnet [warehouse]> select left(user_registration, 13), count(*) from user where wiki='enwiki' and user_registration between '2014-09-11 00:00:00' and '2014-09-11 23:59:59' and registration_type='create' group by 1; +-----------------------------+----------+ | left(user_registration, 13) | count(*) | +-----------------------------+----------+ | 2014-09-11 00 | 280 | | 2014-09-11 01 | 317 | | 2014-09-11 02 | 331 | | 2014-09-11 03 | 281 | | 2014-09-11 04 | 266 | | 2014-09-11 05 | 286 | | 2014-09-11 06 | 292 | | 2014-09-11 07 | 287 | | 2014-09-11 08 | 307 | | 2014-09-11 09 | 318 | | 2014-09-11 10 | 319 | | 2014-09-11 11 | 314 | | 2014-09-11 12 | 338 | | 2014-09-11 13 | 379 | | 2014-09-11 14 | 412 | | 2014-09-11 15 | 479 | | 2014-09-11 16 | 438 | | 2014-09-11 17 | 427 | | 2014-09-11 18 | 384 | | 2014-09-11 19 | 386 | | 2014-09-11 20 | 328 | | 2014-09-11 21 | 291 | | 2014-09-11 22 | 249 | | 2014-09-11 23 | 276 | -> +-----------------------------+----------+
Registrations in enwiki production by the hour on 20140911
2014091100 280 2014091101 317 2014091102 330 2014091103 282 2014091104 265 2014091105 287 2014091106 292 2014091107 287 2014091108 307 2014091109 318 2014091110 319 2014091111 314 2014091112 338 2014091113 378 2014091114 412 2014091115 480 2014091116 438 2014091117 427 2014091118 384 2014091119 386 2014091120 327 2014091121 292 2014091122 249 2014091123 275 2014091200 1
Use Case: Rolling Recurrent Old Active Editors on September 1st
select edit.user_id from edit inner join user on edit.user_id = user.user_id and edit.wiki = user.wiki and edit.time between user.valid_from and user.valid_to where edit.time between '2014-07-01' and '2014-09-01' and user.user_registration < '2014-07-01' and user.in_bot_user_group = 0 and edit.wiki = 'enwiki' group by user_id having sum(if(time <= '2014-08-01', 1, 0)) >= 5 and sum(if(time > '2014-08-01', 1, 0)) >= 5 order by null;
On Dec 1st this select runs fast but returns no records, which is wrong.
select edit.user_id, count(edit.time) as e from edit where edit.time between '2014-07-01' and '2014-09-01' and edit.wiki = 'enwiki' group by user_id order by e ;
This returns 239308 records.
select edit.user_id, count(edit.time) as e from edit inner join user on edit.user_id = user.user_id and edit.wiki = user.wiki where user.user_registration < '2014-07-01' and edit.time between '2014-07-01' and '2014-09-01' and edit.wiki = 'enwiki' group by user_id order by e desc limit 1000;
This also returns plenty data (takes 12 minutes)
Now, making the join more specific (see bold) kind of makes the select take forever:
select edit.user_id, count(edit.time) as e from edit inner join user on edit.user_id = user.user_id and edit.wiki = user.wiki and edit.time between user.valid_from and user.valid_to where edit.time between '2014-07-01' and '2014-09-01' and user.user_registration < '2014-07-01' and edit.wiki = 'enwiki' group by edit.user_id order by e limit 100;
Leaving this aside for a bit and trying to run a simpler metric might be more effcient.
Use Case: New Active Editors on September 1st
select edit.user_id from edit inner join user on edit.user_id = user.user_id and edit.wiki = user.wiki where edit.time between '2014-08-01 00:00:00' and '2014-09-01 23:59:59' and user.user_registration between '2014-08-01 00:00:00' and '2014-09-01 23:59:59' and user.in_bot_user_group = 0 and edit.wiki = 'enwiki' group by user_id having count(*) > 5 order by null;
Runs in 1 minute and returns no results. See data for enwiki for September: https://metrics.wmflabs.org/static/public/datafiles/RollingNewActiveEditor/enwiki.json
For Sep 1st this number should be about ~10.000
Removing "user.in_bot_user_group = 0" returns 9223 records in 4 minutes. => BOT flag is incorrect
select edit.user_id from edit inner join user on edit.user_id = user.user_id and edit.wiki = user.wiki and edit.time between user.valid_from and user.valid_to where edit.time between '2014-08-01 00:00:00' and '2014-09-01 23:59:59' and user.user_registration between '2014-08-01 00:00:00' and '2014-09-01 23:59:59' and edit.wiki = 'enwiki' group by user_id having count(*) > 5 order by null; Empty set (3 min 30.76 sec)
The "user.valid_from" to "user.valid_to" fields have to be wrong
Reducing intervals to 1 day:
select edit.user_id, count(edit.time) from edit inner join user on edit.user_id = user.user_id and edit.wiki = user.wiki where edit.time between '2014-08-01 00:00:00' and '2014-08-02 00:00:00' and user.user_registration between '2014-08-01 00:00:00' and '2014-08-02 00:00:00' and edit.wiki = 'enwiki' group by user_id having count(*) > 5 order by null;
This returns 179 rows in set (4.30 sec)
Looking at data for these three users on detail we can see that bot flags and valid to are actually wrong.
| 22003106 | 7 | | 22003823 | 18 | | 22004357 | 9 |
Validating Use Case: Pages Created
select count(page_id) from page where wiki='enwiki' and page_is_redirect = 0 and valid_from between '2014-10-27 00:00:00' and '2014-10-27 23:59:59'; -- optional: and (@includeArchived = 1 or archived = 0) -- where @includeArchived would be passed in a parameterized query +----------------+ | count(page_id) | +----------------+ | 8787 | +----------------+
1 row in set (2 min 29.17 sec)
In prod, pages created (counting deleted pages)
SELECT count(*) FROM revision JOIN page ON rev_page = page_id WHERE rev_parent_id = 0 AND rev_timestamp between '20141027000000' AND '20141027235959' UNION SELECT count(*) FROM archive WHERE ar_parent_id = 0 AND ar_timestamp between '20141027000000' AND '20141027235959'; count(*)
count(*) 8604 787
Seems that data does not agree?
For 2014-11-04 numbers in warehouse are 11019 and in prod DB is 10847+724 = 11561 so also in this case warehouse seems to have less records.
Looking at just 1 hour:
select page_id,valid_from from page where wiki='enwiki' and valid_from between '2014-11-04 00:00:00' and '2014-11-04 01:00:00'; +----------+---------------------+ | page_id | valid_from | +----------+---------------------+ | 44292163 | 2014-11-04 00:00:15 | | 44292164 | 2014-11-04 00:00:22 |
.... | 44292627 | 2014-11-04 00:59:48 | | 44292628 | 2014-11-04 00:59:50 | | 44292629 | 2014-11-04 00:59:51 | +----------+---------------------+ 439 rows in set (2 min 23.86 sec)
Same select in prod returns 463 results
Warehouse is missing intermittent records for this hour, for example: 44292173, 44292204, 44292213, 44292293 so definitely is the loading.
Things to correct
User Table
- Doesn't look we need the field valid_currently
- Field valid_to is '0000-00-00 00:00:00'. Value -for accounts still valid- should be null to be able to do " between valid_from and coalesce(valid_to, date())"
- Looks like we need to expand indexes to valid_to to valid_from in order to join effectively, otherwise joins that include those fields take a long time.
- Looks like there are a bunch of users on user table with in_bot_user_group=1 when that should not be the case.
mysql:research@analytics-store.eqiad.wmnet [warehouse]> select count(in_bot_user_group), in_bot_user_group from user group by in_bot_user_group; +--------------------------+-------------------+ | count(in_bot_user_group) | in_bot_user_group | +--------------------------+-------------------+ | 1822 | 0 | | 210963212 | 1 | +--------------------------+-------------------+
Page Table
- Field valid_to is '0000-00-00 00:00:00'. Value -for accounts still valid- should be null
- archived fields is always set to "false" in ingestion
- numbers for pages created come out lower in warehouse than production, is this something to do with incremental login in: https://gerrit.wikimedia.org/r/#/c/167839/5/schemas/test.sql,cm