Obsolete:2015 data warehouse experiments/2014-12-02 verifications

From Wikitech

Use Case: Newly Registered Users on September 2014

Newly register data (from user table) matches exactly for dewiki in the month period we have looked at except for 2 days (see below). For enwiki differences are more prevalent, in a month period more than 10 records differ but only in one or two counts per day. In both cases differences are due to same causes, a jump of one in one day that gets counted on the next day.

dewiki

Data for newly registered users in September for dewiki matches exactly for all days but two: 20140911 and 20140912 whose numbers among the warehouse and dewiki differ on 1 unit. The disparity on the 20140911 registrations comes from a registration that appears on the warehouse at 2014-09-11 23:59 but in dewiki-prod appears at 20140912000. Last registration for dewiki in prod in 2014-09-11 happens at 201409112323.

Below selects for dewiki.


 mysql:research@analytics-store.eqiad.wmnet [warehouse]> select left(user_registration, 10), count(*) from user 
  where wiki='dewiki' and user_registration
  between '2014-09-01 00:00:00' and '2014-09-30 23:59:59' and registration_type='create'   group by 1;
+-----------------------------+----------+
| left(user_registration, 10) | count(*) |
+-----------------------------+----------+
| 2014-09-01                  |      542 |
| 2014-09-02                  |      444 |
| 2014-09-03                  |      524 |
| 2014-09-04                  |      465 |
| 2014-09-05                  |      530 |
| 2014-09-06                  |      493 |
| 2014-09-07                  |      587 |
| 2014-09-08                  |      538 |
| 2014-09-09                  |      554 |
| 2014-09-10                  |      572 |
| 2014-09-11                  |      578 | <-
| 2014-09-12                  |      537 | <-
| 2014-09-13                  |      551 |
| 2014-09-14                  |      613 |
| 2014-09-15                  |      545 |
| 2014-09-16                  |      568 |
| 2014-09-17                  |      527 |
| 2014-09-18                  |      594 |
| 2014-09-19                  |      558 |
| 2014-09-20                  |      556 |
| 2014-09-21                  |      696 |
| 2014-09-22                  |      573 |
| 2014-09-23                  |      531 |
| 2014-09-24                  |      532 |
| 2014-09-25                  |      505 |
| 2014-09-26                  |      467 |
| 2014-09-27                  |      489 |
| 2014-09-28                  |      544 |
| 2014-09-29                  |      521 |
| 2014-09-30                  |      550 |
+-----------------------------+----------+
30 rows in set (0.25 sec)


Results from production:

select left(log_timestamp, 8), count(*)
from logging where log_action='create' and log_type='newusers'
and log_timestamp between '20140901000000' and '20140930235959' group by 1;
left(log_timestamp, 8)	count(*)
20140901        542
20140902 	444
20140903 	524
20140904 	465
20140905	        530
20140906	        493
20140907	        587
20140908	        538
20140909 	554
20140910 	572
20140911 	577 <-
20140912	        538 <-
20140913 	551
20140914 	613
20140915 	545
20140916 	568
20140917	        527
20140918	        594
20140919	        558
20140920 	556
20140921 	696
20140922 	573
20140923 	531
20140924	        532
20140925	        505
20140926	        467
20140927 	489
20140928 	544
20140929 	521
20140930 	550

20140911 detail for dewiki

The disparity on the 20140911 registrations comes from a registration that appears on the warehouse at 2014-09-11 23:59 but in dewiki prod appears at 20140912000. Last registration for dewiki in prod in 2014-09-11 happens at 201409112323.

On warehouse
mysql:research@analytics-store.eqiad.wmnet [warehouse]> select left(user_registration, 13), count(*) 
from user where wiki='dewiki' and user_registration  
between '2014-09-11 00:00:00' and '2014-09-11 23:59:59' and registration_type='create'   group by 1;
+-----------------------------+----------+
| left(user_registration, 13) | count(*) |
+-----------------------------+----------+
| 2014-09-11 00               |        7 |
| 2014-09-11 01               |        1 |
| 2014-09-11 02               |        2 |
| 2014-09-11 03               |        3 |
| 2014-09-11 04               |        6 |
| 2014-09-11 05               |       10 |
| 2014-09-11 06               |       16 |
| 2014-09-11 07               |       32 |
| 2014-09-11 08               |       36 |
| 2014-09-11 09               |       28 |
| 2014-09-11 10               |       27 |
| 2014-09-11 11               |       45 |
| 2014-09-11 12               |       42 |
| 2014-09-11 13               |       37 |
| 2014-09-11 14               |       42 |
| 2014-09-11 15               |       33 |
| 2014-09-11 16               |       29 |
| 2014-09-11 17               |       29 |
| 2014-09-11 18               |       38 |
| 2014-09-11 19               |       39 |
| 2014-09-11 20               |       35 |
| 2014-09-11 21               |       24 |
| 2014-09-11 22               |        8 |
| 2014-09-11 23               |        9 | <-
+-----------------------------+----------+
24 rows in set (0.05 sec)

On prod

select left(log_timestamp, 10), count(*)
from logging where log_action='create' and log_type='newusers'
and log_timestamp between '20140911000000' and '20140911235959' group by 1;
left(log_timestamp, 10)	count(*)
2014091100	7
2014091101	1
2014091102	2
2014091103	3
2014091104	6
2014091105	10
2014091106	16
2014091107	32
2014091108	36
2014091109	28
2014091110	27
2014091111	45
2014091112	42
2014091113	37
2014091114	42
2014091115	33
2014091116	29
2014091117	29
2014091118	38
2014091119	39
2014091120	35
2014091121	24
2014091122	8
2014091123	8 <-

enwiki

Data for newly register for enwiki for September differs among the warehouse and enwiki-prod more days than enwiki does but causes are the same.

Warehouse

mysql:research@analytics-store.eqiad.wmnet [warehouse]> select left(user_registration, 10), count(*) 
 from user where wiki='enwiki' and user_registration  between '2014-09-01 00:00:00' and 
'2014-09-15 23:59:59' and registration_type='create'   group by 1;

+-----------------------------+----------+
| left(user_registration, 10) | count(*) |
+-----------------------------+----------+
| 2014-09-01                  |     8274 |
| 2014-09-02                  |     7482 |
| 2014-09-03                  |     8038 |
| 2014-09-04                  |     8065 |
| 2014-09-05                  |     7327 |
| 2014-09-06                  |     7376 | <-
| 2014-09-07                  |     8109 | <-
| 2014-09-08                  |     8214 |
| 2014-09-09                  |     8686 |
| 2014-09-10                  |     8139 |
| 2014-09-11                  |     7985 | <-
| 2014-09-12                  |     7706 | <-
| 2014-09-13                  |     7553 |
| 2014-09-14                  |     7912 | <-
| 2014-09-15                  |     8137 |
+-----------------------------+----------+

enwiki production

select left(log_timestamp, 8), count(*)
from logging where log_action='create' and log_type='newusers'
and log_timestamp between '20140901000000' and '20140915235959' group by 1;
20140901	        8274
20140902	        7482
20140903        8038
20140904 	8065
20140905 	7327
20140906 	7375
20140907 	8110
20140908 	8214
20140909 	8686
20140910	        8139
20140911 	7984
20140912 	7707
20140913	        7553
20140914	        7911
20140915	        8137


20140911 detail for enwiki

Detail for enwiki by the hour. Note that groupings by the hour differ slightly but overall count is only off by 1 on 20140911 among the enwiki-prod db and the warehouse db.

mysql:research@analytics-store.eqiad.wmnet [warehouse]> select left(user_registration, 13), count(*) 
from user where wiki='enwiki' and user_registration  between '2014-09-11 00:00:00' 
and '2014-09-11 23:59:59' and registration_type='create'   group by 1;
+-----------------------------+----------+
| left(user_registration, 13) | count(*) |
+-----------------------------+----------+
| 2014-09-11 00               |      280 |
| 2014-09-11 01               |      317 |
| 2014-09-11 02               |      331 |
| 2014-09-11 03               |      281 |
| 2014-09-11 04               |      266 |
| 2014-09-11 05               |      286 |
| 2014-09-11 06               |      292 |
| 2014-09-11 07               |      287 |
| 2014-09-11 08               |      307 |
| 2014-09-11 09               |      318 |
| 2014-09-11 10               |      319 |
| 2014-09-11 11               |      314 |
| 2014-09-11 12               |      338 |
| 2014-09-11 13               |      379 |
| 2014-09-11 14               |      412 |
| 2014-09-11 15               |      479 |
| 2014-09-11 16               |      438 |
| 2014-09-11 17               |      427 |
| 2014-09-11 18               |      384 |
| 2014-09-11 19               |      386 |
| 2014-09-11 20               |      328 |
| 2014-09-11 21               |      291 |
| 2014-09-11 22               |      249 |
| 2014-09-11 23               |      276 | ->
+-----------------------------+----------+


Registrations in enwiki production by the hour on 20140911

2014091100	280
2014091101	317
2014091102	330
2014091103	282
2014091104	265
2014091105	287
2014091106	292
2014091107	287
2014091108	307
2014091109	318
2014091110	319
2014091111	314
2014091112	338
2014091113	378
2014091114	412
2014091115	480
2014091116	438
2014091117	427
2014091118	384
2014091119	386
2014091120	327
2014091121	292
2014091122	249
2014091123	275
2014091200	1


Use Case: Rolling Recurrent Old Active Editors on September 1st

select edit.user_id
  from edit
           inner join
       user    on edit.user_id = user.user_id
              and edit.wiki = user.wiki
              and edit.time between user.valid_from and user.valid_to
 where edit.time between '2014-07-01' and '2014-09-01'
   and user.user_registration < '2014-07-01'
   and user.in_bot_user_group = 0
   and edit.wiki = 'enwiki'
 group by user_id
having sum(if(time <= '2014-08-01', 1, 0)) >= 5
   and sum(if(time >  '2014-08-01', 1, 0)) >= 5
 order by null;

On Dec 1st this select runs fast but returns no records, which is wrong.

select edit.user_id, count(edit.time) as e from edit where edit.time between '2014-07-01' 
 and '2014-09-01' and edit.wiki = 'enwiki'  group by user_id order by e ;
This returns 239308 records.


select edit.user_id, count(edit.time) as e from edit inner join user on edit.user_id = user.user_id 
and edit.wiki = user.wiki where user.user_registration < '2014-07-01' and  
edit.time between '2014-07-01' and '2014-09-01' and edit.wiki = 'enwiki'  group by user_id order by e desc limit 1000;
This also returns plenty data (takes 12 minutes)


Now, making the join more specific (see bold) kind of makes the select take forever:

select edit.user_id, count(edit.time) as e from edit inner join user    
on edit.user_id = user.user_id  and edit.wiki = user.wiki and edit.time between user.valid_from and user.valid_to 
where edit.time between '2014-07-01' and '2014-09-01' and  user.user_registration < '2014-07-01' and edit.wiki = 'enwiki' group by edit.user_id order by e limit 100;


Leaving this aside for a bit and trying to run a simpler metric might be more effcient.


Use Case: New Active Editors on September 1st

select edit.user_id from edit inner join user on edit.user_id = user.user_id and edit.wiki = user.wiki 
where edit.time between '2014-08-01 00:00:00' and '2014-09-01 23:59:59' 
and  user.user_registration between '2014-08-01 00:00:00' and '2014-09-01 23:59:59' 
and user.in_bot_user_group = 0 and edit.wiki = 'enwiki'  group by user_id having count(*) > 5 order by null;                               

Runs in 1 minute and returns no results. See data for enwiki for September: https://metrics.wmflabs.org/static/public/datafiles/RollingNewActiveEditor/enwiki.json

For Sep 1st this number should be about ~10.000

Removing "user.in_bot_user_group = 0" returns 9223 records in 4 minutes. => BOT flag is incorrect

select edit.user_id from edit inner join user on edit.user_id = user.user_id 
and edit.wiki = user.wiki and edit.time between user.valid_from and user.valid_to where edit.time 
between '2014-08-01  00:00:00' and '2014-09-01 23:59:59' and user.user_registration
between '2014-08-01 00:00:00' and '2014-09-01 23:59:59'  and edit.wiki = 'enwiki'  group by user_id having count(*) > 5 order by null;
Empty set (3 min 30.76 sec)

The "user.valid_from" to "user.valid_to" fields have to be wrong

Reducing intervals to 1 day:

select edit.user_id, count(edit.time) from edit inner join user on edit.user_id = user.user_id and edit.wiki = user.wiki  
where edit.time between '2014-08-01 00:00:00' and '2014-08-02 00:00:00'  and user.user_registration 
between '2014-08-01 00:00:00' and '2014-08-02 00:00:00'  and edit.wiki = 'enwiki'  group by user_id having count(*) > 5 order by null;

This returns 179 rows in set (4.30 sec)

Looking at data for these three users on detail we can see that bot flags and valid to are actually wrong.

| 22003106 | 7 | | 22003823 | 18 | | 22004357 | 9 |

Validating Use Case: Pages Created

select count(page_id)
  from page
 where wiki='enwiki'
   and page_is_redirect = 0
   and valid_from between '2014-10-27 00:00:00' and '2014-10-27 23:59:59';
   -- optional: and (@includeArchived = 1 or archived = 0) -- where @includeArchived would be passed in a parameterized query
+----------------+
| count(page_id) |
+----------------+
|           8787 |
+----------------+

1 row in set (2 min 29.17 sec)

In prod, pages created (counting deleted pages)

SELECT count(*)
   FROM revision
   JOIN page
       ON rev_page = page_id
   WHERE rev_parent_id = 0 AND rev_timestamp between '20141027000000' AND '20141027235959'
UNION
SELECT count(*)
   FROM archive
   WHERE ar_parent_id = 0 AND ar_timestamp between '20141027000000' AND '20141027235959';
count(*)

count(*) 8604 787

Seems that data does not agree?

For 2014-11-04 numbers in warehouse are 11019 and in prod DB is 10847+724 = 11561 so also in this case warehouse seems to have less records.


Looking at just 1 hour:

select page_id,valid_from from page  where wiki='enwiki' and valid_from between '2014-11-04 00:00:00' and '2014-11-04 01:00:00';
+----------+---------------------+
| page_id  | valid_from          |
+----------+---------------------+
| 44292163 | 2014-11-04 00:00:15 |
| 44292164 | 2014-11-04 00:00:22 |
....
| 44292627 | 2014-11-04 00:59:48 |
| 44292628 | 2014-11-04 00:59:50 |
| 44292629 | 2014-11-04 00:59:51 |
+----------+---------------------+
439 rows in set (2 min 23.86 sec)

Same select in prod returns 463 results

Warehouse is missing intermittent records for this hour, for example: 44292173, 44292204, 44292213, 44292293 so definitely is the loading.

Things to correct

User Table

  • Doesn't look we need the field valid_currently
  • Field valid_to is '0000-00-00 00:00:00'. Value -for accounts still valid- should be null to be able to do " between valid_from and coalesce(valid_to, date())"
  • Looks like we need to expand indexes to valid_to to valid_from in order to join effectively, otherwise joins that include those fields take a long time.
  • Looks like there are a bunch of users on user table with in_bot_user_group=1 when that should not be the case.
mysql:research@analytics-store.eqiad.wmnet [warehouse]> select count(in_bot_user_group), 
in_bot_user_group from user group by in_bot_user_group;
+--------------------------+-------------------+
| count(in_bot_user_group) | in_bot_user_group |
+--------------------------+-------------------+
|                     1822 |                 0 |
|                210963212 |                 1 |
+--------------------------+-------------------+

Page Table

  • Field valid_to is '0000-00-00 00:00:00'. Value -for accounts still valid- should be null
  • archived fields is always set to "false" in ingestion
  • numbers for pages created come out lower in warehouse than production, is this something to do with incremental login in: https://gerrit.wikimedia.org/r/#/c/167839/5/schemas/test.sql,cm