Obsolete:2015 data warehouse experiments/2015-02-03 verifications

From Wikitech

This verification was meant to continue the previous one, available here. But when repeating the tests executed in the latter, the results were different. So, for some reason, the data in the warehouse changed since 2015-01-14, and thus the verification will continue in this new page to include the new results.

Edit Fact and Page Dimension

Using the verification script. Results follow for @from = 20140829000000 and @to = 20141001000100.

+--------+---------------------+--------------------+----------------------+
| wiki   | date_with_bad_match | value_in_warehouse | value_in_original_db |
+--------+---------------------+--------------------+----------------------+
| dewiki | 20140829            |              25278 |                25305 |
| dewiki | 20140830            |              24311 |                24321 |
| dewiki | 20140831            |              28807 |                28814 |
| dewiki | 20140901            |              30938 |                30949 |
| dewiki | 20140902            |              35705 |                35717 |
| dewiki | 20140903            |              40823 |                40836 |
| dewiki | 20140904            |              26212 |                26245 |
| dewiki | 20140905            |              24393 |                24408 |
| dewiki | 20140906            |              23405 |                23419 |
| dewiki | 20140907            |              26042 |                26058 |
| dewiki | 20140908            |              27503 |                27524 |
| dewiki | 20140909            |              30042 |                30058 |
| dewiki | 20140910            |              29123 |                29141 |
| dewiki | 20140911            |              28001 |                28022 |
| dewiki | 20140912            |              30820 |                30840 |
| dewiki | 20140913            |              29044 |                29055 |
| dewiki | 20140914            |              30595 |                30603 |
| dewiki | 20140915            |              28672 |                28686 |
| dewiki | 20140916            |              26216 |                26231 |
| dewiki | 20140917            |              27513 |                27538 |
| dewiki | 20140918            |              27519 |                27537 |
| dewiki | 20140919            |              27570 |                27589 |
| dewiki | 20140920            |              24764 |                24783 |
| dewiki | 20140921            |              29747 |                29763 |
| dewiki | 20140922            |              29125 |                29144 |
| dewiki | 20140923            |              26579 |                26591 |
| dewiki | 20140924            |              27716 |                27732 |
| dewiki | 20140925            |              27318 |                27343 |
| dewiki | 20140926            |              26074 |                26115 |
| dewiki | 20140927            |              25270 |                25279 |
| dewiki | 20140928            |              27093 |                27109 |
| dewiki | 20140929            |              28185 |                28198 |
| dewiki | 20140930            |              26761 |                26776 |
| dewiki | 20141001            |                 13 |                   14 |
| enwiki | 20140829            |             135559 |               135567 |
| enwiki | 20140903            |             153581 |               153583 |
| enwiki | 20140905            |             126101 |               126103 |
| enwiki | 20140907            |             124260 |               124263 |
| enwiki | 20140909            |             134882 |               134886 |
| enwiki | 20140913            |             146489 |               146490 |
| enwiki | 20140918            |             133930 |               133931 |
| enwiki | 20140924            |             129260 |               129264 |
| enwiki | 20140926            |             126064 |               126067 |
| enwiki | 20140927            |             121163 |               121165 |
+--------+---------------------+--------------------+----------------------+

One problem is the same from last verification: Page table has archive = 0 for all records, so it seems that archived pages were not imported correctly, thus affecting total counts.

However, the previous verification stated that there were no indexes for page_id in the page table. This time, there are two indexes for that column in the page table. It seems strange, because both indexes are identical except for the name.

+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Index_type |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------------+
| page  |          1 | i1       |            1 | wiki        | A         |        NULL |     NULL | NULL   | BTREE      |
| page  |          1 | i1       |            2 | page_id     | A         |        NULL |     NULL | NULL   | BTREE      |
| page  |          1 | i2       |            1 | wiki        | A         |        NULL |     NULL | NULL   | BTREE      |
| page  |          1 | i2       |            2 | namespace   | A         |        NULL |     NULL | NULL   | BTREE      |
| page  |          1 | i2       |            3 | page_id     | A         |        NULL |     NULL | NULL   | BTREE      |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------------+

In addition to that, in the edit table some registers have user_id = 0. And no user with that id exists in the user table, neither in the warehouse's user table nor in enwiki/dewiki.

select * from edit where time > '2015-01-01' and user_id = 0 limit 10;
+---------------------+--------+---------+-----------+----------+
| time                | wiki   | user_id | rev_id    | page_id  |
+---------------------+--------+---------+-----------+----------+
| 2015-01-01 00:00:10 | enwiki |       0 | 640447557 |  1435055 |
| 2015-01-01 00:00:17 | enwiki |       0 | 640447577 |    12529 |
| 2015-01-01 00:00:20 | enwiki |       0 | 640447580 |  1047631 |
| 2015-01-01 00:00:36 | enwiki |       0 | 640447601 |   273319 |
| 2015-01-01 00:00:50 | enwiki |       0 | 640447626 |   253102 |
| 2015-01-01 00:00:51 | enwiki |       0 | 640447628 | 44375173 |
| 2015-01-01 00:00:52 | enwiki |       0 | 640447632 | 12977178 |
| 2015-01-01 00:00:52 | enwiki |       0 | 640447630 | 15806863 |
| 2015-01-01 00:01:01 | enwiki |       0 | 640447648 | 11477230 |
| 2015-01-01 00:01:02 | enwiki |       0 | 640447651 |  3200022 |
+---------------------+--------+---------+-----------+----------+

User Dimension

Here again, the same problems as in the last verification apply. But with slightly different results. The first 8 queries in this script return 0 as expected, so no discrepancy here. But the last two queries in the same script returned the following. Note that in the previous verification, the second query (enwiki) crashed, and now we have a result.

+--------------+
| wrong_action |  <-- dewiki
+--------------+
|         7557 |
+--------------+

+--------------+
| wrong_action |  <-- enwiki
+--------------+
|      8473998 |
+--------------+

In here, it seems something is not matching: Warehouse's user.registration_type does not match enwiki/dewiki's logging.log_action. It's difficult to see a pattern, but the type 'create2' appears in most of the divergent cases (90%+), either as a field in the warehouse or in the origin db (specially paired up with 'byemail'). Other values for the fields are varied in different proportions.

Executing some other queries, complementary to the automatic SQL test scripts, we can see some differences in the number of registration logs by type:

dewiki
warehouse> select registration_type, count(*) from user where wiki = 'dewiki' and user_registration between '2014-09-01 00:00:00' and '2014-09-30 23:59:59' group by 1;

+-------------------+----------+
| registration_type | count(*) |
+-------------------+----------+
| autocreate        |     9101 |
| byemail           |        2 |
| create            |    16284 |
| create2           |       52 |
+-------------------+----------+

dewiki> select log_action, count(*) from logging where log_type = 'newusers' and log_timestamp between '20140901000000' and '20140930235959' group by 1;

+------------+----------+
| log_action | count(*) |
+------------+----------+
| autocreate |     9101 |
| byemail    |        5 |
| create     |    16284 |
| create2    |       61 |
+------------+----------+
enwiki
warehouse> select registration_type, count(*) from user where wiki = 'enwiki' and user_registration between '2014-09-01 00:00:00' and '2014-09-30 23:59:59' group by 1;

+-------------------+----------+
| registration_type | count(*) |
+-------------------+----------+
| autocreate        |    76916 |
| byemail           |      181 |
| create            |   236569 |
| create2           |     1009 |
+-------------------+----------+

dewiki> select log_action, count(*) from logging where log_type = 'newusers' and log_timestamp between '20140901000000' and '20140930235959' group by 1;

+------------+----------+
| log_action | count(*) |
+------------+----------+
| autocreate |    76916 |
| byemail    |      959 |
| create     |   236567 |
| create2    |     1087 |
+------------+----------+