Jump to content

Obsolete:Category list query

From Wikitech

Much of the Mediawiki code is being adjusted to deal with scaling issues, either because the site has grown or because features which were expected to have limited use have been more popular than expected.

One example is the way the list of categories is found. The reports show "first 500" in alphabetic order, "next 500" and so on. Each article is assigned one of more categories based on its subject, to provide a hierarchical topic index. Here's the query to return page 44, which has names starting from "Synthetic opioids":

SELECT DISTINCT 'Categories' as type,
 14 as namespace,
 cl_to as title,  /* the name of the category */
 1 as value
FROM categorylinks
ORDER BY value
LIMIT 22000,500

The constants 1 and 10 are there because this is fitting into a standard report generation class. Limit 22000,500 means "skip the first 22000 results and give me the following 500". Distinct means "skip this one if you've seen it before" and removes the duplicated cl_to values.

The categorylink table has 493,000 entries in the English language encyclopedia. There are 27,000 different cl_to values. To get the last 500 with the current query, the whole list of 493,000 entries must be scanned.

That's efficient enough when there are only 10,000 articles and 50 categories. It doesn't work so well for 400,000 articles and 27,000.:)

It's easy enough to fix this. Just generate the table of distinct category names and index by position and you can go directly to the title starting at position 22000. Regenerate the list of titles periodically and you've changed an inefficient query into an efficient one. The data isn't normalised any more but that's routine when you're involved in report generation - duplication can be of tremendous help when query efficiency is a factor.

As well as scaling, this is illustrating the "get it into production and see what is popular, then refine as necessary" approach. We don't ignore efficiency, of course, but simple designs are fast to implement and any which become a problem can be improved after they have demonstrated that they are worth the programming time investment. Categories became extremely popular.

For a little extra fun with this query, the order by clause is wrong. It should be ordered by cl_to. At the moment, only consistency in the indexes chosen by the query optimiser is causing the entries to be in their intended alphabetic order. That could change at any time, though it probably won't.