User:AKhatun/Intro to WMF Search Data

From Wikitech

Search Data

The search platform team at the foundation saves some temporary data from searches done in various wikimedia projects, analyzing which can help us understand what improvements can benefit users and what we can do to create better search experience for them. To do this, we need to first understand how search works and what are the various data stored. This page is intended to help you get started with search and search data: with resources, links, and brief explanations. This is not an exhaustive list or a complete explanation of all things related to search.

Where can you search from?

How search works

As you start typing on any of the search boxes mentioned above, the search process has already started. Every letter/group of letter typed fires a search event; once you press enter/click the magnifying glass icon, an event is fired; once you click a search result from the search result page, another event is fired. More about events later.

Here are some of the possibilities with searching:

  1. You start typing in the GO box or any other mediawiki search bar. After each letter you type, you get a drop down of tittle suggestions. These are called autocomplete searches. Sometimes if you type mutiple letters with quick succession, you will get these suggestions when you pause.
    1. You can click one of the tittle suggestions and go to that page directly
    2. Or, you can press enter or select search for pages containing <your text>. This takes you to the search results page.
  2. In the search results page, you will see your search results, results from other langauge wikis (if applicable), results from sister projects, and advanced search options. This is also the search special page. You can continue to perform other searches from here or read your results.
    1. Sometimes, the word or phrase you searched for may have no results. If the system thinks you meant something else, it will search for that and show those results instead. Search for azpw, the results will be populated for the word aziz and says Showing results for aziz. No results found for azpw.
    2. Sometimes the word or phrase you searched for has very few results. If the system thinks you meant something else, it will recommend a different (possibly correct) search. Search for alsha, it will say Did you mean: alpha. It still shows the little results it found for alsha, but you can click on alpha and view those results instead.
  3. On the side are results from sister projects
  4. At the bottom of the results are results from other language wikis if applicable. Search for বন্য প্রানি (a not English query) in the English wikipedia, for example.
  5. Some wikis have results from Wikidata at the bottom as well.

Useful resources

Few blog posts. Find more in diff.wikimedia.org.

Data Sources

Sources of data related to Search
Table name Database Description Docs Code
mediawiki_cirrussearch_request event Also known as query logs. Contains all search events including the query, the various hits returned from one or more wiki projects, time taken, and other backend information Schema -
searchsatisfaction event Table of various search events such as searchResultPage, click, checkin etc along with the query, number of hits returned and other search specific details. Schema Source Code
query_clicks_hourly discovery A cross of mediawiki_cirrussearch_request and searchsatisfaction to list each search query with its list of hits returned and clicks by the user Schema Source Code
query_clicks_daily discovery Sessionized version of the discovery.query_clicks_hourly table. Only contains queries with click throughs Schema Source Code
search_satisfaction_daily discovery A sessionized daily version of the event.searchsatisfaction table. Each search session and most of its related information are aggregated in individual rows - Source Code
fulltext_head_queries discovery Aggregate of queries and their results after making some minor alterations to the query string (e.g please and PLEASE --> please) - Source Code

Table details

event.mediawiki_cirrussearch_request

This table is a bit loaded but actually relatively easy to understand. It has a bunch of metadata and client info. You can find info on most fields in the schema. Some additional information that might help:

  • The search_id set here corresponds to the searchToken in the searchsatisfaction table.
  • The more dense and important search related fields are elasticsearch_requests and hits.
  • There are bunch of "indices" saved. Some for contents of various wiki projects of various languages, others for page tittles for example. The search happens by mapping the search query against these indices.
  • The hits field contains the final list of hits from CirrusSearch that are shown to the user. This includes the page title, page id, score of the result given by ElasticSearch and the index the result was grabbed from.
  • Search occurs in several steps. ElasticSearch performs search and collects a list of results. It also generates a score for each search result to show how relevant of a result it was. CirrusSearch sits on top of ElasticSearch and modifies and enhances the search results that are ultimately shown to the user. A single search from the user performs multiple ElasticSearch searches: For the language wiki you searched in, for other wiki projects (thats how we get results from other wiki projects on the side), for other language projects if relevant for the search. Detecting the lanaguge and identifying whether searching in other wiki langauges is required or not is also part of the job.
  • Since a single search can have several ElasticSearch requests, each request and its relevant results ae listed in the elasticsearch_requests field.
    • hits contains the list of results returned from each of the individual ElasticSearch requests.
    • indices contains the list of all indexes the query was performed against. hits[].index on the other hand contains the index from which that particular result came from.
    • When users perform pagination on search results, i.e, see the "next" set of search results, the offset (number of results already shown) is given in hits_offset.
    • Every resturned search result has a score. max_score is the max of those, typically the score of the first search result shown.

event.searchsatisfaction

The tables schema contains description about most fields. Here are some additional notes to help augment the understanding:

  • Every wikimedia project will have a search bar at the top. Even the Special:Search page. Special:Search page is considered as just another wiki page. Whereever you start typing your search query, that page's id will be stored in the articleId field. For Special:Search page it will be null.
  • Imagine a search session you had started from a random content page. As you start typing, the system suggests pages based on tittle matches. This fires searchResultPage (set in action) event whose source field is autocomplete.
  • After you typed things in, you select one of the suggested pages. This generates a click (set in action field) event. The position field contains the 0-indexed position of the search result you just clicked. Once you start visiting another page, we don't have any other info anymore.
  • Lets assume instead of clicking one of the suggested pages, you press enter or click the magnifying-icon button. This will take you to the Search:Speacial page. You will get:
    • A click event. The position field in this event will be -1 since you did not click any of the autocomplete search results.
    • A visitPage event. visitPage means you just visited the Search:Special page.
    • A searchResultPage event. This will be a fulltext search, set in the source field.
    • As you browse through the search results, checkin events are fired at regular intervals upto 7 minutes. Since you are browsing the search special page, the articleId would be null.
  • Every event generated from one load of a certain page will have the same pageViewId. So the visitPage, searchResultPage, and checkin events generated from the search:Special page will have the same pageViewId. The click event will have a different pageViewId though, since it was generated form the page you wrote the query in. So the previous autocomplete events and the click event will have the same pageViewId.
  • Now let's choose one of the search results.
    • This generates a click and a visitPage event. Both will have position set to the 0-indexed position of the result you clicked.
    • visitPage event will have the page id of the page you visited in the articleId field.
    • Then let's assume you are reading the article you just clicked. As you spend more time on it, checkin events are fired at regular intervals. And the articleId would be the id of the page you are reading.
  • If you click the back button on the browser you will go back to the fulltext search page, the page from where you selected the search result. This will generate a new search with the same query, and so will have a searchResultPage event.
  • Typically the search result page shows 20 results by default. Now suppose you click the pagination buttons, i.e choose to see the next 20 results, or maybe you choose to see 50 or 100 results in the current page. These actions will generate searchResultPage events. The extraParams field will have the offset value in it. So if you had chosen to see the next set of results, the offset will be 20 (for the first 20 pages). Or, suppose you viewed 50 results in the first page, and then clicked to see the next set of results, the offset will be 50.
  • Other params in extraParams: The iw key in extraParams has a list of sister projects. One result from each of these wikis was shown on the search result page, typically on the side. The name of the wiki is in abbreviated form. See the abbreviations here. Along with the wikis abbreviated name is the position of the wiki project, among the sister projects' search result list. So {"source":"q", "position":3 means there was a result from wikiquote in the 3rd position among the results from various wiki projects.
  • Clicking any one the sister project links gives ssclick event. The extraParams of this event will have the link of the page you clicked, but no information about dwell time or anything else.
  • Normal searchResultPage events have a inputLocation of header (when you search from the content pages and the GO box in the header) or content (when you search from the search box in the content/body section of the search:Special page).
  • If a search does not produce enough results and it finds another word or phrase that closely matches with what we typed, it will show us "did you mean" suggestions. If this happens didYouMeanVisible field is "yes".
    • If we click the suggestion provided to us, a new search with the suggested query takes place where the inputLocation is "dym-suggest"
  • If the query you searched for has no results and the search engine finds a close word or phrase, it shows results for that instead. In this case didYouMeanVisible field is "autorewrite"
    • Even though the original query has no results at all, you can still click "search for <original_query> instead". This creates another search event with inputLocation as "dym-original".
  • hover on, hover off, and esclick are not in use as of Aug 2022.

Note:

  • As of now, opening search results in multiple tabs by double-clicking is not recorded as an event. Double click is not considered a "click". So it does not store visitPage or checkin events either.
  • One can perform searches and visit a page from the result set through single clicks and therefore loading it on the same page. It stores searchResultPage, visitPage, and checkin upto this step. Clicking on links on the content page you loaded from the search results and going deeper down the wikipedia hole will not store events any longer. Once you click the browser "back" button and return to the search page, the search action will be performed again and you can continue searching while the events get fired.

discovery.query_clicks_hourly

The fields of this table are fairly clear from its schema definition. It contains the list of all the search results shown to the users with each full-text search and the list of all the pages the user clicked in each search. See Schema and Source Code for more details.

discovery.query_clicks_daily

Sessionized version of the hourly table. This table contains full-text search sessions with click thoughs. If you want all search sessions with ot without click throughs, you will have to check out the hourly table. Simply gives session_id to the queries.

discovery.search_satisfaction_daily

What is a search session?
"A search session identifies a single user performing searches within a limited timespan. If no search is performed within ten minutes of a previous search a new session id is generated." [1] So, whatever a user does after searching, like clicking around, viewing pages, viewing next set of results are all given the same sessionID. A new session starts when this session is idle for 10 minutes.

discovery.search_satisfaction_daily is a sessionized version daily of event.searchsatisfaction. The event table records each event separately whereas the daily table records searches session-wise with seperate rows for each full-text search (not autocomplete searches, only the searches done by users by pressing enter or the magnifying glass icon).

Additional explanation of some fields: Make sure to do describe table_name; in hive or spark sql or whatever method you are accessing it through to see field comments.

  • dym_shown: Whether the search engine result page (SERP) showed a Did You Mean (dym) suggestion. If the number of results is too less, the search engine will try to identify nearby words or phrases to search with and show that query as a suggestion to the user. If the number of results is 0, the engine will perform search with the suggested query and show those results instead. When these situations occur dym_shown is True.
  • is_autorewrite_dym: The phenomenon of getting 0 results and so showing results of the suggested query is called autorewrite.
  • is_dym: When the user cicks the dym suggestion, a search is performed with the suggested query. The new result page has is_dym set to True, because this is the dym suggested query search. It is also True for autorewrite queries since the the page is showing results for the suggested queries.
  • dym_clicked: True when a user clicks the suggested query shown at the top of the page, i.e, the Did You Mean query.

N.B.: In case of a autorewrite dym_shown, is_autorewrite_dym, and is_dym are all True. For more info about the logic, see the source code: Source Code#L128-L148

discovery.fulltext_head_queries

This table is not much used at present. It contains:

  • norm_query : The normalized query. Only few very basic normalizations were performed. See Source Code docs for more info on what normalizations were done. Queries are normalized and then grouped together based on the normalized version.
  • num_sessions : The number of sessions across which these queries had spanned (and are now grouped together).
  • queries : The original queries that were normalized to norm_query along with the number of sessions each query was part of.

References

  1. https://meta.wikimedia.org/wiki/Schema:SearchSatisfaction

Abbreviations

  • SERP: Search Engine Result Page
  • dym: Did You Mean (the alternate query suggestion that comes after a search that does not have enough results)