User:Legoktm/Jazz Q

From Wikitech

This project is blocked until freely produced HTML dumps exist.

Background

Quarry is incredibly popular and well used by people who are not typically considered "programmers" because it's dead simple to use - you put some SQL into a web form and wait a few minutes and presto, results! Most people learn by a cycle of copy-paste-modify, in which they take someone else's query, tweak it a bit, and run it. Over time people actually learn SQL to the point where they can come up with their own queries.

The main limitation of Quarry is that it can only let you discover what is already queryable in database tables. You have no access to page text aside from what you can infer from links tables. People have gotten pretty good at using regex searches to find text patterns, but it's rough and has limits.

Proposal

We now have twice-monthly Parsoid HTML dumps that make it pretty straightforward to extract rich information about of article text pretty quickly. A trivial multi-threaded Rust program can scan enwiki's NS0 uncompressed dump in under 2 hours.

The parsoid crate has useful and fast accessors to read information out of the HTML and over the past two years has grown to handle most edge cases.

We should have a web interface where people can write a Rust function that is then compiled and executed against a small-to-medium category (maybe a PetScan query result) or the full dump. Users will get live feedback on whether their program compiles or not, can test the output against a sample article, and then when ready, submit it for full execution.

The basic structure will be like:

pub fn process_page(code: Wikicode, title: &str) -> Result<Option<Row>, anyhow::Error> {
    ...
    Ok(Some(Row { ... }))
}

Some form of static analysis will prevent using any code outside of the parsoid crate (e.g. ::std::fs::read_to_string(...)) or any unsafe code.

This will be compiled into a program that handles the multi-threading, etc. and will be run in a locked down container and resource allocation (time, CPU, memory, etc.) will be accounted per-user. A MySQL account will be created for each user and will only be allowed to write to that program's result table.

Example

Here's an example program that finds articles with a bold quote mark in the lead, in violation of English Wikipedia's MOS:NICKNAME and MOS:NOBOLDQUOT. It's a silly example, but demonstrates how trivial useful programs can be, just a complex selector and a little bit more string processing.

pub fn process_page(code: Wikicode, title: &str) -> Result<Option<String>, anyhow::Error> {
    let found = code
        .select_first("section[data-mw-section-id=\"0\"] > p > b")
        .or_else(|| {
            // If the bold part is also italicized, it'll be under an <i> tag
            code.select_first("section[data-mw-section-id=\"0\"] > p > i > b")
        });

    if let Some(b) = found {
        if b.text_contents().contains('"') {
            return Ok(Some(title.to_string()));
        }
    }
    Ok(None)
}

Rejected ideas

Lua is the obvious choice to allow people to write arbitrary code in, but it would require writing a Parsoid library in Lua or Rust/Lua bindings for the existing crate. It would likely be nowhere as fast. I briefly looked into rlua, but the maintainer warns that the sandboxing isn't ready.

Inspiration

See this post. I was trying to imagine what jq but for wiki pages would look like, and eventually concluded that instead of inventing and designing some complex query language, we should just let people write code, like we do for SQL. Hence the name. Also 21334!