Jump to content

Data Platform/Systems/ua-parser

From Wikitech

This page describes our setup for ua-parser, a library used in Java and Python to parse user-agent strings into more meaningful values.

Setup

The ua-parser project uses a core repository named uap-core for shared regular-expressions and test-data, and per programming-language repositories (uap-java, uap-python for instance) referencing the core repository as a git submodule.

As of 2019-09-13, the Analytics team maintains 2 forks:

It is an explicit choice not to maintain a fork of the uap-core repository, as we aim to always use an upstream version of the regular-expression definition and provide pull-request as needed

Each of the two forks we maintain has a dedicated branch with patches useful for wmf artifacts releases

  • the wmf branch for the java repository - Contains mostly updates to pom.xml and sometimes functional changes not yet merged upstream. Jar files are generated from this branch and uploaded to archiva.wikimedia.org.
  • the debian branch for python repository - Contains patches allowing to build debian packages out of the code, then uploaded to apt.wikimedia.org

How to update

Measure the change

Before updating the code in production, it is good practice to have a look at the impact the change will have on the data. To do so we use the hadoop-cluster to generate a temporary table containing both current and new versions of parsed user-agent data, and compare. Below is a rough procedure.

  1. Update your version of the uap-java code (including pulling latest version of master in uap-core submodule), build and install a local uap-java jar and use it to build an updated refinery-hive jar
    # In uap-java cloned repo, assuming you have setup a remote
    # to the original github repo named github
    git fetch --all
    git pull github master
    git submodule update --init
    # Building and installing the uap-java jar locally
    mvn clean install
    # Move to the refinery-source folder
    cd /my/refinery-source
    # Update the refinery-source/pom.xml to reference your locally installed new uap-java jar
    # (Here's an example because it took me embarrassingly long to figure it out last time):
          <dependency>
              <groupId>com.github.ua-parser</groupId>
              <artifactId>uap-java</artifactId>
              <version>1.5.2-SNAPSHOT</version>
          </dependency>
    # Build the refinery-hive jar (possibly without tests since ua-parser has changed: -DskipTests)
    mvn -pl refinery-hive -am clean package -DskipTests
    
  2. Generate the comparison hive table using the generated refinery-hive jar (taking a 1/64 sample of one day of webrequest data)
    -- In hive
    use MY_DATABASE;
    -- Use the refinery-hive jar created a last step of previous section
    ADD JAR /PATH/TO/MY/JAR/refinery-hive-MYJARVERSION-SNAPSHOT.jar;
    CREATE TEMPORARY FUNCTION ua_parser as 'org.wikimedia.analytics.refinery.hive.UAParserUDF';
    
    DROP TABLE IF EXISTS tmp_ua_check_YYYY_MM_DD;
    create table tmp_ua_check_YYYY_MM_DD stored as parquet as
    select
      user_agent,
      user_agent_map as user_agent_map_original,
      ua_parser(user_agent) AS user_agent_map_new,
      COUNT(1) as requests
    FROM wmf.webrequest TABLESAMPLE(BUCKET 1 OUT OF 64 ON hostname, sequence)
    WHERE year = YYYY and month = MM and day = DD
    GROUP BY user_agent, user_agent_map, ua_parser(user_agent);
    
  3. Use the comparison table to measure differences
    // In Spark2-shell
    /********************************* 
    NOTE: Requests provided here should not prevent you from keeping a critical eye on the data
    *********************************/
    
    spark.sql("use MY_DATABASE")
    spark.table("tmp_ua_check_YYYY_MM_DD").cache()
    
    /*********************************
      Global analyses
    **********************************/
    spark.sql("""
    SELECT
      count(distinct user_agent) as distinct_user_agent,
      count(distinct user_agent_map_original) as distinct_user_agent_map_original,
      count(distinct user_agent_map_new) as distinct_user_agent_map_new
    FROM tmp_ua_check_YYYY_MM_DD
    """).show()
    
    spark.sql("""
    SELECT
      -- Need to cast to string as map is not natually sortable for group-by
      (CAST(user_agent_map_original AS string) = CAST(user_agent_map_new AS string)) as same_original_new,
      sum(requests) as requests
    FROM tmp_ua_check_YYYY_MM_DD
    GROUP BY (CAST(user_agent_map_original AS string) = CAST(user_agent_map_new AS string))
    """).show()
    
    
    /*********************************
      By value-type (map key) analyses
    **********************************/
    val mapValues = Set("browser_family", "os_major", "wmf_app_version", "browser_major", "os_minor", "os_family", "device_family")
    
    
    /*********************************
        Check differences by value-type
    **********************************/
    mapValues.foreach( v => {
      spark.sql(s"""
    SELECT
      (user_agent_map_original['$v'] = user_agent_map_new['$v']) as same_old_new_$v,
      sum(requests) as requests
    FROM tmp_ua_check_YYYY_MM_DD
    GROUP BY (user_agent_map_original['$v'] = user_agent_map_new['$v'])
    """).show()})
    
    /*********************************
        Check main different values
    **********************************/
    mapValues.foreach( v => {
      spark.sql(s"""
    SELECT
      user_agent_map_original['$v'] as v_original,
      user_agent_map_new['$v'] as v_new,
      sum(requests) as requests
    FROM tmp_ua_check_YYYY_MM_DD
    WHERE user_agent_map_original['$v'] != user_agent_map_new['$v']
    GROUP BY user_agent_map_original['$v'], user_agent_map_new['$v']
    ORDER BY requests DESC
    LIMIT 10
    """).show(10, false)})
    
    /*********************************
        Other queries based on your critical thinking :)
        ...
    **********************************/
    
  4. Document the results Summarize the findings in one or two sentences for webrequest/pageview changelog update (as in this edit for instance), and create a new page with detailed results (as in this page for instance).

Update the code for production

We decided today to try and follow upstream versions of uap libraries. This makes deployment much easier.

  • The uap-java update involves just pointing to the latest version available in maven-central, so just change refinery-source/pom.xml
  • The uap-python update is being discussed right now

The original way we did this, and may have to return to depending on the python discussions

  1. Update the java and python repositories to the needed commit (usually either a released tag, or current master), and update their submodule to the correct version of uap-core (usually current master)
  2. Rebase or cleanup the wmf and debian branches using the updated master branches in java and python repositories. This depends on the changes the branches contained in comparison to what has been merged in the upstream repository.
  3. Push new patches to the wmf and debian branches, fat least so that new version of the jar and debian packages are created.
  4. Build and release the new jar uap-java jar to archiva using instructions at Archiva#Deploy_to_Archiva.
  5. Build the new debian package and add to apt.wikimedia.org (ask Andrew or Luca).