Data Platform/Systems/ua-parser
This page describes our setup for ua-parser, a library used in Java and Python to parse user-agent strings into more meaningful values.
Setup
The ua-parser project uses a core repository named uap-core for shared regular-expressions and test-data, and per programming-language repositories (uap-java, uap-python for instance) referencing the core repository as a git submodule.
As of 2019-09-13, the Analytics team maintains 2 forks:
- https://gerrit.wikimedia.org/g/analytics/ua-parser/uap-java for java
- https://gerrit.wikimedia.org/g/operations/debs/python-ua-parser for python
Each of the two forks we maintain has a dedicated branch with patches useful for wmf artifacts releases
- the
wmf
branch for the java repository - Contains mostly updates to pom.xml and sometimes functional changes not yet merged upstream. Jar files are generated from this branch and uploaded to archiva.wikimedia.org. - the
debian
branch for python repository - Contains patches allowing to build debian packages out of the code, then uploaded to apt.wikimedia.org
How to update
Measure the change
Before updating the code in production, it is good practice to have a look at the impact the change will have on the data. To do so we use the hadoop-cluster to generate a temporary table containing both current and new versions of parsed user-agent data, and compare. Below is a rough procedure.
- Update your version of the uap-java code (including pulling latest version of master in uap-core submodule), build and install a local uap-java jar and use it to build an updated refinery-hive jar
# In uap-java cloned repo, assuming you have setup a remote # to the original github repo named github git fetch --all git pull github master git submodule update --init # Building and installing the uap-java jar locally mvn clean install # Move to the refinery-source folder cd /my/refinery-source # Update the refinery-source/pom.xml to reference your locally installed new uap-java jar # (Here's an example because it took me embarrassingly long to figure it out last time): <dependency> <groupId>com.github.ua-parser</groupId> <artifactId>uap-java</artifactId> <version>1.5.2-SNAPSHOT</version> </dependency> # Build the refinery-hive jar (possibly without tests since ua-parser has changed: -DskipTests) mvn -pl refinery-hive -am clean package -DskipTests
- Generate the comparison hive table using the generated refinery-hive jar (taking a 1/64 sample of one day of webrequest data)
-- In hive use MY_DATABASE; -- Use the refinery-hive jar created a last step of previous section ADD JAR /PATH/TO/MY/JAR/refinery-hive-MYJARVERSION-SNAPSHOT.jar; CREATE TEMPORARY FUNCTION ua_parser as 'org.wikimedia.analytics.refinery.hive.UAParserUDF'; DROP TABLE IF EXISTS tmp_ua_check_YYYY_MM_DD; create table tmp_ua_check_YYYY_MM_DD stored as parquet as select user_agent, user_agent_map as user_agent_map_original, ua_parser(user_agent) AS user_agent_map_new, COUNT(1) as requests FROM wmf.webrequest TABLESAMPLE(BUCKET 1 OUT OF 64 ON hostname, sequence) WHERE year = YYYY and month = MM and day = DD GROUP BY user_agent, user_agent_map, ua_parser(user_agent);
- Use the comparison table to measure differences
// In Spark2-shell /********************************* NOTE: Requests provided here should not prevent you from keeping a critical eye on the data *********************************/ spark.sql("use MY_DATABASE") spark.table("tmp_ua_check_YYYY_MM_DD").cache() /********************************* Global analyses **********************************/ spark.sql(""" SELECT count(distinct user_agent) as distinct_user_agent, count(distinct user_agent_map_original) as distinct_user_agent_map_original, count(distinct user_agent_map_new) as distinct_user_agent_map_new FROM tmp_ua_check_YYYY_MM_DD """).show() spark.sql(""" SELECT -- Need to cast to string as map is not natually sortable for group-by (CAST(user_agent_map_original AS string) = CAST(user_agent_map_new AS string)) as same_original_new, sum(requests) as requests FROM tmp_ua_check_YYYY_MM_DD GROUP BY (CAST(user_agent_map_original AS string) = CAST(user_agent_map_new AS string)) """).show() /********************************* By value-type (map key) analyses **********************************/ val mapValues = Set("browser_family", "os_major", "wmf_app_version", "browser_major", "os_minor", "os_family", "device_family") /********************************* Check differences by value-type **********************************/ mapValues.foreach( v => { spark.sql(s""" SELECT (user_agent_map_original['$v'] = user_agent_map_new['$v']) as same_old_new_$v, sum(requests) as requests FROM tmp_ua_check_YYYY_MM_DD GROUP BY (user_agent_map_original['$v'] = user_agent_map_new['$v']) """).show()}) /********************************* Check main different values **********************************/ mapValues.foreach( v => { spark.sql(s""" SELECT user_agent_map_original['$v'] as v_original, user_agent_map_new['$v'] as v_new, sum(requests) as requests FROM tmp_ua_check_YYYY_MM_DD WHERE user_agent_map_original['$v'] != user_agent_map_new['$v'] GROUP BY user_agent_map_original['$v'], user_agent_map_new['$v'] ORDER BY requests DESC LIMIT 10 """).show(10, false)}) /********************************* Other queries based on your critical thinking :) ... **********************************/
- Document the results Summarize the findings in one or two sentences for webrequest/pageview changelog update (as in this edit for instance), and create a new page with detailed results (as in this page for instance).
Update the code for production
We decided today to try and follow upstream versions of uap libraries. This makes deployment much easier.
- The uap-java update involves just pointing to the latest version available in maven-central, so just change refinery-source/pom.xml
- The uap-python update is being discussed right now
The original way we did this, and may have to return to depending on the python discussions
- Update the java and python repositories to the needed commit (usually either a released tag, or current master), and update their submodule to the correct version of uap-core (usually current master)
- Rebase or cleanup the
wmf
anddebian
branches using the updated master branches in java and python repositories. This depends on the changes the branches contained in comparison to what has been merged in the upstream repository. - Push new patches to the
wmf
anddebian
branches, fat least so that new version of the jar and debian packages are created. - Build and release the new jar uap-java jar to archiva using instructions at Archiva#Deploy_to_Archiva.
- Build the new debian package and add to apt.wikimedia.org (ask Andrew or Luca).