Data Platform/Systems/Manual maintenance
Appearance
< Data Platform | Systems
Monthly
- Mediawiki history Druid data source switch
- Check for newly created wikis
- Add _REFINED flags for events that contribute to the
wmf.wikidata_item_page_link
dataset. (this is not even documented anywhere outside of email) - We run the false positive checker for webrequest loss probably once or more a month. This could be partially automated, if the script finds that all instances of loss are false positives, the job could be automatically rerun. If we do this automatically, we could update the webrequest_sequence_stats table with the results, allowing for trend tracking on top of that table. Currently if you try to analyze data loss over time you find lots of noise with high % loss due to host restarts, etc.
- We re-run sanitization. It's painful to update the command because you often have to change a property file nested in another property file nested in the command. See the docs for Backfilling sanitization, and we should build a rerun command that just takes a list of schemas, since, and until parameters.
Refined flags script
|
---|
from datetime import datetime, timedelta from_dt = datetime.strptime('2021-07-19 01', '%Y-%m-%d %H') to_dt = datetime.strptime('2021-08-09 00', '%Y-%m-%d %H') def get_date_parts(dt): return dt.year, dt.month, dt.day, dt.hour for n in range(int((to_dt - from_dt).total_seconds() / 60 / 60) + 1): year, month, day, hour = get_date_parts(from_dt + timedelta(hours=n)) # make parent directories print(f'hdfs dfs -mkdir -p /wmf/data/event/mediawiki_page_move/datacenter=eqiad/year={year}/month={month}/day={day}/hour={hour}/') # copy flags time_partitions = f'year={year}/month={month}/day={day}/hour={hour}' from_path = f'/wmf/data/event/mediawiki_page_move/datacenter=codfw/{time_partitions}' to_path = f'/wmf/data/event/mediawiki_page_move/datacenter=eqiad/{time_partitions}' print(f'hdfs dfs -cp {from_path}/_REFINED {to_path}/_REFINED') |