Jump to content

Data Platform/Systems/Manual maintenance

From Wikitech


  • Mediawiki history Druid data source switch
  • Check for newly created wikis
  • Add _REFINED flags for events that contribute to the wmf.wikidata_item_page_link dataset. (this is not even documented anywhere outside of email)
  • We run the false positive checker for webrequest loss probably once or more a month. This could be partially automated, if the script finds that all instances of loss are false positives, the job could be automatically rerun. If we do this automatically, we could update the webrequest_sequence_stats table with the results, allowing for trend tracking on top of that table. Currently if you try to analyze data loss over time you find lots of noise with high % loss due to host restarts, etc.
  • We re-run sanitization. It's painful to update the command because you often have to change a property file nested in another property file nested in the command. See the docs for Backfilling sanitization, and we should build a rerun command that just takes a list of schemas, since, and until parameters.

Refined flags script

from datetime import datetime, timedelta

from_dt = datetime.strptime('2021-07-19 01', '%Y-%m-%d %H')
to_dt = datetime.strptime('2021-08-09 00', '%Y-%m-%d %H')

def get_date_parts(dt):
    return dt.year, dt.month, dt.day, dt.hour

for n in range(int((to_dt - from_dt).total_seconds() / 60 / 60) + 1):
    year, month, day, hour = get_date_parts(from_dt + timedelta(hours=n))

    # make parent directories
    print(f'hdfs dfs -mkdir -p /wmf/data/event/mediawiki_page_move/datacenter=eqiad/year={year}/month={month}/day={day}/hour={hour}/')

    # copy flags
    time_partitions = f'year={year}/month={month}/day={day}/hour={hour}'
    from_path = f'/wmf/data/event/mediawiki_page_move/datacenter=codfw/{time_partitions}'
    to_path = f'/wmf/data/event/mediawiki_page_move/datacenter=eqiad/{time_partitions}'
    print(f'hdfs dfs -cp {from_path}/_REFINED {to_path}/_REFINED')