When a strange failure happens for a Refine, sometimes I find it is easiest to launch a spark scala shell to manually inspect the input dataset, and run some Refine code to see if I can reproduce errors.

// 15:18:09 [@stat1004:/home/otto] $ spark3-shell --jars /srv/deployment/analytics/refinery/artifacts/refinery-job-shaded.jar


// Create a Spark schemaLoader.  There are other ways to do this, 
// e.g. for old eventlogging metawiki schemas, or explict schemas.
// See Refine.scala getRefineTargetsFromFS.
val schemaLoader = EventSparkSchemaLoader(

 * RefineTarget object apply has a helper method to instantiate a single RefineTarget.
 * If you have the input and output paths of a failed hourly dataset, you can use these
 * to create a RefineTarget.
 * E.g. a Refine failure alert email might say:
 * The following 1 of 2 dataset partitions for output table `event`.`mediawiki_content_translation_event` failed refinement:
 *     Failed refinement of 
 *     hdfs://analytics-hadoop/wmf/data/raw/event/codfw.mediawiki.content_translation_event/year=2023/month=06/day=16/hour=22 -> 
 *     `event`.`mediawiki_content_translation_event` 
 *     /wmf/data/event/mediawiki_content_translation_event/datacenter=codfw/year=2023/month=6/day=16/hour=22. Original exception: org.wikimedia.eventutilities.core.json.JsonLoadingException: Failed reading JSON/YAML data from /analytics/mediawiki/content_translation_event/latest
 * So you've got the intput and output path in that email.

val refineTarget = RefineTarget(

// With this RefineTarget, you can try to load and inspect the input dataframe,
// which is usually where Refine errors happen, due to corrupt records, or bad schemas or something.
val df = refineTarget.inputDataFrame

// Once you have the DataFrame of the input dataset, you can examine it with
// the usual Spark DataFrame API.

Canary Events

The use of 'canary events' for Hive event ingestion is documented at Data_Engineering/Systems/Hadoop_Event_Ingestion_Lifecycle#Canary_Events

When ProduceCanaryEvents email alerts happen, they are usually because there is a particular stream that cannot have its canary events produced.

The output of the alert email is verbose (this probably should be fixed), but you an find out which stream(s) are causing the alert by looking for log lines with "ERROR ProduceCanaryEvents Some canary events failed to be produced", and viewing the response body JSON. The response body will have set to the stream that failed, and will usually include the reason why it failed in context.message. E.g.

"context":{"message":"Failed loading schema at /mediawiki/revision/score/3.0.0"}}]}

NOTE: Ideally one day soon, we will migrate this job to Airflow, and troubleshooting it will be much easier.

Some common reasons ProduceCanaryEvents might fail:


kubernetes / helm

get k8s node metrics
 akosiaris@prometheus1006:~$ curl dse-k8s-worker1005.eqiad.wmnet:10255/metrics/cadvisor
get network context for container
  sudo nsenter -t 3548838 -n netstat -nlpt
get container/application logs for all pods
kube_env eventgate-analytics eqiad
kubectl logs -c eventgate-analytics -l app=eventgate --max-log-requests=50 --since 5m


Running phpunit tests with docker-compose

docker-compose up
# ...
docker-compose exec mediawiki composer phpunit:entrypoint extensions/EventBus/tests/phpunit/