User:Ottomata/Notes
Troubleshooting
Refine
When a strange failure happens for a Refine, sometimes I find it is easiest to launch a spark scala shell to manually inspect the input dataset, and run some Refine code to see if I can reproduce errors.
// 15:18:09 [@stat1004:/home/otto] $ spark3-shell --jars /srv/deployment/analytics/refinery/artifacts/refinery-job-shaded.jar
import org.wikimedia.analytics.refinery.job.refine._
// Create a Spark schemaLoader. There are other ways to do this,
// e.g. for old eventlogging metawiki schemas, or explict schemas.
// See Refine.scala getRefineTargetsFromFS.
val schemaLoader = EventSparkSchemaLoader(
Seq(
"https://schema.discovery.wmnet/repositories/primary/jsonschema",
"https://schema.discovery.wmnet/repositories/secondary/jsonschema",
),
loadLatest=true,
Some(Refine.Config.default.schema_field)
)
/*
* RefineTarget object apply has a helper method to instantiate a single RefineTarget.
* If you have the input and output paths of a failed hourly dataset, you can use these
* to create a RefineTarget.
*
* E.g. a Refine failure alert email might say:
*
* The following 1 of 2 dataset partitions for output table `event`.`mediawiki_content_translation_event` failed refinement:
* org.wikimedia.analytics.refinery.job.refine.RefineTargetException:
* Failed refinement of
* hdfs://analytics-hadoop/wmf/data/raw/event/codfw.mediawiki.content_translation_event/year=2023/month=06/day=16/hour=22 ->
* `event`.`mediawiki_content_translation_event`
* /wmf/data/event/mediawiki_content_translation_event/datacenter=codfw/year=2023/month=6/day=16/hour=22. Original exception: org.wikimedia.eventutilities.core.json.JsonLoadingException: Failed reading JSON/YAML data from /analytics/mediawiki/content_translation_event/latest
*
* So you've got the intput and output path in that email.
*/
val refineTarget = RefineTarget(
spark,
"hdfs://analytics-hadoop/wmf/data/raw/event/codfw.mediawiki.content_translation_event/year=2023/month=06/day=16/hour=22",
"/wmf/data/event/mediawiki_content_translation_event/datacenter=codfw/year=2023/month=6/day=16/hour=22",
schemaLoader
)
// With this RefineTarget, you can try to load and inspect the input dataframe,
// which is usually where Refine errors happen, due to corrupt records, or bad schemas or something.
val df = refineTarget.inputDataFrame
// Once you have the DataFrame of the input dataset, you can examine it with
// the usual Spark DataFrame API.
// https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html
Canary Events
The use of 'canary events' for Hive event ingestion is documented at Data_Engineering/Systems/Hadoop_Event_Ingestion_Lifecycle#Canary_Events
When ProduceCanaryEvents email alerts happen, they are usually because there is a particular stream that cannot have its canary events produced.
The output of the alert email is verbose (this probably should be fixed), but you an find out which stream(s) are causing the alert by looking for log lines with "ERROR ProduceCanaryEvents Some canary events failed to be produced", and viewing the response body JSON. The response body will have event.meta.stream
set to the stream that failed, and will usually include the reason why it failed in context.message
. E.g.
"context":{"message":"Failed loading schema at /mediawiki/revision/score/3.0.0"}}]}
NOTE: Ideally one day soon, we will migrate this job to Airflow, and troubleshooting it will be much easier.
Some common reasons ProduceCanaryEvents might fail:
- A stream has been declared in EventStreamConfig, but its schema (indicated by the schema_title setting) has not been merged/deployed to schema.wikimedia.org.
- A new schema or schema version for a stream has been merged, but the destination_event_service (eventgate cluster) does not use dynamic schema loading. In this case follow instructions at Event Platform/EventGate/Administration#eventgate-wikimedia schema repository change to resolve.
- A stream has been declared in EventStreamConfig, but the destination_event_service (eventgate cluster) only loads stream config on service startup. In this case the eventgate cluster will need a restart. See Event Platform/EventGate/Administration#EventStreamConfig change for more info, and Event Platform/EventGate/Administration#Roll restart all pods for instructions on how to bounce all pods in the cluster. See Event Platform/EventGate#EventGate clusters to determine if the eventgate cluster in question will need this.
Misc
kubernetes / helm
get k8s node metrics
akosiaris@prometheus1006:~$ curl dse-k8s-worker1005.eqiad.wmnet:10255/metrics/cadvisor
get network context for container
sudo nsenter -t 3548838 -n netstat -nlpt
get container/application logs for all pods
kube_env eventgate-analytics eqiad
kubectl logs -c eventgate-analytics -l app=eventgate --max-log-requests=50 --since 5m
MediaWiki
Running phpunit tests with docker-compose
docker-compose up
# ...
docker-compose exec mediawiki composer phpunit:entrypoint extensions/EventBus/tests/phpunit/