Data Engineering/TOC
This team was previously known as the Analytics team - Therefore much of the documentation on this wiki was created under the Analytics/ namespace.
Whilst the team name has changed to Data Engineering we still concern ourselves in many ways with Analytics data and systems. Therefore it does not make sense to rename these pages in bulk. For this reason we include the tables of contents below for both the Analytics and Data_Engineering namespaces. Migration of content pages will be performed on a case by case basis.
Table of Contents - Data Engineering
- About
- Contact
- Currently working on
- Data Quality
- Documentation
- Evaluations
- Evaluations/2021 data catalog selection
- Evaluations/2021 data catalog selection/Rubric
- Evaluations/2021 data catalog selection/Rubric/Amundsen
- Evaluations/2021 data catalog selection/Rubric/Atlas
- Evaluations/2021 data catalog selection/Rubric/DataHub
- Evaluations/2021 data catalog selection/Rubric/OpenMetadata
- Evaluations/Dumps
- Evaluations/Event Platform/EventStreams
- Evaluations/Event Platform/Stream Processing/Framework Evaluation
- Evaluations/SQL Engine on Cloud
- Exporting from HDFS to Swift
- FAQ
- Learning Materials
- Manual maintenance
- Manual maintenance/Refined flags script
- Meeting norms
- OKRs
- Ops week
- Ownership
- Responding to Requests and Issues
- Show your work
- Systems
- Systems/AQS
- Systems/AQS/Scaling
- Systems/AQS/Scaling/2016/Hardware Refresh
- Systems/AQS/Scaling/2017/Cluster Expansion
- Systems/AQS/Scaling/2020/Cluster Expansion
- Systems/AQS/Scaling/LoadTesting
- Systems/Airflow
- Systems/Airflow/Airflow testing instance tutorial
- Systems/Airflow/Developer guide
- Systems/Airflow/Developer guide/Python Job Repos
- Systems/Airflow/Instances
- Systems/Airflow/Upgrading
- Systems/Analytics Meta
- Systems/Archiva
- Systems/Ceph
- Systems/Cluster/Geotagging
- Systems/Cluster/Hadoop/Load
- Systems/Conda
- Systems/DB Replica
- Systems/Dashiki
- Systems/Dashiki/Configuration
- Systems/DataHub
- Systems/DataHub/Data Catalog Documentation Guide
- Systems/DataHub/Upgrading
- Systems/Dealing with data loss alarms
- Systems/Druid
- Systems/Druid/Alerts
- Systems/Druid/Load test
- Systems/Event Data retention
- Systems/Event Data retention/AppInstallId
- Systems/Hadoop Event Ingestion Lifecycle
- Systems/Java
- Systems/Jupyter
- Systems/Jupyter/Administration
- Systems/Kerberos
- Systems/Kerberos/Administration
- Systems/Maintenance Schedule
- Systems/Managing systemd timers
- Systems/Matomo
- Systems/Reportupdater
- Systems/Varnishkafka
- Systems/Wikistats
- Systems/Wikistats/Traffic
- Systems/Wikistats 2
- TOC
- Team
- Team/Onboarding
- Team roles and expectations
- Tutorials
- Tutorials/Dashboards
Table of Contents - Analytics
- AQS/Devices Analytics
- AQS/Editors by country
- AQS/Legacy Pagecounts
- AQS/Media metrics
- AQS/Mediarequests
- AQS/Mediarequests/Limitations
- AQS/Pageviews
- AQS/Pageviews/Pageviews per project
- AQS/Wikistats 2
- AQS/Wikistats 2/DataQuality/VettingPerProjectFamilies
- AQS/Wikistats 2/DataQuality/Vetting of mediarequest metrics
- AQS/Wikistats 2/Data Quality/VettingPerProject
- AQS/Wikistats 2/Metrics Definition
- Archive
- Archive/AQS -RESTBase
- Archive/AQS - DataStore
- Archive/Alexa
- Archive/Dashboards - Limn
- Archive/Data/Mobile requests stream
- Archive/Data/Pagecounts-all-sites
- Archive/Data/Pagecounts-raw
- Archive/Data/Webrequests sampled
- Archive/Data/Zero webrequests
- Archive/Differential
- Archive/EventLogging pipeline
- Archive/Hadoop - Logstash
- Archive/Hadoop Logging - Solutions Overview
- Archive/Hadoop Logging - Solutions Recommendation
- Archive/Hadoop Streaming
- Archive/Hue
- Archive/Hue/Administration
- Archive/Kafka/Capacity
- Archive/Kraken/Meetings
- Archive/Kraken/Meetings/ArchitectureReview
- Archive/Limn
- Archive/Mingle
- Archive/Oozie
- Archive/Oozie/Administration
- Archive/Pageviews/Aggregation
- Archive/Pentaho
- Archive/Spark/Migration to Spark 3
- Archive/Webrequest partitions monitorin
- Archive/Webstatscollector
- Archive/Wikistats2.0/Design
- BrowserReports
- Cluster/Coordinator
- Cluster/Data Format Experiments
- Cluster/Data deletion and sanitization
- Cluster/Edit data loading
- Cluster/Edit history administration
- Cluster/Edit serving layer
- Cluster/Geolocation
- Cluster/Gobblin
- Cluster/Hadoop
- Cluster/Hadoop/Alerts
- Cluster/Hadoop/Test
- Cluster/Mediawiki History Snapshot Check
- Cluster/Mediawiki history reduced algorithm
- Cluster/Page and user history reconstruction
- Cluster/Page and user history reconstruction algorithm
- Cluster/Revision augmentation and denormalization
- Cluster/Spark/Administration
- Cluster/System Users
- Cluster/Systems/Hive
- Cluster/Systems/Hive/Alerts
- Cluster/Systems/Hive/Avro
- Cluster/Systems/Hive/Compression
- Cluster/Systems/Hive/Counting uniques
- Cluster/Systems/Hive/Queries
- Cluster/Systems/Hive/Queries/Wikidata
- Cluster/Workflow management tools study
- DataRequests
- Data Lake
- Data Lake/Content
- Data Lake/Content/Mediawiki wikitext current
- Data Lake/Content/Mediawiki wikitext history
- Data Lake/Content/Wikidata entity
- Data Lake/Content/Wikidata item page link
- Data Lake/Data Issues
- Data Lake/Data Issues/2021-02-09 Unique Devices By Family Overcount
- Data Lake/Data Issues/2021-06-04 Traffic Data Loss
- Data Lake/Data Issues/2023-01-08 Webrequest Data Loss
- Data Lake/Data Issues/2023-11 eventgate-analytics-external Data Loss
- Data Lake/Edits
- Data Lake/Edits/Edit hourly
- Data Lake/Edits/Geoeditors
- Data Lake/Edits/Geoeditors/Public
- Data Lake/Edits/MediaWiki history
- Data Lake/Edits/MediaWiki history/Revision identity reverts
- Data Lake/Edits/MediaWiki history dumps
- Data Lake/Edits/MediaWiki history dumps/FAQ
- Data Lake/Edits/MediaWiki history dumps/Python spark examples
- Data Lake/Edits/MediaWiki history dumps/Scala spark examples
- Data Lake/Edits/Mediawiki history dumps/Python Dask examples
- Data Lake/Edits/Mediawiki history dumps/Python Pandas examples
- Data Lake/Edits/Mediawiki history reduced
- Data Lake/Edits/Mediawiki page history
- Data Lake/Edits/Mediawiki project namespace map
- Data Lake/Edits/Mediawiki user history
- Data Lake/Edits/Metrics
- Data Lake/Edits/Public
- Data Lake/Edits/Structured data/Commons entity
- Data Lake/Events
- Data Lake/Traffic
- Data Lake/Traffic/Banner activity
- Data Lake/Traffic/BotDetection
- Data Lake/Traffic/Browser general
- Data Lake/Traffic/Caching
- Data Lake/Traffic/Interlanguage
- Data Lake/Traffic/Mediacounts
- Data Lake/Traffic/Pagecounts-ez
- Data Lake/Traffic/Pageview actor
- Data Lake/Traffic/Pageview hourly
- Data Lake/Traffic/Pageview hourly/Fingerprinting Over Time
- Data Lake/Traffic/Pageview hourly/Identity reconstruction analysis
- Data Lake/Traffic/Pageview hourly/K Anonymity Threshold Analysis
- Data Lake/Traffic/Pageview hourly/Sanitization
- Data Lake/Traffic/Pageview hourly/Sanitization algorithm proposal
- Data Lake/Traffic/Pageviews
- Data Lake/Traffic/Pageviews/Bots
- Data Lake/Traffic/Pageviews/Bots Research
- Data Lake/Traffic/Pageviews/Redirects
- Data Lake/Traffic/Projectview hourly
- Data Lake/Traffic/ReaderCounts
- Data Lake/Traffic/SessionLength
- Data Lake/Traffic/Unique Devices
- Data Lake/Traffic/Unique Devices/Automated traffic correction
- Data Lake/Traffic/Unique Devices/Last access solution
- Data Lake/Traffic/Unique Devices/Last access solution/Validation
- Data Lake/Traffic/UserRetention
- Data Lake/Traffic/Virtualpageview hourly
- Data Lake/Traffic/Webrequest
- Data Lake/Traffic/Webrequest/RawIPUsage
- Data Lake/Traffic/Webrequest/Tagging
- Data Lake/Traffic/mediawiki api request
- Data Lake/Traffic/mobile apps session metrics
- Data Lake/Traffic/mobile apps uniques
- Data Lake/Traffic/referrer daily
- Data Lake/Traffic/referrer daily/Dashboard
- Data access
- Data access guidelines
- Data quality/Entrophy alarms
- Data quality/User agent entropy
- Doc proposal
- Event Sanitization
- Fundraising
- Geoeditors
- Hive to Druid Ingestion Pipeline
- Mysql/Utility Datasets
- Pageviews
- Performance
- Projects/Data Lake/Edits History
- Projects/Public Data Lake
- Research/CitationDataset
- Sessions
- Systems
- Systems/Clients
- Systems/Cluster
- Systems/Cluster/Bigtop Packages
- Systems/Cluster/Hadoop/Administration
- Systems/Cluster/Hive/Querying using UDFs
- Systems/Cluster/Iceberg
- Systems/Cluster/Iceberg/Migration Dependencies
- Systems/Cluster/Spark
- Systems/Cluster/Spark History
- Systems/EventLogging
- Systems/EventLogging/Administration
- Systems/EventLogging/Architecture
- Systems/EventLogging/Backfilling
- Systems/EventLogging/Data representations
- Systems/EventLogging/EventCapsule
- Systems/EventLogging/Monitoring
- Systems/EventLogging/NotErrorLogging
- Systems/EventLogging/Outages
- Systems/EventLogging/Performance
- Systems/EventLogging/Publishing
- Systems/EventLogging/Sanitization vs Aggregation
- Systems/EventLogging/Schema Guidelines
- Systems/EventLogging/Sensitive Fields
- Systems/EventLogging/TestingOnBetaCluster
- Systems/EventLogging/User agent sanitization
- Systems/MariaDB
- Systems/Presto
- Systems/Presto/Administration
- Systems/Presto/Query Logger
- Systems/Refine
- Systems/Refine/Deploy Refinery
- Systems/Refine/Deploy Refinery-source
- Systems/Siege
- Systems/Superset
- Systems/Superset/Administration
- Systems/Superset/Date functions
- Systems/Tier2
- Systems/Turnilo
- Systems/Wikistats2/Metrics/FAQ
- Systems/ua-parser
- Systems/ua-parser/2019-09-18 Update
- Team/Conferences
- Team/Conferences/Apache Big Data Europe - November 2016
- Team/MailingList
- Team/Office Hours
- Team/Quarterly Reviews
- Web publication
- Wikistats/Deprecation of Wikistats 1
- Wikistats2.0/Map Component
- Wikistats 2/Smoke Testing
- analytics.wikimedia.org
Tables of contents - With Redirects
Data Engineering
- About
- Contact
- Currently working on
- Data Quality
- Documentation
- Evaluations
- Evaluations/2021 data catalog selection
- Evaluations/2021 data catalog selection/Rubric
- Evaluations/2021 data catalog selection/Rubric/Amundsen
- Evaluations/2021 data catalog selection/Rubric/Atlas
- Evaluations/2021 data catalog selection/Rubric/DataHub
- Evaluations/2021 data catalog selection/Rubric/OpenMetadata
- Evaluations/Dumps
- Evaluations/Event Platform/EventStreams
- Evaluations/Event Platform/Stream Processing/Framework Evaluation
- Evaluations/SQL Engine on Cloud
- Exporting from HDFS to Swift
- FAQ
- Learning Materials
- Manual maintenance
- Manual maintenance/Refined flags script
- Meeting norms
- OKRs
- Ops week
- Ownership
- Responding to Requests and Issues
- Show your work
- Systems
- Systems/AQS
- Systems/AQS/Scaling
- Systems/AQS/Scaling/2016/Hardware Refresh
- Systems/AQS/Scaling/2017/Cluster Expansion
- Systems/AQS/Scaling/2020/Cluster Expansion
- Systems/AQS/Scaling/LoadTesting
- Systems/Airflow
- Systems/Airflow/Airflow testing instance tutorial
- Systems/Airflow/Developer guide
- Systems/Airflow/Developer guide/Python Job Repos
- Systems/Airflow/Instances
- Systems/Airflow/Upgrading
- Systems/Analytics Meta
- Systems/Archiva
- Systems/Ceph
- Systems/Cluster/Geotagging
- Systems/Cluster/Hadoop/Load
- Systems/Conda
- Systems/DB Replica
- Systems/Dashiki
- Systems/Dashiki/Configuration
- Systems/DataHub
- Systems/DataHub/Data Catalog Documentation Guide
- Systems/DataHub/Upgrading
- Systems/Dealing with data loss alarms
- Systems/Druid
- Systems/Druid/Alerts
- Systems/Druid/Load test
- Systems/Event Data retention
- Systems/Event Data retention/AppInstallId
- Systems/Hadoop Event Ingestion Lifecycle
- Systems/Java
- Systems/Jupyter
- Systems/Jupyter/Administration
- Systems/Kerberos
- Systems/Kerberos/Administration
- Systems/Maintenance Schedule
- Systems/Managing systemd timers
- Systems/Matomo
- Systems/Reportupdater
- Systems/Varnishkafka
- Systems/Wikistats
- Systems/Wikistats/Traffic
- Systems/Wikistats 2
- TOC
- Team
- Team/Onboarding
- Team roles and expectations
- Tutorials
- Tutorials/Dashboards
Analytics
- AQS/Devices Analytics
- AQS/Editors by country
- AQS/Legacy Pagecounts
- AQS/Media metrics
- AQS/Mediarequests
- AQS/Mediarequests/Limitations
- AQS/Pageviews
- AQS/Pageviews/Pageviews per project
- AQS/Wikistats 2
- AQS/Wikistats 2/DataQuality/VettingPerProjectFamilies
- AQS/Wikistats 2/DataQuality/Vetting of mediarequest metrics
- AQS/Wikistats 2/Data Quality/VettingPerProject
- AQS/Wikistats 2/Metrics Definition
- Archive
- Archive/AQS -RESTBase
- Archive/AQS - DataStore
- Archive/Alexa
- Archive/Dashboards - Limn
- Archive/Data/Mobile requests stream
- Archive/Data/Pagecounts-all-sites
- Archive/Data/Pagecounts-raw
- Archive/Data/Webrequests sampled
- Archive/Data/Zero webrequests
- Archive/Differential
- Archive/EventLogging pipeline
- Archive/Hadoop - Logstash
- Archive/Hadoop Logging - Solutions Overview
- Archive/Hadoop Logging - Solutions Recommendation
- Archive/Hadoop Streaming
- Archive/Hue
- Archive/Hue/Administration
- Archive/Kafka/Capacity
- Archive/Kraken/Meetings
- Archive/Kraken/Meetings/ArchitectureReview
- Archive/Limn
- Archive/Mingle
- Archive/Oozie
- Archive/Oozie/Administration
- Archive/Pageviews/Aggregation
- Archive/Pentaho
- Archive/Spark/Migration to Spark 3
- Archive/Webrequest partitions monitorin
- Archive/Webstatscollector
- Archive/Wikistats2.0/Design
- BrowserReports
- Cluster/Coordinator
- Cluster/Data Format Experiments
- Cluster/Data deletion and sanitization
- Cluster/Edit data loading
- Cluster/Edit history administration
- Cluster/Edit serving layer
- Cluster/Geolocation
- Cluster/Gobblin
- Cluster/Hadoop
- Cluster/Hadoop/Alerts
- Cluster/Hadoop/Test
- Cluster/Mediawiki History Snapshot Check
- Cluster/Mediawiki history reduced algorithm
- Cluster/Page and user history reconstruction
- Cluster/Page and user history reconstruction algorithm
- Cluster/Revision augmentation and denormalization
- Cluster/Spark/Administration
- Cluster/System Users
- Cluster/Systems/Hive
- Cluster/Systems/Hive/Alerts
- Cluster/Systems/Hive/Avro
- Cluster/Systems/Hive/Compression
- Cluster/Systems/Hive/Counting uniques
- Cluster/Systems/Hive/Queries
- Cluster/Systems/Hive/Queries/Wikidata
- Cluster/Workflow management tools study
- DataRequests
- Data Lake
- Data Lake/Content
- Data Lake/Content/Mediawiki wikitext current
- Data Lake/Content/Mediawiki wikitext history
- Data Lake/Content/Wikidata entity
- Data Lake/Content/Wikidata item page link
- Data Lake/Data Issues
- Data Lake/Data Issues/2021-02-09 Unique Devices By Family Overcount
- Data Lake/Data Issues/2021-06-04 Traffic Data Loss
- Data Lake/Data Issues/2023-01-08 Webrequest Data Loss
- Data Lake/Data Issues/2023-11 eventgate-analytics-external Data Loss
- Data Lake/Edits
- Data Lake/Edits/Edit hourly
- Data Lake/Edits/Geoeditors
- Data Lake/Edits/Geoeditors/Public
- Data Lake/Edits/MediaWiki history
- Data Lake/Edits/MediaWiki history/Revision identity reverts
- Data Lake/Edits/MediaWiki history dumps
- Data Lake/Edits/MediaWiki history dumps/FAQ
- Data Lake/Edits/MediaWiki history dumps/Python spark examples
- Data Lake/Edits/MediaWiki history dumps/Scala spark examples
- Data Lake/Edits/Mediawiki history dumps/Python Dask examples
- Data Lake/Edits/Mediawiki history dumps/Python Pandas examples
- Data Lake/Edits/Mediawiki history reduced
- Data Lake/Edits/Mediawiki page history
- Data Lake/Edits/Mediawiki project namespace map
- Data Lake/Edits/Mediawiki user history
- Data Lake/Edits/Metrics
- Data Lake/Edits/Public
- Data Lake/Edits/Structured data/Commons entity
- Data Lake/Events
- Data Lake/Traffic
- Data Lake/Traffic/Banner activity
- Data Lake/Traffic/BotDetection
- Data Lake/Traffic/Browser general
- Data Lake/Traffic/Caching
- Data Lake/Traffic/Interlanguage
- Data Lake/Traffic/Mediacounts
- Data Lake/Traffic/Pagecounts-ez
- Data Lake/Traffic/Pageview actor
- Data Lake/Traffic/Pageview hourly
- Data Lake/Traffic/Pageview hourly/Fingerprinting Over Time
- Data Lake/Traffic/Pageview hourly/Identity reconstruction analysis
- Data Lake/Traffic/Pageview hourly/K Anonymity Threshold Analysis
- Data Lake/Traffic/Pageview hourly/Sanitization
- Data Lake/Traffic/Pageview hourly/Sanitization algorithm proposal
- Data Lake/Traffic/Pageviews
- Data Lake/Traffic/Pageviews/Bots
- Data Lake/Traffic/Pageviews/Bots Research
- Data Lake/Traffic/Pageviews/Redirects
- Data Lake/Traffic/Projectview hourly
- Data Lake/Traffic/ReaderCounts
- Data Lake/Traffic/SessionLength
- Data Lake/Traffic/Unique Devices
- Data Lake/Traffic/Unique Devices/Automated traffic correction
- Data Lake/Traffic/Unique Devices/Last access solution
- Data Lake/Traffic/Unique Devices/Last access solution/Validation
- Data Lake/Traffic/UserRetention
- Data Lake/Traffic/Virtualpageview hourly
- Data Lake/Traffic/Webrequest
- Data Lake/Traffic/Webrequest/RawIPUsage
- Data Lake/Traffic/Webrequest/Tagging
- Data Lake/Traffic/mediawiki api request
- Data Lake/Traffic/mobile apps session metrics
- Data Lake/Traffic/mobile apps uniques