User:Milimetric/Notebook/Data pipeline options for tracking reuse

Wikimedia content is spreading to more and more interfaces. From interfaces we control on our sites and apps, to Digital assistant answers, to internal research at other organizations and governments. This essay will try and give some quick representative examples and lay out options for how we might capture metadata about this in a way that works for us and our partners in open knowledge. It's basically analytics for Knowledge as a Service.

Examples by Type

online: w:Google_Knowledge_Graph, w:Siri#Features_and_options, Amazon X-ray
offline: [Kiwix https://www.kiwix.org/]
aggregated metadata: Lots of examples, Alexa uses Wikipedia pageview dumps
indirect: Governments use the Global Innovation Index, which relies on Analytics/Data_Lake/Edits/Geoeditors

Challenge

We are currently able to track pageviews on our own sites and apps and instrumentation from interfaces we control. How do we also build a way for other users and re-distributors of our content to report back on the usage? We can start by listing options that we think may be feasible and iterating with various partners to find the best first choice.

Options

Event_Platform/EventGate. We can simply build schemas that outline what data we're interested in and ask our clients to please HTTP post them to us. There are two distinct options here:
- Batched data sent to us periodically
- Real-time data
Periodic custom import. We could negotiate schemas with partners for periodic XML/JSON/TSV/... dumps and build AirFlow jobs to integrate these from a variety of sources.
High level report. Our partners may only be willing to give us a rough high level understanding of how Wikimedia content is used by their features.

Minimize our Work

In going with any of the options here, we would want to minimize our work by:

Using the smallest possible set of schemas to describe content reuse, each new schema means custom integration work
Speak to a wide variety of partners before making too many decisions, to prevent having to re-work agreements or having to implement multiple solutions