User talk:Triciaburmeister/Sandbox/Data platform/Discover data

From Wikitech

DataHub domains

In the Traffic section of this doc, I linked to datasets tagged with "traffic", but there is a Traffic domain: https://datahub.wikimedia.org/domain/urn:li:domain:1228dd29-cce4-4be0-8ac2-745d82d4113b/Entities?is_lineage_mode=false, but that's the only domain we've created so far in Datahub. It would be great if we could have domains in Datahub for Traffic, Content, Contributions/Edits, and maybe also Instrumentation and Essential Metrics, though I'm less sure about those last two. I think curating these as domains would be better than as tags, b/c it would enable tag usage for more specific sub-topics in each of those large domains. But that opinion is just based on a small amount of investigation and reading https://datahubproject.io/docs/domains/; it should be decided as part of a larger data governance strategy. Triciaburmeister (talk) 17:57, 27 March 2024 (UTC)Reply

I like this approach for domains and agree tags should be used for more specific things. I think we manually have to assign DataHub data assets to a domain but that should be easy enough to do and a simple step for any new data assets. I think we could probably do an exercise with interested teams to classify and decide on required domains (and naming). Luke Bowmaker (talk) 14:27, 4 April 2024 (UTC)Reply
This seems like a very good idea to me too. One point that you may like to know about @Triciaburmeister is that we are on the verge of being able to use nested domains.
See: https://github.com/datahub-project/datahub/releases/tag/v0.12.0 for information about the feature.
This use of a domain hierarchy might give further options to your coarse and fine grained categorization approach. Just a thought. We should have this feature available to use with version 0.12 in the next couple of weeks, all being well. Btullis (talk) 08:54, 9 April 2024 (UTC)Reply

Which public data sources are duplicative?

To simplify this page, it would be nice to only highlight public data sources that don't have a private counterpart already listed. The full list of public data sources is covered at https://meta.wikimedia.org/wiki/Research:Data. Since the primary audience of this page is people with access to private data, it would be most effective to only list public data that isn't covered by or easily accessible through any of the private data sources that are already listed. Which public data sources are duplicative and can be removed? Or, do you think we should list all the things even if it means the page is longer and more cluttered? Triciaburmeister (talk) 17:00, 1 April 2024 (UTC)Reply

I think listing all things are ok. It’s usually easier to access things in the internal private way than the public API’s/files so having both gives an option. Luke Bowmaker (talk) 14:36, 4 April 2024 (UTC)Reply

Instrumentation datasets

Duplicating a question I asked in slack here: which datasets should be listed in this section? @Vpoundstone replied that "Historically all the instrument schemas and data has been owned by the individual product teams." So: should we just remove this section for now? Or can anyone provide a reasonable list of datasets that would be useful to link to from this section? Triciaburmeister (talk) 17:02, 1 April 2024 (UTC)Reply

I think we should just include datasets we own/created. There are datasets outside of instrumentation that have been created by analytics or feature teams that use our stack but we aren’t experts in the logic or how to use the data. Luke Bowmaker (talk) 23:38, 4 April 2024 (UTC)Reply
That makes sense; which datasets are the ones that DPE owns/has created? Can you add links to them in that section on the page? Thanks! Triciaburmeister (talk) 16:17, 5 April 2024 (UTC)Reply
I think that’s covered quite well in all of the other sections, things like webrequests, mwhistory, etc. Those are what we would call foundational datasets that other teams rely on to build new data products.
However, there are a lot of other data pipelines - would we want to add all of these?
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic Luke Bowmaker (talk) 12:41, 8 April 2024 (UTC)Reply
I was trying to prevent the page from becoming overwhelming by only linking to the canonical or most prominent data sources, and then having a link to the page you mentioned. If there are a few other pipelines you think should be highlighted on the landing page, we can add those, but I think adding all of them would be too much. The goal of this landing page is to help people navigate to the more detailed data pipeline information by providing just enough context so it's clear what is the right link to click to get to the next page you need. Triciaburmeister (talk) 16:02, 9 April 2024 (UTC)Reply
Based on comments in Slack threads, I've changed this section to attempt to describe very generally what instrumentation data is, and then link to the Collect data landing page where we will describe the tools and process for creating instruments and managing related data. Triciaburmeister (talk) 15:59, 9 April 2024 (UTC)Reply

Notes about existing page content that will be superseded by this page

All content in the "Data available" section of https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake#Data_available is covered on this new landing page draft.

All content on https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Content is covered on this new landing page draft.

I propose that when this page is published, the content of that section and page should be deleted and redirected to the new landing page. Triciaburmeister (talk) 17:05, 1 April 2024 (UTC)Reply

Sounds good to me! ODimitrijevic (talk) 18:31, 4 April 2024 (UTC)Reply

Which datasets are missing? Which should be removed?

1. Are there datasets that you use regularly, or which you consider to be essential or canonical data sources, but which are not listed on the page? Feel free to make edits and add them to the appropriate section.

2. Are there datasets that you know are going to be deprecated, or which are contain data that is better represented in some other source? Feel free to comment or just remove listings that you think shouldn't be there. Triciaburmeister (talk) 17:09, 1 April 2024 (UTC)Reply

How do you feel about event based datasets. We have internal ones that could be interesting and some public ones:
https://stream.wikimedia.org/v2/ui/#/?streams=mediawiki.page-create
Possibly an event section or include these in the current sections. For example, this is a private stream that could go in Private Content Data:
https://wikitech.wikimedia.org/wiki/MediaWiki_Event_Enrichment/SLO/Mediawiki_Page_Content_Change_Enrichment
Let me know what you think. Luke Bowmaker (talk) 14:42, 4 April 2024 (UTC)Reply

[resolved] Essential metrics datasets

Are there existing datasets or data sources to link to in this section? If not, should it be removed? If all we have is project documentation for yet-to-be-implemented resources, we can temporarily link to the relevant project pages in the team's documentation on MediaWiki: https://www.mediawiki.org/wiki/Data_Platform_Engineering/Data_Products. However, we should NOT put project documentation on Wikitech. The "Essential metrics" section on this landing page should link to data sources that are implemented and have technical (not project) documentation on Wikitech. Triciaburmeister (talk) 17:51, 1 April 2024 (UTC)Reply

Resolved in f2f discussion with @ODimitrijevic; for now we will remove this section because there are no published datasets ready to link to. Triciaburmeister (talk) 13:55, 4 April 2024 (UTC)Reply

Collapsible sections for top-level data domains

I am wondering whether there might be value in using collabsible sections to make the initial view of the page more accessible.

For example, some of the collapsible div options outlined here: https://www.mediawiki.org/wiki/Manual:Collapsible_elements/Demo/Advanced

If the user is initially greeted with collapsed sections for:

  • Traffic Data
  • Content Data
  • Contributing and edits data
  • etc.

Maybe this is no better than the existing TOC, but I would suggest that there is so much information available from the landing page that it becomes a fairly lengthy page in itself.

Inviting the user to click through to expand those sections that most interest them might be a way to lead them on a journey somewhat. Btullis (talk) 10:03, 9 April 2024 (UTC)Reply

This demo page is awesome, I don't know if I ever saw it before. I'll play around with this! I was hoping we could remove some of the public data sources from the page so it wouldn't be so overwhelming, but feedback has been to keep them, so now we do need to investigate ways to reduce the info overload via page formatting tricks. Triciaburmeister (talk) 16:12, 9 April 2024 (UTC)Reply
Okay, I created some demos of how this could look with our actual content, but I don't really love any of them. What do you think?
The main things I don't like about this is that it takes the page from too much info to not enough info, making it so you have to click the expand link to even be able to skim the content. I think it would be better to try to streamline the page contents more to make the amount of information skimmable. But, that is definitely also a challenge. So, let me hear your thoughts and maybe we can come up with something that works! Triciaburmeister (talk) 18:20, 9 April 2024 (UTC)Reply
p.s. I think my demo at https://www.mediawiki.org/wiki/User:TBurmeister_(WMF)/Sandbox/Fancytables#Collapsible_section_example is the closest to what you were originally describing in your comment, and it might be the option I dislike the least. So maybe that's worth something! Triciaburmeister (talk) 18:21, 9 April 2024 (UTC)Reply