User:CPettet (WMF)/analytics-tm/questions
Appearance
(Redirected from User:Rush/analytics-tm/questions)
- What is the rsync daemon for on stat1005?
- stat boxes are all allowed to rsync to each other in /srv
- rsync to thorium
- rsyned to something in dumps?
- datasets.wikimedia.org is more adhoc. aaron halfaker has used it to produce for papers and stuff.
- SWAP is internal via SSH tunneling. Is the thinking that this would be a consistent model?
- https://wikitech.wikimedia.org/wiki/Proxy_access_to_cluster
- druid too https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
- "The fastest way to figure it out is to try to establish a ssh tunnel like the following to one random Druid node of the target cluster"
- def not going ot expose it, tricky because it's basically full shell in the prod network in a web browser, shells that get launched automatically set hte http_proxy env, and also pip for adhoc packages
- druid and hive and everything? fancy web shell within analytics cluster
- notebooks never purged (hive databases within hadoop too!)
- try to create purgeing as part of offboarding with users
- stat boxes users usually download stuff in hadoop specific formats that look like a binary file on stat boxes
- PI/PII issues!
- snappy format
- should https://hue.wikimedia.org/accounts/login/?next=/ be behind basic auth?
- not meant ot be atm
- runs embedded super old version of django
- is netflow data really stopped or is there new data?
- daemon runs on rheunium collecting all netflow data unencrypted with IPs, aggregated, and
pushed to kafka jumbo broker and pushing into Hive.
- routers to rhenium unecrypted
- rhenium to kafka unencrypted (but aggregated so is it really a disclosure issue?)
- aggregated between ASs?
- https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging says you need to be on stat1006 to be able to access the eventlogging db but later says "There are many Kafka tools with which you can read the EventLogging data streams. kafkacat is one that is installed on stat1005." The stat1006 login banner also says "stat1006 is a Statistics general compute node (non private data) (statistics::cruncher)"
- stat1005 and stat1006 roles existed before we had hadoop
- before hadoop or kafka we kept web request data on stat1005
- stat1006 was more like a more public place where there was more public data and mediawiki analytics slaves
- stat1006 could technically use a hdaoop client to access
- https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Stats_machines (statistics-users)
- Access to stat1006 for number crunching and accessing non private log files hosted there.
- Where do these private logs come from?
- Access to stat1006 for number crunching and accessing non private log files hosted there.
- when/where do things get pushed to dumps?
- stat1005 via a fusehdfs mount for rysnc
- does analytics consume directly from labsdb*?
- yes
- sqooped from prod adhoc -- users who know how to use sqoop on their own
- sqoops cu_changes usually
- Is zookeeper shared between kafka deployments? is zookeeper colocated with etcd?
- yes
- yes
- what all uses zoopker? kafka, druid (different one on the druid nodes), hadoop, burrow?
- burrow runs on kafka tools tha tis a ganeti VM poking zookeeper and kafka and produces metrics
- what all uses zoopker? kafka, druid (different one on the druid nodes), hadoop, burrow?
about consumers to prometheus "kafkamon1001" and 2001
- what is varnishkafka-statv?
- mostly used by per team
- simple eventlogging that only allows logging in statsd format
- kafka topic
- consumed by a python daemon somehwere that consumers and submits metrics to statsd
- On varnish, statsv is filtered by webrequest via ReqURL ~ "^/beacon/statsv\?"'
- Where is the DB hue uses?
- analytics1003 is mysql database
- What is the heck does oozie do? :)
- scheduler (a cron like thing) that runs on a daemon on analytics1003 with a DB
- submit what are called workflows to it that run with regularity
- nice feature where it schedules jobs based on the existence of data (inotify in hadoop thing)
- rerun on failures and SLA stuff
- PI or PII backups or storage?
- seems no
- What databases are sqooped into Hadoop?