User:Jbond/debugging
USE
http://www.brendangregg.com/USEmethod/use-linux.html
Logs
https://wikitech.wikimedia.org/wiki/Logs
Network
https://wikitech.wikimedia.org/wiki/Network_cheat_sheet#Juniper
Sampled-1000.json on centrallog1001
https://wikitech.wikimedia.org/wiki/Logs/Runbook#Webrequest_Sampled
Example of digging into the data (from cdanis)
$ tail -n300000 /srv/weblog/webrequest/sampled-1000.json| jq -r 'select(.http_status == "429") | select(.dt | contains("2022-06-10T14:5")) | .uri_host' | sort | uniq -c | sort -gr
45371 www.wikipedia.org
728 en.wikipedia.org
16 upload.wikimedia.org
6 de.wikipedia.org
5 pt.wikipedia.org
5 fr.wikipedia.org
5 es.wikipedia.org
4 query.wikidata.org
2 nl.wikipedia.org
2 ja.wikipedia.org
2 api.wikimedia.org
1 sv.wikipedia.org
$ tail -n300000 /srv/weblog/webrequest/sampled-1000.json| jq -r 'select(.http_status == "429") | select(.dt | contains("2022-06-10T14:5")) | select(.uri_host == "www.wikipedia.org")' | less
$ tail -n300000 /srv/weblog/webrequest/sampled-1000.json| jq -r 'select(.http_status == "429") | select(.dt | contains("2022-06-10T14:5")) | select(.uri_host == "www.wikipedia.org") | .uri_path' | sort | uniq -c | sort -gr
45371 /
$ tail -n300000 /srv/weblog/webrequest/sampled-1000.json| jq -r 'select(.http_status == "429") | select(.dt | contains("2022-06-10T14:5")) | select(.uri_host == "www.wikipedia.org") | select(.uri_path == "/") | .uri_query' | sort | uniq -c | sort -gr | head
1 ?q=ZZZWF8bdj6hw
1 ?q=zZzsfU01A8F4
1 ?q=ZzzLH6zEJRvD
1 ?q=ZzZiRz0QoPBK
1 ?q=zZWIuevTlAOu
1 ?q=ZzvAdulyFrRe
1 ?q=ZZv96mB4T6WK
1 ?q=zzUrTAWa2kA8
1 ?q=zzUPPhnOicQ4
1 ?q=ZZT8Y8D2gRnE
$ tail -n400000 /srv/weblog/webrequest/sampled-1000.json| jq -r 'select(.http_status == "429") | select(.dt | contains("2022-06-10T14:5")) | select(.uri_host == "www.wikipedia.org") | select(.uri_path == "/") | select(.uri_query|test("^\\?q=[^&]+$")) | .user_agent' | sort | uniq -c | sort -gr
7711 Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36
7656 Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko
7567 Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3599.0 Safari/537.36
7535 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.18247
7451 Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3599.0 Safari/537.36
7451 Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3599.0 Safari/537.36
$ tail -n400000 /srv/weblog/webrequest/sampled-1000.json| jq -r 'select(.http_status == "429") | select(.dt | contains("2022-06-10T14:5")) | select(.uri_host == "www.wikipedia.org") | select(.uri_path == "/") | select(.uri_query|test("^\\?q=[^&]+$")) | .tls' | sort | uniq -c | sort -gr
43774 vers=TLSv1.3;keyx=UNKNOWN;auth=ECDSA;ciph=AES-256-GCM-SHA384;prot=h2;sess=new
1597 vers=TLSv1.2;keyx=UNKNOWN;auth=ECDSA;ciph=AES256-GCM-SHA384;prot=h2;sess=new
mw server
list all ips which have made more the 100 large requests
$ awk '$2>60000 {print $11}' /var/log/apache2/other_vhosts_access.log | sort | uniq -c | awk '$1>100 {print}'
MediaWiki Shell
$ ssh mwmaint1002
$ mwscript maintenance/shell.php --wiki=enwiki
Then
>>> var_dump($wgUpdateRowsPerQuery);
int(100)
=> null
>>>
One of purge
On mwmaint1002, run:
$ echo 'https://example.org/foo?x=y' | mwscript purgeList.php
re: https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging#One-off_purge
LVS Server
Sample 100k pkts and list top talkers
$ sudo tcpdump -i enp4s0f0 -pn -c 100000 | sed -r 's/.* IP6? //;s/\.[^\.]+ .*//' | sort | uniq -c | sort -nr | head -20
Testig a site agains a specific lvs
$ curl --connect-to "::text-lb.${site}.wikimedia.org" https://en.wikipedia.org/wiki/Main_Page?x=$RANDOM
CP Server
Query for specific status code
$ sudo varnishncsa -n frontend -g request -q 'RespStatus eq 429'
Custom format with client IP address
$ sudo -i varnishncsa -n frontend -g request -q 'RespStatus eq 429' -F '%{X-Client-IP}i %l %u %t \"%r\" %s %b \"%{Referer}i\" \"%{User-agent}i\" \"%{X-Forwarded-Proto}i\""'
Or the much more verbos version
$ sudo varnishlog -n frontend -g request -q 'RespStatus eq 429'
Check the connection tuples for the varnish
$ sudo ss -tan 'sport = :3120' | awk '{print $(NF)" "$(NF-1)}' | sed 's/:[^ ]*//g' | sort | uniq -c
The number of avaible ports which also maps to tuples is available from if the number above is equal to approaching the number of available ports from below then there could ba en issue
$ cat /proc/sys/net/ipv4/ip_local_port_range
Checking sites from CP server
You can use curl from the cp serveres to ensure you fiut the front end/back end cache and for it to hit fetch a specific site with the following commands
Using $RANDOM
below prevents us from hitting the cache
frontend
$ curl --connect-to "::$HOSTNAME" https://en.wikipedia.org/wiki/Main_Page?x=$RANDOM
backend
$ curl --connect-to "::$HOSTNAME:3128" -H "X-Forwarded-Proto: https"" https://en.wikipedia.org/wiki/Main_Page?x=$RANDOM
Proxed web service
Show all request and response headeres on loopback
$ sudo stdbuf -oL -eL /usr/sbin/tcpdump -Ai lo -s 10240 "tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)" | egrep -a --line-buffered ".+(GET |HTTP\/|POST )|^[A-Za-z0-9-]+: " | perl -nle 'BEGIN{$|=1} { s/.*?(GET |HTTP\/[0-9.]* |POST )/\n$1/g; print }'
re: https://serverfault.com/a/633452/464916
show full body
$ sudo stdbuf -oL -eL /usr/sbin/tcpdump -Ai lo -s 10240 "tcp port 8001 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)"
Pooling
Check the pooled state
Servcie
$ confctl select service=thumbor get
host
$ confctl select dc=eqiad,cluster=cache_text,service=varnish-be,name=cp1052.eqiad.wmnet get
Depooling
https://wikitech.wikimedia.org/wiki/Depooling_servers
pybal
Check log files /var/log/pybal.log on lvs servers
Postgresql
display locks
SELECT a.datname,
l.relation::regclass,
l.transactionid,
a.query,
age(now(), a.query_start) AS "age",
a.pid
FROM pg_stat_activity a
JOIN pg_locks l ON l.pid = a.pid
ORDER BY a.query_start;
show blocked by waiting on lock
SELECT blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS current_statement_in_blocking_process
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
get table sizes
SELECT nspname || '.' || relname AS "relation",
pg_size_pretty(pg_relation_size(C.oid)) AS "disk size",
pg_size_pretty( pg_total_relation_size(nspname || '.' || relname)) AS "size"
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname IN ('public')
ORDER BY pg_relation_size(C.oid) DESC;
DHCPd
Use the following to capture DHCP traffic regarding a specific client mac. in the following the mac address was aa:00:00:d9:81:8a
. We just use the last 4 bytes (00:d9:81:8a) in the filter below
$ sudo tcpdump -i ens5 -vvv -s 1500 '((port 67 or port 68) and (udp[38:4] = 0x00d9818a))'
iPXE cli
While booting press ctrl+b to drop you into the iPXE shell. you may be required to use the advanced console connections options
Gerrit
Received disconnect from 208.80.154.151 port 29418:12: Too many concurrent connections (4) - max. allowed: 4
First list connections
$ sudo ss -a "sport = :29418"
tcp ESTAB 0 0 [::ffff:208.80.154.151]:29418 [::ffff:145.224.124.73]:35544
tcp ESTAB 0 0 [::ffff:208.80.154.151]:29418 [::ffff:145.224.124.73]:20486
tcp ESTAB 0 0 [::ffff:208.80.154.151]:29418 [::ffff:145.224.124.73]:45194
Once you have found yuor ip address kill all the connections to it
$ sudo ss -K "dst [::ffff:145.224.124.73]"
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
tcp ESTAB 0 0 [::ffff:208.80.154.151]:29418 [::ffff:145.224.124.73]:35544
tcp ESTAB 0 0 [::ffff:208.80.154.151]:29418 [::ffff:145.224.124.73]:20486
tcp ESTAB 0 0 [::ffff:208.80.154.151]:29418 [::ffff:145.224.124.73]:45194