Tuesday 10 October 2017

Plotting Lustre MDS stats

At $dayjob we have several large filesystems - for example our /scratch system has 3.1 PB of space using over 1000 HDDs. Although each vendor offers their own dashboard for monitoring they're all a little bit crap and don't integrate with anything else.

Cue an afternoon setting up influxdb (trivial) and grafana (also trivial) on a spare VM and a simple python script run on the metadata servers:

[admin@snx11038n003 ~]$ cat push_mdt_stats.py
#!/usr/local/bin/python2.7
import urllib
import time

def grabbit(mds):
 post = ""
 with open(('/proc/fs/lustre/mdt/%s/md_stats' % mds), 'r') as f:
    for line in f:
        k,v,null = line.split(None,2)
        if k == "snapshot_time":
            ts=int(float(v)*1000000)
        else:
            post += 'metadata,fs={3} {0}={1} {2}\n'.format(k,v,ts,mds)
 with open(('/proc/fs/lustre/mdd/%s/changelog_users' % mds), 'r') as f:
    tmp = f.read().split()
    # we can cheat here as they have the same format - 3rd item in list is current changelog count, and then
    # from the 6th item on we get changelog id / position to pull into a dict
    head = int(tmp[2])
    clog = dict(zip(tmp[5:][0::2], tmp[5:][1::2]))
    post += 'changelog,fs={2} head={0} {1}\n'.format(head,ts,mds)
    for cl,count in clog.items():
        post += 'changelog,fs={3} {0}={1} {2}\n'.format(cl,count,ts,mds)

 post=post.encode('ascii')
 p = urllib.urlopen('http://influxbox:8086/write?db=lustre&precision=u',post)
 #print(p.getcode())

while True:
  try:
    grabbit('snx11038-MDT0000')
  except:
    sys.exit("Whoa, that went a bit Pete Tong!")
  time.sleep(10)

And a couple of clicks in Grafana can soon knock up a dashboard:


Feeling Pumped!

Having just had a day without power, and then going round the site to check everything came back online correctly (including services such a...