Tuesday, 10 October 2017

Plotting Lustre MDS stats

At $dayjob we have several large filesystems - for example our /scratch system has 3.1 PB of space using over 1000 HDDs. Although each vendor offers their own dashboard for monitoring they're all a little bit crap and don't integrate with anything else.

Cue an afternoon setting up influxdb (trivial) and grafana (also trivial) on a spare VM and a simple python script run on the metadata servers:

[admin@snx11038n003 ~]$ cat push_mdt_stats.py
#!/usr/local/bin/python2.7
import urllib
import time

def grabbit(mds):
 post = ""
 with open(('/proc/fs/lustre/mdt/%s/md_stats' % mds), 'r') as f:
    for line in f:
        k,v,null = line.split(None,2)
        if k == "snapshot_time":
            ts=int(float(v)*1000000)
        else:
            post += 'metadata,fs={3} {0}={1} {2}\n'.format(k,v,ts,mds)
 with open(('/proc/fs/lustre/mdd/%s/changelog_users' % mds), 'r') as f:
    tmp = f.read().split()
    # we can cheat here as they have the same format - 3rd item in list is current changelog count, and then
    # from the 6th item on we get changelog id / position to pull into a dict
    head = int(tmp[2])
    clog = dict(zip(tmp[5:][0::2], tmp[5:][1::2]))
    post += 'changelog,fs={2} head={0} {1}\n'.format(head,ts,mds)
    for cl,count in clog.items():
        post += 'changelog,fs={3} {0}={1} {2}\n'.format(cl,count,ts,mds)

 post=post.encode('ascii')
 p = urllib.urlopen('http://influxbox:8086/write?db=lustre&precision=u',post)
 #print(p.getcode())

while True:
  try:
    grabbit('snx11038-MDT0000')
  except:
    sys.exit("Whoa, that went a bit Pete Tong!")
  time.sleep(10)

And a couple of clicks in Grafana can soon knock up a dashboard:


No comments:

Word of mouth Skye History

Many years ago we lived in the Old Manse in Waternish, Skye. If you look on the maps, you'll spot that unlike nearly all the other house...