A quick disk space health check service in Python

Update Jan 3, 2023: practices described in this post has been outdated. It is recommended to use an off-the-shelf utility. This post is for archive only.

The challenge with our cross-organization team is that alerts are not configured properly and the operational team undergo alert fatigue. Oftentimes a disk usage alert does not trigger any activity, until the staff learn it in a hard way and developed sensitivity towards certain types of alerts. Worse, full partition does not always bring down the entire application (or its health check port). It takes the application into a bad state, failing transactions but load balancer still thinks the application is up. This is both an operational pain, and a troubleshooting obstacle.

We need a direct mechanism to report disk usage in layer 7 health check response, independent of application vendor. Python comes in handy here as a very flexible framework. It can even host http service. Here is a quick implementation of simple http service to provide disk usage percentage.

#! /usr/bin/env python
# https://github.com/digihunch/pytk/
# Working with python 2.7.5. This SimpleHTTP returns disk usage of /home and / through web port

from BaseHTTPServer import BaseHTTPRequestHandler, HTTPServer
import os
import json

class HCHTTPReqHandler(BaseHTTPRequestHandler):
    def _set_headers_normal(self):
        self.send_response(200)
        self.send_header('Content-type','application/json')
        self.end_headers()

    def _set_headers_bad(self):
        self.send_response(500)
        self.send_header('Content-type','application/json')
        self.end_headers()

    def do_GET(self):
        if self.path =="/status":
            try:
                df_res = os.popen("df / /home | tail -n+2 | awk '{print $4,$5,$6}'").read()
                [root_du,home_du] = df_res.splitlines()
                [rkb,rpc,rmp] = root_du.split()
                [hkb,hpc,hmp] = home_du.split()
                if ((rkb > 2000000) and (hkb > 1000000)):
                    res_stat="okay"
                else:
                    res_stat="bad"
                res_json = {}
                res_json['status'] = res_stat
                res_json[hmp] = hpc
                res_json[rmp] = rpc
                res_out = json.dumps(res_json)+"\n"
                self._set_headers_normal()
                self.wfile.write(res_out)
            except socket.timeout as est:
                self.log_error("Request timed out:%r", est)
                self.close_connection = True
                return
            except:
                res_json = {}
                res_json['status'] = 'no result'
                res_out = json.dumps(res_json)+"\n"
                self.set_headers_bad()
                self.wfile.write(res_out)
        else:
            self._set_headers_bad()
            self.wfile.write("Invalid Path. Please try /status.\n")

def run(server_class=HTTPServer, handler_class=HCHTTPReqHandler, port=8090):
    try:
        server_address = ('', port)
        httpd = server_class(server_address, handler_class)
        print 'Starting httpd... on port ' + str(port) + '. Press Ctrl+C to quit.'
        httpd.serve_forever()
    except KeyboardInterrupt:
        print 'Quit health check.'
        httpd.socket.close()

if __name__ == "__main__":
    from sys import argv
    if len(argv) == 2:
        run(port=int(argv[1]))
    else:
        run()

Latest version of the code is available on Github, with an instruction of configuring a systemd service to automatically start this web server. It depends on Python 2.7 and I’ve tested with issuing 100,000 calls back to back and the process appears robust. Next step is to integrate it with load balancer in production and assess how it works.

This web server, although simple, needs to consider some outlier situations such as mount point gets lost or unresponsive. Once option is implement a timeout mechanism in a separate thread, which I did not do and instead I ask IT support to configure load balancer health check with timeout.

This article is also a test of Syntax Highlighter code plugin for WordPress.

Follow up – the quick web server immediately run into two issues:

  1. Although it was tested with 100,000 curl calls back to back, it does not seem to perform when the source of the call is remote. Some call takes 6~7 seconds.
  2. Load balancers issues way more calls than I anticipated, to the point that I consider caching the result instead of allowing each external call to initiate a du command.
  3. It would be helpful to front the monitor with nginx

So the recommendation is … do not reinvent the wheel. Use a metrics and monitoring tool instead such as Prometheus and Elasticsearch.