Tensorflow for Mesos
I developed the open-source project tensorflow-mesos to enable seamless distributed TensorFlow training within Apache Mesos environments. My goal was to leverage existing cluster resources efficiently for machine learning workloads.
The framework implements a custom Mesos framework in Python that dynamically distributes TensorFlow jobs across worker nodes. It uses TensorFlow 2.x’s parameter server architecture, with Mesos handling resource allocation for CPU, RAM, and GPU. Parameter servers and workers run as isolated processes on the cluster.
This setup is ideal for training large deep learning models: it significantly reduces training times while making full use of on-premises infrastructure. The project is aimed at teams operating ML infrastructure on Mesos without complex cloud overhead.
