Uber Engineering released its resource scheduling and management utility Peloton as open source today. According to company engineers, Peloton was developed to fill a missing niche in Uber’s internal software suite, allowing web-scale compute cluster management, resulting in improved resource utilization across its infrastructure.
Min Cai, Compute Cluster Platform senior staff engineer at Uber, and Mayak Bansal, staff engineer at Uber Big Data, outlined a number of use-case scenarios for Peloton in a blog post. This includes preempting traffic spikes, economizing resource use by running complementary workloads together, and allocating dedicated disaster recovery resources for batch jobs until data center failover.
The two wrote that prior to building Peloton, which was first introduced in November, Uber’s workloads were spread across a large number of clusters and “to their knowledge” there were no existing tools for economizing the work cross-cluster at the scale Uber needed.
“Realizing that these scenarios would enable us to achieve greater operational efficiency, improve capacity planning, and optimize resource sharing, it was evident that we needed to co-locate different workloads together on one single, shared compute platform,” Cai and Bansal wrote. “A unified resource scheduler will enable us to manage all kinds of workloads to use our resources as efficiently as possible both in private data centers and the cloud.”
The specific functions of Peloton are:
- Elastic resource sharing, which allows multiple teams to access resources based on privileges and necessity
- Resource overcommit and task preemption which “improve cluster utilization by scheduling workloads using slack resources and preempting best effort workloads”
- Optimization for Big Data workloads using features of Apache Spark
- Machine learning optimization with “support GPU and Gang scheduling for TensorFlow and Horovod” at a scale of thousands of GPUs
- The Protobuf/gRPC-based API which supports “most of the language bindings such as Golang, Java, Python and Node.js”
- Co-scheduling of mixed workloads, which allows batch, stateful and stateless jobs to run on the same cluster
- The ability to scale to “millions of containers and tens of thousands of nodes”
Cai and Bansal wrote that support for stateless services is planned for the near future.
“By allowing others in the cluster management community to leverage unified schedulers and workload co-location, Peloton will open the door for more efficient resource utilization and management across the community,” they wrote of the team’s decision to open-source the utility. “Moreover, open-sourcing Peloton will enable greater industry collaboration and open up the software to feedback and contributions from industry engineers, independent developers, and academics across the world.”