Machine learning’s life cycle management system ranks and tracks your tests over time and sometimes integrates with deployment and monitoring
For most professional software developers, using application life cycle management (ALM) is the right thing to do. Data scientists, many of them without software development platforms, often do not use life cycle management for their machine learning models. That problem is now much easier to overcome than it was a few years ago, thanks to the advent of the environment and the “MLops” framework that supports the management of the machine learning life cycle.
What is machine learning life cycle management?
The easy answer to this question is to manage the life cycle of machine learning just like ALM, but that would also be wrong. That’s because the life cycle of a machine learning model differs from the software development life cycle (SDLC) in a number of ways.
To begin with, software developers more or less know what they’re trying to build before they write code. There may be a fixed overall specification (waterfall model) or not (fast development), but at any given time, the software developer is trying to build, test, and debugging a feature that can be described. Software developers can also write tests to ensure that the feature works as designed.
In contrast, a data scientist builds models by performing experiments in which an optimization algorithm tries to figure out the best weight set to solve a data set. There are many types of models, and currently, the only way to determine which one is best is to try them all. There are also some possible criteria about the “goodness” of the model and there are no real criteria equivalent to software tests.
Unfortunately, some of the best models (e.g. deep neural networks) take a long time to train, which is why speeders like GPUs, TPUs, and FPGA become important for data science. In addition, a lot of effort is usually devoted to cleaning data and designing the best feature set from the initial observations, to make the model work as best as possible.
Tracking hundreds of experiments and dozens of feature sets isn’t easy, even if you’re using a fixed data set. In real life, it’s even worse: Data often changes over time, so the model needs to be adjusted periodically.
There are several different models for the machine learning life cycle. Typically, they start with ideas, continue with data collection and discovery data analysis, move from there to R&D (hundreds of those experiments) and validate, and ultimately deploy and monitor. Tracking periodically can take you back to step one to try different models and features or to update your training data set. In fact, any step in the life cycle can take you back to the previous one.
The machine learning life cycle management system tries to rank and track all your tests over time. In the most useful implementations, the management system is also integrated with deployment and monitoring.
Machine life cycle management products
We’ve identified several cloud platforms and frameworks for managing the life cycle of machine learning. These currently include Algorithmia, Amazon SageMaker, Azure Machine Learning, Domino Data Lab, Google Cloud AI Platform, HPE Ezmeral ML Ops, Metaflow, MLflow, Paperspace, and Seldon.
Algorithmia can connect, deploy, manage, and expand your machine learning portfolio. Depending on the package you choose, Algorithmia can run on its own cloud, on your premises, on VMware, or in the public cloud. It can maintain models in its own Git repository or on GitHub. It manages automatic modeling, can deploy pipelining, and can run and expand on-demand (serverless) models using CPU and GPUs. Algorithmia offers a keyword model library (see screenshot below) in addition to storing your models. It does not currently provide much support for modeling training.
Amazon SageMaker is Amazon’s fully managed integrated environment for machine learning and deep learning. It includes a Studio environment that combines Jupyter notebooks with test management and tracking (see screenshot below), model debugging, “auto-steering” for users with no machine learning knowledge, bulk conversion, model monitoring, and deployment with elastic information.
Azure Machine Learning
Azure Machine Learning is a cloud-based environment that you can use to train, deploy, authorize, manage, and track machine learning models. It can be used for any type of machine learning, from classical machine learning to deep learning and both supervised and unsymed learning.
Azure Machine Learning supports Python or R coding as well as providing a drag-and-drop visual designer and AutoML options. You can build, train, and track high-precision machine learning and deep learning models in azure machine learning workplaces, whether you train on a local machine or in the Azure cloud.
Azure Machine Learning interacts with popular open source tools, such as PyTorch, TensorFlow, Scikit-learning, Git, and the MLflow platform to manage the life cycle of machine learning. It also has its own open source MLOps environment, which is shown in the screenshot below.
Domino Data Lab
The Domino Data Science platform auto dies developments for data science, so you can spend more time researching and testing more ideas faster. Automatic job tracking enables regeneration, reuse, and collaboration. Domino lets you use your favorite tools on the infrastructure you choose (by default, AWS), track tests, reproduce and compare results (see screenshot below), and find, discuss, and reuse work in one place.
Google Cloud AI platform
Google Cloud’s AI platform includes many functions that support machine learning life cycle management: the overall dashboard, the AI Center (see screenshot below), data labeling, notebooks, work, business process coordination (currently in pre-release status), and models. Once you have a satisfactory model, you can deploy it to make predictions.
These notebooks are integrated with Google Colab, where you can run them for free. The AI center includes a number of public resources including Kubeflow pipelines, notebooks, services, TensorFlow modules, VM images, trained models, and technical guidance. Public data resources are available for images, text, audio, video, and other types of data.
HPE Ezmeral ML Ops
HPE Ezmeral ML Ops provides enterprise-scale machine learning using containers. It supports the machine learning life cycle from sandbox testing with machine learning and deep learning frameworks, to model training on dispersed clusters in containers, to deploying and tracking models in production. You can run HPE Ezmeral ML Ops software on-premises on any infrastructure, across multiple public clouds (including AWS, Azure, and GCP) or in a combination model.
Metaflow is a code-based, Python-friendly process system dedicated to managing machine learning life cycles. It delivers with graphical user interfaces that you see in most of the other products listed here, in favor of de costumes like @step, as shown in the sn piece of code below. Metaflow helps you design your work process as a directional rotation chart (DAG), run it on a large scale, and deploy it to production. It version and track all your tests and data automatically. Metaflow was recently created open source by Netflix and AWS. It can integrate with Amazon SageMaker, a deep learning and machine learning library based on Python and big data systems.
MLflow is an open source machine learning life cycle management platform from Databricks, which is still at the Alpha stage. There is also a hosted MLflow service. MLflow has three components, including tracking, projects, and models.
MLflow tracking allows you to record (use API calls) and test queries: code, data, configuration, and results. It has a web interface (shown in the screenshot below) for queries.
MLflow projects provide a format for encapsuling data science code in reusable and reproducable ways, primarily based on conventions. In addition, the Project component includes APIs and command line tools to run projects, so you can link projects together into workflows.
MLflow models use standard formats to package machine learning models that can be used in a variety of downstream tools – e.g. real-time distribution via rest API or mass infer infer infer inferring on Apache Spark. The format defines a convention that allows you to save a pattern in different “flavors” that different downstream tools can understand.