Model Development Tools
There are several challenges in machine-learning development given the complexity of the ecosystem and the varied number of frameworks that can be leveraged. Managing the complexity and dependencies can be a difficult challenge if you are looking for a rapid development from the lab to the production of your models. In this article, I will make a case of using few tools in an integrated Docker-based development environment to address the following concerns:
- Experimentation: In the early stages of model development, we are always looking for an environment to quickly prototype and formulate our hypothesis. At this stage, there is generally large amounts of data wrangling and analysis.
- Model Development and Tracking: Once we have a concept and a general approach, an environment is required to experiment and tracks our runs.
- Model Deployment: Once we have a champion model, we always want a way to publish our model. This means model registration and tracking.
- Collaboration: AI is a team sport and we always want to share our thinking and results with peers. As a good scientific practice, peer review is a must, and the ability to share your work is fundamental to the success of the project.
Several tools in the market can address these concerns, however, we want to leverage open source tools and make the process integrated by leveraging container-based development. Yan puts it very well that a container is “…like containers on a ship where the goal is to isolate the contents of one container from the others so they don’t get mixed up” (2020). Docker provides a set of services at the OS-level to deliver a capability that allows us to package our software components as containers. In our case, each layer of concern identified above will be a container glued together with Docker tools to provide a unified development experience in data science.
At the beginning of each machine learning development process, we need to define the analytical problem to be solved. Once the problem is defined, we want to explore the data and define our hypothesis statements on how to solve the problem. For this layer, we leverage Spark and Jupyter. These allow us to conduct extensive exploratory analysis. We are going to leverage Pyspark, an open-source cluster computing framework that allows for extensive, ease of use when working with Spark. Paired with Jupyter notebooks, an open-source web application that allows for sharing of documents with live code and visualizations, this allows for rapid experimentation.
Model Development and Tracking
For model development, any IDE can do the job. In our case, we prefer the lightweight VS Code by Microsoft. VS Code is great for container-based development because it is designed to run anywhere and it’s open-source. In model development, tracking experiment runs is important for data scientists and machine-learning engineers. Databricks introduced a great tool that streamlines the process for data scientists assisting in data preparation to training models. Working in a team, it is important to have the ability to reproduce the steps, and having detailed tracking is critical. Several tools exist in the ecosystem, however, MLflow offers an easy-to-use and deploy tracking solution including model management capabilities.
Model Deployment and Collaboration
MLflow does offer capable functionality to register models and serve them with REST endpoints. FAST API is a fast and robust python-based framework that can help with API development. The flexibility of the framework makes it a better web framework that is production-ready and can accelerate your development process. In data science, we encourage collaboration as part of the scientific process. All team members are encouraged to document and share. In the experimenting phase, Jupyter notebooks are great for testing ideas within a collaborative environment. In the model development phase, documentation is done with Markdown in the same source repository. Once the model is deployed, we leverage Streamlit as a web-based method to share.
All the layers need to co-exist and be part of the same machine-learning project so that team members can focus on development. Docker provides a great method to manage all these layers creating a stand-up infrastructure that can instantly run anywhere. Please contact us and we will be happy to share our example project with you. We can also provide a free consultation to jump-start your machine-learning project.