IBM has released open source CodeFlare, a framework for simplifying the integration and efficiently scaling big data and AI workflows onto the hybrid cloud. CodeFlare is built on top of Ray, an emerging open-source distributed computing framework for machine learning applications. CodeFlare extends the capabilities of Ray by adding specific elements to make scaling workflows easier.
To create a machine learning model today, researchers and developers have to train and optimize the model first.
This might involve data cleaning, feature extraction, and model optimization. CodeFlare simplifies this process using a Python-based interface for what’s called a pipeline—by making it simpler to integrate, parallelize and share data.
The goal of the new framework is to unify pipeline workflows across multiple platforms without requiring data scientists to learn a new workflow language.
CodeFlare pipelines run with ease on IBM’s new serverless platform IBM Cloud Code Engine, and Red Hat OpenShift. It allows users to deploy it just about anywhere, extending the benefits of serverless to data scientists and AI researchers.
It also makes it easier to integrate and bridge with other cloud-native ecosystems by providing adapters to event-triggers (such as the arrival of a new file), and load and partition data from a wide range of sources, such as cloud object storages, data lakes, and distributed filesystems.
CodeFlare should also mean developers aren’t having to duplicate their efforts or struggle to figure out what colleagues have done in the past to get a certain pipeline to run. With CodeFlare, IBM aims to give data scientists richer tools and APIs that they can use with more consistency, allowing them to focus more on their actual research than the configuration and deployment complexity.
The framework saves developers significant time and effort in creating pipelines deployed to hybrid cloud.
For example, when one user applied the framework to analyze and optimize approximately 100,000 pipelines for training machine learning models, CodeFlare cut the time it took to execute each pipeline from 4 hours to 15 minutes.
With other users, CodeFlare has been seen to shave off months of developer time, and allow them to tackle larger data problems than before.