Apache SystemDS
Apache SystemDS is an open-source ML system for the end-to-end data science lifecycle from data integration, cleaning, and feature engineering to model training, debugging, and deployment. It provides a declarative machine learning language (DML), automatic optimization for different execution backends (local, distributed Spark), and a Python API (SystemDS Python). SystemDS is an Apache Software Foundation top-level project designed for scalable ML workflows.
APIs
Apache SystemDS Python API
The SystemDS Python API (systemds) provides a Python interface for building end-to-end ML pipelines. It includes Matrix and Frame types for distributed data manipulation, built-...
Features
High-level R-like language for specifying ML algorithms with automatic optimization.
Query optimization, memory management, and execution plan selection for ML workloads.
Privacy-preserving federated ML across distributed data silos without data sharing.
50+ built-in ML algorithms including linear models, neural networks, clustering, and ensemble methods.
Pythonic API for ML pipeline development with lazy evaluation and distributed execution.
Automated data cleaning, imputation, encoding, and normalization pipelines.
Use Cases
Train large-scale ML models distributed across Apache Spark clusters.
Cross-silo federated learning for privacy-sensitive healthcare and finance data.
Integrated data preparation, feature engineering, training, and serving pipelines.
Integrations
Native Spark backend for distributed matrix operations and ML training.
Python API with NumPy-compatible Matrix type for local and distributed computation.
Kubernetes deployment support for SystemDS runtime via Helm charts.