A Panorama of ML System Design: Inference, Training, Data, and Deployment

I’ve recently been working through Stanford’s CS 329S: Machine Learning Systems Design, a course dedicated to the overall design of ML systems, covering a fairly broad range of topics. Here are the parts of the course I found worth noting.

A number of noteworthy keywords show up in the course: DevOps, CI/CD, A/B testing, Flink (stream processing), Microservices and REST APIs, Kubernetes, and tinyML (a new book based on TensorFlow Lite). Overall, the course feels quite comprehensive when it comes to production deployment.

Inference, Computation, and Learning Paradigms

The first few slides systematically lay out the different modes of inference, computation, and learning.

On inference modes, it’s not black-and-white — multiple approaches can be mixed:

On computation modes:

On learning paradigms:

One point here is worth calling out on its own: when there are multiple optimization objectives, the course recommends splitting them into multiple models, each focused on a single metric — this makes both training and tuning easier.

Data Storage and Feature Management

This part covers two main approaches to data storage.

Row-based storage, organized similarly to numpy, is suited to scenarios with frequent INSERTs:

Column-based storage, similar to pandas, is suited to scenarios with frequent SELECTs:

The two storage formats can be converted between via ELT:

The course also mentions combining static data with dynamic data for inference:

The remainder of the third slide deck covers some transfer-learning material, which I’m skipping for now.

Sampling and Handling Class Imbalance

The fourth slide deck focuses mainly on sampling and class imbalance; the speaker appears to come from a traditional ML theory background. There are three main ways to address class imbalance: resampling, weight balancing, and ensembles.

Resampling comes in two flavors: downsampling and oversampling.

Downsampling can use the Tomek Links method (reference: https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets):

Oversampling can use SMOTE:

For weight balancing, the classic approach is Focal Loss:

Ensemble methods train multiple classifiers and then ensemble all the results:

On data augmentation, there’s a 2019 survey worth a look: https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0197-0

The latter part of deck 4 mainly covers feature engineering, data leakage, and model selection, still in a fairly traditional ML style. Deck 5 is all PyTorch, with nothing particularly remarkable.

Parallel Training and System Testing

Deck 6 covers parallelism, including data parallelism and model parallelism, but not in much detail. In addition, deck 6 and the first half of deck 7 both spend a fair amount of space on testing ML systems, especially data testing:

Experiment-Management Tools

The second half of deck 7 highlights two tools.

The first is Weights & Biases (wandb), which offers rich visualization features, a bit like NNI.

The second is DVC, which feels a bit like git — it manages training data the way git manages code:

Beyond managing data, DVC can also manage experiments, similar to NNI, making it easy to compare results across different experiments; finally, it also supports CI/CD pipelines.

Model Deployment

Deck 8 focuses on deployment. The earlier portion touches on model compression and TensorRT, while the later portion introduces two deployment approaches.

Approach one: deploy directly on a cloud platform, for example using GCP (similar to Alibaba Cloud):

Approach two: deploy via Docker containers, which can likewise run on GCP; there are also VM-based and Kubernetes-based options. In practice, the difference between using Docker containers and using VMs is fairly small at the moment:

The slides after tinyML are mostly fairly vague ML industry practices, which didn’t feel worth going through; I’ll come back to them another time.