How to Formulate a Rigorous ML Benchmark (Part 1)

Setup the dataset/benchmark so that if your ML model passes you can confident in it's generality (and some tools to evaluate generality quantitatively)

Aug 26, 2025

∙ Paid

Building a ML model that generalizes well (rather than just interpolating between points in the training data) requires a careful strategy for preparing your dataset. Instead of a random or uniform split, you should strategically partition the data so that the test set challenges the model on scenarios different from those it has seen during training. Otherwise, a naive random split of your dataset can lead to train and test sets that contain very similar characteristics, which further leads to overly optimistic results (model appears accurate simply because it’s operating on very familiar distributions).

Below are several methods and tools to consider to curate your benchmark project, along with ways to quantitatively evaluate generalization.

Reminder that I am a SciML guy - so when we talk about benchmarking, crafting a dataset, feature engineering, etc., we are talking about splitting everything up by the physical phenomena in the data we are interested in so we can decide how well the ML model is at predicting all the scientific behavior and characteristics.

For example: what happens if I start significantly changing the car geometry design compared to the ones it trained my aerodynamics ML model on, or if the shocks over my airfoil start to look more-and-more different than in training, or how would the structural behavior change if I start altering the material properties used?

In other fields like drug discovery, researchers avoid random splits for this reason – e.g. they group molecules by structural scaffold or cluster them so that test compounds are structurally distinct from training ones [reference].

As an example, checkout this visual [source] of a rotating boat propeller that is experiencing cavitation (when gas bubbles form and collapse within the liquid, as shown as the little streaks). The camera is actually lock-stepping almost perfectly with the propeller, so while it is playing a trick on the eyes, it is actually spinning quite fast. So if we were collecting data on this experiment to train a ML model, we’d really need to identify what significant attributes are present to understand what our ML model is actually capturing. Things like cavitation (none, mild, or significant amounts), rotation speed, what type of liquid, propeller shape/design, among other things.
In the context of this write-up, we want to understand how to generate, analyze, and scrutinize ML models in this domain over variations by proper formulation of a benchmark project.

This propeller test demonstration was conducted in the Emerson Cavitation Tunnel at Newcastle University.

Part 1 (this post) will overview the cookbook and explain the pieces as a cohesive recipe. Part 2 coming next will provide some code blocks, visuals, and more hands-on guidance to help execute this strategy piece-by-piece. Each individual topic will be dense on its own, like (for example) how to represent the data as a convex hull and plot your points on it to determine if they are on/outside of, so it will take several subsequent posts in this series to cover them all.

Let’s go through this cookbook in the context of an automotive aerodynamics example. For context on what’s at play, checkout this paper.

AI/Machine learning in fluid mechanics, engineering, physics

How to Formulate a Rigorous ML Benchmark (Part 1)

Setup the dataset/benchmark so that if your ML model passes you can confident in it's generality (and some tools to evaluate generality quantitatively)

This post is for paid subscribers