AI/Machine learning in fluid mechanics, engineering, physics

AI/Machine learning in fluid mechanics, engineering, physics

Making any ML Prediction Explainable: Part 3

Code, tutorials, and techniques to better understand complicated/large fields (point clouds) as a single number

Justin Hodges, PhD's avatar
Justin Hodges, PhD
Sep 12, 2025
∙ Paid
4
Share

If you send me your dataset I will run my code to post-process this for you and send it back. Alternatively, you can do it yourself (I provide all the code to do so here).

Problem statement (tl;dr version): it’s hard to benchmark ML models when every sample in your dataset is a full field of values (e.g. point cloud). Here I give code and suggestions on how to avoid needing to look at each field (time consuming) but rather just a single number generated for each case to indicate complicated three-dimensional differences you should know about.

Problem statement (full): when we do ML benchmarks, we often know we should train a few models based on the characteristics of the data to see where the model is good at learning patterns in the data (and where it’s not, so we don’t use it that way once trained). Rather than doing a random splitting up of what data goes to train versus test buckets, we know we should try things like “what happens if I train on designs that have a low drag coefficient, and then apply that trained model on cases with a high drag coefficient?”. This is provides more insight into how we should safely use/not use the model in the future on other cases. However, there are at least three major issues which are interrelated: 1) some cases may have unique patterns in the flow-field that are important but not represented well from a single performance metric (e.g. drag coefficient), 2) such simple thinking, that all 3D patterns in the scalar field must be similar just because the final performance metric value is similar, is not robust and will lead to scenarios where the model looks well in training but does poorly at inference time, and 3) we have important physical phenomena (e.g. aerodynamic patterns, like turbulent structures) at different geometric scales that can be overlooked.

To address this, I propose an agnostic framework for analyzing simulation field data (doesn’t matter what the scalar or case is) for different cases/samples/designs in your dataset that distills these important coherent spatial structures in the data into a small discrete package of a handful of numbers, such that the user doesn’t have to analyze scalar fields but yet can know for sure the characteristics present in their dataset during training and hold-out testing. This will lower the barrier of entry/use for non-experts, since it’s unrealistic to ask every engineer training a ML model to analyze every scalar field in their dataset and by hand label the patterns present for every case. Further, this will lead to much more robust ML model training exercises.


This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Justin Hodges
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture