This is ‘part 2’ in this ‘zero to hero’ series — you can find ‘part 1’ here.
While there are plenty of projects and tutorials out there for mainstream machine learning, there are of course orders of magnitude less for our niche little world of engineering simulation. I’d like to use this section to point you in the right direction with helpful engineering datasets, projects, libraries to use, and similar educational resources. In general, I think these resources enable you to get started on some introductory projects which should lead to things you can put on your resume (with a bit more work). I will bleed this over into ‘part 3’ to continue sharing of datasets/code/projects (after getting feedback that more bite-size posts are preferred to mini-novels).
I find that while publishing your research work is obviously a great way to have things available to show/talk to recruiters/hiring mangers about, however the cost associated with publishing, traveling, and attending conferences is often high enough to be prohibitive for individuals paying out of pocket. I hope these projects can help you gain experience (for free) while leading you to more advanced projects, and perhaps adding things to your portfolio website.
Datasets you can use, projects you can pickup, public codes available to play with, relevant AI/ML libraries, et cetera et cetera. Let’s jump in.
Datasets
I went ahead and split them into two groups: datasets that are relevant to our niche world of computer aided engineering and simulation (e.g. a dataset comprised of simulation data) and more general machine learning datasets (e.g. a dataset of housing prices). In case you are new to this concept, I just want to point out that you can benefit greatly from the latter grouping for your machine learning projects in CAE applications, so do keep them in mind.
Aerospace, CAE, and Mechanical engineering specific resources
AhmedML, WindsorML and DrivaerML
Let’s start with one of the best, biggest, and most relevant datasets to share (in terms of industry geometries). There are 3 individual datasets here, each available for free here and also on hugging face. The files are HUGE (one dataset is 31TB), so pay attention to your hard drive space (and if you download a few individual samples, or rather the full dataset).
These types of datasets are attractive, but if you are a pure beginner I think this one is not your first or second go-to. You probably want to go with toy cases that easily managed and allow rapid iteration; quickly tune a model, retrain, learn/analyze, repeat and ‘tinker’. If you have full grade datasets like this it’s not suitable for beginners and such a rapid learning cycle.
Dataset link: https://caemldatasets.org/
I do highly recommend reading the companion publications though (here) — highly educational, good formulation of the problem, good dataset for the demonstration of modern techniques to consider in similar non-toy problems. The pictures below are screenshots from the aforementioned publication.
NASA Workshop Cases
This section is one that bears several fruits for the reader; we will call out the long standing and wonderful 'turbulence modeling resource' by the NASA Langley Resource Center, and furthermore a machine learning specific resource that was the outcome of their recent symposium (“Turbulence Modeling: Roadblocks, and the Potential for Machine Learning”). The symposium was a friendly challenge for participants to try their hand at predicting a series of bench marking problems for whichever methods they wanted to employ, with certain focus in this symposium in applying data-driven and machine learning techniques. This is a great resource for us with the free datasets they provide, as well as the validation data from the turbulence modeling resource online, and the recordings from the presentations from each day in the symposium. For those trying to build up your machine learning portfolio or detect an area of research you want to pursue, these are very enabling resources!
Let’s get into the different benchmarking cases from the NASA website [link]. For each case there will be one bullet point to describe the configuration and then the one below it that will state the objective for the machine learning model.
2D Zero Pressure Gradient Flat Plate Validation Case
Show (1) Cf vs. x and (2) u+ vs. log(y+) at x=0.97; compare with theory
2D Fully-Developed Channel Flow at High Reynolds Number Validation Case
Show u+ vs. log(y+) at x=500; compare with theory
Axisymmetric Subsonic Jet
Show (1) u/Ujet vs. x/Djet, (2) u/Ujet vs. y/Djet at 5 specified stations, and (3) u'v'/(Ujet^2) vs. y/Djet at 5 specified stations; compare with experiment
2D NASA Wall-Mounted Hump Separated Flow Validation Case
Show (1) Cp vs. x/c, (2) Cf vs. x/c, (3) u/Uinf vs. y/c at 7 specified stations, and (4) u'v'/(Uinf2) vs. y/c at 7 specified stations; compare with experiment
2D NACA 0012 Airfoil Validation Cases (4 separate cases)
Angles of attack = 10, 15, 17, and 18 deg.
Show (1) CL vs. alpha, (2) CD vs. CL, (3) Cp vs. x/c, and (4) Cf (upper surface) vs. x/c; compare with experiment (except for Cf)
Have fun!
Machine Learning for Physical Simulation Challenge [link]
This competition aims at promoting the use of ML based surrogate models to solve physical problems, through a task addressing a CFD use case: Airfoil design
Stanford Engineering Center for Turbulence Research [link]
DNS statistics saved into data files in separate repositories for fully developed turbulent pipe flow, transitional flow in a pipe, and a zero-pressure-gradient flat-plate boundary layer configuration.
MegaFlow2D: A Parametric Dataset for Machine Learning Super-resolution in Computational Fluid Dynamics Simulations [link]
A dataset of over 2 million snapshots of parameterized 2D fluid dynamics simulations of 3000 different external flow and internal flow configurations
Vreman Research [link]
Databases of Direct Numerical Simulations of turbulent channel flow
Simulation of heat transfer phenomena with supercritical CO2 [link]
Direct numerical simulation database for supercritical dioxide.
Johns Hopkins Turbulence Databases [link]
This website is a portal to an Open Numerical Turbulence Laboratory that enables access to multi-Terabyte turbulence databases
More General AI Datasets & Resources
Kaggle [link]
Kaggle datasets are a diverse collection of datasets made available on the Kaggle platform, encompassing a wide range of topics. Kaggle is extremely popular and hosts lots of discussion boards, coded projects, and datasets.
UCI Machine Learning Repo [link]
As of today, it includes 664 datasets as a service to the machine learning community. Here, you can donate and find datasets used by millions of people all around the world!
Google Dataset Search [link]
Dataset Search is a search engine for datasets. Using a simple keyword search, users can discover datasets hosted in thousands of repositories across the Web.
More soon on this in part 3. I am targeting to release that by Friday the 13th (👻👻👻). Thanks for reading!