Recently Microsoft announced ML.NET, a machine learning framework for .NET. This is exciting news. So my mind immediately goes to: how does this look with F#? The current post will take a look at using ML.NET’s regression module to predict concrete compressive strength based on its composite ingredients.
Before jumping in too far, there is a disclaimer: ML.NET is in its early stages. I found a couple implementation and interface idiosyncrasies I suspect will change over time. Just keep that in mind moving forward. The short version is, I’ve been pleased with what I’ve seen so far. There is some room for improvement, especially having more F#-centric support for calling methods. It will be an interesting journey as the framework matures.
With that out of the way, make sure you have .NET Core version 2.0 installed. If you don’t, head out to the .NET Core Downloads page. Select SDK for your platform. Tangential, but you can also get here by going to dot.net, then navigating to Downloads and .NET Core.
First, create the project and add the ML.NET package. This will be a console app in F# (obviously).
1 2 3
dotnet new console --language F# --name MLNet-Concrete cd MLNet-Concrete dotnet add package Microsoft.ML
Next, it is time to get the data. The source I used for this post is from UCI. The dataset is an Excel file (xls), and I need it as a csv. I used ssource (from apt install gnumeric) to convert from Excel to CSV, but feel free to use whatever works for you.
1 2 3
mkdir data && cd data curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/Concrete_Data.xls ssource Concrete_Data.xls Concrete_Data.csv
Here is a sample of what the data looks like. There is a header row, I’ve transposed this to a vertical list for readablity. The first 8 columns are features, the last is the concrete compressive strength.
1 2 3 4 5 6 7 8 9 10
# Header Row Cement (component 1)(kg in a m^3 mixture) Blast Furnace Slag (component 2)(kg in a m^3 mixture) Fly Ash (component 3)(kg in a m^3 mixture) Water (component 4)(kg in a m^3 mixture) Superplasticizer (component 5)(kg in a m^3 mixture) Coarse Aggregate (component 6)(kg in a m^3 mixture) Fine Aggregate (component 7)(kg in a m^3 mixture) Age (day) Concrete compressive strength(MPa, megapascals)
1 2 3 4 5
# Data Rows 540,0,0,162,2.5,1040,676,28,79.98611076 540,0,0,162,2.5,1055,676,28,61.887365759999994 332.5,142.5,0,228,0,932,594,270,40.269535256000005 332.5,142.5,0,228,0,932,594,365,41.052779992
Now that the project is setup and data is local, we can get to the code. Time to open up the already created Program.fs. First, add the necessary namespaces.
1 2 3 4 5 6
open System open Microsoft.ML open Microsoft.ML.Runtime.Api open Microsoft.ML.Trainers open Microsoft.ML.Transforms open Microsoft.ML.Models
The ML.NET pipeline expects the data in a specific format. In the C# world, this is a class, for F# we can use a type. Below are the required types; ConcreteData is the input data, ConcretePrediction is the output prediction. For ConcreteData, this is basically a a map of columns to member variables. There are a couple notable points to ensure the pipeline can properly consume the data. Each attribute must be mutable public, it also requires the [<Column("#")>] to specify it’s column position, and [<DefaultValue>] attributes. For ConcretePrediction, a single attribute is required, the prediction value. For the input data, the label variable must be named Label. For the prediction type, the variable must be labeled Score. There are methods where you are supposed to be able to define a ColumnName attribute, or copy a label column into the pipeline. But frankly they didn’t work for me. I’m unclear if I was doing something wrong if its a current early-state problem. Over time I expect this will be resolved, but for now I don’t mind working within tighter constraints.
The structure of building a pipeline is pretty intuitive. First, create a pipeline. Then, add components to the pipeline in the order to be executed. So first, load the data with a TextLoader. This data is comma delimited and has a header row.
1 2 3
let pipeline = new LearningPipeline() let dataPath = "./data/Concrete_Data.csv" pipeline.Add(new TextLoader<ConcreteData>(dataPath, separator = ",", useHeader = true))
After the data is loaded, feature columns need to be added to the pipeline. I’m going to use all feature columns from the file, but I don’t have to. The regressor model requires features to be numeric. In this example, that is the case and nothing special needs to be done. In cases where columns are strings, the CategoricalOneHotVectorizer() will convert string columns to numeric mappings. I’ve provided an example line below. Even though I don’t need it, its a handy reference to have. Note the order, since it is a pipeline, the string to numeric column conversion needs to happen prior to adding the feature columns.
1 2 3
// Example how to convert text to numeric // pipeline.Add(new CategoricalOneHotVectorizer("CementBrandName", "SlagBrandName")); pipeline.Add(new ColumnConcatenator("Features", "Cement", "Slag", "Ash", "Water", "Superplasticizer", "CoarseAggregate", "FineAggregate", "Age"))
Now that the features are defined, it is time to determine what training method to use. For this post FastTreeRegressor is used. This is a boosted decision tree and generally offers pretty good results. Custom hyperparameters can also be defined. I found the defaults to be fine, but its good to see the option to tweak those values.
For the dataset in question, the FastTreeRegressor worked the best, but there are alternatives. I’ve listed them below. Most had worst performance, with the FastTreeTweedieRegressor being similar. As will anything, it is good to investigate options.
1 2 3 4 5 6 7 8
// Similar performance // pipeline.Add(new FastTreeTweedieRegressor())
The last part, train the model. Note the ConcreteData and ConcretePrediction types as part of the Train call.
let model = pipeline.Train<ConcreteData, ConcretePrediction>()
Validation of any model is important. For a real case, I would train on one dataset and validate against a previously unseen dataset. Since this is just an example, I validate against the training data. As a result, I expect the results to be very good, and they are. ML.NET offers an Evaluator class, which makes getting some of those crucial high-level numbers pretty easy. It takes a trained model and a dataset, and produces critical metrics. Again, this is one of those components that is crucial to an ML framework and I’m glad to see it here.
Backtracking to the hyperparameter example, here are those results. As you can tell, my randomly picked hyperparameter choices were not better. Certainly it seems like a fun opportunity to pair some optimization searches with the pipeline to see how methods can be improved. Of course, this is more meaningful if it is not validating against the training data, there is already a risk of overfitting that we’re not seeing.
Once a model is trained, it can also be saved to a file a reloaded at a later time. This is supported by the WriteAsync and ReadAsync methods of a model.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
// Save model to file model.WriteAsync("test-model") |> Async.AwaitTask |> ignore
// Load model from file and run a prediction let modelReloaded = PredictionModel.ReadAsync<ConcreteData, ConcretePrediction>("test-model") |> Async.AwaitTask |> Async.RunSynchronously let predictionReloaded = modelReloaded.Predict(test1) printfn "Predicted Strength RL: %f" predictionReloaded.Score printfn "Actual Strength : 38.074243671999994" printfn ""
1 2 3
# Prediction Result (model reloaded): Predicted Strength RL: 38.882920 Actual Strength : 38.074243671999994
Throughout the post, portions of the output have been provided out of band. Here is how the whole thing looks when run with dotnet run.
Not adding a normalizer. Making per-feature arrays Changing data from row-wise to column-wise Processed 1030 instances Binning and forming Feature objects Reserved memory for tree learner: 234780 bytes Starting to train ... Not training a calibrator because it is not needed.
Predicted Strength: 38.882920 Actual Strength : 38.074243671999994
Predicted Strength: 45.623180
Predicted Strength RL: 38.882920 Actual Strength : 38.074243671999994
There you have it. A brief look into training and using an ML.NET regressor model. Although there are a couple quirks, I’m excited to see this released. This will only get better over time and if F# can be a part of that, even better.