2018-06-09

F# and ML.NET Regression

Read Time: 13 minutes

Recently Microsoft announced ML.NET, a machine learning framework for .NET. This is exciting news. So my mind immediately goes to: how does this look with F#? The current post will take a look at using ML.NET’s regression module to predict concrete compressive strength based on its composite ingredients.

Update: This post is here for posterity sake, a rework of this post is here using ML.NET version 1.3.

Before jumping in too far, there is a disclaimer: ML.NET is in its early stages. I found a couple implementation and interface idiosyncrasies I suspect will change over time. Just keep that in mind moving forward. The short version is, I’ve been pleased with what I’ve seen so far. There is some room for improvement, especially having more F#-centric support for calling methods. It will be an interesting journey as the framework matures.

Update: The post was written using Microsoft.ML v0.1.0, and v0.2.0 has since been released. I have noted interfaces changes below, for the example it is just TextLoader.

With that out of the way, make sure you have .NET Core version 2.0 installed. If you don’t, head out to the .NET Core Downloads page. Select SDK for your platform. Tangential, but you can also get here by going to dot.net, then navigating to Downloads and .NET Core.

First, create the project and add the ML.NET package. This will be a console app in F# (obviously).

1
2
3

dotnet new console --language F# --name MLNet-Concrete
cd MLNet-Concrete
dotnet add package Microsoft.ML

Next, it is time to get the data. The source I used for this post is from UCI. The dataset is an Excel file (xls), and I need it as a csv. I used ssource (from apt install gnumeric) to convert from Excel to CSV, but feel free to use whatever works for you.

1
2
3

mkdir data && cd data
curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/Concrete_Data.xls
ssource Concrete_Data.xls Concrete_Data.csv

Here is a sample of what the data looks like. There is a header row, I’ve transposed this to a vertical list for readablity. The first 8 columns are features, the last is the concrete compressive strength.

# Header Row
Cement (component 1)(kg in a m^3 mixture)
Blast Furnace Slag (component 2)(kg in a m^3 mixture)
Fly Ash (component 3)(kg in a m^3 mixture)
Water  (component 4)(kg in a m^3 mixture)
Superplasticizer (component 5)(kg in a m^3 mixture)
Coarse Aggregate  (component 6)(kg in a m^3 mixture)
Fine Aggregate (component 7)(kg in a m^3 mixture)
Age (day)
Concrete compressive strength(MPa, megapascals)

# Data Rows
540,0,0,162,2.5,1040,676,28,79.98611076
540,0,0,162,2.5,1055,676,28,61.887365759999994
332.5,142.5,0,228,0,932,594,270,40.269535256000005
332.5,142.5,0,228,0,932,594,365,41.052779992

Now that the project is setup and data is local, we can get to the code. Time to open up the already created Program.fs. First, add the necessary namespaces.

open System
open Microsoft.ML
open Microsoft.ML.Runtime.Api
open Microsoft.ML.Trainers
open Microsoft.ML.Transforms
open Microsoft.ML.Models

// New for v0.2.0
open Microsoft.ML.Data

The ML.NET pipeline expects the data in a specific format. In the C# world, this is a class, for F# we can use a type. Below are the required types; ConcreteData is the input data, ConcretePrediction is the output prediction. For ConcreteData, this is basically a a map of columns to member variables. There are a couple notable points to ensure the pipeline can properly consume the data. Each attribute must be mutable public, it also requires the [<Column("#")>] to specify it’s column position, and [<DefaultValue>] attributes. For ConcretePrediction, a single attribute is required, the prediction value. For the input data, the label variable must be named Label. For the prediction type, the variable must be labeled Score. There are methods where you are supposed to be able to define a ColumnName attribute, or copy a label column into the pipeline. But frankly they didn’t work for me. I’m unclear if I was doing something wrong if its a current early-state problem. Over time I expect this will be resolved, but for now I don’t mind working within tighter constraints.

type ConcreteData() =
  [<Column("0")>]
  [<DefaultValue>]
  val mutable public Cement:float32

  [<Column("1")>]
  [<DefaultValue>]
  val mutable public Slag:float32

  [<Column("2")>]
  [<DefaultValue>]
  val mutable public Ash:float32

  [<Column("3")>]
  [<DefaultValue>]
  val mutable public Water:float32

  [<Column("4")>]
  [<DefaultValue>]
  val mutable public Superplasticizer:float32

  [<Column("5")>]
  [<DefaultValue>]
  val mutable public CoarseAggregate:float32

  [<Column("6")>]
  [<DefaultValue>]
  val mutable public FineAggregate:float32

  [<Column("7")>]
  [<DefaultValue>]
  val mutable public Age:float32

  [<Column("8")>]
  [<DefaultValue>]
  val mutable public Label :float32

type ConcretePrediction() =
  [<DefaultValue>]
  val mutable public Score:float32

The structure of building a pipeline is pretty intuitive. First, create a pipeline. Then, add components to the pipeline in the order to be executed. So first, load the data with a TextLoader. This data is comma delimited and has a header row.

let pipeline = new LearningPipeline()
let dataPath = "./data/Concrete_Data.csv"
// v0.1.0 
// pipeline.Add(new TextLoader<ConcreteData>(dataPath, separator = ",", useHeader = true))
// New for v0.2.0
pipeline.Add((new TextLoader(dataPath)).CreateFrom<ConcreteData>(separator = ',', useHeader = true))

After the data is loaded, feature columns need to be added to the pipeline. I’m going to use all feature columns from the file, but I don’t have to. The regressor model requires features to be numeric. In this example, that is the case and nothing special needs to be done. In cases where columns are strings, the CategoricalOneHotVectorizer() will convert string columns to numeric mappings. I’ve provided an example line below. Even though I don’t need it, its a handy reference to have. Note the order, since it is a pipeline, the string to numeric column conversion needs to happen prior to adding the feature columns.

1
2
3

// Example how to convert text to numeric
// pipeline.Add(new CategoricalOneHotVectorizer("CementBrandName", "SlagBrandName"));
pipeline.Add(new ColumnConcatenator("Features", "Cement", "Slag", "Ash", "Water", "Superplasticizer", "CoarseAggregate", "FineAggregate", "Age"))

Now that the features are defined, it is time to determine what training method to use. For this post FastTreeRegressor is used. This is a boosted decision tree and generally offers pretty good results. Custom hyperparameters can also be defined. I found the defaults to be fine, but its good to see the option to tweak those values.

pipeline.Add(new FastTreeRegressor())

// FastTreeRegressor with hyperparameters
// pipeline.Add(new FastTreeRegressor(NumTrees = 500, DropoutRate = 0.2))

For the dataset in question, the FastTreeRegressor worked the best, but there are alternatives. I’ve listed them below. Most had worst performance, with the FastTreeTweedieRegressor being similar. As will anything, it is good to investigate options.

// Similar performance
// pipeline.Add(new FastTreeTweedieRegressor())

// Worse performance
// pipeline.Add(new PoissonRegressor())
// pipeline.Add(new StochasticDualCoordinateAscentRegressor())
// pipeline.Add(new FastForestRegressor())
// pipeline.Add(new OnlineGradientDescentRegressor())

The last part, train the model. Note the ConcreteData and ConcretePrediction types as part of the Train call.

1	let model = pipeline.Train<ConcreteData, ConcretePrediction>()

Validation of any model is important. For a real case, I would train on one dataset and validate against a previously unseen dataset. Since this is just an example, I validate against the training data. As a result, I expect the results to be very good, and they are. ML.NET offers an Evaluator class, which makes getting some of those crucial high-level numbers pretty easy. It takes a trained model and a dataset, and produces critical metrics. Again, this is one of those components that is crucial to an ML framework and I’m glad to see it here.

// Evaluate results
// v0.1.0 
// let testData = new TextLoader<ConcreteData>(dataPath, separator = ",", useHeader = true)
// New for v0.2.0
let testData = (new TextLoader(dataPath)).CreateFrom<ConcreteData>(separator = ',', useHeader = true)
let evaluator = new RegressionEvaluator()
let metrics = evaluator.Evaluate(model, testData) 
printfn ""
printfn "R-Squared: %f" <| metrics.RSquared
printfn "RMS      : %f" <| metrics.Rms
printfn "L1       : %f" <| metrics.L1
printfn "L2       : %f" <| metrics.L2
printfn ""

# Evaluator Results:
R-Squared: 0.988533
RMS      : 1.788017
L1       : 1.139818
L2       : 3.197006

Backtracking to the hyperparameter example, here are those results. As you can tell, my randomly picked hyperparameter choices were not better. Certainly it seems like a fun opportunity to pair some optimization searches with the pipeline to see how methods can be improved. Of course, this is more meaningful if it is not validating against the training data, there is already a risk of overfitting that we’re not seeing.

# Evaluator Results (with hyperparameters):
R-Squared: 0.947057
RMS      : 3.841995
L1       : 2.846822
L2       : 14.760922

Here is an example of how individual predictions can be made. Create a ConcreteData object and provide it to the Predict method. For this example, I pull one of those rows from the training data.

let test1 = ConcreteData()
test1.Cement <- 198.6f
test1.Slag <- 132.4f
test1.Ash <- 0.f
test1.Water <- 192.f
test1.Superplasticizer <- 0.f
test1.CoarseAggregate <- 978.4f
test1.FineAggregate <- 825.5f
test1.Age <- 90.f

let predictionTest1 = model.Predict(test1)
printfn "Predicted Strength: %f" predictionTest1.Score
printfn "Actual Strength   : 38.074243671999994"
printfn ""

1
2
3

# Prediction Result:
Predicted Strength: 38.882920
Actual Strength   : 38.074243671999994

On a lark, let’s see what happens if slag is increased, and the water content is reduced. It looks like compressive strength gets stronger.

let test2 = ConcreteData()
test2.Cement <- 198.6f
test2.Slag <- 150.0f
test2.Ash <- 0.f
test2.Water <- 172.f
test2.Superplasticizer <- 0.f
test2.CoarseAggregate <- 978.4f
test2.FineAggregate <- 825.5f
test2.Age <- 90.f

let predictionTest2 = model.Predict(test2)
printfn "Predicted Strength: %f" predictionTest2.Score
printfn ""

1 2	# Prediction Result: Predicted Strength: 45.623180

Once a model is trained, it can also be saved to a file a reloaded at a later time. This is supported by the WriteAsync and ReadAsync methods of a model.

// Save model to file
model.WriteAsync("test-model")
|> Async.AwaitTask
|> ignore

// Load model from file and run a prediction
let modelReloaded =
  PredictionModel.ReadAsync<ConcreteData, ConcretePrediction>("test-model") 
  |> Async.AwaitTask 
  |> Async.RunSynchronously
let predictionReloaded = modelReloaded.Predict(test1)
printfn "Predicted Strength RL: %f" predictionReloaded.Score
printfn "Actual Strength      : 38.074243671999994"
printfn ""

1
2
3

# Prediction Result (model reloaded):
Predicted Strength RL: 38.882920
Actual Strength      : 38.074243671999994

Throughout the post, portions of the output have been provided out of band. Here is how the whole thing looks when run with dotnet run.

Not adding a normalizer.
Making per-feature arrays
Changing data from row-wise to column-wise
Processed 1030 instances
Binning and forming Feature objects
Reserved memory for tree learner: 234780 bytes
Starting to train ...
Not training a calibrator because it is not needed.

R-Squared: 0.988533
RMS      : 1.788017
L1       : 1.139818
L2       : 3.197006

Predicted Strength: 38.882920
Actual Strength   : 38.074243671999994

Predicted Strength: 45.623180

Predicted Strength RL: 38.882920
Actual Strength      : 38.074243671999994

There you have it. A brief look into training and using an ML.NET regressor model. Although there are a couple quirks, I’m excited to see this released. This will only get better over time and if F# can be a part of that, even better.