2019-09-14

F# and ML.NET Regression

Read Time: 10 minutes

Today I’ll look at using F# and ML.NET to perform some model building. Specifically to predict concrete compressive strength based on its composite ingredients. If this sounds familiar, it is. This is a revisit of a previous post that explored a beta version of ML.NET. This time around, the version 1 interface will be used.

It has been on my list for awhile to revisit the ML.NET beta posts to see how things have changed as well as get a good example of code in the new, now version 1, world. Here I’ll be looking at the Regression post from last year. The original plan was to perform a closer analysis of how things have changed over time. But the reality is so much has changed that a line by line comparison won’t provide much value. Instead, I’ve decided to just redo the post using ML.NET version 1.3 and show some interesting points along the way. Before getting into it, the experience was definitely better this time around. The improvements are obvious and I have been pleased with the progression the ML.NET team has made over time. Without further adieu, make sure you have .NET Core version 2.2 installed. If you don’t, head out to dot.net, then navigating to Downloads and the .NET Core SDK.

Create the project and add the necessary ML.NET packages. This will be a console app in F# (obviously).

dotnet new console -lang F# -n MLNetConcreteV2
cd MLNetConcreteV2
dotnet add package Microsoft.ML --version 1.3
dotnet add package Microsoft.ML.FastTree --version 1.3

Before getting into the code, I first need to get the data. The source used is from UCI. The dataset is an Excel file (xls), and it needs to be a csv. I used ssconvert (from apt install gnumeric) to convert from Excel to CSV, but feel free to use whatever works for you.

1
2
3

mkdir data && cd data
curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/Concrete_Data.xls
ssconvert Concrete_Data.xls Concrete_Data.csv

Here is a sample of what the data looks like. There is a header row, I’ve transposed this to a vertical list for readablity. The first 8 columns are features, the last is the concrete compressive strength.

# Header Row
Cement (component 1)(kg in a m^3 mixture)
Blast Furnace Slag (component 2)(kg in a m^3 mixture)
Fly Ash (component 3)(kg in a m^3 mixture)
Water  (component 4)(kg in a m^3 mixture)
Superplasticizer (component 5)(kg in a m^3 mixture)
Coarse Aggregate  (component 6)(kg in a m^3 mixture)
Fine Aggregate (component 7)(kg in a m^3 mixture)
Age (day)
Concrete compressive strength(MPa, megapascals)

# Data Rows
540,0,0,162,2.5,1040,676,28,79.98611076
540,0,0,162,2.5,1055,676,28,61.887365759999994
332.5,142.5,0,228,0,932,594,270,40.269535256000005
332.5,142.5,0,228,0,932,594,365,41.052779992

Now that the project is setup and data is local, there is some foundational code that needs created. F# can use record types for specifying the data format. The ConcreteData record is effectively a file definition. ML.NET requires the LoadColumn attribute for column mappings used during the dataload phase. The ConcretePrediction record is for prediction results. Once the datatypes are setup, an MLContext object must be created.

open System
open Microsoft.ML
open Microsoft.ML.Data

[<CLIMutable>]
type ConcreteData = {
  [<LoadColumn(0)>]
  Cement: float32

  [<LoadColumn(1)>]
  Slag: float32

  [<LoadColumn(2)>]
  Ash: float32

  [<LoadColumn(3)>]
  Water: float32

  [<LoadColumn(4)>]
  Superplasticizer: float32

  [<LoadColumn(5)>]
  CoarseAggregate: float32

  [<LoadColumn(6)>]
  FineAggregate: float32

  [<LoadColumn(7)>]
  Age: float32

  [<LoadColumn(8)>]
  Label: float32
  }

[<CLIMutable>]
type ConcretePrediction = {
  Score: float32
  }

let context = MLContext()

Data loading is a reasonably simple process. I will use all of the data, but if I wanted to filter data, there is a FilterRowsByColumn method to exclude data, perhaps known bad data or outliers. Additionally, there is a need for training and test datasets. ML.NET provides a convenient TrainTestSplit method to generate those datasets. Here I will take 10% of the records for a test set.

let dataPath = "./data/Concrete_Data.csv"
let allData = 
  context
    .Data
    .LoadFromTextFile<ConcreteData>(
      path = dataPath,
      hasHeader = true,
      separatorChar = ',')

let filteredData = context.Data.FilterRowsByColumn(allData, "Slag", lowerBound = 50., upperBound = 100.)

let allDataSplit = context.Data.TrainTestSplit(allData, testFraction = 0.1)
let trainData = allDataSplit.TrainSet
let testData = allDataSplit.TestSet

Time to build the data pipeline. There are a couple things to look at here. Multiple transformation actions will be applied to the data. Concatenate provides a mechanism to build an aggregate field, specifically Features. I also don’t want to take the data raw, I want to apply a normalization method against the columns to improve the prediction results. To keep it simple, I am applying NormalizeMeanVariance to all of the targeted fields. To do it right I should be more rigorous about my transformation choices. Other possible normalization methods are NormalizeLogMeanVariance, NormalizeLpNorm, NormalizeMinMax, to name a few. The data is already in numeric form, but if there were text fields, there are transformation methods for that as well. An example of this could be: Transforms.Categorical.OneHotEncoding("CementBrandName", "CementBrandId"). I’ve only touched the surface on data transformation options, but there is generally something there for your specific needs.

let pipeline = 
  EstimatorChain()
    .Append(context.Transforms.NormalizeMeanVariance("Cement", "Cement"))
    .Append(context.Transforms.NormalizeMeanVariance("Slag", "Slag"))
    .Append(context.Transforms.NormalizeMeanVariance("Ash", "Ash"))
    .Append(context.Transforms.NormalizeMeanVariance("Water", "Water"))
    .Append(context.Transforms.NormalizeMeanVariance("Superplasticizer", "Superplasticizer"))
    .Append(context.Transforms.NormalizeMeanVariance("CoarseAggregate", "CoarseAggregate"))
    .Append(context.Transforms.NormalizeMeanVariance("FineAggregate", "FineAggregate"))
    .Append(context.Transforms.NormalizeMeanVariance("Age", "Age"))
    .Append(context.Transforms.Concatenate("Features", [|"Cement"; "Slag"; "Ash"; "Water"; "Superplasticizer"; "CoarseAggregate"; "FineAggregate"; "Age"|]))

Once the data pipeline is configured, it is time to build a model trainer. Again, ML.NET offers multiple options for training methods. For the example I use FastTreeTweedie with no parameters. Hyperparameter options are available for many of the trainers. I show some additional examples of how to implement alternative trainers with and without hyperparameters. Once the trainer is defined, it is appended to the pipeline. Now is the time to create a trained model using Fit against the previously defined training data. There is one last piece to make this process useful, the prediction engine. This provides the mechanism to actually perform predictions.

let trainer = context.Regression.Trainers.FastTreeTweedie()

// FastTreeRegressor with hyperparameters
let trainer = context.Regression.Trainers.FastTreeTweedie(numberOfTrees = 500, minimumExampleCountPerLeaf = 5)

// OnlineGradientDescent 
let trainer = context.Regression.Trainers.OnlineGradientDescent(labelColumnName = "Label", featureColumnName = "Features")

let modelBuilder = pipeline.Append(trainer)

let model = modelBuilder.Fit(trainData) 

let predictionEngine = context.Model.CreatePredictionEngine<ConcreteData,ConcretePrediction>(model)

The next step is to see how good of a model has been built. The trained model is now applied to the test data, and performance metrics are extracted. I cherry-picked a couple of the available metrics.

let predictions = model.Transform(testData)

let metrics = context.Regression.Evaluate(predictions)

printfn "R-Squared: %f" (metrics.RSquared)
printfn "RMS      : %f" (metrics.RootMeanSquaredError)
printfn "Loss     : %f" (metrics.LossFunction)
printfn "MAE      : %f" (metrics.MeanAbsoluteError)
printfn "MSE      : %f" (metrics.MeanSquaredError)

Here are the evaluation metrics for the test data run.

R-Squared: 0.920959
RMS      : 4.875440
Loss     : 23.769913
MAE      : 2.682631
MSE      : 23.769913

The trained model is now something that can be used against data. I pulled one of the rows from the data just to show how this is put together. Again, I can use a record type to define the data.

let test1 = {
  Cement = 198.6f
  Slag = 132.4f
  Ash = 0.f
  Water = 192.f
  Superplasticizer = 0.f
  CoarseAggregate = 978.4f
  FineAggregate = 825.5f
  Age = 90.f
  Label =  0.f
  }

let predictionTest1 = predictionEngine.Predict(test1)
printfn "Predicted Strength: %f" predictionTest1.Score
printfn "Actual Strength   : 38.074243671999994"
printfn ""

1 2	Predicted Strength: 39.541660 Actual Strength : 38.074243671999994

A trained model isn’t much use if it can’t be passed around and used elsewhere. That is where the ML.NET model save and load methods come into play. They are both straight forward to use, what I’ve come to expect. The same prediction as above is run, but this time on a model loaded from a file. The circle is complete.

// Save model to file
context.Model.Save(model, trainData.Schema, "model.zip")

// Load model from file
let (modelReloaded, schemaReloaded) = context.Model.Load("model.zip")
let predictionEngineReloaded = context.Model.CreatePredictionEngine<ConcreteData,ConcretePrediction>(modelReloaded)
let predictionReloaded = predictionEngineReloaded.Predict(test1)

printfn "Predicted Strength RL: %f" predictionReloaded.Score
printfn "Actual Strength      : 38.074243671999994"

1 2	Predicted Strength RL: 39.541660 Actual Strength : 38.074243671999994

It has been nice to revisit an old dataset and method, but with the newest ML.NET. The ergonomics and functionality have certainly improved over time. This bodes well for the future as they continue to release and improve functionality. I hope you have found this useful as well, and perhaps enticed you to investigate ML.NET and F# for yourself. Until next time.