F# and ML.NET Classification

Read Time: 12 minutes

Expanding on my previous post, F# and ML.NET Regression, the current post will take a look at performing classification using Microsoft’s new ML.NET framework. The task at hand will be to use biomechanical attributes to classify patient vertebra conditions into normal (NO), disk hernia (DH), or spondilolysthesis (SL) categories.

As I mentioned in the previous post, there is a disclaimer: ML.NET is in its early stages. I found a couple implementation and interface idiosyncrasies I suspect will change over time. Just keep that in mind moving forward. But knowing that, I’ve been pleased with what I’ve seen so far.

Update: The post was written using Microsoft.ML v0.1.0, and v0.2.0 has since been released. I have noted interfaces changes below, for the example it is just TextLoader.

Make sure you have .NET Core version 2.1 installed. If you don’t, head out to the .NET Core Downloads page. Select SDK for your platform. Tangential, but you can also get here by going to dot.net, then navigating to Downloads and .NET Core.

First, create a console F# project, then add the ML.NET package.

1
2
3
dotnet new console --language F# --name MLNet-Vertebral
cd MLNet-Vertebral
dotnet add package Microsoft.ML

Next, it is time to get the data. The source I used for this post is from UCI. The datafile can be found here.

1
2
mkdir data && cd data
curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/00212/vertebral_column_data.zip

Here is a sample of what the data looks like. There is no header row. The columns represent 6 features and 1 classification column:

  • pelvic_incidence
  • pelvic_tilt
  • lumbar_lordosis_angle
  • sacral_slope
  • pelvic_radius
  • degree_spondylolisthesis
  • class
1
2
3
4
5
# Data Rows
63.03,22.55,39.61,40.48,98.67,-0.25,DH
39.06,10.06,25.02,29,114.41,4.56,DH
68.83,22.22,50.09,46.61,105.99,-3.53,DH
69.3,24.65,44.31,44.64,101.87,11.21,DH

Now that the project is setup and data is local, we can get to the code. Time to open up the already created Program.fs. First, add the necessary namespaces.

1
2
3
4
5
6
7
8
9
10
open System
open Microsoft.ML
open Microsoft.ML.Runtime.Api
open Microsoft.ML.Trainers
open Microsoft.ML.Transforms
open Microsoft.ML.Models
open Microsoft.ML.Runtime.Learners

// New for v0.2.0
open Microsoft.ML.Data

The ML.NET pipeline expects the data in a specific format. In the C# world, this is a class, for F# we can use a type. Below are the required types; VertebralData is the input data, VertebralPrediction is the output prediction. For VertebralData, this is basically a a map of columns to member variables. There are a couple notable points to ensure the pipeline can properly consume the data. Each attribute must be mutable public, it also requires the [<Column("#")>] to specify it’s column position, and [<DefaultValue>] attributes. For VertebralPrediction, a single attribute is required, the prediction value. For the input data, the label variable must be named Label. For the prediction type, the variable must be labeled PredictedLabel.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
type VertebralData() =
[<Column("0")>]
[<DefaultValue>]
val mutable public PelvicIncidence :float32

[<DefaultValue>]
[<Column("1")>]
val mutable public PelvicTilt :float32

[<DefaultValue>]
[<Column("2")>]
val mutable public LumbarLordosisAngle :float32

[<DefaultValue>]
[<Column("3")>]
val mutable public SacralSlope :float32

[<DefaultValue>]
[<Column("4")>]
val mutable public PelvicRadius :float32

[<DefaultValue>]
[<Column("5")>]
val mutable public DegreeSpondylolisthesis :float32

[<DefaultValue>]
[<Column("6")>]
val mutable public Label:string

type VertebralPrediction() =
[<DefaultValue>]
val mutable public PredictedLabel:string

Building the pipeline structure is reasonably intuitive. First, create a pipeline. Then, add components to the pipeline in the order to be executed. So first, load the data with a TextLoader. This data is comma delimited and has a header row.

1
2
3
4
5
6
let pipeline = new LearningPipeline()
let dataPath = "./data/column_3C.dat"
// v0.1.0
// pipeline.Add(new TextLoader<VertebralData>(dataPath, separator = ",", useHeader = false))
// New for v0.2.0
pipeline.Add((new TextLoader(dataPath)).CreateFrom<VertebralData>(separator = ',', useHeader = false))

After the data is loaded, feature columns need to be added to the pipeline. I’m going to use all feature columns from the file, but I don’t have to. The classification model requires features to be numeric. The features don’t need anything special done to them, but the class does need converted to numeric values. For this I use Dictionarizer.

1
2
pipeline.Add(new Dictionarizer("Label"))
pipeline.Add(new ColumnConcatenator("Features", "PelvicIncidence", "PelvicTilt", "LumbarLordosisAngle", "SacralSlope", "PelvicRadius", "DegreeSpondylolisthesis"))

Now that the features are defined, it is time to determine what training method to use. For this post StochasticDualCoordinateAscentClassifier is used. Custom hyperparameters can also be defined. I have a commented out example that changes bias and convergence tolerance.

1
2
pipeline.Add(new StochasticDualCoordinateAscentClassifier())
// pipeline.Add(new StochasticDualCoordinateAscentClassifier(BiasLearningRate = 0.1f, ConvergenceTolerance = 0.3f))

For the dataset in question, the StochasticDualCoordinateAscentClassifier worked well, but I could’ve used a NaiveBayesClassifier as well. Since this is multiclass, I had less options, but ML.NET seems to have a fair amount of binary classifiers, when that is the desired use case.

The last part, train the model. Note the VertebralData and VertebralPrediction types as part of the Train call. I also need to define the prediction label column name. Unfortunantly the function name is really long, but it is at least descriptive…

1
2
pipeline.Add(new PredictedLabelColumnOriginalValueConverter(PredictedLabelColumn = "PredictedLabel") )    
let model = pipeline.Train<VertebralData, VertebralPrediction>()

Validation of any model is important. For a real case, I would train on one dataset and validate against a previously unseen dataset. Since this is just an example, I validate against the training data. As a result, I expect the results to be very good, and they are. ML.NET offers multiple Evaluator classes, based on specific needs. This makes getting some of those crucial high-level numbers pretty easy. It takes a trained model and a dataset, and produces critical metrics. One specific call-out I will make is to the TopKAccuracy. The evaluator allows an additional accuracy result based on if the correct class was in the top k rankings for a prediction. Here I have choosen 2, which is a little silly in a 3 class problem, but obviously in larger problems this is a valuable analysis tool. The confusion matrix takes a bit of coercing to print nicely, but at least the data is there.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Evaluate results
// v0.1.0
// let testData = new TextLoader<VertebralData>(dataPath, separator = ",", useHeader = false)
// New for v0.2.0
let testData = (new TextLoader(dataPath)).CreateFrom<VertebralData>(separator = ',', useHeader = false)
let evaluator = new ClassificationEvaluator()
evaluator.OutputTopKAcc <- Option.toNullable(Some(2))
let metrics = evaluator.Evaluate(model, testData)
printfn ""
printfn "Macro-Average Accuracy : %f" <| metrics.AccuracyMacro
printfn "Micro-Average Accuracy : %f" <| metrics.AccuracyMicro
printfn "Average Log-loss : %f" <| metrics.LogLoss
printfn "Relative Log-loss : %f" <| metrics.LogLossReduction
printfn "Classes : %A" <| metrics.ConfusionMatrix.ClassNames
printfn "Log-loss by class : %A" <| metrics.PerClassLogLoss
printfn "Top-k Accuracy : %f" <| metrics.TopKAccuracy

// Confusion Matrix Header
printfn "Confusion Matrix :"
[0..metrics.ConfusionMatrix.Order - 1]
|> List.map (fun a -> sprintf "%4s " (metrics.ConfusionMatrix.ClassNames.[a]))
|> List.fold (+) " "
|> printfn "%s"

// Confusion Matrix Data
[0..metrics.ConfusionMatrix.Order - 1]
|> List.iter (fun a ->
[0..metrics.ConfusionMatrix.Order - 1]
|> List.map (fun b -> sprintf "%4d " (int (metrics.ConfusionMatrix.Item(a,b))))
|> List.fold (+) (sprintf "%4s " (metrics.ConfusionMatrix.ClassNames.[a]))
|> printfn "%s"
)
printfn ""
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Evaluator Results:
Macro-Average Accuracy : 0.817778
Micro-Average Accuracy : 0.858065
Average Log-loss : 0.313334
Relative Log-loss : 69.699239
Classes : [|"DH"; "SL"; "NO"|]
Log-loss by class : [|0.5853304274; 0.1233100547; 0.4351712598|]
Top-k Accuracy : 0.996774

Confusion Matrix :
DH SL NO
DH 40 1 19
SL 3 142 5
NO 13 3 84

With the initial evaluation out of the way, here is an example of how individual predictions can be made. Create a VertebralData object and provide it to the Predict method. For this example, I pull one of those rows from the training data.

1
2
3
4
5
6
7
8
9
10
11
let test1 = VertebralData()
test1.PelvicTilt <- 63.03f
test1.LumbarLordosisAngle <- 22.55f
test1.SacralSlope <- 39.61f
test1.PelvicRadius <- 40.48f
test1.DegreeSpondylolisthesis <- -0.25f
let prediction = model.Predict(test1)

printfn "Predicted Category: %s" prediction.PredictedLabel
printfn "Actual Category : DH"
printfn ""
1
2
3
# Prediction Result:
Predicted Category: DH
Actual Category : DH

Once a model is trained, it can also be saved to a file a reloaded at a later time. This is supported by the WriteAsync and ReadAsync methods of a model.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// Save model to file 
model.WriteAsync("test-model")
|> Async.AwaitTask
|> ignore

// Load model from file and run a prediction
let modelReloaded =
PredictionModel.ReadAsync<VertebralData, VertebralPrediction>("test-model")
|> Async.AwaitTask
|> Async.RunSynchronously
let predictionReloaded = modelReloaded.Predict(test1)
printfn "Predicted Category (Reloaded): %s" predictionReloaded.PredictedLabel
printfn "Actual Category (Reloaded) : DH"
printfn ""
1
2
3
# Prediction Result: (model reloaded):
Predicted Category (Reloaded): DH
Actual Category (Reloaded) : DH

Throughout the post, portions of the output have been provided out of band. Here is how the whole thing looks when run with dotnet run.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Automatically adding a MinMax normalization transform, use 'norm=Warn' or 'norm=No' to turn this behavior off.
Using 2 threads to train.
Automatically choosing a check frequency of 2.
Auto-tuning parameters: maxIterations = 4836.
Auto-tuning parameters: L2 = 2.668161E-05.
Auto-tuning parameters: L1Threshold (L1/L2) = 0.
Using best model from iteration 448.
Not training a calibrator because it is not needed.

Macro-Average Accuracy : 0.817778
Micro-Average Accuracy : 0.858065
Average Log-loss : 0.313334
Relative Log-loss : 69.699239
Classes : [|"DH"; "SL"; "NO"|]
Log-loss by class : [|0.5853304274; 0.1233100547; 0.4351712598|]
Top-k Accuracy : 0.996774
Confusion Matrix :
DH SL NO
DH 40 1 19
SL 3 142 5
NO 13 3 84

Predicted Category: DH
Actual Category : DH

Predicted Category (Reloaded): DH
Actual Category (Reloaded) : DH

This has been a brief look into training and using an ML.NET classification model. There were two interesting takeaways for me. The first is just how to interact with the framework for classification. The second, and more interesting, is how little is different between the regressor and classification pipelines. I really appreciate a consistent framework where I can leverage a similar pipeline for most of my projects, and only really need to change out the appropriate logic bits for my specfic current problem. ML.NET has some really good components, and it will be interesting to see it grows, hopefully with more F#-centric support as well. Until next time.