2018-07-07

F# and ML.NET Clustering

Read Time: 13 minutes

The discovering ML.NET series continues. With the release of v0.3.0, it is time to look at performing K-means clustering using F# and Microsoft’s new ML.NET framework. The use case will be to use examination attributes to classify mammogram results.

NOTE: Due to ML.NET changes, this post is superceded by the post Clustering-V2.

For reference, previous ML.NET series posts are below:

As I mentioned in the previous posts, there is a disclaimer: ML.NET is in its early stages. I found a couple interface idiosyncrasies I suspect will change over time. Just keep that in mind. I am happy with what I have seen so far, and I’m excited to see it grow and mature.

Note: The post was written using Microsoft.ML v0.3.0.

Make sure you have .NET Core version 2.1 installed. If you don’t, head out to the .NET Core Downloads page. Select SDK for your platform. Tangential, but you can also get here by going to dot.net, then navigating to Downloads and .NET Core.

First, create a console F# project, then add the ML.NET package.

1
2
3

dotnet new console --language F# --name MLNet-Mammogram
cd MLNet-Mammogram
dotnet add package Microsoft.ML

Next, it is time to get the data. The source I used for this post is from UCI. The datafile can be found [here] (https://archive.ics.uci.edu/ml/machine-learning-databases/mammographic-masses/mammographic_masses.data)

1 2	mkdir data && cd data curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/mammographic-masses/mammographic_masses.data

Here is a sample of what the data looks like. There is no header row. The columns represent 5 features and 1 classification column:

BI-RADS assessment (1-5)
Age (Patient’s age)
Shape (mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal))
Margin (mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal))
Density: (mass density high=1 iso=2 low=3 fat-containing=4 (ordinal))
Severity: (benign=0 or malignant=1)

# Data Rows
5,67,3,5,3,1
4,43,1,1,?,1
5,58,4,5,3,1
4,28,1,1,3,0
5,57,1,5,3,1

Now that the project is setup and data is local, we can get to the code. Time to open up the already created Program.fs. First, add the necessary namespaces.

open Microsoft.ML
open Microsoft.ML.Runtime.Api
open Microsoft.ML.Trainers
open Microsoft.ML.Transforms
open Microsoft.ML.Models
open Microsoft.ML.Data

The ML.NET pipeline expects the data in a specific format. In the C# world, this is a class, for F# we can use a type. Below are the required types; MammogramData is the input data, MammogramPrediction is the output prediction. For MammogramData, this is basically a a map of columns to member variables. There are a couple notable points to ensure the pipeline can properly consume the data. Each attribute must be mutable public, it also requires the [<Column("#")>] to specify it’s column position, and [<DefaultValue>] attributes. For MammogramPrediction, PredictionLabel for the cluster id, and Score for calculated distances from all clusters is required.

type MammogramPrediction() =
    [<Column("0")>]
    [<DefaultValue>]
    val mutable public BiRads:float32

    [<Column("1")>]
    [<DefaultValue>]
    val mutable public Age:float32

    [<Column("2")>]
    [<DefaultValue>]
    val mutable public Shape:float32

    [<Column("3")>]
    [<DefaultValue>]
    val mutable public Margin:float32

    [<Column("4")>]
    [<DefaultValue>]
    val mutable public Density:float32

    [<Column("5")>]
    [<DefaultValue>]
    val mutable public Label :float32

type MammogramPrediction() =
    [<DefaultValue>]
    [<ColumnName("PredictedLabel")>]
    val mutable public SelectedClusterId:uint32

    [<DefaultValue>]
    [<ColumnName("Score")>]
    val mutable public Distance: float32[]

As the other examples, building the pipeline structure is intuitive. First, create a pipeline. Then, add components to the pipeline in the order to be executed. So first, load the data with a TextLoader. This data is comma delimited and has a header row.

1
2
3

let pipeline = new LearningPipeline()
let dataPath = "./data/mammographic_masses.data"
pipeline.Add((new TextLoader(dataPath)).CreateFrom<MammogramData>(separator = ',', useHeader = false))

After the data is loaded, feature columns need to be added to the pipeline. I’m going to use all feature columns from the file, and exclude severity. The clustering model requires features to be numeric, which if fine here. As the other posts show, you can convert text to numeric mappings if necessary.

1	pipeline.Add(new ColumnConcatenator("Features", "BiRads", "Age", "Shape", "Margin", "Density"))

Now that the features are defined, it is time to define the training method. This will be KMeansPlusPlusClusterer. Similar to the other trainers, custom parameters can be defined, I have decided to use K = 4. It also has other options as as MaxIterations, OptTol (convergence tolerance), and NormalizeFeatures.

1	pipeline.Add(new KMeansPlusPlusClusterer(K = 4))

The last part, train the model. Note the MammogramData and MammogramPrediction types as part of the Train call.

1	let model = pipeline.Train<MammogramData, MammogramPrediction>()

Validation of any model is important. For a real case, I would train on one dataset and validate against a previously unseen dataset. Since this is just an example, I validate against the training data. As a result, I expect the predictions to be really accurate. ML.NET offers multiple Evaluator classes, based on specific needs. For this, the obvious choice is ClusterEvaluator, it takes a trained model and a dataset, and produces critical metrics.

// Evaluate results
let testData = (new TextLoader(dataPath)).CreateFrom<MammogramData>(separator = ',', useHeader = true)
let evaluator = new ClusterEvaluator()
let metrics = evaluator.Evaluate(model, testData)
printfn ""
printfn "Avg Min Score: %f" <| metrics.AvgMinScore
// Davies-Bouldin Index
printfn "DBI          : %A" <| metrics.Dbi
// Normalized Mutual Information
printfn "NMI          : %A" <| metrics.Nmi
printfn ""

Automatically adding a MinMax normalization transform, use 'norm=Warn' or 'norm=No' to turn this behavior off.
Initializing centroids
Centroids initialized, starting main trainer
Model trained successfully on 829 instances
Not training a calibrator because it is not needed.

Avg Min Score: 0.049841
DBI          : 0.0
NMI          : 0.3012495931

With the initial evaluation out of the way, it is time to move onto individual predictions. I want to create aggregate classification percentages for each cluster. To do this I take the predictive model and apply it against the the training file. Using the predicted cluster and the training label, I create a mapping for detailed predictions. Each cluster gets its own raw benign/malignant count, which can be converted into percentage likelihood for each classification. I have the details annotated in comments, to make it easier to follow. Honestly, this is the most labor-intensive part of the process. I’d love to be able to pass an cluster-aggregate-score function in as part of the trainer to eliminate this work or reprocessing the data. Once I have these results as a Map, I can query results easy enough.

  // Create classifications by cluster 
  let clusterClassification = 
    // Read file
    System.IO.File.ReadAllLines(dataPath)
    // Filter incomplete rows
    |> Array.filter (fun line -> not (line.Contains("?")))
    // Run predictions 
    |> Array.map (fun line -> 
      // Convert line to float array
      let row = line.Split(',') |> Array.map float32
      // Predict the clusterId of the row
      let predictedCluster = 
        model.Predict(
          MammogramData(
            BiRads = row.[0],
            Age = row.[1],
            Shape = row.[2],
            Margin = row.[3],
            Density = row.[4])) 
      // Populate benign/maligant counter for cluster (0=benign, 1=malignant)
      if int row.[5] = 0 
      then (predictedCluster.SelectedClusterId, [| 1; 0 |])
      else (predictedCluster.SelectedClusterId, [| 0; 1 |]))
    // Group by ClusterId
    |> Array.groupBy (fun (clusterId, _) -> clusterId)
    // Sum each cluster's classification counts
    |> Array.map (fun (clusterId, data) -> 
      let countSums = 
        data
        |> Array.map (fun (_, z) -> z)
        |> Array.fold (fun a (x:int []) -> 
            [| a.[0] + x.[0]; a.[1] + x.[1] |]) [| 0; 0 |] 
      (clusterId, countSums))
    |> Map.ofArray

/// Provide a prediction based on cluster id
let clusterIdToPrediction (clusterClassification:Map<uint32, int[]>) (clusterId:uint32) =
  let classifications = clusterClassification.Item clusterId

  let total = classifications |> Array.sum |> float
  let benignPct = float classifications.[0] / total
  let malignantPct = float classifications.[1] / total

  sprintf "Benign: %0.2f Malignant: %0.2f (%d, %d)" 
    benignPct 
    malignantPct 
    classifications.[0] 
    classifications.[1]

Now that the clusterIdToPrediction is defined, I can pair the ML.NET cluster prediction with the aggregated cluster classification percentages. First, create a MammogramData object and provide it to the Predict method. Second, use the predicted clusterId with the aggregated cluster classification percentages to get a classification result. For this example, I pull one of those rows from the training data.

let test1 = MammogramData()
test1.BiRads <- 5.f
test1.Age <- 67.f
test1.Shape <- 3.f
test1.Margin <- 5.f
test1.Density <- 3.f
// Actual: 1

let predictionTest1 = model.Predict(test1)
printfn "Predicted ClusterId: %d" predictionTest1.SelectedClusterId
printfn "Predicted Distances: %A" predictionTest1.Distance
printfn "Predicted Result: %s" (clusterIdToPrediction clusterClassification predictionTest1.SelectedClusterId)
printfn "Actual Result   : 1 (Malignant)"
printfn ""

The results show the prediction falls into cluster 3, which has a 84% likelihood it is malignant, which matches the actual value.

# Prediction Result:
Predicted ClusterId: 3
Predicted Distances: [|0.128789425f; 0.166862488f; 0.0578770638f; 0.80590868f|]
Predicted Result: Benign: 0.16 Malignant: 0.84 (19, 99)
Actual Result   : 1 (Malignant)

Like the other models before it, the cluster model can be saved to a file and reloaded later. This is supported by the WriteAsync and ReadAsync methods of a model.

// Save model to file
model.WriteAsync("test-model")
|> Async.AwaitTask
|> ignore

// Load model from file and run a prediction
let modelReloaded =
  PredictionModel.ReadAsync<MammogramData, MammogramPrediction>("test-model") 
  |> Async.AwaitTask 
  |> Async.RunSynchronously
let predictionReloaded = modelReloaded.Predict(test1)
printfn "Predicted ClusterId RL: %d" predictionReloaded.SelectedClusterId
printfn "Predicted Distances RL: %A" predictionReloaded.Distance
printfn "Predicted Result RL: %s" (clusterIdToPrediction clusterClassification predictionReloaded.SelectedClusterId)
printfn "Actual Result RL   : 1 (Malignant)"
printfn ""

As expected, the prediction results are the same with the reloaded model.

# Prediction Result: (model reloaded):
Predicted ClusterId RL: 3
Predicted Distances RL: [|0.128789425f; 0.166862488f; 0.0578770638f; 0.80590868f|]
Predicted Result RL: Benign: 0.16 Malignant: 0.84 (19, 99)
Actual Result RL   : 1 (Malignant)

Throughout the post, portions of the output have been provided out of band. Here is how the whole thing looks when run with dotnet run.

Automatically adding a MinMax normalization transform, use 'norm=Warn' or 'norm=No' to turn this behavior off.
Initializing centroids
Centroids initialized, starting main trainer
Model trained successfully on 829 instances
Not training a calibrator because it is not needed.

Avg Min Score: 0.049841
DBI          : 0.0
NMI          : 0.3012495931

ClusterId 1u => Benign: 0.26 Malignant: 0.74 (83, 236)
ClusterId 2u => Benign: 0.59 Malignant: 0.41 (41, 29)
ClusterId 3u => Benign: 0.16 Malignant: 0.84 (19, 99)
ClusterId 4u => Benign: 0.88 Malignant: 0.12 (284, 39)

Predicted ClusterId: 3
Predicted Distances: [|0.128789425f; 0.166862488f; 0.0578770638f; 0.80590868f|]
Predicted Result: Benign: 0.16 Malignant: 0.84 (19, 99)
Actual Result   : 1 (Malignant)

Predicted ClusterId RL: 3
Predicted Distances RL: [|0.128789425f; 0.166862488f; 0.0578770638f; 0.80590868f|]
Predicted Result RL: Benign: 0.16 Malignant: 0.84 (19, 99)
Actual Result RL   : 1 (Malignant)

This has been a brief look into training and using an ML.NET k-means cluster model. As seen with the other models, ML.NET is providing a nice consistent interface and has some good components. It is a framework that continues to grow in a positive direction. Kudos and thanks to all the people making this a reality. That’s all for now. Until next time.