With the release of v0.7.0, it is time to revisit K-means clustering using F# and Microsoft’s new ML.NET framework. The api has changed enough to warrant a minor rework. This post is a re-examination of a previous post F# and ML.NET Clustering. The use case will be to use examination attributes to classify mammogram results.
Note: ML.NET is still evolving, this post was written using Microsoft.ML v0.7.0.
Make sure you have .NET Core version 2.1 installed. If you don’t, head out to the .NET Core Downloads page. Select SDK for your platform. Tangential, but you can also get here by going to dot.net, then navigating to
First, create a console F# project, then add the ML.NET package.
dotnet new console --language F# --name MLNet-Mammogram
Next, it is time to get the data. The source I used for this post is from UCI. The datafile can be found [here] (https://archive.ics.uci.edu/ml/machine-learning-databases/mammographic-masses/mammographic_masses.data)
mkdir data && cd data
Here is a sample of what the data looks like. There is no header row. The columns represent 5 features and 1 classification column:
- BI-RADS assessment (1-5)
# Data Rows
Now that the project is setup and data is local, we can get to the code. Time to open up the already created
Program.fs. First, add the necessary namespaces.
The ML.NET pipeline expects the data in a specific format. In the C# world, this is a class, for F# we can use a type. Below are the required types;
MammogramData is the input data,
MammogramPrediction is the output prediction. For
MammogramData, this is basically a a map of columns to member variables. There are a couple notable points to ensure the pipeline can properly consume the data. Each attribute must be
mutable public, it also requires the
[<Column("#")>] to specify it’s column position, and
[<DefaultValue>] attributes. For
PredictionLabel for the cluster id, and
Score for calculated distances from all clusters is required.
type MammogramPrediction() =
Here is one of the big changes from early versions. Where the pipeline object is gone, it has been replaced with an
MLContext. Although different, it still maintains intuitiveness, and gains additional functionality. First, create an MLContext, if desired a
seed can be defined to ensure the same results between executions.
let mlContext = MLContext()
Time to load the data. This is another method that has updated since earlier versions. First create a
TextReader with a file format definition. Then use that object to read the data from the data file. The entire file can be used for training. Alternatively,
TrainTestSplit (another new function) can be used to easily divide a single dataset into train and test sets. This is especially handy during the development process.
let dataPath = "./data/mammographic_masses.data"
After the data is loaded, feature columns need to be added to the transforms. I’m going to use all feature columns from the file, and exclude severity. The clustering model requires features to be numeric, which if fine here. As the other posts show, you can convert text to numeric mappings if necessary.
let dataProcessPipeline = mlContext.Transforms.Concatenate("Features", "BiRads", "Age", "Shape", "Margin", "Density")
Now that the features are defined, it is time to create a model. This will be
KMeans. Similar to the other trainers, custom parameters can be defined, I have decided to use
K = 4. It also has other options as as
OptTol (convergence tolerance), and
NormalizeFeatures. The KMeans trainer/estimator must be combined with the training data to create a model. The last part, create a prediction function from the model. Note the
MammogramPrediction types as part of the call.
let trainer = mlContext.Clustering.Trainers.KMeans(features = "Features", clustersCount = 4)
Validation of any model is important. With the data split into train and test sets, it is easy to get metrics against the training data and then validate against the previously unseen test data.
// Evaluate results (train)
With the initial evaluation out of the way, it is time to move onto individual predictions. I want to create aggregate classification percentages for each cluster. To do this I take the predictive model and apply it against the the training file. Using the predicted cluster and the training label, I create a mapping for detailed predictions. Each cluster gets its own raw benign/malignant count, which can be converted into percentage likelihood for each classification. I have the details annotated in comments, to make it easier to follow. Honestly, this is the most labor-intensive part of the process. I’d love to be able to pass an cluster-aggregate-score function in as part of the trainer to eliminate this work or reprocessing the data. Once I have these results as a
Map, I can query results easy enough.
// Create classifications by cluster
Now that the
clusterIdToPrediction is defined, I can pair the ML.NET cluster prediction with the aggregated cluster classification percentages. First, create a
MammogramData object and provide it to the
Predict method. Second, use the predicted clusterId with the aggregated cluster classification percentages to get a classification result. For this example, I pull one of those rows from the training data.
The results show the prediction falls into cluster 1, which has a 80% likelihood it is malignant, which matches the actual value.
# Prediction Result:
Once a model has been created, it is often useful to save for later use. The save method has changed from previous versions. Once saved, this model can then be loaded for future use.
// Save model to file
As expected, the prediction results are the same with the reloaded model.
# Prediction Result: (model reloaded):
Throughout the post, portions of the output have been provided out of band. Here is how the whole thing looks when run with
This has been a brief look into training and using an ML.NET k-means cluster model. As seen with the other models, ML.NET is providing a nice consistent interface and has some good components. It is a framework that continues to grow in a positive direction. Kudos and thanks to all the people making this a reality. That’s all for now. Until next time.