2018-01-06

Tackling Kaggle with F# and XGBoost

Read Time: 16 minutes

Today’s topic will be to demonstrate tackling a Kaggle problem with XGBoost and F#. Comparing Quora question intent offers a perfect opportunity to work with XGBoost, a common tool used in Kaggle competitions. Luckily there is a .NET wrapper around the XGBoost library, XGBoost.Net.

Before going too far, let’s break down the data formats. First, Kaggle provides a train.csv which is used for training models. This contains question pairs and the ground truth regarding their duplicated-ness. Second, test.csv is questions pairs with no ground truth. This is used for generating the submission file to Kaggle. Third, submission.csv are the results to submit to Kaggle for judging. is_duplicate represents a percentage likelihood of being a duplicate. Below are example rows from each dataset.

// train.csv
"id","qid1","qid2","question1","question2","is_duplicate"                                                                
"0","1","2","What is the step by step guide to invest in share market in india?","What is the step by step guide to inves
t in share market?","0"                                                                                                  
"1","3","4","What is the story of Kohinoor (Koh-i-Noor) Diamond?","What would happen if the Indian government stole the K
ohinoor (Koh-i-Noor) diamond back?","0"                                                                                  

// test.csv
"test_id","question1","question2"
0,"How does the Surface Pro himself 4 compare with iPad Pro?","Why did Microsoft choose core m3 and not core i3 home Surface Pro 4?"
1,"Should I have a hair transplant at age 24? How much would it cost?","How much cost does hair transplant require?"

// submission.csv
test_id,is_duplicate
0,0.425764
1,0.212075

Now that the data is out of the way, time to get started. Using Paket, here is a sample paket.dependencies file.

source https://nuget.org/api/v2

nuget FSharp.Data
nuget PicNet.XGBoost

Here is the boilerplate and initial variables. Most of this is self-explanatory, although I want to call out a couple things specifically. As expected, TypeProviders will be used to load the csv datasets. When I get to the model training section, there will be hyperparameters. This object will be managed by ModelParameterType and ModelParameter. Feature extraction will use dataset-level metadata. Since this is meant to be a simple example, the only metadata will be the average number of words in a question. As shown above, the train and test files are slightly different formats. Whatever method I use, I want to be able to run the same code against train and test. StandardRow enables this by standardizing the input row format for transformation.

System.IO.Directory.SetCurrentDirectory(__SOURCE_DIRECTORY__)
#r "../packages/FSharp.Data/lib/net40/FSharp.Data.dll"
#r "../packages/PicNet.XGBoost/lib/net40/XGBoost.dll"

open System
open System.IO
open FSharp.Data
open XGBoost

/// Percent of training dataset to use for training
/// Note: ValidationPct = 1. - TrainPct
[<Literal>]
let TrainPct = 0.8

/// Training filename
[<Literal>]
let TrainFilename = "../data/train.csv"

/// Kaggle test filename (used to generate submission)
[<Literal>]
let TestFilename = "../data/test.csv"

/// Kaggle submission filename
[<Literal>]
let SubmissionFilename = "../data/submission.csv"

/// Type of hyperparameter value
type ModelParameterType = | Int | Float32 
/// Model hyperparameter
type ModelParameter = { Name: string; Type: ModelParameterType; Value: float }
/// Dataset Metadata (Used for feature calculation)
type Metadata = { AverageWordCount: float32 }
// Standarized row
type StandardRow = { QuestionId: int; Label: float32; Features: float32[] }

/// Training dataset
type TrainData = CsvProvider<TrainFilename>
/// Test/Submission dataset
type TestData = CsvProvider<TestFilename>

To ensure proper model training, the provided train.csv will be broken into a train and validation set. This method could be more advanced, but take the first x% for training and 100-x% for validation works well enough in this case. Since the train and test files are different, a conversion function is needed.

/// Sample dataset into train and validation datasets
let sample (input:CsvProvider<TrainFilename>) trainPct = 
  let trainRows = int (float (input.Rows |> Seq.length) * trainPct) 
  let trainData = input.Rows |> Seq.take trainRows |> Seq.toArray
  let validatationData = input.Rows |> Seq.skip trainRows |> Seq.toArray
  (trainData, validatationData)

/// Convert the test data format to train data format
/// Note: This is necessary because their train and test datasets differ slightly
let convertTestToTrainFormat (input:CsvProvider<TestFilename>.Row []) :(CsvProvider<TrainFilename>.Row []) =
  input
  |> Array.map (fun x -> new CsvProvider<TrainFilename>.Row(x.Test_id, 0, 0, x.Question1, x.Question2, false))

Here are the feature generating, and supporting, functions. For pedagogical reasons the feature set is going to be overly simplistic. This won’t result in a great prediction result, but proper feature creation can be involved. More advanced feature extraction will be addressed in a later post. For now, this will be enough to get some results, without losing the primary goal in a forest of feature extraction code.

Some features will/may need aggregate information about the dataset. This is commonly used to for scaling or comparison for averages. This will be stored in a dataset metadata object that all rows will have access to during row transformation and feature extraction. The row-specific features are length and wordcount for the two questions being compared. In addition, the difference in wordcount between the questions is considered.

/// Number of words in sentence
let wordCount (s:string) = Array.length (s.Split([| ' ' |])) 

/// Absolute value
let abs (x:int) = Math.Abs(x)

/// Calculate dataset metadata for feature calculation
let metadata (input:CsvProvider<TrainFilename>.Row []) =
  let averageWordCount = 
    input
    |> Array.collect (fun row -> [|
      Array.length (row.Question1.Split([| ' ' |])); 
      Array.length (row.Question2.Split([| ' ' |])) |])
    |> Array.sum
    |> (fun total -> float32 total / float32 (input.Length * 2))

  { Metadata.AverageWordCount = averageWordCount }

/// Calculate features for a row
let rowFeatures (metadata:Metadata) (input:CsvProvider<TrainFilename>.Row) =
  [|
    float32 input.Question1.Length;
    float32 input.Question2.Length;
    (wordCount >> float32) input.Question1;
    (wordCount >> float32) input.Question2;
    (abs >> float32) (wordCount input.Question1 - wordCount input.Question2);
  |]

/// Transform csv row into label + features
let transform (metadata:Metadata) (input:CsvProvider<TrainFilename>.Row []) =
  input
  |> Array.map(fun row -> 
    {
      StandardRow.QuestionId = row.Id;
      Label = if row.Is_duplicate then float32 1. else float32 0.;
      Features = rowFeatures metadata row
    }  
  )

Now it is time to look at the XGBoost functionality. Generating a model is as simple as creating a classifier, applying a hyperparameter set, and then running .Fit using the training data (features, and labels). One small mention, as can be seen, the library uses float32[] for most of it’s numeric interations.

Once the model is trained, it can be applied using PredictProba against an array of features (that match the structure of the training data). The result is an array of probabilities per class. Since this is a binary classification, [0.34, 0.66] means there is a 34% chance the result is false, and 66% chance the result is true. For the final submission, a percentage is desired, but for training, it is useful to know the binary true/false regarding duplicate question status.

// Given training data and hyperparameters, create an xgboost classification model
let buildXgClassModel (trainInput:float32[][]) (trainOutput:float32[]) (parameters:ModelParameter list) = 
    let model = XGBClassifier()

    // To handle xgboost types, I carry along the type with parameter values, 
    // and cast accordingly when I set the values
    parameters 
    |> List.iter (fun parameter -> 
        match parameter.Type with
        | Int     -> model.SetParameter(parameter.Name, (int parameter.Value))
        | Float32 -> model.SetParameter(parameter.Name, (float32 parameter.Value)))

    model.Fit(trainInput, trainOutput)
    model

let predictionProbabilities (model:XGBClassifier) (inputs:float32[][]) =
    // Note, provides prob for each class (ex: 0=0.67, 1=0.33)
    model.PredictProba(inputs)

let predictionValues (model:XGBClassifier) (inputs:float32[][]) =
    // Note, provides prob for each class (ex: 0=0.67, 1=0.33)
    // Higher probability is the class that "wins"
    predictionProbabilities model inputs
    |> Array.map (fun x -> if x.[0] > x.[1] then 0 else 1)

To faciliate debugging and improvement, a confusion matrix is very useful. This, along with an overall accuracy reporting will assign in future developmental interations.

/// Compares target vs. predicted values
let comparePredictions (target:float32[]) predicted = 
  (target, predicted)
  ||> Array.zip
  |> Array.map (fun (t, p) -> ((int t) - p) = 0)


/// Create confusion matrix of results (represented as an array of arrays)
/// Result:
/// [ 
///  T=1,P=1  T=1,P=0
///  T=0,P=1  T=0,P=0
/// ]
let createConfusionMatrix (target:int[]) (predict:int[]) =
  let combined = (target, predict) ||> Array.zip

  let aggregateRow combined filter = 
    combined
    |> Array.filter (fun (_,p) -> p=filter)
    |> Array.map (fun (t,p) -> ((if t=1 then 1 else 0), (if t=0 then 1 else 0)))
    |> Array.fold (fun (a,b) (x,y) -> (a+x, b+y)) (0,0) 

  let pTrue = aggregateRow combined 1
  let pFalse = aggregateRow combined 0

  [|
    [| fst pTrue; snd pTrue |];
    [| fst pFalse; snd pFalse |]
  |]


/// Print confusion matrix
let printConfusionMatrix targetValues predictedValues =
  createConfusionMatrix targetValues predictedValues 
  |> (fun m -> 
    printfn "T\P  %6s %6s" "T" "F"
    printfn "T    %6d %6d" (m.[0].[0]) (m.[0].[1])
    printfn "F    %6d %6d" (m.[1].[0]) (m.[1].[1]))


/// Use a model to create predictions from input values, 
/// then compare target output to predicted output 
let evaluatePredictionResults model input targetOutput = 
  let predictedValidationValues = predictionValues model input 
  let predictedValidationMatches = comparePredictions targetOutput predictedValidationValues
  let pctValidationMatches = float (predictedValidationMatches |> Array.filter id |> Array.length) / float (predictedValidationMatches |> Array.length)

  printfn "Accuracy: %f" pctValidationMatches
  printConfusionMatrix (targetOutput |> Array.map int) predictedValidationValues

Since the submission file has specific criteria, there are some functions to create the submission file. This is primarily formatting the percents as Kaggle expects and then writing the dataset to a file.

/// Convert probabilities per classification to a single probability
/// Note: if class 0 "wins", invert its percent, since the final result expects low percents to map to class 0.
let convertPredictionToProbability (probabilities: float32[]) = 
  if probabilities.[0] > probabilities.[1] 
  then 1.f - probabilities.[0] 
  else probabilities.[1]


/// Combine question ids with prediction results
let formatSubmissionData (rows:StandardRow[]) (predictions:float32[][]) = 
  (rows, predictions)
  ||> Array.zip
  |> Array.map (fun (input, prediction) ->
    let questionId = input.QuestionId
    let probability = convertPredictionToProbability prediction
    (questionId, probability))

// Write submission data to file
let writeSubmissionFile (submissionFilename:string) (submissionData: (int * float32)[]) =
    let fileStream = new StreamWriter(submissionFilename)
    fileStream.WriteLine("test_id,is_duplicate")
    submissionData 
    |> Array.iter(fun (id, probability) ->
        let line = sprintf "%d,%f" id probability
        fileStream.WriteLine(line)) 
    fileStream.Flush()
    fileStream.Close()

Now that all the hard work is done, it is time to put it all together. The first step is data preparation. First, load the training data and split into train and validation sets. Second, build dataset level metadata. Third, run transformations (feature creation) against the datasets. Fourth, structure the data for model training by generating the appropriate label and features arrays.

/// Training data
let allData = TrainData.Load(TrainFilename)
let (trainData, validationData) = sample allData TrainPct

let trainMetadata = metadata trainData 
let transformedTrainData = transform trainMetadata trainData
let transformedValidationData = transform trainMetadata validationData

let trainInput = transformedTrainData |> Array.map (fun row -> row.Features)
let trainOutput = transformedTrainData |> Array.map (fun row -> row.Label)

let validationInput = transformedValidationData |> Array.map (fun row -> row.Features)
let validationOutput = transformedValidationData |> Array.map (fun row -> row.Label)

Time to train the model. XGBoost supports the below parameters. The values shown are populated with some reasonable values for the dataset in question. Out of scope for this post, but hyperparameter optimization should be leveraged here to find the best training model. In a later post I’ll discuss a simple method to approach this topic.

Once trained, report on prediction capability against the original training set as well as the validation set (which the model hasn’t seen).

/// Model training parameters
let modelParameters =  [
  { Name = "max_depth";        Type = ModelParameterType.Int;     Value = 10. };
  { Name = "learning_rate";    Type = ModelParameterType.Float32; Value = 0.76 };
  { Name = "gamma";            Type = ModelParameterType.Float32; Value = 1.9 };
  { Name = "min_child_weight"; Type = ModelParameterType.Int;     Value = 5. };
  { Name = "max_delta_step";   Type = ModelParameterType.Int;     Value = 0. };
  { Name = "subsample";        Type = ModelParameterType.Float32; Value = 0.75 };
  { Name = "colsample";        Type = ModelParameterType.Float32; Value = 0.75 };
  { Name = "reg_lambda";       Type = ModelParameterType.Float32; Value = 4. };
  { Name = "reg_alpha";        Type = ModelParameterType.Float32; Value = 1. } ]

/// Trained model
let finalModel = buildXgClassModel trainInput trainOutput modelParameters 

// Predict train values
evaluatePredictionResults finalModel trainInput trainOutput

// Validate Step
evaluatePredictionResults finalModel validationInput validationOutput

Here are the prediction results of train and test. The prediction capability isn’t great, but the validation set holds up comparatively well. At least overfitting isn’t a concern (for now). This also shows how more and better features have plenty of room for improvement.

> evaluatePredictionResults finalModel trainInput trainOutput
Accuracy: 0.680396
T\P       T      F
T     53352  36546
F     66824 166710

> evaluatePredictionResults finalModel validationInput validationOutput
Accuracy: 0.651030
T\P       T      F
T     11625  10755
F     17462  41016

Now it is time to create the final predictions and submission file for Kaggle. To do this, replicate the validate workflow, with a couple caveats. First, the test dataset is formatted slightly differently. Since this is data with no known classificaions, there is no class in the file. So I need to load the test data, then run the convert so the test data matches the format of the training data. Second, the submission file needs to be populated with a percent likelihood of the questions being duplicates (not with a straight classification). Lastly, write the id along with the result to the submission file.

let testData = TestData.Load(TestFilename).Rows |> Seq.toArray
let transformedTestData = transform trainMetadata (convertTestToTrainFormat testData)
let testInput = transformedTestData |> Array.map (fun row -> row.Features)
let testPredictions = predictionProbabilities finalModel testInput 
let submissionData = formatSubmissionData transformedTestData testPredictions
writeSubmissionFile SubmissionFilename submissionData

All that is left to do is submit the file for judging. Spolier alert, because this is an overly simplified model, it faired poorly. Like I mentioned in the beginning, the current feature set isn’t good. In addition, the hyper-parameters could benefit from some search of their own. These are both topics I plan on discussing in future posts. F# and .NET still have a couple more tricks up their sleeves to get these results even better. Hopefully this has provided a bit of inspiration to try F# in your own projects, perhaps even a Kaggle. Until next time.