2018-04-24

F# and Word Stems

Read Time: 6 minutes

Today’s post discusses performing word stemming with F#. This will be an expansion on a previous post, Comparing Quora question intent. As a result, it will also address some feature engineering.

For those not familar with word stems, in this context it basically refers to word bases, excluding suffixes. Stems are helpful when doing text compares, especially when dealing with data of a content-based nature. This aligns well with the Quora question comparisons. The Annytab.Stemmer library meets the needs well.

Before getting started, everything here will be an enhancement of existing code from the Kaggle Quora duplicate questions post.

First, add the Annytab.Stemmer package to the project by adding it to paket.dependencies. Then open the namespaces and create a stemmer object.

1	nuget Annytab.Stemmer

#r "../packages/Annytab.Stemmer/lib/netstandard1.4/Annytab.Stemmer.dll"

open Annytab
open Annytab.Stemmer

let stemmer = EnglishStemmer()

Now that the basic components are in place, I can provide a simple stem example.

let sentence1 = "When birds fly, they are soaring above the trees while people are watching and talking"
let sentence2 = "When birds are flying, they soar above the trees while people watch and talk"

let sentenceToWords (s:string) = s.Split([|' '|]) 
let sentence1Words = sentenceToWords sentence1
let sentence2Words = sentenceToWords sentence2

let matches = Set.intersect (set sentence1Words) (set sentence2Words)
printfn "Matches: %A" (Set.count matches)

let sentenceToStemWords (s:string) = 
  sentenceToWords s
  |> stemmer.GetSteamWords

let sentence1StemWords = sentenceToStemWords sentence1
let sentence2StemWords = sentenceToStemWords sentence2

let stemMatches = Set.intersect (set sentence1StemWords) (set sentence2StemWords)
printfn "Stem Matches: %A" (Set.count stemMatches)

printfn "sentence1: %A" sentence1StemWords
printfn "sentence2: %A" sentence2StemWords
printfn "Matches     : %d\nStem Matches: %d" (Set.count matches) (Set.count stemMatches)

Here are the results. Notice in the stemmed word list only the bases are listed birds -> bird and watching to watch, etc. This allows for concepts to be matched better.

//Stem words:
sentence1: [|"when"; "bird"; "fly,"; "they"; "are"; "soar"; "abov"; "the"; "tree"; "while";
  "peopl"; "are"; "watch"; "and"; "talk"|]
sentence2: [|"when"; "bird"; "are"; "flying,"; "they"; "soar"; "abov"; "the"; "tree";
  "while"; "peopl"; "watch"; "and"; "talk"|]

> printfn "Matches     : %d\nStem Matches: %d" (Set.count matches) (Set.count stemMatches)
Matches     : 10
Stem Matches: 13

Time to update the feature generation. A valuable reminder is that feature generation is part art, part science. Often it is an iterative, and experimental, process. Don’t worry, intuition of what a good feature might be grows with time and experience. Using the now defined sentenceToStemWords to extract words from the questions, a comparison can be doing using a Set.intersect.

let rowFeatures (metadata:Metadata) (input:CsvProvider<TrainFilename>.Row) =
  let question1Words = sentenceToStemWords input.Question1
  let question2Words = sentenceToStemWords input.Question2
  let wordShareCount = 
    Set.intersect (set question1Words) (set question2Words) 
    |> Set.count

  let wordShareFeature = ((float32 wordShareCount) * 2.f) / (float32 question1Words.Length + float32 question2Words.Length)

  [|
    float32 input.Question1.Length;
    float32 input.Question2.Length;
    (wordCount >> float32) input.Question1;
    (wordCount >> float32) input.Question2;
    (abs >> float32) (wordCount input.Question1 - wordCount input.Question2);
    wordShareFeature
  |]

Add matching word stems between questions as a feature has improved the accuracy by about 8%. That is a decent ROI for adding a feature.

> evaluatePredictionResults finalModel trainInput trainOutput
Accuracy: 0.755652
T\P       T      F
T     84299  43153
F     35877 160103

> evaluatePredictionResults finalModel validationInput validationOutput
Accuracy: 0.704828
T\P       T      F
T     18281  13061
F     10806  38710

There is one downside to this approach, common words like “a”, “and”, “the” are included in the matching word feature. This can result in a deceptively high percentage word match. To get a more representative match, these “stop words” can be excluded. Time to make another feature change. I built a stopwords list, here is a sample. The full file is here.

i
a
about
after
all
also
an

Then alter sentenceToFilteredStemWords to be sentenceToFilteredStemWords that excludes stop words. This will get me to where I want to be.

[<Literal>]
let StopWordsFilename = "../data/stopwords.txt"

let stopWords = 
  File.ReadAllLines StopWordsFilename
  |> Array.map (fun x -> (x, 1))
  |> Map.ofArray

let sentenceToFilteredStemWords s =
  sentenceToStemWords s
  |> Array.filter (fun w -> not (Map.containsKey w stopWords))

let rowFeatures (metadata:Metadata) (input:CsvProvider<TrainFilename>.Row) =
  let question1Words = sentenceToFilteredStemWords input.Question1
  let question2Words = sentenceToFilteredStemWords input.Question2
  let wordShareCount = 
    Set.intersect (set question1Words) (set question2Words) 
    |> Set.count

  let wordShareFeature =
    if question1Words.Length + question2Words.Length = 0
    then 0.f
    else ((float32 wordShareCount) * 2.f) / (float32 question1Words.Length + float32 question2Words.Length)

  [|
    float32 input.Question1.Length;
    float32 input.Question2.Length;
    (wordCount >> float32) input.Question1;
    (wordCount >> float32) input.Question2;
    (abs >> float32) (wordCount input.Question1 - wordCount input.Question2);
    wordShareFeature;
  |]

Filtering out stop words gained another 3%. Admittedly I expected a bit more, but still upwards.

> evaluatePredictionResults finalModel trainInput trainOutput
Accuracy: 0.777598
T\P       T      F
T     88157  39913
F     32019 163343

> evaluatePredictionResults finalModel validationInput validationOutput
Accuracy: 0.730577
T\P       T      F
T     19373  12071
F      9714  39700

As you can see, using word stems and stop words to extend the features can be a useful tactic. This also serves as a good reminder that F# has the tools for interesting analysis. I hope you found this post useful. Until next time.