2024-05-24

Estimating Distinct Element Counts with F#

Read Time: 10 minutes

Performing estimated counting of distinct elements in large datasets is a common task. While there are straightforward approaches, they can be memory-intensive and slow for massive datasets. Today I’m going to take a look at the F0 Estimator introduced in the paper Distinct Elements in Streams: An Algorithm for the (Text) Book∗. As often, this will be an implementation in F#.

First things first, what is the goal? There are some cases where the data being processed is so large, calculating exact counts of distinct values can be impractical. An alternative is to perform an estimated count of the items. The key is to provide an estimation that is accurate enough, while also being performant (across whatever axis matters). The goal here is not to go through the proofs, but to understand the implementation.

With that said, I need to implement the paper’s algorithm in F#. One nice thing aspect of this paper is the algorithm is pretty simple; that is also the point of the paper . I’ll get into performance later, but it is enjoyable to see benefits from a solution that isn’t too hard to understand. At a high level, the algorithm iterates through the data, keeping track of items it has seen (using a Set). This is the first place where probabilities come into play. The value is only added to the set randomly based on a variable sampling rate (p). Also, in order to save space, the set is going to have a limited capacity (thresh). This inevitably will lead to the set filling up. When this happens, space needs to be freed up for further processing. This is done by conditionally removing each item with a 50% probability. At the same time, the variable sampling rate (p) mentioned earlier is halved. This results in the dynamic that as processing progressess, it become less likely that a new item will be added to the set. Once processing is complete, the estimate is calcuated by: set size / p. Accuracy and space used are driven by the epsilon and delta parameters. Lower values result in higher accuracy, but more space being used; Higher values result in reduced accuracy, but space-used savings. And that’s pretty much it. Below is what the F# code looks like.

let f0Estimator (A: string[]) (epsilon: float) (delta: float) =
  let m = A.Length
  let mutable p = 1.0
  let thresh = int(ceil (12.0 * epsilon ** -2.0 * log (8.0 * float m / delta)))
  let mutable X = HashSet<string>(thresh)
  let mutable noAnswer = false

  for i in 0 .. m-1 do
    X.Remove(A.[i]) |> ignore
    if random.NextDouble() < p then
      X.Add(A.[i]) |> ignore

    if X.Count = thresh then
      X <- X
           |> Seq.filter (fun _ -> random.NextDouble() >= 0.5)
           |> HashSet

      p <- p / 2.0
      if X.Count = thresh then
        noAnswer <- true

  if noAnswer then
    None
  else
    Some (int (float X.Count / p))

So, what do the results look like? For the following analysis I used a 1GB text file that has 27M lines, 184M words, and 132,876 unique words. There are two main things I care about, accuracy and performance. I’ll look at estimation accuracy first. Below are the results with several different epsilon and delta values. I also include thresh, which is the maximum size of the set being used (which directly impacts memory usage). Depending on your tolerance level, the results are reasonable. Looking at epsilon and delta values from 1 down to 0.1, you can see the tradeoff of using more memory. A lower thresh results in less memory, but it also includes a larger difference from the actual values; it also has a higher variation level. Using more memory (higher thresh), results in higher accuracy as well as more consistency in the results.

Actual   : 132876

# epsilon: 1 delta: 1 thresh: 254
Estimator: 120832 Difference: 9.06%
Estimator: 141312 Difference: 6.35%
Estimator: 149504 Difference: 12.51%
Estimator: 129024 Difference: 2.90%
Estimator: 135168 Difference: 1.72%
Estimator: 133120 Difference: 0.18%
Estimator: 135168 Difference: 1.72%
Estimator: 138240 Difference: 4.04%

# epsilon: 0.5 delta: 0.5 thresh: 1047
Estimator: 127488 Difference: 4.05%
Estimator: 134400 Difference: 1.15%
Estimator: 132864 Difference: 0.01%
Estimator: 126976 Difference: 4.44%
Estimator: 135936 Difference: 2.30%
Estimator: 138496 Difference: 4.23%
Estimator: 131072 Difference: 1.36%
Estimator: 124160 Difference: 6.56%

# epsilon: 0.25 delta: 0.25 thresh: 4320
Estimator: 130048 Difference: 2.13%
Estimator: 132544 Difference: 0.25%
Estimator: 136448 Difference: 2.69%
Estimator: 137664 Difference: 3.60%
Estimator: 133440 Difference: 0.42%
Estimator: 131648 Difference: 0.92%
Estimator: 137024 Difference: 3.12%
Estimator: 130752 Difference: 1.60%

# epsilon: 0.1 delta: 0.1 thresh: 28100
Estimator: 133384 Difference: 0.38%
Estimator: 132832 Difference: 0.03%
Estimator: 134664 Difference: 1.35%
Estimator: 134344 Difference: 1.10%
Estimator: 133872 Difference: 0.75%
Estimator: 133240 Difference: 0.27%
Estimator: 132984 Difference: 0.08%
Estimator: 133184 Difference: 0.23%

Next, what does the performance look like? I used BenchmarkDotNet and again look at several different settings. For comparison, I use a naive “countDistinct” and a more efficient “countDistinctWithHashSet (see below). One thing that becomes quickly obvious is memory allocations are drastically reduced when using estimated counting. This is expected, and one of the primary goals. Estimates are several orders of magnitude less than the naive version, and at least half of the HashSet benchmark. This savings in memory allocations is a compelling reason to use estimated counting. As you might imagine, the amount of allocations correspond to epsilon and delta settings. Higher levels of accuracy require more allocations. The second take away is execution performance, its approximately twice as fast as both counts used as the control groups. These results are a nice demonstraiont, and concrete view, of the trade-offs, and why one would want to use estimated counting.

// Naive
let countDistinct (data: string[]) =
  data |> Array.distinct |> Array.length

// Naive with HashSet
let countDistinctWithHashSet (data: string[]) =
  let mutable X = HashSet<string>()
  data
  |> Array.iter (fun word -> X.Add word |> ignore)
  X.Count

# Performance:

| Method                   | Mean     | Error    | StdDev   | Allocated     |
|------------------------- |---------:|---------:|---------:|--------------:|
| CountDistinct            | 10.414 s | 0.0323 s | 0.0302 s | 1310550.29 KB |
| CountDistinctWithHashSet |  9.523 s | 0.0626 s | 0.0523 s |   12234.11 KB |
| F0Estimator_one          |  4.216 s | 0.0429 s | 0.0380 s |     183.16 KB |
| F0Estimator_half         |  4.286 s | 0.0840 s | 0.0934 s |     679.77 KB |
| F0Estimator_quarter      |  4.354 s | 0.0798 s | 0.0667 s |    2358.71 KB |
| F0Estimator_tenth        |  5.428 s | 0.1063 s | 0.0943 s |    6838.87 KB |

There is one additional aspect to tease out. For comparison purposes, everything is based on an array. But considering the idea is to use a method like this on very large data, there is value in investigating its use when processing a stream. Below is an adaptation that performs an estimated distinct word count on a stream. The estimated counting logic is the same as above. The major difference is because the data size isn’t know beforehand, thresh (set size) needs to be explicitly provided. Manual calculations can be done to address a solution base on use case, anticipated size, and desired memory profile

let streamF0Estimator (thresh: int) (stream: Stream) =
  let mutable p = 1.0
  let mutable X = HashSet<string>(thresh)
  let mutable noAnswer = false
  let chunkSize = 4096
  let buffer = Array.zeroCreate<byte> chunkSize
  let mutable bytesRead = 0
  let mutable lastWord = ""

  while (bytesRead <- stream.Read(buffer, 0, chunkSize); bytesRead > 0) do
    let content = Text.Encoding.UTF8.GetString(buffer, 0, bytesRead)
    let words = (lastWord + content).Split([|' ';'\r';'\n'|])

    for i in 0 .. words.Length - 2 do
      X.Remove(words[i]) |> ignore
      if random.NextDouble() < p then
        X.Add(words[i]) |> ignore

      if X.Count = thresh then
        X <- X
             |> Seq.filter (fun _ -> random.NextDouble() >= 0.5)
             |> HashSet

        p <- p / 2.0
        if X.Count = thresh then
          noAnswer <- true

    lastWord <- words.[words.Length - 1]

  if noAnswer then
    None
  else
    Some (int (float X.Count / p))

// Usage
use stream = new IO.FileStream("really_big_file.txt", IO.FileMode.Open, FileAccess.Read)
let distinctEstimate2 = streamF0Estimator 4000 stream

So what does all this mean? First, if you’re willing to use estimates when counting distinct values, there are some good ways to get performance benefits in both processsing and memory. In reality though, this isn’t new. What is unique in this case is that the estimation algorithm is pretty easy to implement and understand. That can go a long way when you’re concerned with long term maintenance. If nothing else, there is value in investigating and understanding alternate approaches. That’s all I have for now. Until next time.

References:
Distinct Elements in Streams: An Algorithm for the (Text) Book∗ (original source: https://arxiv.org/pdf/2301.10191)