Today I look at using F# with the NDtw package. This is so I can play with some dynamic time warping. In case you’re not familar with DTW, the TLDR version is that it is a method to compare timeseries data that can differ in frequency. This allows for a more nuanced data comparison that can capture shifted, compressed, and extended patterns. It’s a fun little algorithm to use and worth reading more about.
Onto the data. I’ve pulled an EEG dataset from Kaggle. I’ve also included a copy here for posterity sake. It contains EEG data of subjects watching short videos. The goal of the dataset is mental state classification. I won’t be doing that here, but I can see using DTW as a method to facilitate classification based on channel smiliarties.
Using Paket, here is a sample paket.dependencies file.
Here is my standard boilerplate code, along with the VideoId that I will be using for testing. As a note, all of the data columns are floats, including the subject and video ids. If I was doing something more serious I’d be inclined to convert these, but do do something quick I’ll deal with it.
Load data using a type provider. Since the file doesn’t have headers, I’ll use Schema to define the column names. As a note, the
[<Literal>] attribute on eegDataFile is so I can use the string in the CsvProvider.
I now create a function to extract subject and video specific rows from the dataset. I also reduce the dataset to a single value for the signal. I’ve decided to use the Theta channel. This is arbitrary, but is primarily due to this quote from the dataset’s Kaggle page “Past research has indicated that Theta signal is correlated with confusion level.” This leads me to believe it might be the most likely channel to find interesting comparisons. So the resulting charts are easier to read, I limit the dataset to the first 100 rows of data per subject/video combination.
// Get subject & video specific data, only pull the first 100 records of the dataset
Here is the dynamic time warping distance calculation function. The call is straight foward. After all this setup its almost anticlimatic. It expects 2 parameters, both being an array of values.
// Get the distance between two signals
The NDtw library allows for a more complicated DTW call if so desired. I’ve made an alternate distance function using the more complex version. As an example, it allows for different distance calculations: Euclidean, Manhattan, Maximum, SquaredEuclidean. It also allows for limitations how much the path diverges from the standard path.
// Get the distance between two signals (using more advanced dtw call)
As a short aside, the library offers a couple other bits of useful functionality that I am not using right now, but worth mentioning.
// Example comparison of 2 small datasets
Back to the task at hand, I build a function to compare each subject’s signal against each other subject’s signal. Then I get the best match I can find and store the results in subject1, subject2, and difference.
// Compare each subject against every other subject
For visualization purposes, I create a comparison chart for each subject against Subject 1. Then I save the charts to files.
Here are the top couple matches, in the format (subject1, subject2, distance).
// Show distances for our winner
Here is the top signal match.
Here are the other subjects compared to subject 1.
There is it. The best match it can find is between subjects 1 and 7, although 1 and 4 are a close second. This has been a fun experiment. Until next time.