Ever wondered why so many people focus on breaking Antibot systems instead of trying to build one? This question has been on my mind for quite some time. There’s definitely psychological factors and financial benefits. But think about it – how often do you come across articles aimed at defending against attacks versus those teaching how to launch them?
And I’m not talking about academic circles, where research is strictly regulated by ethical standards, but rather the community. Despite being part of many groups, I’ve rarely seen discussions on how to defend against BOTs attacks.
I think one of the main reasons for this phenomenon, is the lack of accessibility to antibot technologies. No one really talks about them, with little information available, and they’re not open source.
In this blog, I try to do something different than usual. Instead of targeting a particular antibot protection, I’m going to dissect a popular bot bypass method that’s been a headache for Akamai for years. I’ll reverse engineer it and then show you how to shut it down. Let’s start.
About five years ago a working mouse movement generator for Akamai v1.60 was leaked. The puzzling part? No one could figure out why it worked so effectively. The mact had a great success rate and remained operational for two full years before Akamai managed to patch the vulnerability. But the questions remains: Why did it take Akamai so long to address the issue, and what made this so effective?
The answer, surprisingly, might be simpler than we think. Akamai’s delay and the mact success primarily came down to a lack of a dedicated Threat Intelligence team. Had Akamai allocated resources to look into the motivations and techniques of those determined to bypass their systems—like teenagers eager to get the latest Jordan 4 Sail—they might have identified and mitigated the threat much sooner.
Note: on akamai mouse movements are referred as MACT
I’ve asked the community to lend me the mouse movement generator in question. Here’s what the generated mouse movement samples looked like:
Fig. 1 Fig. 2 Fig. 3
I have studied mouse movements for more than 3 years now, and I would say these are not bad!
Let’s imagine we are akamai employees, how should we proceed? I usually approach this kind of problems by breaking it down into several steps: Visual Analysis, Technical Analysis, Brainstorming, Testing, Model Generation and Testing again.
When analyzing mouse movements, two main aspects demand our attention: velocity and trajectory. Velocity refers to the speed of the mouse movement. By examining how clustered or spread out each point is, we can infer the velocity. Since mouse movements are sampled at a consistent rate, the distance between points gives us a good indication of speed. For instance, if it takes 100 ms to move the mouse across a trajectory and samples are taken every 10 ms, resulting in 10 samples, a fast hand movement across the screen from bottom left to top right would show points widely spaced apart, indicating high velocity.
Fig.4 In gray a fast movement, in orange a slow movement, in blue the trajectory.
In Fig. 4 the gray points are faster movements, while the orange points indicate slower speeds. The trajectory, or the path the mouse takes, is shown in blue.
Despite their apparent differences, the samples in Figures 1, 2, and 3 share some similarities. In a typical analysis, dozens or even hundreds of samples are necessary to identify common patterns reliably. However, even with just these samples, we can observe some characteristics.
One notable similarity across all samples is the presence of the same number of “curves” where the trajectory sharply bends, and the velocity decreases. This pattern isn’t random; the creator designed each trajectory to have a specific number of curves, which can be customized. I know this does not seem right, but sometimes antibot development is pushing imaginative theories to the farthest and not fearing incorrect hypotheses.
To simplify, imagine identifying four critical points along the trajectory where the velocity drops (indicated by closer points). These points mark the beginning and end of each movement curve. Here’s what that might look like:
Fig. 5 Fig. 6
Between each “star” point we only have one curve, and if we analyzed multiple samples, we would not find more curves among the “stars” points.
An interesting observation about velocity, is that it tends to decrease where these “stars” points are closer, as seen in Fig. 6. where the velocity is lowest between the second and third “star” points before increasing again when the “stars” points are farther apart.
At first glance, this method might seem too simplistic to effectively fingerprint an algorithm, especially since a similar process could be applied to authentic mouse movements. However, this allows us to have a clear direction for what to look for in the Technical Analysis.
Fun Fact: People used to call me crazy when I criticized their mact just by looking at a photo of it.
By technical analysis I mean using algorithm to extract and compute features from the mouse movements. In this analysis, I’ve chosen to focus exclusively on velocity. Trajectory is far more complex to analyze.
To analyze velocity, we begin by constructing a velocity profile. This involves calculating the velocity in both the x and y directions for each sample point, then using these to find the average velocity across the trajectory
\[\Delta \text{V}_x = \frac{x_{i+1} - x_i}{\Delta T}, \Delta \text{V}_y = \frac{y_{i+1} - y_i}{\Delta T}\]To find the average velocity, we use the Pythagorean theorem:
\[\text{V}_{avg} = \sqrt{\Delta \text{V}_x^2 + \Delta \text{V}_y^2}\]We repeat this algorithm across all samples and get the velocity profile. Also I filtired the velocity to make it look smoother, but this is not necessary.
FIg. 7 BOT velocity profile.
And we can compare it with a real mouse movement:
Fig. 8 Real velocity profile.
At first glance, the profiles appear strikingly similar, almost indistinguishable. One might consider employing machine learning to discern patterns that differentiate synthetic from real movements. However, this approach risks increasing the false positive rate, where genuine Human are mistakenly flagged as BOTs. Moreover, it’s not an efficient path forward given that the features(the two velocities) are very similar.
Here, it’s worth mentioning Cynthia Rudin’s paper, which argues against relying on black box machine learning models for critical decisions and advocating for interpretable models from the outset Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
This philosophy aligns with my approach: rather than attempting to fit explainability into a complex model, it’s far more effective to build a model that’s understandable.
This is where the Akamai’s strategy faltered, allowing the synthetic algorithm to operate undetected for years. While the realm of biometric research offers countless sophisticated solutions to this problem, our focus here is on building a solution that, although not too complex, remains effective.
I asked the community to borrow the mouse algorithm to study and find vectors of attack, and in a few hours I had already two solutions in mind.
Let’s look at the code, which by the way is available on my github under old-mact, here. A function immediately comes to my attention. It’s important to note that this function, generateLine, operates independently on the X and Y axes, later combining these linear paths to construct the overall trajectory.
func generateLine(size, cycles int) (result []float64) {
const (
min float64 = 0
max float64 = 1000
)
result = make([]float64, size)
for i := 0; i < size; i++ {
result[i] = min
}
multiplier := 2
for i := 0; i < cycles; i++ {
randoms := make([]float64, 2+int(math.Ceil(float64((size-1)/(size/int(math.Pow(2, float64(cycles))))))))
for j := 0; j < len(randoms); j++ {
randoms[j] = min + rand.Float64()*(max/float64(multiplier))
}
segmentSize := math.Floor(float64(size / multiplier))
for j := 0; j < size; j++ {
currentSegment := math.Floor(float64(j) / segmentSize)
ratio := float64(j)/segmentSize - math.Floor(float64(j)/segmentSize)
result[j] += interpolate(randoms[int(currentSegment)], randoms[int(currentSegment)+1], ratio)
}
multiplier *= 2
}
return
}
It took me a while to understand the algorithm, but I’ll try to explain it in a simple way. The core of the algorithm revolves around ‘cycles,’ a concept that essentially dictates the algorithm’s complexity, or in other words, the number of curves integrated into the trajectory. This concept directly correlates with the ‘stars’ identified in our visual analysis.
randoms := make([]float64, 2+int(math.Ceil(float64((size-1)/(size/int(math.Pow(2, float64(cycles))))))))
for j := 0; j < len(randoms); j++ {
randoms[j] = min + rand.Float64()*(max/float64(multiplier))
}
This generates the anchor points (star points) across the screen at random locations, with the quantity of points increasing in a pattern determined by the cycle count. In the first cycle there are 3 points, in the second 5, in the third 9 and so on.
segmentSize := math.Floor(float64(size / multiplier))
The algorithm then calculates the segment size, or the number of points between each pair of anchor points, For instance, consider a total number of points of 100 (Akamai mact requires 100 points). If the cycle is 1, then there are 3 anchor points, and the segment size is 100/3 = 33.33. This means that there are 33 points in the first segment, 33 in the second and 34 in the third. If the cycle is 2, then there are 5 anchor points, and the segment size is 100/5 = 20. This means that there are 20 points in the first segment, 20 in the second, 20 in the third, 20 in the fourth and 20 in the fifth.
for j := 0; j < size; j++ {
currentSegment := math.Floor(float64(j) / segmentSize)
ratio := float64(j)/segmentSize - math.Floor(float64(j)/segmentSize)
result[j] += interpolate(randoms[int(currentSegment)], randoms[int(currentSegment)+1], ratio)
}
And this is a fancy way of creating a straight line between two points. The final result is something like this for the first cycle.
Fig. 9 The trajectory X signal for 1 cycle.
Now the second cycle is basically the same, we are just adding additional points in the middle of the segments. This is the result for the second cycle.
Note: Fig.9 and Fig.10 were generated running the script in two different time.
Fig. 10 In blue the trajectory X signal for 1 cycle, in orange for 2 cycles.
You can see in the first cycle there are only 3 points, in the second there are 5. And in the second we have added an additional point in the middle of each segment. The number of segments doubled each time.
For the third cycle we add another point in the middle of each segment, and so on. At the end we will have a total of 2^cycles + 1 points and 2^cycles points. So 2^3 + 1 = 9 points and 8 segments.
Fig. 11 The trajectory X signal for 3 cycles.
We are still missing the secret ingredient: smoothing.
The trick was in a basic smoothing function in the code, which blended the points together seamlessly. This function didn’t just make the path less rigid; it also unintentionally mimicked a human-like velocity profile purely by chance. You heard me right, this happened by pure luck. Here’s the code that did the magic:
func smooth(arr []float64, smoothing float64) (result []float64) {
result = make([]float64, len(arr))
result[0] = arr[0]
for i := 1; i < len(arr); i++ {
result[i] = (1-smoothing)*arr[i] + smoothing*result[i-1]
}
return
}
This is what made the algorithm work, a banal smooth function. This is mind-blowing! This is what the final results looks like:
In blue the starting unfiltered trajectory X, in orange the smoothed trajectory X.
At first glance, it might seem impossible to distinguish these generated movements from real ones. But upon closer inspection, look at Fig. 11, each segment of the movement had points spaced evenly, meaning the velocity is constant within each segment.
What we are going to do is reverse the smoothing formula, and obtain a good approximation of the original signal (the blue line in the image below). Unfortunately, the smoothing operator is destructive, meaning we cannot restore the original velocity information once it is applied, but only an approximation.
To understand this process, think of the velocity as a simple signal. It jumps from one constant speed to another with each new segment. The smoothing function acts like a filter, gradually transitioning between these speeds, which makes the movement look more natural.
Fig. 12 Velocity profile for the X axis, in blue the starting velocity, in orange the applied EWMA.
This mathematical operation is expressed by the following formula:
\(\text {EWMA}_t = \lambda \text {EWMA}_{t-1} + (1 - \lambda) v_t\) Eq.1
The author of the script used a value of 0.955 for \(\lambda\). Essentially, we are attributing a small weight to every new velocity input that are coming, so the transition is smooth. You can interpret the EWMA as the actual value of the velocity obtained from the MACT.
If we started from a velocity of zero and an input of \(v_t\) = 10cm/s, the first values of the EWMA would be [0, 0.45, 0.9…]
By looking at the smoothed data and applying our understanding of EWMA, we can attempt to reconstruct what the original, unsmoothed velocities might have been. To obtain the values of the original input, We solve Eq.1 for \(v_t\).
\(v_t = \frac{\text {EWMA}_t - \lambda \text {EWMA}_{t-1}}{1 - \lambda}\) Eq. 2
Let’s do some testing to see if this works. In the previous example the values of the EWMA are \([0, 0.45, 0.9...]\), EWMA_0 = 0, EWMA_1 = 0.45, EWMA_2 = 0.9, therefore
\[v_1 = \frac{0.45 - 0.955 * 0}{1 - 0.955} = 10 \\\] \[v_2 = \frac{0.9 - 0.955 * 0.45}{1 - 0.955} = 10.4 \\\]Now as I mentioned, the smoothing function is destructive, and the results obtained are not going to be precise. We also have to consider that an attacker might change the value of \(\lambda\), and, in a real browsers, the values are truncated to the nearest integer.
Let’s apply Eq.2 to obtain the original velocity profile on a real mouse movement.
Remember: We are applying Eq.2 to the X and Y axis separately.
Fig. 13, Velocity profile for the X axis. In blue the starting velocity from the mouse movement, in orange the reconstructed original signal.
As you can see, the velocity is not exactly constant in each segment, because of the approximations, but it’s close. From this analysis, we’ve got two strategies to figure out if a mouse movement is genuine or not:
1) The first method involves taking the velocity profile we’ve analyzed (like the one shown in Fig. 13) and applying a smoothing process similar to what the mact algorithm uses. By comparing this smoothed signal with the actual mouse movements (real or synthetic), we can identify discrepancies.
2) Instead of reconstructing the smoothed signal, this method examines how much the velocity within each segment of the movement varies. Since synthetic movements tend to have segments of constant velocity, analyzing the variance within these segments can help us spot BOTs patterns.
Once we have extracted the metrics, we will use a ML algorithm to classify the samples either in BOTs or Humans.
In the first approach we start from the velocity input obtained in Fig. 13 (the orange line), and apply an EWMA exactly like the author of the script, obtaining a signal closely resembling the original. We then compare the predicted signal with the real mact values.
This is how a predicted signal looks like for the author’s script. Fig. 13.b In blue the BOT velocity profile, in orange the predicted Bot signal.
And this is how a predicted signal looks like for a real mouse movement. Fig. 14 In blue the real mouse velocity, in orange the predicted velocity.
As you can see the difference is very noticeable.
When comparing the two signals (predicted vs. actual), the goal is to accurately assess how similar they are to each other. Several statistical methods are available, each with its own advantages and disadvantages:
Fig. 15
Although we notice that the autocorrelation has values greater than \(1.69 \sqrt N\), which indicates a pattern in the autocorrelation, the data is non stationary, and as said in “The Analysis of Time Series: An Introduction with R” by Chris Chatfield, the autocorrelation function is not useful and should be discarded.
Another approach is using Cosine similarity, that is able to determine the similarity in shape between two signals, ignoring the magnitude. This fits our case. Without going too much into details, it basically works by calculating the angles between two data points (two vector starting from origin). The values ranges from 0 to 1. A cosine similarity of 1 indicates that the vectors are in the same direction (high similarity), while a value of 0 suggests orthogonality (no similarity). The more is the value the more the similarity.
Wavelet and Gabor Coefficients: These are the best ones, These coefficients offer advanced methods for analyzing non-stationary data, preserving time information which Fourier coefficients might lose. Wavelet and Gabor analyses are perfect at handling data where the statistical properties change over time. However, they are more complex to implement and out of the scope of this blog.
The code for the first approach is available on my github under first_algorithm.py
.
In the second approach, we do not attempt to recreate the signal but instead assesses how closely the observed movements adhere to the characteristics of step signals. In the original script, the input velocity are exactly step inputs with a constant value, i.e \(v=10cm/s, t=(0,0.5)\) and \(v=2cm/s, t=(0.5,1)\) and so on. Therefore, if the step inputs extracted using Eq.2 were taken on the author’s script, they should be very close to the a step input. By examining the variance within these segments, we can identify the uniformity of the movements.
# Reverse the EWMA formula
x_approx = np.zeros_like(y)
for n in range(1, len(y)):
x_approx[n] = (y[n] - a * y[n-1]) / (1 - a)
# We assume that the cluster size is 12
cluster_size = 12
variances = []
#measure the variance of each cluster in an
for i in range(0, len(y), cluster_size):
variances.append(np.var(x_approx[i:i+cluster_size]))
Note: When we’re analyzing mouse movements with our code, we’re working under the assumption that every little piece of the movement (which we’re calling a “segment”) is made up of 12 parts (or “samples”). This is just a starting point based on the original setup. However, not all mouse movement scripts work the same way—some attackers might break the movement into more or fewer parts.
Now that we have extracted the features from the signals, we can build a ML model to classify the samples and distinguish them between real and fake mouse movements.
The dataset will be composed of the following: 10000 samples from a real mouse movements and 10000 from the author’s mact. We generate the real mouse movement from MIMIC which has been proven to synthetize mouse movements that are indistinguishable from real mouse movements. Each sample within our dataset is categorized as either ‘0’ for authentic mouse movements or ‘1’ for synthetic ones. The false positive are Humans classified as BOTS, and the false negative are BOTS classified as Humans.
For this task, we’ve chosen to implement a Gradient Boosting Decision Tree model. My experience suggests that Gradient Boosting tends to outperform Random Forest in scenarios involving extensive datasets and where the data’s noise level is manageable. One of its notable advantages is the ease with which the loss function can be modified, providing us the flexibility to adjust our model’s sensitivity towards false positives. In antibots it is paramount that we do not flag a legitimate user as a bot, as this would ruin the customer experience. Thus, we’ve calibrated the algorithm to prioritize the reduction of false positives, accepting the trade-off that might decrease the overall accuracy of the model.
To build the tree we use Algorithm 2, as it is the most effective.
A Decision Tree starts by examining the data—think of it as trying to find the best questions that split the data into the most informative categories. For our mouse movement data, we look at variances in speed across different segments of the movement in the X and Y velocities.
For instance, we might start with a set of values for variance like [100, 120, 0, 2, 1]. It selects a first Feature, say the first feature where we have a value of 100. The tree uses an exhaustive search, which is a thorough way of testing every possible division, to figure out the best place to split the data based on this feature. i.e. Feature 1 value > 90, then the sample data is a Real Mouse movement, otherwise it is a fake mouse movement.
After finding a good split, the tree doesn’t stop there. It continues to add more splits (or “nodes”) by picking new features and finding the best split criteria, repeating this process. The goal is to keep making these splits until adding more doesn’t significantly reduce the error in telling apart real and fake movements.
One of the way to calculate the error is using cross-entropy, which measures how close the predicted label is to the actual label (bot or human). The cross-entropy is defined as:
\[\text{Cross-Entropy} = -\sum_{i=1}^{N} y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)\]Note: This is not exactly precise, we want to minimize \(\text{Gain} = L_\text{before} - (L_\text{left} + L_\text{right})\) but we won’t go into too much details, if you want to know more about this, I suggest you to watch this if you want to know more.
Where \(y_i\) is the actual label and \(\hat{y}_i\) is the predicted label of the model. But for our task, we are going to use an easier loss function, the MSE
\[\text{MSE} = \sum_{i=1}^{N} (y_i - \hat{y}_i)^2\]Now, In our approach to fine-tuning the machine learning model, we incorporate a strategy to significantly penalize the misclassification of genuine users as BOTs.
When our model incorrectly identifies a real mouse movement (labeled as \(y_i=0\)) as fake (predicts \(\hat{y}_i>0.5\)), we amplify the error associated with this mistake by a factor of 100. Essentially, it tells the model that misclassifying real movements as fake is a significant mistake, influencing it to opt for splits that minimize this type of error, even if those splits might increase the overall accuracy.
Another interesting approach, but untested in this blog, is to modify directly the boosting algorithm. The concept of boosting in machine learning involves iteratively adjusting the focus of the training on samples that previous iterations have classified incorrectly. As in the previous case, we are giving more significant to the data that we have misclassified.
This is the Adaboost algorithm that we can modify:
\[J_m = \sum_{n=1}^N w_n^{(m)} I(y_i\not=\hat{y}_i)\] \[\epsilon = \frac{J_m}{\sum w_n^{(m)}} \\\] \[\alpha = \log(\frac{1 - \epsilon}{\epsilon}) \\\] \[w_n^{(m+1)} = w_n^{(m)} e^{\alpha I(y_i\not=\hat{y}_i)}\]This is the iterative model AdaBoost, to iteratively recalculate the weights of the samples for each decision tree. Where \(J_m\) is the error of the model, \(w_n^{(m)}\) is the weight of the sample in the decision tree \(m\), \(I(y_i\not=\hat{y}_i)\) is an indicator function that it is equal to 0 if \(y_i=\hat{y}_i\) (there is no error). Otherwise it is equal to 1 and the weights are increased by a factor of \(e^\alpha\). For our case, where we want to assign more relevance to misclassified real mouse movement, we could replace the \(I\) function with \(\Gamma\), where
\[\Gamma(y_i, \hat{y}_i) = \begin{cases} 10 & \text{if } y_i=1 \text{ and } \hat{y}_i<0.5 \\ 1 & \text{if } y_i \not = \hat{y}_i \\ 0 & \text{otherwise} \end{cases}\]Note: This idea is completely untested, and I came up with this while studying the AdaBoost algorithm. However, I thought it was interesting to share a different approach to the problem.
There are many other things that can be done to improve a gradient boosting tree for tuning the parameters, but we will not cover them.
Luckily, although this might seem very complex, the python library LightGBM allows to customize the loss function easily. You can find an interesting walkthrough on how to do this here. The full code is available on my github.
First we look at an implementation without the custom MSE function
# Define the custom loss function
gbm = lightgbm.LGBMClassifier()
gbm.set_params(**{"objective": "binary"})
gbm.fit(
X_train,
y_train,
eval_set=[(X_valid, y_valid)],
)
y_pred = gbm.predict(X_valid)
accuracy = accuracy_score(y_valid, y_pred.round())
print("Accuracy: %.2f%%" % (accuracy * 100.0))
y_pred_binary = np.where(y_pred >= 0.5, 1, 0)
conf_matrix = confusion_matrix(y_valid, y_pred_binary)
print(conf_matrix)
Which results in the following output
Accuracy: 99.10%
[[2740 36]
[ 16 3005]]
Our model distinguished the fake mouse movements from the real mouse movements with a 99.10% accuracy. This is an excellent result, however, we got 36 false positive, meaning that 36 real mouse movements were classified as fake mouse movements. This is not acceptable, it means we have a 0.0062 chance of incorrectly flagging a legitimate user as a bot. Around 5 users every 1000. We can do much better than this. Let’s try to use the custom MSE function.
def custom_asymmetric_train(y_true, y_pred):
try:
residual = (y_true - y_pred).astype("float")
# gradient of the MSE (y_true - y_pred)^2, https://www.geeksforgeeks.org/ml-gradient-boosting/ explains the negative sign
grad = np.where(residual<0, -2*100.0*residual, -2*residual)
hess = np.where(residual<0, 2*100.0, 2.0)
except:
# This is to solve a bug in lightgbm
residual = (y_true - y_pred.label).astype("float")
grad = np.where(residual<0, -2*100.0*residual, -2*residual)
hess = np.where(residual<0, 2*100.0, 2.0)
return grad, hess
# Define the custom loss function
gbm.set_params(**{'objective': custom_asymmetric_train})
And this time the output is
Accuracy: 93.86%
[[2774 2]
[ 354 2667]]
The accuracy has decreased, but we have only 2 false positive, meaning that we have a 0.00034 of incorrectly flagging a legitimate user as a bot. Around 3 user every 10000. This is a huge improvement, and we can further improve this by tuning the parameters of the model, increase the weight of misclassified data, or increasing the training set, as we have only used 20000 samples.
Results with Random Forest are also shown below
Accuracy: 98.00%
[[1950 34]
[ 44 1837]]
Well, now the real race begins. Even if the mact has a really low success rate, it still means there’s a small percentage of mouse data that can pass as legitimate.
In most antibot implementation, it is very easy to understand which particular request is flagged as fraudulent, and the attacker can log the session and store the mouse data that pass.
If an attacker is skilled, they can identify the features that made the algorithm successful, based on the difference between flagged and passed data. Alternatively, bot developers can leverage ML. They can train a model to distinguish between the synthetic mouse data that gets flagged as fraudulent and the data that passes as legitimate. Once this ML model is in place, it can be used to pre-screen all generated mouse movements. Only those predicted to be classified as legitimate by the antibot system would then be sent.
In wrapping up, we’ve seen how uncovering and analyzing the inner workings of antibot protection isn’t as daunting as it might seem. This blog highlights why it’s so important for any antibot team to have experts in Threat Intelligence. They are they key to digging up the kind of info that can make or break security measures.
Also, let’s talk about Reverse Engineering. It’s not just about having fun hacking systems. This skill is game-changer for making systems stronger. People who are good at reverse engineering can often understand and navigate through complex code much faster than many traditional software developers.
So, to all the bot devs and antibot defenders out there, remember: the folks who test the limits of your systems aren’t just “fraudsters”. They could end up being your most valuable players. Keeping an open mind to the insights gained from attempting to bypass security can lead to more robust and secure systems for everyone.
]]>In this blog we delve into the fascinating realm of Anti-Bot Biometric Protections and explore novel approaches to creating models that mimic human traits. Think of Anti-bots as composed by multiple layers of security. Each layer—from basic measures to deter simple bots to advanced ‘fingerprinting’ techniques—adds its own unique set of challenges for attackers. Today’s blog will dive deep into the fingerprint layer, exploring how it distinguishes real users from bots by analyzing unique patterns and behaviors.
So what exactly is this ‘fingerprinting’? Fingerprinting doesn’t just involve your literal fingerprint. It’s a sophisticated technique that identifies unique behavioral patterns—like the rhythm of your keystrokes or the specific way you move your mouse. You heard me right, Anti-bots are able to identify your browser session based on how you move your mouse. This offers an additional layer of identity verification, making it exceptionally difficult for bots to mimic human behavior. If anti-bots detect a repeated pattern of mouse movements, they can block the session and prevent the bot from accessing the website.
Using algorithms which are based on Fourier Transforms, we’ve developed models that mimic perfectly human behaviours. We’re talking about heuristics and unsupervised machine learning that can replicate human-like keyboard events so convincingly, even the bots get confused.
The blog begins by discussing Fourier Transforms, which serve as the foundation of our analysis. We explore its applications to keyboard biometric modeling, where we aim to infer a biometric model from a small training set to mimic the patterns observed in human behavior during keyboard events. Our initial exploration leads to the realization of the KeyCAP equation, an heuristic model that mimics accurately Keyboard Data.
As we delve deeper into the analysis, we encounter challenges that necessitate alternative approaches. We introduce Unsupervised Machine learning techniques, such as Non-Negative Matrix Factorization (NMF) and Gaussian Mixture Models (GMM), to identify and reproduce significant features in the keyboard biometric model and synthesize new valid data.
Throughout our exploration, we also address current Anti-Bot limitations and propose ideas to improve their effectiveness.
For those unfamiliar with Fourier Transforms, what I am about to introduce could potentially revolutionize the way you look at things. This might well be one of the most important piece of knowledge i had come across all these years in my engineering course. Brace yourself as we venture into another dimension—the frequency domain.
As humans, we experience the world in the time domain, every action we perform is happening in time, a disorganized, intricate, chaotic space that can seem perplexing. Chaos often instills fear within us, as we struggle to find order amidst the apparent randomness. If we were to analyze any dataset in the time domain, the trends and patterns may appear to be random and perplexing, making it challenging to derive meaningful insights or make sense of the data.
However, what if our perception were skewed, and chaos is merely a consequence of observing things from the wrong standpoint?
Hence enter the remarkable concept of the Fourier Transform. Fourier Transform presents a pathway to bring order to chaos, offering a new perspective that uncovers the hidden structure within it. By examining data in the frequency domain, we gain the ability to discern patterns and structures that may remain elusive in the time domain.
Let us consider a first simple example to provide an intuitive understanding of Fourier transforms. Please note that this is purely intended to offer an intuitive understanding and does not accurately represent the real-world application of Fourier transforms.
Imagine receiving a letter from a stranger that has been damaged in transit. The letter was torn apart, and all you are left with is a bunch of randomly mixed letters that don’t make sense in that order.
Fig.1
Attempting to decipher this letter by trying different combinations and anagrams could take hours, if not days, without producing the desired outcome.
Now, Let’s apply Fourier Transform to the sample data (each letter is treated as a singular sample).
Fig.2
As a result, the previously unordered letters are now ordered and spell out the message ‘HI BOTS’. This example gives you a sense of the power of this tool. We started from a bunch of chaotic samples, ordered them, and obtained a clearer representation of the message.
Now let us examine a more concrete example.
Imagine you are a security engineer monitoring the turbulence experienced by passengers on a plane. You notice that the airplane is experiencing constant shaking and vibration, which is caused by the friction of the air against the plane’s surface. To better understand this turbulence, you create a graph with the turbulence amplitude on the y-axis and time on the x-axis.
Fig.3
Suddenly, the passengers experience a spike in turbulence, something is wrong (look at the new y-axis). An Attacker is trying to hijack the plane by using a power beam against it to destabilize it and cause a crash.
The new turbulence plot is shown below.
Fig.4
It’s your job to stop the hijackers and protect the passengers before the crash. You analyze the turbulence signal to identify any abnormal activity, but the random-looking signal makes it difficult to spot any specific patterns.
This is where the Fourier Transform comes to your aid.
Fig. 5
By applying the Fourier Transform to the turbulence signal, you obtain two relevant pieces of information: the Amplitude Spectrum and the Phase Spectrum. In our example, we are mainly interested in the Amplitude Spectrum, which provides information about the amount of energy expressed by the signal, while the phase provides information on when the energy is applied.
The Fourier Transform of the signal allows you to analyze the frequency content of the turbulence signal, helping you identify patterns in the hijacker attack. In the image above, the signal in the frequency domain is reduced to four energy spikes, with a clear structure.
If you are not familiar with the concept of frequency this might all look very confusing at first.
It’s important to note that what we are applying to the signal is a Discrete Fourier Transform (DFT) or its fast implementation called Fast Fourier Transform (FFT), as the turbulence values, are not continuous in time (we have turbulence values at discrete intervals).
Frequency refers to how often a particular event or pattern occurs within a given time interval. The unit of measurement for frequency is hertz (Hz), which represents the number of occurrences per second.
To illustrate this concept, imagine pounding your hand on a desk at a regular pace. If you touch the desk every 1 second, the frequency of your hand hitting the desk would be 1 Hz. This means that the event (your hand hitting the desk) occurs once every second.
Now, let’s say you increase the pace at which you hit the desk. Your hand starts moving faster, and you are now touching the desk every 0.5 seconds. In this case, the frequency has doubled to 2 Hz. This means that the event (your hand hitting the desk) now occurs twice within a second.
In the hijacker example, the frequency refers to the rate at which the signal generated by the power beams repeats itself.
By decomposing the signal into its fundamental frequencies, we can detect which frequencies are the most present, and associate a frequency to a particular power ray. For instance, if we observe a significant frequency at \(10^{19}\) Hz, which corresponds to the frequency of gamma rays, we may deduce the attackers are using gamma rays to hijack the plane and take actions against gamma rays.
In particular, in Fig. 5, four spikes are shown, indicating that the hijackers are using four different beams working simultaneously at different frequencies. The background noise on all frequencies is attributed to air friction.
The great news is that once we identify the frequencies at which the hijackers are transmitting their signals, we can take measures to stop the attack by making our system (the plane) resistant to those specific frequencies.
The moment we were all waiting for. If those examples seemed unpractical or not useful, do not worry, we will now apply the knowledge we just learned to build a basic keyboard synthesizer.
In this section, we present a generative model for synthesizing human-like keyboard data. Our goal is to establish a correct timing pattern between each letter typed, referred to as the velocity profile of the keystrokes.
This model aims to capture the characteristics of different users, ensuring that the synthesized data cannot be linked to a single individual, therefore creating complications in Anti-Bot detection systems.
While we focus on creating a generative model, capable of synthesizing unique and correct data at every generation, Anti-bots, on the other side, are fighting with discriminative models. These neural networks aim to distinguish between real data supplied by legitimate users and spoofed data provided by bots.
Discriminative models face much more difficult challenges compared to generative models, as they not only require a huge amount of data to be trained on, but they also must consider and classify every nuanced and unusual user behavior. Making a mistake in classification can result in blocking legitimate user sessions, which is unacceptable for any effective anti-bot system.
If the training phase were to be executed properly, the discriminator learns the features of human traits and becomes able to distinguish “bad” features associated with bot activity from legitimate users.
This blog presents an alternative to the popular harvesting method used by bots.
Harvesting involves supplying real user-harvested data to anti-bot protections, instead of synthesizing new data from a model.
While this approach may be suitable for static challenges like canvas fingerprinting, it is inadequate for dynamic challenges that require proof of legitimacy.
For example, in a Mouse Biometry Challenge, it is becoming more popular for anti-bot systems to check mouse click positions on the HTML DOM. If a significant number of clicks are submitted on “null” elements, instead of existing elements such as a button, it may indicate that the movement was synthesized.
Harvesting becomes useless because now the attackers would need to collect an impractical number of movements to click on every possible position on the screen. Moreover, rescaling or translating harvested data to click on some elements would not make a difference, since the biometric profiles remain the same (i.e the velocity, which is the most distinguishing feature does not change).
In our quest to address dynamic challenges, we explore the development of a dynamic keyboard synthesizer.
When creating a synthesizer, the initial step is to preprocess the data and identify the most important features that provide the most variance, in other words, the most distinguishing traits in a person’s keyboard writing.
1) What are the most discriminant features in keyboard writing?
In keyboard writing, the primary discriminant feature is the typing speed, our velocity profile. Unlike other biometric traits like mouse movements that involve multiple factors, such as trajectory, velocity, timing, timestamps…- keyboard writing predominantly relies on the speed at which keys are pressed.
2) What is the variability of this feature?
To understand the variability of typing speed, we can visualize time-domain plots of captured keyboard sessions. We plot the velocity on the y-axis and assume a linear space on the x-axis from 0 to \(n\) (\(n\) representing the total number of samples).
In this case, velocity is calculated as the reciprocal of the time difference between consecutive key presses.
\[v = \frac{1}{time\_difference_i}\]Fig.7 Fig.8
At first glance, the trend seems almost random, posing a challenge to devise an accurate equation for reproducing it.
We disregard unrealistic solutions that randomize typing speed or sample velocities from a Gaussian distribution. Instead, our goal is to generate a realistic and reliable biometric model that accurately mimics keyboard typing speed.
To address this, we need tools that highlight the trend characteristics and bring order and clarity to the data. Hence, we turn to the frequency domain by applying the Fourier Transform (FFT). For this purpose, we use PYNUFFT, as it provides a built-in interpolator that makes for easy visualization of the extracted frequency profile.
Fig. 9
Fig. 10
Isn’t this mesmerizing? The FFT has brought clarity to the data!
Let’s first understand what the plots represent. The amplitude spectrum provides information on how the typing speed is distributed. Peaks in the spectrum indicate that certain frequencies are more present than others, with higher frequencies associated to higher typing speed (i.e., shorter timing differences between key presses). It is important to note that the plot is normalized between 0 and 1 Hz, preventing us from inferring the actual frequency values.
For instance, in Fig.10, we observe a peak at 0.5Hz, indicating a significant 0.5 Hz component in the user’s writing style. You might have also noticed that the shape of the curve is a Gaussian bell distribution, which aligns with our expectations. We expect some variance in the typing speed of the user, as we do not expect them to maintain a constant velocity every second, but rather it would oscillate around some values. Keep this in mind because is a fundamental assumption we make on the data distribution.
Multiple peaks suggest different typing mechanisms for words, which means that some words are typed faster than others.
The phase component of the Fourier Transform provides insight on when do we write faster and when slower, or rather when the different amplitude peaks appear in time.
3) How do we mimic human biometry?
To replicate the trends observed in the amplitude and phase of typing speed spectrum, we have two main alternatives: a heuristic approach and an unsupervised machine learning model. In this section, we explore both approaches, highlighting their strengths and weaknesses.
A heuristic is a mathematical model that approximates the original biometric function. Developing heuristics is a preferred approach due to the level of control the derived model offers. With a heuristic approach, we can directly manipulate the equation and gain a deep understanding of the underlying human biometric traits involved. This enables easier improvement and refinement compared to using machine learning models.
However, it requires extensive research efforts to explore relevant literature and delve into topics like NeuroPhysiology. Additionally, a solid mathematical foundation is necessary to derive, optimize, and stabilize the model.
This section presents the mathematical proof of the KeyCAP equation, and a heuristic model that mimics human keyboard typing speed.
We start with an unknown biometric model, represented by the function \(f(t)\), which describes the typing speed behavior of the brain. Our objective is to infer the original \(f(t)\) model using the small training set we collected and plotted in the previous section.
Based on our analysis, we noticed that certain frequencies in the amplitude plot exhibit Gaussian functions trend. As a result, we can assume that the amplitude plot \(\|F(w)\|\), where \(w\) represents the frequency, is a sum of multiple Gaussian curves, each shifted to different frequencies.
\[|F(\omega)| = \sum_{-\infty}^{\infty} \frac{1}{\sigma_i \sqrt{2\pi}} e^{-\frac{(\omega - \mu_i)^2}{2\sigma_i^2}}\]By considering the linearity property of the summation operator we can write
\[|F(\omega)| = \frac{1}{\sigma_i \sqrt{2\pi}} e^{-\frac{(\omega - \mu_i)^2}{2\sigma_i^2}}+\frac{1}{\sigma_{i+1} \sqrt{2\pi}} e^{-\frac{(\omega - \mu_{i+1})^2}{2\sigma^2_{i+1}}}+...\]To obtain the original signal \(f(t)\), we take the Inverse Fourier Transform
\[f(t) = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} |F(\omega)| e^{-j\phi(\omega)}e^{j\omega t} \, d\omega\]Therefore we can take for each \(f(t)\) term its Inverse Fourier Transform and sum it to obtain back \(f(t)\).
Consider a singular Gaussian curve, and assume its phase \(\phi(t)\) to be zero. Also assume \(\mu=0\), so that the curve is centered and not shifted and call its Inverse Fourier Transform \(g(t)\).
\[g(t) = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty}\frac{1}{\sigma_i \sqrt{2\pi}} e^{-\frac{\omega^2}{2\sigma_i^2}}e^{-j\omega t} \, d\omega\]For this proof, we refer to [Applied Partial Differential Equations; with Fourier Series and Boundary Value Problems, Fourth Edition (ISBN 0-13-065243-I) by Richard Haberman, Chapter 10.3.3)
Disregarding the constant terms, which we will add later on, we can rewrite the equation as
\[\begin{array}{l}g(t)=\int e^{-\alpha \omega^2} e^{-j \omega t} d \omega \\ g^{\prime}(t)=\int-j \omega e^{-\alpha \omega^2} e^{-j \omega t} d \omega\end{array}\] \[=\frac{-j}{2 \alpha} \int \frac{d}{d \omega}\left(e^{-\alpha \omega^2}\right) e^{-j \omega t} d \omega\]By integrating by parts
\[\left\{\begin{array}{l}d u=\frac{d}{d \omega}\left(e^{-\alpha \omega^2}\right), v=e^{-j \omega t} \\ u=e^{-\alpha \omega^2}, \quad d v=-j t e^{-j\omega t}\end{array}\right.\] \[\begin{array}{l}=-\frac{j}{2 \alpha}\left(u v-j t \int_{-\infty}^{+\infty} e^{-\alpha \omega^2} e^{-j\omega t} d \omega\right) \\ \left.=\frac{-u}{2 \alpha}\left(e^{-j \omega t} e^{-\alpha \omega^2}{\mid_{-\infty}^{\infty}}-jt g(t)\right)\right) \\\end{array}\]which yields
\[g^{\prime}(t)=-\frac{t}{2 \alpha} g(t)\]Integrating both sides
\[\int \frac{g^{\prime}(t)}{g(t)}=\int \frac{t}{2 \alpha}\] \[\ln(|g(t)|)=-\frac{t^2}{4 \alpha}+C\]By isolating \(g(t)\)
\[g(t)=Ce^{-\frac{t^2}{4 \alpha}}\]We calculate C using the initial condition
\[g(0)=\int_{-\infty}^{+\infty} e^{-\alpha \omega^2} d\omega\]and by applying integration by substitution
\(z =\sqrt{\alpha} \omega\), \(dz =\sqrt{\alpha}d\omega\)
yields
\[g(0)=\frac{1}{\sqrt{\alpha}}\int_{-\infty}^{+\infty} e^{-z^2} d\omega = \sqrt{\frac{\pi}{\alpha}}=C\]and finally
\[g(t)=\sqrt{\frac{\pi}{\alpha}} e^{-\frac{t^2}{4 \alpha}}\]by comparing \(g(t)\) to a Guassian Function we set
\[4\alpha = 2\sigma^2\]and by adding back the constant term \(\frac{1}{\sqrt{2\pi}}\) we finally obtain
\[g(t)=\frac{1}{2\alpha} e^{-\frac{t^2}{4 \alpha}}[]\]which is the Fourier Transform of a Gaussian Function in the Time-Domain centered at zero.
Because the Gaussian Function in the frequency domain was shifted, we apply the frequency shifting property of Fourier Transform and obtain
\[F(\omega-\omega_0) \leftrightarrow f(t)e^{j\omega_0t}\]obtaining our shifted gaussian curve
\[f(t)=\frac{1}{2\alpha} e^{-\frac{t^2}{4 \alpha}}e^{j\omega_0t}\]However, upon closer examination of Fig.10, we observe that the Gaussian curves appear both in positive and negative frequencies. Therefore, we cannot ignore the negative frequency components. Thus, our revised model becomes:
\[f(t)=\frac{1}{2\alpha} e^{-\frac{t^2}{4 \alpha}}e^{j\omega_0t} + \frac{1}{2\alpha} e^{-\frac{t^2}{4 \alpha}}e^{-j\omega_0t}\]Because we are working in the time domain, we will disregard any imaginary components, resulting in the simplified form
\[f(t)=\frac{1}{2\alpha} e^{-\frac{t^2}{4 \alpha}}cos(j\omega_0t) + \frac{1}{2\alpha} e^{-\frac{t^2}{4 \alpha}}cos(-j\omega_0t)\] \[=\frac{1}{\alpha} e^{-\frac{t^2}{4 \alpha}}cos(j\omega_0t)\]Because we are uncertain about the peak value in the frequency domain, we introduce an additional degree of freedom to the model, the amplitude \(D\). The revised equation becomes:
\[f(t)=\frac{D}{\alpha} e^{-\frac{t^2}{4 \alpha}}cos(\omega_0t)\]By summing all the Gaussian contributions, we arrive at our heuristic equation, which we call the KeyCAP equation:
\[f(t) = \Sigma K(D_i,\omega_{0i},\alpha_i, t) =\sum_{0}^{N}{\frac{D_i}{\alpha_i} e^{-\frac{t^2}{4 \alpha_i}}cos(\omega_{0i}t)}\]It is important to note that our approach is based on two key assumptions for the model to be effective:
Although the second assumption is incorrect, the frequency-shifting terms we introduced in our heuristic model result in a phase that closely approximates the original unknown biometric model.
Fig. 11
Fig. 12
For this test case, we set \(\alpha=2, n=22, D∈[0,1]\) with \(\omega_{0i}=0.5i\). The model behaves accurately as the samples we harvested, and the Fourier Spectrum accurately mimics the unknown biometric model, validating the KeyCAP model.
Tuning the parameters involves creating a Controller for our generated model, which can select the appropriate parameter values for \(\alpha, n, D, \omega_0\). This task can be challenging as certain combinations of parameter values may result in model instability.
For instance, certain parameter values might result in the model typing too fast or too slow compared to human behavior. To address this, we need to carefully analyze edge cases and establish constraints on the parameters.
Let’s consider a scenario where we want to set a minimum velocity for our model, expressed as the constraint:
\[\Sigma K(D_i,\omega_{0i},\alpha_i, t) >= K\]This constraint ensures that the model doesn’t type too slowly, with long pauses between each key pressed. Solving these Constraint satisfaction problem is sometimes not feasible, and therefore we need to explore alternatives.
One of the easiest way to solve these kinds of problems is through Trial and Error and Optimization.
Initially, we randomly generate the parameters and evaluate the model at each time instant \(t_i\). If the constraint is not met, the controller adjusts the parameters by either generating new random values or employing optimization algorithms like Gradient Descent to find the best parameters quickly. The optimization process involves iteratively refining the parameters until the constraint is satisfied.
The process looks like this
Fig. 13
Once we determine the appropriate parameters, our model is ready to be deployed. If properly configured, it poses a significant challenge to anti-bot systems, as the generated data appears valid to them.
Unfortunately for Anti-Bots, if the heuristic model accurately mimics the biometric profile of a human trait, it becomes extremely challenging, if not impossible, to differentiate between synthesized data and legitimate user data.
However, there might be a potential solution.
We start by assuming that attackers possess heuristic models capable of reproducing the neuromotor data generated by the human Central Nervous System (CNS).
Unlike a Controller(represented by the violet box in the image) that randomly generates the parameters for a heuristic model, the CNS generates these parameters in a pattern influenced by past experience. A person’s CNS learns from their past experience how to generate the parameters that the neuromotor equation follows. We can think of a person who has a favorite writing style: the CNS generates the input parameters for the NeuroMuscular system to follow that particular style.
By analyzing a substantial amount of data harvested from a single individual, it may be possible to unveil an underlying pattern among the CNS generated parameters. The idea is to train a classifier to recognize this pattern to determine whether a given sample was produced by the same user or not, albeit with a certain margin of error (FAR). This is often referred to as Behavioral model.
Therefore, If a BOT submits periodically varying biometric data during their session, it may be feasible to tell them apart from a real user, because no CNS pattern would be unveiled, and all the data would look significantly different. If we combined different biometric traits, like mouse, gyroscope, keyboard… - our chances of detecting an attacker improve.
However, this hypothesis has limitations. Training a classifier to identify such differences requires a significant amount of data, especially in the case of Mouse Event, where approximately one minute of session data would be necessary to achieve an Equal Error Rate(EER) of 2%.[On Mouse Dynamics as a Behavioral Biometric for Authentication, Zach Jorgensen and Ting Yu]
Unless more advanced classifier systems are developed, heuristic models will likely continue to hold an advantage over existing anti-bot measures.
In some cases, developing a heuristic model to synthesize human traits can be a tedious task.
For this reason, in this section, I present a different approach that employs Unsupervised Machine Learning models to generate human-like biometric data. By utilizing machine learning techniques, the model can automatically identify and select the most important biometric features for us, without the need to come up ourselves with a heuristic.
The general idea is to extract the features of the phase and amplitude spectrum, and modify them to obtain different results while still keeping the signal characteristics.
However, there are certain limitations associated with this approach. Unlike heuristics, where we have direct control over parameter tuning, in the case of unsupervised machine learning, we do not have access to the exact parameters that define the biometric model as they are unknown.
If the model performs poorly, we might need to retrain the ML model entirely. Additionally, since we do not have a precise mathematical formula to guide the generation process, the generated data may exhibit variations and differences compared to the real biometric model.
In this section, we explore an approach that utilizes Non-Negative Matrix Factorization (NMF) to mimic the amplitude frequency spectrum and Gaussian Mixture Models (GMMs) for the phase. Additionally, we propose a theoretical approach using Linear Regression as a generative model. It’s important to note that these are not the only generative models available, and other techniques such as Clustering, Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Autoencoders can also be employed, Neural Networks… although they will not be covered here.
To gain more knowledge on this topic I recommend:
(Information science and statistics) Christopher M. Bishop - Pattern Recognition and Machine Learning-Springer (2006)
Stuart Russell, Peter Norvig - Artificial Intelligence_ A Modern Approach (4th Edition) (Pearson Series in Artifical Intelligence)-Language_ English (2020)
(Springer Series in Statistics) Trevor Hastie, Robert Tibshirani, Jerome Friedman - The elements of statistical learning_ Data mining, inference, and prediction-Springer (2009)
NMF (Non-Negative Matrix Factorization) is a technique that decomposes an input data matrix \(V\) into \(N\) feature matrices \(W\) and \(H\), where \(W\) represents the features and \(H\) represents the weights or contributions of each feature to the input data. Mathematically, this can be expressed as \(V = WH\).
Consider a signal plot
Fig. 14
And consider the plot of its NMF decomposition with N = 3
Fig. 15
with \(H=[3.22231416, 0.73094897, 1.78781614]\)
The plots display the extracted features of the original signal. Although they may appear different, if we multiply the feature matrice \(W\) by \(H\), we obtain a reconstructed signal that closely resembles the original with a margin of error.
Now, What happens if we modify the weight matrix H?
Theoretically, changing the weights \(H\) would result in a different signal since it assigns different importance to each feature in the matrix W
In the next test let’s set \(H[0] = 3\), and reconstruct the signal
Fig. 16
As shown in the corresponding plot, we obtain a different signal that still retains the characteristics of the original. The impact of weight changes becomes less significant as we increase the number of features \(N\) in the NMF model.
We apply this same principle to the keyboard amplitude spectrum
Fig. 17
In orange is the original amplitude spectrum of a captured session, in blue the modified signal reconstructed with an NMF with \(N=10\), and by adding noise to each \(H\) component we obtain a very different but valid spectrum.
We need to address the problem of training the NMF model on datasets.
Our purpose is to extract a large number of different features, and then recombine them to obtain a large number of different outputs. If we trained the entire dataset with an NMF with \(N=10\), we obtain only 10 different features that can be recombined together by changing their weight matrix \(H\). To avoid easy discrimination by anti-bot measures, we need more variance in our generative model.
Increasing the size of \(N\) might seem like a solution, but it can result in irrelevant or non-informative features, as the NMF will run out of features to extract. Instead, a better approach is to divide the dataset into multiple smaller datasets. For instance, if we have 100 training samples, we can create 10 different datasets, each containing 10 samples. By applying NMF to each of these datasets, we obtain 100 new features that can be recombined.
This approach allows us to obtain different results by tuning the weights of different datasets. Moreover, we can transfer certain features from one dataset to another, by switching columns of the matrix \(W\). Using the previous example, with 10 features selected from 100 samples, we have a total of 17,297,280 possible combinations (calculated using the formula 100! / ((100-10)! * 10!)). This number increases further when considering the combinations of changing the weights.
We encounter a similar challenge when attempting to develop a generative model for the Phase Spectrum. However, unlike the amplitude spectrum, the phase dataset exhibits less variance.
We observe a consistent pattern in the data, and we can distinguish two main features: the first feature corresponds to a phase constant increase, and the second feature represents a sudden phase decrease (resembling an “N” shape pattern).
Fig. 18
It is important to note that this time we do not want to generate new features by linearly combining different features, but we just want the same features at different frequencies.
In other terms, our generative model should only select the same two features and place them at different locations. This is because if we combined different features together, we might obtain new features which do not appear in our dataset.
We need an algorithm that recognizes these two features in the dataset and separates them into two different classes. These type of algorithm are called Clustering Models.
Fig. 19
By applying a clustering model like the Gaussian Mixture Model (GMM), we can assign different probabilities to each of the two class features. This allows us to determine the likelihood of a feature at a specific frequency.
After training the GMM with \(N=2\), we can generate synthetic data by randomly sampling from the two classes while incrementing the frequency. This process ensures that we obtain the same features at different frequency positions.
And the result would look something like this
Fig. 20
which keeps both the features 1 and 2(40Hz, 43Hz, 70Hz for example).
Note that in some cases, it may be necessary to limit the number of samples drawn from a Gaussian distribution, especially when dealing with small time intervals or when excessive sampling leads to unsatisfactory results. In these cases, you can initially draw a smaller number of samples from the Gaussian distribution that provides a good output. After obtaining these initial samples, you can then perform a process called resampling to obtain the desired number of samples. Resampling involves interpolating the points obtained from the initial samples to create a new curve. The interpolation method can be linear or cubic, depending on the desired smoothness of the resulting curve. Once the new curve is generated, you can sample a larger number of samples from this curve.
Alternatively, we can consider sampling from the joint distribution of the two classes but it would result in some new features that are a combination of both the first and second.
Now that we have both the amplitude and phase spectrum at each frequency, we can reconstruct the general Fourier Transform expression given by:
\[FFT = AMPLITUDE[i]*exp(1j*(PHASE[i]))\]with the \(i^{th}\) index of the array. This is a sample case of the test result we obtained by reconstructing with PYNUFFT
Fig. 21
Lastly, I would like to mention an interesting idea shared by my colleague, Paolo Masciullo, regarding the usage of Linear Regression as a generative model.
Linear Regression is a supervised model that finds the best linear relationship in a dataset to represent the overall trend. In the case of the amplitude plot, it finds the best-fitting line that represents the given training data.
When new input data X is provided, the trained Linear Regression attempts to fit the input data using the relationship learned from the training phase to make predictions on X. This allows the model to generate new sample data that best fits the input data X.
Fig. 22s
Typically what all the books suggest, is that your model should neither underfit nor overfit the data, in other terms we should not train the model with a small number of samples, nor should we use a huge amount, since in the former case the model will not capture all the meaningful relationship of the data, and in the latter, when new input data X are given, it will not adapt to the input data, and generate samples that closely resemble the training set.
However, for our purposes, Paolo has found that underfitting the model can be advantageous.
Underfitting the model makes it more susceptible to noise, so that small variations in the input X lead to meaningful variations in the predicted trend. For example, adding random noise variations to the same input X may result in significantly different results.
Fig. 23
The problem with underfitting is that it may produce different features compared to the original biometric model.
To address this, we can use regression basis functions to impose the desired feature shape that the model should follow. In the case of the magnitude spectrum, which consists of a series of Gaussians, we can use a Gaussian as the basis function. By doing so, the model will fit the data using Gaussian curves and connect them together, as shown by the red dots resembling a sum of Gaussian curves in the image.
It’s important to note that no extensive experiments have been conducted on this idea yet, and further research is needed to explore its effectiveness.
In conclusion, we have explored different approaches to generate valid biometric data, utilizing heuristics and machine learning models. We have also discussed the challenges faced by Anti-Bot protections in distinguishing between fake synthesized data and those generated by legitimate users.
Moving forward, I intend to delve into more advanced fingerprinting and biometric protection techniques, as well as explain state-of-the-art machine learning approaches to bypass such protections.
I also plan on covering Anti-Bots protection implementations and possible improvements that can be applied to the current solutions.
]]>