A repository with the objective of understanding and reversing exactly how Datadome's slide captcha works. This code should not be used in production, it is more of a PoC.
This repository is intended solely for educational purposes. Please use this information responsibly and ethically. Any misuse or unauthorized use of this knowledge is strongly discouraged.
Slide captchas are used by antibot companies such as Datadome and Geetest. They are used to prevent web scraping and automation by distinguishing between real human users, and automated bots.
A slide captcha is a method of detecting bots and web scrapers. They work by making the user slide a puzzle piece along a background image into the correct position shown by a darker location.
Datadome collect signals
which are device data, mouse events, screen sizes. I assume this is then run against and AI model which has been trained on other real device data to determine whether the user should be allowed to continue browsing the site.
The script is given to us in an obfuscated state, which is a process which many antibot companies use to try and pretect their code from being understood. We can reverse this using a process called deobfuscation, which is simply analysing the script, determining which obfsucation methods are used and reversing them. We can do this in Node.js using babel.
This process can be found here.
With our now deobfuscated version of the script, we can begin to determine how the script runs. At execution the script begins with a custom module loader, which has a similar function as require
does in Node.js. This is used throughout the script to access different modules. There is 8 modules inside of the script:
payload-js/exports/payload.js
- The payload generator - how the 'signals' are collected, and encoded for submission../bean
- This module seems to be in charge of all event recording (mouse, keyboard, touch)../es5_code/obf
- This module starts the collection of device data../hash
- This module contains a function used to hash different values during runtime../helpers
- This moudle just contains a function used for safe base64 encoding.initial
- This is the first module to run, definition of the signals class, a checksum of different functions within the script../picasso
- This module is in charge of canvas fingerprinting../slidercaptcha
- This module loads all of the images used during the captcha into the DOM - the background image and the puzzle fragment.
The main feature of the slide captcha is having the puzzle piece in the correct position. This is the first benchmark in determining whether the user is a human or a bot. To do this, we can use python's cv2 library to perform template matching. I found this approach to be 88-90% accurate, which isn't ideal, but considering it takes on average 0.02 seconds per detection, I believe where we sacrifice in accuracy we gain in speed and efficiency.
This process can be found here.
Using our understanding of the script, and our puzzle piece provided by the detection API, we can begin to build our own solutions to Datadome.
This process can be found here.
First, we can begin to add our own hardcoded values which we have collected from our own browser. Values such as screen sizes, user agents, device and device memory can all just be thrown straight into our payloads, as they are very generic and hard to fingerprint, as millions of devices will have similar values.
Events are a bit different from the other values we will use, as we cannot hard code these since Datadome deem these an important piece of our fingerprint. I went with mouse events, as they seemed the easiest to replicate. Datadome collect the x and y pixels of the cursor, along with the timestamps at which each movement happens. I wrote a basic function to emulate these events, which focus on the cursor moving from a starting x value, to an x value determined by the location of the puzzle piece, as the mouse events stop recording upon a mouseup
event. These coordinates and timestamps are then used to calculate many different values included in the signals. A standard deviation of both the x and y values and an average speed of x and y are two of them.
Canvas fingerprinting is a technique used by numerous antibot companies, and it involves shapes being drawn in the html canvas. Each browser will render the canvas in a slightly different way than other browsers, so they most likely use it as a benchmark against the other signals being submitted. I found that we can hardcode these as long as the other signals match up with the device the canvas fingerprint was collected from
Timestamps are used frequently during the script, as Datadome will try and fingerprint these sessions by comparing our fake timestamps with those that are real. I have found it sufficient to randomly generate these timestamps within ranges of those found in real browsers.