diff --git a/_data/navigation.yml b/_data/navigation.yml index 6600d2a4..90a0f195 100644 --- a/_data/navigation.yml +++ b/_data/navigation.yml @@ -139,6 +139,8 @@ wiki: url: /wiki/sensing/azure-block-detection/ - title: DWM1001 UltraWideband Positioning System url: /wiki/sensing/ultrawideband-beacon-positioning.md + - title: Perception via Thermal Imaging + url: /wiki/sensing/thermal-perception/ - title: Controls & Actuation url: /wiki/actuation/ children: diff --git a/assets/images/Moge_relative_thermal.png b/assets/images/Moge_relative_thermal.png new file mode 100644 index 00000000..ddc2d30f Binary files /dev/null and b/assets/images/Moge_relative_thermal.png differ diff --git a/assets/images/foundation_stereo.png b/assets/images/foundation_stereo.png new file mode 100644 index 00000000..1e3c90d7 Binary files /dev/null and b/assets/images/foundation_stereo.png differ diff --git a/wiki/sensing/thermal-perception.md b/wiki/sensing/thermal-perception.md new file mode 100644 index 00000000..a1f95890 --- /dev/null +++ b/wiki/sensing/thermal-perception.md @@ -0,0 +1,255 @@ +--- +# Jekyll 'Front Matter' goes here. Most are set by default, and should NOT be +# overwritten except in special circumstances. +# You should set the date the article was last updated like this: +date: 2025-04-29 # YYYY-MM-DD +# This will be displayed at the bottom of the article +# You should set the article's title: +title: Perception via Thermal Imaging +# The 'title' is automatically displayed at the top of the page +# and used in other parts of the site. +--- + +In this article, we discuss strategies to implement key steps in a robotic perception pipeline using thermal cameras. +Specifically, we discuss the conditions under which a thermal camera provides more utility than an RGB camera, followed +by implementation details to perform camera calibration, dense depth estimation and odometry using thermal cameras. + +## Why Thermal Cameras? + +Thermal cameras are useful in key situations where normal RGB cameras fail - notably, in perceptual degradation like +smoke and darkness. +Furthermore, unlike LiDAR and RADAR, thermal cameras do not emit any detectable radiation. +If your robot is expected to operate in darkness and smoke-filled areas, thermal cameras are a means for your robot to +perceive the environment in nearly the same way as visual cameras would in ideal conditions. + +## Why Depth is Hard in Thermal + +Depth perception — inferring the 3D structure of a scene — generally relies on texture-rich, high-contrast inputs. +Thermal imagery tends to violate these assumptions: + +- **Low Texture**: Stereo matching algorithms depend on local patches with distinctive features. Thermal scenes often + lack these. +- **High Noise**: Infrared sensors may introduce non-Gaussian noise, which confuses pixel-level correspondence. +- **Limited Resolution**: Consumer-grade thermal cameras are often <640×480, constraining disparity accuracy. +- **Spectral Domain Shift**: Models trained on RGB datasets fail to generalize directly to the thermal domain. + +_________________________ + +## Calibration + +Calibration is the process by which we can estimate the internal and external parameters of a camera. Usually, the +camera intrinsics matrix has the following numbers + +- fx, fy - This the focal length of the camera in the x and y directions **in the camera's frame**. px/distance_unit +- cx, cy OR px, py - The principal point or the optical center of the image +- distortion coefficients (2 - 6 numbers depending on distortion model used) + +Additionally, we must also estimate camera extrinsics which is the pose of the camera relative to another sensor - the +body frame of a robot is defined to be the same as the IMU, or another camera in the case of multi-camera system + +- This will be in the form of series of 12 numbers - 9 for the rotation matrix and 3 for the translation +- *NOTE*: BE VERY CAREFUL OF COORDINATE FRAMES +- If using more than one sensor, timesync will help you. + +- Calibrating thermal cameras is quite similar to calibrating any other RGB sensor. To accomplish this you must have a + checkerboard pattern, Aruco grid or some other calibration target. + - A square checkerboard is not ideal because it is symmetrical and it is hard for the algorithm to estimate if the + orientation of the board has changed. + - An aruco grid gives precise orientation and is the most reliable option but is not necessary. + +General tips + +- For a thermal camera you will need to use something with distinct hot and cold edges, eg: a thermal checkerboard +- Ensure that the edges on the checkerboard are visible and are not fuzzy. If they are adjust the focus, wipe the lens + and check if there is any blurring being applied +- Ensure the hot parts of the checkerboard are the hottest things in the picture. This will make it easier to detect the + checkerboard +- Thermal cameras by default give 16bit output. You will need to convert this to an 8bit grayscale image. +- Other than the checkerboard, the lesser things that are visible in the image, the better your calibration will be +- If possible, preprocess your image so that other distracting features will be ignored + +### Camera Intrinsics + +- Calibrating thermal camera intrinsics will give you fx, fy, cx, cy, and the respective distortion coefficients + +1. Heat up the checkerboard +2. Record a rosbag with the necessary topics +3. Preprocess your images +4. Run them through OpenCV or Kalibr. There are plenty of good resources online. + +Example output from Kalibr: + +```text + cam0: + cam_overlaps: [] + camera_model: pinhole + distortion_coeffs: [-0.3418843277284295, 0.09554844659447544, 0.0006766728551819399, 0.00013250437150091342] + distortion_model: radtan + intrinsics: [404.9842534577856, 405.0992911907136, 313.1521147858522, 237.73982476898445] + resolution: [640, 512] + rostopic: /thermal_left/image +``` + +### Thermal Camera peculiarities + +- Thermal Cameras are extremely noisy. There are ways you can reduce this noise +- **Camera gain calibration:** The gain values on the camera are used to reduce or increase the intensity of the noise + in the image. + - The noise is increased if you are trying to estimate the static noise and remove it from the image (FFC) + +- **Flat Field Correction**: FFC is used to remove any lens effects in the image such as vignetting and thermal patterns + in the images + - FFC is carried out by placing a uniform object in front of the camera and taking a picture + - Then the noise patterns and then vignetting effects are estimated and then removed from the cameras + - The FLIR thermal cameras constantly "click" which them placing a shutter in front of the sensor and taking picture + and correcting for any noise + - The FLIR documentation describes Supplemental FFC (SFFC) which is the user performing FFC manually. It is + recommended that this is performed when the cameras are in their operating conditions + +### Camera Extrinsics + +- Relative camera pose is necessary to perform depth estimation. Kalibr calls this a camchain +- Camera-IMU calibration is necessary to perform sensor fusion and integrate both sensor together. This can be estimated + using CAD as well. +- Time-sync is extremely important for this because the sensor readings need to be at the exact same time for the + algorithm to effectively estimate poses. +- While performing extrinsics calibration, ensure that all axes are excited (up-down, left-right, fwd-back, roll, pitch, + yaw) sufficiently. ENSURE that you move slow enough that there is no motion blur with the calibration target but fast + enough to excite the axes enough. + +________ + +## Our Depth Estimation Pipeline Evolution + +### 1. **Stereo Block Matching** + +We started with classical stereo techniques. Given left and right images $I_L, I_R$, stereo block matching computes +disparity $d(x, y)$ using a sliding window that minimizes a similarity cost (e.g., sum of absolute differences): + +$d(x, y) = argmin_d \space Cost(x, y, d)$ + +In broad strokes, this brute force approach compares blocks from $I_L$ and $I_R$. For each block it computes a cost +based on the pixel to pixel similarity (using a loss between feature descriptors generally). Finally, once a block match +is found, the disparity is found by checking how much each pixel has moved in the x direction. + +As you can imagine, this approach is simple and lightweight. However, it is dependent on many things such as the noise +in your images, the contrast separation, and will struggle to find accurate matches when looking at textureless and +colorless inputs (like a wall in a thermal image). The algorithm performed better than expected, but we chose not to go +ahead with it. + +--- + +### 2. **Monocular Relative Depth with MoGe** + +If you are using a single camera setup, this is called a monocular approach. One issue is that this problem is ill +posed. For example, objects can be placed at twice the distance and be scaled to twice their size to yield the same +image. There is a reflection of the scale ambiguity that exists in any monocular depth estimation method. Therefore, +learning based models are employed to "guess" the right depth (based on data driven priors like the usual chairs). One +such model is MoGe (Monocular Geometry) which estimates *relative* depth $z'$ from a single image. These estimates are +affine-invariant, +meaning we need to apply a scale and a shift to retrieve metric depth: + +$z = s \cdot z' + t$ + +This means they look visually coherent (look at the image below on the right), but the ambiguity limits 3D metric use ( +SLAM based applications). + +![Relative Depth on Thermal Images](/assets/images/Moge_relative_thermal.png) + +--- + +### 3. **MADPose Solver for Metric Recovery** + +To determine global scale and shift, we incorporated a stereo system and inferred relative depth from both. We then +utilized the MADPose solver to find the scale and shift of both relative depth images to make them align, i.e. both +depthmaps, after being made metric, should tell us the same 3D structure. This optimizer also estimates other properties +such as extrinsics between the cameras, solving for more unknowns than necessary. Additionally, there is no +temporal constraint imposed (you are looking at mostly the same things between $T$ and $T+1$ timesteps). This meant that +the metric depth that we recovered would keep changing significantly across frames, resulting in pointclouds of +different sizes and distances across timesteps. This method, while sound in theory, did not work out ver well in +practise. + +--- + +### 4. **Monocular Metric Depth Predictors** + +We also tested monocular models trained to output metric depth directly. This problem would be the most ill-posed +problem as you would definitely overfit to the baseline of your training data and the approach would fail to generalize +to other baselines. These treat depth as a regression problem from single input $I$: + +$z(x, y) = f(I(x, y))$ + +Thermal's lack of depth cues and color made the problem even harder, and the models performed poorly. + +--- + +### 5. **Stereo Networks Trained on RGB (e.g., MS2, KITTI)** + +Alternatively, when a dual camera setup is used, we call it a stereo approach. This inherently is a much simpler problem +to solve as you have two rays that intersect at the point of capture. I encourage looking at the following set of videos +to understand epipolar geometry and the formualation behind the stereo camera +setup [Link](https://www.youtube.com/watch?v=6kpBqfgSPRc). + +We evaluated multiple pretrained stereo disparity networks. However, there were a lot of differences between the +datasets used for pretraining and our data distribution. These models failed to generalize due to: + +- Domain mismatch (RGB → thermal) +- Texture reliance +- Exposure to only outdoor content +- Reduced exposure + +--- + +## Final Approach: FoundationStereo + +Our final and most successful solution was [FoundationStereo](https://github.com/NVlabs/FoundationStereo), a foundation +model for depth estimation that generalizes to unseen domains without retraining. It is trained on large-scale synthetic +stereo data and supports robust zero-shot inference. + +### Why It Works: + +- **Zero-shot Generalization**: No need for thermal-specific fine-tuning. +- **Strong Priors**: Learned over large datasets of scenes with varied geometry and lighting. (These variations helped + overcome RGB to thermal domain shifts and textureless cues) +- **Robust Matching**: Confidence estimation allows the model to ignore uncertain matches rathern than hallucinate. +- **Formulation**: Formulating the problem as dense depth matching problem also served well. This allowed generalization + to any baseline by constraining the output to the pixel space. + +Stereo rectified thermal image pairs are given to FoundationStereo, which gives us clean disparity maps (image space). +We +recover metric depth using the intrinsics of the camera and the baseline. Finally, we can reproject this into the 3D +space to get consistent point clouds: + +$ +z = \frac{f \cdot B}{d} +$ + +Where: + +- $f$ = focal length, +- $B$ = baseline between cameras, +- $d$ = disparity at pixel. + +An example output is given below (thermal preprocessed on the top left, disparity is middle left, and the metric +pointcloud is on the right). + +![Metric Depth using Foundation Models](/assets/images/foundation_stereo.png) + +## Lessons Learned + +1. **Texture matters**: Thermal's low detail forces the need for models that use global context. +2. **Don’t trust pretrained RGB models**: They often don’t generalize without retraining. +3. **Stereo > Monocular for thermal**: Even noisy stereo is better than ill-posed monocular predictions. +4. **Foundation models are promising**: Large-scale pretrained vision backbones like FoundationStereo are surprisingly + effective out-of-the-box. + +## Conclusion + +Recovering depth from thermal imagery is hard — but not impossible. While classical and RGB-trained methods struggled, +modern foundation stereo models overcame the domain gap with minimal effort. Our experience suggests that for any team +facing depth recovery in non-traditional modalities, foundation models are a compelling place to start. + +## See Also + +- The [Thermal Cameras wiki page](https://roboticsknowledgebase.com/wiki/sensing/thermal-cameras/) goes into more depth + about how thermal cameras function.