Sensor-Agnostic Cloud Detection from Satellite Imagery with GeoAI

Cloud contamination is one of the most common challenges in optical remote sensing. Before you can analyze vegetation, land cover, or change over time, you need to identify and mask out cloudy pixels. In this tutorial, I walk through a complete cloud detection workflow using the GeoAI package and OmniCloudMask, an open-source algorithm that works with any sensor as long as the imagery includes red, green, and near-infrared bands. That covers Landsat, Sentinel-2, NAIP, and most commercial satellite data.

Video tutorial: Cloud Detection from Satellite Imagery with GeoAI

Resources:

Getting Started¶

Import the GeoAI library and download the sample Sentinel-2 imagery (approximately 300 MB covering Knoxville, Tennessee):

import geoai

geoai.download_sample_data()

You can visualize the imagery before running cloud detection:

geoai.view_raster(file_path)

Predicting the Cloud Mask¶

The core step is a single function call. The key parameter is band_order, which tells the algorithm where to find the red, green, and near-infrared bands in your imagery:

geoai.predict_cloud_mask_from_raster(
    input_path="sentinel2.tif",
    output_path="cloud_mask.tif",
    band_order=[1, 2, 4],  # red, green, NIR
    batch_size=4,
    inference_dtype="bf16",
)

The band_order is the most important parameter to get right. For this sample data, band 1 is red, band 2 is green, and band 4 is near-infrared (band 3, blue, is not used). Adjust this based on your input data. The batch_size controls how many tiles are processed at once; decrease it if you have limited GPU memory.

The prediction raster contains four values:

0: Clear pixels
1: Thick cloud
2: Thin cloud
3: Cloud shadow

On the sample imagery, cloud detection takes only a few seconds.

Cloud Statistics¶

After prediction, you can compute cloud cover statistics:

import rasterio
import numpy as np

with rasterio.open("cloud_mask.tif") as src:
    data = src.read(1)

stats = geoai.calculate_cloud_statistics(data)

This reports the total pixel count and the percentage of clear, thick cloud, thin cloud, and shadow pixels. For the sample imagery, roughly 13% is covered by cloud and 8.7% by cloud shadow.

Post-Processing¶

The raw prediction often contains small holes within cloud regions and tiny isolated artifacts. The clean_raster function fills holes and removes small objects:

geoai.clean_raster(
    input_path="cloud_mask.tif",
    output_path="cloud_mask_clean.tif",
    min_island_size=100,
)

Any hole or island smaller than 100 pixels is filled or removed, producing a cleaner, more continuous cloud mask.

Converting to Vector and Smoothing Boundaries¶

To work with the cloud mask as polygons:

gdf = geoai.raster_to_vector("cloud_mask_clean.tif")
gdf_smooth = geoai.smooth_vector(gdf, smooth_iterations=3)

The smooth_vector function removes the pixelated staircase edges that come from raster-to-vector conversion. This is especially useful for natural boundaries like clouds and water bodies. If three iterations smooth too aggressively, decrease to one; for even smoother boundaries, increase the value.

You can visualize the smoothed polygons overlaid on the original imagery:

geoai.view_vector_interactive(
    gdf_smooth,
    tiles=file_path,
    band_order=[4, 1, 2],
    max_value=3000,
)

Adding Geometry Properties¶

Optionally, you can compute geometric properties for each cloud polygon:

geoai.add_geometry_properties(gdf_smooth)
gdf_smooth.describe()

This calculates area, perimeter, bounding box dimensions, orientation, and other shape metrics. For the sample scene, there are over 1,600 cloud objects with a wide range of sizes, which can be useful for studying cloud characteristics.

Creating a Cloud-Free Mask¶

For downstream analysis, you often just need a binary mask separating cloud-free pixels from everything else (thick cloud, thin cloud, and shadow combined). You can generate this from the prediction raster and compare it side by side with the original imagery using the split map tool:

geoai.create_split_map(left_layer, right_layer)

The resulting mask lets you exclude cloudy areas before running any further analysis such as vegetation indices, land cover classification, or change detection.

Adjusting the Confidence Threshold¶

By default, the detection uses a confidence threshold of 0.5. If you see too many false positives (clear areas labeled as cloud), increase the threshold. If the model is missing clouds, decrease it. The right value depends on your imagery and study area, so it is worth experimenting.

Summary¶

The full workflow requires just a handful of functions:

predict_cloud_mask_from_raster to detect clouds and shadows
clean_raster to remove small artifacts and fill holes
raster_to_vector and smooth_vector to produce clean vector polygons
add_geometry_properties for per-object statistics
A binary cloud-free mask for downstream analysis

Because OmniCloudMask is sensor-agnostic, the same workflow applies to any imagery with red, green, and NIR bands. To get started, check out the full notebook or run it directly in Google Colab.