WACV 2026

Power of Boundary & Reflection:
Semantic Transparent Object Segmentation
with Transparent Cues

Introducing TransCues — Pyramid Vision Transformer with BFE & RFE

Tuan-Anh Vu1 Hai Nguyen-Truong1 Ziqiang Zheng1 Binh-Son Hua2 Qing Guo3 Ivor W. Tsang4 Sai-Kit Yeung1

1HKUST    2Trinity College Dublin    3Nankai University    4CFAR, A*STAR

+4.2% mIoU · Trans10K-v2
+5.6% mIoU · MSD
+10.1% mIoU · RGBD-Mirror
+13.1% mIoU · TROSD
+8.3% mIoU · Stanford2D3D
Paper (WACV) GitHub arXiv

Two cues humans rely on.
One unified framework.

Glass is a prevalent material in everyday life, yet segmentation methods struggle to distinguish it from opaque materials due to its transparency and reflection. While human perception relies on boundary and reflective-object features to distinguish glass objects, the existing literature has not yet sufficiently captured both properties.

We propose TransCues, a pyramidal transformer encoder-decoder architecture to segment transparent objects. Our key modules — Boundary Feature Enhancement (BFE) and Reflection Feature Enhancement (RFE) — capture these two cues in a mutually beneficial way.

Extensive evaluations show state-of-the-art performance across glass segmentation, mirror segmentation, and generic segmentation benchmarks, demonstrating broad versatility beyond specialized transparent-object methods.

Keywords

Transparent Object Segmentation Glass Detection Mirror Segmentation Pyramid Vision Transformer Boundary Features Reflection Cues

CORE INSIGHT — TWO VISUAL CUES

🔲

Boundary Cue (Geometric)

Glass objects exhibit high-contrast edges. The BFE module learns multi-scale boundary features via parallel convolution blocks (1×1 to 9×9 kernels), supervised by a Sobel-based boundary loss that measures gradient alignment with ground truth.

🔮

Reflection Cue (Appearance)

Reflections on glass surfaces provide localised visual signals. The RFE module employs a convolution-deconvolution U-Net to capture and enhance reflection features, distinguishing glass from non-glass regions without requiring specialist sensors.

79.3%
mIoU · Trans10K-v2
91.0%
IoU · MSD

TransCues Architecture

An encoder-decoder pyramid transformer where BFE and RFE modules are placed at the end of the decoder — empirically the optimal placement — to capture boundary then reflection cues in sequence before final MLP prediction.

TransCues pipeline
FEM

Feature Extraction Module

PVT-based encoder that captures multi-scale long-range dependencies. Four stages output feature maps at 1/4, 1/8, 1/16, and 1/32 resolution.

FPM

Feature Parsing Module

Lightweight decoder alternative to FEM. Captures detailed-to-abstract representations of transparent objects across C1–C4 with minimal parameters.

BFE

Boundary Feature Enhancement

Parallel convolution blocks (1×1 to 9×9) fused via an ASPP-inspired fusion module. Supervised by a Sobel-based boundary loss on gradient alignment.

RFE

Reflection Feature Enhancement

Conv-deconv U-Net that detects localised reflections. Uses pseudo ground-truth for reflective categories (windows, cups, bottles) for supervision.

Quantitative Results

TransCues outperforms the state-of-the-art across all three tasks — glass, mirror, and generic segmentation — using only RGB input, without depth or polarization data.

MethodGFLOPs ↓MParams ↓ACC ↑mIoU ↑
Trans4Trans-T10.4593.2368.63
Ours-T10.5012.7293.5269.53
Trans4Trans-S19.9294.5774.15
Ours-S20.0023.9894.8375.32
Ours-B121.2914.8795.3777.05
Trans4Trans-M34.3895.0175.14
DenseASPP36.2029.0990.8663.01
Ours-B237.0327.5995.9279.29
Trans2Seg49.0356.2094.1472.15
Ours-B5154.37106.1996.9381.37
MethodBackboneMSD IoU ↑MSD Fβ ↑MSD MAE ↓RGBD-M IoU ↑PMD IoU ↑
SANetResNeXt10179.850.8790.05474.9966.84
VCNetResNeXt10180.080.8980.04473.0164.02
SATNetSwin-S85.410.9220.03378.4269.38
HetNet82.800.9060.04369.00
Ours-B3PVTv2-B391.040.9530.02888.5269.61
MethodInputTROSD IoU ↑TROSD mIoU ↑S2D3D mIoU ↑
TransLabRGB42.5750.72
DANetRGB42.7654.39
TROSNetRGB48.7548.56
Trans4Trans-MRGB45.73
Ours-B2/B3RGB67.2567.2353.98

↗ +18.5% mIoU over TROSNet on TROSD, without using depth input.

MethodBackboneGFLOPs ↓RGB-P mIoU ↑GSD-S mIoU ↑
SegFormerMiT-B570.278.454.7
GSDResNeXt-10192.778.172.1
SETRViT-Large240.177.656.7
GDNetResNeXt-101271.577.652.9
Ours-B4PVTv2-B479.382.174.1
+4.2%
mIoU · Trans10K-v2
+5.6%
mIoU · MSD
+10.1%
mIoU · RGBD-Mirror
+13.1%
mIoU · TROSD
+8.3%
mIoU · Stanford2D3D

Visual Comparisons

TransCues accurately identifies glass and mirror regions of diverse dimensions and morphologies, differentiating them from look-alike non-glass regions. Our boundary and reflection modules reduce both over-detection and under-detection errors common in prior work.

FIGURE 4  ·  Glass segmentation comparison on Trans10K-v2, RGB-P, GSD-S (Input / GT / Trans2Seg / Trans4Trans / Ours)

Glass Segmentation Comparison

FIGURE 5  ·  Mirror segmentation comparison on MSD, PMD, RGBD-Mirror (Input / GT / SANet / VCNet / SATNet / Ours)

Mirror Segmentation Comparison

FIGURE 6  ·  Failure cases on Trans10K-v2

Failure cases
LIMITATIONS
Our method may confuse objects with glass-like properties (e.g. door frames with reflections/distortion) or struggle when glass lacks sufficient reflective signal. Boundary shapes are generally preserved even under misclassification.

Module & Design Validation

EFFECTIVENESS OF MODULES · TRANS10K-V2

BackboneBFERFES2D3DTrans10K
PVTv1-T45.1969.44
PVTv2-B146.7970.49
PVTv2-B148.12+3.2 72.65
PVTv2-B150.22+5.5 74.89
PVTv2-B151.55+7.6 77.05

BFE alone yields greater gains than RFE alone, confirming boundary cues as the primary discriminative signal for generic scenes too.

PLACEMENT OF MODULES · TRANS10K-V2

VariantMParamsmIoU ↑
Baseline (no modules)13.8970.49
+ RFE→BFE in Encoder48.9873.54
+ BFE→RFE in Encoder48.9974.12
+ RFE→BFE in Decoder14.9075.11
+ BFE ∥ RFE in Decoder14.9175.44
TransCues (BFE→RFE Decoder)14.8777.05

Decoder placement is significantly more parameter-efficient than encoder placement (14.87M vs 48.99M) while achieving the best performance.

How to Cite

BIBTEX
@inproceedings{vu2026transcues, title = {Power of Boundary and Reflection: Semantic Transparent Object Segmentation using Pyramid Vision Transformer with Transparent Cues}, author = {Vu, Tuan-Anh and Nguyen-Truong, Hai and Zheng, Ziqiang and Hua, Binh-Son and Guo, Qing and Tsang, Ivor W. and Yeung, Sai-Kit}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, year = {2026} }

BACKBONE

PVTv1 / PVTv2 (Tiny to B5 variants)

TRAINING

AdamW · lr 1e-4 · Batch 8 · 1× RTX 3090

DATASETS

Trans10K-v2 · MSD · PMD · RGBD-Mirror · TROSD · Stanford2D3D · RGB-P · GSD-S

INPUT

RGB only — no depth, no polarization