▼ 2024 New ▶ Active 3D ▶ Passive 3D ▶ Event ▶ Point Cloud
2024 New
Toward Dynamic Non-Line-of-Sight Imaging with Mamba Enforced Temporal Consistency
Yue Li, Yi Sun, Shida Sun, Juntian Ye, Yueyi Zhang, Feihu Xu, Zhiwei Xiong
Advances in Neural Information Processing Systems (NeurIPS), 2024
Paper | Code | Abstract
Dynamic reconstruction in confocal non-line-of-sight imaging encounters great challenges since the dense raster-scanning manner limits the practical frame rate. A fewer pioneer works reconstruct high-resolution volumes from the under-scanning transient measurements but overlook temporal consistency among transient frames. To fully exploit multi-frame information, we propose the first spatial-temporal Mamba (ST-Mamba) based method tailored for dynamic reconstruction of transient videos. Our method capitalizes on neighbouring transient frames to aggregate the target 3D hidden volume. Specifically, the interleaved features extracted from the input transient frames are fed to the proposed ST-Mamba blocks, which leverage the time-resolving causality in transient measurement. The cross STMamba blocks are then devised to integrate the adjacent transient features. The target high-resolution transient frame is subsequently recovered by the transient spreading module. After transient fusion and recovery, a physical-based network is employed to reconstruct the hidden volume. To tackle the substantial noise inherent in transient videos, we propose a wave-based loss function to impose constraints within the phasor field. Besides, we introduce a new dataset, comprising synthetic videos for training and real-world videos for evaluation. Extensive experiments showcase the superior performance of our method on both synthetic data and real-world data captured by different imaging setups. The code and data are available at https://github.com/Depth2World/Dynamic_NLOS.
Recurrent Cross-Modality Fusion for Time-of-Flight Depth Denoising
Guanting Dong, Yueyi Zhang, Xiaoyan Sun, Zhiwei Xiong
IEEE Transactions on Computational Imaging, 2024
Paper | Code | Abstract
The widespread use of Time-of-Flight (ToF) depth cameras in academia and industry is limited by noise, such as Multi-Path-Interference (MPI) and shot noise, which hampers their ability to produce high-quality depth images. Learning-based ToF denoising methods currently in existence often face challenges in delivering satisfactory performance in complex scenes. This is primarily attributed to the impact of multiple reflected signals on the formation of MPI, rendering it challenging to predict MPI directly through spatially-varying convolutions. To address this limitation, we adopt a recurrent architecture that exploits the prior that MPI is decomposable into an additive combination of the geometric information for the neighboring pixels. Our approach employs a Gated Recurrent Unit (GRU) based network to estimate a long-distance aggregation process, simplifying the MPI removal and updating depth correction over multiple steps. Additionally, we introduce a global restoration module and a local update module to fuse depth and amplitude features, which improves denoising performance and prevents structural distortions. Experimental results on both synthetic and real-world datasets demonstrate the superiority of our approach over state-of-the-art methods.
Exploiting Dual-Correlation for Multi-frame Time-of-Flight Denoising
Guanting Dong, Yueyi Zhang, Xiaoyan Sun, Zhiwei Xiong
European Conference on Computer Vision (ECCV), 2024
Paper | Code | Abstract
Recent advancements have achieved impressive results in removing Multi-Path Interference (MPI) and shot noise. However, these methods only utilize a single frame of ToF data, neglecting the correlation between frames. The multi-frame ToF denoising is still underexplored. In this paper, we propose the first learning-based framework for multi-frame ToF denoising. Different from previous frameworks, ours leverages the correlation between inter frames to guide the ToF noise removal with a confidence map. Specifically, we introduce a Dual-Correlation Estimation Module, which exploits both intra- and inter-correlation. The intra-correlation explicitly establishes the relevance between the spatial positions of geometric objects within the scene, aiding in depth residual initialization. The inter-correlation discerns variations in ToF noise distribution across different frames, thereby locating the areas with strong noise. To further leverage dual-correlation, we introduce a Confidence-guided Residual Regression Module to predict a confidence map, which guides the residual regression to prioritize the regions with strong ToF noise. The experimental evaluations have consistently shown that our approach outperforms other ToF denoising methods, highlighting its superior performance in effectively reducing strong ToF noise.
CEIA: CLIP-Based Event-Image Alignment for Open-World Event-Based Understanding
Wenhao Xu, Wenming Weng, Yueyi Zhang, Zhiwei Xiong
European Conference on Computer Vision Workshops (ECCVW), 2024
Paper | Code | Abstract
We present CEIA, an effective framework for open-world event-based understanding. Currently training a large event-text model still poses a huge challenge due to the shortage of paired event-text data. In response to this challenge, CEIA learns to align event and image data as an alternative instead of directly aligning event and text data. Specifically, we leverage the rich event-image datasets to learn an event embedding space aligned with the image space of CLIP through contrastive learning. In this way, event and text data are naturally aligned via using image data as a bridge. Particularly, CEIA offers two distinct advantages. First, it allows us to take full advantage of the existing event-image datasets to make up the shortage of large-scale event-text datasets. Second, leveraging more training data, it also exhibits the flexibility to boost performance, ensuring scalable capability. In highlighting the versatility of our framework, we make extensive evaluations through a diverse range of event-based multi-modal applications, such as object recognition, event-image retrieval, event-text retrieval, and domain adaptation. The outcomes demonstrate CEIA's distinct zero-shot superiority over existing methods on these applications.
Event-Adapted Video Super-Resolution
Zeyu Xiao, Dachun Kai, Yueyi Zhang, Zheng-Jun Zha, Xiaoyan Sun, Zhiwei Xiong
European Conference on Computer Vision (ECCV), 2024
Paper | Code | Abstract
Introducing event cameras into video super-resolution (VSR) shows great promise. In practice, however, integrating event data as a new modality necessitates a laborious model architecture design. This not only consumes substantial time and effort but also disregards valuable insights from successful existing VSR models. Furthermore, the resource-intensive process of retraining these newly designed structures exacerbates the challenge. In this paper, inspired by recent success of parameter-efficient tuning in reducing the number of trainable parameters of a pre-trained model for downstream tasks, we introduce the Event AdapTER (EATER) for VSR. EATER efficiently utilizes pre-trained VSR model knowledge at the feature level through two lightweight and trainable components: the event-adapted alignment (EAA) unit and the event-adapted fusion (EAF) unit. The EAA unit aligns multiple frames based on the event stream in a coarse-to-fine manner, while the EAF unit efficiently fuses frames with the event stream through a multi-scaled design. Thanks to both units, EATER outperforms the full fine-tuning paradigm. Comprehensive experiments demonstrate the effectiveness of EATER, achieving superior results with parameter efficiency.
Joint Flow Estimation from Point Clouds and Event Streams
Hanlin Li, Yueyi Zhang, Guanting Dong, Shida Sun, Zhiwei Xiong
IEEE International Conference on Multimedia and Expo (ICME), 2024
Paper | Code | Abstract
Understanding scene dynamics relies heavily on optical flow and scene flow. Most existing flow estimation methods use low-rate RGB images and point clouds, and match the frames geometrically. However, this approach faces challenges in real-world scenes with intricate motion, occlusion, and noise. To tackle this problem, we combine point clouds with events, which introduce dynamic inter-frame information. We propose a bi-stream neural network that jointly estimates optical flow and scene flow. The event branch extracts dynamic information and estimates optical flow, while the point branch captures scene structure and estimate scene flow. A Spatio-temporal Fusion Block is introduced to fuse the complementary information from points and events. Additionally, we adopt a result-level fusion strategy for direct refinement between the flow predictions of the two branches. We evaluate our model on the real-world datasets DSEC and MVSEC. The experimental results demonstrate superior performance compared to existing methods.
Depth From Asymmetric Frame-Event Stereo: A Divide-and-Conquer Approach
Xihao Chen, Wenming Weng, Yueyi Zhang, Zhiwei Xiong
IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024
Paper | Code | Abstract
Event cameras asynchronously measure brightness changes in a scene without motion blur or saturation, while frame cameras capture images with dense intensity and fine details at a fixed rate. The exclusive advantages of the two modalities make depth estimation from Stereo Asymmetric Frame-Event (SAFE) systems appealing. However, due to the inevitable information absence of one modality in certain challenging regions, existing stereo matching methods lose efficacy for asymmetric inputs from SAFE systems. In this paper, we propose a divide-and-conquer approach that decomposes depth estimation from SAFE systems into three sub-tasks, i.e., frame-event stereo matching, frame-based Structure-from-Motion (SfM), and event-based SfM. In this way, the above challenging regions are addressed by monocular SfM, which estimates robust depth with two views belonging to the same functioning modality. Moreover, we propose a dual sampling strategy to construct cost volumes with identical spatial locations and depth hypotheses for different sub-tasks, which enables sub-task fusion at the cost volume level. To tackle the occlusion issue raised by the sampling strategy, we further introduce a temporal fusion scheme to utilize long-term sequential inputs with multi-view information. Experimental results validate the superior performance of our method over existing solutions.
▲ 2024 New ▶ Active 3D ▶ Passive 3D ▶ Event ▶ Point Cloud