Research Topic: Manipulation
▼ 2024 New ▶ Generation ▶ Enhancement ▶ Denoising ▶ Low4High
2024 New
Learning Complementary Maps for Light Field Salient Object Detection
Zeyu Xiao, Jiateng Shou, and Zhiwei Xiong
Paper | Abstract
Light field imaging presents a promising avenue for advancing salient object detection (SOD). However, existing light field SOD (LFSOD) methods grapple with challenges related to effectively aggregating features from all-in-focus (AiF) images and focal slices. These methods often under-utilize the complementary nature of salient and non-saliency maps, leading to inaccurate predictions, particularly at fine boundaries. To tackle these limitations, in this paper, we introduce a novel method for LFSOD. Our method incorporates a Cross-Modality Aggregation (CMA) module at multiple levels, facilitating the efficient fusion of AiF image and focal slice features. This progressive aggregation capitalizes on global and local dependencies to harness implicit geometric information in an LF. Based on the observation that, salient regions and non-salient counterparts are complementary to each other, thus a better estimation on one side leads to an improved estimation on the other, and vice versa, we introduce the Complementary Saliency Map Generator (CSMG). The CSMG generates both saliency and non-saliency maps interactively to leverage the inherent complementary relationship between salient regions and their non-salient counterparts. Through extensive experiments conducted on benchmark datasets, we have demonstrated that our proposed method achieves superior performance in LFSOD.
MLP Embedded Inverse Tone Mapping
Panjun Liu, Jiacheng Li, Lizhi Wang, Zheng-Jun Zha, Zhiwei Xiong
Paper | Code | Abstract
The advent of High Dynamic Range/Wide Color Gamut (HDR/WCG) display technology has made significant progress in providing exceptional richness and vibrancy for the human visual experience. However, the widespread adoption of HDR/WCG images is hindered by their substantial storage requirements, imposing significant bandwidth challenges during distribution. Besides, HDR/WCG images are often tone-mapped into Standard Dynamic Range (SDR) versions for compatibility, necessitating the usage of inverse Tone Mapping (iTM) techniques to reconstruct their original representation. In this work, we propose a meta-transfer learning framework for practical HDR/WCG media transmission by embedding image-wise metadata into their SDR counterparts for later iTM reconstruction. Specifically, we devise a meta-learning strategy to pre-train a lightweight multilayer perceptron (MLP) model that maps SDR pixels to HDR/WCG ones on an external dataset, resulting in a domain-wise iTM model. Subsequently, for the transfer learning process of each HDR/WCG image, we present a spatial-aware online mining mechanism to select challenging training pairs to adapt the meta-trained model to an image-wise iTM model. Finally, the adapted MLP, embedded as metadata, is transmitted alongside the SDR image, facilitating the reconstruction of the original image on HDR/WCG displays. We conduct extensive experiments and evaluate the proposed framework with diverse metrics. Compared with existing solutions, our framework shows superior performance in fidelity, minimal latency, and negligible overhead. The codes are available at https://github.com/pjliu3/MLP_iTM.
Diffusion-Promoted HDR Video Reconstruction
Yuanshen Guan, Ruikang Xu, Mingde Yao, Ruisheng Gao, Lizhi Wang, Zhiwei Xiong
Paper | Code | Abstract
High dynamic range (HDR) video reconstruction aims to generate HDR videos from low dynamic range (LDR) frames captured with alternating exposures. Most existing works solely rely on the regression-based paradigm, leading to adverse effects such as ghosting artifacts and missing details in saturated regions. In this paper, we propose a diffusion-promoted method for HDR video reconstruction, termed HDR-V-Diff, which incorporates a diffusion model to capture the HDR distribution. As such, HDR-V-Diff can reconstruct HDR videos with realistic details while alleviating ghosting artifacts. However, the direct introduction of video diffusion models would impose massive computational burden. Instead, to alleviate this burden, we first propose an HDR Latent Diffusion Model (HDR-LDM) to learn the distribution prior of single HDR frames. Specifically, HDR-LDM incorporates a tonemapping strategy to compress HDR frames into the latent space and a novel exposure embedding to aggregate the exposure information into the diffusion process. We then propose a Temporal-Consistent Alignment Module (TCAM) to learn the temporal information as a complement for HDR-LDM, which conducts coarse-to-fine feature alignment at different scales among video frames. Finally, we design a Zero-Init Cross-Attention (ZiCA) mechanism to effectively integrate the learned distribution prior and temporal information for generating HDR frames. Extensive experiments validate that HDR-V-Diff achieves state-of-the-art results on several representative datasets.
Diffusion-based Light Field Synthesis
Ruisheng Gao, Yutong Liu, Zeyu Xiao, Zhiwei Xiong
Paper | Code | Abstract
Light fields (LFs), conducive to comprehensive scene radiance recorded across angular dimensions, find wide applications in 3D reconstruction, virtual reality, and computational this http URL, the LF acquisition is inevitably time-consuming and resource-intensive due to the mainstream acquisition strategy involving manual capture or laborious software this http URL such a challenge, we introduce LFdiff, a straightforward yet effective diffusion-based generative framework tailored for LF synthesis, which adopts only a single RGB image as this http URL leverages disparity estimated by a monocular depth estimation network and incorporates two distinctive components: a novel condition scheme and a noise estimation network tailored for LF this http URL, we design a position-aware warping condition scheme, enhancing inter-view geometry learning via a robust conditional this http URL then propose DistgUnet, a disentanglement-based noise estimation network, to harness comprehensive LF this http URL experiments demonstrate that LFdiff excels in synthesizing visually pleasing and disparity-controllable light fields with enhanced generalization this http URL, comprehensive results affirm the broad applicability of the generated LF data, spanning applications like LF super-resolution and refocusing.
Exploiting Optical Flow Guidance for Transformer-Based Video Inpainting
Kaidong Zhang, Jialun Peng, Jingjing Fu, Dong Liu
Paper | Code | Abstract
Transformers have been widely used for video processing owing to the multi-head self attention (MHSA) mechanism. However, the MHSA mechanism encounters an intrinsic difficulty for video inpainting, since the features associated with the corrupted regions are degraded and incur inaccurate self attention. This problem, termed query degradation, may be mitigated by first completing optical flows and then using the flows to guide the self attention, which was verified in our previous work – flow-guided transformer (FGT). We further exploit the flow guidance and propose FGT++ to pursue more effective and efficient video inpainting. First, we design a lightweight flow completion network by using local aggregation and edge loss. Second, to address the query degradation, we propose a flow guidance feature integration module, which uses the motion discrepancy to enhance the features, together with a flow-guided feature propagation module that warps the features according to the flows. Third, we decouple the transformer along the temporal and spatial dimensions, where flows are used to select the tokens through a temporally deformable MHSA mechanism, and global tokens are combined with the inner-window local tokens through a dual-perspective MHSA mechanism. FGT++ is experimentally evaluated to be outperforming the existing video inpainting networks qualitatively and quantitatively.
Towards Interactive Image Inpainting via Robust Sketch Refinement
Chang Liu, Shunxin Xu, Jialun Peng, Kaidong Zhang, Dong Liu
Paper | Code | Abstract
One tough problem of image inpainting is to restore complex structures in the corrupted regions. It motivates interactive image inpainting which leverages additional hints, e.g., sketches, to assist the inpainting process. A sketch is simple and intuitive for end users to provide, but meanwhile has free forms with much randomness. Such randomness may confuse the inpainting models, and incur severe artifacts in completed images. To better facilitate image inpainting with sketch guidance, we propose a two-stage image inpainting system, termed SketchRefiner. The first stage of our approach serves as a data provider that simulates real sketches and derives the capability of sketch calibration from the simulated data. In the second stage, our approach aligns the sketch guidance with the inpainting process so as to elevate image inpainting with sketches. We also propose a real-world test protocol to address the evaluation of inpainting methods upon practical applications with user sketches. Experimental results on three prevailing benchmark datasets, i.e., CelebA-HQ, Places2, and ImageNet, and the proposed test protocol demonstrate the state-of-the-art performance of our approach, and its great potentials upon real-world applications. Further analyses illustrate that our approach effectively utilizes sketch information as guidance and eliminates the artifacts due to the free-form sketches.
Disparity-guided Multi-view Interaction Network for Light Field Reflection Removal
Yutong Liu, Wenming Weng, Ruisheng Gao, Zeyu Xiao, Yueyi Zhang, Zhiwei Xiong
Paper | Code | Abstract
Light field (LF) imaging presents a promising avenue for reflection removal, owing to its ability of reliable depth perception and utilization of complementary texture details from multiple sub-aperture images (SAIs). However, the domain shifts between real-world and synthetic scenes, as well as the challenge of embedding transmission information across SAIs pose the main obstacles in this task. In this paper, we conquer the above challenges from the perspectives of data and network, respectively. To mitigate domain shifts, we propose an efficient data synthesis strategy for simulating realistic reflection scenes, and build the largest ever LF reflection dataset containing 420 synthetic scenes and 70 real-world scenes. To enable the transmission information embedding across SAIs, we propose a novel D isparity-guided M ulti-view I nteraction Net work (DMINet) for LF reflection removal. DMINet mainly consists of a transmission disparity estimation (TDE) module and a center-side interaction (CSI) module. The TDE module aims to predict transmission disparity by filtering out reflection disturbances, while the CSI module is responsible for the transmission integration which adopts the central view as the bridge for the propagation conducted between different SAIs. Compared with existing reflection removal methods for LF input, DMINet achieves a distinct performance boost with merits of efficiency and robustness, especially for scenes with complex depth variations.
EdiTor: Edge-guided Transformer for Ghost-free High Dynamic Range Imaging
Yuanshen Guan, Ruikang Xu, Mingde Yao, Jie Huang, Zhiwei Xiong
Paper | Abstract
Synthesizing the high dynamic range (HDR) image from multi-exposure images has been extensively studied by exploiting convolutional neural networks (CNNs) recently. Despite the remarkable progress, existing CNN-based methods have the intrinsic limitation of local receptive field, which hinders the model’s capability of capturing long-range correspondence and large motions across under/over-exposure images, resulting in ghosting artifacts of dynamic scenes. To address the above challenge, we propose a novel Edge-guided Transformer framework (EdiTor) customized for ghost-free HDR reconstruction, where the long-range motions across different exposures can be delicately modeled by incorporating the edge prior. Specifically, EdiTor calculates patch-wise correlation maps on both image and edge domains, enabling the network to effectively model the global movements and the fine-grained shifts across multiple exposures. Based on this framework, we further propose an exposure-masked loss to adaptively compensate for the severely distorted regions (e.g., highlights and shadows). Experiments demonstrate that EdiTor outperforms state-of-the-art methods both quantitatively and qualitatively, achieving appealing HDR visualization with unified textures and colors.
Decomposition Betters Tracking Everything Everywhere
Rui Li, Dong Liu
Paper | Code | Abstract
Recent studies on motion estimation have advocated an optimized motion representation that is globally consistent across the entire video, preferably for every pixel. This is challenging as a uniform representation may not account for the complex and diverse motion and appearance of natural videos. We address this problem and propose a new test-time optimization method, named DecoMotion, for estimating per-pixel and long-range motion. DecoMotion explicitly decomposes video content into static scenes and dynamic objects, either of which uses a quasi-3D canonical volume to represent. DecoMotion separately coordinates the transformations between local and canonical spaces, facilitating an affine transformation for the static scene that corresponds to camera motion. For the dynamic volume, DecoMotion leverages discriminative and temporally consistent features to rectify the non-rigid transformation. The two volumes are finally fused to fully represent motion and appearance. This divide-and-conquer strategy leads to more robust tracking through occlusions and deformations and meanwhile obtains decomposed appearances. We conduct evaluations on the TAP-Vid benchmark. The results demonstrate our method boosts the point-tracking accuracy by a large margin and performs on par with some state-of-the-art dedicated point-tracking solutions. **High-Resolution and Few-shot View Synthesis from Asymmetric Dual-lens Inputs**
_Ruikang Xu, Mingde Yao, Yue Li, Yueyi Zhang, Zhiwei Xiong_
[Paper](https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/00368.pdf){:target="\_blank"} | [Code](https://github.com/XrKang/DL-GS){:target="\_blank"} | Abstract
Novel view synthesis has achieved remarkable quality and efficiency by the paradigm of 3D Gaussian Splatting (3D-GS), but still faces two challenges: 1) significant performance degradation when trained with only few-shot samples due to a lack of geometry constraint, and 2) incapability of rendering at a higher resolution that is beyond the input resolution of training samples. In this paper, we propose Dual-Lens 3D-GS (DL-GS) to achieve high-resolution (HR) and few-shot view synthesis, by leveraging the characteristics of the asymmetric dual-lens system commonly equipped on mobile devices. This kind of system captures the same scene with different focal lengths (\textit{i.e.}, wide-angle and telephoto) under an asymmetric stereo configuration, which naturally provides geometric hints for few-shot training and HR guidance for resolution improvement. Nevertheless, there remain two major technical problems to achieving this goal. First, how to effectively exploit the geometry information from the asymmetric stereo configuration? To this end, we propose a consistency-aware training strategy, which integrates a dual-lens-consistent loss to regularize the 3D-GS optimization. Second, how to make the best use of the dual-lens training samples to effectively improve the resolution of newly synthesized views? To this end, we design a multi-reference-guided refinement module to select proper telephoto and wide-angle guided images from training samples based on the camera pose distances, and then exploit their information for high-frequency detail enhancement. Extensive experiments on simulated and real-captured datasets validate the distinct superiority of our DL-GS over various competitors on the task of HR and few-shot view synthesis.
_Ganlin Yang, Guoqiang Wei, Zhizheng Zhang, Yan Lu, Dong Liu_
[Paper](https://arxiv.org/abs/2304.04962){:target="\_blank"} | [Code](https://github.com/Ganlin-Yang/MRVM-NeRF){:target="\_blank"} | Abstract
Most Neural Radiance Fields (NeRFs) exhibit limited generalization capabilities,which restrict their applicability in representing multiple scenes using a single model. To address this problem, existing generalizable NeRF methods simply condition the model on image features. These methods still struggle to learn precise global representations over diverse scenes since they lack an effective mechanism for interacting among different points and views. In this work, we unveil that 3D implicit representation learning can be significantly improved by mask-based modeling. Specifically, we propose masked ray and view modeling for generalizable NeRF (MRVM-NeRF), which is a self-supervised pretraining target to predict complete scene representations from partially masked features along each ray. With this pretraining target, MRVM-NeRF enables better use of correlations across different rays and views as the geometry priors, which thereby strengthens the capability of capturing intricate details within the scenes and boosts the generalization capability across different scenes. Extensive experiments demonstrate the effectiveness of our proposed MRVM-NeRF on both synthetic and real-world datasets, qualitatively and quantitatively. Besides, we also conduct experiments to show the compatibility of our proposed method with various backbones and its superiority under few-shot cases.
_Zeyu Xiao, Yurui Zhu, Xueyang Fu, Zhiwei Xiong_
[Paper](https://openaccess.thecvf.com/content/WACV2024/html/Xiao_TSA2_Temporal_Segment_Adaptation_and_Aggregation_for_Video_Harmonization_WACV_2024_paper.html){:target="\_blank"} | [Code](https://github.com/zeyuxiao1997/TSA) | Abstract
Video composition merges the foreground and background of different videos, presenting challenges due to variations in capture conditions (e.g., saturation, brightness, and contrast). Video harmonization is a vital process in achieving a realistic composite by seamlessly adjusting the foreground's appearance to match the background. In this paper, we propose TSA2, a novel method for video harmonization that incorporates temporal segment adaptation and aggregation. TSA2 divides the inharmonious input sequence into temporal segments, each corresponding to a different frame rate, allowing effective utilization of complementary information within each segment. The method includes the Temporal Segment Adaptation module, which learns and remaps the distribution difference between background and foreground regions, and the Temporal Segment Aggregation module, which emphasizes and aggregates cross-segment information through element-wise correlations. Experimental results demonstrate that TSA2 outperforms advanced image and video harmonization methods quantitatively and qualitatively.