Research Topic: Manipulation

Light field imaging presents a promising avenue for advancing salient object detection (SOD). However, existing light field SOD (LFSOD) methods grapple with challenges related to effectively aggregating features from all-in-focus (AiF) images and focal slices. These methods often under-utilize the complementary nature of salient and non-saliency maps, leading to inaccurate predictions, particularly at fine boundaries. To tackle these limitations, in this paper, we introduce a novel method for LFSOD. Our method incorporates a Cross-Modality Aggregation (CMA) module at multiple levels, facilitating the efficient fusion of AiF image and focal slice features. This progressive aggregation capitalizes on global and local dependencies to harness implicit geometric information in an LF. Based on the observation that, salient regions and non-salient counterparts are complementary to each other, thus a better estimation on one side leads to an improved estimation on the other, and vice versa, we introduce the Complementary Saliency Map Generator (CSMG). The CSMG generates both saliency and non-saliency maps interactively to leverage the inherent complementary relationship between salient regions and their non-salient counterparts. Through extensive experiments conducted on benchmark datasets, we have demonstrated that our proposed method achieves superior performance in LFSOD.

The advent of High Dynamic Range/Wide Color Gamut (HDR/WCG) display technology has made significant progress in providing exceptional richness and vibrancy for the human visual experience. However, the widespread adoption of HDR/WCG images is hindered by their substantial storage requirements, imposing significant bandwidth challenges during distribution. Besides, HDR/WCG images are often tone-mapped into Standard Dynamic Range (SDR) versions for compatibility, necessitating the usage of inverse Tone Mapping (iTM) techniques to reconstruct their original representation. In this work, we propose a meta-transfer learning framework for practical HDR/WCG media transmission by embedding image-wise metadata into their SDR counterparts for later iTM reconstruction. Specifically, we devise a meta-learning strategy to pre-train a lightweight multilayer perceptron (MLP) model that maps SDR pixels to HDR/WCG ones on an external dataset, resulting in a domain-wise iTM model. Subsequently, for the transfer learning process of each HDR/WCG image, we present a spatial-aware online mining mechanism to select challenging training pairs to adapt the meta-trained model to an image-wise iTM model. Finally, the adapted MLP, embedded as metadata, is transmitted alongside the SDR image, facilitating the reconstruction of the original image on HDR/WCG displays. We conduct extensive experiments and evaluate the proposed framework with diverse metrics. Compared with existing solutions, our framework shows superior performance in fidelity, minimal latency, and negligible overhead. The codes are available at https://github.com/pjliu3/MLP_iTM.

High dynamic range (HDR) video reconstruction aims to generate HDR videos from low dynamic range (LDR) frames captured with alternating exposures. Most existing works solely rely on the regression-based paradigm, leading to adverse effects such as ghosting artifacts and missing details in saturated regions. In this paper, we propose a diffusion-promoted method for HDR video reconstruction, termed HDR-V-Diff, which incorporates a diffusion model to capture the HDR distribution. As such, HDR-V-Diff can reconstruct HDR videos with realistic details while alleviating ghosting artifacts. However, the direct introduction of video diffusion models would impose massive computational burden. Instead, to alleviate this burden, we first propose an HDR Latent Diffusion Model (HDR-LDM) to learn the distribution prior of single HDR frames. Specifically, HDR-LDM incorporates a tonemapping strategy to compress HDR frames into the latent space and a novel exposure embedding to aggregate the exposure information into the diffusion process. We then propose a Temporal-Consistent Alignment Module (TCAM) to learn the temporal information as a complement for HDR-LDM, which conducts coarse-to-fine feature alignment at different scales among video frames. Finally, we design a Zero-Init Cross-Attention (ZiCA) mechanism to effectively integrate the learned distribution prior and temporal information for generating HDR frames. Extensive experiments validate that HDR-V-Diff achieves state-of-the-art results on several representative datasets.

Light fields (LFs), conducive to comprehensive scene radiance recorded across angular dimensions, find wide applications in 3D reconstruction, virtual reality, and computational this http URL, the LF acquisition is inevitably time-consuming and resource-intensive due to the mainstream acquisition strategy involving manual capture or laborious software this http URL such a challenge, we introduce LFdiff, a straightforward yet effective diffusion-based generative framework tailored for LF synthesis, which adopts only a single RGB image as this http URL leverages disparity estimated by a monocular depth estimation network and incorporates two distinctive components: a novel condition scheme and a noise estimation network tailored for LF this http URL, we design a position-aware warping condition scheme, enhancing inter-view geometry learning via a robust conditional this http URL then propose DistgUnet, a disentanglement-based noise estimation network, to harness comprehensive LF this http URL experiments demonstrate that LFdiff excels in synthesizing visually pleasing and disparity-controllable light fields with enhanced generalization this http URL, comprehensive results affirm the broad applicability of the generated LF data, spanning applications like LF super-resolution and refocusing.

Transformers have been widely used for video processing owing to the multi-head self attention (MHSA) mechanism. However, the MHSA mechanism encounters an intrinsic difficulty for video inpainting, since the features associated with the corrupted regions are degraded and incur inaccurate self attention. This problem, termed query degradation, may be mitigated by first completing optical flows and then using the flows to guide the self attention, which was verified in our previous work – flow-guided transformer (FGT). We further exploit the flow guidance and propose FGT++ to pursue more effective and efficient video inpainting. First, we design a lightweight flow completion network by using local aggregation and edge loss. Second, to address the query degradation, we propose a flow guidance feature integration module, which uses the motion discrepancy to enhance the features, together with a flow-guided feature propagation module that warps the features according to the flows. Third, we decouple the transformer along the temporal and spatial dimensions, where flows are used to select the tokens through a temporally deformable MHSA mechanism, and global tokens are combined with the inner-window local tokens through a dual-perspective MHSA mechanism. FGT++ is experimentally evaluated to be outperforming the existing video inpainting networks qualitatively and quantitatively.

One tough problem of image inpainting is to restore complex structures in the corrupted regions. It motivates interactive image inpainting which leverages additional hints, e.g., sketches, to assist the inpainting process. A sketch is simple and intuitive for end users to provide, but meanwhile has free forms with much randomness. Such randomness may confuse the inpainting models, and incur severe artifacts in completed images. To better facilitate image inpainting with sketch guidance, we propose a two-stage image inpainting system, termed SketchRefiner. The first stage of our approach serves as a data provider that simulates real sketches and derives the capability of sketch calibration from the simulated data. In the second stage, our approach aligns the sketch guidance with the inpainting process so as to elevate image inpainting with sketches. We also propose a real-world test protocol to address the evaluation of inpainting methods upon practical applications with user sketches. Experimental results on three prevailing benchmark datasets, i.e., CelebA-HQ, Places2, and ImageNet, and the proposed test protocol demonstrate the state-of-the-art performance of our approach, and its great potentials upon real-world applications. Further analyses illustrate that our approach effectively utilizes sketch information as guidance and eliminates the artifacts due to the free-form sketches.

Light field (LF) imaging presents a promising avenue for reflection removal, owing to its ability of reliable depth perception and utilization of complementary texture details from multiple sub-aperture images (SAIs). However, the domain shifts between real-world and synthetic scenes, as well as the challenge of embedding transmission information across SAIs pose the main obstacles in this task. In this paper, we conquer the above challenges from the perspectives of data and network, respectively. To mitigate domain shifts, we propose an efficient data synthesis strategy for simulating realistic reflection scenes, and build the largest ever LF reflection dataset containing 420 synthetic scenes and 70 real-world scenes. To enable the transmission information embedding across SAIs, we propose a novel D isparity-guided M ulti-view I nteraction Net work (DMINet) for LF reflection removal. DMINet mainly consists of a transmission disparity estimation (TDE) module and a center-side interaction (CSI) module. The TDE module aims to predict transmission disparity by filtering out reflection disturbances, while the CSI module is responsible for the transmission integration which adopts the central view as the bridge for the propagation conducted between different SAIs. Compared with existing reflection removal methods for LF input, DMINet achieves a distinct performance boost with merits of efficiency and robustness, especially for scenes with complex depth variations.

Synthesizing the high dynamic range (HDR) image from multi-exposure images has been extensively studied by exploiting convolutional neural networks (CNNs) recently. Despite the remarkable progress, existing CNN-based methods have the intrinsic limitation of local receptive field, which hinders the model’s capability of capturing long-range correspondence and large motions across under/over-exposure images, resulting in ghosting artifacts of dynamic scenes. To address the above challenge, we propose a novel Edge-guided Transformer framework (EdiTor) customized for ghost-free HDR reconstruction, where the long-range motions across different exposures can be delicately modeled by incorporating the edge prior. Specifically, EdiTor calculates patch-wise correlation maps on both image and edge domains, enabling the network to effectively model the global movements and the fine-grained shifts across multiple exposures. Based on this framework, we further propose an exposure-masked loss to adaptively compensate for the severely distorted regions (e.g., highlights and shadows). Experiments demonstrate that EdiTor outperforms state-of-the-art methods both quantitatively and qualitatively, achieving appealing HDR visualization with unified textures and colors.

Recent studies on motion estimation have advocated an optimized motion representation that is globally consistent across the entire video, preferably for every pixel. This is challenging as a uniform representation may not account for the complex and diverse motion and appearance of natural videos. We address this problem and propose a new test-time optimization method, named DecoMotion, for estimating per-pixel and long-range motion. DecoMotion explicitly decomposes video content into static scenes and dynamic objects, either of which uses a quasi-3D canonical volume to represent. DecoMotion separately coordinates the transformations between local and canonical spaces, facilitating an affine transformation for the static scene that corresponds to camera motion. For the dynamic volume, DecoMotion leverages discriminative and temporally consistent features to rectify the non-rigid transformation. The two volumes are finally fused to fully represent motion and appearance. This divide-and-conquer strategy leads to more robust tracking through occlusions and deformations and meanwhile obtains decomposed appearances. We conduct evaluations on the TAP-Vid benchmark. The results demonstrate our method boosts the point-tracking accuracy by a large margin and performs on par with some state-of-the-art dedicated point-tracking solutions. **High-Resolution and Few-shot View Synthesis from Asymmetric Dual-lens Inputs**
_Ruikang Xu, Mingde Yao, Yue Li, Yueyi Zhang, Zhiwei Xiong_
European Conference on Computer Vision (ECCV), 2024
[Paper](https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/00368.pdf){:target="\_blank"} | [Code](https://github.com/XrKang/DL-GS){:target="\_blank"} | Abstract **Mask-Based Modeling for Neural Radiance Fields**
_Ganlin Yang, Guoqiang Wei, Zhizheng Zhang, Yan Lu, Dong Liu_
International Conference on Learning Representations (ICLR), 2024
[Paper](https://arxiv.org/abs/2304.04962){:target="\_blank"} | [Code](https://github.com/Ganlin-Yang/MRVM-NeRF){:target="\_blank"} | Abstract **TSA2: Temporal Segment Adaptation and Aggregation for Video Harmonization**
_Zeyu Xiao, Yurui Zhu, Xueyang Fu, Zhiwei Xiong_
IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024
[Paper](https://openaccess.thecvf.com/content/WACV2024/html/Xiao_TSA2_Temporal_Segment_Adaptation_and_Aggregation_for_Video_Harmonization_WACV_2024_paper.html){:target="\_blank"} | [Code](https://github.com/zeyuxiao1997/TSA) | Abstract --- **[▲ 2024 New](/manipulation/2024-New){: style="color: rgb(191, 0, 0)"}** **[▶ Generation](/manipulation/editing-generation){: style="color: rgb(191, 0, 0)"}** **[▶ Enhancement](/manipulation/hdr-enhancement){: style="color: rgb(191, 0, 0)"}** **[▶ Denoising](/manipulation/denoising){: style="color: rgb(191, 0, 0)"}** **[▶ Low4High](/manipulation/low-for-high){: style="color: rgb(191, 0, 0)"}**

VIDAR

2024 New