DeepNLP CVPR2024 Accepted Paper List AI Robotic and STEM Top Conference & Journal Papers
-
Hyperspectral images (HSIs) have extensive applications in various fields such as medicine agriculture and industry. Nevertheless acquiring high signal-to-noise ratio HSI poses a challenge due to narrow-band spectral filtering. Consequently the importance of HSI denoising is substantial especially for snapshot hyperspectral imaging technology. While most previous HSI denoising methods are supervised creating supervised training datasets for the diverse scenes hyperspectral cameras and scan parameters is impractical. In this work we present Diff-Unmix a self-supervised denoising method for HSI using diffusion denoising generative models. Specifically Diff-Unmix addresses the challenge of recovering noise-degraded HSI through a fusion of Spectral Unmixing and conditional abundance generation. Firstly it employs a learnable block-based spectral unmixing strategy complemented by a pure transformer-based backbone. Then we introduce a self-supervised generative diffusion network to enhance abundance maps from the spectral unmixing block. This network reconstructs noise-free Unmixing probability distributions effectively mitigating noise-induced degradations within these components. Finally the reconstructed HSI is reconstructed through unmixing reconstruction by blending the diffusion-adjusted abundance map with the spectral endmembers. Experimental results on both simulated and real-world noisy datasets show that Diff-Unmix achieves state-of-the-art performance.
-
The reflective nature of the human eye is an under-appreciated source of information about what the world around us looks like. By imaging the eyes of a moving person we capture multiple views of a scene outside the camera's direct line of sight through the reflections in the eyes. In this paper we reconstruct a radiance field beyond the camera's line of sight using portrait images containing eye reflections. This task is challenging due to 1) the difficulty of accurately estimating eye poses and 2) the entangled appearance of the iris textures and the scene reflections. To address these our method jointly optimizes the cornea poses the radiance field depicting the scene and the observer's eye iris texture. We further present a regularization prior on the iris texture to improve scene reconstruction quality. Through various experiments on synthetic and real-world captures featuring people with varied eye colors and lighting conditions we demonstrate the feasibility of our approach to recover the radiance field using cornea reflections.
-
The recovery of occluded human meshes poses challenges for current methods due to the difficulty in extracting effective image features under severe occlusion. In this paper we introduce DPMesh an innovative framework for occluded human mesh recovery that capitalizes on the profound knowledge about object structure and spatial relationships embedded in a pre-trained text-to-image diffusion model. Unlike previous methods reliant on conventional backbones for vanilla feature extraction DPMesh seamlessly integrates the pre-trained denoising U-Net with potent priors as its image backbone and performs a single-step inference to provide occlusion-aware information. To enhance the perception capability for occluded poses DPMesh incorporates judicious guidance via condition injection which produces effective controls from 2D observations for the denoising U-Net. Furthermore we explore a dedicated noisy key-point reasoning approach to mitigate disturbances arising from occlusion and crowded scenarios. This strategy fully unleashes the perceptual capability of the diffusion prior thereby enhancing accuracy. Extensive quantitative and qualitative experiments affirm the efficacy of our framework as we outperform state-of-the-art methods on both occlusion-specific and standard datasets underscoring its ability to achieve precise and robust 3D human mesh recovery particularly in challenging scenarios involving occlusion and crowded scenes. Code is available at https://github.com/EternalEvan/DPMesh.
-
The training of contemporary deep learning models heavily relies on publicly available data posing a risk of unauthorized access to online data and raising concerns about data privacy. Current approaches to creating unlearnable data involve incorporating small specially designed noises but these methods strictly limit data usability overlooking its potential usage in authorized scenarios. In this paper we extend the concept of unlearnable data to conditional data learnability and introduce UnGeneralizable Examples (UGEs). UGEs exhibit learnability for authorized users while maintaining unlearnability for potential hackers. The protector defines the authorized network and optimizes UGEs to match the gradients of the original data and its ungeneralizable version ensuring learnability. To prevent unauthorized learning UGEs are trained by maximizing a designated distance loss in a common feature space. Additionally to further safeguard the authorized side from potential attacks we introduce additional undistillation optimization. Experimental results on multiple datasets and various networks demonstrate that the proposed UGEs framework preserves data usability while reducing training performance on hacker networks even under different types of attacks.
-
Monocular 3D lane detection has become a fundamental problem in the context of autonomous driving which comprises the tasks of finding the road surface and locating lane markings. One major challenge lies in a flexible but robust line representation capable of modeling complex lane structures while still avoiding unpredictable behavior. While previous methods rely on fully data-driven approaches we instead introduce a novel approach LaneCPP that uses a continuous 3D lane detection model leveraging physical prior knowledge about the lane structure and road geometry. While our sophisticated lane model is capable of modeling complex road structures it also shows robust behavior since physical constraints are incorporated by means of a regularization scheme that can be analytically applied to our parametric representation. Moreover we incorporate prior knowledge about the road geometry into the 3D feature space by modeling geometry-aware spatial features guiding the network to learn an internal road surface representation. In our experiments we show the benefits of our contributions and prove the meaningfulness of using priors to make 3D lane detection more robust. The results show that LaneCPP achieves state-of-the-art performance in terms of F-Score and geometric errors.
-
3D city generation is a desirable yet challenging task since humans are more sensitive to structural distortions in urban environments. Additionally generating 3D cities is more complex than 3D natural scenes since buildings as objects of the same class exhibit a wider range of appearances compared to the relatively consistent appearance of objects like trees in natural scenes. To address these challenges we propose CityDreamer a compositional generative model designed specifically for unbounded 3D cities. Our key insight is that 3D city generation should be a composition of different types of neural fields: 1) various building instances and 2) background stuff such as roads and green lands. Specifically we adopt the bird's eye view scene representation and employ a volumetric render for both instance-oriented and stuff-oriented neural fields. The generative hash grid and periodic positional embedding are tailored as scene parameterization to suit the distinct characteristics of building instances and background stuff. Furthermore we contribute a suite of CityGen Datasets including OSM and GoogleEarth which comprises a vast amount of real-world city imagery to enhance the realism of the generated 3D cities both in their layouts and appearances. CityDreamer achieves state-of-the-art performance not only in generating realistic 3D cities but also in localized editing within the generated cities.
-
High-resolution wide-angle fisheye images are becoming more and more important for robotics applications such as autonomous driving. However using ordinary convolutional neural networks or vision transformers on this data is problematic due to projection and distortion losses introduced when projecting to a rectangular grid on the plane. We introduce the HEAL-SWIN transformer which combines the highly uniform Hierarchical Equal Area iso-Latitude Pixelation (HEALPix) grid used in astrophysics and cosmology with the Hierarchical Shifted-Window (SWIN) transformer to yield an efficient and flexible model capable of training on high-resolution distortion-free spherical data. In HEAL-SWIN the nested structure of the HEALPix grid is used to perform the patching and windowing operations of the SWIN transformer enabling the network to process spherical representations with minimal computational overhead. We demonstrate the superior performance of our model on both synthetic and real automotive datasets as well as a selection of other image datasets for semantic segmentation depth regression and classification tasks. Our code is publicly available.
-
We present 3D Paintbrush a technique for automatically texturing local semantic regions on meshes via text descriptions. Our method is designed to operate directly on meshes producing texture maps which seamlessly integrate into standard graphics pipelines. We opt to simultaneously produce a localization map (to specify the edit region) and a texture map which conforms to it. This approach improves the quality of both the localization and the stylization. To enhance the details and resolution of the textured area we leverage multiple stages of a cascaded diffusion model to supervise our local editing technique with generative priors learned from images at different resolutions. Our technique referred to as Cascaded Score Distillation (CSD) simultaneously distills scores at multiple resolutions in a cascaded fashion enabling control over both the granularity and global understanding of the supervision. We demonstrate the effectiveness of 3D Paintbrush to locally texture different semantic regions on a variety of shapes.
-
Out-of-Distribution (OOD) detection aims to address the excessive confidence prediction by neural networks by triggering an alert when the input sample deviates significantly from the training distribution (in-distribution) indicating that the output may not be reliable. Current OOD detection approaches explore all kinds of cues to identify OOD data such as finding irregular patterns in the feature space logit space gradient space or the raw image space. Surprisingly we observe a linear trend between the OOD score produced by current OOD detection algorithms and the network features on several datasets. We conduct a thorough investigation theoretically and empirically to analyze and understand the meaning of such a linear trend in OOD detection. This paper proposes a Robust Test-time Linear method (RTL) to utilize such linear trends like a `free lunch' when we have a batch of data to perform OOD detection. By using a simple linear regression as a test time adaptation we can make a more precise OOD prediction. We further propose an online variant of the proposed method which achieves promising performance and is more practical for real applications. Theoretical analysis is given to prove the effectiveness of our methods. Extensive experiments on several OOD datasets show the efficacy of RTL for OOD detection tasks significantly improving the results of base OOD detectors. Project will be available at https://github.com/kfan21/RTL.
-
Unsupervised video object segmentation aims to segment the most prominent object in a video sequence. However the existence of complex backgrounds and multiple foreground objects make this task challenging. To address this issue we propose a guided slot attention network to reinforce spatial structural information and obtain better foreground-background separation. The foreground and background slots which are initialized with query guidance are iteratively refined based on interactions with template information. Furthermore to improve slot-template interaction and effectively fuse global and local features in the target and reference frames K-nearest neighbors filtering and a feature aggregation transformer are introduced. The proposed model achieves state-of-the-art performance on two popular datasets. Additionally we demonstrate the robustness of the proposed model in challenging scenes through various comparative experiments.
-
Significant progress in image deblurring has been achieved by deep learning methods especially the remarkable performance of supervised models on paired synthetic data. However real-world quality degradation is more complex than synthetic datasets and acquiring paired data in real-world scenarios poses significant challenges. To address these challenges we propose a novel unsupervised image deblurring framework based on self-enhancement. The framework progressively generates improved pseudo-sharp and blurry image pairs without the need for real paired datasets and the generated image pairs with higher qualities can be used to enhance the performance of the reconstructor. To ensure the generated blurry images are closer to the real blurry images we propose a novel re-degradation principal component consistency loss which enforces the principal components of the generated low-quality images to be similar to those of re-degraded images from the original sharp ones. Furthermore we introduce the self-enhancement strategy that significantly improves deblurring performance without increasing the computational complexity of network during inference. Through extensive experiments on multiple real-world blurry datasets we demonstrate the superiority of our approach over other state-of-the-art unsupervised methods.
-
Action detection aims to localize the starting and ending points of action instances in untrimmed videos and predict the classes of those instances. In this paper we make the observation that the outputs of the action detection task can be formulated as images. Thus from a novel perspective we tackle action detection via a three-image generation process to generate starting point ending point and action-class predictions as images via our proposed Action Detection Image Diffusion (ADI-Diff) framework. Furthermore since our images differ from natural images and exhibit special properties we further explore a Discrete Action-Detection Diffusion Process and a Row-Column Transformer design to better handle their processing. Our ADI-Diff framework achieves state-of-the-art results on two widely-used datasets.
-
Character animation in real-world scenarios necessitates a variety of constraints such as trajectories key-frames interactions etc. Existing methodologies typically treat single or a finite set of these constraint(s) as separate control tasks. These methods are often specialized and the tasks they address are rarely extendable or customizable. We categorize these as solutions to the close-set motion control problem. In response to the complexity of practical motion control we propose and attempt to solve the open-set motion control problem. This problem is characterized by an open and fully customizable set of motion control tasks. To address this we introduce a new paradigm programmable motion generation. In this paradigm any given motion control task is broken down into a combination of atomic constraints. These constraints are then programmed into an error function that quantifies the degree to which a motion sequence adheres to them. We utilize a pre-trained motion generation model and optimize its latent code to minimize the error function of the generated motion. Consequently the generated motion not only inherits the prior of the generative model but also satisfies the requirements of the compounded constraints. Our experiments demonstrate that our approach can generate high-quality motions when addressing a wide range of unseen tasks. These tasks encompass motion control by motion dynamics geometric constraints physical laws interactions with scenes objects or the character's own body parts etc. All of these are achieved in a unified approach without the need for ad-hoc paired training data collection or specialized network designs. During the programming of novel tasks we observed the emergence of new skills beyond those of the prior model. With the assistance of large language models we also achieved automatic programming. We hope that this work will pave the way for the motion control of general AI agents.
-
Self-supervised landmark estimation is a challenging task that demands the formation of locally distinct feature representations to identify sparse facial landmarks in the absence of annotated data. To tackle this task existing state-of-the-art (SOTA) methods (1) extract coarse features from backbones that are trained with instance-level self-supervised learning (SSL) paradigms which neglect the dense prediction nature of the task (2) aggregate them into memory-intensive hypercolumn formations and (3) supervise lightweight projector networks to naively establish full local correspondences among all pairs of spatial features. In this paper we introduce SCE-MAE a framework that (1) leverages the MAE [??] a region-level SSL method that naturally better suits the landmark prediction task (2) operates on the vanilla feature map instead of on expensive hypercolumns and (3) employs a Correspondence Approximation and Refinement Block (CARB) that utilizes a simple density peak clustering algorithm and our proposed Locality-Constrained Repellence Loss to directly hone only select local correspondences. We demonstrate through extensive experiments that SCE-MAE is highly effective and robust outperforming existing SOTA methods by large margins of 20%-44% on the landmark matching and 9%-15% on the landmark detection tasks.
-
LAKE-RED: Camouflaged Images Generation by Latent Background Knowledge Retrieval-Augmented Diffusion
Camouflaged vision perception is an important vision task with numerous practical applications. Due to the expensive collection and labeling costs this community struggles with a major bottleneck that the species category of its datasets is limited to a small number of object species. However the existing camouflaged generation methods require specifying the background manually thus failing to extend the camouflaged sample diversity in a low-cost manner. In this paper we propose a Latent Background Knowledge Retrieval-Augmented Diffusion (LAKE-RED) for camouflaged image generation. To our knowledge our contributions mainly include: (1) For the first time we propose a camouflaged generation paradigm that does not need to receive any background inputs. (2) Our LAKE-RED is the first knowledge retrieval-augmented method with interpretability for camouflaged generation in which we propose an idea that knowledge retrieval and reasoning enhancement are separated explicitly to alleviate the task-specific challenges. Moreover our method is not restricted to specific foreground targets or backgrounds offering a potential for extending camouflaged vision perception to more diverse domains. (3) Experimental results demonstrate that our method outperforms the existing approaches generating more realistic camouflage images.
-
Recently diffusion models have emerged as a new powerful generative method for 3D point cloud generation tasks. However few works study the effect of the architecture of the diffusion model in the 3D point cloud resorting to the typical UNet model developed for 2D images. Inspired by the wide adoption of Transformers we study the complementary role of convolution (from UNet) and attention (from Transformers). We discover that their respective importance change according to the timestep in the diffusion process. At early stage attention has an outsized influence because Transformers are found to generate the overall shape more quickly and at later stages when adding fine detail convolution starts having a larger impact on the generated point cloud's local surface quality. In light of this observation we propose a time-varying two-stream denoising model combined with convolution layers and transformer blocks. We generate an optimizable mask from each timestep to reweigh global and local features obtaining time-varying fused features. Experimentally we demonstrate that our proposed method quantitatively outperforms other state-of-the-art methods regarding visual quality and diversity. Code is avaiable github.com/Zhiyuan-R/Tiger-Time-varying-Diffusion-Model-for-Point-Cloud-Generation.
-
In this work we propose a method to address the challenge of rendering a 3D human from a single image in a free-view manner. Some existing approaches could achieve this by using generalizable pixel-aligned implicit fields to reconstruct a textured mesh of a human or by employing a 2D diffusion model as guidance with the Score Distillation Sampling (SDS) method to lift the 2D image into 3D space. However a generalizable implicit field often results in an over-smooth texture field while the SDS method tends to lead to a texture-inconsistent novel view with the input image. In this paper we introduce a texture-consistent back view synthesis method that could transfer the reference image content to the back view through depth-guided mutual self-attention. With this method we could achieve high-fidelity and texture-consistent human rendering from a single image. Moreover to alleviate the color distortion that occurs in the side region we propose a visibility-aware patch consistency regularization combined with the synthesized back view texture. Experiments conducted on both real and synthetic data demonstrate the effectiveness of our method and show that our approach outperforms previous baseline methods.
-
Existing text-based person retrieval datasets often have relatively coarse-grained text annotations. This hinders the model to comprehend the fine-grained semantics of query texts in real scenarios. To address this problem we contribute a new benchmark named UFineBench for text-based person retrieval with ultra-fine granularity. Firstly we construct a new dataset named UFine6926. We collect a large number of person images and manually annotate each image with two detailed textual descriptions averaging 80.8 words each. The average word count is three to four times that of the previous datasets. In addition of standard in-domain evaluation we also propose a special evaluation paradigm more representative of real scenarios. It contains a new evaluation set with cross domains cross textual granularity and cross textual styles named UFine3C and a new evaluation metric for accurately measuring retrieval ability named mean Similarity Distribution (mSD). Moreover we propose CFAM a more efficient algorithm especially designed for text-based person retrieval with ultra fine-grained texts. It achieves fine granularity mining by adopting a shared cross-modal granularity decoder and hard negative match mechanism. With standard in-domain evaluation CFAM establishes competitive performance across various datasets especially on our ultra fine-grained UFine6926. Furthermore by evaluating on UFine3C we demonstrate that training on our UFine6926 significantly improves generalization to real scenarios compared with other coarse-grained datasets. The dataset and code will be made publicly available at https://github.com/Zplusdragon/UFineBench.
-
Hyperparameter Optimization and Neural Architecture Search are powerful in attaining state-of-the-art machine learning models with Bayesian Optimization (BO) standing out as a mainstream method. Extending BO into the multi-fidelity setting has been an emerging research topic in this field but faces the challenge of determining an appropriate fidelity for each hyperparameter configuration to fit the surrogate model. To tackle the challenge we propose a multi-fidelity BO method named FastBO which excels in adaptively deciding the fidelity for each configuration and providing strong performance while ensuring efficient resource usage. These advantages are achieved through our proposed techniques based on the concepts of efficient point and saturation point for each configuration which can be obtained from the empirical learning curve of the configuration estimated from early observations. Extensive experiments demonstrate FastBO's superior anytime performance and efficiency in identifying high-quality configurations and architectures. We also show that our method provides a way to extend any single-fidelity method to the multi-fidelity setting highlighting the wide applicability of our approach.
-
Real-time rendering of photorealistic and controllable human avatars stands as a cornerstone in Computer Vision and Graphics. While recent advances in neural implicit rendering have unlocked unprecedented photorealism for digital avatars real-time performance has mostly been demonstrated for static scenes only. To address this we propose ASH an animatable Gaussian splatting approach for photorealistic rendering of dynamic humans in real time. We parameterize the clothed human as animatable 3D Gaussians which can be efficiently splatted into image space to generate the final rendering. However naively learning the Gaussian parameters in 3D space poses a severe challenge in terms of compute. Instead we attach the Gaussians onto a deformable character model and learn their parameters in 2D texture space which allows leveraging efficient 2D convolutional architectures that easily scale with the required number of Gaussians. We benchmark ASH with competing methods on pose-controllable avatars demonstrating that our method outperforms existing real-time methods by a large margin and shows comparable or even better results than offline methods.
-
Adversarial training is often formulated as a min-max problem however concentrating only on the worst adversarial examples causes alternating repetitive confusion of the model i.e. previously defended or correctly classified samples are not defensible or accurately classifiable in subsequent adversarial training. We characterize such non-ignorable samples as "hiders" which reveal the hidden high-risk regions within the secure area obtained through adversarial training and prevent the model from finding the real worst cases. We demand the model to prevent hiders when defending against adversarial examples for improving accuracy and robustness simultaneously. By rethinking and redefining the min-max optimization problem for adversarial training we propose a generalized adversarial training algorithm called Hider-Focused Adversarial Training (HFAT). HFAT introduces the iterative evolution optimization strategy to simplify the optimization problem and employs an auxiliary model to reveal hiders effectively combining the optimization directions of standard adversarial training and prevention hiders. Furthermore we introduce an adaptive weighting mechanism that facilitates the model in adaptively adjusting its focus between adversarial examples and hiders during different training periods. We demonstrate the effectiveness of our method based on extensive experiments and ensure that HFAT can provide higher robustness and accuracy. We will release the source code upon publication.
-
This work introduces ArtAdapter a transformative text-to-image (T2I) style transfer framework that transcends traditional limitations of color brushstrokes and object shape capturing high-level style elements such as composition and distinctive artistic expression. The integration of a multi-level style encoder with our proposed explicit adaptation mechanism enables ArtAdapter to achieve unprecedented fidelity in style transfer ensuring close alignment with textual descriptions. Additionally the incorporation of an Auxiliary Content Adapter (ACA) effectively separates content from style alleviating the borrowing of content from style references. Moreover our novel fast finetuning approach could further enhance zero-shot style representation while mitigating the risk of overfitting. Comprehensive evaluations confirm that ArtAdapter surpasses current state-of-the-art methods.
-
This paper tackles a novel yet challenging problem: how to transfer knowledge from the emerging Segment Anything Model (SAM) -- which reveals impressive zero-shot instance segmentation capacity -- to learn a compact panoramic semantic segmentation model i.e. student without requiring any labeled data. This poses considerable challenges due to SAM's inability to provide semantic labels and the large capacity gap between SAM and the student. To this end we propose a novel framework called GoodSAM that introduces a teacher assistant (TA) to provide semantic information integrated with SAM to generate ensemble logits to achieve knowledge transfer. Specifically we propose a Distortion-Aware Rectification (DAR) module that first addresses the distortion problem of panoramic images by imposing prediction-level consistency and boundary enhancement. This subtly enhances TA's prediction capacity on panoramic images. DAR then incorporates a cross-task complementary fusion block to adaptively merge the predictions of SAM and TA to obtain more reliable ensemble logits. Moreover we introduce a Multi-level Knowledge Adaptation (MKA) module to efficiently transfer the multi-level feature knowledge from TA and ensemble logits to learn a compact student model. Extensive experiments on two benchmarks show that our GoodSAM achieves a remarkable +3.75% mIoU improvement over the state-of-the-art (SOTA) domain adaptation methods e.g. [41]. Also our most lightweight model achieves comparable performance to the SOTA methods with only 3.7M parameters.
-
In this paper we focus on a challenging Online Task-Free Class Incremental Learning (OTFCIL) problem. Different from the existing methods that continuously learn the feature space from data streams we propose a novel compute-and-align paradigm for the OTFCIL. It first computes an optimal geometry i.e. the class prototype distribution for classifying existing classes and updates it when new classes emerge and then trains a DNN model by aligning its feature space to the optimal geometry. To this end we develop a novel Dynamic Neural Collapse (DNC) algorithm to compute and update the optimal geometry. The DNC expands the geometry when new classes emerge without loss of the geometry optimality and guarantees the drift distance of old class prototypes with an explicit upper bound. Then we propose a novel Dynamic feature space Self-Organization (DYSON) method containing three major components including 1) a feature extractor 2) a Dynamic Feature-Geometry Alignment (DFGA) module aligning the feature space to the optimal geometry computed by DNC and 3) a training-free class-incremental classifier derived from the DNC geometry. Experimental comparison results on four benchmark datasets including CIFAR10 CIFAR100 CUB200 and CoRe50 demonstrate the efficiency and superiority of the DYSON method. The source code is provided in the supplementary material.
-
An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos predict rich detailed textual descriptions and be able to produce outputs before processing the entire video. Current state-of-the-art models however process a fixed number of downsampled frames and make a single full prediction after seeing the whole video. We propose a streaming dense video captioning model that consists of two novel components: First we propose a new memory module based on clustering incoming tokens which can handle arbitrarily long videos as the memory is of a fixed size. Second we develop a streaming decoding algorithm that enables our model to make predictions before the entire video has been processed. Our model achieves this streaming ability and significantly improves the state-of-the-art on three dense video captioning benchmarks: ActivityNet YouCook2 and ViTT. Our code is released at https://github.com/google-research/scenic.
-
Despite the growing demand for accurate surface normal estimation models existing methods use general-purpose dense prediction models adopting the same inductive biases as other tasks. In this paper we discuss the inductive biases needed for surface normal estimation and propose to (1) utilize the per-pixel ray direction and (2) encode the relationship between neighboring surface normals by learning their relative rotation. The proposed method can generate crisp - yet piecewise smooth - predictions for challenging in-the-wild images of arbitrary resolution and aspect ratio. Compared to a recent ViT-based state-of-the-art model our method shows a stronger generalization ability despite being trained on an orders of magnitude smaller dataset. The code is available at https://github.com/baegwangbin/DSINE.
-
Event sensors offer high temporal resolution visual sensing which makes them ideal for perceiving fast visual phenomena without suffering from motion blur. Certain applications in robotics and vision-based navigation require 3D perception of an object undergoing circular or spinning motion in front of a static camera such as recovering the angular velocity and shape of the object. The setting is equivalent to observing a static object with an orbiting camera. In this paper we propose event-based structure-from-orbit (eSfO) where the aim is to simultaneously reconstruct the 3D structure of a fast spinning object observed from a static event camera and recover the equivalent orbital motion of the camera. Our contributions are threefold: since state-of-the-art event feature trackers cannot handle periodic self-occlusion due to the spinning motion we develop a novel event feature tracker based on spatio-temporal clustering and data association that can better track the helical trajectories of valid features in the event data. The feature tracks are then fed to our novel factor graph-based structure-from-orbit back-end that calculates the orbital motion parameters (e.g. spin rate relative rotational axis) that minimize the reprojection error. For evaluation we produce a new event dataset of objects under spinning motion. Comparisons against ground truth indicate the efficacy of eSfO.
-
Event camera has significant advantages in capturingdynamic scene information while being prone to noise interferenceparticularly in challenging conditions like lowthreshold and low illumination. However most existing researchfocuses on gentle situations hindering event cameraapplications in realistic complex scenarios. To tackle thislimitation and advance the field we construct a new pairedreal-world event denoising dataset (LED) including 3K sequenceswith 18K seconds of high-resolution (1200*680)event streams and showing three notable distinctions comparedto others: diverse noise levels and scenes largerscalewith high-resolution and high-quality GT. Specificallyit contains stepped parameters and varying illuminationwith diverse scenarios. Moreover based on theproperty of noise events inconsistency and signal eventsconsistency we propose a novel effective denoising framework(DED) using homogeneous dual events to generate theGT with better separating noise from the raw. Furthermorewe design a bio-inspired baseline leveraging Leaky-Integrate-and-Fire (LIF) neurons with dynamic thresholdsto realize accurate denoising. The experimental resultsdemonstrate that the remarkable performance of the proposedapproach on different datasets.The dataset and codeare at https://github.com/Yee-Sing/led.
-
Federated learning (FL) has emerged as a new paradigm for privacy-preserving collaborative training. Under domain skew the current FL approaches are biased and face two fairness problems. 1) Parameter Update Conflict: data disparity among clients leads to varying parameter importance and inconsistent update directions. These two disparities cause important parameters to potentially be overwhelmed by unimportant ones of dominant updates. It consequently results in significant performance decreases for lower-performing clients. 2) Model Aggregation Bias: existing FL approaches introduce unfair weight allocation and neglect domain diversity. It leads to biased model convergence objective and distinct performance among domains. We discover a pronounced directional update consistency in Federated Learning and propose a novel framework to tackle above issues. First leveraging the discovered characteristic we selectively discard unimportant parameter updates to prevent updates from clients with lower performance overwhelmed by unimportant parameters resulting in fairer generalization performance. Second we propose a fair aggregation objective to prevent global model bias towards some domains ensuring that the global model continuously aligns with an unbiased model. The proposed method is generic and can be combined with other existing FL methods to enhance fairness. Comprehensive experiments on Digits and Office-Caltech demonstrate the high fairness and performance of our method.
-
In this work we study a novel problem which focuses on person identification while performing daily activities. Learning biometric features from RGB videos is challenging due to spatio-temporal complexity and presence of appearance biases such as clothing color and background. We propose ABNet a novel framework which leverages disentanglement of biometric and non-biometric features to perform effective person identification from daily activities. ABNet relies on a bias-less teacher to learn biometric features from RGB videos and explicitly disentangle non-biometric features with the help of biometric distortion. In addition ABNet also exploits activity prior for biometrics which is enabled by joint biometric and activity learning. We perform comprehensive evaluation of the proposed approach across five different datasets which are derived from existing activity recognition benchmarks. Furthermore we extensively compare ABNet with existing works in person identification and demonstrate its effectiveness for activity-based biometrics across all five datasets. The code and dataset can be accessed at: https://github.com/sacrcv/Activity-Biometrics/
-
Despite the remarkable progress in image style transfer formulating style in the context of art is inherently subjective and challenging. In contrast to existing methods this study shows that vanilla diffusion models can directly extract style information and seamlessly integrate the generative prior into the content image without retraining. Specifically we adopt dual denoising paths to represent content/style references in latent space and then guide the content image denoising process with style latent codes. We further reveal that the cross-attention mechanism in latent diffusion models tends to blend the content and style images resulting in stylized outputs that deviate from the original content image. To overcome this limitation we introduce a cross-attention reweighting strategy. Through theoretical analysis and experiments we demonstrate the effectiveness and superiority of the diffusion-based zero-shot style transfer via attention reweighting Z-STAR.
-
Visual interactivity understanding within visual scenes presents a significant challenge in computer vision. Existing methods focus on complex interactivities while leveraging a simple relationship model. These methods however struggle with a diversity of appearance situation position interaction and relation in videos. This limitation hinders the ability to fully comprehend the interplay within the complex visual dynamics of subjects. In this paper we delve into interactivities understanding within visual content by deriving scene graph representations from dense interactivities among humans and objects. To achieve this goal we first present a new dataset containing Appearance-Situation-Position-Interaction-Relation predicates named ASPIRe offering an extensive collection of videos marked by a wide range of interactivities. Then we propose a new approach named Hierarchical Interlacement Graph (HIG) which leverages a unified layer and graph within a hierarchical structure to provide deep insights into scene changes across five distinct tasks. Our approach demonstrates superior performance to other methods through extensive experiments conducted in various scenarios.
-
Trajectory prediction is fundamental in computer vision and autonomous driving particularly for understanding pedestrian behavior and enabling proactive decision-making. Existing approaches in this field often assume precise and complete observational data neglecting the challenges associated with out-of-view objects and the noise inherent in sensor data due to limited camera range physical obstructions and the absence of ground truth for denoised sensor data. Such oversights are critical safety concerns as they can result in missing essential non-visible objects. To bridge this gap we present a novel method for out-of-sight trajectory prediction that leverages a vision-positioning technique. Our approach denoises noisy sensor observations in an unsupervised manner and precisely maps sensor-based trajectories of out-of-sight objects into visual trajectories. This method has demonstrated state-of-the-art performance in out-of-sight noisy sensor trajectory denoising and prediction on the Vi-Fi and JRDB datasets. By enhancing trajectory prediction accuracy and addressing the challenges of out-of-sight objects our work significantly contributes to improving the safety and reliability of autonomous driving in complex environments. Our work represents the first initiative towards Out-Of-Sight Trajectory prediction (OOSTraj) setting a new benchmark for future research.
-
Learning fair representation in deep learning is essential to mitigate discriminatory outcomes and enhance trustworthiness. However previous research has been commonly established on inappropriate assumptions prone to unrealistic counterfactuals and performance degradation. Although some proposed alternative approaches such as employing correlation-aware causal graphs or proxies for mutual information these methods are less practical and not applicable in general. In this work we propose FAir DisEntanglement with Sensitive relevance (FADES) a novel approach that leverages conditional mutual information from the information theory perspective to address these challenges. We employ sensitive relevant code to direct correlated information between target labels and sensitive attributes by imposing conditional independence allowing better separation of the features of interest in the latent space. Utilizing an intuitive disentangling approach FADES consistently achieves superior performance and fairness both quantitatively and qualitatively with its straightforward structure. Specifically the proposed method outperforms existing works in downstream classification and counterfactual generations on various benchmarks.
-
Current controls over diffusion models (e.g. through text or ControlNet) for image generation fall short in recognizing abstract continuous attributes like illumination direction or non-rigid shape change. In this paper we present an approach for allowing users of text-to-image models to have fine-grained control of several attributes in an image. We do this by engineering special sets of input tokens that can be transformed in a continuous manner we call them Continuous 3D Words. These attributes can for example be represented as sliders and applied jointly with text prompts for fine-grained control over image generation. Given only a single mesh and a rendering engine we show that our approach can be adopted to provide continuous user control over several 3D-aware attributes including time-of-day illumination bird wing orientation dollyzoom effect and object poses. Our method is capable of conditioning image creation with multiple Continuous 3D Words and text descriptions simultaneously while adding no overhead to the generative process.
-
Modern text-to-image generation models produce high-quality images that are both photorealistic and faithful to the text prompts. However this quality comes at significant computational cost: nearly all of these models are iterative and require running sampling multiple times with large models. This iterative process is needed to ensure that different regions of the image are not only aligned with the text prompt but also compatible with each other. In this work we propose a light-weight approach to achieving this compatibility between different regions of an image using a Markov Random Field (MRF) model. We demonstrate the effectiveness of this method on top of the latent token-based Muse text-to-image model. The MRF richly encodes the compatibility among image tokens at different spatial locations to improve quality and significantly reduce the required number of Muse sampling steps. Inference with the MRF is significantly cheaper and its parameters can be quickly learned through back-propagation by modeling MRF inference as a differentiable neural-network layer. Our full model MarkovGen uses this proposed MRF model to both speed up Muse by 1.5xand produce higher quality images by decreasing undesirable image artifacts.
-
The perception of motion behavior in a dynamic environment holds significant importance for autonomous driving systems wherein class-agnostic motion prediction methods directly predict the motion of the entire point cloud. While most existing methods rely on fully-supervised learning the manual labeling of point cloud data is laborious and time-consuming. Therefore several annotation-efficient methods have been proposed to address this challenge. Although effective these methods rely on weak annotations or additional multi-modal data like images and the potential benefits inherent in the point cloud sequence are still underexplored. To this end we explore the feasibility of self-supervised motion prediction with only unlabeled LiDAR point clouds. Initially we employ an optimal transport solver to establish coarse correspondences between current and future point clouds as the coarse pseudo motion labels. Training models directly using such coarse labels leads to noticeable spatial and temporal prediction inconsistencies. To mitigate these issues we introduce three simple spatial and temporal regularization losses which facilitate the self-supervised training process effectively. Experimental results demonstrate the significant superiority of our approach over the state-of-the-art self-supervised methods. Code will be available.
-
In this paper we address the problem of efficient point searching and sampling for volume neural rendering. Within this realm two typical approaches are employed: rasterization and ray tracing. The rasterization-based methods enable real-time rendering at the cost of increased memory and lower fidelity. In contrast the ray-tracing-based methods yield superior quality but demand longer rendering time. We solve this problem by our HashPoint method combining these two strategies leveraging rasterization for efficient point searching and sampling and ray marching for rendering. Our method optimizes point searching by rasterizing points within the camera's view organizing them in a hash table and facilitating rapid searches. Notably we accelerate the rendering process by adaptive sampling on the primary surface encountered by the ray. Our approach yields substantial speed-up for a range of state-of-the-art ray-tracing-based methods maintaining equivalent or superior accuracy across synthetic and real test datasets. The code will be available at https://jiahao-ma.github.io/hashpoint/
-
In recent interactive segmentation algorithms previous probability maps are used as network input to help predictions in the current segmentation round. However despite the utilization of previous masks useful information contained in the probability maps is not well propagated to the current predictions. In this paper to overcome this limitation we propose a novel and effective algorithm for click-based interactive image segmentation called MFP which attempts to make full use of probability maps. We first modulate previous probability maps to enhance their representations of user-specified objects. Then we feed the modulated probability maps as additional input to the segmentation network. We implement the proposed MFP algorithm based on the ResNet-34 HRNet-18 and ViT-B backbones and assess the performance extensively on various datasets. It is demonstrated that MFP meaningfully outperforms the existing algorithms using identical backbones. The source codes are available at https://github.com/cwlee00/MFP.
-
We describe a novel method StyLitGAN for relighting and resurfacing images in the absence of labeled data. StyLitGAN generates images with realistic lighting effects including cast shadows soft shadows inter-reflections and glossy effects without the need for paired or CGI data. StyLitGAN uses an intrinsic image method to decompose an image followed by a search of the latent space of a pretrained StyleGAN to identify a set of directions. By prompting the model to fix one component (e.g. albedo) and vary another (e.g. shading) we generate relighted images by adding the identified directions to the latent style codes. Quantitative metrics of change in albedo and lighting diversity allow us to choose effective directions using a forward selection process. Qualitative evaluation confirms the effectiveness of our method.
-
The laws of model size data volume computation and model performance have been extensively studied in the field of Natural Language Processing (NLP). However the scaling laws in Scene Text Recognition (STR) have not yet been investigated. To address this we conducted comprehensive studies that involved examining the correlations between performance and the scale of models data volume and computation in the field of text recognition. Conclusively the study demonstrates smooth power laws between performance and model size as well as training data volume when other influencing factors are held constant. Additionally we have constructed a large-scale dataset called REBU-Syn which comprises 6 million real samples and 18 million synthetic samples. Based on our scaling law and new dataset we have successfully trained a scene text recognition model achieving a new state-of-the-art on 6 common test benchmarks with a top-1 average accuracy of 97.42%. The models and dataset are publicly available at \href https://github.com/large-ocr-model/large-ocr-model.github.io large-ocr-model.github.io .
-
We tackle the problem of 3D point cloud localization based on a few natural linguistic descriptions and introduce a novel neural network Text2Loc that fully interprets the semantic relationship between points and text. Text2Loc follows a coarse-to-fine localization pipeline: text-submap global place recognition followed by fine localization. In global place recognition relational dynamics among each textual hint are captured in a hierarchical transformer with max-pooling (HTM) whereas a balance between positive and negative pairs is maintained using text-submap contrastive learning. Moreover we propose a novel matching-free fine localization method to further refine the location predictions which completely removes the need for complicated text-instance matching and is lighter faster and more accurate than previous methods. Extensive experiments show that Text2Loc improves the localization accuracy by up to 2x over the state-of-the-art on the KITTI360Pose dataset. Our project page is publicly available at: https: //yan-xia.github.io/projects/text2loc/.
-
Tensor network (TN) representation is a powerful technique for computer vision and machine learning. TN structure search (TN-SS) aims to search for a customized structure to achieve a compact representation which is a challenging NP-hard problem. Recent "sampling-evaluation"-based methods require sampling an extensive collection of structures and evaluating them one by one resulting in prohibitively high computational costs. To address this issue we propose a novel TN paradigm named SVD-inspired TN decomposition (SVDinsTN) which allows us to efficiently solve the TN-SS problem from a regularized modeling perspective eliminating the repeated structure evaluations. To be specific by inserting a diagonal factor for each edge of the fully-connected TN SVDinsTN allows us to calculate TN cores and diagonal factors simultaneously with the factor sparsity revealing a compact TN structure. In theory we prove a convergence guarantee for the proposed method. Experimental results demonstrate that the proposed method achieves approximately 100 1000 times acceleration compared to the state-of-the-art TN-SS methods while maintaining a comparable level of representation ability.
-
Medical vision language pre-training (VLP) has emerged as a frontier of research enabling zero-shot pathological recognition by comparing the query image with the textual descriptions for each disease. Due to the complex semantics of biomedical texts current methods struggle to align medical images with key pathological findings in unstructured reports. This leads to the misalignment with the target disease's textual representation. In this paper we introduce a novel VLP framework designed to dissect disease descriptions into their fundamental aspects leveraging prior knowledge about the visual manifestations of pathologies. This is achieved by consulting a large language model and medical experts. Integrating a Transformer module our approach aligns an input image with the diverse elements of a disease generating aspect-centric image representations. By consolidating the matches from each aspect we improve the compatibility between an image and its associated disease. Additionally capitalizing on the aspect-oriented representations we present a dual-head Transformer tailored to process known and unknown diseases optimizing the comprehensive detection efficacy. Conducting experiments on seven downstream datasets ours improves the accuracy of recent methods by up to 8.56% and 17.26% for seen and unseen categories respectively. Our code is released at https://github.com/HieuPhan33/MAVL.
-
We introduce MoMask a novel masked modeling framework for text-driven 3D human motion generation. In MoMask a hierarchical quantization scheme is employed to represent human motion as multi-layer discrete motion tokens with high-fidelity details. Starting at the base layer with a sequence of motion tokens obtained by vector quantization the residual tokens of increasing orders are derived and stored at the subsequent layers of the hierarchy. This is consequently followed by two distinct bidirectional transformers. For the base-layer motion tokens a Masked Transformer is designated to predict randomly masked motion tokens conditioned on text input at training stage. During generation (i.e. inference) stage starting from an empty sequence our Masked Transformer iteratively fills up the missing tokens; Subsequently a Residual Transformer learns to progressively predict the next-layer tokens based on the results from current layer. Extensive experiments demonstrate that MoMask outperforms the state-of-art methods on the text-to-motion generation task with an FID of 0.045 (vs e.g. 0.141 of T2M-GPT) on the HumanML3D dataset and 0.228 (vs 0.514) on KIT-ML respectively. MoMask can also be seamlessly applied in related tasks without further model fine-tuning such as text-guided temporal inpainting.
-
Inverse rendering aims at recovering both geometry and materials of objects. It provides a more compatible reconstruction for conventional rendering engines compared with the neural radiance fields (NeRFs). On the other hand existing NeRF-based inverse rendering methods cannot handle glossy objects with local light interactions well as they typically oversimplify the illumination as a 2D environmental map which assumes infinite lights only. Observing the superiority of NeRFs in recovering radiance fields we propose a novel 5D Neural Plenoptic Function (NeP) based on NeRFs and ray tracing such that more accurate lighting-object interactions can be formulated via the rendering equation. We also design a material-aware cone sampling strategy to efficiently integrate lights inside the BRDF lobes with the help of pre-filtered radiance fields. Our method has two stages: the geometry of the target object and the pre-filtered environmental radiance fields are reconstructed in the first stage and materials of the target object are estimated in the second stage with the proposed NeP and material-aware cone sampling strategy. Extensive experiments on the proposed real-world and synthetic datasets demonstrate that our method can reconstruct high-fidelity geometry/materials of challenging glossy objects with complex lighting interactions from nearby objects. Project webpage: https://whyy.site/paper/nep
-
Large vision-language models (VLMs) like CLIP have demonstrated good zero-shot learning performance in the unsupervised domain adaptation task. Yet most transfer approaches for VLMs focus on either the language or visual branches overlooking the nuanced interplay between both modalities. In this work we introduce a Unified Modality Separation (UniMoS) framework for unsupervised domain adaptation. Leveraging insights from modality gap studies we craft a nimble modality separation network that distinctly disentangles CLIP's features into language-associated and vision-associated components. Our proposed Modality-Ensemble Training (MET) method fosters the exchange of modality-agnostic information while maintaining modality-specific nuances. We align features across domains using a modality discriminator. Comprehensive evaluations on three benchmarks reveal our approach sets a new state-of-the-art with minimal computational costs. Code: https://github.com/TL-UESTC/UniMoS.
-
Affine subspaces of Euclidean spaces are also referred to as flats. A standard task in computer vision or more generally in engineering and applied sciences is fitting a flat to a set of points which is commonly solved using the PCA. We generalize this technique to enable fitting a flat to a set of other flats possibly of varying dimensions based on representing the flats as squared distance fields. Compared to previous approaches such as Riemannian centers of mass in the manifold of affine Grassmannians our approach is conceptually much simpler and computationally more efficient yet offers desirable properties such as respecting symmetries and being equivariant to rigid transformations leading to more intuitive and useful results in practice. We demonstrate these claims in a number of synthetic experiments and a multi-view reconstruction task of line-like objects.
-
As wearable cameras become more popular an important question emerges: how to identify camera wearers within the perspective of conventional static cameras. The drastic difference between first-person (egocentric) and third-person (exocentric) camera views makes this a challenging task. We present PersonEnvironmentNet (PEN) a framework designed to integrate information from both the individuals in the two views and geometric cues inferred from the background environment. To facilitate research in this direction we also present TF2023 a novel dataset comprising synchronized first-person and third-person views along with masks of camera wearers and labels associating these masks with the respective first-person views. In addition we propose a novel quantitative metric designed to measure a model's ability to comprehend the relationship between the two views. Our experiments reveal that PEN outperforms existing methods. The code and dataset are available at https://github.com/ziweizhao1993/PEN.
-
Point cloud matching a crucial technique in computer vision medical and robotics fields is primarily concerned with finding correspondences between pairs of point clouds or voxels. In some practical scenarios emphasizing local differences is crucial for accurately identifying a correct match thereby enhancing the overall robustness and reliability of the matching process. Commonly used shape descriptors have several limitations and often fail to provide meaningful local insights about the paired geometries. In this work we propose a new technique based on graph Laplacian eigenmaps to match point clouds by taking into account fine local structures. To deal with the order and sign ambiguity of Laplacian eigenmaps we introduce a new operator called Coupled Laplacian that allows to easily generate aligned eigenspaces for multiple registered geometries. We show that the similarity between those aligned high-dimensional spaces provides a locally meaningful score to match shapes. We firstly evaluate the performance of the proposed technique in a point-wise manner focusing on the task of object anomaly localization on the MVTec 3D-AD dataset. Additionally we define a new medical task called automatic Bone Side Estimation (BSE) which we address through a global similarity score derived from coupled eigenspaces. In order to test it we propose a benchmark collecting bone surface structures from various public datasets. Our matching technique based on Coupled Laplacian outperforms other methods by reaching an impressive accuracy on both tasks.
-
Foundation models encompass an extensive knowledge base and offer remarkable transferability. However this knowledge becomes outdated or insufficient over time. The challenge lies in continuously updating foundation models to accommodate novel information while retaining their original capabilities. Leveraging the fact that foundation models have initial knowledge on various tasks and domains we propose a novel approach that instead of updating all parameters equally localizes the updates to a sparse set of parameters relevant to the task being learned. We strike a balance between efficiency and new task performance while maintaining the transferability and generalizability of foundation models. We extensively evaluate our method on foundational vision-language models with a diverse spectrum of continual learning tasks. Our method achieves improvements on the accuracy of the newly learned tasks up to 7% while preserving the pretraining knowledge with a negligible decrease of 0.9% on a representative control set accuracy.
-
Templates serve as a good starting point to implement a design (e.g. banner slide) but it takes great effort from designers to manually create. In this paper we present Desigen an automatic template creation pipeline which generates background images as well as harmonious layout elements over the background. Different from natural images a background image should preserve enough non-salient space for the overlaying layout elements. To equip existing advanced diffusion-based models with stronger spatial control we propose two simple but effective techniques to constrain the saliency distribution and reduce the attention weight in desired regions during the background generation process. Then conditioned on the background we synthesize the layout with a Transformer-based autoregressive generator. To achieve a more harmonious composition we propose an iterative inference strategy to adjust the synthesized background and layout in multiple rounds. We constructed a design dataset with more than 40k advertisement banners to verify our approach. Extensive experiments demonstrate that the proposed pipeline generates high-quality templates comparable to human designers. More than a single-page design we further show an application of presentation generation that outputs a set of theme-consistent slides. The data and code are available at https://whaohan.github.io/desigen.
-
When editing a video a piece of attractive background music is indispensable. However video background music generation tasks face several challenges for example the lack of suitable training datasets and the difficulties in flexibly controlling the music generation process and sequentially aligning the video and music. In this work we first propose a high-quality music-video dataset BGM909 with detailed annotation and shot detection to provide multi-modal information about the video and music. We then present evaluation metrics to assess music quality including music diversity and alignment between music and video with retrieval precision metrics. Finally we propose the Diff-BGM framework to automatically generate the background music for a given video which uses different signals to control different aspects of the music during the generation process i.e. uses dynamic video features to control music rhythm and semantic features to control the melody and atmosphere. We propose to align the video and music sequentially by introducing a segment-aware cross-attention layer. Experiments verify the effectiveness of our proposed method. The code and models are available at https://github.com/sizhelee/Diff-BGM.
-
Audiovisual representation learning typically relies on the correspondence between sight and sound. However there are often multiple audio tracks that can correspond with a visual scene. Consider for example different conversations on the same crowded street. The effect of such counterfactual pairs on audiovisual representation learning has not been previously explored. To investigate this we use dubbed versions of movies and television shows to augment cross-modal contrastive learning. Our approach learns to represent alternate audio tracks differing only in speech similarly to the same video. Our results from a comprehensive set of experiments investigating different training strategies show this general approach improves performance on a range of downstream auditory and audiovisual tasks without majorly affecting linguistic task performance overall. These findings highlight the importance of considering speech variation when learning scene-level audiovisual correspondences and suggest that dubbed audio can be a useful augmentation technique for training audiovisual models toward more robust performance on diverse downstream tasks.
-
Vision Transformer (ViT) has emerged as a prominent backbone for computer vision. For more efficient ViTs recent works lessen the quadratic cost of the self-attention layer by pruning or fusing the redundant tokens. However these works faced the speed-accuracy trade-off caused by the loss of information. Here we argue that token fusion needs to consider diverse relations between tokens to minimize information loss. In this paper we propose a Multi-criteria Token Fusion (MCTF) that gradually fuses the tokens based on multi-criteria (i.e. similarity informativeness and size of fused tokens). Further we utilize the one-step-ahead attention which is the improved approach to capture the informativeness of the tokens. By training the model equipped with MCTF using a token reduction consistency we achieve the best speed-accuracy trade-off in the image classification (ImageNet1K). Experimental results prove that MCTF consistently surpasses the previous reduction methods with and without training. Specifically DeiT-T and DeiT-S with MCTF reduce FLOPs by about 44% while improving the performance (+0.5% and +0.3%) over the base model respectively. We also demonstrate the applicability of MCTF in various Vision Transformers (e.g. T2T-ViT LV-ViT) achieving at least 31% speedup without performance degradation. Code is available at https://github.com/mlvlab/MCTF.
-
The spiking cameras offer the benefits of high dynamic range (HDR) high temporal resolution and low data redundancy. However reconstructing HDR videos in high-speed conditions using single-bit spikings presents challenges due to the limited bit depth. Increasing the bit depth of the spikings is advantageous for boosting HDR performance but the readout efficiency will be decreased which is unfavorable for achieving a high frame rate (HFR) video. To address these challenges we propose a readout mechanism to obtain rolling-mixed-bit (RMB) spikings which involves interleaving multi-bit spikings within the single-bit spikings in a rolling manner thereby combining the characteristics of high bit depth and efficient readout. Furthermore we introduce RMB-Net for reconstructing HDR and HFR videos. RMB-Net comprises a cross-bit attention block for fusing mixed-bit spikings and a cross-time attention block for achieving temporal fusion. Extensive experiments conducted on synthetic and real-synthetic data demonstrate the superiority of our method. For instance pure 3-bit spikings result in 3 times of data volume whereas our method achieves comparable performance with less than 2% increase in data volume.
-
Long-form video content constitutes a significant portion of internet traffic making automated video summarization an essential research problem. However existing video summarization datasets are notably limited in their size constraining the effectiveness of state-of-the-art methods for generalization. Our work aims to overcome this limitation by capitalizing on the abundance of long-form videos with dense speech-to-video alignment and the remarkable capabilities of recent large language models (LLMs) in summarizing long text. We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset using LLMs as Oracle summarizers. By leveraging the generated dataset we analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them. To facilitate further research in the field our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals. Extensive experiments clearly indicate that our proposed approach sets a new state-of-the-art in video summarization across several benchmarks.
-
Most current arbitrary-scale image super-resolution (SR) methods has commonly relied on simulated data generated by simple synthetic degradation models (e.g. bicubic downsampling) at continuous various scales thereby falling short in capturing the complex degradation of real-world images. This limitation hinders the visual quality of these methods when applied to real-world images. To address this issue we propose the Continuous Optical Zooming dataset (COZ) by constructing an automatic imaging system to collect images at fine-grained various focal lengths within a specific range and providing strict image pair alignment. The COZ dataset serves as a benchmark to provide real-world data for training and testing arbitrary-scale SR models. To enhance the model's robustness against real-world image degradation we propose a Local Mix Implicit network (LMI) based on the MLP-mixer architecture and meta-learning which directly learns the local texture information by simultaneously mixing features and coordinates of multiple independent points. The extensive experiments demonstrate the superior performance of the arbitrary-scale SR models trained on the COZ dataset compared to models trained on simulated data. Our LMI model exhibits the superior effectiveness compared to other models. This study is of great significance in developing more efficient algorithms and improving the performance of arbitrary-scale image SR methods in practical applications. Our dataset and codes are available at https://github.com/pf0607/COZ.
-
Gaze is a powerful form of non-verbal communication that humans develop from an early age. As such modeling this behavior is an important task that can benefit a broad set of application domains ranging from robotics to sociology. In particular the gaze following task in computer vision is defined as the prediction of the 2D pixel coordinates where a person in the image is looking. Previous attempts in this area have primarily centered on CNN-based architectures but they have been constrained by the need to process one person at a time which proves to be highly inefficient. In this paper we introduce a novel and effective multi-person transformer-based architecture for gaze prediction. While there exist prior works using transformers for multi-person gaze prediction they use a fixed set of learnable embeddings to decode both the person and its gaze target which requires a matching step afterward to link the predictions with the annotations. Thus it is difficult to quantitatively evaluate these methods reliably with the available benchmarks or integrate them into a larger human behavior understanding system. Instead we are the first to propose a multi-person transformer-based architecture that maintains the original task formulation and ensures control over the people fed as input. Our main contribution lies in encoding the person-specific information into a single controlled token to be processed alongside image tokens and using its output for prediction based on a novel multiscale decoding mechanism. Our new architecture achieves state-of-the-art results on the GazeFollow VideoAttentionTarget and ChildPlay datasets and outperforms comparable multi-person architectures with a notable margin. Our code checkpoints and data extractions will be made publicly available soon.
-
Novel-view synthesis through diffusion models has demonstrated remarkable potential for generating diverse and high-quality images. Yet the independent process of image generation in these prevailing methods leads to challenges in maintaining multiple-view consistency. To address this we introduce ViewFusion a novel training-free algorithm that can be seamlessly integrated into existing pre-trained diffusion models. Our approach adopts an auto-regressive method that implicitly leverages previously generated views as context for the next view generation ensuring robust multi-view consistency during the novel-view generation process. Through a diffusion process that fuses known-view information via interpolated denoising our framework successfully extends single-view conditioned models to work in multiple-view conditional settings without any additional fine-tuning. Extensive experimental results demonstrate the effectiveness of ViewFusion in generating consistent and detailed novel views.
-
We propose SketchINR to advance the representation of vector sketches with implicit neural models. A variable length vector sketch is compressed into a latent space of fixed dimension that implicitly encodes the underlying shape as a function of time and strokes. The learned function predicts the xy point coordinates in a sketch at each time and stroke. Despite its simplicity SketchINR outperforms existing representations at multiple tasks: (i) Encoding an entire sketch dataset into a fixed size latent vector SketchINR gives 60x and 10x data compression over raster and vector sketches respectively. (ii) SketchINR's auto-decoder provides a much higher-fidelity representation than other learned vector sketch representations and is uniquely able to scale to complex vector sketches such as FS-COCO. (iii) SketchINR supports parallelisation that can decode/render 100x faster than other learned vector representations such as SketchRNN. (iv) SketchINR for the first time emulates the human ability to reproduce a sketch with varying abstraction in terms of number and complexity of strokes. As a first look at implicit sketches SketchINR's compact high-fidelity representation will support future work in modelling long and complex sketches.
-
This paper studies open-vocabulary segmentation (OVS) through calibrating in-vocabulary and domain-biased embedding space with generalized contextual prior of CLIP. As the core of open-vocabulary understanding alignment of visual content with the semantics of unbounded text has become the bottleneck of this field. To address this challenge recent works propose to utilize CLIP as an additional classifier and aggregate model predictions with CLIP classification results. Despite their remarkable progress performance of OVS methods in relevant scenarios is still unsatisfactory compared with supervised counterparts. We attribute this to the in-vocabulary embedding and domain-biased CLIP prediction. To this end we present a Semantic-assisted CAlibration Network (SCAN). In SCAN we incorporate generalized semantic prior of CLIP into proposal embedding to avoid collapsing on known categories. Besides a contextual shift strategy is applied to mitigate the lack of global context and unnatural background noise. With above designs SCAN achieves state-of-the-art performance on all popular open-vocabulary segmentation benchmarks. Furthermore we also focus on the problem of existing evaluation system that ignores semantic duplication across categories and propose a new metric called Semantic-Guided IoU (SG-IoU).
-
Recent learning methods for object pose estimation require resource-intensive training for each individual object instance or category hampering their scalability in real applications when confronted with previously unseen objects. In this paper we propose MatchU a Fuse-Describe-Match strategy for 6D pose estimation from RGB-D images. MatchU is a generic approach that fuses 2D texture and 3D geometric cues for 6D pose prediction of unseen objects. We rely on learning geometric 3D descriptors that are rotation-invariant by design. By encoding pose-agnostic geometry the learned descriptors naturally generalize to unseen objects and capture symmetries. To tackle ambiguous associations using 3D geometry only we fuse additional RGB information into our descriptor. This is achieved through a novel attention-based mechanism that fuses cross-modal information together with a matching loss that leverages the latent space learned from RGB data to guide the descriptor learning process. Extensive experiments reveal the generalizability of both the RGB-D fusion strategy as well as the descriptor efficacy. Benefiting from the novel designs MatchU surpasses all existing methods by a significant margin in terms of both accuracy and speed even without the requirement of expensive re-training or rendering.
-
Progress in lighting estimation is tracked by computing existing image quality assessment (IQA) metrics on images from standard datasets. While this may appear to be a reasonable approach we demonstrate that doing so does not correlate to human preference when the estimated lighting is used to relight a virtual scene into a real photograph. To study this we design a controlled psychophysical experiment where human observers must choose their preference amongst rendered scenes lit using a set of lighting estimation algorithms selected from the recent literature and use it to analyse how these algorithms perform according to human perception. Then we demonstrate that none of the most popular IQA metrics from the literature taken individually correctly represent human perception. Finally we show that by learning a combination of existing IQA metrics we can more accurately represent human preference. This provides a new perceptual framework to help evaluate future lighting estimation algorithms. To encourage future research all (anonymised) perceptual data and code are available at https://lvsn.github.io/PerceptionMetric/.
-
The annotation of blind image quality assessment (BIQA) is labor-intensive and time-consuming especially for authentic images. Training on synthetic data is expected to be beneficial but synthetically trained models often suffer from poor generalization in real domains due to domain gaps. In this work we make a key observation that introducing more distortion types in the synthetic dataset may not improve or even be harmful to generalizing authentic image quality assessment. To solve this challenge we propose distortion-guided unsupervised domain adaptation for BIQA (DGQA) a novel framework that leverages adaptive multi-domain selection via prior knowledge from distortion to match the data distribution between the source domains and the target domain thereby reducing negative transfer from the outlier source domains. Extensive experiments on two cross-domain settings (synthetic distortion to authentic distortion and synthetic distortion to algorithmic distortion) have demonstrated the effectiveness of our proposed DGQA. Besides DGQA is orthogonal to existing model-based BIQA methods and can be used in combination with such models to improve performance with less training data.
-
Data replay is a successful incremental learning technique for images. It prevents catastrophic forgetting by keeping a reservoir of previous data original or synthesized to ensure the model retains past knowledge while adapting to novel concepts. However its application in the video domain is rudimentary as it simply stores frame exemplars for action recognition. This paper presents the first exploration of video data replay techniques for incremental action segmentation focusing on action temporal modeling. We propose a Temporally Coherent Action (TCA) model which represents actions using a generative model instead of storing individual frames. The integration of a conditioning variable that captures temporal coherence allows our model to understand the evolution of action features over time. Therefore action segments generated by TCA for replay are diverse and temporally coherent. In a 10-task incremental setup on the Breakfast dataset our approach achieves significant increases in accuracy for up to 22% compared to the baselines.
-
We have recently seen tremendous progress in photo-real human modeling and rendering. Yet efficiently rendering realistic human performance and integrating it into the rasterization pipeline remains challenging. In this paper we present HiFi4G an explicit and compact Gaussian-based approach for high-fidelity human performance rendering from dense footage. Our core intuition is to marry the 3D Gaussian representation with non-rigid tracking achieving a compact and compression-friendly representation. We first propose a dual-graph mechanism to obtain motion priors with a coarse deformation graph for effective initialization and a fine-grained Gaussian graph to enforce subsequent constraints. Then we utilize a 4D Gaussian optimization scheme with adaptive spatial-temporal regularizers to effectively balance the non-rigid prior and Gaussian updating. We also present a companion compression scheme with residual compensation for immersive experiences on various platforms. It achieves a substantial compression rate of approximately 25 times with less than 2MB of storage per frame. Extensive experiments demonstrate the effectiveness of our approach which significantly outperforms existing approaches in terms of optimization speed rendering quality and storage overhead.
-
This paper proposes a novel task named "3D part grouping". Suppose there is a mixed set containing scattered parts from various shapes. This task requires algorithms to find out every possible combination among all the parts. To address this challenge we propose the so called Gradient Field-based Auto-Regressive Sampling framework (G-FARS) tailored specifically for the 3D part grouping task. In our framework we design a gradient-field-based selection graph neural network (GNN) to learn the gradients of a log conditional probability density in terms of part selection where the condition is the given mixed part set. This innovative approach implemented through the gradient-field-based selection GNN effectively captures complex relationships among all the parts in the input. Upon completion of the training process our framework becomes capable of autonomously grouping 3D parts by iteratively selecting them from the mixed part set leveraging the knowledge acquired by the trained gradient-field-based selection GNN. Our code is available at: https://github.com/J-F-Cheng/G-FARS-3DPartGrouping.
-
We develop a novel vectorized image representation scheme accommodating both shape/geometry and texture in a decoupled way particularly tailored for reconstruction and editing tasks of artistic/design images such as Emojis and Cliparts. In the heart of this representation is a set of sparsely and unevenly located 2D control points. On one hand these points constitute a collection of parametric/vectorized geometric primitives (e.g. curves and closed shapes) describing the shape characteristics of the target image. On the other hand local texture codes in terms of implicit neural network parameters are spatially distributed into each control point yielding local coordinate-to-RGB mappings within the anchored region of each control point. In the meantime a zero-shot learning algorithm is developed to decompose an arbitrary raster image into the above representation for the sake of high-fidelity image vectorization with convenient editing ability. Extensive experiments on a series of image vectorization and editing tasks well demonstrate the high accuracy offered by our proposed method with a significantly higher image compression ratio over prior art.
-
Diffusion probabilistic models (DPMs) are a key component in modern generative models. DPM-solvers have achieved reduced latency and enhanced quality significantly but have posed challenges to find the exact inverse (i.e. finding the initial noise from the given image). Here we investigate the exact inversions for DPM-solvers and propose algorithms to perform them when samples are generated by the first-order as well as higher-order DPM-solvers. For each explicit denoising step in DPM-solvers we formulated the inversions using implicit methods such as gradient descent or forward step method to ensure the robustness to large classifier-free guidance unlike the prior approach using fixed-point iteration. Experimental results demonstrated that our proposed exact inversion methods significantly reduced the error of both image and noise reconstructions greatly enhanced the ability to distinguish invisible watermarks and well prevented unintended background changes consistently during image editing.
-
Segment Anything Model (SAM) has emerged as a powerful tool for numerous vision applications. A key component that drives the impressive performance for zero-shot transfer and high versatility is a super large Transformer model trained on the extensive high-quality SA-1B dataset. While beneficial the huge computation cost of SAM model has limited its applications to wider real-world applications. To address this limitation we propose EfficientSAMs light-weight SAM models that exhibits decent performance with largely reduced complexity. Our idea is based on leveraging masked image pretraining SAMI which learns to reconstruct features from SAM image encoder for effective visual representation learning. Further we take SAMI-pretrained light-weight image encoders and mask decoder to build EfficientSAMs and finetune the models on SA-1B for segment anything task. We perform evaluations on multiple vision tasks including image classification object detection instance segmentation and semantic segmentation and find that our proposed pretraining method SAMI consistently outperforms other masked image pretraining methods. On segment anything task such as zero-shot instance segmentation our EfficientSAMs with SAMI-pretrained lightweight image encoders perform favorably with a significant gain (e.g. 4 AP on COCO/LVIS) over other fast SAM models. Our EfficientSAM code and models are available at https://github.com/yformer/EfficientSAM.
-
We present ChatScene a Large Language Model (LLM)-based agent that leverages the capabilities of LLMs to generate safety-critical scenarios for autonomous vehicles. Given unstructured language instructions the agent first generates textually described traffic scenarios using LLMs. These scenario descriptions are subsequently broken down into several sub-descriptions for specified details such as behaviors and locations of vehicles. The agent then distinctively transforms the textually described sub-scenarios into domain-specific languages which then generate actual code for prediction and control in simulators facilitating the creation of diverse and complex scenarios within the CARLA simulation environment. A key part of our agent is a comprehensive knowledge retrieval component which efficiently translates specific textual descriptions into corresponding domain-specific code snippets by training a knowledge database containing the scenario description and code pairs. Extensive experimental results underscore the efficacy of ChatScene in improving the safety of autonomous vehicles. For instance the scenarios generated by ChatScene show a 15% increase in collision rates compared to state-of-the-art baselines when tested against different reinforcement learning-based ego vehicles. Furthermore we show that by using our generated safety-critical scenarios to fine-tune different RL-based autonomous driving models they can achieve a 9% reduction in collision rates surpassing current SOTA methods. ChatScene effectively bridges the gap between textual descriptions of traffic scenarios and practical CARLA simulations providing a unified way to conveniently generate safety-critical scenarios for safety testing and improvement for AVs.
-
Text-driven video editing poses significant challenges in exhibiting flicker-free visual continuity while preserving the inherent motion patterns of original videos. Existing methods operate under a paradigm where motion and appearance are intricately intertwined. This coupling leads to the network either over-fitting appearance content -- failing to capture motion patterns -- or focusing on motion patterns at the expense of content generalization to diverse textual scenarios. Inspired by the pivotal role of wavelet transform in dissecting video sequences we propose CAusal Motion Enhancement tailored for Lifting text-driven video editing (CAMEL) a novel technique with two core designs. First we introduce motion prompts designed to summarize motion concepts from video templates through direct optimization. The optimized prompts are purposefully integrated into latent representations of diffusion models to enhance the motion fidelity of generated results. Second to enhance motion coherence and extend the generalization of appearance content to creative textual prompts we propose the causal motion-enhanced attention mechanism. This mechanism is implemented in tandem with a novel causal motion filter synergistically enhancing the motion coherence of disentangled high-frequency components and concurrently preserving the generalization of appearance content across various textual scenarios. Extensive experimental results show the superior performance of CAMEL.
-
Teeth localization segmentation and labeling in 2D images have great potential in modern dentistry to enhance dental diagnostics treatment planning and population-based studies on oral health. However general instance segmentation frameworks are incompetent due to 1) the subtle differences between some teeth' shapes (e.g. maxillary first premolar and second premolar) 2) the teeth's position and shape variation across subjects and 3) the presence of abnormalities in the dentition (e.g. caries and edentulism). To address these problems we propose a ViT-based framework named TeethSEG which consists of stacked Multi-Scale Aggregation (MSA) blocks and an Anthropic Prior Knowledge (APK) layer. Specifically to compose the two modules we design 1) a unique permutation-based upscaler to ensure high efficiency while establishing clear segmentation boundaries with 2) multi-head self/cross-gating layers to emphasize particular semantics meanwhile maintaining the divergence between token embeddings. Besides we collect 3) the first open-sourced intraoral image dataset IO150K which comprises over 150k intraoral photos and all photos are annotated by orthodontists using a human-machine hybrid algorithm. Experiments on IO150K demonstrate that our TeethSEG outperforms the state-of-the-art segmentation models on dental image segmentation.
-
The Segment Anything Model (SAM) marks a notable milestone in segmentation models highlighted by its robust zero-shot capabilities and ability to handle diverse prompts. SAM follows a pipeline that separates interactive segmentation into image preprocessing through a large encoder and interactive inference via a lightweight decoder ensuring efficient real-time performance. However SAM faces stability issues in challenging samples upon this pipeline. These issues arise from two main factors. Firstly the image preprocessing disables SAM to dynamically use image-level zoom-in strategies to refocus on the target object during interaction. Secondly the lightweight decoder struggles to sufficiently integrate interactive information with image embeddings. To address these two limitations we propose FocSAM with a pipeline redesigned on two pivotal aspects. First we propose Dynamic Window Multi-head Self-Attention (Dwin-MSA) to dynamically refocus SAM's image embeddings on the target object. Dwin-MSA localizes attention computations around the target object enhancing object-related embeddings with minimal computational overhead. Second we propose Pixel-wise Dynamic ReLU (P-DyReLU) to enable sufficient integration of interactive information from a few initial clicks that have significant impacts on the overall segmentation results. Experimentally FocSAM augments SAM's interactive segmentation performance to match the existing state-of-the-art method in segmentation quality requiring only about 5.6% of this method's inference time on CPUs. Code is available at https://github.com/YouHuang67/focsam.
-
We explore visual reinforcement learning (RL) using two complementary visual modalities: frame-based RGB camera and event-based Dynamic Vision Sensor (DVS). Existing multi-modality visual RL methods often encounter challenges in effectively extracting task-relevant information from multiple modalities while suppressing the increased noise only using indirect reward signals instead of pixel-level supervision. To tackle this we propose a Decomposed Multi-Modality Representation (DMR) framework for visual RL. It explicitly decomposes the inputs into three distinct components: combined task-relevant features (co-features) RGB-specific noise and DVS-specific noise. The co-features represent the full information from both modalities that is relevant to the RL task; the two noise components each constrained by a data reconstruction loss to avoid information leak are contrasted with the co-features to maximize their difference. Extensive experiments demonstrate that by explicitly separating the different types of information our approach achieves substantially improved policy performance compared to state-of-the-art approaches.
-
Recently a number of image-mixing-based augmentation techniques have been introduced to improve the generalization of deep neural networks. In these techniques two or more randomly selected natural images are mixed together to generate an augmented image. Such methods may not only omit important portions of the input images but also introduce label ambiguities by mixing images across labels resulting in misleading supervisory signals. To address these limitations we propose DIFFUSEMIX a novel data augmentation technique that leverages a diffusion model to reshape training images supervised by our bespoke conditional prompts. First concatenation of a partial natural image and its generated counterpart is obtained which helps in avoiding the generation of unrealistic images or label ambiguities. Then to enhance resilience against adversarial attacks and improves safety measures a randomly selected structural pattern from a set of fractal images is blended into the concatenated image to form the final augmented image for training. Our empirical results on seven different datasets reveal that DIFFUSEMIX achieves superior performance compared to existing state- of-the-art methods on tasks including general classification fine-grained classification fine-tuning data scarcity and adversarial robustness.
-
Reward finetuning has emerged as a promising approach to aligning foundation models with downstream objectives. Remarkable success has been achieved in the language domain by using reinforcement learning (RL) to maximize rewards that reflect human preference. However in the vision domain existing RL-based reward finetuning methods are limited by their instability in large-scale training rendering them incapable of generalizing to complex unseen prompts. In this paper we propose Proximal Reward Difference Prediction (PRDP) enabling stable black-box reward finetuning for diffusion models for the first time on large-scale prompt datasets with over 100K prompts. Our key innovation is the Reward Difference Prediction (RDP) objective that has the same optimal solution as the RL objective while enjoying better training stability. Specifically the RDP objective is a supervised regression objective that tasks the diffusion model with predicting the reward difference of generated image pairs from their denoising trajectories. We theoretically prove that the diffusion model that obtains perfect reward difference prediction is exactly the maximizer of the RL objective. We further develop an online algorithm with proximal updates to stably optimize the RDP objective. In experiments we demonstrate that PRDP can match the reward maximization ability of well-established RL-based methods in small-scale training. Furthermore through large-scale training on text prompts from the Human Preference Dataset v2 and the Pick-a-Pic v1 dataset PRDP achieves superior generation quality on a diverse set of complex unseen prompts whereas RL-based methods completely fail.
-
Data-Free Meta-Learning (DFML) aims to extract knowledge from a collection of pre-trained models without requiring the original data presenting practical benefits in contexts constrained by data privacy concerns. Current DFML methods primarily focus on the data recovery from these pre-trained models. However they suffer from slow recovery speed and overlook gaps inherent in heterogeneous pre-trained models. In response to these challenges we introduce the Faster and Better Data-Free Meta-Learning (FREE) framework which contains: (i) a meta-generator for rapidly recovering training tasks from pre-trained models; and (ii) a meta-learner for generalizing to new unseen tasks. Specifically within the module Faster Inversion via Meta-Generator each pre-trained model is perceived as a distinct task. The meta-generator can rapidly adapt to a specific task in just five steps significantly accelerating the data recovery. Furthermore we propose Better Generalization via Meta-Learner and introduce an implicit gradient alignment algorithm to optimize the meta-learner. This is achieved as aligned gradient directions alleviate potential conflicts among tasks from heterogeneous pre-trained models. Empirical experiments on multiple benchmarks affirm the superiority of our approach marking a notable speed-up (20x) and performance enhancement (1.42% 4.78%) in comparison to the state-of-the-art.
-
We present Bayesian Diffusion Models (BDM) a prediction algorithm that performs effective Bayesian inference by tightly coupling the top-down (prior) information with the bottom-up (data-driven) procedure via joint diffusion processes. We demonstrate the application of BDM on the 3D shape reconstruction task. Compared to standard deep learning data-driven approaches relying on supervised data our BDM can bring in rich prior information trained in an unsupervised manner to improve the bottom-up 3D reconstruction. As opposed to the traditional Bayesian frameworks where explicitly learned prior and data-driven distributions are required for gradient computation and combination BDM performs a seamless fusion of the two via coupled diffusion processes with learned gradient computation networks. The specialty of our Bayesian Diffusion Models (BDM) lies in its capability to engage the active and effective information exchange and fusion of the top-down and bottom-up processes where each itself is a diffusion process. We demonstrate state-of-the-art results on both synthetic and real-world benchmarks for 3D shape reconstruction. Project link: https://mlpc-ucsd.github.io/BDM
-
General image fusion aims at integrating important information from multi-source images. However due to the significant cross-task gap the respective fusion mechanism varies considerably in practice resulting in limited performance across subtasks. To handle this problem we propose a novel task-customized mixture of adapters (TC-MoA) for general image fusion adaptively prompting various fusion tasks in a unified model. We borrow the insight from the mixture of experts (MoE) taking the experts as efficient tuning adapters to prompt a pre-trained foundation model. These adapters are shared across different tasks and constrained by mutual information regularization ensuring compatibility with different tasks while complementarity for multi-source images. The task-specific routing networks customize these adapters to extract task-specific information from different sources with dynamic dominant intensity performing adaptive visual feature prompt fusion. Notably our TC-MoA controls the dominant intensity bias for different fusion tasks successfully unifying multiple fusion tasks in a single model. Extensive experiments show that TC-MoA outperforms the competing approaches in learning commonalities while retaining compatibility for general image fusion (multi-modal multi-exposure and multi-focus) and also demonstrating striking controllability on more generalization experiments. The code is available at https://github.com/YangSun22/TC-MoA.
-
Camera-based Semantic Scene Completion (SSC) is to infer the full geometry of objects and scenes from only 2D images. The task is particularly challenging for those invisible areas due to the inherent occlusions and lighting ambiguity. Existing works ignore the information missing or ambiguous in those shaded and occluded areas resulting in distorted geometric prediction. To address this issue we propose a novel method Bi-SSC bidirectional geometric semantic fusion for camera-based 3D semantic scene completion. The key insight is to use the neighboring structure of objects in the image and the spatial differences from different perspectives to compensate for the lack of information in occluded areas. Specifically we introduce a spatial sensory fusion module with multiple association attention to improve semantic correlation in geometric distributions. This module works within single view and across stereo views to achieve global spatial consistency. Experimental results demonstrate that Bi-SSC outperforms state-of-the-art camera-based methods on SemanticKITTI particularly excelling in those invisible and shaded areas.
-
Knowledge Distillation (KD) has been validated as an effective model compression technique for learning compact object detectors. Existing state-of-the-art KD methods for object detection are mostly based on feature imitation. In this paper we present a general and effective prediction mimicking distillation scheme called CrossKD which delivers the intermediate features of the student's detection head to the teacher's detection head. The resulting cross-head predictions are then forced to mimic the teacher's predictions. This manner relieves the student's head from receiving contradictory supervision signals from the annotations and the teacher's predictions greatly improving the student's detection performance. Moreover as mimicking the teacher's predictions is the target of KD CrossKD offers more task-oriented information in contrast with feature imitation. On MS COCO with only prediction mimicking losses applied our CrossKD boosts the average precision of GFL ResNet-50 with 1x training schedule from 40.2 to 43.7 outperforming all existing KD methods. In addition our method also works well when distilling detectors with heterogeneous backbones.
-
One-shot medical image segmentation (MIS) aims to cope with the expensive time-consuming and inherent human bias annotations. One prevalent method to address one-shot MIS is joint registration and segmentation (JRS) with a shared encoder which mainly explores the voxel-wise correspondence between the labeled data and unlabeled data for better segmentation. However this method omits underlying connections between task-specific decoders for segmentation and registration leading to unstable training. In this paper we propose a novel Bi-level Learning of Task-Specific Decoders for one-shot MIS employing a pretrained fixed shared encoder that is proved to be more quickly adapted to brand-new datasets than existing JRS without fixed shared encoder paradigm. To be more specific we introduce a bi-level optimization training strategy considering registration as a major objective and segmentation as a learnable constraint by leveraging inter-task coupling dependencies. Furthermore we design an appearance conformity constraint strategy that learns the backward transformations generating the fake labeled data used to perform data augmentation instead of the labeled image to avoid performance degradation caused by inconsistent styles between unlabeled data and labeled data in previous methods. Extensive experiments on the brain MRI task across ABIDE ADNI and PPMI datasets demonstrate that the proposed Bi-JROS outperforms state-of-the-art one-shot MIS methods for both segmentation and registration tasks. The code will be available at https://github.com/Coradlut/Bi-JROS.
-
As large-scale foundation models become publicly available for different domains efficiently adapting them to individual downstream applications and additional data modalities has turned into a central challenge. For example foundation models for geospatial and satellite remote sensing applications are commonly trained on large optical RGB or multi-spectral datasets although data from a wide variety of heterogeneous sensors are available in the remote sensing domain. This leads to significant discrepancies between pre-training and downstream target data distributions for many important applications. Fine-tuning large foundation models to bridge that gap incurs high computational cost and can be infeasible when target datasets are small. In this paper we address the question of how large pre-trained foundational transformer models can be efficiently adapted to downstream remote sensing tasks involving different data modalities or limited dataset size. We present a self-supervised adaptation method that boosts downstream linear evaluation accuracy of different foundation models by 4-6% (absolute) across 8 remote sensing datasets while outperforming full fine-tuning when training only 1-2% of the model parameters. Our method significantly improves label efficiency and increases few-shot accuracy by 6-10% on different datasets.
-
Defense without Forgetting: Continual Adversarial Defense with Anisotropic & Isotropic Pseudo Replay
Deep neural networks have demonstrated susceptibility to adversarial attacks. Adversarial defense techniques often focus on one-shot setting to maintain robustness against attack. However new attacks can emerge in sequences in real-world deployment scenarios. As a result it is crucial for a defense model to constantly adapt to new attacks but the adaptation process can lead to catastrophic forgetting of previously defended against attacks. In this paper we discuss for the first time the concept of continual adversarial defense under a sequence of attacks and propose a lifelong defense baseline called Anisotropic & Isotropic Replay (AIR) which offers three advantages: (1) Isotropic replay ensures model consistency in the neighborhood distribution of new data indirectly aligning the output preference between old and new tasks. (2) Anisotropic replay enables the model to learn a compromise data manifold with fresh mixed semantics for further replay constraints and potential future attacks. (3) A straightforward regularizer mitigates the 'plasticity-stability' trade-off by aligning model output between new and old tasks. Experiment results demonstrate that AIR can approximate or even exceed the empirical performance upper bounds achieved by Joint Training.
-
We introduce EscherNet a multi-view conditioned diffusion model for view synthesis. EscherNet learns implicit and generative 3D representations coupled with a specialised camera positional encoding allowing precise and continuous relative control of the camera transformation between an arbitrary number of reference and target views. EscherNet offers exceptional generality flexibility and scalability in view synthesis --- it can generate more than 100 consistent target views simultaneously on a single consumer-grade GPU despite being trained with a fixed number of 3 reference views to 3 target views. As a result EscherNet not only addresses zero-shot novel view synthesis but also naturally unifies single- and multi-image 3D reconstruction combining these diverse tasks into a single cohesive framework. Our extensive experiments demonstrate that EscherNet achieves state-of-the-art performance in multiple benchmarks even when compared to methods specifically tailored for each individual problem. This remarkable versatility opens up new directions for designing scalable neural architectures for 3D vision. Project page: https://kxhit.github.io/EscherNet.
-
Zero-shot image captioning (IC) without well-paired image-text data can be categorized into two main types: training-free and text-only-training methods. While both types integrate pre-trained vision-language models such as CLIP for image-text similarity evaluation and a pre-trained language model (LM) for caption generation their distinction lies in the utilization of textual corpus for LM training. Despite achieving promising performance on certain metrics existing methods commonly suffer from drawbacks. Training-free methods often generate hallucinations whereas text-only-training methods may lack generalization capability. To address these challenges we propose a novel Memory-Augmented zero-shot image Captioning framework (MeaCap). This framework equipped with a textual memory incorporates a retrieve-then-filter module to extract key concepts highly relevant to the image. By leveraging our proposed memory-augmented visual-related fusion score within a keywords-to-sentence LM MeaCap generates concept-centered captions that exhibit high consistency with the image with reduced hallucinations and enriched world knowledge. MeaCap achieves state-of-the-art performance across various zero-shot IC settings. Our code is publicly available at https://github.com/joeyz0z/MeaCap.
-
An increasingly common approach for creating photo-realistic digital avatars is through the use of volumetric neural fields. The original neural radiance field (NeRF) allowed for impressive novel view synthesis of static heads when trained on a set of multi-view images and follow up methods showed that these neural representations can be extended to dynamic avatars. Recently new variants also surpassed the usual drawback of baked-in illumination in neural representations showing that static neural avatars can be relit in any environment. In this work we simultaneously tackle both the motion and illumination problem proposing a new method for relightable and animatable neural heads. Our method builds on a proven dynamic avatar approach based on a mixture of volumetric primitives combined with a recently-proposed lightweight hardware setup for relightable neural fields and includes a novel architecture that allows relighting dynamic neural avatars performing unseen expressions in any environment even with nearfield illumination and viewpoints.
-
360 depth estimation has recently received great attention for 3D reconstruction owing to its omnidirectional field of view (FoV). Recent approaches are predominantly focused on cross-projection fusion with geometry-based re-projection: they fuse 360 images with equirectangular projection (ERP) and another projection type e.g. cubemap projection to estimate depth with the ERP format. However these methods suffer from 1) limited local receptive fields making it hardly possible to capture large FoV scenes and 2) prohibitive computational cost caused by the complex cross-projection fusion module design. In this paper we propose Elite360D a novel framework that inputs the ERP image and icosahedron projection (ICOSAP) point set which is undistorted and spatially continuous. Elite360D is superior in its capacity in learning a representation from a local-with-global perspective. With a flexible ERP image encoder it includes an ICOSAP point encoder and a Bi-projection Bi-attention Fusion (B2F) module (totally 1M parameters). Specifically the ERP image encoder can take various perspective image-trained backbones (e.g. ResNet Transformer) to extract local features. The point encoder extracts the global features from the ICOSAP. Then the B2F module captures the semantic- and distance-aware dependencies between each pixel of the ERP feature and the entire ICOSAP feature set. Without specific backbone design and obvious computational cost increase Elite360D outperforms the prior arts on several benchmark datasets.
-
Deep-learning-based gaze estimation approaches often suffer from notable performance degradation in unseen target domains. One of the primary reasons is that the Fully Connected layer is highly prone to overfitting when mapping the high-dimensional image feature to 3D gaze. In this paper we propose Analytical Gaze Generalization framework (AGG) to improve the generalization ability of gaze estimation models without touching target domain data. The AGG consists of two modules the Geodesic Projection Module (GPM) and the Sphere-Oriented Training (SOT). GPM is a generalizable replacement of FC layer which projects high-dimensional image features to 3D space analytically to extract the principle components of gaze. Then we propose Sphere-Oriented Training (SOT) to incorporate the GPM into the training process and further improve cross-domain performances. Experimental results demonstrate that the AGG effectively alleviate the overfitting problem and consistently improves the cross-domain gaze estimation accuracy in 12 cross-domain settings without requiring any target domain data. The insight from the Analytical Gaze Generalization framework has the potential to benefit other regression tasks with physical meanings.
-
Referring image segmentation (RIS) aims to precisely segment referents in images through corresponding natural language expressions yet relying on cost-intensive mask annotations. Weakly supervised RIS thus learns from image-text pairs to pixel-level semantics which is challenging for segmenting fine-grained masks. A natural approach to enhancing segmentation precision is to empower weakly supervised RIS with the image segmentation foundation model SAM. Nevertheless we observe that simply integrating SAM yields limited benefits and can even lead to performance regression due to the inevitable noise issues and challenges in excessive focus on object parts. In this paper we present an innovative framework Point PrompTing (PPT) incorporated with the proposed multi-source curriculum learning strategy to address these challenges. Specifically the core of PPT is a point generator that not only harnesses CLIP's text-image alignment capability and SAM's powerful mask generation ability but also generates negative point prompts to address the noisy and excessive focus issues inherently and effectively. In addition we introduce a curriculum learning strategy with object-centric images to help PPT gradually learn from simpler yet precise semantic alignment to more complex RIS. Experiments demonstrate that our PPT significantly and consistently outperforms prior weakly supervised techniques on mIoU by 11.34% 14.14% and 6.97% across RefCOCO RefCOCO+ and G-Ref respectively.
-
In this paper we make the first attempt at achieving the cross-modal (i.e. image-to-events) adaptation for event-based object recognition without accessing any labeled source image data owning to privacy and commercial issues. Tackling this novel problem is non-trivial due to the novelty of event cameras and the distinct modality gap between images and events. In particular as only the source model is available a hurdle is how to extract the knowledge from the source model by only using the unlabeled target event data while achieving knowledge transfer. To this end we propose a novel framework dubbed EventDance for this unsupervised source-free cross-modal adaptation problem. Importantly inspired by event-to-video reconstruction methods we propose a reconstruction-based modality bridging (RMB) module which reconstructs intensity frames from events in a self-supervised manner. This makes it possible to build up the surrogate images to extract the knowledge (i.e. labels) from the source model. We then propose a multi-representation knowledge adaptation (MKA) module that transfers the knowledge to target models learning events with multiple representation types for fully exploring the spatiotemporal information of events. The two modules connecting the source and target models are mutually updated so as to achieve the best performance. Experiments on three benchmark datasets with two adaption settings show that EventDance is on par with prior methods utilizing the source data.
-
In the realm of medical 3D data such as CT and MRI images prevalent anisotropic resolution is characterized by high intra-slice but diminished inter-slice resolution. The lowered resolution between adjacent slices poses challenges hindering optimal viewing experiences and impeding the development of robust downstream analysis algorithms. Various volumetric super-resolution algorithms aim to surmount these challenges enhancing inter-slice resolution and overall 3D medical imaging quality. However existing approaches confront inherent challenges: 1) often tailored to specific upsampling factors lacking flexibility for diverse clinical scenarios; 2) newly generated slices frequently suffer from over-smoothing degrading fine details and leading to inter-slice inconsistency. In response this study presents CycleINR a novel enhanced Implicit Neural Representation model for 3D medical data volumetric super-resolution. Leveraging the continuity of the learned implicit function the CycleINR model can achieve results with arbitrary up-sampling rates eliminating the need for separate training. Additionally we enhance the grid sampling in CycleINR with a local attention mechanism and mitigate over-smoothing by integrating cycle-consistent loss. We introduce a new metric Slice-wise Noise Level Inconsistency (SNLI) to quantitatively assess inter-slice noise level inconsistency. The effectiveness of our approach is demonstrated through image quality evaluations on an in-house dataset and a downstream task analysis on the Medical Segmentation Decathlon liver tumor dataset.
-
Pre-trained models with large-scale training data such as CLIP and Stable Diffusion have demonstrated remarkable performance in various high-level computer vision tasks such as image understanding and generation from language descriptions. Yet their potential for low-level tasks such as image restoration remains relatively unexplored. In this paper we explore such models to enhance image restoration. As off-the-shelf features (OSF) from pre-trained models do not directly serve image restoration we propose to learn an additional lightweight module called Pre-Train-Guided Refinement Module (PTG-RM) to refine restoration results of a target restoration network with OSF. PTG-RM consists of two components Pre-Train-Guided Spatial-Varying Enhancement (PTG-SVE) and Pre-Train-Guided Channel-Spatial Attention (PTG-CSA). PTG-SVE enables optimal short- and long-range neural operations while PTG-CSA enhances spatial-channel attention for restoration-related learning. Extensive experiments demonstrate that PTG-RM with its compact size (<1M parameters) effectively enhances restoration performance of various models across different tasks including low-light enhancement deraining deblurring and denoising.
-
Face Video Retouching is a complex task that often requires labor-intensive manual editing. Conventional image retouching methods perform less satisfactorily in terms of generalization performance and stability when applied to videos without exploiting the correlation among frames. To address this issue we propose a Video Retouching transformEr to remove facial imperfections in videos which is referred to as VRetouchEr. Specifically we estimate the apparent motion of imperfections between two consecutive frames and the resulting displacement vectors are used to refine the imperfection map which is synthesized from the current frame together with the corresponding encoder features. The flow-based imperfection refinement is critical for precise and stable retouching across frames. To leverage the temporal contextual information we inject the refined imperfection map into each transformer block for multi-frame masked attention computation such that we can capture the interdependence between the current frame and multiple reference frames. As a result the imperfection regions can be replaced with normal skin with high fidelity while at the same time keeping the other regions unchanged. Extensive experiments are performed to verify the superiority of VRetouchEr over state-of-the-art image retouching methods in terms of fidelity and stability.
-
Deep neural networks (DNNs) are vulnerable to highly transferable adversarial attacks. Especially many studies have shown that sparse attacks pose a significant threat to DNNs on account of their exceptional imperceptibility. Current sparse attack methods mostly limit only the magnitude and number of perturbations while generally overlooking the location of the perturbations resulting in decreased performances on attack transferability. A subset of studies indicates that perturbations existing in the significant regions with rich classification-relevant features are more effective. Leveraging this insight we introduce the structural sparsity constraint in the framework of generative models to limit the perturbation positions. To ensure that the perturbations are generated towards classification-relevant regions we propose an exact group sparsity training method to learn pixel-level and group-level sparsity. For purpose of improving the effectiveness of sparse training we further put forward masked quantization network and multi-stage optimization algorithm in the training process. Utilizing CNNs as surrogate models extensive experiments demonstrate that our method has higher transferability in image classification attack compared to state-of-the-art methods at approximately same sparsity levels. In cross-model ViT object detection and semantic segmentation attack tasks we also achieve a better attack success rate. Code is available at https://github.com/MisterRpeng/EGS-TSSA.
-
The rise of multimodal large language models (MLLMs) has spurred interest in language-based driving tasks. However existing research typically focuses on limited tasks and often omits key multi-view and temporal information which is crucial for robust autonomous driving. To bridge these gaps we introduce NuInstruct a novel dataset with 91K multi-view video-QA pairs across 17 subtasks where each task demands holistic information (e.g. temporal multi-view and spatial) significantly elevating the challenge level. To obtain NuInstruct we propose a novel SQL-based method to generate instruction-response pairs automatically which is inspired by the driving logical progression of humans. We further present BEV-InMLLM an end-to-end method for efficiently deriving instruction-aware Bird's-Eye-View (BEV) features language-aligned for large language models. BEV-InMLLM integrates multi-view spatial awareness and temporal semantics to enhance MLLMs' capabilities on NuInstruct tasks. Moreover our proposed BEV injection module is a plug-and-play method for existing MLLMs. Our experiments on NuInstruct demonstrate that BEV-InMLLM significantly outperforms existing MLLMs e.g 9% improvement on various tasks. We release our NuInstruct at https://github.com/xmed-lab/NuInstruct.
-
Super-resolution (SR) and image generation are important tasks in computer vision and are widely adopted in real-world applications. Most existing methods however generate images only at fixed-scale magnification and suffer from over-smoothing and artifacts. Additionally they do not offer enough diversity of output images nor image consistency at different scales. Most relevant work applied Implicit Neural Representation (INR) to the denoising diffusion model to obtain continuous-resolution yet diverse and high-quality SR results. Since this model operates in the image space the larger the resolution of image is produced the more memory and inference time is required and it also does not maintain scale-specific consistency. We propose a novel pipeline that can super-resolve an input image or generate from a random noise a novel image at arbitrary scales. The method consists of a pretrained auto-encoder a latent diffusion model and an implicit neural decoder and their learning strategies. The proposed method adopts diffusion processes in a latent space thus efficient yet aligned with output image space decoded by MLPs at arbitrary scales. More specifically our arbitrary-scale decoder is designed by the symmetric decoder w/o up-scaling from the pretrained auto-encoder and Local Implicit Image Function (LIIF) in series. The latent diffusion process is learnt by the denoising and the alignment losses jointly. Errors in output images are backpropagated via the fixed decoder improving the quality of output images. In the extensive experiments using multiple public benchmarks on the two tasks i.e. image super-resolution and novel image generation at arbitrary scales the proposed method outperforms relevant methods in metrics of image quality diversity and scale consistency. It is significantly better than the relevant prior-art in the inference speed and memory usage.
-
Implicit Neural Representations have gained prominence as a powerful framework for capturing complex data modalities encompassing a wide range from 3D shapes to images and audio. Within the realm of 3D shape representation Neural Signed Distance Functions (SDF) have demonstrated remarkable potential in faithfully encoding intricate shape geometry. However learning SDFs from 3D point clouds in the absence of ground truth supervision remains a very challenging task. In this paper we propose a method to infer occupancy fields instead of SDFs as they are easier to learn from sparse inputs. We leverage a margin-based uncertainty measure to differentiably sample from the decision boundary of the occupancy function and supervise the sampled boundary points using the input point cloud. We further stabilise the optimization process at the early stages of the training by biasing the occupancy function towards minimal entropy fields while maximizing its entropy at the input point cloud. Through extensive experiments and evaluations we illustrate the efficacy of our proposed method highlighting its capacity to improve implicit shape inference with respect to baselines and the state-of-the-art using synthetic and real data.
-
This paper introduces a novel approach to learning instance segmentation using extreme points i.e. the topmost leftmost bottommost and rightmost points of each object. These points are readily available in the modern bounding box annotation process while offering strong clues for precise segmentation and thus allows to improve performance at the same annotation cost with box-supervised methods. Our work considers extreme points as a part of the true instance mask and propagates them to identify potential foreground and background points which are all together used for training a pseudo label generator. Then pseudo labels given by the generator are in turn used for supervised learning of our final model. On three public benchmarks our method significantly outperforms existing box-supervised methods further narrowing the gap with its fully supervised counterpart. In particular our model generates high-quality masks when a target object is separated into multiple parts where previous box-supervised methods often fail.
-
We propose a novel method for 3D point cloud action recognition. Understanding human actions in RGB videos has been widely studied in recent years however its 3D point cloud counterpart remains under-explored despite the clear value that 3D information may bring. This is mostly due to the inherent limitation of the point cloud data modality---lack of structure permutation invariance and varying number of points---which makes it difficult to learn a spatio-temporal representation. To address this limitation we propose the 3DinAction pipeline that first estimates patches moving in time (t-patches) as a key building block alongside a hierarchical architecture that learns an informative spatio-temporal representation. We show that our method achieves improved performance on existing datasets including DFAUST and IKEA ASM. Code is publicly available at https://github.com/sitzikbs/3dincaction
-
Diffusion models have recently revolutionized the field of image synthesis due to their ability to generate photorealistic images. However one of the major drawbacks of diffusion models is that the image generation process is costly. A large image-to-image network has to be applied many times to iteratively refine an image from random noise. While many recent works propose techniques to reduce the number of required steps they generally treat the underlying denoising network as a black box. In this work we investigate the behavior of the layers within the network and find that 1) the layers' output changes smoothly over time 2) the layers show distinct patterns of change and 3) the change from step to step is often very small. We hypothesize that many layer computations in the denoising network are redundant. Leveraging this we introduce Block Caching in which we reuse outputs from layer blocks of previous steps to speed up inference. Furthermore we propose a technique to automatically determine caching schedules based on each block's changes over timesteps. In our experiments we show through FID human evaluation and qualitative analysis that Block Caching allows to generate images with higher visual quality at the same computational cost. We demonstrate this for different state-of-the-art models (LDM and EMU) and solvers (DDIM and DPM).
-
Medical generative models acknowledged for their high-quality sample generation ability have accelerated the fast growth of medical applications. However recent works concentrate on separate medical generation models for distinct medical tasks and are restricted to inadequate medical multi-modal knowledge constraining medical comprehensive diagnosis. In this paper we propose MedM2G a Medical Multi-Modal Generative framework with the key innovation to align extract and generate medical multi-modal within a unified model. Extending beyond single or two medical modalities we efficiently align medical multi-modal through the central alignment approach in the unified space. Significantly our framework extracts valuable clinical knowledge by preserving the medical visual invariant of each imaging modal thereby enhancing specific medical information for multi-modal generation. By conditioning the adaptive cross-guided parameters into the multi-flow diffusion framework our model promotes flexible interactions among medical multi-modal for generation. MedM2G is the first medical generative model that unifies medical generation tasks of text-to-image image-to-text and unified generation of medical modalities (CT MRI X-ray). It performs 5 medical generation tasks across 10 datasets consistently outperforming various state-of-the-art works.
-
Reconstructing dynamic objects from monocular videos is a severely underconstrained and challenging problem and recent work has approached it in various directions. However owing to the ill-posed nature of this problem there has been no solution that can provide consistent high-quality novel views from camera positions that are significantly different from the training views. In this work we introduce Neural Parametric Gaussians (NPGs) to take on this challenge by imposing a two-stage approach: first we fit a low-rank neural deformation model which then is used as regularization for non-rigid reconstruction in the second stage. The first stage learns the object's deformations such that it preserves consistency in novel views. The second stage obtains high reconstruction quality by optimizing 3D Gaussians that are driven by the coarse model. To this end we introduce a local 3D Gaussian representation where temporally shared Gaussians are anchored in and deformed by local oriented volumes. The resulting combined model can be rendered as radiance fields resulting in high-quality photo-realistic reconstructions of the non-rigidly deforming objects. We demonstrate that NPGs achieve superior results compared to previous works especially in challenging scenarios with few multi-view cues.
-
Deep learning-based monocular depth estimation (MDE) extensively applied in autonomous driving is known to be vulnerable to adversarial attacks. Previous physical attacks against MDE models rely on 2D adversarial patches so they only affect a small localized region in the MDE map but fail under various viewpoints. To address these limitations we propose 3D Depth Fool (3D^2Fool) the first 3D texture-based adversarial attack against MDE models. 3D^2Fool is specifically optimized to generate 3D adversarial textures agnostic to model types of vehicles and to have improved robustness in bad weather conditions such as rain and fog. Experimental results validate the superior performance of our 3D^2Fool across various scenarios including vehicles MDE models weather conditions and viewpoints. Real-world experiments with printed 3D textures on physical vehicle models further demonstrate that our 3D^2Fool can cause an MDE error of over 10 meters.
-
While fine-tuning is a de facto standard method for training deep neural networks it still suffers from overfitting when using small target datasets. Previous methods improve fine-tuning performance by maintaining knowledge of the source datasets or introducing regularization terms such as contrastive loss. However these methods require auxiliary source information (e.g. source labels or datasets) or heavy additional computations. In this paper we propose a simple method called adaptive random feature regularization (AdaRand). AdaRand helps the feature extractors of training models to adaptively change the distribution of feature vectors for downstream classification tasks without auxiliary source information and with reasonable computation costs. To this end AdaRand minimizes the gap between feature vectors and random reference vectors that are sampled from class conditional Gaussian distributions. Furthermore AdaRand dynamically updates the conditional distribution to follow the currently updated feature extractors and balance the distance between classes in feature spaces. Our experiments show that AdaRand outperforms the other fine-tuning regularization requiring auxiliary source information and heavy computation costs.
-
We present a novel semi-supervised framework for breast ultrasound (BUS) image segmentation which is a very challenging task owing to (1) large scale and shape variations of breast lesions and (2) extremely ambiguous boundaries caused by massive speckle noise and artifacts in BUS images. While existing models achieved certain progress in this task we believe the main bottleneck nowadays for further improvement is that we still cannot deal with hard cases well. Our framework aims to break through this bottleneck which includes two innovative components: an adaptive patch augmentation scheme and a hard-patch contrastive learning module. We first identify hard patches by computing the average entropy of each patch and then shield hard patches to prevent them from being cropped out while performing random patch cutmix. Such a scheme is able to prevent hard regions from being inadequately trained under strong augmentation. We further develop a new hard-patch contrastive learning algorithm to direct model attention to hard regions by applying extra contrast to pixels in hard patches further improving segmentation performance on hard cases. We demonstrate the superiority of our framework to state-of-the-art approaches on two famous BUS datasets achieving better performance under different labeling conditions. The code is available at https://github.com/jjjsyyy/PH-Net.
-
Despite substantial progress all-in-one image restoration (IR) grapples with persistent challenges in handling intricate real-world degradations. This paper introduces MPerceiver: a novel multimodal prompt learning approach that harnesses Stable Diffusion (SD) priors to enhance adaptiveness generalizability and fidelity for all-in-one image restoration. Specifically we develop a dual-branch module to master two types of SD prompts: textual for holistic representation and visual for multiscale detail representation. Both prompts are dynamically adjusted by degradation predictions from the CLIP image encoder enabling adaptive responses to diverse unknown degradations. Moreover a plug-in detail refinement module improves restoration fidelity via direct encoder-to-decoder information transformation. To assess our method MPerceiver is trained on 9 tasks for all-in-one IR and outperforms state-of-the-art task-specific methods across many tasks. Post multitask pre-training MPerceiver attains a generalized representation in low-level vision exhibiting remarkable zero-shot and few-shot capabilities in unseen tasks. Extensive experiments on 16 IR tasks underscore the superiority of MPerceiver in terms of adaptiveness generalizability and fidelity.
-
Event cameras have recently been shown beneficial for practical vision tasks such as action recognition thanks to their high temporal resolution power efficiency and reduced privacy concerns. However current research is hindered by 1) the difficulty in processing events because of their prolonged duration and dynamic actions with complex and ambiguous semantics and 2) the redundant action depiction of the event frame representation with fixed stacks. We find language naturally conveys abundant semantic information rendering it stunningly superior in reducing semantic uncertainty. In light of this we propose ExACT a novel approach that for the first time tackles event-based action recognition from a cross-modal conceptualizing perspective. Our ExACT brings two technical contributions. Firstly we propose an adaptive fine-grained event (AFE) representation to adaptively filter out the repeated events for the stationary objects while preserving dynamic ones. This subtly enhances the performance of ExACT without extra computational cost. Then we propose a conceptual reasoning-based uncertainty estimation module which simulates the recognition process to enrich the semantic representation. In particular conceptual reasoning builds the temporal relation based on the action semantics and uncertainty estimation tackles the semantic uncertainty of actions based on the distributional representation. Experiments show that our ExACT achieves superior recognition accuracy of 94.83%(+2.23%) 90.10%(+37.47%) and 67.24% on PAF HARDVS and our SeAct datasets respectively.
-
Images captured under sub-optimal illumination conditions may contain both over- and under-exposures. We observe that over- and over-exposed regions display opposite color tone distribution shifts which may not be easily normalized in joint modeling as they usually do not have "normal-exposed" regions/pixels as reference. In this paper we propose a novel method to enhance images with both over- and under-exposures by learning to estimate and correct such color shifts. Specifically we first derive the color feature maps of the brightened and darkened versions of the input image via a UNet-based network followed by a pseudo-normal feature generator to produce pseudo-normal color feature maps. We then propose a novel COlor Shift Estimation (COSE) module to estimate the color shifts between the derived brightened (or darkened) color feature maps and the pseudo-normal color feature maps. The COSE module corrects the estimated color shifts of the over- and under-exposed regions separately. We further propose a novel COlor MOdulation (COMO) module to modulate the separately corrected colors in the over- and under-exposed regions to produce the enhanced image. Comprehensive experiments show that our method outperforms existing approaches.
-
Visual scenes are naturally organized in a hierarchy where a coarse semantic is recursively comprised of several fine details. Exploring such a visual hierarchy is crucial to recognize the complex relations of visual elements leading to a comprehensive scene understanding. In this paper we propose a Visual Hierarchy Mapper (Hi-Mapper) a novel approach for enhancing the structured understanding of the pre-trained Deep Neural Networks (DNNs). Hi-Mapper investigates the hierarchical organization of the visual scene by 1) pre-defining a hierarchy tree through the encapsulation of probability densities; and 2) learning the hierarchical relations in hyperbolic space with a novel hierarchical contrastive loss. The pre-defined hierarchy tree recursively interacts with the visual features of the pre-trained DNNs through hierarchy decomposition and encoding procedures thereby effectively identifying the visual hierarchy and enhancing the recognition of an entire scene. Extensive experiments demonstrate that Hi-Mapper significantly enhances the representation capability of DNNs leading to an improved performance on various tasks including image classification and dense prediction tasks.
-
The large-scale visual pretraining has significantly improve the performance of large vision models. However we observe the low FLOPs pitfall that the existing low-FLOPs models cannot benefit from large-scale pretraining. In this paper we introduce a novel design principle termed ParameterNet aimed at augmenting the number of parameters in large-scale visual pretraining models while minimizing the increase in FLOPs. We leverage dynamic convolutions to incorporate additional parameters into the networks with only a marginal rise in FLOPs. The ParameterNet approach allows low-FLOPs networks to take advantage of large-scale visual pretraining. Furthermore we extend the ParameterNet concept to the language domain to enhance inference results while preserving inference speed. Experiments on the large-scale ImageNet-22K have shown the superiority of our ParameterNet scheme. For example ParameterNet-600M can achieve higher accuracy than the widely-used Swin Transformer (81.6% vs. 80.9%) and has much lower FLOPs (0.6G vs. 4.5G). The code will be released at https://parameternet.github.io/.
-
Monocular depth estimation is a fundamental computer vision task. Recovering 3D depth from a single image is geometrically ill-posed and requires scene understanding so it is not surprising that the rise of deep learning has led to a breakthrough. The impressive progress of monocular depth estimators has mirrored the growth in model capacity from relatively modest CNNs to large Transformer architectures. Still monocular depth estimators tend to struggle when presented with images with unfamiliar content and layout since their knowledge of the visual world is restricted by the data seen during training and challenged by zero-shot generalization to new domains. This motivates us to explore whether the extensive priors captured in recent generative diffusion models can enable better more generalizable depth estimation. We introduce Marigold a method for affine-invariant monocular depth estimation that is derived from Stable Diffusion and retains its rich prior knowledge. The estimator can be fine-tuned in a couple of days on a single GPU using only synthetic training data. It delivers state-of-the-art performance across a wide range of datasets including over 20% performance gains in specific cases. Project page: https://marigoldmonodepth.github.io.
-
To better understand the behavior of image classifiers it is useful to visualize the contribution of individual pixels to the model prediction. In this study we propose a method MoXI(Model eXplanation by Interactions) that efficiently and accurately identifies a group of pixels with high prediction confidence. The proposed method employs game-theoretic concepts Shapley values and interactions taking into account the effects of individual pixels and the cooperative influence of pixels on model confidence. Theoretical analysis and experiments demonstrate that our method better identifies the pixels that are highly contributing to the model outputs than widely-used by Grad-CAM Attention rollout and Shapley value. While prior studies have suffered from the exponential computational cost in the computation of Shapley value and interactions we show that this can be reduced to quadratic cost for our task. The code is available at https://github.com/KosukeSumiyasu/MoXI.
-
Recently 3D anomaly detection a crucial problem involving fine-grained geometry discrimination is getting more attention. However the lack of abundant real 3D anomaly data limits the scalability of current models. To enable scalable anomaly data collection we propose a 3D anomaly synthesis pipeline to adapt existing large-scale 3D models for 3D anomaly detection. Specifically we construct a synthetic dataset i.e. Anomaly-ShapeNet based on ShapeNet. Anomaly-ShapeNet consists of 1600 point cloud samples under 40 categories which provides a rich and varied collection of data enabling efficient training and enhancing adaptability to industrial scenarios. Meanwhile to enable scalable representation learning for 3D anomaly localization we propose a self-supervised method i.e. Iterative Mask Reconstruction Network (IMRNet). During training we propose a geometry-aware sample module to preserve potentially anomalous local regions during point cloud down-sampling. Then we randomly mask out point patches and sent the visible patches to a transformer for reconstruction-based self-supervision. During testing the point cloud repeatedly goes through the Mask Reconstruction Network with each iteration's output becoming the next input. By merging and contrasting the final reconstructed point cloud with the initial input our method successfully locates anomalies. Experiments show that IMRNet outperforms previous state-of-the-art methods achieving 66.1% in I-AUC on our Anomaly-ShapeNet dataset and 72.5% in I-AUC on Real3D-AD dataset. Our benchmark will be released at https://github.com/Chopper233/Anomaly-ShapeNet.
-
Understanding how the surrounding environment changes is crucial for performing downstream tasks safely and reliably in autonomous driving applications. Recent occupancy estimation techniques using only camera images as input can provide dense occupancy representations of large-scale scenes based on the current observation. However they are mostly limited to representing the current 3D space and do not consider the future state of surrounding objects along the time axis. To extend camera-only occupancy estimation into spatiotemporal prediction we propose Cam4DOcc a new benchmark for camera-only 4D occupancy forecasting evaluating the surrounding scene changes in a near future. We build our benchmark based on multiple publicly available datasets including nuScenes nuScenes-Occupancy and Lyft-Level5 which provides sequential occupancy states of general movable and static objects as well as their 3D backward centripetal flow. To establish this benchmark for future research with comprehensive comparisons we introduce four baseline types from diverse camera-based perception and prediction implementations including a static-world occupancy model voxelization of point cloud prediction 2D-3D instance-based prediction and our proposed novel end-to-end 4D occupancy forecasting network. Furthermore the standardized evaluation protocol for preset multiple tasks is also provided to compare the performance of all the proposed baselines on present and future occupancy estimation with respect to objects of interest in autonomous driving scenarios. The dataset and our implementation of all four baselines in the proposed Cam4DOcc benchmark are released as open source at https://github.com/haomo-ai/Cam4DOcc.
-
Instance segmentation demands substantial labeling resources. This has prompted increased interest to explore the object discovery task as an unsupervised alternative. In particular promising results were achieved in localizing instances using motion supervision only. However the motion signal introduces complexities due to its inherent noise and sparsity which constrains the effectiveness of current methodologies. In the present paper we propose DIOD (self DIstillation meets Object Discovery) the first method that places the motion-guided object discovery within a framework of continuous improvement through knowledge distillation providing solutions to existing limitations (i) DIOD robustly eliminates the noise present in the exploited motion maps providing accurate motion-supervision (ii) DIOD leverages the discovered objects within an iterative pseudo-labeling framework enriching the initial motion-supervision with static objects which results in a cost-efficient increase in performance. Through experiments on synthetic and real-world datasets we demonstrate the benefits of bridging the gap between object discovery and distillation by significantly improving the state-of-the-art. This enhancement is also sustained across other demanding metrics so far reserved for supervised tasks.
-
We introduce GoMAvatar a novel approach for real-time memory-efficient high-quality animatable human modeling. GoMAvatar takes as input a single monocular video to create a digital avatar capable of re-articulation in new poses and real-time rendering from novel viewpoints while seamlessly integrating with rasterization-based graphics pipelines. Central to our method is the Gaussians-on-Mesh (GoM) representation a hybrid 3D model combining rendering quality and speed of Gaussian splatting with geometry modeling and compatibility of deformable meshes. We assess GoMAvatar on ZJU-MoCap PeopleSnapshot and various YouTube videos. GoMAvatar matches or surpasses current monocular human modeling algorithms in rendering quality and significantly outperforms them in computational efficiency (43 FPS) while being memory-efficient (3.63 MB per subject).
-
Our understanding of the generalization capabilities of neural networks NNs is still incomplete. Prevailing explanations are based on implicit biases of gradient descent GD but they cannot account for the capabilities of models from gradientfree methods nor the simplicity bias recently observed in untrained networks This paper seeks other sources of generalization in NNs. To understand the inductive biases provided by architectures independently from GD we examine untrained randomweight networks Even simple MLPs show strong inductive biases uniform sampling in weight space yields a very biased distribution of functions in terms of complexity But unlike common wisdom NNs do not have an inherent simplicity bias This property depends on components such as ReLUs residual connections and layer normalizations Alternative architectures can be built with a bias for any level of complexity. Transformers also inherit all these properties from their building blocks. We provide a fresh explanation for the success of deep learning independent from gradientbased training It points at promising avenues for controlling the solutions implemented by trained models.
-
Realistic 3D human generation from text prompts is a desirable yet challenging task. Existing methods optimize 3D representations like mesh or neural fields via score distillation sampling (SDS) which suffers from inadequate fine details or excessive training time. In this paper we propose an efficient yet effective framework HumanGaussian that generates high-quality 3D humans with fine-grained geometry and realistic appearance. Our key insight is that 3D Gaussian Splatting is an efficient renderer with periodic Gaussian shrinkage or growing where such adaptive density control can be naturally guided by intrinsic human structures. Specifically 1) we first propose a Structure-Aware SDS that simultaneously optimizes human appearance and geometry. The multi-modal score function from both RGB and depth space is leveraged to distill the Gaussian densification and pruning process. 2) Moreover we devise an Annealed Negative Prompt Guidance by decomposing SDS into a noisier generative score and a cleaner classifier score which well addresses the over-saturation issue. The floating artifacts are further eliminated based on Gaussian size in a prune-only phase to enhance generation smoothness. Extensive experiments demonstrate the superior efficiency and competitive quality of our framework rendering vivid 3D humans under diverse scenarios.
-
In image question answering due to the abundant and sometimes redundant information precisely matching and integrating the information from both text and images is a challenge. In this paper we propose the Decomposition-Integration Enhancing Multimodal Insight (DIEM) which initially decomposes the given question and image into multiple subquestions and several sub-images aiming to isolate specific elements for more focused analysis. We then integrate these sub-elements by matching each subquestion with its relevant sub-images while also retaining the original image to construct a comprehensive answer to the original question without losing sight of the overall context. This strategy mirrors the human cognitive process of simplifying complex problems into smaller components for individual analysis followed by an integration of these insights. We implement DIEM on the LLaVA-v1.5 model and evaluate its performance on ScienceQA and MM-Vet. Experimental results indicate that our method boosts accuracy in most question classes of the ScienceQA (+2.03% in average) especially in the image modality (+3.40%). On MM-Vet our method achieves an improvement in MM-Vet scores increasing from 31.1 to 32.4. These findings highlight DIEM's effectiveness in harmonizing the complexities of multimodal data demonstrating its ability to enhance accuracy and depth in image question answering through its decomposition-integration process.
-
We present CosmicMan a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and text-image misalignment for humans CosmicMan enables generating photo-realistic human images with meticulous appearance reasonable structure and precise text-image alignment with detailed dense descriptions. At the heart of CosmicMan's success are the new reflections and perspectives on data and models: (1) We found that data quality and a scalable data production flow are essential for the final results from trained models. Hence we propose a new data production paradigm Annotate Anyone which serves as a perpetual data flywheel to produce high-quality data with accurate yet cost-effective annotations over time. Based on this we constructed a large-scale dataset CosmicMan-HQ 1.0 with 6 Million high-quality real-world human images in a mean resolution of 1488x1255 and attached with precise text annotations deriving from 115 Million attributes in diverse granularities. (2) We argue that a text-to-image foundation model specialized for humans must be pragmatic - easy to integrate into down-streaming tasks while effective in producing high-quality human images. Hence we propose to model the relationship between dense text descriptions and image pixels in a decomposed manner and present Decomposed-Attention-Refocusing (Daring) training framework. It seamlessly decomposes the cross-attention features in existing text-to-image diffusion model and enforces attention refocusing without adding extra modules. Through Daring we show that explicitly discretizing continuous text space into several basic groups that align with human body structure is the key to tackling the misalignment problem in a breeze. Project page: https://cosmicman-cvpr2024.github.io/.
-
Sign Language Translation (SLT) is a challenging task that aims to translate sign videos into spoken language. Inspired by the strong translation capabilities of large language models (LLMs) that are trained on extensive multilingual text corpora we aim to harness off-the-shelf LLMs to handle SLT. In this paper we regularize the sign videos to embody linguistic characteristics of spoken language and propose a novel SignLLM framework to transform sign videos into a language-like representation for improved readability by off-the-shelf LLMs. SignLLM comprises two key modules: (1) The Vector-Quantized Visual Sign module converts sign videos into a sequence of discrete character-level sign tokens and (2) the Codebook Reconstruction and Alignment module converts these character-level tokens into word-level sign representations using an optimal transport formulation. A sign-text alignment loss further bridges the gap between sign and text tokens enhancing semantic compatibility. We achieve state-of-the-art gloss-free results on two widely-used SLT benchmarks.
-
No-reference point cloud quality assessment (NR-PCQA) aims to automatically evaluate the perceptual quality of distorted point clouds without available reference which have achieved tremendous improvements due to the utilization of deep neural networks. However learning-based NR-PCQA methods suffer from the scarcity of labeled data and usually perform suboptimally in terms of generalization. To solve the problem we propose a novel contrastive pre-training framework tailored for PCQA (CoPA) which enables the pre-trained model to learn quality-aware representations from unlabeled data. To obtain anchors in the representation space we project point clouds with different distortions into images and randomly mix their local patches to form mixed images with multiple distortions. Utilizing the generated anchors we constrain the pre-training process via a quality-aware contrastive loss following the philosophy that perceptual quality is closely related to both content and distortion. Furthermore in the model fine-tuning stage we propose a semantic-guided multi-view fusion module to effectively integrate the features of projected images from multiple perspectives. Extensive experiments show that our method outperforms the state-of-the-art PCQA methods on popular benchmarks. Further investigations demonstrate that CoPA can also benefit existing learning-based PCQA models.
-
We propose a practical approach to JPEG image decoding utilizing a local implicit neural representation with continuous cosine formulation. The JPEG algorithm significantly quantizes discrete cosine transform (DCT) spectra to achieve a high compression rate inevitably resulting in quality degradation while encoding an image. We have designed a continuous cosine spectrum estimator to address the quality degradation issue that restores the distorted spectrum. By leveraging local DCT formulations our network has the privilege to exploit dequantization and upsampling simultaneously. Our proposed model enables decoding compressed images directly across different quality factors using a single pre-trained model without relying on a conventional JPEG decoder. As a result our proposed network achieves state-of-the-art performance in flexible color image JPEG artifact removal tasks. Our source code is available at https://github.com/WooKyoungHan/JDEC
-
Active Domain Adaptation (ADA) aims to maximally boost model adaptation in a new target domain by actively selecting a limited number of target data to annotate. This setting neglects the more practical scenario where training data are collected from multiple sources. This motivates us to extend ADA from a single source domain to multiple source domains termed Multi-source Active Domain Adaptation (MADA). Not surprisingly we find that most traditional ADA methods cannot work directly in such a setting mainly due to the excessive domain gap introduced by all the source domains. Considering this we propose a Detective framework that comprehensively considers the domain shift between multi-source domains and target domains to detect the informative target samples. Specifically the Detective leverages a dynamic Domain Adaptation (DA) model that learns how to adapt the model's parameters to fit the union of multi-source domains. This enables an approximate single-source domain modeling by the dynamic model. We then comprehensively measure both domain uncertainty and predictive uncertainty in the target domain to detect informative target samples using evidential deep learning thereby mitigating uncertainty miscalibration. Experiments demonstrate that our solution outperforms existing methods by a considerable margin on three domain adaptation benchmarks.
-
Lifelong Person Re-identification (L-ReID) aims to learn from sequentially collected data to match a person across different scenes. Once an L-ReID model is updated using new data all historical images in the gallery are required to be re-calculated to obtain new features for testing known as "re-indexing". However it is infeasible when raw images in the gallery are unavailable due to data privacy concerns resulting in incompatible retrieval between the query and the gallery features calculated by different models which causes significant performance degradation. In this paper we focus on a new task called Re-indexing Free Lifelong Person Re-identification (RFL-ReID) which requires achieving effective L-ReID without re-indexing raw images in the gallery. To this end we propose a Continual Compatible Representation (C2R) method which facilitates the query feature calculated by the continuously updated model to effectively retrieve the gallery feature calculated by the old model in a compatible manner. Specifically we design a Continual Compatible Transfer (CCT) network to continuously transfer and consolidate the old gallery feature into the new feature space. Besides a Balanced Compatible Distillation module is introduced to achieve compatibility by aligning the transferred feature space with the new feature space. Finally a Balanced Anti-forgetting Distillation module is proposed to eliminate the accumulated forgetting of old knowledge during the continual compatible transfer. Extensive experiments on several benchmark L-ReID datasets demonstrate the effectiveness of our method against state-of-the-art methods for both RFL-ReID and L-ReID tasks. The source code of this paper is available at https://github.com/PKU-ICST-MIPL/C2R_CVPR2024.
-
Pan-sharpening is a super-resolution problem that essentially relies on spectra fusion of panchromatic (PAN) images and low-resolution multi-spectral (LRMS) images. The previous methods have validated the effectiveness of information fusion in the Fourier space of the whole image. However they haven't fully explored the Fourier relationships at different hierarchies between PAN and LRMS images. To this end we propose a Hierarchical Frequency Integration Network (HFIN) to facilitate hierarchical Fourier information integration for pan-sharpening. Specifically our network consists of two designs: information stratification and information integration. For information stratification we hierarchically decompose PAN and LRMS information into spatial global Fourier and local Fourier information and fuse them independently. For information integration the above hierarchical fused information is processed to further enhance their relationships and undergo comprehensive integration. Our method extend a new space for exploring the relationships of PAN and LRMS images enhancing the integration of spatial-frequency information. Extensive experiments robustly validate the effectiveness of the proposed network showcasing its superior performance compared to other state-of-the-art methods and generalization in real-world scenes and other fusion tasks as a general image fusion framework. Code is available at https://github.com/JosephTiTan/HFIN.
-
3D instance segmentation (3DIS) is a crucial task but point-level annotations are tedious in fully supervised settings. Thus using bounding boxes (bboxes) as annotations has shown great potential. The current mainstream approach is a two-step process involving the generation of pseudo-labels from box annotations and the training of a 3DIS network with the pseudo-labels. However due to the presence of intersections among bboxes not every point has a determined instance label especially in overlapping areas. To generate higher quality pseudo-labels and achieve more precise weakly supervised 3DIS results we propose the Box-Supervised Simulation-assisted Mean Teacher for 3D Instance Segmentation (BSNet) which devises a novel pseudo-labeler called Simulation-assisted Transformer. The labeler consists of two main components. The first is Simulation-assisted Mean Teacher which introduces Mean Teacher for the first time in this task and constructs simulated samples to assist the labeler in acquiring prior knowledge about overlapping areas. To better model local-global structure we also propose Local-Global Aware Attention as the decoder for teacher and student labelers. Extensive experiments conducted on the ScanNetV2 and S3DIS datasets verify the superiority of our designs.
-
Object-centric learning (OCL) extracts the representation of objects with slots offering an exceptional blend of flexibility and interpretability for abstracting low-level perceptual features. A widely adopted method within OCL is slot attention which utilizes attention mechanisms to iteratively refine slot representations. However a major drawback of most object-centric models including slot attention is their reliance on predefining the number of slots. This not only necessitates prior knowledge of the dataset but also overlooks the inherent variability in the number of objects present in each instance. To overcome this fundamental limitation we present a novel complexity-aware object auto-encoder framework. Within this framework we introduce an adaptive slot attention (AdaSlot) mechanism that dynamically determines the optimal number of slots based on the content of the data. This is achieved by proposing a discrete slot sampling module that is responsible for selecting an appropriate number of slots from a candidate list. Furthermore we introduce a masked slot decoder that suppresses unselected slots during the decoding process. Our framework tested extensively on object discovery tasks with various datasets shows performance matching or exceeding top fixed-slot models. Moreover our analysis substantiates that our method exhibits the capability to dynamically adapt the slot number according to each instance's complexity offering the potential for further exploration in slot attention research. Project will be available at https://kfan21.github.io/AdaSlot/
-
Deep neural networks (DNNs) often display overconfidence when encountering out-of-distribution (OOD) samples posing significant challenges in real-world applications. Capitalizing on the observation that responses on convolutional kernels are generally more pronounced for in-distribution (ID) samples than for OOD ones this paper proposes the COnvolutional REsponse-based Score (CORES) to exploit these discrepancies for OOD detection. Initially CORES delves into the extremities of convolutional responses by considering both their magnitude and the frequency of significant values. Moreover through backtracking from the most prominent predictions CORES effectively pinpoints sample-relevant kernels across different layers. These kernels which exhibit a strong correlation to input samples are integral to CORES's OOD detection capability. Comprehensive experiments across various ID and OOD settings demonstrate CORES's effectiveness in OOD detection and its superiority to the state-of-the-art methods.
-
Deep Neural Networks (DNNs) are widely used for their ability to effectively approximate large classes of functions. This flexibility however makes the strict enforcement of constraints on DNNs a difficult problem. In contexts where it is critical to limit the function space to which certain network components belong such as wavelets employed in Multi-Resolution Analysis (MRA) naive constraints via additional terms in the loss function are inadequate. To address this we introduce a Convolutional Neural Network (CNN) wherein the convolutional filters are strictly constrained to be wavelets. This allows the filters to update to task-optimized wavelets during the training procedure. Our primary contribution lies in the rigorous formulation of these filters via a constrained empirical risk minimization framework thereby providing an exact mechanism to enforce these structural constraints. While our work is grounded in theory we investigate our approach empirically through applications in medical imaging particularly in the task of contour prediction around various organs achieving superior performance compared to baseline methods.
-
Humans naturally interact with both others and the surrounding multiple objects engaging in various social activities. However recent advances in modeling human-object interactions mostly focus on perceiving isolated individuals and objects due to fundamental data scarcity. In this paper we introduce HOI-M^3 a novel large-scale dataset for modeling the interactions of Multiple huMans and Multiple objects. Notably it provides accurate 3D tracking for both humans and objects from dense RGB and object-mounted IMU inputs covering 199 sequences and 181M frames of diverse humans and objects under rich activities. With the unique HOI-M^3 dataset we introduce two novel data-driven tasks with companion strong baselines: monocular capture and unstructured generation of multiple human-object interactions. Extensive experiments demonstrate that our dataset is challenging and worthy of further research about multiple human-object interactions and behavior analysis. Our HOI-M^3 dataset corresponding codes and pre-trained models will be disseminated to the community for future research.
-
3D object generation has undergone significant advancements yielding high-quality results. However fall short in achieving precise user control often yielding results that do not align with user expectations thus limiting their applicability. User-envisioning 3D object generation faces significant challenges in realizing its concepts using current generative models due to limited interaction capabilities. Existing methods mainly offer two approaches: (i) interpreting textual instructions with constrained controllability or (ii) reconstructing 3D objects from 2D images. Both of them limit customization to the confines of the 2D reference and potentially introduce undesirable artifacts during the 3D lifting process restricting the scope for direct and versatile 3D modifications. In this work we introduce Interactive3D an innovative framework for interactive 3D generation that grants users precise control over the generative process through extensive 3D interaction capabilities. Interactive3D is constructed in two cascading stages utilizing distinct 3D representations. The first stage employs Gaussian Splatting for direct user interaction allowing modifications and guidance of the generative direction at any intermediate step through (i) Adding and Removing components (ii) Deformable and Rigid Dragging (iii) Geometric Transformations and (iv) Semantic Editing. Subsequently the Gaussian splats are transformed into InstantNGP. We introduce a novel (v) Interactive Hash Refinement module to further add details and extract the geometry in the second stage. Our experiments demonstrate that proposed Interactive3D markedly improves the controllability and quality of 3D generation. Our project webpage is available at https://interactive-3d.github.io/.
-
Vision Transformer (ViT) has emerged as a prominent architecture for various computer vision tasks. In ViT we divide the input image into patch tokens and process them through a stack of self-attention blocks. However unlike Convolutional Neural Network (CNN) ViT's simple architecture has no informative inductive bias (e.g. locality etc.). Due to this ViT requires a large amount of data for pre-training. Various data-efficient approaches (DeiT) have been proposed to train ViT on balanced datasets effectively. However limited literature discusses the use of ViT for datasets with long-tailed imbalances. In this work we introduce DeiT-LT to tackle the problem of training ViTs from scratch on long-tailed datasets. In DeiT-LT we introduce an efficient and effective way of distillation from CNN via distillation \texttt DIST token by using out-of-distribution images and re-weighting the distillation loss to enhance focus on tail classes. This leads to the learning of local CNN-like features in early ViT blocks improving generalization for tail classes. Further to mitigate overfitting we propose distilling from a flat CNN teacher which leads to learning low-rank generalizable features for DIST tokens across all ViT blocks. With the proposed DeiT-LT scheme the distillation DIST token becomes an expert on the tail classes and the classifier CLS token becomes an expert on the head classes. The experts help to effectively learn features corresponding to both the majority and minority classes using a distinct set of tokens within the same ViT architecture. We show the effectiveness of DeiT-LT for training ViT from scratch on datasets ranging from small-scale CIFAR-10 LT to large-scale iNaturalist-2018. Project Page: https://rangwani-harsh.github.io/DeiT-LT.
-
Recent advancements in Spatial Transcriptomics (ST) technology have facilitated detailed gene expression analysis within tissue contexts. However the high costs and methodological limitations of ST necessitate a more robust predictive model. In response this paper introduces TRIPLEX a novel deep learning framework designed to predict spatial gene expression from Whole Slide Images (WSIs). TRIPLEX uniquely harnesses multi-resolution features capturing cellular morphology at individual spots the local context around these spots and the global tissue organization. By integrating these features through an effective fusion strategy TRIPLEX achieves accurate gene expression prediction. Our comprehensive benchmark study conducted on three public ST datasets and supplemented with Visium data from 10X Genomics demonstrates that TRIPLEX outperforms current state-of-the-art models in Mean Squared Error (MSE) Mean Absolute Error (MAE) and Pearson Correlation Coefficient (PCC). The model's predictions align closely with ground truth gene expression profiles and tumor annotations underscoring TRIPLEX's potential in advancing cancer diagnosis and treatment.
-
Non-Exemplar Class Incremental Learning (NECIL) involves learning a classification model on a sequence of data without access to exemplars from previously encountered old classes. Such a stringent constraint always leads to catastrophic forgetting of the learned knowledge. Currently existing methods either employ knowledge distillation techniques or preserved class prototypes to sustain prior knowledge. However two critical issues still persist. On the one hand as the model is continually updated the preserved prototypes of old classes will inevitably derive from the suitable location in the feature space of the new model. On the other hand due to the lack of exemplars the features of new classes will take the place of similar old classes which breaks the classification boundary. To address these challenges we propose a Feature Calibration and Separation (FCS) method for NECIL. Our approach comprises a Feature Calibration Network (FCN) that adapts prototypes of old classes to the new model via optimal transport learning approximating the drift of prototypes caused by model evolution. Additionally we also propose a Prototype-Involved Contrastive Loss (PIC) that enhances feature separation among different classes. Specifically to mitigate the boundary distortion arising from the interplay of classes from different learning stages prototypes are involved in pushing the feature of new classes away from the old classes. Extensive experiments on three datasets with different settings have demonstrated the superiority of our FCS method against the state-of-the-art class incremental learning approaches. Code is available at https://github.com/zhoujiahuan1991/CVPR2024-FCS.
-
Modeling and visualizing relationships between tasks or datasets is an important step towards solving various meta-tasks such as dataset discovery multi-tasking and transfer learning. However many relationships such as containment and transferability are naturally asymmetric and current approaches for representation and visualization (e.g. t-SNE) do not readily support this. We propose Task2Box an approach to represent tasks using box embeddings---axis-aligned hyperrectangles in low dimensional spaces---that can capture asymmetric relationships between them through volumetric overlaps. We show that Task2Box accurately predicts unseen hierarchical relationships between nodes in ImageNet and iNaturalist datasets as well as transferability between tasks in the Taskonomy benchmark. We also show that box embeddings estimated from task representations (e.g. CLIP Task2Vec or attribute based) can be used to predict relationships between unseen tasks more accurately than classifiers trained on the same representations as well as handcrafted asymmetric distances (e.g. KL divergence). This suggests that low-dimensional box embeddings can effectively capture these task relationships and have the added advantage of being interpretable. We use the approach to visualize relationships among publicly available image classification datasets on popular dataset hosting platform called Hugging Face.
-
In this paper we present a novel indoor 3D reconstruction method with occluded surface completion given a sequence of depth readings. Prior state-of-the-art (SOTA) methods only focus on the reconstruction of the visible areas in a scene neglecting the invisible areas due to the occlusions e.g. the contact surface between furniture occluded wall and floor. Our method tackles the task of completing the occluded scene surfaces resulting in a complete 3D scene mesh. The core idea of our method is learning 3D geometry prior from various complete scenes to infer the occluded geometry of an unseen scene from solely depth measurements. We design a coarse-fine hierarchical octree representation coupled with a dual-decoder architecture i.e. Geo-decoder and 3D Inpainter which jointly reconstructs the complete 3D scene geometry. The Geo-decoder with detailed representation at fine levels is optimized online for each scene to reconstruct visible surfaces. The 3D Inpainter with abstract representation at coarse levels is trained offline using various scenes to complete occluded surfaces. As a result while the Geo-decoder is specialized for an individual scene the 3D Inpainter can be generally applied across different scenes. We evaluate the proposed method on the 3D Completed Room Scene (3D-CRS) and iTHOR datasets significantly outperforming the SOTA methods by a gain of 16.8% and 24.2% in terms of the completeness of 3D reconstruction. 3D-CRS dataset including a complete 3D mesh of each scene is provided at project webpage.
-
Video grounding aims to localize a spatio-temporal section in a video corresponding to an input text query. This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio-Temporal Video Grounding task. Unlike prevalent closed-set approaches that struggle with open-vocabulary scenarios due to limited training data and predefined vocabularies our model leverages pre-trained representations from foundational spatial grounding models. This empowers it to effectively bridge the semantic gap between natural language and diverse visual content achieving strong performance in closed-set and open-vocabulary settings. Our contributions include a novel spatio-temporal video grounding model surpassing state-of-the-art results in closed-set evaluations on multiple datasets and demonstrating superior performance in open-vocabulary scenarios. Notably the proposed model outperforms state-of-the-art methods in closed-set settings on VidSTG (Declarative and Interrogative) and HC-STVG (V1 and V2) datasets. Furthermore in open-vocabulary evaluations on HC-STVG V1 and YouCook-Interactions our model surpasses the recent best-performing models by 4.88 m_vIoU and 1.83 accuracy demonstrating its efficacy in handling diverse linguistic and visual concepts for improved video understanding. Our codes will be publicly released.
-
Omnidirectional cameras are extensively used in various applications to provide a wide field of vision. However they face a challenge in synthesizing novel views due to the inevitable presence of dynamic objects including the photographer in their wide field of view. In this paper we introduce a new approach called Omnidirectional Local Radiance Fields (OmniLocalRF) that can render static-only scene views removing and inpainting dynamic objects simultaneously. Our approach combines the principles of local radiance fields with the bidirectional optimization of omnidirectional rays. Our input is an omnidirectional video and we evaluate the mutual observations of the entire angle between the previous and current frames. To reduce ghosting artifacts of dynamic objects and inpaint occlusions we devise a multi-resolution motion mask prediction module. Unlike existing methods that primarily separate dynamic components through the temporal domain our method uses multi-resolution neural feature planes for precise segmentation which is more suitable for long 360-degree videos. Our experiments validate that OmniLocalRF outperforms existing methods in both qualitative and quantitative metrics especially in scenarios with complex real-world scenes. In particular our approach eliminates the need for manual interaction such as drawing motion masks by hand and additional pose estimation making it a highly effective and efficient solution.
-
Estimating disparities in challenging areas is difficult and limits the performance of stereo matching models. In this paper we exploit local structure information (LSI) to enhance stereo matching. Specifically our LSI comprises a series of key elements including the slant plane (parameterised by disparity gradients) disparity offset details and neighbouring relations. This LSI empowers our method to effectively handle intricate structures including object boundaries and curved surfaces. We bootstrap the LSI from monocular depth and subsequently iteratively refine it to better capture the underlying scene geometry constraints. Building upon the LSI we introduce the Local Structure-Guided Propagation (LSGP) which enhances the disparity initialization optimization and refinement processes. By combining LSGP with a Gated Recurrent Unit (GRU) we present our novel stereo matching method referred to as Local Structure-guided stereo matching (LoS). Remarkably LoS achieves top-ranking results on four widely recognized public benchmark datasets (ETH3D Middlebury KITTI 15 & 12) demonstrating the superior capabilities of our proposed model.
-
The field of 3D detailed human mesh reconstruction has made significant progress in recent years. However current methods still face challenges when used in industrial applications due to unstable results low-quality meshes and a lack of UV unwrapping and skinning weights. In this paper we present SHERT a novel pipeline that can reconstruct semantic human meshes with textures and high-precision details. SHERT applies semantic- and normal-based sampling between the detailed surface (e.g. mesh and SDF) and the corresponding SMPL-X model to obtain a partially sampled semantic mesh and then generates the complete semantic mesh by our specifically designed self-supervised completion and refinement networks. Using the complete semantic mesh as a basis we employ a texture diffusion model to create human textures that are driven by both images and texts. Our reconstructed meshes have stable UV unwrapping high-quality triangle meshes and consistent semantic information. The given SMPL-X model provides semantic information and shape priors allowing SHERT to perform well even with incorrect and incomplete inputs. The semantic information also makes it easy to substitute and animate different body parts such as the face body and hands. Quantitative and qualitative experiments demonstrate that SHERT is capable of producing high-fidelity and robust semantic meshes that outperform state-of-the-art methods.
-
Federated learning facilitates the collaborative learning of a global model across multiple distributed medical institutions without centralizing data. Nevertheless the expensive cost of annotation on local clients remains an obstacle to effectively utilizing local data. To mitigate this issue federated active learning methods suggest leveraging local and global model predictions to select a relatively small amount of informative local data for annotation. However existing methods mainly focus on all local data sampled from the same domain making them unreliable in realistic medical scenarios with domain shifts among different clients. In this paper we make the first attempt to assess the informativeness of local data derived from diverse domains and propose a novel methodology termed Federated Evidential Active Learning (FEAL) to calibrate the data evaluation under domain shift. Specifically we introduce a Dirichlet prior distribution in both local and global models to treat the prediction as a distribution over the probability simplex and capture both aleatoric and epistemic uncertainties by using the Dirichlet-based evidential model. Then we employ the epistemic uncertainty to calibrate the aleatoric uncertainty. Afterward we design a diversity relaxation strategy to reduce data redundancy and maintain data diversity. Extensive experiments and analysis on five real multi-center medical image datasets demonstrate the superiority of FEAL over the state-of-the-art active learning methods in federated scenarios with domain shifts. The code will be available at https://github.com/JiayiChen815/FEAL.
-
Recent advances in large-scale pretraining have yielded visual foundation models with strong capabilities. Not only can recent models generalize to arbitrary images for their training task their intermediate representations are useful for other visual tasks such as detection and segmentation. Given that such models can classify delineate and localize objects in 2D we ask whether they also represent their 3D structure? In this work we analyze the 3D awareness of visual foundation models. We posit that 3D awareness implies that representations (1) encode the 3D structure of the scene and (2) consistently represent the surface across views. We conduct a series of experiments using task-specific probes and zero-shot inference procedures on frozen features. Our experiments reveal several limitations of the current models. Our code and analysis can be found at https://github.com/mbanani/probe3d.
-
Recent advancements in personalized text-to-image (T2I) models have revolutionized content creation empowering non-experts to generate stunning images with unique styles. While promising animating these personalized images with realistic motions poses significant challenges in preserving distinct styles high-fidelity details and achieving motion controllability by text. In this paper we present PIA a Personalized Image Animator that excels in aligning with condition images achieving motion controllability by text and the compatibility with various personalized T2I models without specific tuning. To achieve these goals PIA builds upon a base T2I model with well-trained temporal alignment layers allowing for the seamless transformation of any personalized T2I model into an image animation model. A key component of PIA is the introduction of the condition module which takes as inputs the condition frame and inter-frame affinity. This module leverages the affinity hint to transfer appearance information from the condition frame to individual frames in the latent space. This design mitigates the challenges of appearance-related frame alignment within PIA and allows for a stronger focus on aligning with motion-related guidance. To address the lack of a benchmark for this field we introduce AnimateBench a comprehensive benchmark comprising diverse personalized T2I models curated images and motion-related prompts. We show extensive evaluations and applications on AnimateBench to verify the superiority of PIA.
-
Visual grounding refers to the process of associating natural language expressions with corresponding regions within an image. Existing benchmarks for visual grounding primarily operate within small-scale scenes with a few objects. Nevertheless recent advances in imaging technology have enabled the acquisition of gigapixel-level images providing high-resolution details in large-scale scenes containing numerous objects. To bridge this gap between imaging and computer vision benchmarks and make grounding more practically valuable we introduce a novel dataset named GigaGrounding designed to challenge visual grounding models in gigapixel-level large-scale scenes. We extensively analyze and compare the dataset with existing benchmarks demonstrating that GigaGrounding presents unique challenges such as large-scale scene understanding gigapixel-level resolution significant variations in object scales and the "multi-hop expressions". Furthermore we introduced a simple yet effective grounding approach which employs a "glance-to-zoom-in" paradigm and exhibits enhanced capabilities for addressing the GigaGrounding task. The dataset is available at www.gigavision.ai.
-
A Neural Radiance Field (NeRF) encodes the specific relation of 3D geometry and appearance of a scene. We here ask the question whether we can transfer the appearance from a source NeRF onto a target 3D geometry in a semantically meaningful way such that the resulting new NeRF retains the target geometry but has an appearance that is an analogy to the source NeRF. To this end we generalize classic image analogies from 2D images to NeRFs. We leverage correspondence transfer along semantic affinity that is driven by semantic features from large pre-trained 2D image models to achieve multi-view consistent appearance transfer. Our method allows exploring the mix-and-match product space of 3D geometry and appearance. We show that our method outperforms traditional stylization-based methods and that a large majority of users prefer our method over several typical baselines. Project page: https://mfischer-ucl.github.io/nerf_analogies
-
We introduce Mind Artist (MindArt) a novel and efficient neural decoding architecture to snap artistic photographs from our mind in a controllable manner. Recently progress has been made in image reconstruction with non-invasive brain recordings but it's still difficult to generate realistic images with high semantic fidelity due to the scarcity of data annotations. Unlike previous methods this work casts the neural decoding into optimal transport (OT) and representation decoupling problems. Specifically under discrete OT theory we design a graph matching-guided neural representation learning framework to seek the underlying correspondences between conceptual semantics and neural signals which yields a natural and meaningful self-supervisory task. Moreover the proposed MindArt structured with multiple stand-alone modal branches enables the seamless incorporation of semantic representation into any visual style information thus leaving it to have multi-modal reconstruction and training-free semantic editing capabilities. By doing so the reconstructed images of MindArt have phenomenal realism both in terms of semantics and appearance. We compare our MindArt with leading alternatives and achieve SOTA performance in different decoding tasks. Importantly our approach can directly generate a series of stylized "mind snapshots" w/o extra optimizations which may open up more potential applications. Code is available at https://github.com/JxuanC/MindArt.
-
Recent breakthroughs in vision-language models (VLMs) start a new page in the vision community. The VLMs provide stronger and more generalizable feature embeddings compared to those from ImageNet-pretrained models thanks to the training on the large-scale Internet image-text pairs. However despite the amazing achievement from the VLMs vanilla Vision Transformers (ViTs) remain the default choice for the image encoder. Although pure transformer proves its effectiveness in the text encoding area it remains questionable whether it is also the case for image encoding especially considering that various types of networks are proposed on the ImageNet benchmark which unfortunately are rarely studied in VLMs. Due to small data/model scale the original conclusions of model design on ImageNet can be limited and biased. In this paper we aim at building an evaluation protocol of vision models in the vision-language era under the contrastive language-image pretraining (CLIP) framework. We provide a comprehensive way to benchmark different vision models covering their zero-shot performance and scalability in both model and training data sizes. To this end we introduce ViTamin a new vision models tailored for VLMs. ViTamin-L significantly outperforms ViT-L by 2.0% ImageNet zero-shot accuracy when using the same publicly available DataComp-1B dataset and the same OpenCLIP training scheme. ViTamin-L presents promising results on 60 diverse benchmarks including classification retrieval open-vocabulary detection and segmentation and large multi-modal models. When further scaling up the model size our ViTamin-XL with only 436M parameters attains 82.9% ImageNet zero-shot accuracy surpassing 82.0% achieved by EVA-E that has ten times more parameters (4.4B).
-
Recent advancements in machine learning have spotlighted the potential of hyperbolic spaces as they effectively learn hierarchical feature representations. While there has been progress in leveraging hyperbolic spaces in single-modality contexts its exploration in multimodal settings remains under explored. Some recent efforts have sought to transpose Euclidean multimodal learning techniques to hyperbolic spaces by adopting geodesic distance based contrastive losses. However we show both theoretically and empirically that such spatial proximity based contrastive loss significantly disrupts hierarchies in the latent space. To remedy this we advocate that the cross-modal representations should accept the inherent modality gap between text and images and introduce a novel approach to measure cross-modal similarity that does not enforce spatial proximity. Our approach show remarkable capabilities in preserving unimodal hierarchies while aligning the two modalities. Our experiments on a series of downstream tasks demonstrate that better latent structure emerges with our objective function while being superior in text-to-image and image-to-text retrieval tasks.
-
Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues. The effectiveness of audio-visual learning critically depends on achieving accurate cross-modal alignment between sound and visual objects. Successful audio-visual learning requires two essential components: 1) a challenging dataset with high-quality pixel-level multi-class annotated images associated with audio files and 2) a model that can establish strong links between audio information and its corresponding visual object. However these requirements are only partially addressed by current methods with training sets containing biased audio-visual data and models that generalise poorly beyond this biased training set. In this work we propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks. We also propose a new informative sample mining method for audio-visual supervised contrastive learning to leverage discriminative contrastive samples to enforce cross-modal understanding. We show empirical results that demonstrate the effectiveness of our benchmark. Furthermore experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy.
-
Few-shot object detection (FSOD) aims to detect objects with only a few training examples. Visual feature extraction and query-support similarity learning are the two critical components. Existing works are usually developed based on ImageNet pre-trained vision backbones and design sophisticated metric-learning networks for few-shot learning but still have inferior accuracy. In this work we study few-shot object detection using modern foundation models. First vision-only contrastive pre-trained DINOv2 model is used for the vision backbone which shows strong transferable performance without tuning the parameters. Second Large Language Model (LLM) is employed for contextualized few-shot learning with the input of all classes and query image proposals. Language instructions are carefully designed to prompt the LLM to classify each proposal in context. The contextual information include proposal-proposal relations proposal-class relations and class-class relations which can largely promote few-shot learning. We comprehensively evaluate the proposed model (FM-FSOD) in multiple FSOD benchmarks achieving state-of-the-arts performance.
-
Federated learning (FL) promotes decentralized training while prioritizing data confidentiality. However its application on resource-constrained devices is challenging due to the high demand for computation and memory resources to train deep learning models. Neural network pruning techniques such as dynamic pruning could enhance model efficiency but directly adopting them in FL still poses substantial challenges including post-pruning performance degradation high activation memory usage etc. To address these challenges we propose FedMef a novel and memory-efficient federated dynamic pruning framework. FedMef comprises two key components. First we introduce the budget-aware extrusion that maintains pruning efficiency while preserving post-pruning performance by salvaging crucial information from parameters marked for pruning within a given budget. Second we propose scaled activation pruning to effectively reduce activation memory footprints which is particularly beneficial for deploying FL to memory-limited devices. Extensive experiments demonstrate the effectiveness of our proposed FedMef. In particular it achieves a significant reduction of 28.5% in memory footprint compared to state-of-the-art methods while obtaining superior accuracy.
-
Computer vision tasks typically involve describing what is visible in an image (e.g. classification detection segmentation and captioning). We study a visual common sense task that requires understanding 'what is not visible'. Specifically given an image (e.g. of a living room) and a name of an object ("cushion") a vision system is asked to predict semantically-meaningful regions (masks or bounding boxes) in the image where that object could be placed or is likely be placed by humans (e.g. on the sofa). We call this task: Semantic Placement (SP) and believe that such common-sense visual understanding is critical for assitive robots (tidying a house) AR devices (automatically rendering an object in the user's space) and visually-grounded chatbots with common sense. Studying the invisible is hard. Datasets for image description are typically constructed by curating relevant images (e.g. via image search with object names) and asking humans to annotate the contents of the image; neither of those two steps are straightforward for objects not present in the image. We overcome this challenge by operating in the opposite direction: we start with an image of an object in context (which is easy to find online) and remove that object from the image via inpainting. This automated pipeline converts unstructured web data into a paired with/without object dataset. With this proposed data generation pipeline we collect a novel dataset containing 1.3M images across 9 object categories. We then train a SP prediction model called CLIP-UNet on our dataset. The CLIP-UNet outperforms existing VLMs and baselines that combine semantic priors with object detectors generalizes well to real-world and simulated images exhibits semantics-aware reasoning for object placement and enables downstream applications like tidying robots in indoor environments.
-
Image-based virtual try-on is an increasingly important task for online shopping. It aims to synthesize images of a specific person wearing a specified garment. Diffusion model-based approaches have recently become popular as they are excellent at image synthesis tasks. However these approaches usually employ additional image encoders and rely on the cross-attention mechanism for texture transfer from the garment to the person image which affects the try-on's efficiency and fidelity. To address these issues we propose an Texture-Preserving Diffusion (TPD) model for virtual try-on which enhances the fidelity of the results and introduces no additional image encoders. Accordingly we make contributions from two aspects. First we propose to concatenate the masked person and reference garment images along the spatial dimension and utilize the resulting image as the input for the diffusion model's denoising UNet. This enables the original self-attention layers contained in the diffusion model to achieve efficient and accurate texture transfer. Second we propose a novel diffusion-based method that predicts a precise inpainting mask based on the person and reference garment images further enhancing the reliability of the try-on results. In addition we integrate mask prediction and image synthesis into a single compact model. The experimental results show that our approach can be applied to various try-on tasks e.g. garment-to-person and person-to-person try-ons and significantly outperforms state-of-the-art methods on popular VITON VITON-HD databases. Code is available at https://github.com/Gal4way/TPD.
-
Domain Generalization (DG) aims to resolve distribution shifts between source and target domains and current DG methods are default to the setting that data from source and target domains share identical categories. Nevertheless there exists unseen classes from target domains in practical scenarios. To address this issue Open Set Domain Generalization (OSDG) has emerged and several methods have been exclusively proposed. However most existing methods adopt complex architectures with slight improvement compared with DG methods. Recently vision-language models (VLMs) have been introduced in DG following the fine-tuning paradigm but consume huge training overhead with large vision models. Therefore in this paper we innovate to transfer knowledge from VLMs to lightweight vision models and improve the robustness by introducing Perturbation Distillation (PD) from three perspectives including Score Class and Instance (SCI) named SCI-PD. Moreover previous methods are oriented by the benchmarks with identical and fixed splits ignoring the divergence between source domains. These methods are revealed to suffer from sharp performance decay with our proposed new benchmark Hybrid Domain Generalization (HDG) and a novel metric H^ 2 -CV which construct various splits to comprehensively assess the robustness of algorithms. Extensive experiments demonstrate that our method outperforms state-of-the-art algorithms on multiple datasets especially improving the robustness when confronting data scarcity.
-
We introduce SODA a self-supervised diffusion model designed for representation learning. The model incorporates an image encoder which distills a source view into a compact representation that in turn guides the generation of related novel views. We show that by imposing a tight bottleneck between the encoder and a denoising decoder and leveraging novel view synthesis as a self-supervised objective we can turn diffusion models into strong representation learners capable of capturing visual semantics in an unsupervised manner. To the best of our knowledge SODA is the first diffusion model to succeed at ImageNet linear-probe classification and at the same time it accomplishes reconstruction editing and synthesis tasks across a wide range of datasets. Further investigation reveals the disentangled nature of its emergent latent space that serves as an effective interface to control and manipulate the produced images. All in all we aim to shed light on the exciting and promising potential of diffusion models not only for image generation but also for learning rich and robust representations. See our website at soda-diffusion.github.io.
-
Event camera has recently received much attention for low-light image enhancement (LIE) thanks to their distinct advantages such as high dynamic range. However current research is prohibitively restricted by the lack of large-scale real-world and spatial-temporally aligned event-image datasets. To this end we propose a real-world (indoor and outdoor) dataset comprising over 30K pairs of images and events under both low and normal illumination conditions. To achieve this we utilize a robotic arm that traces a consistent non-linear trajectory to curate the dataset with spatial alignment precision under 0.03mm. We then introduce a matching alignment strategy rendering 90% of our dataset with errors less than 0.01s. Based on the dataset we propose a novel event-guided LIE approach called EvLight towards robust performance in real-world low-light scenes. Specifically we first design the multi-scale holistic fusion branch to extract holistic structural and textural information from both events and images. To ensure robustness against variations in the regional illumination and noise we then introduce a Signal-to-Noise-Ratio (SNR)-guided regional feature selection to selectively fuse features of images from regions with high SNR and enhance those with low SNR by extracting regional structural information from events. our EvLight significantly surpasses the frame-based methods e.g. Retinexformer by 1.14 dB and 2.62 dB respectively. Code and datasets are available at https://vlislab22.github.io/eg-lowlight/.
-
Understanding illumination and reducing the need for supervision pose a significant challenge in low-light enhancement. Current approaches are highly sensitive to data usage during training and illumination-specific hyper-parameters limiting their ability to handle unseen scenarios. In this paper we propose a new zero-reference low-light enhancement framework trainable solely with normal light images. To accomplish this we devise an illumination-invariant prior inspired by the theory of physical light transfer. This prior serves as the bridge between normal and low-light images. Then we develop a prior-to-image framework trained without low-light data. During testing this framework is able to restore our illumination-invariant prior back to images automatically achieving low-light enhancement. Within this framework we leverage a pretrained generative diffusion model for model ability introduce a bypass decoder to handle detail distortion as well as offer a lightweight version for practicality. Extensive experiments demonstrate our framework's superiority in various scenarios as well as good interpretability robustness and efficiency. Code is available on our project homepage: http://daooshee.github.io/QuadPrior-Website/
-
Existing methods to fine-tune LLMs like Adapter Prefix-tuning and LoRA which introduce extra modules or additional input sequences to inject new skills or knowledge may compromise the innate abilities of LLMs. In this paper we propose LLaMA-Excitor a lightweight method that stimulates the LLMs' potential to better follow instructions by gradually paying more attention to worthwhile information. Specifically the LLaMA-Excitor does not directly change the intermediate hidden state during the self-attention calculation of the transformer structure. We designed the Excitor block as a bypass module for the similarity score computation in LLMs' self-attention to reconstruct keys and change the importance of values by learnable prompts. LLaMA-Excitor ensures a self-adaptive allocation of additional attention to input instructions thus effectively preserving LLMs' pre-trained knowledge when fine-tuning LLMs on low-quality instruction-following datasets. Furthermore we unify the modeling of multi-modal tuning and language-only tuning extending LLaMA-Excitor to a powerful visual instruction follower without the need for complex multi-modal alignment. Our proposed approach is evaluated in language-only and multi-modal tuning experimental scenarios. Notably LLaMA-Excitor is the only method that maintains basic capabilities while achieving a significant improvement (+6%) on the MMLU benchmark. In the visual instruction tuning we achieve a new state-of-the-art image captioning performance of 157.5 CIDEr on MSCOCO and a comparable performance (88.39%) on ScienceQA to cutting-edge models with more parameters and extensive vision-language pertaining.
-
The emergence of Neural Radiance Fields (NeRF) has greatly impacted 3D scene modeling and novel-view synthesis. As a kind of visual media for 3D scene representation compression with high rate-distortion performance is an eternal target. Motivated by advances in neural compression and neural field representation we propose NeRFCodec an end-to-end NeRF compression framework that integrates non-linear transform quantization and entropy coding for memory-efficient scene representation. Since training a non-linear transform directly on a large scale of NeRF feature planes is impractical we discover that pre-trained neural 2D image codec can be utilized for compressing the features when adding content-specific parameters. Specifically we reuse neural 2D image codec but modify its encoder and decoder heads while keeping the other parts of the pre-trained decoder frozen. This allows us to train the full pipeline via supervision of rendering loss and entropy loss yielding the rate-distortion balance by updating the content-specific parameters. At test time the bitstreams containing latent code feature decoder head and other side information are transmitted for communication. Experimental results demonstrate our method outperforms existing NeRF compression methods enabling high-quality novel view synthesis with a memory budget of 0.5 MB.
-
We tackle a new problem of multi-view camera and subject registration in the bird's eye view (BEV) without pre-given camera calibration which promotes the multi-view subject registration problem to a new calibration-free stage. This greatly alleviates the limitation in many practical applications. However this is a very challenging problem since its only input is several RGB images from different first-person views (FPVs) without the BEV image and the calibration of the FPVs while the output is a unified plane aggregated from all views with the positions and orientations of both the subjects and cameras in a BEV. For this purpose we propose an end-to-end framework solving camera and subject registration together by taking advantage of their mutual dependence whose main idea is as below: i) creating a subject view-transform module (VTM) to project each pedestrian from FPV to a virtual BEV ii) deriving a multi-view geometry-based spatial alignment module (SAM) to estimate the relative camera pose in a unified BEV iii) selecting and refining the subject and camera registration results within the unified BEV. We collect a new large-scale synthetic dataset with rich annotations for training and evaluation. Additionally we also collect a real dataset for cross-domain evaluation. The experimental results show the remarkable effectiveness of our method. The code and proposed datasets are available at https://github.com/zekunqian/BEVSee.
-
Image keypoint descriptions that are discriminative and matchable over large changes in viewpoint are vital for 3D reconstruction. However descriptions output by learned descriptors are typically not robust to camera rotation. While they can be made more robust by e.g. data aug-mentation this degrades performance on upright images. Another approach is test-time augmentation which incurs a significant increase in runtime. Instead we learn a lin-ear transform in description space that encodes rotations of the input image. We call this linear transform a steerer since it allows us to transform the descriptions as if the im-age was rotated. From representation theory we know all possible steerers for the rotation group. Steerers can be optimized (A) given a fixed descriptor (B) jointly with a de-scriptor or (C) we can optimize a descriptor given a fixed steerer. We perform experiments in these three settings and obtain state-of-the-art results on the rotation invariant im-age matching benchmarks AIMS and Roto-360.
-
Dataset distillation reduces the storage and computational consumption of training a network by generating a small surrogate dataset that encapsulates rich information of the original large-scale one. However previous distillation methods heavily rely on the sample-wise iterative optimization scheme. As the images-per-class (IPC) setting or image resolution grows larger the necessary computation will demand overwhelming time and resources. In this work we intend to incorporate generative diffusion techniques for computing the surrogate dataset. Observing that key factors for constructing an effective surrogate dataset are representativeness and diversity we design additional minimax criteria in the generative training to enhance these facets for the generated images of diffusion models. We present a theoretical model of the process as hierarchical diffusion control demonstrating the flexibility of the diffusion process to target these criteria without jeopardizing the faithfulness of the sample to the desired distribution. The proposed method achieves state-of-the-art validation performance while demanding much less computational resources. Under the 100-IPC setting on ImageWoof our method requires less than one-twentieth the distillation time of previous methods yet yields even better performance. Source code and generated data are available in https://github.com/vimar-gu/MinimaxDiffusion.
-
We introduce Posterior Distillation Sampling (PDS) a novel optimization method for parametric image editing based on diffusion models. Existing optimization-based methods which leverage the powerful 2D prior of diffusion models to handle various parametric images have mainly focused on generation. Unlike generation editing requires a balance between conforming to the target attribute and preserving the identity of the source content. Recent 2D image editing methods have achieved this balance by leveraging the stochastic latent encoded in the generative process of diffusion models. To extend the editing capabilities of diffusion models shown in pixel space to parameter space we reformulate the 2D image editing method into an optimization form named PDS. PDS matches the stochastic latents of the source and the target enabling the sampling of targets in diverse parameter spaces that align with a desired attribute while maintaining the source's identity. We demonstrate that this optimization resembles running a generative process with the target attribute but aligning this process with the trajectory of the source's generative process. Extensive editing results in Neural Radiance Fields and Scalable Vector Graphics representations demonstrate that PDS is capable of sampling targets to fulfill the aforementioned balance across various parameter spaces.
-
Human hands are highly articulated and versatile at handling objects. Jointly estimating the 3D poses of a hand and the object it manipulates from a monocular camera is challenging due to frequent occlusions. Thus existing methods often rely on intermediate 3D shape representations to increase performance. These representations are typically explicit such as 3D point clouds or meshes and thus provide information in the direct surroundings of the intermediate hand pose estimate. To address this we introduce HOISDF a Signed Distance Field (SDF) guided hand-object pose estimation network which jointly exploits hand and object SDFs to provide a global implicit representation over the complete reconstruction volume. Specifically the role of the SDFs is threefold: equip the visual encoder with implicit shape information help to encode hand-object interactions and guide the hand and object pose regression via SDF-based sampling and by augmenting the feature representations. We show that HOISDF achieves state-of-the-art results on hand-object pose estimation benchmarks (DexYCB and HO3Dv2). Code is available https://github.com/amathislab/HOISDF.
-
In video super-resolution it is common to use a frame-wise alignment to support the propagation of information over time. The role of alignment is well-studied for low-level enhancement in video but existing works overlook a critical step -- resampling. We show through extensive experiments that for alignment to be effective the resampling should preserve the reference frequency spectrum while minimizing spatial distortions. However most existing works simply use a default choice of bilinear interpolation for resampling even though bilinear interpolation has a smoothing effect and hinders super-resolution. From these observations we propose an implicit resampling-based alignment. The sampling positions are encoded by a sinusoidal positional encoding while the value is estimated with a coordinate network and a window-based cross-attention. We show that bilinear interpolation inherently attenuates high-frequency information while an MLP-based coordinate network can approximate more frequencies. Experiments on synthetic and real-world datasets show that alignment with our proposed implicit resampling enhances the performance of state-of-the-art frameworks with minimal impact on both compute and parameters.
-
We present DiffPortrait3D a conditional diffusion model that is capable of synthesizing 3D-consistent photo-realistic novel views from as few as a single in-the-wild portrait. Specifically given a single RGB input we aim to synthesize plausible but consistent facial details rendered from novel camera views with retained both identity and facial expression. In lieu of time-consuming optimization and fine-tuning our zero-shot method generalizes well to arbitrary face portraits with unposed camera views extreme facial expressions and diverse artistic depictions. At its core we leverage the generative prior of 2D diffusion models pre-trained on large-scale image datasets as our rendering backbone while the denoising is guided with disentangled attentive control of appearance and camera pose. To achieve this we first inject the appearance context from the reference image into the self-attention layers of the frozen UNets. The rendering view is then manipulated with a novel conditional control module that interprets the camera pose by watching a condition image of a crossed subject from the same view. Furthermore we insert a trainable cross-view attention module to enhance view consistency which is further strengthened with a novel 3D-aware noise generation process during inference. We demonstrate state-of-the-art results both qualitatively and quantitatively on our challenging in-the-wild and multi-view benchmarks.
-
Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks by pre-training on large amount of unlabelled data. Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data. Different from standard natural image datasets remote sensing data is acquired from various sensor technologies and exhibit diverse range of scale variations as well as modalities. Existing satellite image pre-training methods either ignore the scale information present in the remote sensing imagery or restrict themselves to use only a single type of data modality. In this paper we re-visit transformers pre-training and leverage multi-scale information that is effectively utilized with multiple modalities. Our proposed approach named SatMAE++ performs multi-scale pre-training and utilizes convolution based upsampling blocks to reconstruct the image at higher scales making it extensible to include more scales. Compared to existing works the proposed SatMAE++ with multi-scale pre-training is equally effective for both optical as well as multi-spectral imagery. Extensive experiments on six datasets reveal the merits of proposed contributions leading to state-of-the-art performance on all datasets. SatMAE++ achieves mean average precision (mAP) gain of 2.5% for multi-label classification task on BigEarthNet dataset.
-
Weakly-Supervised Scene Graph Generation (WSSGG) research has recently emerged as an alternative to the fully-supervised approach that heavily relies on costly annotations. In this regard studies on WSSGG have utilized image captions to obtain unlocalized triplets while primarily focusing on grounding the unlocalized triplets over image regions. However they have overlooked the two issues involved in the triplet formation process from the captions: 1) Semantic over-simplification issue arises when extracting triplets from captions where fine-grained predicates in captions are undesirably converted into coarse-grained predicates resulting in a long-tailed predicate distribution and 2) Low-density scene graph issue arises when aligning the triplets in the caption with entity/predicate classes of interest where many triplets are discarded and not used in training leading to insufficient supervision. To tackle the two issues we propose a new approach i.e. Large Language Model for weakly-supervised SGG (LLM4SGG) where we mitigate the two issues by leveraging the LLM's in-depth understanding of language and reasoning ability during the extraction of triplets from captions and alignment of entity/predicate classes with target data. To further engage the LLM in these processes we adopt the idea of Chain-of-Thought and the in-context few-shot learning strategy. To validate the effectiveness of LLM4SGG we conduct extensive experiments on Visual Genome and GQA datasets showing significant improvements in both Recall@K and mean Recall@K compared to the state-of-the-art WSSGG methods. A further appeal is that LLM4SGG is data-efficient enabling effective model training with a small amount of training images.
-
Parameter-efficient fine-tuning (PEFT) is an effective methodology to unleash the potential of large foundation models in novel scenarios with limited training data. In the computer vision community PEFT has shown effectiveness in image classification but little research has studied its ability for image segmentation. Fine-tuning segmentation models usually require a heavier adjustment of parameters to align the proper projection directions in the parameter space for new scenarios. This raises a challenge to existing PEFT algorithms as they often inject a limited number of individual parameters into each block which prevents substantial adjustment of the projection direction of the parameter space due to the limitation of Hidden Markov Chain along blocks. In this paper we equip PEFT with a cross-block orchestration mechanism to enable the adaptation of the Segment Anything Model (SAM) to various downstream scenarios. We introduce a novel inter-block communication module which integrates a learnable relation matrix to facilitate communication among different coefficient sets of each PEFT block's parameter space. Moreover we propose an intra-block enhancement module which introduces a linear projection head whose weights are generated from a hyper-complex layer further enhancing the impact of the adjustment of projection directions on the entire parameter space. Extensive experiments on diverse benchmarks demonstrate that our proposed approach consistently improves the segmentation performance significantly on novel scenarios with only around 1K additional parameters.
-
Novel-view synthesis of specular objects like shiny metals or glossy paints remains a significant challenge. Not only the glossy appearance but also global illumination effects including reflections of other objects in the environment are critical components to faithfully reproduce a scene. In this paper we present Neural Directional Encoding (NDE) a view-dependent appearance encoding of neural radiance fields (NeRF) for rendering specular objects. NDE transfers the concept of feature-grid-based spatial encoding to the angular domain significantly improving the ability to model high-frequency angular signals. In contrast to previous methods that use encoding functions with only angular input we additionally cone-trace spatial features to obtain a spatially varying directional encoding which addresses the challenging interreflection effects. Extensive experiments on both synthetic and real datasets show that a NeRF model with NDE (1) outperforms the state of the art on view synthesis of specular objects and (2) works with small networks to allow fast (real-time) inference. The source code is available at: https://github.com/lwwu2/nde
-
We introduce a novel approach to single image denoising based on the Blind Spot Denoising principle which we call MAsked and SHuffled Blind Spot Denoising (MASH). We focus on the case of correlated noise which often plagues real images. MASH is the result of a careful analysis to determine the relationships between the level of blindness (masking) of the input and the (unknown) noise correlation. Moreover we introduce a shuffling technique to weaken the local correlation of noise which in turn yields an additional denoising performance improvement. We evaluate MASH via extensive experiments on real-world noisy image datasets. We demonstrate state-of-the-art results compared to existing self-supervised denoising methods.
-
Vision-Language Models (VLMs) have demonstrated impressive performance on zero-shot classification i.e. classification when provided merely with a list of class names. In this paper we tackle the case of zero-shot classification in the presence of unlabeled data. We leverage the graph structure of the unlabeled data and introduce ZLaP a method based on label propagation (LP) that utilizes geodesic distances for classification. We tailor LP to graphs containing both text and image features and further propose an efficient method for performing inductive inference based on a dual solution and a sparsification step. We perform extensive experiments to evaluate the effectiveness of our method on 14 common datasets and show that ZLaP outperforms the latest related works. Code: https://github.com/vladan-stojnic/ZLaP
-
DiffusionAvatars synthesizes a high-fidelity 3D head avatar of a person offering intuitive control over both pose and expression. We propose a diffusion-based neural renderer that leverages generic 2D priors to produce compelling images of faces. For coarse guidance of the expression and head pose we render a neural parametric head model (NPHM) from the target viewpoint which acts as a proxy geometry of the person. Additionally to enhance the modeling of intricate facial expressions we condition DiffusionAvatars directly on the expression codes obtained from NPHM via cross-attention. Finally to synthesize consistent surface details across different viewpoints and expressions we rig learnable spatial features to the head's surface via TriPlane lookup in NPHM's canonical space. We train DiffusionAvatars on RGB videos and corresponding fitted NPHM meshes of a person and test the obtained avatars in both self-reenactment and animation scenarios. Our experiments demonstrate that DiffusionAvatars generates temporally consistent and visually appealing videos for novel poses and expressions of a person outperforming existing approaches.
-
Quantization for model compression can efficiently reduce the network complexity and storage requirement but the original training data is necessary to remedy the performance loss caused by quantization. The Data-Free Quantization (DFQ) methods have been proposed to handle the absence of original training data with synthetic data. However there are differences between the synthetic and original training data which affects the performance of the quantized network but none of the existing methods considers the differences. In this paper we propose an efficient data-free quantization via pseudo-label filtering which is the first to evaluate the synthetic data before quantization. We design a new metric for evaluating synthetic data using self-entropy which indicates the reliability of synthetic data. The synthetic data can be categorized with the metric into high- and low-reliable datasets for the following training process. Besides the multiple pseudo-labels are designed to label the synthetic data with different reliability which can provide valuable supervision information and avoid misleading training by low-reliable samples. Extensive experiments are implemented on several datasets including CIFAR-10 CIFAR-100 and ImageNet with various models. The experimental results show that our method can perform excellently and outperform existing methods in accuracy.
-
Global translation estimation is a highly challenging step in the global structure from motion (SfM) algorithm. Many existing methods depend solely on relative translations leading to inaccuracies in low parallax scenes and degradation under collinear camera motion. While recent approaches aim to address these issues by incorporating feature tracks into objective functions they are often sensitive to outliers. In this paper we first revisit global translation estimation methods with feature tracks and categorize them into explicit and implicit methods. Then we highlight the superiority of the objective function based on the cross-product distance metric and propose a novel explicit global translation estimation framework that integrates both relative translations and feature tracks as input. To enhance the accuracy of input observations we re-estimate relative translations with the coplanarity constraint of the epipolar plane and propose a simple yet effective strategy to select reliable feature tracks. Finally the effectiveness of our approach is demonstrated through experiments on urban image sequences and unordered Internet images showcasing its superior accuracy and robustness compared to many state-of-the-art techniques.
-
Unsupervised domain adaptation (UDA) for semantic segmentation aims to transfer the pixel-wise knowledge from the labeled source domain to the unlabeled target domain. However current UDA methods typically assume a shared label space between source and target limiting their applicability in real-world scenarios where novel categories may emerge in the target domain. In this paper we introduce Open-Set Domain Adaptation for Semantic Segmentation (OSDA-SS) for the first time where the target domain includes unknown classes. We identify two major problems in the OSDA-SS scenario as follows: 1) the existing UDA methods struggle to predict the exact boundary of the unknown classes and 2) they fail to accurately predict the shape of the unknown classes. To address these issues we propose Boundary and Unknown Shape-Aware open-set domain adaptation coined BUS. Our BUS can accurately discern the boundaries between known and unknown classes in a contrastive manner using a novel dilation-erosion-based contrastive loss. In addition we propose OpenReMix a new domain mixing augmentation method that guides our model to effectively learn domain and size-invariant features for improving the shape detection of the known and unknown classes. Through extensive experiments we demonstrate that our proposed BUS effectively detects unknown classes in the challenging OSDA-SS scenario compared to the previous methods by a large margin.
-
We present a method that uses a text-to-image model to generate consistent content across multiple image scales enabling extreme semantic zooms into a scene e.g. ranging from a wide-angle landscape view of a forest to a macro shot of an insect sitting on one of the tree branches. We achieve this through a joint multi-scale diffusion sampling approach that encourages consistency across different scales while preserving the integrity of each individual sampling process. Since each generated scale is guided by a different text prompt our method enables deeper levels of zoom than traditional super-resolution methods that may struggle to create new contextual structure at vastly different scales. We compare our method qualitatively with alternative techniques in image super-resolution and outpainting and show that our method is most effective at generating consistent multi-scale content.
-
This paper introduces a novel top-down representation approach for deformable image registration which estimates the deformation field by capturing various short- and long-range flow features at different scale levels. As a Hierarchical Vision Transformer (H-ViT) we propose a dual self-attention and cross-attention mechanism that uses high-level features in the deformation field to represent low-level ones enabling information streams in the deformation field across all voxel patch embeddings irrespective of their spatial proximity. Since high-level features contain abstract flow patterns such patterns are expected to effectively contribute to the representation of the deformation field in lower scales. When the self-attention module utilizes within-scale short-range patterns for representation the cross-attention modules dynamically look for the key tokens across different scales to further interact with the local query voxel patches. Our method shows superior accuracy and visual quality over the state-of-the-art registration methods in five publicly available datasets highlighting a substantial enhancement in the performance of medical imaging registration. The project link is available at https://mogvision.github.io/hvit.
-
Contrastive learning has emerged as a promising paradigm for 3D open-world understanding i.e. aligning point cloud representation to image and text embedding space individually. In this paper we introduce MixCon3D a simple yet effective method aiming to sculpt holistic 3D representation in contrastive language-image-3D pre-training. In contrast to point cloud only we develop the 3D object-level representation from complementary perspectives e.g. multi-view rendered images with the point cloud. Then MixCon3D performs language-3D contrastive learning comprehensively depicting real-world 3D objects and bolstering text alignment. Additionally we pioneer the first thorough investigation of various training recipes for the 3D contrastive learning paradigm building a solid baseline with improved performance. Extensive experiments conducted on three representative benchmarks reveal that our method significantly improves over the baseline surpassing the previous state-of-the-art performance on the challenging 1156-category Objaverse-LVIS dataset by 5.7%. The versatility of MixCon3D is showcased in applications such as text-to-3D retrieval and point cloud captioning further evidencing its efficacy in diverse scenarios. The code is available at https://github.com/UCSC-VLAA/MixCon3D.
-
Infrared and visible image fusion aims to generate a fused image by integrating and distinguishing complementary information from multiple sources. While the cross-attention mechanism with global spatial interactions appears promising it only capture second-order spatial interactions neglecting higher-order interactions in both spatial and channel dimensions. This limitation hampers the exploitation of synergies between multi-modalities. To bridge this gap we introduce a Synergistic High-order Interaction Paradigm (SHIP) designed to systematically investigate spatial fine-grained and global statistics collaborations between infrared and visible images across two fundamental dimensions: 1) Spatial dimension: we construct spatial fine-grained interactions through element-wise multiplication mathematically equivalent to global interactions and then foster high-order formats by iteratively aggregating and evolving complementary information enhancing both efficiency and flexibility. 2) Channel dimension: expanding on channel interactions with first-order statistics (mean) we devise high-order channel interactions to facilitate the discernment of inter-dependencies between source images based on global statistics. Harnessing high-order interactions significantly enhances our model's ability to exploit multi-modal synergies leading in superior performance over state-of-the-art alternatives as shown through comprehensive experiments across various benchmarks.
-
Large Language Models (LLMs) have been enhanced with vision capabilities enabling them to comprehend images videos and interleaved vision-language content. However the learning methods of these large multimodal models (LMMs) typically treat videos as predetermined clips rendering them less effective and efficient at handling streaming video inputs. In this paper we propose a novel Learning-In-Video-Stream (LIVE) framework which enables temporally aligned long-context and real-time dialogue within a continuous video stream. Our LIVE framework comprises comprehensive approaches to achieve video streaming dialogue encompassing: (1) a training objective designed to perform language modeling for continuous streaming inputs (2) a data generation scheme that converts offline temporal annotations into a streaming dialogue format and (3) an optimized inference pipeline to speed up interactive chat in real-world video streams. With our LIVE framework we develop a simplified model called VideoLLM-online and demonstrate its significant advantages in processing streaming videos. For instance our VideoLLM-online-7B model can operate at over 10 FPS on an A100 GPU for a 5-minute video clip from Ego4D narration. Moreover VideoLLM-online also showcases state-of-the-art performance on public offline video benchmarks such as recognition captioning and forecasting. The code model data and demo have been made available at showlab.github.io/videollm-online.
-
With the advent of generative models and vision language pretraining significant improvement has been made in text-driven face manipulation. The text embedding can be used as target supervision for expression control.However it is non-trivial to associate with its 3D attributesi.e. pose and illumination. To address these issues we propose a Text-conditional Attribute aLignment approach for 3D controllable face image synthesis and our model is referred to as TcALign. Specifically since the 3D rendered image can be precisely controlled with the 3D face representation we first propose a Text-conditional 3D Editor to produce the target face representation to realize text-driven manipulation in the 3D space. An attribute embedding space spanned by the target-related attributes embeddings is also introduced to infer the disentangled task-specific direction. Next we train a cross-modal latent mapping network conditioned on the derived difference of 3D representation to infer a correct vector in the latent space of StyleGAN.Thiscorrection vector learning design can accurately transfer the attribute manipulation on 3D images to 2D images. We show that the proposed method delivers more precise text-driven multi-attribute manipulation for 3D controllable face image synthesis. Extensive qualitative and quantitative experiments verify the effectiveness and superiority of our method over the other competing methods.
-
In this paper we tackle the task of category-agnostic pose estimation (CAPE) which aims to predict poses for objects of any category with few annotated samples. Previous works either rely on local matching between features of support and query samples or require support keypoint identifier. The former is prone to overfitting due to its sensitivity to sparse samples while the latter is impractical for the open-world nature of the task. To overcome these limitations we propose ESCAPE - a Bayesian framework that learns a prior over the features of keypoints. The prior can be expressed as a mixture of super-keypoints each being a high-level abstract keypoint that captures the statistics of semantically related keypoints from different categories. We estimate the super-keypoints from base categories and use them in adaptation to novel categories. The adaptation to an unseen category involves two steps: first we match each novel keypoint to a related super-keypoint; and second we transfer the knowledge encoded in the matched super-keypoints to the novel keypoints. For the first step we propose a learnable matching network to capture the relationship between the novel keypoints and the super-keypoints resulting in a more reliable matching. ESCAPE mitigates overfitting by directly transferring learned knowledge to novel categories while it does not use keypoint identifiers. We achieve state-of-the-art performance on the standard MP-100 benchmark.
-
Despite diffusion models' superior capabilities in modeling complex distributions there are still non-trivial distributional discrepancies between generated and ground-truth images which has resulted in several notable problems in image generation including missing object errors in text-to-image generation and low image quality. Existing methods that attempt to address these problems mostly do not tend to address the fundamental cause behind these problems which is the distributional discrepancies and hence achieve sub-optimal results. In this paper we propose a particle filtering framework that can effectively address both problems by explicitly reducing the distributional discrepancies. Specifically our method relies on a set of external guidance including a small set of real images and a pre-trained object detector to gauge the distribution gap and then design the resampling weight accordingly to correct the gap. Experiments show that our methods can effectively correct missing object errors and improve image quality in various image generation tasks. Notably our method outperforms the existing strongest baseline by 5% in object occurrence and 1.0 in FID on MS-COCO. Our code is available at https://github.com/UCSB-NLP-Chang/diffusion_resampling.git.
-
Vision-language (VL) models have achieved unprecedented success recently in which the connection module is the key to bridge the modality gap. Nevertheless the abundant visual clues are not sufficiently exploited in most existing methods. On the vision side most existing approaches only use the last feature of the vision tower without using the low-level features. On the language side most existing methods only introduce shallow vision-language interactions. In this paper we present a vision-inspired vision-language connection module dubbed as VIVL which efficiently exploits the vision cue for VL models. To take advantage of the lowerlevel information from the vision tower a feature pyramid extractor (FPE) is introduced to combine features from different intermediate layers which enriches the visual cue with negligible parameters and computation overhead. To enhance VL interactions we propose deep vision-conditioned prompts (DVCP) that allows deep interactions of vision and language features efficiently. Our VIVL exceeds the previous state-of-the-art method by 18.1 CIDEr when training from scratch on the COCO caption task which greatly improves the data efficiency. When used as a plug-in module VIVL consistently improves the performance for various backbones and VL frameworks delivering new state-of-the-art results on multiple benchmarks e.g. NoCaps and VQAv2.
-
Monocular 3D object detection poses a significant challenge in 3D scene understanding due to its inherently ill-posed nature in monocular depth estimation. Existing methods heavily rely on supervised learning using abundant 3D labels typically obtained through expensive and labor-intensive annotation on LiDAR point clouds. To tackle this problem we propose a novel weakly supervised 3D object detection framework named VSRD (Volumetric Silhouette Rendering for Detection) to train 3D object detectors without any 3D supervision but only weak 2D supervision. VSRD consists of multi-view 3D auto-labeling and subsequent training of monocular 3D object detectors using the pseudo labels generated in the auto-labeling stage. In the auto-labeling stage we represent the surface of each instance as a signed distance field (SDF) and render its silhouette as an instance mask through our proposed instance-aware volumetric silhouette rendering. To directly optimize the 3D bounding boxes through rendering we decompose the SDF of each instance into the SDF of a cuboid and the residual distance field (RDF) that represents the residual from the cuboid. This mechanism enables us to optimize the 3D bounding boxes in an end-to-end manner by comparing the rendered instance masks with the ground truth instance masks. The optimized 3D bounding boxes serve as effective training data for 3D object detection. We conduct extensive experiments on the KITTI-360 dataset demonstrating that our method outperforms the existing weakly supervised 3D object detection methods. The code is available at https://github.com/skmhrk1209/VSRD.
-
We leverage Large Language Models (LLM) for zeroshot Semantic Audio Visual Navigation (SAVN). Existing methods utilize extensive training demonstrations for reinforcement learning yet achieve relatively low success rates and lack generalizability. The intermittent nature of auditory signals further poses additional obstacles to inferring the goal information. To address this challenge we present the Reflective and Imaginative Language Agent (RILA). By employing multi-modal models to process sensory data we instruct an LLM-based planner to actively explore the environment. During the exploration our agent adaptively evaluates and dismisses inaccurate perceptual descriptions. Additionally we introduce an auxiliary LLMbased assistant to enhance global environmental comprehension by mapping room layouts and providing strategic insights. Through comprehensive experiments and analysis we show that our method outperforms relevant baselines without training demonstrations from the environment and complementary semantic information.
-
The Segment Anything Model (SAM) a prompt-driven foundational model has demonstrated remarkable performance in natural image segmentation. However its application in video camouflaged object detection (VCOD) encounters challenges chiefly stemming from the overlooked temporal-spatial associations and the unreliability of user-provided prompts for camouflaged objects that are difficult to discern with the naked eye. To tackle the above issues we endow SAM with keen eyes and propose the Temporal-spatial Prompt SAM (TSP-SAM) a novel approach tailored for VCOD via an ingenious prompted learning scheme. Firstly motion-driven self-prompt learning is employed to capture the camouflaged object thereby bypassing the need for user-provided prompts. With the detected subtle motion cues across consecutive video frames the overall movement of the camouflaged object is captured for more precise spatial localization. Subsequently to eliminate the prompt bias resulting from inter-frame discontinuities the long-range consistency within the video sequences is taken into account to promote the robustness of the self-prompts. It is also injected into the encoder of SAM to enhance the representational capabilities. Extensive experimental results on two benchmarks demonstrate that the proposed TSP-SAM achieves a significant improvement over the state-of-the-art methods. With the mIoU metric increasing by 7.8% and 9.6% TSP-SAM emerges as a groundbreaking step forward in the field of VCOD.
-
Parkinson's disease (PD) is a devastating movement disorder accelerating in global prevalence but a lack of precision symptom measurement has made the development of effective therapies challenging. The Unified Parkinson's Disease Rating Scale (UPDRS) is the gold-standard for assessing motor symptom severity yet its manual scoring criteria are vague and subjective resulting in coarse and noisy clinical assessments. Machine learning approaches have the potential to modernize PD symptom assessments by making them more quantitative objective and scalable. However the lack of benchmark video datasets for PD motor exams hinders model development. Here we introduce the TULIP dataset to bridge this gap. TULIP emphasizes precision and comprehensiveness comprising multi-view video recordings (6 cameras) of all 25 UPDRS motor exam components together with ratings by 3 clinical experts in a cohort of Parkinson's patients and healthy controls. The multi-view recordings enable 3D reconstructions of body movement that better capture disease signatures than more conventional 2D methods. Using the dataset we establish a baseline model for predicting UPDRS scores from 3D poses illustrating how existing diagnostics could be automated. Looking ahead TULIP could aid the development of new precision diagnostics that transcend UPDRS scores providing a deeper understanding of PD and its potential treatments.
-
Neural radiance fields provide state-of-the-art view synthesis quality but tend to be slow to render. One reason is that they make use of volume rendering thus requiring many samples (and model queries) per ray at render time. Although this representation is flexible and easy to optimize most real-world objects can be modeled more efficiently with surfaces instead of volumes requiring far fewer samples per ray. This observation has spurred considerable progress in surface representations such as signed distance functions but these may struggle to model semi-opaque and thin structures. We propose a method HybridNeRF that leverages the strengths of both representations by rendering most objects as surfaces while modeling the (typically) small fraction of challenging regions volumetrically. We evaluate HybridNeRF against the challenging Eyeful Tower dataset along with other commonly used view synthesis datasets. When comparing to state-of-the-art baselines including recent rasterization-based approaches we improve error rates by 15-30% while achieving real-time framerates (at least 36 FPS) for virtual-reality resolutions (2K x 2K).
-
Extracting planes from a 3D scene is useful for downstream tasks in robotics and augmented reality. In this paper we tackle the problem of estimating the planar surfaces in a scene from posed images. Our first finding is that a surprisingly competitive baseline results from combining popular clustering algorithms with recent improvements in 3D geometry estimation. However such purely geometric methods are understandably oblivious to plane semantics which are crucial to discerning distinct planes. To overcome this limitation we propose a method that predicts multi-view consistent plane embeddings that complement geometry when clustering points into planes. We show through extensive evaluation on the ScanNetV2 dataset that our new method outperforms existing approaches and our strong geometric baseline for the task of plane estimation.
-
In this paper we study the problem of generalizable synthetic image detection aiming to detect forgery images from diverse generative methods e.g. GANs and diffusion models. Cutting-edge solutions start to explore the benefits of pre-trained models and mainly follow the fixed paradigm of solely training an attached classifier e.g. combining frozen CLIP-ViT with a learnable linear layer in UniFD. However our analysis shows that such a fixed paradigm is prone to yield detectors with insufficient learning regarding forgery representations. We attribute the key challenge to the lack of forgery adaptation and present a novel forgery-aware adaptive transformer approach namely FatFormer. Based on the pre-trained vision-language spaces of CLIP FatFormer introduces two core designs for the adaption to build generalized forgery representations. First motivated by the fact that both image and frequency analysis are essential for synthetic image detection we develop a forgery-aware adapter to adapt image features to discern and integrate local forgery traces within image and frequency domains. Second we find that considering the contrastive objectives between adapted image features and text prompt embeddings a previously overlooked aspect results in a nontrivial generalization improvement. Accordingly we introduce language-guided alignment to supervise the forgery adaptation with image and text prompts in FatFormer. Experiments show that by coupling these two designs our approach tuned on 4-class ProGAN data attains a remarkable detection performance achieving an average of 98% accuracy to unseen GANs and surprisingly generalizes to unseen diffusion models with 95% accuracy.
-
Human Mesh Recovery (HMR) aims to estimate the 3D human body from 2D images which is a challenging task due to inherent ambiguities in translating 2D observations to 3D space. A novel approach called PostureHMR is proposed to leverage a multi-step diffusion-style process which converts this task into a posture transformation from an SMPL T-pose mesh to the target mesh. To inject the learning process of posture transformation with the physical structure of the human body model a kinematics-based forward process is proposed to interpolate the intermediate state with pose and shape decomposition. Moreover a mesh-to-posture (M2P) decoder is designed by combining the input of 3D and 2D mesh constraints estimated from the image to model the posture changes in the reverse process. It mitigates the difficulties of posture change learning directly from RGB pixels. To overcome the limitation of pixel-level misalignment of modeling results with the input image a new trimap-based rendering loss is designed to highlight the areas with poor recognition. Experiments conducted on three widely used datasets demonstrate that the proposed approach outperforms the state-of-the-art methods.
-
This paper presents an innovative framework designed to train an image deblurring algorithm tailored to a specific camera device. This algorithm works by transforming a blurry input image which is challenging to deblur into another blurry image that is more amenable to deblurring. The transformation process from one blurry state to another leverages unpaired data consisting of sharp and blurry images captured by the target camera device. Learning this blur-to-blur transformation is inherently simpler than direct blur-to-sharp conversion as it primarily involves modifying blur patterns rather than the intricate task of reconstructing fine image details. The efficacy of the proposed approach has been demonstrated through comprehensive experiments on various benchmarks where it significantly outperforms state-of-the-art methods both quantitatively and qualitatively. Our code and data are available at https://github.com/VinAIResearch/Blur2Blur
-
Point cloud analysis has achieved outstanding performance by transferring point cloud pre-trained models. However existing methods for model adaptation usually update all model parameters i.e. full fine-tuning paradigm which is inefficient as it relies on high computational costs (e.g. training GPU memory) and massive storage space. In this paper we aim to study parameter-efficient transfer learning for point cloud analysis with an ideal trade-off between task performance and parameter efficiency. To achieve this goal we freeze the parameters of the default pre-trained models and then propose the Dynamic Adapter which generates a dynamic scale for each token considering the token significance to the downstream task. We further seamlessly integrate Dynamic Adapter with Prompt Tuning (DAPT) by constructing Internal Prompts capturing the instance-specific features for interaction. Extensive experiments conducted on five challenging datasets demonstrate that the proposed DAPT achieves superior performance compared to the full fine-tuning counterparts while significantly reducing the trainable parameters and training GPU memory by 95% and 35% respectively. Code is available at https://github.com/LMD0311/DAPT.
-
To build a cross-modal latent space between 3D human motion and language acquiring large-scale and high-quality human motion data is crucial. However unlike the abundance of image data the scarcity of motion data has limited the performance of existing motion-language models. To counter this we introduce "motion patches" a new representation of motion sequences and propose using Vision Transformers (ViT) as motion encoders via transfer learning aiming to extract useful knowledge from the image domain and apply it to the motion domain. These motion patches created by dividing and sorting skeleton joints based on body parts in motion sequences are robust to varying skeleton structures and can be regarded as color image patches in ViT. We find that transfer learning with pre-trained weights of ViT obtained through training with 2D image data can boost the performance of motion analysis presenting a promising direction for addressing the issue of limited motion data. Our extensive experiments show that the proposed motion patches used jointly with ViT achieve state-of-the-art performance in the benchmarks of text-to-motion retrieval and other novel challenging tasks such as cross-skeleton recognition zero-shot motion classification and human interaction recognition which are currently impeded by the lack of data.
-
Eliminating image blur produced by various kinds of motion has been a challenging problem. Dominant approaches rely heavily on model capacity to remove blurring by reconstructing residual from blurry observation in feature space. These practices not only prevent the capture of spatially variable motion in the real world but also ignore the tailored handling of various motions in image space. In this paper we propose a novel real-world deblurring filtering model called the Motion-adaptive Separable Collaborative (MISC) Filter. In particular we use a motion estimation network to capture motion information from neighborhoods thereby adaptively estimating spatially-variant motion flow mask kernels weights and offsets to obtain the MISC Filter. The MISC Filter first aligns the motion-induced blurring patterns to the motion middle along the predicted flow direction and then collaboratively filters the aligned image through the predicted kernels weights and offsets to generate the output. This design can handle more generalized and complex motion in a spatially differentiated manner. Furthermore we analyze the relationships between the motion estimation network and the residual reconstruction network. Extensive experiments on four widely used benchmarks demonstrate that our method provides an effective solution for real-world motion blur removal and achieves state-of-the-art performance. Code is available at https://github.com/ChengxuLiu/MISCFilter.
-
Simulation is an invaluable tool for radio-frequency system designers that enables rapid prototyping of various algorithms for imaging target detection classification and tracking. However simulating realistic radar scans is a challenging task that requires an accurate model of the scene radio frequency material properties and a corresponding radar synthesis function. Rather than specifying these models explicitly we propose DART - Doppler Aided Radar Tomography a Neural Radiance Field-inspired method which uses radar-specific physics to create a reflectance and transmittance-based rendering pipeline for range-Doppler images. We then evaluate DART by constructing a custom data collection platform and collecting a novel radar dataset together with accurate position and instantaneous velocity measurements from lidar-based localization. In comparison to state-of-the-art baselines DART synthesizes superior radar range-Doppler images from novel views across all datasets and additionally can be used to generate high quality tomographic images.
-
In this work we introduce Wonder3D a novel method for generating high-fidelity textured meshes from single-view images with remarkable efficiency. Recent methods based on the Score Distillation Sampling (SDS) loss methods have shown the potential to recover 3D geometry from 2D diffusion priors but they typically suffer from time-consuming per-shape optimization and inconsistent geometry. In contrast certain works directly produce 3D information via fast network inferences but their results are often of low quality and lack geometric details. To holistically improve the quality consistency and efficiency of image-to-3D tasks we propose a cross-domain diffusion model that generates multi-view normal maps and the corresponding color images. To ensure consistency we employ a multi-view cross-domain attention mechanism that facilitates information exchange across views and modalities. Lastly we introduce a geometry-aware normal fusion algorithm that extracts high-quality surfaces from the multi-view 2D representations in only 2 3 minutes. Our extensive evaluations demonstrate that our method achieves high-quality reconstruction results robust generalization and remarkable efficiency compared to prior works.
-
Real-world vision tasks frequently suffer from the appearance of unexpected adverse weather conditions including rain haze snow and raindrops. In the last decade convolutional neural networks and vision transformers have yielded outstanding results in single-weather video removal. However due to the absence of appropriate adaptation most of them fail to generalize to other weather conditions. Although ViWS-Net is proposed to remove adverse weather conditions in videos with a single set of pre-trained weights it is seriously blinded by seen weather at train-time and degenerates when coming to unseen weather during test-time. In this work we introduce test-time adaptation into adverse weather removal in videos and propose the first framework that integrates test-time adaptation into the iterative diffusion reverse process. Specifically we devise a diffusion-based network with a novel temporal noise model to efficiently explore frame-correlated information in degraded video clips at training stage. During inference stage we introduce a proxy task named Diffusion Tubelet Self-Calibration to learn the primer distribution of test video stream and optimize the model by approximating the temporal noise model for online adaptation. Experimental results on benchmark datasets demonstrate that our Test-Time Adaptation method with Diffusion-based network(Diff-TTA) outperforms state-of-the-art methods in terms of restoring videos degraded by seen weather conditions. Its generalizable capability is validated with unseen weather conditions in synthesized and real-world videos.
-
With the growing size of pre-trained models full fine-tuning and storing all the parameters for various downstream tasks is costly and infeasible. In this paper we propose a new parameter-efficient fine-tuning method Gradient-based Parameter Selection (GPS) demonstrating that only tuning a few selected parameters from the pre-trained model while keeping the remainder of the model frozen can generate similar or better performance compared with the full model fine-tuning method. Different from the existing popular and state-of-the-art parameter-efficient fine-tuning approaches our method does not introduce any additional parameters and computational costs during both the training and inference stages. Another advantage is the model-agnostic and non-destructive property which eliminates the need for any other design specific to a particular model. Compared with the full fine-tuning GPS achieves 3.33% (91.78% vs. 88.45% FGVC) and 9.61% (73.1% vs. 65.57% VTAB) improvement of the accuracy with tuning only 0.36% parameters of the pre-trained model on average over 24 image classification tasks; it also demonstrates a significant improvement of 17% and 16.8% in mDice and mIoU respectively on medical image segmentation task. Moreover GPS achieves state-of-the-art performance compared with existing PEFT methods. The code will be available in https://github.com/FightingFighting/GPS.git.
-
Protein representation learning is a challenging task that aims to capture the structure and function of proteins from their amino acid sequences. Previous methods largely ignored the fact that not all amino acids are equally important for protein folding and activity. In this article we propose a neural clustering framework that can automatically discover the critical components of a protein by considering both its primary and tertiary structure information. Our framework treats a protein as a graph where each node represents an amino acid and each edge represents a spatial or sequential connection between amino acids. We then apply an iterative clustering strategy to group the nodes into clusters based on their 1D and 3D positions and assign scores to each cluster. We select the highest-scoring clusters and use their medoid nodes for the next iteration of clustering until we obtain a hierarchical and informative representation of the protein. We evaluate on four protein-related tasks: protein fold classification enzyme reaction classification gene ontology term prediction and enzyme commission number prediction. Experimental results demonstrate that our method achieves state-of-the-art performance.
-
This paper presents a simple but performant semi-supervised semantic segmentation approach called CorrMatch. Previous approaches mostly employ complicated training strategies to leverage unlabeled data but overlook the role of correlation maps in modeling the relationships between pairs of locations. We observe that the correlation maps not only enable clustering pixels of the same category easily but also contain good shape information which previous works have omitted. Motivated by these we aim to improve the use efficiency of unlabeled data by designing two novel label propagation strategies. First we propose to conduct pixel propagation by modeling the pairwise similarities of pixels to spread the high-confidence pixels and dig out more. Then we perform region propagation to enhance the pseudo labels with accurate class-agnostic masks extracted from the correlation maps. CorrMatch achieves great performance on popular segmentation benchmarks. Taking the DeepLabV3+ with ResNet-101 backbone as our segmentation model we receive a 76%+ mIoU score on the Pascal VOC 2012 dataset with only 92 annotated images. Code is available at https://github.com/BBBBchan/CorrMatch .
-
Estimating large extreme inter-image rotations is critical for numerous computer vision domains involving images related by limited or non-overlapping fields of view. In this work we propose an attention-based approach with a pipeline of novel algorithmic components. First as rotation estimation pertains to image pairs we introduce an inter-image distillation scheme using Decoders to improve embeddings. Second whereas contemporary methods compute a 4D correlation volume (4DCV) encoding inter-image relationships we propose an Encoder-based cross-attention approach between activation maps to compute an enhanced equivalent of the 4DCV. Finally we present a cascaded Decoder-based technique for alternately refining the cross-attention and the rotation query. Our approach outperforms current state-of-the-art methods on extreme rotation estimation. We make our code publicly available.
-
Lifting 2D diffusion for 3D generation is a challenging problem due to the lack of geometric prior and the complex entanglement of materials and lighting in natural images. Existing methods have shown promise by first creating the geometry through score-distillation sampling (SDS) applied to rendered surface normals followed by appearance modeling. However relying on a 2D RGB diffusion model to optimize surface normals is suboptimal due to the distribution discrepancy between natural images and normals maps leading to instability in optimization. In this paper recognizing that the normal and depth information effectively describe scene geometry and be automatically estimated from images we propose to learn a generalizable Normal-Depth diffusion model for 3D generation. We achieve this by training on the large-scale LAION dataset together with the generalizable image-to-depth and normal prior models. In an attempt to alleviate the mixed illumination effects in the generated materials we introduce an albedo diffusion model to impose data-driven constraints on the albedo component. Our experiments show that when integrated into existing text-to-3D pipelines our models significantly enhance the detail richness achieving state-of-the-art results. Our project page is at https://aigc3d.github.io/richdreamer/.
-
Adapt or Perish: Adaptive Sparse Transformer with Attentive Feature Refinement for Image Restoration
Transformer-based approaches have achieved promising performance in image restoration tasks given their ability to model long-range dependencies which is crucial for recovering clear images. Though diverse efficient attention mechanism designs have addressed the intensive computations associated with using transformers they often involve redundant information and noisy interactions from irrelevant regions by considering all available tokens. In this work we propose an Adaptive Sparse Transformer (AST) to mitigate the noisy interactions of irrelevant areas and remove feature redundancy in both spatial and channel domains. AST comprises two core designs i.e. an Adaptive Sparse Self-Attention (ASSA) block and a Feature Refinement Feed-forward Network (FRFN). Specifically ASSA is adaptively computed using a two-branch paradigm where the sparse branch is introduced to filter out the negative impacts of low query-key matching scores for aggregating features while the dense one ensures sufficient information flow through the network for learning discriminative representations. Meanwhile FRFN employs an enhance-and-ease scheme to eliminate feature redundancy in channels enhancing the restoration of clear latent images. Experimental results on commonly used benchmarks have demonstrated the versatility and competitive performance of our method in several tasks including rain streak removal real haze removal and raindrop removal. The code and pre-trained models are available at https://github.com/joshyZhou/AST.
-
Rigging and skinning clothed human avatars is a challenging task and traditionally requires a lot of manual work and expertise. Recent methods addressing it either generalize across different characters or focus on capturing the dynamics of a single character observed under different pose configurations. However the former methods typically predict solely static skinning weights which perform poorly for highly articulated poses and the latter ones either require dense 3D character scans in different poses or cannot generate an explicit mesh with vertex correspondence over time. To address these challenges we propose a fully automated approach for creating a fully rigged character with pose-dependent skinning weights which can be solely learned from multi-view video. Therefore we first acquire a rigged template which is then statically skinned. Next a coordinate-based MLP learns a skinning weights field parameterized over the position in a canonical pose space and the respective pose. Moreover we introduce our pose- and view-dependent appearance field allowing us to differentiably render and supervise the posed mesh using multi-view imagery. We show that our approach outperforms state-of-the-art while not relying on dense 4D scans. More details can be found on our project page.
-
Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to provided textual prompts which requires: (i) a fine-grained disentanglement of complex visual scene and textual context and (ii) a capacity to understand relationships among disentangled entities. Unfortunately existing large vision-language alignment (VLA) models e.g. CLIP struggle with both aspects so cannot be directly used for this task. To mitigate this gap we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject predicate object). After that grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model and subsequently propagate it to an instance-level similarity matrix. Furthermore to equip VLA models with the ability of relationship understanding we design a triplet-matching objective to fine-tune the VLA models on a collection of curated dataset containing abundant entity relationships. Experiments demonstrate that our visual grounding performance increase of up to 19.5% over the SOTA zero-shot model on RefCOCO/+/g. On the more challenging Who's Waldo dataset our zero-shot approach achieves comparable accuracy to the fully supervised model. Code is available at https://github.com/Show-han/Zeroshot_REC.
-
Prompt learning has emerged as an effective and data-efficient technique in large Vision-Language Models (VLMs). However when adapting VLMs to specialized domains such as remote sensing and medical imaging domain prompt learning remains underexplored. While large-scale domain-specific foundation models can help tackle this challenge their concentration on a single vision level makes it challenging to prompt both vision and language modalities. To overcome this we propose to leverage domain-specific knowledge from domain-specific foundation models to transfer the robust recognition ability of VLMs from generalized to specialized domains using quaternion networks. Specifically the proposed method involves using domain-specific vision features from domain-specific foundation models to guide the transformation of generalized contextual embeddings from the language branch into a specialized space within the quaternion networks. Moreover we present a hierarchical approach that generates vision prompt features by analyzing intermodal relationships between hierarchical language prompt features and domain-specific vision features. In this way quaternion networks can effectively mine the intermodal relationships in the specific domain facilitating domain-specific vision-language contrastive learning. Extensive experiments on domain-specific datasets show that our proposed method achieves new state-of-the-art results in prompt learning.
-
The systematic evaluation and understanding of computer vision models under varying conditions require large amounts of data with comprehensive and customized labels which real-world vision datasets rarely satisfy. While current synthetic data generators offer a promising alternative particularly for embodied AI tasks they often fall short for computer vision tasks due to low asset and rendering quality limited diversity and unrealistic physical properties. We introduce the BEHAVIOR Vision Suite (BVS) a set of tools and assets to generate fully customized synthetic data for systematic evaluation of computer vision models based on the newly developed embodied AI benchmark BEHAVIOR-1K. BVS supports a large number of adjustable parameters at the scene level (e.g. lighting object placement) the object level (e.g. joint configuration attributes such as "filled" and "folded") and the camera level (e.g. field of view focal length). Researchers can arbitrarily vary these parameters during data generation to perform controlled experiments. We showcase three example application scenarios: systematically evaluating the robustness of models across different continuous axes of domain shift evaluating scene understanding models on the same set of images and training and evaluating simulation-to-real transfer for a novel vision task: unary and binary state prediction. Project website: https://behavior-vision-suite.github.io/
-
Recent advancements in 3D reconstruction from single images have been driven by the evolution of generative models. Prominent among these are methods based on Score Distillation Sampling (SDS) and the adaptation of diffusion models in the 3D domain. Despite their progress these techniques often face limitations due to slow optimization or rendering processes leading to extensive training and optimization times. In this paper we introduce a novel approach for single-view reconstruction that efficiently generates a 3D model from a single image via feed-forward inference. Our method utilizes two transformer-based networks namely a point decoder and a triplane decoder to reconstruct 3D objects using a hybrid Triplane-Gaussian intermediate representation. This hybrid representation strikes a balance achieving a faster rendering speed compared to implicit representations while simultaneously delivering superior rendering quality than explicit representations. The point decoder is designed for generating point clouds from single images offering an explicit representation which is then utilized by the triplane decoder to query Gaussian features for each point. This design choice addresses the challenges associated with directly regressing explicit 3D Gaussian attributes characterized by their non-structural nature. Subsequently the 3D Gaussians are decoded by an MLP to enable rapid rendering through splatting. Both decoders are built upon a scalable transformer-based architecture and have been efficiently trained on large-scale 3D datasets. The evaluations conducted on both synthetic datasets and real-world images demonstrate that our method not only achieves higher quality but also ensures a faster runtime in comparison to previous state-of-the-art techniques. Please see our project page at https://zouzx.github.io/TriplaneGaussian/
-
The advances in the Neural Radiance Fields (NeRF) research offer extensive applications in diverse domains but protecting their copyrights has not yet been researched in depth. Recently NeRF watermarking has been considered one of the pivotal solutions for safely deploying NeRF-based 3D representations. However existing methods are designed to apply only to implicit or explicit NeRF representations. In this work we introduce an innovative watermarking method that can be employed in both representations of NeRF. This is achieved by fine-tuning NeRF to embed binary messages in the rendering process. In detail we propose utilizing the discrete wavelet transform in the NeRF space for watermarking. Furthermore we adopt a deferred back-propagation technique and introduce a combination with the patch-wise loss to improve rendering quality and bit accuracy with minimum trade-offs. We evaluate our method in three different aspects: capacity invisibility and robustness of the embedded watermarks in the 2D-rendered images. Our method achieves state-of-the-art performance with faster training speed over the compared state-of-the-art methods. Project page: https://kuai-lab.github.io/cvpr2024waterf/
-
We introduce Gaussian-Flow a novel point-based approach for fast dynamic scene reconstruction and real-time rendering from both multi-view and monocular videos. In contrast to the prevalent NeRF-based approaches hampered by slow training and rendering speeds our approach harnesses recent advancements in point-based 3D Gaussian Splatting (3DGS). Specifically a novel Dual-Domain Deformation Model (DDDM) is proposed to explicitly model attribute deformations of each Gaussian point where the time-dependent residual of each attribute is captured by a polynomial fitting in the time domain and a Fourier series fitting in the frequency domain. The proposed DDDM is capable of modeling complex scene deformations across long video footage eliminating the need for training separate 3DGS for each frame or introducing an additional implicit neural field to model 3D dynamics. Moreover the explicit deformation modeling for discretized Gaussian points ensures ultra-fast training and rendering of a 4D scene which is comparable to the original 3DGS designed for static 3D reconstruction. Our proposed approach showcases a substantial efficiency improvement achieving a 5xfaster training speed compared to the per-frame 3DGS modeling. In addition quantitative results demonstrate that the proposed Gaussian-Flow significantly outperforms previous leading methods in novel view rendering quality.
-
Knowledge distillation methods have recently shown to be a promising direction to speedup the synthesis of large-scale diffusion models by requiring only a few inference steps. While several powerful distillation methods were recently proposed the overall quality of student samples is typically lower compared to the teacher ones which hinders their practical usage. In this work we investigate the relative quality of samples produced by the teacher text-to-image diffusion model and its distilled student version. As our main empirical finding we discover that a noticeable portion of student samples exhibit superior fidelity compared to the teacher ones despite the approximate nature of the student. Based on this finding we propose an adaptive collaboration between student and teacher diffusion models for effective text-to-image synthesis. Specifically the distilled model produces an initial image sample and then an oracle decides whether it needs further improvements with the teacher model. Extensive experiments demonstrate that the designed pipeline surpasses state-of-the-art text-to-image alternatives for various inference budgets in terms of human preference. Furthermore the proposed approach can be naturally used in popular applications such as text-guided image editing and controllable generation.
-
Synchronization issues between audio and video are one of the most disturbing quality defects in film production and live broadcasting. Even a discrepancy as short as 45 millisecond can degrade the viewer's experience enough to warrant manual quality checks over entire movies. In this paper we study the automatic discovery of such issues. Specifically we focus on the alignment of lip movements with spoken words targeting realistic production scenarios which can include background noise and music intricate head poses excessive makeup or scenes with multiple individuals where the speaker is unknown. Our model's robustness also extends to various media specifications including different video frame rates and audio sample rates. To address these challenges we present a model fully based on transformers that encodes face crops or full video frames and raw audio using timestamp information identifies the speaker and provides highly accurate synchronization predictions much faster than previous methods.
-
Recently efficient Vision Transformers have shown great performance with low latency on resource-constrained devices. Conventionally they use 4x4 patch embeddings and a 4-stage structure at the macro level while utilizing sophisticated attention with multi-head configuration at the micro level. This paper aims to address computational redundancy at all design levels in a memory-efficient manner. We discover that using larger-stride patchify stem not only reduces memory access costs but also achieves competitive performance by leveraging token representations with reduced spatial redundancy from the early stages. Furthermore our preliminary analyses suggest that attention layers in the early stages can be substituted with convolutions and several attention heads in the latter stages are computationally redundant. To handle this we introduce a single-head attention module that inherently prevents head redundancy and simultaneously boosts accuracy by parallelly combining global and local information. Building upon our solutions we introduce SHViT a Single-Head Vision Transformer that obtains the state-of-the-art speed-accuracy tradeoff. For example on ImageNet-1k our SHViT-S4 is 3.3x 8.1x and 2.4x faster than MobileViTv2x1.0 on GPU CPU and iPhone12 mobile device respectively while being 1.3% more accurate. For object detection and instance segmentation on MS COCO using Mask-RCNN head our model achieves performance comparable to FastViT-SA12 while exhibiting 3.8x and 2.0x lower backbone latency on GPU and mobile device respectively.
-
Reconstructing High Dynamic Range (HDR) video from image sequences captured with alternating exposures is challenging especially in the presence of large camera or object motion. Existing methods typically align low dynamic range sequences using optical flow or attention mechanism for deghosting. However they often struggle to handle large complex motions and are computationally expensive. To address these challenges we propose a robust and efficient flow estimator tailored for real-time HDR video reconstruction named HDRFlow. HDRFlow has three novel designs: an HDR-domain alignment loss (HALoss) an efficient flow network with a multi-size large kernel (MLK) and a new HDR flow training scheme. The HALoss supervises our flow network to learn an HDR-oriented flow for accurate alignment in saturated and dark regions. The MLK can effectively model large motions at a negligible cost. In addition we incorporate synthetic data Sintel into our training dataset utilizing both its provided forward flow and backward flow generated by us to supervise our flow network enhancing our performance in large motion regions. Extensive experiments demonstrate that our HDRFlow outperforms previous methods on standard benchmarks. To the best of our knowledge HDRFlow is the first real-time HDR video reconstruction method for video sequences captured with alternating exposures capable of processing 720p resolution inputs at 25ms.
-
Can we capture shape and reflectance in stealth? Such capability would be valuable for many application domains in vision xR robotics and HCI. We introduce structured polarization for invisible depth and reflectance sensing (SPIDeRS) the first depth and reflectance sensing method using patterns of polarized light. The key idea is to modulate the angle of linear polarization (AoLP) of projected light at each pixel. The use of polarization makes it invisible and lets us recover not only depth but also directly surface normals and even reflectance. We implement SPIDeRS with a liquid crystal spatial light modulator (SLM) and a polarimetric camera. We derive a novel method for robustly extracting the projected structured polarization pattern from the polarimetric object appearance. We evaluate the effectiveness of SPIDeRS by applying it to a number of real-world objects. The results show that our method successfully reconstructs object shapes of various materials and is robust to diffuse reflection and ambient light. We also demonstrate relighting using recovered surface normals and reflectance. We believe SPIDeRS opens a new avenue of polarization use in visual sensing.
-
We present SuperNormal a fast high-fidelity approach to multi-view 3D reconstruction using surface normal maps. With a few minutes SuperNormal produces detailed surfaces on par with 3D scanners. We harness volume rendering to optimize a neural signed distance function (SDF) powered by multi-resolution hash encoding. To accelerate training we propose directional finite difference and patchbased ray marching to approximate the SDF gradients numerically. While not compromising reconstruction quality this strategy is nearly twice as efficient as analytical gradients and about three times faster than axis-aligned finite difference. Experiments on the benchmark dataset demonstrate the superiority of SuperNormal in efficiency and accuracy compared to existing multi-view photometric stereo methods. On our captured objects SuperNormal produces more fine-grained geometry than recent neural 3D reconstruction methods. Our code is available at https://github.com/CyberAgentAILab/SuperNormal.git.
-
A simple yet effective method for occlusion-robust 3D human mesh reconstruction from a single image is presented in this paper. Although many recent studies have shown the remarkable improvement in human mesh reconstruction it is still difficult to generate accurate meshes when person-to-person occlusion occurs due to the ambiguity of who a body part belongs to. To address this problem we propose an instance-aware contrastive learning scheme. Specifically joint features belonging to the target human are trained to be proximate with the anchor feature (i.e. feature extracted from the body center position). On the other hand anchor features of different human instances are forced to be far apart so that joint features of each person can be clearly distinguished from others. By interpreting the joint possession based on such contrastive learning scheme the proposed method easily understands the spatial occupancy of body parts for each person in a given image thus can reconstruct reliable human meshes even with severely overlapped cases between multiple persons. Experimental results on benchmark datasets demonstrate the robustness of the proposed method compared to previous approaches under person-to-person occlusions. The code and model are publicly available at: https://github.com/DCVL-3D/InstanceHMR_release.
-
A significant challenge facing current optical flow methods is the difficulty in generalizing them well to the real world. This is mainly due to the lack of large-scale real-world datasets and existing self-supervised methods are limited by indirect loss and occlusions resulting in fuzzy outcomes. To address this challenge we introduce a novel optical flow training framework: automatic data factory (ADF). ADF only requires RGB images as input to effectively train the optical flow network on the target data domain. Specifically we use advanced NeRF technology to reconstruct scenes from photo groups collected by a monocular camera and then calculate optical flow labels between camera pose pairs based on the rendering results. To eliminate erroneous labels caused by defects in the scene reconstructed by NeRF we screened the generated labels from multiple aspects such as optical flow matching accuracy radiation field confidence and depth consistency. The filtered labels can be directly used for network supervision. Experimentally the generalization ability of ADF on KITTI surpasses existing self-supervised optical flow and monocular scene flow algorithms. In addition ADF achieves impressive results in real-world zero-point generalization evaluations and surpasses most supervised methods.
-
The surge in multi-modal data has propelled cross-modal matching to the forefront of research interest. However the challenge lies in the laborious and expensive process of curating a large and accurately matched multimodal dataset. Commonly sourced from the Internet these datasets often suffer from a significant presence of mismatched data impairing the performance of matching models. To address this problem we introduce a novel regularization approach named Equivariant Similarity Consistency (ESC) which can facilitate robust clean and noisy data separation and improve the training for cross-modal matching. Intuitively our method posits that the semantic variations caused by image changes should be proportional to those caused by text changes for any two matched samples. Accordingly we first calculate the ESC by comparing image and text semantic variations between a set of elaborated anchor points and other undivided training data. Then pairs with high ESC are filtered out as noisy correspondence pairs. We implement our method by combining the ESC with a traditional hinge-based triplet loss. Extensive experiments on three widely used datasets including Flickr30K MS-COCO and Conceptual Captions verify the effectiveness of our method.
-
We train a set of open text-to-image (T2I) diffusion models on a dataset of curated Creative-Commons-licensed (CC) images which yields models that are competitive with Stable Diffusion 2 (SD2). This task presents two challenges: (1) high-resolution CC images lack the captions necessary to train T2I models; (2) CC images are relatively scarce. To address these challenges we use an intuitive transfer learning technique to produce a set of high-quality synthetic captions paired with our assembled CC images. We then develop a data- and compute-efficient training recipe that requires as little as 3% of the LAION data (i.e. roughly 70 million examples) needed to train existing SD2 models but obtains the same quality. These results indicate that we have a sufficient number of CC images (also roughly 70 million) for training high-quality models. Our recipe also implements a variety of optimizations that achieve 2.71x training speed-ups enabling rapid model iteration. We leverage this recipe to train several high-quality T2I mod- els which we dub the CommonCanvas family. Our largest model achieves comparable performance to SD2 on human evaluation even though we use a synthetically captioned CC-image dataset that is only <3% the size of LAION for training. We release our models data and code on GitHub.
-
Referring image segmentation (RIS) aims to segment the target referent described by natural language. Recently large-scale pre-trained models e.g. CLIP and SAM have been successfully applied in many downstream tasks but they are not well adapted to RIS task due to inter-task differences. In this paper we propose a new prompt-driven framework named Prompt-RIS which bridges CLIP and SAM end-to-end and transfers their rich knowledge and powerful capabilities to RIS task through prompt learning. To adapt CLIP to pixel-level task we first propose a Cross-Modal Prompting method which acquires more comprehensive vision-language interaction and fine-grained text-to-pixel alignment by performing bidirectional prompting. Then the prompt-tuned CLIP generates masks points and text prompts for SAM to generate more accurate mask predictions. Moreover we further propose Instance Contrastive Learning to improve the model's discriminability to different instances and robustness to diverse languages describing the same instance. Extensive experiments demonstrate that the performance of our method outperforms the state-of-the-art methods consistently in both general and open-vocabulary settings.
-
We present Image Sculpting a new framework for editing 2D images by incorporating tools from 3D geometry and graphics. This approach differs markedly from existing methods which are confined to 2D spaces and typically rely on textual instructions leading to ambiguity and limited control. Image Sculpting converts 2D objects into 3D enabling direct interaction with their 3D geometry. Post-editing these objects are re-rendered into 2D merging into the original image to produce high-fidelity results through a coarse-to-fine enhancement process. The framework supports precise quantifiable and physically-plausible editing options such as pose editing rotation translation 3D composition carving and serial addition. It marks an initial step towards combining the creative freedom of generative models with the precision of graphics pipelines.
-
In this paper we suggest a new novel method to understand complex semantic structures through long video inputs. Conventional methods for understanding videos have been focused on short-term clips and trained to get visual representations for the short clips using convolutional neural networks or transformer architectures. However most real-world videos are composed of long videos ranging from minutes to hours therefore it essentially brings limitations to understanding the overall semantic structures of the long videos by dividing them into small clips and learning the representations of them. We suggest a new algorithm to learn the multi-granular semantic structures of videos by defining spatiotemporal high-order relationships among object-based representations as semantic units. The proposed method includes a new transformer architecture capable of learning spatiotemporal graphs and a compositional learning method to learn disentangled features for each semantic unit. Using the suggested method we resolve the challenging video task which is compositional generalization understanding of unseen videos. In experiments we demonstrate new state-of-the-art performances for two challenging video datasets.
-
Building accurate maps is a key building block to enable reliable localization planning and navigation of autonomous vehicles. We propose a novel approach for building accurate 3D maps of dynamic environments utilizing a sequence of LiDAR scans. To this end we propose encoding the 4D scene into a novel spatio-temporal implicit neural map representation by fitting a time-dependent truncated signed distance function to each point. Using our representation we can extract the static map by filtering the dynamic parts. Our neural representation is based on sparse feature grids a globally shared decoder and time-dependent basis functions which can be jointly optimized in an unsupervised fashion. To learn this representation from a sequence of LiDAR scans we design a simple yet efficient loss function to supervise the map optimization in a piecewise way. We evaluate our approach on various scenes containing moving objects in terms of the reconstruction quality of static maps and the segmentation of dynamic point clouds. The experimental results demonstrate that our method is capable of removing the dynamic part of the input point clouds while reconstructing accurate and complete large-scale 3D maps outperforming several state-of-the-art methods for static map generation and scene reconstruction.
-
Spatio-temporal grounding describes the task of localizing events in space and time e.g. in video data based on verbal descriptions only. Models for this task are usually trained with human-annotated sentences and bounding box supervision. This work addresses this task from a multimodal supervision perspective proposing a framework for spatio-temporal action grounding trained on loose video and subtitle supervision only without human annotation. To this end we combine local representation learning which focuses on leveraging fine-grained spatial information with a global representation encoding that captures higher-level representations and incorporates both in a joint approach. To evaluate this challenging task in a real-life setting a new benchmark dataset is proposed providing dense spatio-temporal grounding annotations in long untrimmed multi-action instructional videos for over 5K events. We evaluate the proposed approach and other methods on the proposed and standard downstream tasks showing that our method improves over current baselines in various settings including spatial temporal and untrimmed multi-action spatio-temporal grounding.
-
We present FoundationPose a unified foundation model for 6D object pose estimation and tracking supporting both model-based and model-free setups. Our approach can be instantly applied at test-time to a novel object without finetuning as long as its CAD model is given or a small number of reference images are captured. Thanks to the unified framework the downstream pose estimation modules are the same in both setups with a neural implicit representation used for efficient novel view synthesis when no CAD model is available. Strong generalizability is achieved via large-scale synthetic training aided by a large language model (LLM) a novel transformer-based architecture and contrastive learning formulation. Extensive evaluation on multiple public datasets involving challenging scenarios and objects indicate our unified approach outperforms existing methods specialized for each task by a large margin. In addition it even achieves comparable results to instance-level methods despite the reduced assumptions. Project page: https://nvlabs.github.io/FoundationPose/
-
In recent years Neural Radiance Field (NeRF) has demonstrated remarkable capabilities in representing 3D scenes. To expedite the rendering process learnable explicit representations have been introduced for combination with implicit NeRF representation which however results in a large storage space requirement. In this paper we introduce the Context-based NeRF Compression (CNC) framework which leverages highly efficient context models to provide a storage-friendly NeRF representation. Specifically we excavate both level-wise and dimension-wise context dependencies to enable probability prediction for information entropy reduction. Additionally we exploit hash collision and occupancy grids as strong prior knowledge for better context modeling. To the best of our knowledge we are the first to construct and exploit context models for NeRF compression. We achieve a size reduction of 100X and 70X with improved fidelity against the baseline Instant-NGP on Synthesic-NeRF and Tanks and Temples datasets respectively. Additionally we attain 86.7% and 82.3% storage size reduction against the SOTA NeRF compression method BiRF. Our code is available here: https://github.com/YihangChen-ee/CNC.
-
Recent developments in face restoration have achieved remarkable results in producing high-quality and lifelike outputs. The stunning results however often fail to be faithful with respect to the identity of the person as the models lack necessary context. In this paper we explore the potential of personalized face restoration with diffusion models. In our approach a restoration model is personalized using a few images of the identity leading to tailored restoration with respect to the identity while retaining fine-grained details. By using independent trainable blocks for personalization the rich prior of a base restoration model can be exploited to its fullest. To avoid the model relying on parts of identity left in the conditioning low-quality images a generative regularizer is employed. With a learnable parameter the model learns to balance between the details generated based on the input image and the degree of personalization. Moreover we improve the training pipeline of face restoration models to enable an alignment-free approach. We showcase the robust capabilities of our approach in several real-world scenarios with multiple identities demonstrating our method's ability to generate fine-grained details with faithful restoration. In the user study we evaluate the perceptual quality and faithfulness of the generated details with our method being voted best 61% of the time compared to the second best with 25% of the votes.
-
We present TextureDreamer a novel image-guided texture synthesis method to transfer relightable textures from a small number of input images (3 to 5) to target 3D shapes across arbitrary categories. Texture creation is a pivotal challenge in vision and graphics. Industrial companies hire experienced artists to manually craft textures for 3D assets. Classical methods require densely sampled views and accurately aligned geometry while learning-based methods are confined to category-specific shapes within the dataset. In contrast TextureDreamer can transfer highly detailed intricate textures from real-world environments to arbitrary objects with only a few casually captured images potentially significantly democratizing texture creation. Our core idea personalized geometry-aware score distillation (PGSD) draws inspiration from recent advancements in diffuse models including personalized modeling for texture information extraction score distillation for detailed appearance synthesis and explicit geometry guidance with ControlNet. Our integration and several essential modifications substantially improve the texture quality. Experiments on real images spanning different categories show that TextureDreamer can successfully transfer highly realistic semantic meaningful texture to arbitrary objects surpassing the visual quality of previous state-of-the-art. Project page: https://texturedreamer.github.io
-
Image Quality Assessment (IQA) constitutes a fundamental task within the field of computer vision yet it remains an unresolved challenge owing to the intricate distortion conditions diverse image contents and limited availability of data. Recently the community has witnessed the emergence of numerous large-scale pretrained foundation models. However it remains an open problem whether the scaling law in high-level tasks is also applicable to IQA tasks which are closely related to low-level clues. In this paper we demonstrate that with a proper injection of local distortion features a larger pretrained vision transformer (ViT) foundation model performs better in IQA tasks. Specifically for the lack of local distortion structure and inductive bias of the large-scale pretrained ViT we use another pretrained convolution neural networks (CNNs) which is well known for capturing the local structure to extract multi-scale image features. Further we propose a local distortion extractor to obtain local distortion features from the pretrained CNNs and a local distortion injector to inject the local distortion features into ViT. By only training the extractor and injector our method can benefit from the rich knowledge in the powerful foundation models and achieve state-of-the-art performance on popular IQA datasets indicating that IQA is not only a low-level problem but also benefits from stronger high-level features drawn from large-scale pretrained models. Codes are publicly available at: https://github.com/NeosXu/LoDa.
-
Anomaly detection is a challenging computer vision task in industrial scenario. Advancements in deep learning constantly revolutionize vision-based anomaly detection methods and considerable progress has been made in both supervised and self-supervised anomaly detection. The commonly-used pipeline is to optimize the model by constraining the feature embeddings using a distance-based loss function. However these methods work in Euclidean space and they cannot well exploit the data lied in non-Euclidean space. In this paper we are the first to explore anomaly detection task in hyperbolic space that is a representative of non-Euclidean space and propose a hyperbolic anomaly detection (HypAD) method. Specifically we first extract image features and then map them from Euclidean space to hyperbolic space where the hyperbolic distance metric is employed to optimize the proposed HypAD. Extensive experiments on the benchmarking datasets including MVTec AD and VisA show that our HypAD approach obtains the state-of-the-art performance demonstrating the effectiveness of our HypAD and the promise of investigating anomaly detection in hyperbolic space.
-
Autonomous driving is a complex and challenging task that aims at safe motion planning through scene understanding and reasoning. While vision-only autonomous driving methods have recently achieved notable performance through enhanced scene understanding several key issues including lack of reasoning low generalization performance and long-tail scenarios still need to be addressed. In this paper we present VLP a novel Vision-Language-Planning framework that exploits language models to bridge the gap between linguistic understanding and autonomous driving. VLP enhances autonomous driving systems by strengthening both the source memory foundation and the self-driving car's contextual understanding. VLP achieves state-of-the-art end-to-end planning performance on the challenging NuScenes dataset by achieving 35.9% and 60.5% reduction in terms of average L2 error and collision rates respectively compared to the previous best method. Moreover VLP shows improved performance in challenging long-tail scenarios and strong generalization capabilities when faced with new urban environments.
-
Recent thrilling progress in large-scale text-to-image (T2I) models has unlocked unprecedented synthesis quality of AI-generated content (AIGC) including image generation 3D and video composition. Further personalized techniques enable appealing customized production of a novel concept given only several images as reference. However an intriguing problem persists: Is it possible to capture multiple novel concepts from one single reference image? In this paper we identify that existing approaches fail to preserve visual consistency with the reference image and eliminate cross-influence from concepts. To alleviate this we propose an attention calibration mechanism to improve the concept-level understanding of the T2I model. Specifically we first introduce new learnable modifiers bound with classes to capture attributes of multiple concepts. Then the classes are separated and strengthened following the activation of the cross-attention operation ensuring comprehensive and self-contained concepts. Additionally we suppress the attention activation of different classes to mitigate mutual influence among concepts. Together our proposed method dubbed DisenDiff can learn disentangled multiple concepts from one single image and produce novel customized images with learned concepts. We demonstrate that our method outperforms the current state of the art in both qualitative and quantitative evaluations. More importantly our proposed techniques are compatible with LoRA and inpainting pipelines enabling more interactive experiences.
-
Generative AI (GenAI) is transforming creative workflows through the capability to synthesize and manipulate images via high-level prompts. Yet creatives are not well supported to receive recognition or reward for the use of their content in GenAI training. To this end we propose ProMark a causal attribution technique to attribute a synthetically generated image to its training data concepts like objects motifs templates artists or styles. The concept information is proactively embedded into the input training images using imperceptible watermarks and the diffusion models (unconditional or conditional) are trained to retain the corresponding watermarks in generated images. We show that we can embed as many as 2^ 16 unique watermarks into the training data and each training image can contain more than one watermark. ProMark can maintain image quality whilst outperforming correlation-based attribution. Finally several qualitative examples are presented providing the confidence that the presence of the watermark conveys a causative relationship between training data and synthetic images.
-
While GAN-based models have been successful in image stylization tasks they often struggle with structure preservation while stylizing a wide range of input images. Recently diffusion models have been adopted for image stylization but still lack the capability to maintain the original quality of input images. Building on this we propose OSASIS: a novel one-shot stylization method that is robust in structure preservation. We show that OSASIS is able to effectively disentangle the semantics from the structure of an image allowing it to control the level of content and style implemented to a given input. We apply OSASIS to various experimental settings including stylization with out-of-domain reference images and stylization with text-driven manipulation. Results show that OSASIS outperforms other stylization methods especially for input images that were rarely encountered during training providing a promising solution to stylization via diffusion models.
-
Multimodal Large Language Models (MLLMs) have excelled in 2D image-text comprehension and image generation but their understanding of the 3D world is notably deficient limiting progress in 3D language understanding and generation. To solve this problem we introduce GPT4Point an innovative groundbreaking point-language multimodal model designed specifically for unified 3D object understanding and generation within the MLLM framework. GPT4Point as a powerful 3D MLLM seamlessly can execute a variety of point-text reference tasks such as point-cloud captioning and Q&A. Additionally GPT4Point is equipped with advanced capabilities for controllable 3D generation it can get high-quality results through a low-quality point-text feature maintaining the geometric shapes and colors. To support the expansive needs of 3D object-text pairs we develop Pyramid-XL a point-language dataset annotation engine. It constructs a large-scale database over 1M objects of varied text granularity levels from the Objaverse-XL dataset essential for training GPT4Point. A comprehensive benchmark has been proposed to evaluate 3D point-language understanding capabilities. In extensive evaluations GPT4Point has demonstrated superior performance in understanding and generation.
-
We present "SemCity" a 3D diffusion model for semantic scene generation in real-world outdoor environments. Most 3D diffusion models focus on generating a single object synthetic indoor scenes or synthetic outdoor scenes while the generation of real-world outdoor scenes is rarely addressed. In this paper we concentrate on generating a real-outdoor scene through learning a diffusion model on a real-world outdoor dataset. In contrast to synthetic data real-outdoor datasets often contain more empty spaces due to sensor limitations causing challenges in learning real-outdoor distributions. To address this issue we exploit a triplane representation as a proxy form of scene distributions to be learned by our diffusion model. Furthermore we propose a triplane manipulation that integrates seamlessly with our triplane diffusion model. The manipulation improves our diffusion model's applicability in a variety of downstream tasks related to outdoor scene generation such as scene inpainting scene outpainting and semantic scene completion refinements. In experimental results we demonstrate that our triplane diffusion model shows meaningful generation results compared with existing work in a real-outdoor dataset SemanticKITTI. We also show our triplane manipulation facilitates seamlessly adding removing or modifying objects within a scene. Further it also enables the expansion of scenes toward a city-level scale. Finally we evaluate our method on semantic scene completion refinements where our diffusion model enhances predictions of semantic scene completion networks by learning scene distribution. Our code is available at https://github.com/zoomin-lee/SemCity.
-
Recent self-supervised models produce visual features that are not only effective at encoding image-level but also pixel-level semantics. They have been reported to obtain impressive results for dense visual semantic correspondence estimation even outperforming fully-supervised methods. Nevertheless these models still fail in the presence of challenging image characteristics such as symmetries and repeated parts. To address these limitations we propose a new semantic correspondence estimation method that supplements state-of-the-art self-supervised features with 3D understanding via a weak geometric spherical prior. Compared to more involved 3D pipelines our model provides a simple and effective way of injecting informative geometric priors into the learned representation while requiring only weak viewpoint information. We also propose a new evaluation metric that better accounts for repeated part and symmetry-induced mistakes. We show that our method succeeds in distinguishing between symmetric views and repeated parts across many object categories in the challenging SPair-71k dataset and also in generalizing to previously unseen classes in the AwA dataset.
-
This research paper presents a novel class of restoration network architecture based on the Volterra series formulation. By incorporating non-linearity into the system response function through higher order convolutions instead of traditional activation functions we introduce a general framework for image/video restoration. Through extensive experimentation we demonstrate that our proposed architecture achieves state-of-the-art (SOTA) performance in the field of Image/Video Restoration. Moreover we establish that the recently introduced Non-Linear Activation Free Network (NAF-NET) can be considered a special case within the broader class of Volterra Neural Networks. These findings highlight the potential of Volterra Neural Networks as a versatile and powerful tool for addressing complex restoration tasks in computer vision.
-
With the emergence of pre-trained vision-language models like CLIP how to adapt them to various downstream classification tasks has garnered significant attention in recent research. The adaptation strategies can be typically categorized into three paradigms: zero-shot adaptation few-shot adaptation and the recently-proposed training-free few-shot adaptation. Most existing approaches are tailored for a specific setting and can only cater to one or two of these paradigms. In this paper we introduce a versatile adaptation approach that can effectively work under all three settings. Specifically we propose the dual memory networks that comprise dynamic and static memory components. The static memory caches training data knowledge enabling training-free few-shot adaptation while the dynamic memory preserves historical test features online during the testing process allowing for the exploration of additional data insights beyond the training set. This novel capability enhances model performance in the few-shot setting and enables model usability in the absence of training data. The two memory networks employ the same flexible memory interactive strategy which can operate in a training-free mode and can be further enhanced by incorporating learnable projection layers. Our approach is tested across 11 datasets under the three task settings. Remarkably in the zero-shot scenario it outperforms existing methods by over 3% and even shows superior results against methods utilizing external training data. Additionally our method exhibits robust performance against natural distribution shifts.
-
We introduce a framework for intrinsic latent diffusion models operating directly on the surfaces of 3D shapes with the goal of synthesizing high-quality textures. Our approach is underpinned by two contributions: Field Latents a latent representation encoding textures as discrete vector fields on the mesh vertices and Field Latent Diffusion Models which learn to denoise a diffusion process in the learned latent space on the surface. We consider a single-textured-mesh paradigm where our models are trained to generate variations of a given texture on a mesh. We show the synthesized textures are of superior fidelity compared those from existing single-textured-mesh generative models. Our models can also be adapted for user-controlled editing tasks such as inpainting and label-guided generation. The efficacy of our approach is due in part to the equivariance of our proposed framework under isometries allowing our models to seamlessly reproduce details across locally similar regions and opening the door to a notion of generative texture transfer. Code and visualizations are available at https://single-mesh-diffusion.github.io/.
-
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. However most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs leading to insufficient extraction and reasoning of visual knowledge. To address this issue we devise a dual-Level vIsual knOwledge eNhanced Multimodal Large Language Model (LION) which empowers the MLLM by injecting visual knowledge in two levels. 1) Progressive incorporation of fine-grained spatial-aware visual knowledge. We design a vision aggregator cooperated with region-level vision-language (VL) tasks to incorporate fine-grained spatial-aware visual knowledge into the MLLM. To alleviate the conflict between image-level and region-level VL tasks during incorporation we devise a dedicated stage-wise instruction-tuning strategy with mixture-of-adapters. This progressive incorporation scheme contributes to the mutual promotion between these two kinds of VL tasks. 2) Soft prompting of high-level semantic visual evidence. We facilitate the MLLM with high-level semantic visual evidence by leveraging diverse image tags. To mitigate the potential influence caused by imperfect predicted tags we propose a soft prompting method by embedding a learnable token into the tailored text instruction. Comprehensive experiments on several multi-modal benchmarks demonstrate the superiority of our model (e.g. improvement of 5% accuracy on VSR and 3% CIDEr on TextCaps over InstructBLIP 5% accuracy on RefCOCOg over Kosmos-2).
-
Multiple camera view (multi-view) setups have proven useful in many computer vision applications. However the high computational cost associated with multiple views creates a significant challenge for end devices with limited computational resources. In modern CPU pipelining breaks a longer job into steps and enables parallelism over sequential steps from multiple jobs. Inspired by this we study selective view pipelining for efficient multi-view understanding which breaks computation of multiple views into steps and only computes the most helpful views/steps in a parallel manner for the best efficiency. To this end we use reinforcement learning to learn a very light view selection module that analyzes the target object or scenario from initial views and selects the next-best-view for recognition or detection for pipeline computation. Experimental results on multi-view classification and detection tasks show that our approach achieves promising performance while using only 2 or 3 out of N available views significantly reducing computational costs while maintaining parallelism over GPU through selective view pipelining.
-
The goal of selective prediction is to allow an a model to abstain when it may not be able to deliver a reliable prediction which is important in safety-critical contexts. Existing approaches to selective prediction typically require access to the internals of a model require retraining a model or study only unimodal models. However the most powerful models (e.g. GPT-4) are typically only available as black boxes with inaccessible internals are not retrainable by end-users and are frequently used for multimodal tasks. We study the possibility of selective prediction for vision-language models in a realistic black-box setting. We propose using the principle of neighborhood consistency to identify unreliable responses from a black-box vision-language model in question answering tasks. We hypothesize that given only a visual question and model response the consistency of the model's responses over the neighborhood of a visual question will indicate reliability. It is impossible to directly sample neighbors in feature space in a black-box setting. Instead we show that it is possible to use a smaller proxy model to approximately sample from the neighborhood. We find that neighborhood consistency can be used to identify model responses to visual questions that are likely unreliable even in adversarial settings or settings that are out-of-distribution to the proxy model.
-
Advancements in 3D instance segmentation have traditionally been tethered to the availability of annotated datasets limiting their application to a narrow spectrum of object categories. Recent efforts have sought to harness vision-language models like CLIP for open-set semantic reasoning yet these methods struggle to distinguish between objects of the same categories and rely on specific prompts that are not universally applicable. In this paper we introduce SAI3D a novel zero-shot 3D instance segmentation approach that synergistically leverages geometric priors and semantic cues derived from Segment Anything Model (SAM). Our method partitions a 3D scene into geometric primitives which are then progressively merged into 3D instance segmentations that are consistent with the multi-view SAM masks. Moreover we design a hierarchical region-growing algorithm with a dynamic thresholding mechanism which largely improves the robustness of fine-grained 3D scene parsing. Empirical evaluations on ScanNet Matterport3D and the more challenging ScanNet++ datasets demonstrate the superiority of our approach. Notably SAI3D outperforms existing open-vocabulary baselines and even surpasses fully-supervised methods in class-agnostic segmentation on ScanNet++. Our project page is at https://yd-yin.github.io/SAI3D/.
-
Recent advancements in video modeling extensively rely on optical flow to represent the relationships across frames but this approach often lacks efficiency and fails to model the probability of the intrinsic motion of objects. In addition conventional encoder-decoder frameworks in video processing focus on modeling the correlation in the encoder leading to limited generative capabilities and redundant intermediate representations. To address these challenges this paper proposes a novel Implicit Motion Function (IMF) method. Our approach utilizes a low-dimensional latent token as the implicit representation along with the use of cross-attention to implicitly model the correlation between frames. This enables the implicit modeling of temporal correlations and understanding of object motions. Our method not only improves sparsity and efficiency in representation but also explores the generative capabilities of the decoder by integrating correlation modeling within it. The IMF framework facilitates video editing and other generative tasks by allowing the direct manipulation of latent tokens. We validate the effectiveness of IMF through extensive experiments on multiple video tasks demonstrating superior performance in terms of reconstructed video quality compression efficiency and generation ability.
-
Test-time adaptation (TTA) aims at adapting a model pre-trained on the labeled source domain to the unlabeled target domain. Existing methods usually focus on improving TTA performance under covariate shifts while neglecting semantic shifts. In this paper we delve into a realistic open-set TTA setting where the target domain may contain samples from unknown classes. Many state-of-the-art closed-set TTA methods perform poorly when applied to open-set scenarios which can be attributed to the inaccurate estimation of data distribution and model confidence. To address these issues we propose a simple but effective framework called unified entropy optimization (UniEnt) which is capable of simultaneously adapting to covariate-shifted in-distribution (csID) data and detecting covariate-shifted out-of-distribution (csOOD) data. Specifically UniEnt first mines pseudo-csID and pseudo-csOOD samples from test data followed by entropy minimization on the pseudo-csID data and entropy maximization on the pseudo-csOOD data. Furthermore we introduce UniEnt+ to alleviate the noise caused by hard data partition leveraging sample-level confidence. Extensive experiments on CIFAR benchmarks and Tiny-ImageNet-C show the superiority of our framework. The code is available at https://github.com/gaozhengqing/UniEnt.
-
This paper focuses on synthesizing high-quality and complete textures directly on the surface of 3D models within 3D space. 2D diffusion-based methods face challenges in generating 2D texture maps due to the infinite possibilities of UV mapping for a given 3D mesh. Utilizing point clouds helps circumvent variations arising from diverse mesh topologies and UV mappings. Nevertheless achieving dense point clouds to accurately represent texture details poses a challenge due to limited computational resources. To address these challenges we propose an efficient octree-based diffusion pipeline called TexOct. Our method starts by sampling a point cloud from the surface of a given 3D model with each point containing texture noise values. We utilize an octree structure to efficiently represent this point cloud. Additionally we introduce an innovative octree-based diffusion model that leverages the denoising capabilities of the Denoising Diffusion Probabilistic Model (DDPM). This model gradually reduces the texture noise on the octree nodes resulting in the restoration of fine texture. Experimental results on ShapeNet demonstrate that TexOct effectively generates high-quality 3D textures in both unconditional and text / image-conditional scenarios.
-
Coordinate based implicit neural representations have gained rapid popularity in recent years as they have been successfully used in image geometry and scene modeling tasks. In this work we present a novel use case for such implicit representations in the context of learning anatomically constrained face models. Actor specific anatomically constrained face models are the state of the art in both facial performance capture and performance retargeting. Despite their practical success these anatomical models are slow to evaluate and often require extensive data capture to be built. We propose the anatomical implicit face model; an ensemble of implicit neural networks that jointly learn to model the facial anatomy and the skin surface with high-fidelity and can readily be used as a drop in replacement to conventional blendshape models. Given an arbitrary set of skin surface meshes of an actor and only a neutral shape with estimated skull and jaw bones our method can recover a dense anatomical substructure which constrains every point on the facial surface. We demonstrate the usefulness of our approach in several tasks ranging from shape fitting shape editing and performance retargeting.
-
Class-Incremental Learning (CIL) requires a learning system to continually learn new classes without forgetting. Despite the strong performance of Pre-Trained Models (PTMs) in CIL a critical issue persists: learning new classes often results in the overwriting of old ones. Excessive modification of the network causes forgetting while minimal adjustments lead to an inadequate fit for new classes. As a result it is desired to figure out a way of efficient model updating without harming former knowledge. In this paper we propose ExpAndable Subspace Ensemble (EASE) for PTM-based CIL. To enable model updating without conflict we train a distinct lightweight adapter module for each new task aiming to create task-specific subspaces. These adapters span a high-dimensional feature space enabling joint decision-making across multiple subspaces. As data evolves the expanding subspaces render the old class classifiers incompatible with new-stage spaces. Correspondingly we design a semantic-guided prototype complement strategy that synthesizes old classes' new features without using any old class instance. Extensive experiments on seven benchmark datasets verify EASE's state-of-the-art performance. Code is available at: https://github.com/sun-hailong/CVPR24-Ease
-
In this paper we focus on capturing closely interacted two-person motions from monocular videos an important yet understudied topic. Unlike less-interacted motions closely interacted motions contain frequently occurring inter-human occlusions which pose significant challenges to existing capturing algorithms. To address this problem our key observation is that close physical interactions between two subjects typically happen under very specific situations (e.g. handshake hug etc.) and such situational contexts contain strong prior semantics to help infer the poses of occluded joints. In this spirit we introduce reaction priors which are invertible neural networks that bi-directionally model the pose probability distributions of one person given the pose of the other. The learned reaction priors are then incorporated into a query-based pose estimator which is a decoder-only Transformer with self-attentions on both intra-joint and inter-joint relationships. We demonstrate that our design achieves considerably higher performance than previous methods on multiple benchmarks. What's more as existing datasets lack sufficient cases of close human-human interactions we also build a new dataset called Dual-Human to better evaluate different methods. Dual-Human contains around 2k sequences of closely interacted two-person motions each with synthetic multi-view renderings contact annotations and text descriptions. We believe that this new public dataset can significantly promote further research in this area.
-
Segment Anything Model (SAM) has emerged as a transformative approach in image segmentation acclaimed for its robust zero-shot segmentation capabilities and flexible prompting system. Nonetheless its performance is challenged by images with degraded quality. Addressing this limitation we propose the Robust Segment Anything Model (RobustSAM) which enhances SAM's performance on low-quality images while preserving its promptability and zero-shot generalization. Our method leverages the pre-trained SAM model with only marginal parameter increments and computational requirements. The additional parameters of RobustSAM can be optimized within 30 hours on eight GPUs demonstrating its feasibility and practicality for typical research laboratories. We also introduce the Robust-Seg dataset a collection of 688K image-mask pairs with different degradations designed to train and evaluate our model optimally. Extensive experiments across various segmentation tasks and datasets confirm RobustSAM's superior performance especially under zero-shot conditions underscoring its potential for extensive real-world application. Additionally our method has been shown to effectively improve the performance of SAM-based downstream tasks such as single image dehazing and deblurring.
-
We introduce MultiDiff a novel approach for consistent novel view synthesis of scenes from a single RGB image. The task of synthesizing novel views from a single reference image is highly ill-posed by nature as there exist multiple plausible explanations for unobserved areas. To address this issue we incorporate strong priors in form of monocular depth predictors and video-diffusion models. Monocular depth enables us to condition our model on warped reference images for the target views increasing geometric stability. The video-diffusion prior provides a strong proxy for 3D scenes allowing the model to learn continuous and pixel-accurate correspondences across generated images. In contrast to approaches relying on autoregressive image generation that are prone to drifts and error accumulation MultiDiff jointly synthesizes a sequence of frames yielding high-quality and multi-view consistent results -- even for long-term scene generation with large camera movements while reducing inference time by an order of magnitude. For additional consistency and image quality improvements we introduce a novel structured noise distribution. Our experimental results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging real-world datasets RealEstate10K and ScanNet. Finally our model naturally supports multi-view consistent editing without the need for further tuning.
-
3D-aware GANs offer new capabilities for view synthesis while preserving the editing functionalities of their 2D counterparts. GAN inversion is a crucial step that seeks the latent code to reconstruct input images or videos subsequently enabling diverse editing tasks through manipulation of this latent code. However a model pre-trained on a particular dataset (e.g. FFHQ) often has difficulty reconstructing images with out-of-distribution (OOD) objects such as faces with heavy make-up or occluding objects. We address this issue by explicitly modeling OOD objects from the input in 3D-aware GANs. Our core idea is to represent the image using two individual neural radiance fields: one for the in-distribution content and the other for the out-of-distribution object. The final reconstruction is achieved by optimizing the composition of these two radiance fields with carefully designed regularization. We demonstrate that our explicit decomposition alleviates the inherent trade-off between reconstruction fidelity and editability. We evaluate reconstruction accuracy and editability of our method on challenging real face images and videos and showcase favorable results against other baselines.
-
Identifying the chemical structure from a graphical representation or image of a molecule is a challenging pattern recognition task that would greatly benefit drug development. Yet existing methods for chemical structure recognition do not typically generalize well and show diminished effectiveness when confronted with domains where data is sparse or costly to generate such as hand-drawn molecule images. To address this limitation we propose a new chemical structure recognition tool that delivers state-of-the-art performance and can adapt to new domains with a limited number of data samples and supervision. Unlike previous approaches our method provides atom-level localization and can therefore segment the image into the different atoms and bonds. Our model is the first model to perform OCSR with atom-level entity detection with only SMILES supervision. Through rigorous and extensive benchmarking we demonstrate the preeminence of our chemical structure recognition approach in terms of data efficiency accuracy and atom-level entity prediction.
-
3D visual language multi-modal modeling plays an important role in actual human-computer interaction. However the inaccessibility of large-scale 3D-language pairs restricts their applicability in real-world scenarios. In this paper we aim to handle a real-time multi-task for 6-DoF pose tracking of unknown objects leveraging 3D-language pre-training scheme from a series of 3D point cloud video streams while simultaneously performing 3D shape reconstruction in current observation. To this end we present a generic Language-to-4D modeling paradigm termed L4D-Track that tackles zero-shot 6-DoF \underline Track ing and shape reconstruction by learning pairwise implicit 3D representation and multi-level multi-modal alignment. Our method constitutes two core parts. 1) Pairwise Implicit 3D Space Representation that establishes spatial-temporal to language coherence descriptions across continuous 3D point cloud video. 2) Language-to-4D Association and Contrastive Alignment enables multi-modality semantic connections between 3D point cloud video and language. Our method trained exclusively on public NOCS-REAL275 dataset achieves promising results on both two publicly benchmarks. This not only shows powerful generalization performance but also proves its remarkable capability in zero-shot inference.
-
The pre-training architectures of large language models encompass various types including autoencoding models autoregressive models and encoder-decoder models. We posit that any modality can potentially benefit from a large language model as long as it undergoes vector quantization to become discrete tokens. Inspired by the General Language Model we propose a General Point Model (GPM) that seamlessly integrates autoencoding and autoregressive tasks in a point cloud transformer. This model is versatile allowing fine-tuning for downstream point cloud representation tasks as well as unconditional and conditional generation tasks. GPM enhances masked prediction in autoencoding through various forms of mask padding tasks leading to improved performance in point cloud understanding. Additionally GPM demonstrates highly competitive results in unconditional point cloud generation tasks even exhibiting the potential for conditional generation tasks by modifying the input's conditional information. Compared to models like Point-BERT MaskPoint and PointMAE our GPM achieves superior performance in point cloud understanding tasks. Furthermore the integration of autoregressive and autoencoding within the same transformer underscores its versatility across different downstream tasks.
-
Implicit neural representations (INRs) were recently proposed as a new video compression paradigm with existing approaches performing on par with HEVC. However such methods only perform well in limited settings e.g. specific model sizes fixed aspect ratios and low-motion videos. We address this issue by proposing T-NeRV a hybrid video INR that combines frame-specific embeddings with GOP-specific features providing a lever for content-specific fine-tuning. We employ entropy-constrained training to jointly optimize our model for rate and distortion and demonstrate that T-NeRV can thereby automatically adjust this lever during training effectively fine-tuning itself to the target content. We evaluate T-NeRV on the UVG dataset where it achieves state-of-the-art results on the video representation task outperforming previous works by up to 3dB PSNR on challenging high-motion sequences. Further our method improves on the compression performance of previous methods and is the first video INR to outperform HEVC on all UVG sequences.
-
Camera-based person re-identification (ReID) systems have been widely applied in the field of public security. However cameras often lack the perception of 3D morphological information of human and are susceptible to various limitations such as inadequate illumination complex background and personal privacy. In this paper we propose a LiDAR-based ReID framework ReID3D that utilizes pre-training strategy to retrieve features of 3D body shape and introduces Graph-based Complementary Enhancement Encoder for extracting comprehensive features. Due to the lack of LiDAR datasets we build LReID the first LiDAR-based person ReID dataset which is collected in several outdoor scenes with variations in natural conditions. Additionally we introduce LReID-sync a simulated pedestrian dataset designed for pre-training encoders with tasks of point cloud completion and shape parameter learning. Extensive experiments on LReID show that ReID3D achieves exceptional performance with a rank-1 accuracy of 94.0 highlighting the significant potential of LiDAR in addressing person ReID tasks. To the best of our knowledge we are the first to propose a solution for LiDAR-based ReID. The code and dataset are available at https://github.com/GWxuan/ReID3D.
-
As an important pillar of underwater intelligence Marine Animal Segmentation (MAS) involves segmenting animals within marine environments. Previous methods don't excel in extracting long-range contextual features and overlook the connectivity between discrete pixels. Recently Segment Anything Model (SAM) offers a universal framework for general segmentation tasks. Unfortunately trained with natural images SAM does not obtain the prior knowledge from marine images. In addition the single-position prompt of SAM is very insufficient for prior guidance. To address these issues we propose a novel feature learning framework named Dual-SAM for high-performance MAS. To this end we first introduce a dual structure with SAM's paradigm to enhance feature learning of marine images. Then we propose a Multi-level Coupled Prompt (MCP) strategy to instruct comprehensive underwater prior information and enhance the multi-level features of SAM's encoder with adapters. Subsequently we design a Dilated Fusion Attention Module (DFAM) to progressively integrate multi-level features from SAM's encoder. Finally instead of directly predicting the masks of marine animals we propose a Criss-Cross Connectivity Prediction (C3P) paradigm to capture the inter-connectivity between discrete pixels. With dual decoders it generates pseudo-labels and achieves mutual supervision for complementary feature representations resulting in considerable improvements over previous techniques. Extensive experiments verify that our proposed method achieves state-of-the-art performances on five widely-used MAS datasets. The code is available at https://github.com/Drchip61/Dual SAM.
-
Video and audio content creation serves as the core technique for the movie industry and professional users. Recently existing diffusion-based methods tackle video and audio generation separately which hinders the technique transfer from academia to industry. In this work we aim at filling the gap with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation. We observe the powerful generation ability of off-the-shelf video or audio generation models. Thus instead of training the giant models from scratch we propose to bridge the existing strong models with a shared latent representation space. Specifically we propose a multimodality latent aligner with the pre-trained ImageBind model. Our latent aligner shares a similar core as the classifier guidance that guides the diffusion denoising process during inference time. Through carefully designed optimization strategy and loss functions we show the superior performance of our method on joint video-audio generation visual-steered audio generation and audio-steered visual generation tasks. The project website can be found at \href https://yzxing87.github.io/Seeing-and-Hearing/ https://yzxing87.github.io/Seeing-and-Hearing/ .
-
When adopting a deep learning model for embodied agents it is required that the model structure be optimized for specific tasks and operational conditions. Such optimization can be static such as model compression or dynamic such as adaptive inference. Yet these techniques have not been fully investigated for embodied control systems subject to time constraints which necessitate sequential decision-making for multiple tasks each with distinct inference latency limitations. In this paper we present MoDeC a time constraint-aware embodied control framework using the modular model adaptation. We formulate model adaptation to varying operational conditions on resource and time restrictions as dynamic routing on a modular network incorporating these conditions as part of multi-task objectives. Our evaluation across several vision-based embodied environments demonstrates the robustness of MoDeC showing that it outperforms other model adaptation methods in both performance and adherence to time constraints in robotic manipulation and autonomous driving applications.
-
We develop a theory for the representation of opaque solids as volumes. Starting from a stochastic representation of opaque solids as random indicator functions we prove the conditions under which such solids can be modeled using exponential volumetric transport. We also derive expressions for the volumetric attenuation coefficient as a functional of the probability distributions of the underlying indicator functions. We generalize our theory to account for isotropic and anisotropic scattering at different parts of the solid and for representations of opaque solids as stochastic implicit surfaces. We derive our volumetric representation from first principles which ensures that it satisfies physical constraints such as reciprocity and reversibility. We use our theory to explain compare and correct previous volumetric representations as well as propose meaningful extensions that lead to improved performance in 3D reconstruction tasks.
-
The pretraining-finetuning paradigm has gained popularity in various computer vision tasks. In this paradigm the emergence of active finetuning arises due to the abundance of large-scale data and costly annotation requirements. Active finetuning involves selecting a subset of data from an unlabeled pool for annotation facilitating subsequent finetuning. However the use of a limited number of training samples can lead to a biased distribution potentially resulting in model overfitting. In this paper we propose a new method called ActiveDC for the active finetuning tasks. Firstly we select samples for annotation by optimizing the distribution similarity between the subset to be selected and the entire unlabeled pool in continuous space. Secondly we calibrate the distribution of the selected samples by exploiting implicit category information in the unlabeled pool. The feature visualization provides an intuitive sense of the effectiveness of our approach to distribution calibration. We conducted extensive experiments on three image classification datasets with different sampling ratios. The results indicate that ActiveDC consistently outperforms the baseline performance in all image classification tasks. The improvement is particularly significant when the sampling ratio is low with performance gains of up to 10%. Our code will be released.
-
Machine learning holds tremendous promise for transforming the fundamental practice of scientific discovery by virtue of its data-driven nature. With the ever-increasing stream of research data collection it would be appealing to autonomously explore patterns and insights from observational data for discovering novel classes of phenotypes and concepts. However in the biomedical domain there are several challenges inherently presented in the cumulated data which hamper the progress of novel class discovery. The non-i.i.d. data distribution accompanied by the severe imbalance among different groups of classes essentially leads to ambiguous and biased semantic representations. In this work we present a geometry-constrained probabilistic modeling treatment to resolve the identified issues. First we propose to parameterize the approximated posterior of instance embedding as a marginal von Mises-Fisher distribution to account for the interference of distributional latent bias. Then we incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space which in turn minimizes the uncontrollable risk for unknown class learning and structuring. Furthermore a spectral graph-theoretic method is devised to estimate the number of potential novel classes. It inherits two intriguing merits compared to existent approaches namely high computational efficiency and flexibility for taxonomy-adaptive estimation. Extensive experiments across various biomedical scenarios substantiate the effectiveness and general applicability of our method.
-
In this era the success of large language models and text-to-image models can be attributed to the driving force of large-scale datasets. However in the realm of 3D vision while remarkable progress has been made with models trained on large-scale synthetic and real-captured object data like Objaverse and MVImgNet a similar level of progress has not been observed in the domain of human-centric tasks partially due to the lack of a large-scale human dataset. Existing datasets of high-fidelity 3D human capture continue to be mid-sized due to the significant challenges in acquiring large-scale high-quality 3D human data. To bridge this gap we present MVHumanNet a dataset that comprises multi-view human action sequences of 4500 human identities. The primary focus of our work is on collecting human data that features a large number of diverse identities and everyday clothing using a multi-view human capture system which facilitates easily scalable data collection. Our dataset contains 9000 daily outfits 60000 motion sequences and 645 million frames with extensive annotations including human masks camera parameters 2D and 3D keypoints SMPL/SMPLX parameters and corresponding textual descriptions. To explore the potential of MVHumanNet in various 2D and 3D visual tasks we conducted pilot studies on view-consistent action recognition human NeRF reconstruction text-driven view-unconstrained human image generation as well as 2D view-unconstrained human image and 3D avatar generation. Extensive experiments demonstrate the performance improvements and effective applications enabled by the scale provided by MVHumanNet. As the current largest-scale 3D human dataset we hope that the release of MVHumanNet data with annotations will foster further innovations in the domain of 3D human-centric tasks at scale.
-
Federated learning often suffers from slow and unstable convergence due to the heterogeneous characteristics of participating client datasets. Such a tendency is aggravated when the client participation ratio is low since the information collected from the clients has large variations. To address this challenge we propose a simple but effective federated learning framework which improves the consistency across clients and facilitates the convergence of the server model. This is achieved by making the server broadcast a global model with a lookahead gradient. This strategy enables the proposed approach to convey the projected global update information to participants effectively without additional client memory and extra communication costs. We also regularize local updates by aligning each client with the overshot global model to reduce bias and improve the stability of our algorithm. We provide the theoretical convergence rate of our algorithm and demonstrate remarkable performance gains in terms of accuracy and communication efficiency compared to the state-of-the-art methods especially with low client participation rates. The source code is available at our project page.
-
Skeleton-based action recognition has attracted lots of research attention. Recently to build an accurate skeleton-based action recognizer a variety of works have been proposed. Among them some works use large model architectures as backbones of their recognizers to boost the skeleton data representation capability while some other works pre-train their recognizers on external data to enrich the knowledge. In this work we observe that large language models which have been extensively used in various natural language processing tasks generally hold both large model architectures and rich implicit knowledge. Motivated by this we propose a novel LLM-AR framework in which we investigate treating the Large Language Model as an Action Recognizer. In our framework we propose a linguistic projection process to project each input action signal (i.e. each skeleton sequence) into its "sentence format" (i.e. an "action sentence"). Moreover we also incorporate our framework with several designs to further facilitate this linguistic projection process. Extensive experiments demonstrate the efficacy of our proposed framework.
-
Generative models have been very popular in the recent years for their image generation capabilities. GAN-based models are highly regarded for their disentangled latent space which is a key feature contributing to their success in controlled image editing. On the other hand diffusion models have emerged as powerful tools for generating high-quality images. However the latent space of diffusion models is not as thoroughly explored or understood. Existing methods that aim to explore the latent space of diffusion models usually relies on text prompts to pinpoint specific semantics. However this approach may be restrictive in areas such as art fashion or specialized fields like medicine where suitable text prompts might not be available or easy to conceive thus limiting the scope of existing work. In this paper we propose an unsupervised method to discover latent semantics in text-to-image diffusion models without relying on text prompts. Our method takes a small set of unlabeled images from specific domains such as faces or cats and a pre-trained diffusion model and discovers diverse semantics in unsupervised fashion using a contrastive learning objective. Moreover the learned directions can be applied simultaneously either within the same domain (such as various types of facial edits) or across different domains (such as applying cat and face edits within the same image) without interfering with each other. Our extensive experiments show that our method achieves highly disentangled edits outperforming existing approaches in both diffusion-based and GAN-based latent space editing methods.
-
Neural radiance fields have achieved remarkable performance in modeling the appearance of 3D scenes. However existing approaches still struggle with the view-dependent appearance of glossy surfaces especially under complex lighting of indoor environments. Unlike existing methods which typically assume distant lighting like an environment map we propose a learnable Gaussian directional encoding to better model the view-dependent effects under near-field lighting conditions. Importantly our new directional encoding captures the spatially-varying nature of near-field lighting and emulates the behavior of prefiltered environment maps. As a result it enables the efficient evaluation of preconvolved specular color at any 3D location with varying roughness coefficients. We further introduce a data-driven geometry prior that helps alleviate the shape radiance ambiguity in reflection modeling. We show that our Gaussian directional encoding and geometry prior significantly improve the modeling of challenging specular reflections in neural radiance fields which helps decompose appearance into more physically meaningful components.
-
In subject-driven text-to-image synthesis the synthesis process tends to be heavily influenced by the reference images provided by users often overlooking crucial attributes detailed in the text prompt. In this work we propose Subject-Agnostic Guidance (SAG) a simple yet effective solution to remedy the problem. We show that through constructing a subject-agnostic condition and applying our proposed dual classifier-free guidance one could obtain outputs consistent with both the given subject and input text prompts. We validate the efficacy of our approach through both optimization-based and encoder-based methods. Additionally we demonstrate its applicability in second-order customization methods where an encoder-based model is fine-tuned with DreamBooth. Our approach is conceptually simple and requires only minimal code modifications but leads to substantial quality improvements as evidenced by our evaluations and user studies.
-
Large language models (LLMs) are fine-tuned using human comparison data with Reinforcement Learning from Human Feedback (RLHF) methods to make them better aligned with users' preferences. In contrast to LLMs human preference learning has not been widely explored in text-to-image diffusion models; the best existing approach is to fine-tune a pretrained model using carefully curated high quality images and captions to improve visual appeal and text alignment. We propose Diffusion-DPO a method to align diffusion models to human preferences by directly optimizing on human comparison data. Diffusion-DPO is adapted from the recently developed Direct Preference Optimization (DPO) a simpler alternative to RLHF which directly optimizes a policy that best satisfies human preferences under a classification objective. We re-formulate DPO to account for a diffusion model notion of likelihood utilizing the evidence lower bound to derive a differentiable objective. Using the Pick-a-pic dataset of 851K crowdsourced pairwise preferences we fine-tune the base model of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO. Our fine-tuned base model significantly outperforms both base SDXL-1.0 and the larger SDXL-1.0 model consisting of an additional refinement model in human evaluation improving visual appeal and prompt alignment. We also develop a variant that uses AI feedback and has comparable performance to training on human preferences opening the door for scaling of diffusion model alignment methods.
-
Advanced life forms sustained by the synergistic interaction of neural cognitive mechanisms continually acquire and transfer knowledge throughout their lifespan. In contrast contemporary machine learning paradigms exhibit limitations in emulating the facets of continual learning (CL). Nonetheless the emergence of large language models (LLMs) presents promising avenues for realizing CL via interactions with these models. Drawing on Complementary Learning System theory this paper presents a novel Interactive Continual Learning (ICL) framework enabled by collaborative interactions among models of various sizes. Specifically we assign the ViT model as System1 and multimodal LLM as System2. To enable the memory module to deduce tasks from class information and enhance Set2Set retrieval we propose the Class-Knowledge-Task Multi-Head Attention (CKT-MHA). Additionally to improve memory retrieval in System1 through enhanced geometric representation we introduce the CL-vMF mechanism based on the von Mises-Fisher (vMF) distribution. Meanwhile we introduce the von Mises-Fisher Outlier Detection and Interaction (vMF-ODI) strategy to identify hard examples thus enhancing collaboration between System1 and System2 for complex reasoning realization. Comprehensive evaluation of our proposed ICL demonstrates significant resistance to forgetting and superior performance relative to existing methods. Code is available at github.com/ICL.
-
We introduce a 3D-aware diffusion model ZeroNVS for single-image novel view synthesis for in-the-wild scenes. While existing methods are designed for single objects with masked backgrounds we propose new techniques to address challenges introduced by in-the-wild multi-object scenes with complex backgrounds. Specifically we train a generative prior on a mixture of data sources that capture object-centric indoor and outdoor scenes. To address issues from data mixture such as depth-scale ambiguity we propose a novel camera conditioning parameterization and normalization scheme. Further we observe that Score Distillation Sampling (SDS) tends to truncate the distribution of complex backgrounds during distillation of 360-degree scenes and propose "SDS anchoring" to improve the diversity of synthesized novel views. Our model sets a new state-of-the-art result in LPIPS on the DTU dataset in the zero-shot setting even outperforming methods specifically trained on DTU. We further adapt the challenging Mip-NeRF 360 dataset as a new benchmark for single-image novel view synthesis and demonstrate strong performance in this setting. Code and models will be publicly available.
-
The inherent generative power of denoising diffusion models makes them well-suited for image restoration tasks where the objective is to find the optimal high-quality image within the generative space that closely resembles the input image. We propose a method to adapt a pretrained diffusion model for image restoration by simply adding noise to the input image to be restored and then denoise. Our method is based on the observation that the space of a generative model needs to be constrained. We impose this constraint by finetuning the generative model with a set of anchor images that capture the characteristics of the input image. With the constrained space we can then leverage the sampling strategy used for generation to do image restoration. We evaluate against previous methods and show superior performances on multiple real-world restoration datasets in preserving identity and image quality. We also demonstrate an important and practical application on personalized restoration where we use a personal album as the anchor images to constrain the generative space. This approach allows us to produce results that accurately preserve high-frequency details which previous works are unable to do. Project webpage: https://gen2res.github.io.
-
Amplitude modulated continuous-wave time-of-flight (AMCW-ToF) cameras are finding applications as flash Lidars in autonomous navigation robotics and AR/VR applications. A conventional CW-ToF camera requires illuminating the scene with a temporally varying light source and demodulating a set of quadrature measurements to recover the scene's depth and intensity. Capturing the four measurements in sequence renders the system slow invariably causing inaccuracies in depth estimates due to motion in the scene or the camera. To mitigate this problem we propose a snapshot Lidar that captures amplitude and phase simultaneously as a single time-of-flight hologram. Uniquely our approach requires minimal changes to existing CW-ToF imaging hardware. To demonstrate the efficacy of the proposed system we design and build a lab prototype and evaluate it under varying scene geometries illumination conditions and compare the reconstructed depth measurements against conventional techniques. We rigorously evaluate the robustness of our system on diverse real-world scenes to show that our technique results in a significant reduction in data bandwidth with minimal loss in reconstruction accuracy. As high-resolution CW-ToF cameras are becoming ubiquitous increasing their temporal resolution by four times enables robust real-time capture of geometries of dynamic scenes.
-
Continual Learning (CL) enables machine learning models to learn from continuously shifting new training data in absence of data from old tasks. Recently pre-trained vision transformers combined with prompt tuning have shown promise for overcoming catastrophic forgetting in CL. These approaches rely on a pool of learnable prompts which can be inefficient in sharing knowledge across tasks leading to inferior performance. In addition the lack of fine-grained layer specific prompts does not allow these to fully express the strength of the prompts for CL. We address these limitations by proposing ConvPrompt a novel convolutional prompt creation mechanism that maintains layer-wise shared embeddings enabling both layer-specific learning and better concept transfer across tasks. The intelligent use of convolution enables us to maintain a low parameter overhead without compromising performance. We further leverage Large Language Models to generate fine-grained text descriptions of each category which are used to get task similarity and dynamically decide the number of prompts to be learned. Extensive experiments demonstrate the superiority of ConvPrompt and improves SOTA by 3% with significantly less parameter overhead. We also perform strong ablation over various modules to disentangle the importance of different components.
-
Video deblurring relies on leveraging information from other frames in the video sequence to restore the blurred regions in the current frame. Mainstream approaches employ bidirectional feature propagation spatio-temporal transformers or a combination of both to extract information from the video sequence. However limitations in memory and computational resources constraints the temporal window length of the spatio-temporal transformer preventing the extraction of longer temporal contextual information from the video sequence. Additionally bidirectional feature propagation is highly sensitive to inaccurate optical flow in blurry frames leading to error accumulation during the propagation process. To address these issues we propose BSSTNet Blur-aware Spatio-temporal Sparse Transformer Network. It introduces the blur map which converts the originally dense attention into a sparse form enabling a more extensive utilization of information throughout the entire video sequence. Specifically BSSTNet (1) uses a longer temporal window in the transformer leveraging information from more distant frames to restore the blurry pixels in the current frame. (2) introduces bidirectional feature propagation guided by blur maps which reduces error accumulation caused by the blur frame. The experimental results demonstrate the proposed BSSTNet outperforms the state-of-the-art methods on the GoPro and DVD datasets.
-
Building a generalist agent that can interact with the world is an ultimate goal for humans thus spurring the research for embodied navigation where an agent is required to navigate according to instructions or respond to queries. Despite the major progress attained previous works primarily focus on task-specific agents and lack generalizability to unseen scenarios. Recently LLMs have presented remarkable capabilities across various fields and provided a promising opportunity for embodied navigation. Drawing on this we propose the first generalist model for embodied navigation NaviLLM. It adapts LLMs to embodied navigation by introducing schema-based instruction. The schema-based instruction flexibly casts various tasks into generation problems thereby unifying a wide range of tasks. This approach allows us to integrate diverse data sources from various datasets into the training equipping NaviLLM with a wide range of capabilities required by embodied navigation. We conduct extensive experiments to evaluate the performance and generalizability of our model. The experimental results demonstrate that our unified model achieves state-of-the-art performance on CVDN SOON and ScanQA. Specifically it surpasses the previous stats-of-the-art method by a significant margin of 29% in goal progress on CVDN. Moreover our model also demonstrates strong generalizability and presents impressive results on unseen tasks e.g. embodied question answering and 3D captioning.
-
Motion capture from a limited number of body-worn sensors such as inertial measurement units (IMUs) and pressure insoles has important applications in health human performance and entertainment. Recent work has focused on accurately reconstructing whole-body motion from a specific sensor configuration using six IMUs. While a common goal across applications is to use the minimal number of sensors to achieve required accuracy the optimal arrangement of the sensors might differ from application to application. We propose a single diffusion model DiffusionPoser which reconstructs human motion in real-time from an arbitrary combination of sensors including IMUs placed at specified locations and pressure insoles. Unlike existing methods our model grants users the flexibility to determine the number and arrangement of sensors tailored to the specific activity of interest without the need for retraining. A novel autoregressive inferencing scheme ensures real-time motion reconstruction that closely aligns with measured sensor signals. The generative nature of DiffusionPoser ensures realistic behavior even for degrees-of-freedom not directly measured. Qualitative results can be found on our project website.
-
Understanding how we grasp objects with our hands has important applications in areas like robotics and mixed reality. However this challenging problem requires accurate modeling of the contact between hands and objects.To capture grasps existing methods use skeletons meshes or parametric models that does not represent hand shape accurately resulting in inaccurate contacts. We present MANUS a method for Markerless Hand-Object Grasp Capture using Articulated 3D Gaussians. We build a novel articulated 3D Gaussians representation that extends 3D Gaussian splatting for high-fidelity representation of articulating hands. Since our representation uses Gaussian primitives optimized from the multi-view pixel-aligned losses it enables us to efficiently and accurately estimate contacts between the hand and the object. For the most accurate results our method requires tens of camera views that current datasets do not provide. We therefore build MANUS-Grasps a new dataset that contains hand-object grasps viewed from 50+ cameras across 30+ scenes 3 subjects and comprising over 7M frames. In addition to extensive qualitative results we also show that our method outperforms others on a quantitative contact evaluation method that uses paint transfer from the object to the hand.
-
In image restoration (IR) leveraging semantic priors from segmentation models has been a common approach to improve performance. The recent segment anything model (SAM) has emerged as a powerful tool for extracting advanced semantic priors to enhance IR tasks. However the computational cost of SAM is prohibitive for IR compared to existing smaller IR models. The incorporation of SAM for extracting semantic priors considerably hampers the model inference efficiency. To address this issue we propose a general framework to distill SAM's semantic knowledge to boost exiting IR models without interfering with their inference process. Specifically our proposed framework consists of the semantic priors fusion (SPF) scheme and the semantic priors distillation (SPD) scheme. SPF fuses two kinds of information between the restored image predicted by the original IR model and the semantic mask predicted by SAM for the refined restored image. SPD leverages a self-distillation manner to distill the fused semantic priors to boost the performance of original IR models. Additionallywe design a semantic-guided relation (SGR) module for SPD which ensures semantic feature representation space consistency to fully distill the priors. We demonstrate the effectiveness of our framework across multiple IR models and tasks including deraining deblurring and denoising.
-
Geometric knowledge has been shown to be beneficial for the stereo matching task. However prior attempts to integrate geometric insights into stereo matching algorithms have largely focused on geometric knowledge from single images while crucial cross-view factors such as occlusion and matching uniqueness have been overlooked. To address this gap we propose a novel Intra-view and Cross-view Geometric knowledge learning Network (ICGNet) specifically crafted to assimilate both intra-view and cross-view geometric knowledge. ICGNet harnesses the power of interest points to serve as a channel for intra-view geometric understanding. Simultaneously it employs the correspondences among these points to capture cross-view geometric relationships. This dual incorporation empowers the proposed ICGNet to leverage both intra-view and cross-view geometric knowledge in its learning process substantially improving its ability to estimate disparities. Our extensive experiments demonstrate the superiority of the ICGNet over contemporary leading models. The code will be available at https://github.com/DFSDDDDD1199/ICGNet.
-
Domain generalization aims to solve the challenge of Out-of-Distribution (OOD) generalization by leveraging common knowledge learned from multiple training domains to generalize to unseen test domains. To accurately evaluate the OOD generalization ability it is required that test data information is unavailable. However the current domain generalization protocol may still have potential test data information leakage. This paper examines the risks of test data information leakage from two aspects of the current evaluation protocol: supervised pretraining on ImageNet and oracle model selection. We propose modifications to the current protocol that we should employ self-supervised pretraining or train from scratch instead of employing the current supervised pretraining and we should use multiple test domains. These would result in a more precise evaluation of OOD generalization ability. We also rerun the algorithms with the modified protocol and introduce new leaderboards to encourage future research in domain generalization with a fairer comparison.
-
Black-Box Knowledge Distillation (B2KD) is a formulated problem for cloud-to-edge model compression with invisible data and models hosted on the server. B2KD faces challenges such as limited Internet exchange and edge-cloud disparity of data distributions. In this paper we formalize a two-step workflow consisting of deprivatization and distillation and theoretically provide a new optimization direction from logits to cell boundary different from direct logits alignment. With its guidance we propose a new method Mapping-Emulation KD (MEKD) that distills a black-box cumbersome model into a lightweight one. Our method does not differentiate between treating soft or hard responses and consists of: 1) deprivatization: emulating the inverse mapping of the teacher function with a generator and 2) distillation: aligning low-dimensional logits of the teacher and student models by reducing the distance of high-dimensional image points. For different teacher-student pairs our method yields inspiring distillation performance on various benchmarks and outperforms the previous state-of-the-art approaches.
-
Generating large-scale 3D scenes cannot simply apply existing 3D object synthesis technique since 3D scenes usually hold complex spatial configurations and consist of a number of objects at varying scales. We thus propose a practical and efficient 3D representation that incorporates an equivariant radiance field with the guidance of a bird's-eye view (BEV) map. Concretely objects of synthesized 3D scenes could be easily manipulated through steering the corresponding BEV maps. Moreover by adequately incorporating positional encoding and low-pass filters into the generator the representation becomes equivariant to the given BEV map. Such equivariance allows us to produce large-scale even infinite-scale 3D scenes via synthesizing local scenes and then stitching them with smooth consistency. Extensive experiments on 3D scene datasets demonstrate the effectiveness of our approach. Our project website is at: https://https://zqh0253.github.io/BerfScene.
-
While existing methods for 3D face reconstruction from in-the-wild images excel at recovering the overall face shape they commonly miss subtle extreme asymmetric or rarely observed expressions. We improve upon these methods with SMIRK (Spatial Modeling for Image-based Reconstruction of Kinesics) which faithfully reconstructs expressive 3D faces from images. We identify two key limitations in existing methods: shortcomings in their self-supervised training formulation and a lack of expression diversity in the training images. For training most methods employ differentiable rendering to compare a predicted face mesh with the input image along with a plethora of additional loss functions. This differentiable rendering loss not only has to provide supervision to optimize for 3D face geometry camera albedo and lighting which is an ill-posed optimization problem but the domain gap between rendering and input image further hinders the learning process. Instead SMIRK replaces the differentiable rendering with a neural rendering module that given the rendered predicted mesh geometry and sparsely sampled pixels of the input image generates a face image. As the neural rendering gets color information from sampled image pixels supervising with neural rendering-based reconstruction loss can focus solely on the geometry. Further it enables us to generate images of the input identity with varying expressions while training. These are then utilized as input to the reconstruction model and used as supervision with ground truth geometry. This effectively augments the training data and enhances the generalization for diverse expressions. Our qualitative quantitative and particularly our perceptual evaluations demonstrate that SMIRK achieves the new state-of-the art performance on accurate expression reconstruction. For our method's source code demo video and more please visit our project webpage: https://georgeretsi.github.io/smirk/.
-
Vehicle-to-everything (V2X) is a popular topic in the field of Autonomous Driving in recent years. Vehicle-infrastructure cooperation (VIC) becomes one of the important research area. Due to the complexity of traffic conditions such as blind spots and occlusion it greatly limits the perception capabilities of single-view roadside sensing systems. To further enhance the accuracy of roadside perception and provide better information to the vehicle side in this paper we constructed holographic intersections with various layouts to build a large-scale multi-sensor holographic vehicle-infrastructure cooperation dataset called HoloVIC. Our dataset includes 3 different types of sensors (Camera Lidar Fisheye) and employs 4 sensor-layouts based on the different intersections. Each intersection is equipped with 6-18 sensors to capture synchronous data. While autonomous vehicles pass through these intersections for collecting VIC data. HoloVIC contains in total on 100k+ synchronous frames from different sensors. Additionally we annotated 3D bounding boxes based on Camera Fisheye and Lidar. We also associate the IDs of the same objects across different devices and consecutive frames in sequence. Based on HoloVIC we formulated four tasks to facilitate the development of related research. We also provide benchmarks for these tasks.
-
The Segment Anything Model (SAM) has garnered significant attention for its versatile segmentation abilities and intuitive prompt-based interface. However its application in medical imaging presents challenges requiring either substantial training costs and extensive medical datasets for full model fine-tuning or high-quality prompts for optimal performance. This paper introduces H-SAM: a prompt-free adaptation of SAM tailored for efficient fine-tuning of medical images via a two-stage hierarchical decoding procedure. In the initial stage H-SAM employs SAM's original decoder to generate a prior probabilistic mask guiding a more intricate decoding process in the second stage. Specifically we propose two key designs: 1) A class-balanced mask-guided self-attention mechanism addressing the unbalanced label distribution enhancing image embedding; 2) A learnable mask cross-attention mechanism spatially modulating the interplay among different image regions based on the prior mask. Moreover the inclusion of a hierarchical pixel decoder in H-SAM enhances its proficiency in capturing fine-grained and localized details. This approach enables SAM to effectively integrate learned medical priors facilitating enhanced adaptation for medical image segmentation with limited samples. Our H-SAM demonstrates a 4.78% improvement in average Dice compared to existing prompt-free SAM variants for multi-organ segmentation using only 10% of 2D slices. Notably without using any unlabeled data H-SAM even outperforms state-of-the-art semi-supervised models relying on extensive unlabeled training data across various medical datasets. Our code is available at https://github.com/Cccccczh404/H-SAM.
-
Style transfer aims to render an image with the artistic features of a style image while maintaining the original structure. Various methods have been put forward for this task but some challenges still exist. For instance it is difficult for CNN-based methods to handle global information and long-range dependencies between input images for which transformer-based methods have been proposed. Although transformer can better model the relationship between content and style images they require high-cost hardware and time-consuming inference. To address these issues we design a novel transformer model that includes only encoders thus significantly reducing the computational cost. In addition we also find that existing style transfer methods may lead to images under-stylied or missing content. In order to achieve better stylization we design a content feature extractor and a style feature extractor. Then we can feed pure content and style images into the transformer. Finally we propose a network model termed Puff-Net i.e. efficient style transfer with pure content and style feature fusion network. Through qualitative and quantitative experiments we demonstrate the advantages of our model compared to state-of-the-art ones in the literature. The code is availabel at https://github.com/ZszYmy9/Puff-Net.
-
Image warping a classic task in computer vision aims to use geometric transformations to change the appearance of images. Recent methods learn the resampling kernels for warping through neural networks to estimate missing values in irregular grids which however fail to capture local variations in deformed content and produce images with distortion and less high-frequency details. To address this issue this paper proposes an effective method namely MFR to learn Multi-Frequency Representations from input images for image warping. Specifically we propose a progressive filtering network to learn image representations from different frequency subbands and generate deformable images in a coarse-to-fine manner. Furthermore we employ learnable Gabor wavelet filters to improve the model's capability to learn local spatial-frequency representations. Comprehensive experiments including homography transformation equirectangular to perspective projection and asymmetric image super-resolution demonstrate that the proposed MFR significantly outperforms state-of-the-art image warping methods. Our method also showcases superior generalization to out-of-distribution domains where the generated images are equipped with rich details and less distortion thereby high visual quality. The source code is available at https://github.com/junxiao01/MFR.
-
Adjusting camera exposure in arbitrary lighting conditions is the first step to ensure the functionality of computer vision applications. Poorly adjusted camera exposure often leads to critical failure and performance degradation. Traditional camera exposure control methods require multiple convergence steps and time-consuming processes making them unsuitable for dynamic lighting conditions. In this paper we propose a new camera exposure control framework that rapidly controls camera exposure while performing real-time processing by exploiting deep reinforcement learning. The proposed framework consists of four contributions: 1) a simplified training ground to simulate real-world's diverse and dynamic lighting changes 2) flickering and image attribute-aware reward design along with lightweight state design for real-time processing 3) a static-to-dynamic lighting curriculum to gradually improve the agent's exposure-adjusting capability and 4) domain randomization techniques to alleviate the limitation of the training ground and achieve seamless generalization in the wild. As a result our proposed method rapidly reaches a desired exposure level within five steps with real-time processing (1 ms). Also the acquired images are well-exposed and show superiority in various computer vision tasks such as feature extraction and object detection.
-
We introduce the Splatter Image an ultra-efficient approach for monocular 3D object reconstruction. Splatter Image is based on Gaussian Splatting which allows fast and high-quality reconstruction of 3D scenes from multiple images. We apply Gaussian Splatting to monocular reconstruction by learning a neural network that at test time performs reconstruction in a feed-forward manner at 38 FPS. Our main innovation is the surprisingly straightforward design of this network which using 2D operators maps the input image to one 3D Gaussian per pixel. The resulting set of Gaussians thus has the form an image the Splatter Image. We further extend the method take several images as input via cross-view attention. Owning to the speed of the renderer (588 FPS) we use a single GPU for training while generating entire images at each iteration to optimize perceptual metrics like LPIPS. On several synthetic real multi-category and large-scale benchmark datasets we achieve better results in terms of PSNR LPIPS and other metrics while training and evaluating much faster than prior works. Code models and more results are available at https://szymanowiczs.github.io/ splatter-image.
-
From content moderation to wildlife conservation the number of applications that require models to recognize nuanced or subjective visual concepts is growing. Traditionally developing classifiers for such concepts requires substantial manual effort measured in hours days or even months to identify and annotate data needed for training. Even with recently proposed Agile Modeling techniques which enable rapid bootstrapping of image classifiers users are still required to spend 30 minutes or more of monotonous repetitive data labeling just to train a single classifier. Drawing on Fiske's Cognitive Miser theory we propose a new framework that alleviates manual effort by replacing human labeling with natural language interactions reducing the total effort required to define a concept by an order of magnitude: from labeling 2000 images to only 100 plus some natural language interactions. Our framework leverages recent advances in foundation models both large language models and vision-language models to carve out the concept space through conversation and by automatically labeling training data points. Most importantly our framework eliminates the need for crowd-sourced annotations. Moreover our framework ultimately produces lightweight classification models that are deployable in cost-sensitive scenarios. Across 15 subjective concepts and across 2 public image classification datasets our trained models outperform traditional Agile Modeling as well as state-of-the-art zero-shot classification models like ALIGN CLIP CuPL and large visual question answering models like PaLI-X.
-
This paper introduces a versatile paradigm for integrating multi-view reflectance (optional) and normal maps acquired through photometric stereo. Our approach employs a pixel-wise joint re-parameterization of reflectance and normal considering them as a vector of radiances rendered under simulated varying illumination. This re-parameterization enables the seamless integration of reflectance and normal maps as input data in neural volume rendering-based 3D reconstruction while preserving a single optimization objective. In contrast recent multi-view photometric stereo (MVPS) methods depend on multiple potentially conflicting objectives. Despite its apparent simplicity our proposed approach outperforms state-of-the-art approaches in MVPS benchmarks across F-score Chamfer distance and mean angular error metrics. Notably it significantly improves the detailed 3D reconstruction of areas with high curvature or low visibility.
-
Backdoor attack poses a significant security threat to Deep Learning applications. Existing attacks are often not evasive to established backdoor detection techniques. This susceptibility primarily stems from the fact that these attacks typically leverage a universal trigger pattern or transformation function such that the trigger can cause misclassification for any input. In response to this recent papers have introduced attacks using sample-specific invisible triggers crafted through special transformation functions. While these approaches manage to evade detection to some extent they reveal vulnerability to existing backdoor mitigation techniques. To address and enhance both evasiveness and resilience we introduce a novel backdoor attack LOTUS. Specifically it leverages a secret function to separate samples in the victim class into a set of partitions and applies unique triggers to different partitions. Furthermore LOTUS incorporates an effective trigger focusing mechanism ensuring only the trigger corresponding to the partition can induce the backdoor behavior. Extensive experimental results show that LOTUS can achieve high attack success rate across 4 datasets and 7 model structures and effectively evading 13 backdoor detection and mitigation techniques. The code is available at https://github.com/Megum1/LOTUS.
-
Object pose refinement is essential for robust object pose estimation. Previous work has made significant progress towards instance-level object pose refinement. Yet category-level pose refinement is a more challenging problem due to large shape variations within a category and the discrepancies between the target object and the shape prior. To address these challenges we introduce a novel architecture for category-level object pose refinement. Our approach integrates an HS-layer and learnable affine transformations which aims to enhance the extraction and alignment of geometric information. Additionally we introduce a cross-cloud transformation mechanism that efficiently merges diverse data sources. Finally we push the limits of our model by incorporating the shape prior information for translation and size error prediction. We conducted extensive experiments to demonstrate the effectiveness of the proposed framework. Through extensive quantitative experiments we demonstrate significant improvement over the baseline method by a large margin across all metrics.
-
Removing noise from images a.k.a image denoising can be a very challenging task since the type and amount of noise can greatly vary for each image due to many factors including a camera model and capturing environments. While there have been striking improvements in image denoising with the emergence of advanced deep learning architectures and real-world datasets recent denoising networks struggle to maintain performance on images with noise that has not been seen during training. One typical approach to address the challenge would be to adapt a denoising network to new noise distribution. Instead in this work we shift our attention to the input noise itself for adaptation rather than adapting a network. Thus we keep a pretrained network frozen and adapt an input noise to capture the fine-grained deviations. As such we propose a new denoising algorithm dubbed Learning-to-Adapt-Noise (LAN) where a learnable noise offset is directly added to a given noisy image to bring a given input noise closer towards the noise distribution a denoising network is trained to handle. Consequently the proposed framework exhibits performance improvement on images with unseen noise displaying the potential of the proposed research direction.
-
Confronting the challenges of data scarcity and advanced motion synthesis in human-scene interaction modeling we introduce the TRUMANS dataset alongside a novel HSI motion synthesis method. TRUMANS stands as the most comprehensive motion-captured HSI dataset currently available encompassing over 15 hours of human interactions across 100 indoor scenes. It intricately captures whole-body human motions and part-level object dynamics focusing on the realism of contact. This dataset is further scaled up by transforming physical environments into exact virtual models and applying extensive augmentations to appearance and motion for both humans and objects while maintaining interaction fidelity. Utilizing TRUMANS we devise a diffusion-based autoregressive model that efficiently generates HSI sequences of any length taking into account both scene context and intended actions. In experiments our approach shows remarkable zero-shot generalizability on a range of 3D scene datasets (e.g. PROX Replica ScanNet ScanNet++) producing motions that closely mimic original motion-captured sequences as confirmed by quantitative experiments and human studies.
-
Single-point annotation in visual tasks with the goal of minimizing labeling costs is becoming increasingly prominent in research. Recently visual foundation models such as Segment Anything (SAM) have gained widespread usage due to their robust zero-shot capabilities and exceptional annotation performance. However SAM's class-agnostic output and high confidence in local segmentation introduce semantic ambiguity posing a challenge for precise category-specific segmentation. In this paper we introduce a cost-effective category-specific segmenter using SAM. To tackle this challenge we have devised a Semantic-Aware Instance Segmentation Network (SAPNet) that integrates Multiple Instance Learning (MIL) with matching capability and SAM with point prompts. SAPNet strategically selects the most representative mask proposals generated by SAM to supervise segmentation with a specific focus on object category information. Moreover we introduce the Point Distance Guidance and Box Mining Strategy to mitigate inherent challenges: group and local issues in weakly supervised segmentation. These strategies serve to further enhance the overall segmentation performance. The experimental results on Pascal VOC and COCO demonstrate the promising performance of our proposed SAPNet emphasizing its semantic matching capabilities and its potential to advance point-prompted instance segmentation. The code is available at https://github.com/zhaoyangwei123/SAPNet.
-
This paper proposes Group Activity Feature (GAF) learning in which features of multi-person activity are learned as a compact latent vector. Unlike prior work in which the manual annotation of group activities is required for supervised learning our method learns the GAF through person attribute prediction without group activity annotations. By learning the whole network in an end-to-end manner so that the GAF is required for predicting the person attributes of people in a group the GAF is trained as the features of multi-person activity. As a person attribute we propose to use a person's action class and appearance features because the former is easy to annotate due to its simpleness and the latter requires no manual annotation. In addition we introduce a location-guided attribute prediction to disentangle the complex GAF for extracting the features of each target person properly. Various experimental results validate that our method outperforms SOTA methods quantitatively and qualitatively on two public datasets. Visualization of our GAF also demonstrates that our method learns the GAF representing fined-grained group activity classes. Code: https://github.com/chihina/GAFL-CVPR2024.
-
Human-centric 3D scene understanding has recently drawn increasing attention driven by its critical impact on robotics. However human-centric real-life scenarios are extremely diverse and complicated and humans have intricate motions and interactions. With limited labeled data supervised methods are difficult to generalize to general scenarios hindering real-life applications. Mimicking human intelligence we propose an unsupervised 3D detection method for human-centric scenarios by transferring the knowledge from synthetic human instances to real scenes. To bridge the gap between the distinct data representations and feature distributions of synthetic models and real point clouds we introduce novel modules for effective instance-to-scene representation transfer and synthetic-to-real feature alignment. Remarkably our method exhibits superior performance compared to current state-of-the-art techniques achieving 87.8% improvement in mAP and closely approaching the performance of fully supervised methods (62.15 mAP vs. 69.02 mAP) on HuCenLife Dataset.
-
Various transfer attack methods have been proposed to evaluate the robustness of deep neural networks (DNNs). Although manifesting remarkable performance in generating untargeted adversarial perturbations existing proposals still fail to achieve high targeted transferability. In this work we discover that the adversarial perturbations' overfitting towards source models of mediocre generalization capability can hurt their targeted transferability. To address this issue we focus on enhancing the source model's generalization capability to improve its ability to conduct transferable targeted adversarial attacks. In pursuit of this goal we propose a novel model self-enhancement method that incorporates two major components: Sharpness-Aware Self-Distillation (SASD) and Weight Scaling (WS). Specifically SASD distills a fine-tuned auxiliary model which mirrors the source model's structure into the source model while flattening the source model's loss landscape. WS obtains an approximate ensemble of numerous pruned models to perform model augmentation which can be conveniently synergized with SASD to elevate the source model's generalization capability and thus improve the resultant targeted perturbations' transferability. Extensive experiments corroborate the effectiveness of the proposed method. Notably under the black-box setting our approach can outperform the state-of-the-art baselines by a significant margin of 12.2% on average in terms of the obtained targeted transferability. Code is available at https://github.com/g4alllf/SASD.
-
Category-level 3D pose estimation is a fundamentally important problem in computer vision and robotics e.g. for embodied agents or to train 3D generative models. However so far methods that estimate the category-level object pose require either large amounts of human annotations CAD models or input from RGB-D sensors. In contrast we tackle the problem of learning to estimate the category-level 3D pose only from casually taken object-centric videos without human supervision. We propose a two-step pipeline: First we introduce a multi-view alignment procedure that determines canonical camera poses across videos with a novel and robust cyclic distance formulation for geometric and appearance matching using reconstructed coarse meshes and DINOv2 features. In a second step the canonical poses and reconstructed meshes enable us to train a model for 3D pose estimation from a single image. In particular our model learns to estimate dense correspondences between images and a prototypical 3D template by predicting for each pixel in a 2D image a feature vector of the corresponding vertex in the template mesh. We demonstrate that our method outperforms all baselines at the unsupervised alignment of object-centric videos by a large margin and provides faithful and robust predictions in-the-wild on the Pascal3D+ and ObjectNet3D datasets.
-
Diffusion models have shown tremendous results in image generation. However due to the iterative nature of the diffusion process and its reliance on classifier-free guidance inference times are slow. In this paper we propose a new distillation approach for guided diffusion models in which an external lightweight guide model is trained while the original text-to-image model remains frozen. We show that our method reduces the inference computation of classifier-free guided latent-space diffusion models by almost half and only requires 1% trainable parameters of the base model. Furthermore once trained our guide model can be applied to various fine-tuned domain-specific versions of the base diffusion model without the need for additional training: this "plug-and-play" functionality drastically improves inference computation while maintaining the visual fidelity of generated images. Empirically we show that our approach is able to produce visually appealing results and achieve a comparable FID score to the teacher with as few as 8 to 16 steps.
-
Brain decoding a pivotal field in neuroscience aims to reconstruct stimuli from acquired brain signals primarily utilizing functional magnetic resonance imaging (fMRI). Currently brain decoding is confined to a per-subject-per-model paradigm limiting its applicability to the same individual for whom the decoding model is trained. This constraint stems from three key challenges: 1) the inherent variability in input dimensions across subjects due to differences in brain size; 2) the unique intrinsic neural patterns influencing how different individuals perceive and process sensory information; 3) limited data availability for new subjects in real-world scenarios hampers the performance of decoding models. In this paper we present a novel approach MindBridge that achieves cross-subject brain decoding by employing only one model. Our proposed framework establishes a generic paradigm capable of addressing these challenges by introducing biological-inspired aggregation function and novel cyclic fMRI reconstruction mechanism for subject-invariant representation learning. Notably by cycle reconstruction of fMRI MindBridge can enable novel fMRI synthesis which also can serve as pseudo data augmentation. Within the framework we also devise a novel reset-tuning method for adapting a pretrained model to a new subject. Experimental results demonstrate MindBridge's ability to reconstruct images for multiple subjects which is competitive with dedicated subject-specific models. Furthermore with limited data for a new subject we achieve a high level of decoding accuracy surpassing that of subject-specific models. This advancement in cross-subject brain decoding suggests promising directions for wider applications in neuroscience and indicates potential for more efficient utilization of limited fMRI data in real-world scenarios. Project page: https://littlepure2333.github.io/MindBridge
-
Creating high-dynamic videos such as motion-rich actions and sophisticated visual effects poses a significant challenge in the field of artificial intelligence. Unfortunately current state-of-the-art video generation methods primarily focusing on text-to-video generation tend to produce video clips with minimal motions despite maintaining high fidelity. We argue that relying solely on text instructions is insufficient and suboptimal for video generation. In this paper we introduce PixelDance a novel approach based on diffusion models that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation. Comprehensive experimental results demonstrate that PixelDance trained with public data exhibits significantly better proficiency in synthesizing videos with complex scenes and intricate motions setting a new standard for video generation.
-
We present MM-Narrator a novel system leveraging GPT-4 with multimodal in-context learning for the generation of audio descriptions (AD). Unlike previous methods that primarily focused on downstream fine-tuning with short video clips MM-Narrator excels in generating precise audio descriptions for videos of extensive lengths even beyond hours in an autoregressive manner. This capability is made possible by the proposed memory-augmented generation process which effectively utilizes both the short-term textual context and long-term visual memory through an efficient register-and-recall mechanism. These contextual memories compile pertinent past information including storylines and character identities ensuring an accurate tracking and depicting of story-coherent and character-centric audio descriptions. Maintaining the training-free design of MM-Narrator we further propose a complexity-based demonstration selection strategy to largely enhance its multi-step reasoning capability via few-shot multimodal in-context learning (MM-ICL). Experimental results on MAD-eval dataset demonstrate that MM-Narrator consistently outperforms both the existing fine-tuning-based approaches and LLM-based approaches in most scenarios as measured by standard evaluation metrics. Additionally we introduce the first segment-based evaluator for recurrent text generation. Empowered by GPT-4 this evaluator comprehensively reasons and marks AD generation performance in various extendable dimensions.
-
Recent advances in generative diffusion models have enabled the previously unfeasible capability of generating 3D assets from a single input image or a text prompt. In this work we aim to enhance the quality and functionality of these models for the task of creating controllable photorealistic human avatars. We achieve this by integrating a 3D morphable model into the state-of-the-art multi-view-consistent diffusion approach. We demonstrate that accurate conditioning of a generative pipeline on the articulated 3D model enhances the baseline model performance on the task of novel view synthesis from a single image. More importantly this integration facilitates a seamless and accurate incorporation of facial expression and body pose control into the generation process. To the best of our knowledge our proposed framework is the first diffusion model to enable the creation of fully 3D-consistent animatable and photorealistic human avatars from a single image of an unseen subject; extensive quantitative and qualitative evaluations demonstrate the advantages of our approach over existing state-of-the-art avatar creation models on both novel view and novel expression synthesis tasks. The code for our project is publicly available.
-
In magnetic resonance imaging (MRI) slice-to-volume reconstruction (SVR) refers to computational reconstruction of an unknown 3D magnetic resonance volume from stacks of 2D slices corrupted by motion. While promising current SVR methods require multiple slice stacks for accurate 3D reconstruction leading to long scans and limiting their use in time-sensitive applications such as fetal fMRI. Here we propose a SVR method that overcomes the shortcomings of previous work and produces state-of-the-art reconstructions in the presence of extreme inter-slice motion. Inspired by the recent success of single-view depth estimation methods we formulate SVR as a single-stack motion estimation task and train a fully convolutional network to predict a motion stack for a given slice stack producing a 3D reconstruction as a byproduct of the predicted motion. Extensive experiments on the SVR of adult and fetal brains demonstrate that our fully convolutional method is twice as accurate as previous SVR methods. Our code is available at github.com/seannz/svr.
-
Text-to-image (T2I) generative models have recently emerged as a powerful tool enabling the creation of photo-realistic images and giving rise to a multitude of applications. However the effective integration of T2I models into fundamental image classification tasks remains an open question. A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models. In this study we scrutinize the shortcomings of both current generative and conventional data augmentation techniques. Our analysis reveals that these methods struggle to produce images that are both faithful (in terms of foreground objects) and diverse (in terms of background contexts) for domain-specific concepts. To tackle this challenge we introduce an innovative inter-class data augmentation method known as Diff-Mix (\href https://github.com/Zhicaiwww/Diff-Mix) https://github.com/Zhicaiwww/Diff-Mix which enriches the dataset by performing image translations between classes. Our empirical results demonstrate that Diff-Mix achieves a better balance between faithfulness and diversity leading to a marked improvement in performance across diverse image classification scenarios including few-shot conventional and long-tail classifications for domain-specific datasets.
-
Binary neural networks utilize 1-bit quantized weights and activations to reduce both the model's storage demands and computational burden. However advanced binary architectures still incorporate millions of inefficient and nonhardware-friendly full-precision multiplication operations. A&B BNN is proposed to directly remove part of the multiplication operations in a traditional BNN and replace the rest with an equal number of bit operations introducing the mask layer and the quantized RPReLU structure based on the normalizer-free network architecture. The mask layer can be removed during inference by leveraging the intrinsic characteristics of BNN with straightforward mathematical transformations to avoid the associated multiplication operations. The quantized RPReLU structure enables more efficient bit operations by constraining its slope to be integer powers of 2. Experimental results achieved 92.30% 69.35% and 66.89% on the CIFAR-10 CIFAR-100 and ImageNet datasets respectively which are competitive with the state-of-the-art. Ablation studies have verified the efficacy of the quantized RPReLU structure leading to a 1.14% enhancement on the ImageNet compared to using a fixed slope RLeakyReLU. The proposed add&bit-operation-only BNN offers an innovative approach for hardware-friendly network architecture.
-
Contrastive Language-Image Pre-training (CLIP) plays an essential role in extracting valuable content information from images across diverse tasks. It aligns textual and visual modalities to comprehend the entire image including all the details even those irrelevant to specific tasks. However for a finer understanding and controlled editing of images it becomes crucial to focus on specific regions of interest which can be indicated as points masks or boxes by humans or perception models. To fulfill the requirements we introduce Alpha-CLIP an enhanced version of CLIP with an auxiliary alpha channel to suggest attentive regions and fine-tuned with constructed millions of RGBA region-text pairs. Alpha-CLIP not only preserves the visual recognition ability of CLIP but also enables precise control over the emphasis of image contents. It demonstrates effectiveness in various tasks including but not limited to open-world recognition multimodal large language models and conditional 2D / 3D generation. It has a strong potential to serve as a versatile tool for image-related tasks.
-
We present a generative approach to forecast long-term future human behavior in 3D requiring only weak supervision from readily available 2D human action data. This is a fundamental task enabling many downstream applications. The required ground-truth data is hard to capture in 3D (mocap suits expensive setups) but easy to acquire in 2D (simple RGB cameras). Thus we design our method to only require 2D RGB data at inference time while being able to generate 3D human motion sequences. We use a differentiable 2D projection scheme in an autoregressive manner for weak supervision and an adversarial loss for 3D regularization. Our method predicts long and complex human behavior sequences (e.g. cooking assembly) consisting of multiple sub-actions. We tackle this in a semantically hierarchical manner jointly predicting high-level coarse action labels together with their low-level fine-grained realizations as characteristic 3D human poses. We observe that these two action representations are coupled in nature and joint prediction benefits both action and pose forecasting. Our experiments demonstrate the complementary nature of joint action and 3D pose prediction: our joint approach outperforms each task treated individually enables robust longer-term sequence prediction and improves over alternative approaches to forecast actions and characteristic 3D poses.
-
Nighttime conditions pose a significant challenge to color constancy due to the diversity of lighting conditions and the presence of substantial low-light noise. Existing color constancy methods struggle with nighttime scenes frequently leading to imprecise light color estimations. To tackle nighttime color constancy we propose a novel unsupervised domain adaptation approach that utilizes labeled daytime data to facilitate learning on unlabeled nighttime images. To specifically address the unique lighting conditions of nighttime and ensure the robustness of pseudo labels we propose adaptive channel masking and light uncertainty. By selectively masking channels that are less sensitive to lighting conditions adaptive channel masking directs the model to progressively focus on features less affected by variations in light colors and noise. Additionally our model leverages light uncertainty to provide a pixel-wise uncertainty estimation regarding light color prediction which helps avoid learning from incorrect labels. Our model demonstrates a significant improvement in accuracy achieving 21.5% lower Mean Angular Error (MAE) compared to the state-of-the-art method on our nighttime dataset.
-
Part-aware panoptic segmentation (PPS) requires (a) that each foreground object and background region in an image is segmented and classified and (b) that all parts within foreground objects are segmented classified and linked to their parent object. Existing methods approach PPS by separately conducting object-level and part-level segmentation. However their part-level predictions are not linked to individual parent objects. Therefore their learning objective is not aligned with the PPS task objective which harms the PPS performance. To solve this and make more accurate PPS predictions we propose Task-Aligned Part-aware Panoptic Segmentation (TAPPS). This method uses a set of shared queries to jointly predict (a) object-level segments and (b) the part-level segments within those same objects. As a result TAPPS learns to predict part-level segments that are linked to individual parent objects aligning the learning objective with the task objective and allowing TAPPS to leverage joint object-part representations. With experiments we show that TAPPS considerably outperforms methods that predict objects and parts separately and achieves new state-of-the-art PPS results.
-
In the realm of computer vision Neural Fields have gained prominence as a contemporary tool harnessing neural networks for signal representation. Despite the remarkable progress in adapting these networks to solve a variety of problems the field still lacks a comprehensive theoretical framework. This article aims to address this gap by delving into the intricate interplay between initialization and activation providing a foundational basis for the robust optimization of Neural Fields. Our theoretical insights reveal a deep-seated connection among network initialization architectural choices and the optimization process emphasizing the need for a holistic approach when designing cutting-edge Neural Fields.
-
3D instance segmentation is fundamental to geometric understanding of the world around us. Existing methods for instance segmentation of 3D scenes rely on supervision from expensive manual 3D annotations. We propose UnScene3D the first fully unsupervised 3D learning approach for class-agnostic 3D instance segmentation of indoor scans. UnScene3D first generates pseudo masks by leveraging self-supervised color and geometry features to find potential object regions. We operate on a basis of 3D segment primitives enabling efficient representation and learning on high-resolution 3D data. The coarse proposals are then refined through self-training our model on its predictions. Our approach improves over state-of-the-art unsupervised 3D instance segmentation methods by more than 300% Average Precision score demonstrating effective instance segmentation even in challenging cluttered 3D scenes.
-
Model quantization is widely used to compress and accelerate deep neural networks. However recent studies have revealed the feasibility of weaponizing model quantization via implanting quantization-conditioned backdoors (QCBs). These special backdoors stay dormant on released full-precision models but will come into effect after standard quantization. Due to the peculiarity of QCBs existing defenses have minor effects on reducing their threats or are even infeasible. In this paper we conduct the first in-depth analysis of QCBs. We reveal that the activation of existing QCBs primarily stems from the nearest rounding operation and is closely related to the norms of neuron-wise truncation errors (i.e. the difference between the continuous fullprecision weights and its quantized version). Motivated by these insights we propose Error-guided Flipped Rounding with Activation Preservation (EFRAP) an effective and practical defense against QCBs. Specifically EFRAP learns a non-nearest rounding strategy with neuron-wise error norm and layer-wise activation preservation guidance flipping the rounding strategies of neurons crucial for backdoor effects but with minimal impact on clean accuracy. Extensive evaluations on benchmark datasets demonstrate that our EFRAP can defeat state-of-the-art QCB attacks under various settings. Code is available here.
-
The realism of digital avatars is crucial in enabling telepresence applications with self-expression and customization. While physical simulations can produce realistic motions for clothed humans they require high-quality garment assets with associated physical parameters for cloth simulations. However manually creating these assets and calibrating their parameters is labor-intensive and requires specialized expertise. Current methods focus on reconstructing geometry but don't generate complete assets for physics-based applications. To address this gap we propose DiffAvatar a novel approach that performs body and garment co-optimization using differentiable simulation. By integrating physical simulation into the optimization loop and accounting for the complex nonlinear behavior of cloth and its intricate interaction with the body our framework recovers body and garment geometry and extracts important material parameters in a physically plausible way. Our experiments demonstrate that our approach generates realistic clothing and body shape suitable for downstream applications. We provide additional insights and results on our webpage: people.csail.mit.edu/liyifei/publication/diffavatar.
-
Powered by massive curated training data Segment Anything Model (SAM) has demonstrated its impressive generalization capabilities in open-world scenarios with the guidance of prompts. However the vanilla SAM is class-agnostic and heavily relies on user-provided prompts to segment objects of interest. Adapting this method to diverse tasks is crucial for accurate target identification and to avoid suboptimal segmentation results. In this paper we propose a novel framework termed AlignSAM designed for automatic prompting for aligning SAM to an open context through reinforcement learning. Anchored by an agent AlignSAM enables the generality of the SAM model across diverse downstream tasks while keeping its parameters frozen. Specifically AlignSAM initiates a prompting agent to iteratively refine segmentation predictions by interacting with the foundational model. It integrates a reinforcement learning policy network to provide informative prompts to the foundational models. Additionally a semantic recalibration module is introduced to provide fine-grained labels of prompts enhancing the model's proficiency in handling tasks encompassing explicit and implicit semantics. Experiments conducted on various challenging segmentation tasks among existing foundation models demonstrate the superiority of the proposed AlignSAM over state-of-the-art approaches. Project page: https://github.com/Duojun-Huang/AlignSAM-CVPR2024.
-
Generalization to new domains not seen during training is one of the long-standing challenges in deploying neural networks in real-world applications. Existing generalization techniques either necessitate external images for augmentation and/or aim at learning invariant representations by imposing various alignment constraints. Large-scale pretraining has recently shown promising generalization capabilities along with the potential of binding different modalities. For instance the advent of vision-language models like CLIP has opened the doorway for vision models to exploit the textual modality. In this paper we introduce a simple framework for generalizing semantic segmentation networks by employing language as the source of randomization. Our recipe comprises three key ingredients: (i) the preservation of the intrinsic CLIP robustness through minimal fine-tuning (ii) language-driven local style augmentation and (iii) randomization by locally mixing the source and augmented styles during training. Extensive experiments report state-of-the-art results on various generalization benchmarks.
-
Diffusion models are just at a tipping point for image super-resolution task. Nevertheless it is not trivial to capitalize on diffusion models for video super-resolution which necessitates not only the preservation of visual appearance from low-resolution to high-resolution videos but also the temporal consistency across video frames. In this paper we propose a novel approach pursuing Spatial Adaptation and Temporal Coherence (SATeCo) for video super-resolution. SATeCo pivots on learning spatial-temporal guidance from low-resolution videos to calibrate both latent-space high-resolution video denoising and pixel-space video reconstruction. Technically SATeCo freezes all the parameters of the pre-trained UNet and VAE and only optimizes two deliberately-designed spatial feature adaptation (SFA) and temporal feature alignment (TFA) modules in the decoder of UNet and VAE. SFA modulates frame features via adaptively estimating affine parameters for each pixel guaranteeing pixel-wise guidance for high-resolution frame synthesis. TFA delves into feature interaction within a 3D local window (tubelet) through self-attention and executes cross-attention between tubelet and its low-resolution counterpart to guide temporal feature alignment. Extensive experiments conducted on the REDS4 and Vid4 datasets demonstrate the effectiveness of our approach.
-
Large-scale datasets have fueled recent advancements in AI-based autonomous vehicle research. However these datasets are usually collected from a single vehicle's one-time pass of a certain location lacking multiagent interactions or repeated traversals of the same place. Such information could lead to transformative enhancements in autonomous vehicles' perception prediction and planning capabilities. To bridge this gap in collaboration with the self-driving company May Mobility we present the MARS dataset which unifies scenarios that enable MultiAgent multitraveRSal and multimodal autonomous vehicle research. More specifically MARS is collected with a fleet of autonomous vehicles driving within a certain geographical area. Each vehicle has its own route and different vehicles may appear at nearby locations. Each vehicle is equipped with a LiDAR and surround-view RGB cameras. We curate two subsets in MARS: one facilitates collaborative driving with multiple vehicles simultaneously present at the same location and the other enables memory retrospection through asynchronous traversals of the same location by multiple vehicles. We conduct experiments in place recognition and neural reconstruction. More importantly MARS introduces new research opportunities and challenges such as multitraversal 3D reconstruction multiagent perception and unsupervised object discovery. Our data and codes can be found at https://ai4ce.github.io/MARS/.
-
Various pose estimation and tracking problems in robotics can be decomposed into a correspondence estimation problem (often computed using a deep network) followed by a weighted least squares optimization problem to solve for the poses. Recent work has shown that coupling the two problems by iteratively refining one conditioned on the other's output yields SOTA results across domains. However training these models has proved challenging requiring a litany of tricks to stabilize and speed up training. In this work we take the visual odometry problem as an example and identify three plausible causes: (1) flow loss interference (2) linearization errors in the bundle adjustment (BA) layer and (3) dependence of weight gradients on the BA residual. We show how these issues result in noisy and higher variance gradients potentially leading to a slow down in training and instabilities. We then propose a simple solution to reduce the gradient variance by using the weights predicted by the network in the inner optimization loop to also weight the correspondence objective in the training problem. This helps the training objective 'focus' on the more important points thereby reducing the variance and mitigating the influence of outliers. We show that the resulting method leads to faster training and can be more flexibly trained in varying training setups without sacrificing performance. In particular we show 2-2.5x training speedups over a baseline visual odometry model we modify.
-
Point clouds frequently contain noise and outliers presenting obstacles for downstream applications. In this work we introduce a novel denoising method for point clouds. By leveraging the latent space we explicitly uncover noise components allowing for the extraction of a clean latent code. This in turn facilitates the restoration of clean points via inverse transformation. A key component in our network is a new multi-level graph convolution network for capturing rich geometric structural features at various scales from local to global. These features are then integrated into the invertible neural network which bijectively maps the latent space to guide the noise disentanglement process. Additionally we employ an invertible monotone operator to model the transformation process effectively enhancing the representation of integrated geometric features. This enhancement allows our network to precisely differentiate between noise factors and the intrinsic clean points in the latent code by projecting them onto separate channels. Both qualitative and quantitative evaluations demonstrate that our method outperforms state-of-the-art methods at various noise levels. The source code is available at https://github.com/yanbiao1/PD-LTS.
-
Many query-based approaches for 3D Multi-Object Tracking (MOT) adopt the tracking-by-attention paradigm utilizing track queries for identity-consistent detection and object queries for identity-agnostic track spawning. Tracking-by-attention however entangles detection and tracking queries in one embedding for both the detection and tracking task which is sub-optimal. Other approaches resemble the tracking-by-detection paradigm detecting objects using decoupled track and detection queries followed by a subsequent association. These methods however do not leverage synergies between the detection and association task. Combining the strengths of both paradigms we introduce ADA-Track a novel end-to-end framework for 3D MOT from multi-view cameras. We introduce a learnable data association module based on edge-augmented cross-attention leveraging appearance and geometric features. Furthermore we integrate this association module into the decoder layer of a DETR-based 3D detector enabling simultaneous DETR-like query-to-image cross-attention for detection and query-to-query cross-attention for data association. By stacking these decoder layers queries are refined for the detection and association task alternately effectively harnessing the task dependencies. We evaluate our method on the nuScenes dataset and demonstrate the advantage of our approach compared to the two previous paradigms. Code is available at https://github.com/dsx0511/ADA-Track.
-
Hyperspectral image (HSI) restoration aims at recovering clean images from degraded observations and plays a vital role in downstream tasks. Existing model-based methods have limitations in accurately modeling the complex image characteristics with handcraft priors and deep learning-based methods suffer from poor generalization ability. To alleviate these issues this paper proposes an unsupervised HSI restoration framework with pre-trained diffusion model (HIR-Diff) which restores the clean HSIs from the product of two low-rank components i.e. the reduced image and the coefficient matrix. Specifically the reduced image which has a low spectral dimension lies in the image field and can be inferred from our improved diffusion model where a new guidance function with total variation (TV) prior is designed to ensure that the reduced image can be well sampled. The coefficient matrix can be effectively pre-estimated based on singular value decomposition (SVD) and rank-revealing QR (RRQR) factorization. Furthermore a novel exponential noise schedule is proposed to accelerate the restoration process (about 5xacceleration for denoising) with little performance decrease. Extensive experimental results validate the superiority of our method in both performance and speed on a variety of HSI restoration tasks including HSI denoising noisy HSI super-resolution and noisy HSI inpainting. The code is available at https://github.com/LiPang/HIRDiff.
-
Monocular Depth Estimation (MDE) is a fundamental problem in computer vision with numerous applications. Recently LIDAR-supervised methods have achieved remarkable per-pixel depth accuracy in outdoor scenes. However significant errors are typically found in the proximity of depth discontinuities i.e. depth edges which often hinder the performance of depth-dependent applications that are sensitive to such inaccuracies e.g. novel view synthesis and augmented reality. Since direct supervision for the location of depth edges is typically unavailable in sparse LIDAR-based scenes encouraging the MDE model to produce correct depth edges is not straightforward. To the best of our knowledge this paper is the first attempt to address the depth edges issue for LIDAR-supervised scenes. In this work we propose to learn to detect the location of depth edges from densely-supervised synthetic data and use it to generate supervision for the depth edges in the MDE training. To quantitatively evaluate our approach and due to the lack of depth edges GT in LIDAR-based scenes we manually annotated subsets of the KITTI and the DDAD datasets with depth edges ground truth. We demonstrate significant gains in the accuracy of the depth edges with comparable per-pixel depth accuracy on several challenging datasets. Code and datasets are available at https://github.com/liortalker/MindTheEdge.
-
Diffusion models (DMs) have exhibited superior performance in generating high-quality and diverse images. However this exceptional performance comes at the cost of expensive generation process particularly due to the heavily used attention module in leading models. Existing works mainly adopt a retraining process to enhance DM efficiency. This is computationally expensive and not very scalable. To this end we introduce the Attention-driven Training-free Efficient Diffusion Model (AT-EDM) framework that leverages attention maps to perform run-time pruning of redundant tokens without the need for any retraining. Specifically for single-denoising-step pruning we develop a novel ranking algorithm Generalized Weighted Page Rank (G-WPR) to identify redundant tokens and a similarity-based recovery method to restore tokens for the convolution operation. In addition we propose a Denoising-Steps-Aware Pruning (DSAP) approach to adjust the pruning budget across different denoising timesteps for better generation quality. Extensive evaluations show that AT-EDM performs favorably against prior art in terms of efficiency (e.g. 38.8% FLOPs saving and up to 1.53x speed-up over Stable Diffusion XL) while maintaining nearly the same FID and CLIP scores as the full model.
-
Retrieval Augmented Generation (RAG) is emerging as a flexible and robust technique to adapt models to private users data without training to handle credit attribution and to allow efficient machine unlearning at scale. However RAG techniques for image generation may lead to parts of the retrieved samples being copied in the model's output. To reduce risks of leaking private information contained in the retrieved set we introduce Copy-Protected generation with Retrieval (CPR) a new method for RAG with strong copyright protection guarantees in a mixed-private setting for diffusion models. CPR allows to condition the output of diffusion models on a set of retrieved images while also guaranteeing that unique identifiable information about those example is not exposed in the generated outputs. In particular it does so by sampling from a mixture of public (safe) distribution and private (user) distribution by merging their diffusion scores at inference. We prove that CPR satisfies Near Access Freeness (NAF) which bounds the amount of information an attacker may be able to extract from the generated images. We provide two algorithms for copyright protection CPR-KL and CPR-Choose. Unlike previously proposed rejection-sampling-based NAF methods our methods enable efficient copyright-protected sampling with a single run of backward diffusion. We show that our method can be applied to any pre-trained conditional diffusion model such as Stable Diffusion or unCLIP. In particular we empirically show that applying CPR on top of un- CLIP improves quality and text-to-image alignment of the generated results (81.4 to 83.17 on TIFA benchmark) while enabling credit attribution copy-right protection and deterministic constant time unlearning.
-
To serve the intricate and varied demands of image editing precise and flexible manipulation in image content is indispensable. Recently Drag-based editing methods have gained impressive performance. However these methods predominantly center on point dragging resulting in two noteworthy drawbacks namely "miss tracking" where difficulties arise in accurately tracking the predetermined handle points and "ambiguous tracking" where tracked points are potentially positioned in wrong regions that closely resemble the handle points. To address the above issues we propose FreeDrag a feature dragging methodology designed to free the burden on point tracking. The FreeDrag incorporates two key designs i.e. template feature via adaptive updating and line search with backtracking the former improves the stability against drastic content change by elaborately controlling the feature updating scale after each dragging while the latter alleviates the misguidance from similar points by actively restricting the search area in a line. These two technologies together contribute to a more stable semantic dragging with higher efficiency. Comprehensive experimental results substantiate that our approach significantly outperforms pre-existing methodologies offering reliable point-based editing even in various complex scenarios.
-
This paper addresses text-supervised semantic segmentation aiming to learn a model capable of segmenting arbitrary visual concepts within images by using only image-text pairs without dense annotations. Existing methods have demonstrated that contrastive learning on image-text pairs effectively aligns visual segments with the meanings of texts. We notice that there is a discrepancy between text alignment and semantic segmentation: A text often consists of multiple semantic concepts whereas semantic segmentation strives to create semantically homogeneous segments. To address this issue we propose a novel framework Image-Text Co-Decomposition (CoDe) where the paired image and text are jointly decomposed into a set of image regions and a set of word segments respectively and contrastive learning is developed to enforce region-word alignment. To work with a vision-language model we present a prompt learning mechanism that derives an extra representation to highlight an image segment or a word segment of interest with which more effective features can be extracted from that segment. Comprehensive experimental results demonstrate that our method performs favorably against existing text-supervised semantic segmentation methods on six benchmark datasets.
-
To accommodate real-world dynamics artificial intelligence systems need to cope with sequentially arriving content in an online manner. Beyond regular Continual Learning (CL) attempting to address catastrophic forgetting with offline training of each task Online Continual Learning (OCL) is a more challenging yet realistic setting that performs CL in a one-pass data stream. Current OCL methods primarily rely on memory replay of old training samples. However a notable gap from CL to OCL stems from the additional overfitting-underfitting dilemma associated with the use of rehearsal buffers: the inadequate learning of new training samples (underfitting) and the repeated learning of a few old training samples (overfitting). To this end we introduce a novel approach Multi-level Online Sequential Experts (MOSE) which cultivates the model as stacked sub-experts integrating multi-level supervision and reverse self-distillation. Supervision signals across multiple stages facilitate appropriate convergence of the new task while gathering various strengths from experts by knowledge distillation mitigates the performance decline of old tasks. MOSE demonstrates remarkable efficacy in learning new samples and preserving past knowledge through multi-level experts thereby significantly advancing OCL performance over state-of-the-art baselines (e.g. up to 7.3% on Split CIFAR-100 and 6.1% on Split Tiny-ImageNet).
-
In the pursuit of robust and generalizable environment perception and language understanding the ubiquitous challenge of dataset bias continues to plague vision-and-language navigation (VLN) agents hindering their performance in unseen environments. This paper introduces the generalized cross-modal causal transformer (GOAT) a pioneering solution rooted in the paradigm of causal inference. By delving into both observable and unobservable confounders within vision language and history we propose the back-door and front-door adjustment causal learning (BACL and FACL) modules to promote unbiased learning by comprehensively mitigating potential spurious correlations. Additionally to capture global confounder features we propose a cross-modal feature pooling (CFP) module supervised by contrastive learning which is also shown to be effective in improving cross-modal representations during pre-training. Extensive experiments across multiple VLN datasets (R2R REVERIE RxR and SOON) underscore the superiority of our proposed method over previous state-of-the-art approaches. Code is available at https://github.com/CrystalSixone/VLN-GOAT.
-
In the realm of point cloud scene understanding particularly in indoor scenes objects are arranged following human habits resulting in objects of certain semantics being closely positioned and displaying notable inter-object correlations. This can create a tendency for neural networks to exploit these strong dependencies bypassing the individual object patterns. To address this challenge we introduce a novel self-supervised learning (SSL) strategy. Our approach leverages both object patterns and contextual cues to produce robust features. It begins with the formulation of an object-exchanging strategy where pairs of objects with comparable sizes are exchanged across different scenes effectively disentangling the strong contextual dependencies. Subsequently we introduce a context-aware feature learning strategy which encodes object patterns without relying on their specific context by aggregating object features across various scenes. Our extensive experiments demonstrate the superiority of our method over existing SSL techniques further showing its better robustness to environmental changes. Moreover we showcase the applicability of our approach by transferring pre-trained models to diverse point cloud datasets.
-
Addressing pose ambiguity in 6D object pose estimation from single RGB images presents a significant challenge particularly due to object symmetries or occlusions. In response we introduce a novel score-based diffusion method applied to the SE(3) group marking the first application of diffusion models to SE(3) within the image domain specifically tailored for pose estimation tasks. Extensive evaluations demonstrate the method's efficacy in handling pose ambiguity mitigating perspective-induced ambiguity and showcasing the robustness of our surrogate Stein score formulation on SE(3). This formulation not only improves the convergence of denoising process but also enhances computational efficiency. Thus we pioneer a promising strategy for 6D object pose estimation.
-
We address the problem of synthesizing multi-view optical illusions: images that change appearance upon a transformation such as a flip or rotation. We propose a simple zero-shot method for obtaining these illusions from off-the-shelf text-to-image diffusion models. During the reverse diffusion process we estimate the noise from different views of a noisy image and then combine these noise estimates together and denoise the image. A theoretical analysis suggests that this method works precisely for views that can be written as orthogonal transformations of which permutations are a subset. This leads to the idea of a visual anagram ---an image that changes appearance under some rearrangement of pixels. This includes rotations and flips but also more exotic pixel permutations such as a jigsaw rearrangement. Our approach also naturally extends to illusions with more than two views. We provide both qualitative and quantitative results demonstrating the effectiveness and flexibility of our method. Please see our project webpage for additional visualizations and results: https://dangeng.github.io/visual_anagrams/
-
Referring expression segmentation (RES) aims at segmenting the foreground masks of the entities that match the descriptive natural language expression. Previous datasets and methods for classic RES task heavily rely on the prior assumption that one expression must refer to object-level targets. In this paper we take a step further to finer-grained part-level RES task. To promote the object-level RES task towards finer-grained vision-language understanding we put forward a new multi-granularity referring expression segmentation (MRES) task and construct an evaluation benchmark called RefCOCOm by manual annotations. By employing our automatic model-assisted data engine we build the largest visual grounding dataset namely MRES-32M which comprises over 32.2M high-quality masks and captions on the provided 1M images. Besides a simple yet strong model named UniRES is designed to accomplish the unified object-level and part-level grounding task. Extensive experiments on our RefCOCOm for MRES and three datasets (i.e. RefCOCO(+/g)) for classic RES task demonstrate the superiority of our method over previous state-of-the-art methods. To foster future research into fine-grained visual grounding our benchmark RefCOCOm the MRES-32M dataset and model UniRES will be publicly available at https://github.com/Rubics-Xuan/MRES.
-
We present DiffInDScene a novel framework for tackling the problem of high-quality 3D indoor scene generation which is challenging due to the complexity and diversity of the indoor scene geometry. Although diffusion-based generative models have previously demonstrated impressive performance in image generation and object-level 3D generation they have not yet been applied to room-level 3D generation due to their computationally intensive costs. In DiffInDScene we propose a cascaded 3D diffusion pipeline that is efficient and possesses strong generative performance for Truncated Signed Distance Function (TSDF). The whole pipeline is designed to run on a sparse occupancy space in a coarse-to-fine fashion. Inspired by KinectFusion's incremental alignment and fusion of local TSDF volumes we propose a diffusion-based SDF fusion approach that iteratively diffuses and fuses local TSDF volumes facilitating the generation of an entire room environment. The generated results demonstrate that our work is capable to achieve high-quality room generation directly in three-dimensional space starting from scratch. In addition to the scene generation the final part of DiffInDScene can be used as a post-processing module to refine the 3D reconstruction results from multi-view stereo. According to the user study the mesh quality generated by our DiffInDScene can even outperform the ground truth mesh provided by ScanNet.
-
Robust segmentation is critical for deriving quantitative measures from large-scale multi-center and longitudinal medical scans. Manually annotating medical scans however is expensive and labor-intensive and may not always be available in every domain. Unsupervised domain adaptation (UDA) is a well-studied technique that alleviates this label-scarcity problem by leveraging available labels from another domain. In this study we introduce Masked Autoencoding and Pseudo-Labeling Segmentation (MAPSeg) a unified UDA framework with great versatility and superior performance for heterogeneous and volumetric medical image segmentation. To the best of our knowledge this is the first study that systematically reviews and develops a framework to tackle four different domain shifts in medical image segmentation. More importantly MAPSeg is the first framework that can be applied to centralized federated and test-time UDA while maintaining comparable performance. We compare MAPSeg with previous state-of-the-art methods on a private infant brain MRI dataset and a public cardiac CT-MRI dataset and MAPSeg outperforms others by a large margin (10.5 Dice improvement on the private MRI dataset and 5.7 on the public CT-MRI dataset). MAPSeg poses great practical value and can be applied to real-world problems. GitHub: https://github.com/XuzheZ/MAPSeg/.
-
Scene Graph Generation (SGG) aims to identify entities and predict the relationship triplets
in visual scenes. Given the prevalence of large visual variations of subject-object pairs even in the same predicate it can be quite challenging to model and refine predicate representations directly across such pairs which is however a common strategy adopted by most existing SGG methods. We observe that visual variations within the identical triplet are relatively small and certain relation cues are shared in the same type of triplet which can potentially facilitate the relation learning in SGG. Moreover for the long-tail problem widely studied in SGG task it is also crucial to deal with the limited types and quantity of triplets in tail predicates. Accordingly in this paper we propose a Dual-granularity Relation Modeling (DRM) network to leverage fine-grained triplet cues besides the coarse-grained predicate ones. DRM utilizes contexts and semantics of predicate and triplet with Dual-granularity Constraints generating compact and balanced representations from two perspectives to facilitate relation recognition. Furthermore a Dual-granularity Knowledge Transfer (DKT) strategy is introduced to transfer variation from head predicates/triplets to tail ones aiming to enrich the pattern diversity of tail classes to alleviate the long-tail problem. Extensive experiments demonstrate the effectiveness of our method which establishes new state-of-the-art performance on Visual Genome Open Image and GQA datasets. Our code is available at https://github.com/jkli1998/DRM. -
Addressing the intricate challenge of modeling and re-rendering dynamic scenes most recent approaches have sought to simplify these complexities using plane-based explicit representations overcoming the slow training time issues associated with methods like Neural Radiance Fields (NeRF) and implicit representations. However the straightforward decomposition of 4D dynamic scenes into multiple 2D plane-based representations proves insufficient for re-rendering high-fidelity scenes with complex motions. In response we present a novel direction-aware representation (DaRe) approach that captures scene dynamics from six different directions. This learned representation undergoes an inverse dual-tree complex wavelet transformation (DTCWT) to recover plane-based information. DaReNeRF computes features for each space-time point by fusing vectors from these recovered planes. Combining DaReNeRF with a tiny MLP for color regression and leveraging volume rendering in training yield state-of-the-art performance in novel view synthesis for complex dynamic scenes. Notably to address redundancy introduced by the six real and six imaginary direction-aware wavelet coefficients we introduce a trainable masking approach mitigating storage issues without significant performance decline. Moreover DaReNeRF maintains a 2x reduction in training time compared to prior art while delivering superior performance.
-
This paper introduces SfmCAD a novel unsupervised network that reconstructs 3D shapes by learning the Sketch-based Feature Modeling operations commonly used in modern CAD workflows. Given a 3D shape represented as voxels SfmCAD learns a neural-typed sketch+path parameterized representation including 2D sketches of feature primitives and their 3D sweeping paths without supervision for inferring feature-based CAD programs. SfmCAD employs 2D sketches for local detail representation and 3D paths to capture the overall structure achieving a clear separation between shape details and structure. This conversion into parametric forms enables users to seamlessly adjust the shape's geometric and structural features thus enhancing interpretability and user control. We demonstrate the effectiveness of our method by applying SfmCAD to many different types of objects such as CAD parts ShapeNet objects and tree shapes. Extensive comparisons show that SfmCAD produces compact and faithful 3D reconstructions with superior quality compared to alternatives. The code is released at https://github.com/BunnySoCrazy/SfmCAD.
-
We present CoDi-2 a Multimodal Large Language Model (MLLM) for learning in-context interleaved multimodal representations. By aligning modalities with language for both encoding and generation CoDi-2 empowers Large Language Models (LLMs) to understand modality-interleaved instructions and in-context examples and autoregressively generate grounded and coherent multimodal outputs in an any-to-any input-output modality paradigm. To train CoDi-2 we build a large-scale generation dataset encompassing in-context multimodal instructions across text vision and audio. CoDi-2 demonstrates a wide range of zero-shot and few-shot capabilities for tasks like editing exemplar learning composition reasoning etc. CoDi-2 surpasses previous domain-specific models on tasks such as subject-driven image generation vision transformation and audio editing and showcases a significant advancement for integrating diverse multimodal tasks with sequential generation.
-
Existing fine-tuning methods for computer vision tasks primarily focus on re-weighting the knowledge learned from the source domain during pre-training. They aim to retain beneficial knowledge for the target domain while suppressing unfavorable knowledge. During the pre-training and fine-tuning stages there is a notable disparity in the data scale. Consequently it is theoretically necessary to employ a model with reduced complexity to mitigate the potential structural risk. However our empirical investigation in this paper reveals that models fine-tuned using existing methods still manifest a high level of model complexity inherited from the pre-training stage leading to a suboptimal stability and generalization ability. This phenomenon indicates an issue that has been overlooked in fine-tuning: Structural Risk Minimization. To address this issue caused by data scale disparity during the fine-tuning stage we propose a simple yet effective approach called Tuning Stable Rank Shrinkage (TSRS). TSRS mitigates the structural risk during the fine-tuning stage by constraining the noise sensitivity of the target model based on stable rank theories. Through extensive experiments we demonstrate that incorporating TSRS into fine-tuning methods leads to improved generalization ability on various tasks regardless of whether the neural networks are based on convolution or transformer architectures. Additionally empirical analysis reveals that TSRS enhances the robustness convexity and smoothness of the loss landscapes in fine-tuned models. Code is available at https://github.com/WitGotFlg/TSRS.
-
Photometric stereo leverages variations in illumination conditions to reconstruct surface normals. Display photometric stereo which employs a conventional monitor as an illumination source has the potential to overcome limitations often encountered in bulky and difficult-to-use conventional setups. In this paper we present differentiable display photometric stereo (DDPS) addressing an often overlooked challenge in display photometric stereo: the design of display patterns. Departing from using heuristic display patterns DDPS learns the display patterns that yield accurate normal reconstruction for a target system in an end-to-end manner. To this end we propose a differentiable framework that couples basis-illumination image formation with analytic photometric-stereo reconstruction. The differentiable framework facilitates the effective learning of display patterns via auto-differentiation. Also for training supervision we propose to use 3D printing for creating a real-world training dataset enabling accurate reconstruction on the target real-world setup. Finally we exploit that conventional LCD monitors emit polarized light which allows for the optical separation of diffuse and specular reflections when combined with a polarization camera leading to accurate normal reconstruction. Extensive evaluation of DDPS shows improved normal-reconstruction accuracy compared to heuristic patterns and demonstrates compelling properties such as robustness to pattern initialization calibration errors and simplifications in image formation and reconstruction.
-
To alleviate the utility degradation of deep learning image classification with differential privacy (DP) employing extra public data or pre-trained models has been widely explored. Recently the use of in-distribution public data has been investigated where tiny subsets of datasets are released publicly. In this paper we investigate a framework that leverages recent diffusion models to amplify the information of public data. Subsequently we identify data diversity and generalization gap between public and private data as critical factors addressing the limited public data. While assuming 4% of training data as public our method achieves 85.48% on CIFAR-10 with a privacy budget of ?=2 without employing extra public data for training.
-
Blind face restoration focuses on restoring high-fidelity details from images subjected to complex and unknown degradations while preserving identity information. In this paper we present a Prior-based Latent Transformation approach (PLTrans) which is specifically designed to learn a degradation-unaware representation thereby allowing the restoration network to effectively generalize to real-world degradation. Toward this end PLTrans learns a degradation-unaware query via a latent diffusion-based regularization module. Furthermore conditioned on the features of a degraded face image a latent dictionary that captures the priors of HQ face images is leveraged to refine the features by mapping the top-d nearest elements. The refined version will be used to build key and value for the cross-attention computation which is tailored to each degraded image and exhibits reduced sensitivity to different degradation factors. Conditioned on the resulting representation we train a decoding network that synthesizes face images with authentic details and identity preservation. Through extensive experiments we verify the effectiveness of the design elements and demonstrate the generalization ability of our proposed approach for both synthetic and unknown degradations. We finally demonstrate the applicability of PLTrans in other vision tasks.
-
Autonomous systems need to process large-scale sparse and irregular point clouds with limited compute resources. Consequently it is essential to develop LiDAR perception methods that are both efficient and effective. Although naively enlarging 3D kernel size can enhance performance it will also lead to a cubically-increasing overhead. Therefore it is crucial to develop streamlined 3D large kernel designs that eliminate redundant weights and work effectively with larger kernels. In this paper we propose an efficient and effective Large Sparse Kernel 3D Neural Network (LSK3DNet) that leverages dynamic pruning to amplify the 3D kernel size. Our method comprises two core components: Spatial-wise Dynamic Sparsity (SDS) and Channel-wise Weight Selection (CWS). SDS dynamically prunes and regrows volumetric weights from the beginning to learn a large sparse 3D kernel. It not only boosts performance but also significantly reduces model size and computational cost. Moreover CWS selects the most important channels for 3D convolution during training and subsequently prunes the redundant channels to accelerate inference for 3D vision tasks. We demonstrate the effectiveness of LSK3DNet on three benchmark datasets and five tracks compared with classical models and large kernel designs. Notably LSK3DNet achieves the state-of-the-art performance on SemanticKITTI (i.e. 75.6% on single-scan and 63.4% on multi-scan) with roughly 40% model size reduction and 60% computing operations reduction compared to the naive large 3D kernel model.
-
The goal of this work is to simultaneously generate natural talking faces and speech outputs from text. We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework. We address the main challenges of each task: (1) generating a range of head poses representative of real-world scenarios and (2) ensuring voice consistency despite variations in facial motion for the same identity. To tackle these issues we introduce a motion sampler based on conditional flow matching which is capable of high-quality motion code generation in an efficient way. Moreover we introduce a novel conditioning method for the TTS system which utilises motion-removed features from the TFG model to yield uniform speech outputs. Our extensive experiments demonstrate that our method effectively creates natural-looking talking faces and speech that accurately match the input text. To our knowledge this is the first effort to build a multimodal synthesis system that can generalise to unseen identities.
-
Annotation ambiguity due to inherent data uncertainties such as blurred boundaries in medical scans and different observer expertise and preferences has become a major obstacle for training deep-learning based medical image segmentation models. To address it the common practice is to gather multiple annotations from different experts leading to the setting of multi-rater medical image segmentation. Existing works aim to either merge different annotations into the "groundtruth" that is often unattainable in numerous medical contexts or generate diverse results or produce personalized results corresponding to individual expert raters. Here we bring up a more ambitious goal for multi-rater medical image segmentation i.e. obtaining both diversified and personalized results. Specifically we propose a two-stage framework named D-Persona (first Diversification and then Personalization). In Stage I we exploit multiple given annotations to train a Probabilistic U-Net model with a bound-constrained loss to improve the prediction diversity. In this way a common latent space is constructed in Stage I where different latent codes denote diversified expert opinions. Then in Stage II we design multiple attention-based projection heads to adaptively query the corresponding expert prompts from the shared latent space and then perform the personalized medical image segmentation. We evaluated the proposed model on our in-house Nasopharyngeal Carcinoma dataset and the public lung nodule dataset (i.e. LIDC-IDRI). Extensive experiments demonstrated our D-Persona can provide diversified and personalized results at the same time achieving new SOTA performance for multi-rater medical image segmentation. Our code will be released at https://github.com/ycwu1997/D-Persona.
-
We conduct a comprehensive study on a new task named power battery detection (PBD) which aims to localize the dense cathode and anode plates endpoints from X-ray images to evaluate the quality of power batteries. Existing manufacturers usually rely on human eye observation to complete PBD which makes it difficult to balance the accuracy and efficiency of detection. To address this issue and drive more attention into this meaningful task we first elaborately collect a dataset called X-ray PBD which has 1500 diverse X-ray images selected from thousands of power batteries of 5 manufacturers with 7 different visual interference. Then we propose a novel segmentation-based solution for PBD termed multi-dimensional collaborative network (MDCNet). With the help of line and counting predictors the representation of the point segmentation branch can be improved at both semantic and detail aspects. Besides we design an effective distance-adaptive mask generation strategy which can alleviate the visual challenge caused by the inconsistent distribution density of plates to provide MDCNet with stable supervision. Without any bells and whistles our segmentation-based MDCNet consistently outperforms various other corner detection crowd counting and general/tiny object detection-based solutions making it a strong baseline that can help facilitate future research in PBD. Finally we share some potential difficulties and works for future researches. The source code and datasets will be publicly available at \href https://github.com/Xiaoqi-Zhao-DLUT/X-ray-PBD X-ray PBD .
-
With the rapid growth in deepfake video content we require improved and generalizable methods to detect them. Most existing detection methods either use uni-modal cues or rely on supervised training to capture the dissonance between the audio and visual modalities. While the former disregards the audio-visual correspondences entirely the latter predominantly focuses on discerning audio-visual cues within the training corpus thereby potentially overlooking correspondences that can help detect unseen deepfakes. We present Audio-Visual Feature Fusion (AVFF) a two-stage cross-modal learning method that explicitly captures the correspondence between the audio and visual modalities for improved deepfake detection. The first stage pursues representation learning via self-supervision on real videos to capture the intrinsic audio-visual correspondences. To extract rich cross-modal representations we use contrastive learning and autoencoding objectives and introduce a novel audio-visual complementary masking and feature fusion strategy. The learned representations are tuned in the second stage where deepfake classification is pursued via supervised learning on both real and fake videos. Extensive experiments and analysis suggest that our novel representation learning paradigm is highly discriminative in nature. We report 98.6% accuracy and 99.1% AUC on the FakeAVCeleb dataset outperforming the current audio-visual state-of-the-art by 14.9% and 9.9% respectively.
-
Machine learning models can perform well on in-distribution data but often fail on biased subgroups that are underrepresented in the training data hindering the robustness of models for reliable applications. Such subgroups are typically unknown due to the absence of subgroup labels. Discovering biased subgroups is the key to understanding models' failure modes and further improving models' robustness. Most previous works of subgroup discovery make an implicit assumption that models only underperform on a single biased subgroup which does not hold on in-the-wild data where multiple biased subgroups exist. In this work we propose Decomposition Interpretation and Mitigation (DIM) a novel method to address a more challenging but also more practical problem of discovering multiple biased subgroups in image classifiers. Our approach decomposes the image features into multiple components that represent multiple subgroups. This decomposition is achieved via a bilinear dimension reduction method Partial Least Square (PLS) guided by useful supervision from the image classifier. We further interpret the semantic meaning of each subgroup component by generating natural language descriptions using vision-language foundation models. Finally DIM mitigates multiple biased subgroups simultaneously via two strategies including the data- and model-centric strategies. Extensive experiments on CIFAR-100 and Breeds datasets demonstrate the effectiveness of DIM in discovering and mitigating multiple biased subgroups. Furthermore DIM uncovers the failure modes of the classifier on Hard ImageNet showcasing its broader applicability to understanding model bias in image classifiers.
-
This paper presents the DiffusionRegPose a novel approach to multi-person pose estimation that converts a one-stage end-to-end keypoint regression model into a diffusion-based sampling process. Existing one-stage deterministic regression methods though efficient are often prone to missed or false detections in crowded or occluded scenes due to their inability to reason pose ambiguity. To address these challenges we handle ambiguous poses in a generative fashion i.e. sampling from the image-conditioned pose distributions characterized by a diffusion probabilistic model. Specifically with initial pose tokens extracted from the image noisy pose candidates are progressively refined by interacting with the initial tokens via attention layers. Extensive evaluations on the COCO and CrowdPose datasets show that DiffusionRegPose clearly improves the pose accuracy in crowded scenarios as evidenced by a notable 4.0 AP increase in the AP_H metric on the CrowdPose dataset. This demonstrates the model's potential for robust and precise human pose estimation in real-world applications.
-
Deep functional maps have emerged in recent years as a prominent learning-based framework for non-rigid shape matching problems. While early methods in this domain only focused on learning in the functional domain the latest techniques have demonstrated that by promoting consistency between functional and pointwise maps leads to significant improvements in accuracy. Unfortunately existing approaches rely heavily on the computation of large dense matrices arising from soft pointwise maps which compromises their efficiency and scalability. To address this limitation we introduce a novel memory-scalable and efficient functional map learning pipeline. By leveraging the specific structure of functional maps we offer the possibility to achieve identical results without ever storing the pointwise map in memory. Furthermore based on the same approach we present a differentiable map refinement layer adapted from an existing axiomatic refinement algorithm. Unlike many functional map learning methods which use this algorithm at a post-processing step ours can be easily used at train time enabling to enforce consistency between the refined and initial versions of the map. Our resulting approach is both simpler more efficient and more numerically stable by avoiding differentiation through a linear system while achieving close to state-of-the-art results in challenging scenarios.
-
Lately there has been growing interest in adapting vision-language models (VLMs) to image and third-person video classification due to their success in zero-shot recognition. However the adaptation of these models to egocentric videos has been largely unexplored. To address this gap we propose a simple yet effective cross-modal adaptation framework which we call X-MIC. Using a video adapter our pipeline learns to align frozen text embeddings to each egocentric video directly in the shared embedding space. Our novel adapter architecture retains and improves generalization of the pre-trained VLMs by disentangling learnable temporal modeling and frozen visual encoder. This results in an enhanced alignment of text embeddings to each egocentric video leading to a significant improvement in cross-dataset generalization. We evaluate our approach on the Epic-Kitchens Ego4D and EGTEA datasets for fine-grained cross-dataset action generalization demonstrating the effectiveness of our method.
-
ExMap: Leveraging Explainability Heatmaps for Unsupervised Group Robustness to Spurious Correlations
Group robustness strategies aim to mitigate learned biases in deep learning models that arise from spurious correlations present in their training datasets. However most existing methods rely on the access to the label distribution of the groups which is time-consuming and expensive to obtain. As a result unsupervised group robustness strategies are sought. Based on the insight that a trained model's classification strategies can be inferred accurately based on explainability heatmaps we introduce ExMap an unsupervised two stage mechanism designed to enhance group robustness in traditional classifiers. ExMap utilizes a clustering module to infer pseudo-labels based on a model's explainability heatmaps which are then used during training in lieu of actual labels. Our empirical studies validate the efficacy of ExMap - We demonstrate that it bridges the per- formance gap with its supervised counterparts and outperforms existing partially supervised and unsupervised methods. Additionally ExMap can be seamlessly integrated with existing group robustness learning strategies. Finally we demonstrate its potential in tackling the emerging issue of multiple shortcut mitigation
-
Creating high-fidelity 3D head avatars has always been a research hotspot but there remains a great challenge under lightweight sparse view setups. In this paper we propose Gaussian Head Avatar represented by controllable 3D Gaussians for high-fidelity head avatar modeling. We optimize the neutral 3D Gaussians and a fully learned MLP-based deformation field to capture complex expressions. The two parts benefit each other thereby our method can model fine-grained dynamic details while ensuring expression accuracy. Furthermore we devise a well-designed geometry-guided initialization strategy based on implicit SDF and Deep Marching Tetrahedra for the stability and convergence of the training procedure. Experiments show our approach outperforms other state-of-the-art sparse-view methods achieving ultra high-fidelity rendering quality at 2K resolution even under exaggerated expressions. Project page: https://yuelangx.github.io/gaussianheadavatar.
-
Estimating 3D full-body avatars from AR/VR devices is essential for creating immersive experiences in AR/VR applications. This task is challenging due to the limited input from Head Mounted Devices which capture only sparse observations from the head and hands. Predicting the full-body avatars particularly the lower body from these sparse observations presents significant difficulties. In this paper we are inspired by the inherent property of the kinematic tree defined in the Skinned Multi-Person Linear (SMPL) model where the upper body and lower body share only one common ancestor node bringing the potential of decoupled reconstruction. We propose a stratified approach to decouple the conventional full-body avatar reconstruction pipeline into two stages with the reconstruction of the upper body first and a subsequent reconstruction of the lower body conditioned on the previous stage. To implement this straightforward idea we leverage the latent diffusion model as a powerful probabilistic generator and train it to follow the latent distribution of decoupled motions explored by a VQ-VAE encoder-decoder model. Extensive experiments on AMASS mocap dataset demonstrate our state-of-the-art performance in the reconstruction of full-body motions.
-
Egocentric videos provide a first-person perspective of the wearer's activities involving simultaneous interactions with multiple objects. In this work we propose the task of weakly-supervised Narration-based Video Object Segmentation (NVOS). Given an egocentric video clip and a narration of the wearer's activities our aim is to segment object instances mentioned in the narration without using any spatial annotations during training. Existing weakly-supervised video object grounding methods typically yield bounding boxes for referred objects. In contrast we propose ROSA a weakly-supervised pixel-level grounding framework learning alignments between referred objects and segmentation mask proposals. Our model harnesses vision-language models pre-trained on image-text pairs to embed region masks and object phrases. During training we combine (a) a video-narration contrastive loss that implicitly supervises the alignment between regions and phrases and (b) a region-phrase contrastive loss based on inferred latent alignments. To address the lack of annotated NVOS datasets in egocentric videos we create a new evaluation benchmark VISOR-NVOS leveraging existing annotations of segmentation masks from VISOR alongside 14.6k newly-collected object-based video clip narrations. Our approach achieves state-of-the-art zero-shot pixel-level grounding performance compared to strong baselines under similar supervision. Additionally we demonstrate generalization capabilities for zero-shot video object grounding on YouCook2 a third-person instructional video dataset.
-
Recent studies have drawn attention to the untapped potential of the "star operation" (element-wise multiplication) in network design. While intuitive explanations abound the foundational rationale behind its application remains largely unexplored. Our study attempts to reveal the star operation's ability of mapping inputs into high-dimensional non-linear feature spaces--akin to kernel tricks--without widening the network. We further introduce StarNet a simple yet powerful prototype demonstrating impressive performance and low latency under compact network structure and efficient budget. Like stars in the sky the star operation appears unremarkable but holds a vast universe of potential. Our work encourages further exploration across tasks with codes available at https://github.com/ma-xu/Rewrite-the-Stars.
-
Recent advancements in large-scale visual-language pre-trained models have led to significant progress in zero-/few-shot anomaly detection within natural image domains. However the substantial domain divergence between natural and medical images limits the effectiveness of these methodologies in medical anomaly detection. This paper introduces a novel lightweight multi-level adaptation and comparison framework to repurpose the CLIP model for medical anomaly detection. Our approach integrates multiple residual adapters into the pre-trained visual encoder enabling a stepwise enhancement of visual features across different levels. This multi-level adaptation is guided by multi-level pixel-wise visual-language feature alignment loss functions which recalibrate the model's focus from object semantics in natural imagery to anomaly identification in medical images. The adapted features exhibit improved generalization across various medical data types even in zero-shot scenarios where the model encounters unseen medical modalities and anatomical regions during training. Our experiments on medical anomaly detection benchmarks demonstrate that our method significantly surpasses current state-of-the-art models with an average AUC improvement of 6.24% and 7.33% for anomaly classification 2.03% and 2.37% for anomaly segmentation under the zero-shot and few-shot settings respectively. Source code is available at: https://github.com/MediaBrain-SJTU/MVFA-AD
-
Accurate estimation of Room Impulse Response (RIR) which captures an environment's acoustic properties is important for speech processing and AR/VR applications. We propose AV-RIR a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and the visual cues of its corresponding environment. AV-RIR builds on a novel neural codec-based architecture that effectively captures environment geometry and materials properties and solves speech dereverberation as an auxiliary task by using multi-task learning. We also propose Geo-Mat features that augment material information into visual cues and CRIP that improves late reverberation components in the estimated RIR via image-to-RIR retrieval by 86%. Empirical results show that AV-RIR quantitatively outperforms previous audio-only and visual-only approaches by achieving 36% - 63% improvement across various acoustic metrics in RIR estimation. Additionally it also achieves higher preference scores in human evaluation. As an auxiliary benefit dereverbed speech from AV-RIR shows competitive performance with the state-of-the-art in various spoken language processing tasks and outperforms reverberation time error score in the real-world AVSpeech dataset. Qualitative examples of both synthesized reverberant speech and enhanced speech are available online https://www.youtube.com/watch?v=tTsKhviukAE.
-
Zero-shot Video Object Segmentation (ZSVOS) aims at segmenting the primary moving object without any human annotations. Mainstream solutions mainly focus on learning a single model on large-scale video datasets which struggle to generalize to unseen videos. In this work we introduce a test-time training (TTT) strategy to address the problem. Our key insight is to enforce the model to predict consistent depth during the TTT process. In detail we first train a single network to perform both segmentation and depth prediction tasks. This can be effectively learned with our specifically designed depth modulation layer. Then for the TTT process the model is updated by predicting consistent depth maps for the same frame under different data augmentations. In addition we explore different TTT weight update strategies. Our empirical results suggest that the momentum-based weight initialization and looping-based training scheme lead to more stable improvements. Experiments show that the proposed method achieves clear improvements on ZSVOS. Our proposed video TTT strategy provides significant superiority over state-of-the-art TTT methods. Our code is available at: https://nifangbaage.github.io/DATTT/.
-
Non-exemplar class incremental learning (NECIL) aims to continuously assimilate new knowledge without forgetting previously acquired ones when historical data are unavailable. One of the generative NECIL methods is to invert the images of old classes for joint training. However these synthetic images suffer significant domain shifts compared with real data hampering the recognition of old classes. In this paper we present a novel method termed Dual-Consistency Model Inversion (DCMI) to generate better synthetic samples of old classes through two pivotal consistency alignments: (1) the semantic consistency between the synthetic images and the corresponding prototypes and (2) domain consistency between synthetic and real images of new classes. Besides we introduce Prototypical Routing (PR) to provide task-prior information and generate unbiased and accurate predictions. Our comprehensive experiments across diverse datasets consistently showcase the superiority of our method over previous state-of-the-art approaches.
-
With recent video object segmentation (VOS) benchmarks evolving to challenging scenarios we revisit a simple but overlooked strategy: restricting the size of memory banks. This diverges from the prevalent practice of expanding memory banks to accommodate extensive historical information. Our specially designed "memory deciphering" study offers a pivotal insight underpinning such a strategy: expanding memory banks while seemingly beneficial actually increases the difficulty for VOS modules to decode relevant features due to the confusion from redundant information. By restricting memory banks to a limited number of essential frames we achieve a notable improvement in VOS accuracy. This process balances the importance and freshness of frames to maintain an informative memory bank within a bounded capacity. Additionally restricted memory banks reduce the training-inference discrepancy in memory lengths compared with continuous expansion. This fosters new opportunities in temporal reasoning and enables us to introduce the previously overlooked "temporal positional embedding." Finally our insights are embodied in "RMem" ("R" for restricted) a simple yet effective VOS modification that excels at challenging VOS scenarios and establishes new state of the art for object state changes (VOST dataset) and long videos (the Long Videos dataset). Our codes are available at https://github.com/Restricted-Memory/RMemand our demo can be watched on https://youtu.be/V3tCFQsJrrM.
-
Given the power of vision transformers a new learning paradigm pre-training and then prompting makes it more efficient and effective to address downstream visual recognition tasks. In this paper we identify a novel security threat towards such a paradigm from the perspective of backdoor attacks. Specifically an extra prompt token called the switch token in this work can turn the backdoor mode on i.e. converting a benign model into a backdoored one. Once under the backdoor mode a specific trigger can force the model to predict a target class. It poses a severe risk to the users of cloud API since the malicious behavior can not be activated and detected under the benign mode thus making the attack very stealthy. To attack a pre-trained model our proposed attack named SWARM learns a trigger and prompt tokens including a switch token. They are optimized with the clean loss which encourages the model always behaves normally even the trigger presents and the backdoor loss that ensures the backdoor can be activated by the trigger when the switch is on. Besides we utilize the cross-mode feature distillation to reduce the effect of the switch token on clean samples. The experiments on diverse visual recognition tasks confirm the success of our switchable backdoor attack i.e. achieving 95%+ attack success rate and also being hard to be detected and removed. Our code is available at https://github.com/20000yshust/SWARM.
-
Image and video analysis requires not only accurate object but also the understanding of relationships among detected objects. Common solutions to relation modeling typically resort to stand-alone object detectors followed by non-differentiable post-processing techniques. Recently introduced detection transformers (DETR) perform end-to-end object detection based on a bipartite matching loss. Such methods however lack the ability to jointly detect objects and resolve object associations. In this paper we build on the DETR approach and extend it to the joint detection of objects and their relationships by introducing an approximated bipartite matching. While our method can generalize to an arbitrary number of objects we here focus on the modeling of object pairs and their relations. In particular we apply our method PairDETR to the problem of detecting human bodies and faces and associating them for the same person. Our approach not only eliminates the need for hand-designed post-processing but also achieves excellent results for body-face associations. We evaluate PairDETR on the challenging CrowdHuman and CityPersons datasets and demonstrate a large improvement over the state of the art. Our training code and pre-trained models are available online.
-
Recent advancements in personalized image generation using diffusion models have been noteworthy. However existing methods suffer from inefficiencies due to the requirement for subject-specific fine-tuning. This computationally intensive process hinders efficient deployment limiting practical usability. Moreover these methods often grapple with identity distortion and limited expression diversity. In light of these challenges we propose PortraitBooth an innovative approach designed for high efficiency robust identity preservation and expression-editable text-to-image generation without the need for fine-tuning. PortraitBooth leverages subject embeddings from a face recognition model for personalized image generation without fine-tuning. It eliminates computational overhead and mitigates identity distortion. The introduced dynamic identity preservation strategy further ensures close resemblance to the original image identity. Moreover PortraitBooth incorporates emotion-aware cross-attention control for diverse facial expressions in generated images supporting text-driven expression editing. Its scalability enables efficient and high-quality image creation including multi-subject generation. Extensive results demonstrate superior performance over other state-of-the-art methods in both single and multiple image generation scenarios.
-
In recent years anchor-based methods have achieved promising progress in multi-view clustering. The performances of these methods are significantly affected by the quality of the anchors. However the anchors generated by previous works solely rely on single-view information ignoring the correlation among different views. In particular we observe that similar patterns are more likely to exist between similar views so such correlation information can be leveraged to enhance the quality of the anchors which is also omitted. To this end we propose a novel plug-and-play anchor enhancement strategy through view correlation for multi-view clustering. Specifically we construct a view graph based on aligned initial anchor graphs to explore inter-view correlations. By learning from view correlation we enhance the anchors of the current view using the relationships between anchors and samples on neighboring views thereby narrowing the spatial distribution of anchors on similar views. Experimental results on seven datasets demonstrate the superiority of our proposed method over other existing methods. Furthermore extensive comparative experiments validate the effectiveness of the proposed anchor enhancement module when applied to various anchor-based methods.
-
Few-shot semantic segmentation (FSS) endeavors to segment unseen classes with only a few labeled samples. Current FSS methods are commonly built on the assumption that their training and application scenarios share similar domains and their performances degrade significantly while applied to a distinct domain. To this end we propose to leverage the cutting-edge foundation model the Segment Anything Model (SAM) for generalization enhancement. The SAM however performs unsatisfactorily on domains that are distinct from its training data which primarily comprise natural scene images and it does not support automatic segmentation of specific semantics due to its interactive prompting mechanism. In our work we introduce APSeg a novel auto-prompt network for cross-domain few-shot semantic segmentation (CD-FSS) which is designed to be auto-prompted for guiding cross-domain segmentation. Specifically we propose a Dual Prototype Anchor Transformation (DPAT) module that fuses pseudo query prototypes extracted based on cycle-consistency with support prototypes allowing features to be transformed into a more stable domain-agnostic space. Additionally a Meta Prompt (MPG) module is introduced to automatically generate prompt embeddings eliminating the need for manual visual prompts. We build an efficient model which can be applied directly to target domains without fine-tuning. Extensive experiments on four cross-domain datasets show that our model outperforms the state-of-the-art CD-FSS method by 5.24% and 3.10% in average accuracy on 1-shot and 5-shot settings respectively.
-
This paper introduces the first text-guided work for generating the sequence of hand-object interaction in 3D. The main challenge arises from the lack of labeled data where existing ground-truth datasets are nowhere near generalizable in interaction type and object category which inhibits the modeling of diverse 3D hand-object interaction with the correct physical implication (e.g. contacts and semantics) from text prompts. To address this challenge we propose to decompose the interaction generation task into two subtasks: hand-object contact generation; and hand-object motion generation. For contact generation a VAE-based network takes as input a text and an object mesh and generates the probability of contacts between the surfaces of hands and the object during the interaction. The network learns a variety of local geometry structure of diverse objects that is independent of the objects' category and thus it is applicable to general objects. For motion generation a Transformer-based diffusion model utilizes this 3D contact map as a strong prior for generating physically plausible hand-object motion as a function of text prompts by learning from the augmented labeled dataset; where we annotate text labels from many existing 3D hand and object motion data. Finally we further introduce a hand refiner module that minimizes the distance between the object surface and hand joints to improve the temporal stability of the object-hand contacts and to suppress the penetration artifacts. In the experiments we demonstrate that our method can generate more realistic and diverse interactions compared to other baseline methods. We also show that our method is applicable to unseen objects. We will release our model and newly labeled data as a strong foundation for future research. Codes and data are available in: https://github.com/JunukCha/Text2HOI.
-
Deployment of Transformer models on edge devices is becoming increasingly challenging due to the exponentially growing inference cost that scales quadratically with the number of tokens in the input sequence. Token pruning is an emerging solution to address this challenge due to its ease of deployment on various Transformer backbones. However most token pruning methods require computationally expensive fine-tuning which is undesirable in many edge deployment cases. In this work we propose Zero-TPrune the first zero-shot method that considers both the importance and similarity of tokens in performing token pruning. It leverages the attention graph of pre-trained Transformer models to produce an importance distribution for tokens via our proposed Weighted Page Rank (WPR) algorithm. This distribution further guides token partitioning for efficient similarity-based pruning. Due to the elimination of the fine-tuning overhead Zero-TPrune can prune large models at negligible computational cost switch between different pruning configurations at no computational cost and perform hyperparameter tuning efficiently. We evaluate the performance of Zero-TPrune on vision tasks by applying it to various vision Transformer backbones and testing them on ImageNet. Without any fine-tuning Zero-TPrune reduces the FLOPs cost of DeiT-S by 34.7% and improves its throughput by 45.3% with only 0.4% accuracy loss. Compared with state-of-the-art pruning methods that require fine-tuning Zero-TPrune not only eliminates the need for fine-tuning after pruning but also does so with only 0.1% accuracy loss. Compared with state-of-the-art fine-tuning-free pruning methods Zero-TPrune reduces accuracy loss by up to 49% with the same or higher throughput.
-
Continual learning (CL) aims to empower models to learn new tasks without forgetting previously acquired knowledge. Most prior works concentrate on the techniques of architectures replay data regularization etc. However the category name of each class is largely neglected. Existing methods commonly utilize the one-hot labels and randomly initialize the classifier head. We argue that the scarce semantic information conveyed by the one-hot labels hampers the effective knowledge transfer across tasks. In this paper we revisit the role of the classifier head within the CL paradigm and replace the classifier with semantic knowledge from pretrained language models (PLMs). Specifically we use PLMs to generate semantic targets for each class which are frozen and serve as supervision signals during training. Such targets fully consider the semantic correlation between all classes across tasks. Empirical studies show that our approach mitigates forgetting by alleviating representation drifting and facilitating knowledge transfer across tasks. The proposed method is simple to implement and can seamlessly be plugged into existing methods with negligible adjustments. Extensive experiments based on eleven mainstream baselines demonstrate the effectiveness and generalizability of our approach to various protocols. For example under the class-incremental learning setting on ImageNet-100 our method significantly improves the Top-1 accuracy by 3.2% to 6.1% while reducing the forgetting rate by 2.6% to 13.1%.
-
The rapid expansion of large-scale text-to-image diffusion models has raised growing concerns regarding their potential misuse in creating harmful or misleading content. In this paper we introduce MACE a finetuning framework for the task of mass concept erasure. This task aims to prevent models from generating images that embody unwanted concepts when prompted. Existing concept erasure methods are typically restricted to handling fewer than five concepts simultaneously and struggle to find a balance between erasing concept synonyms (generality) and maintaining unrelated concepts (specificity). In contrast MACE differs by successfully scaling the erasure scope up to 100 concepts and by achieving an effective balance between generality and specificity. This is achieved by leveraging closed-form cross-attention refinement along with LoRA finetuning collectively eliminating the information of undesirable concepts. Furthermore MACE integrates multiple LoRAs without mutual interference. We conduct extensive evaluations of MACE against prior methods across four different tasks: object erasure celebrity erasure explicit content erasure and artistic style erasure. Our results reveal that MACE surpasses prior methods in all evaluated tasks. Code is available at https://github.com/Shilin-LU/MACE.
-
We present Dive Into the Boundaries (DIBS) a novel pretraining framework for dense video captioning (DVC) that elaborates on improving the quality of the generated event captions and their associated pseudo event boundaries from unlabeled videos. By leveraging the capabilities of diverse large language models (LLMs) we generate rich DVC-oriented caption candidates and optimize the corresponding pseudo boundaries under several meticulously designed objectives considering diversity event-centricity temporal ordering and coherence. Moreover we further introduce a novel online boundary refinement strategy that iteratively improves the quality of pseudo boundaries during training. Comprehensive experiments have been conducted to examine the effectiveness of the proposed technique components. By leveraging a substantial amount of unlabeled video data such as HowTo100M we achieve a remarkable advancement on standard DVC datasets like YouCook2 and ActivityNet. We outperform the previous state-of-the-art Vid2Seq across a majority of metrics achieving this with just 0.4% of the unlabeled video data used for pre-training by Vid2Seq.
-
Recently some large kernel convnets strike back with appealing performance and efficiency. However given the square complexity of convolution scaling up kernels can bring about an enormous amount of parameters and the proliferated parameters can induce severe optimization problem. Due to these issues current CNNs compromise to scale up to 51x51 in the form of stripe convolution (i.e. 51x5+5x51) and start to saturate as the kernel size continues growing. In this paper we delve into addressing these vital issues and explore whether we can continue scaling up kernels for more performance gains. Inspired by human vision we propose a human-like peripheral convolution that efficiently reduces over 90% parameter count of dense grid convolution through parameter sharing and manage to scale up kernel size to extremely large. Our peripheral convolution behaves highly similar to human reducing the complexity of convolution from O(K^2) to O(logK) without backfiring performance. Built on this we propose Parameter-efficient Large Kernel Network (PeLK). Our PeLK outperforms modern vision Transformers and ConvNet architectures like Swin ConvNeXt RepLKNet and SLaK on various vision tasks including ImageNet classification semantic segmentation on ADE20K and object detection on MS COCO. For the first time we successfully scale up the kernel size of CNNs to an unprecedented 101x101 and demonstrate consistent improvements.
-
Expressive human pose and shape estimation (a.k.a. 3D whole-body mesh recovery) involves the human body hand and expression estimation. Most existing methods have tackled this task in a two-stage manner first detecting the human body part with an off-the-shelf detection model and then inferring the different human body parts individually. Despite the impressive results achieved these methods suffer from 1) loss of valuable contextual information via cropping 2) introducing distractions and 3) lacking inter-association among different persons and body parts inevitably causing performance degradation especially for crowded scenes. To address these issues we introduce a novel all-in-one-stage framework AiOS for multiple expressive human pose and shape recovery without an additional human detection step. Specifically our method is built upon DETR which treats multi-person whole-body mesh recovery task as a progressive set prediction problem with various sequential detection. We devise the decoder tokens and extend them to our task. Specifically we first employ a human token to probe a human location in the image and encode global features for each instance which provides a coarse location for the later transformer block. Then we introduce a joint-related token to probe the human joint in the image and encoder a fine-grained local feature which collaborates with the global feature to regress the whole-body mesh. This straightforward but effective model outperforms previous state-of-the-art methods by a 9 reduction in NMVE on AGORA a 30 reduction in PVE on EHF a 10 reduction in PVE on ARCTIC and a 3 reduction in PVE on EgoBody.
-
Deep learning models particularly those based on transformers often employ numerous stacked structures which possess identical architectures and perform similar functions. While effective this stacking paradigm leads to a substantial increase in the number of parameters pos- ing challenges for practical applications. In today's land- scape of increasingly large models stacking depth can even reach dozens further exacerbating this issue. To miti- gate this problem we introduce LORS (LOw-rank Residual Structure). LORS allows stacked modules to share the majority of parameters requiring a much smaller num- ber of unique ones per module to match or even surpass the performance of using entirely distinct ones thereby significantly reducing parameter usage. We validate our method by applying it to the stacked decoders of a query- based object detector and conduct extensive experiments on the widely used MS COCO dataset. Experimental re- sults demonstrate the effectiveness of our method as even with a 70% reduction in the parameters of the decoder our method still enables the model to achieve comparable or even better performance than its original.
-
In recent years there has been a significant shift in the field of digital avatar research towards modeling animating and reconstructing clothed human representations as a key step towards creating realistic avatars. However current 3D cloth generation methods are garment specific or trained completely on synthetic data hence lacking fine details and realism. In this work we make a step towards automatic realistic garment design and propose Design2Cloth a high fidelity 3D generative model trained on a real world dataset from more than 2000 subject scans. To provide vital contribution to the fashion industry we developed a user-friendly adversarial model capable of generating diverse and detailed clothes simply by drawing a 2D cloth mask. Under a series of both qualitative and quantitative experiments we showcase that Design2Cloth outperforms current state-of-the-art cloth generative models by a large margin. In addition to the generative properties of our network we showcase that the proposed method can be used to achieve high quality reconstructions from single in-the-wild images and 3D scans. Dataset code and pre-trained model will become publicly available.
-
Scene text recognition (STR) in the wild frequently encounters challenges when coping with domain variations font diversity shape deformations etc. A straightforward solution is performing model fine-tuning tailored to a specific scenario but it is computationally intensive and requires multiple model copies for various scenarios. Recent studies indicate that large language models (LLMs) can learn from a few demonstration examples in a training-free manner termed "In-Context Learning" (ICL). Nevertheless applying LLMs as a text recognizer is unacceptably resource-consuming. Moreover our pilot experiments on LLMs show that ICL fails in STR mainly attributed to the insufficient incorporation of contextual information from diverse samples in the training stage. To this end we introduce E2STR a STR model trained with context-rich scene text sequences where the sequences are generated via our proposed in-context training strategy. E2STR demonstrates that a regular-sized model is sufficient to achieve effective ICL capabilities in STR. Extensive experiments show that E2STR exhibits remarkable training-free adaptation in various scenarios and outperforms even the fine-tuned state-of-the-art approaches on public benchmarks. The code is released at https://github.com/bytedance/E2STR.
-
Our brain can effortlessly recognize objects even when partially hidden from view. Seeing the visible of the hidden is called amodal completion; however this task remains a challenge for generative AI despite rapid progress. We propose to sidestep many of the difficulties of existing approaches which typically involve a two-step process of predicting amodal masks and then generating pixels. Our method involves thinking outside the box literally! We go outside the object bounding box to use its context to guide a pre-trained diffusion inpainting model and then progressively grow the occluded object and trim the extra background. We overcome two technical challenges: 1) how to be free of unwanted co-occurrence bias which tends to regenerate similar occluders and 2) how to judge if an amodal completion has succeeded. Our amodal completion method exhibits improved photorealistic completion results compared to existing approaches in numerous successful completion cases. And the best part? It doesn't require any special training or fine-tuning of models. Project page and code: https://k8xu.github.io/amodal/
-
Diffusion models have demonstrated unprecedented capabilities in image generation. Yet they incorporate and amplify the data bias (e.g. gender age) from the original training set limiting the diversity of generated images. In this paper we propose a diversity-oriented fine-tuning method using reinforcement learning (RL) for diffusion models under the guidance of an image-set-based reward function. Specifically the proposed reward function denoted as Diversity Reward utilizes a set of generated images to evaluate the coverage of the current generative distribution w.r.t. the reference distribution represented by a set of unbiased images. Built on top of the probabilistic method of distribution discrepancy estimation Diversity Reward can measure the relative distribution gap with a small set of images efficiently. We further formulate the diffusion process as a multi-step decision-making problem (MDP) and apply policy gradient methods to fine-tune diffusion models by maximizing the Diversity Reward. The proposed rewards are validated on a post-sampling selection task where a subset of the most diverse images are selected based on Diversity Reward values. We also show the effectiveness of our RL fine-tuning framework on enhancing the diversity of image generation with different types of diffusion models including class-conditional models and text-conditional models e.g. StableDiffusion.
-
Microscopic traffic simulation plays a crucial role in transportation engineering by providing insights into individual vehicle behavior and overall traffic flow. However creating a realistic simulator that accurately replicates human driving behaviors in various traffic conditions presents significant challenges. Traditional simulators relying on heuristic models often fail to deliver accurate simulations due to the complexity of real-world traffic environments. Due to the covariate shift issue existing imitation learning-based simulators often fail to generate stable long-term simulations. In this paper we propose a novel approach called learner-aware supervised imitation learning to address the covariate shift problem in multi-agent imitation learning. By leveraging a variational autoencoder simultaneously modeling the expert and learner state distribution our approach augments expert states such that the augmented state is aware of learner state distribution. Our method applied to urban traffic simulation demonstrates significant improvements over existing state-of-the-art baselines in both short-term microscopic and long-term macroscopic realism when evaluated on the real-world dataset pNEUMA.
-
Federated Learning (FL) facilitates clients to collaborate on training a shared machine learning model without exposing individual private data. Nonetheless FL remains susceptible to utility and privacy attacks notably evasion data poisoning and model inversion attacks compromising the system's efficiency and data privacy. Existing FL defenses are often specialized to a particular single attack lacking generality and a comprehensive defender's perspective. To address these challenges we introduce Federated Cryptography Defense (FCD) a unified single framework aligning with the defender's perspective. FCD employs row-wise transposition cipher based data encryption with a secret key to counter both evasion black-box data poisoning and model inversion attacks. The crux of FCD lies in transferring the entire learning process into an encrypted data space and using a novel distillation loss guided by the Kullback-Leibler (KL) divergence. This measure compares the probability distributions of the local pretrained teacher model's predictions on normal data and the local student model's predictions on the same data in FCD's encrypted form. By working within this encrypted space FCD eliminates the need for decryption at the server resulting in reduced computational complexity. We demonstrate the practical feasibility of FCD and apply it to defend against evasion utility attack on benchmark datasets (GTSRB KBTS CIFAR10 and EMNIST). We further extend FCD for defending against model inversion attack in split FL on the CIFAR100 dataset. Our experiments across the diverse attack and FL settings demonstrate practical feasibility and robustness against utility evasion (impact >30) and privacy attacks (MSE >73) compared to the second best method.
-
Deep learning-based methods have achieved significant successes on solving the blind super-resolution (BSR) problem. However most of them request supervised pre-training on labelled datasets. This paper proposes an unsupervised kernel estimation model named dynamic kernel prior (DKP) to realize an unsupervised and pre-training-free learning-based algorithm for solving the BSR problem. DKP can adaptively learn dynamic kernel priors to realize real-time kernel estimation and thereby enables superior HR image restoration performances. This is achieved by a Markov chain Monte Carlo sampling process on random kernel distributions. The learned kernel prior is then assigned to optimize a blur kernel estimation network which entails a network-based Langevin dynamic optimization strategy. These two techniques ensure the accuracy of the kernel estimation. DKP can be easily used to replace the kernel estimation models in the existing methods such as Double-DIP and FKP-DIP or be added to the off-the-shelf image restoration model such as diffusion model. In this paper we incorporate our DKP model with DIP and diffusion model referring to DIP-DKP and Diff-DKP for validations. Extensive simulations on Gaussian and motion kernel scenarios demonstrate that the proposed DKP model can significantly improve the kernel estimation with comparable runtime and memory usage leading to state-of-the-art BSR results. The code is available at https://github.com/XYLGroup/DKP.
-
In the evolving landscape of digital media and video production the precise manipulation and reproduction of visual elements like camera movements and character actions are highly desired. Existing SLAM methods face limitations in dynamic scenes and human pose estimation often focuses on 2D projections neglecting 3D statuses. To address these issues we first introduce a reverse filming behavior estimation technique. It optimizes camera trajectories by leveraging NeRF as a differentiable renderer and refining SMPL tracks. We then introduce a cinematic transfer pipeline that is able to transfer various shot types to a new 2D video or a 3D virtual environment. The incorporation of 3D engine workflow enables superior rendering and control abilities which also achieves a higher rating in the user study.
-
SeaBird: Segmentation in Bird's View with Dice Loss Improves Monocular 3D Detection of Large Objects
Monocular 3D detectors achieve remarkable performance on cars and smaller objects. However their performance drops on larger objects leading to fatal accidents. Some attribute the failures to training data scarcity or the receptive field requirements of large objects. In this paper we highlight this understudied problem of generalization to large objects. We find that modern frontal detectors struggle to generalize to large objects even on nearly balanced datasets. We argue that the cause of failure is the sensitivity of depth regression losses to noise of larger objects. To bridge this gap we comprehensively investigate regression and dice losses examining their robustness under varying error levels and object sizes. We mathematically prove that the dice loss leads to superior noise-robustness and model convergence for large objects compared to regression losses for a simplified case. Leveraging our theoretical insights we propose SeaBird (Segmentation in Bird's View) as the first step towards generalizing to large objects. SeaBird effectively integrates BEV segmentation on foreground objects for 3D detection with the segmentation head trained with the dice loss. SeaBird achieves SoTA results on the KITTI-360 leaderboard and improves existing detectors on the nuScenes leaderboard particularly for large objects.
-
Language has emerged as a natural interface for image editing. In this paper we introduce a method for region-based image editing driven by textual prompts without the need for user-provided masks or sketches. Specifically our approach leverages an existing pre-trained text-to-image model and introduces a bounding box generator to identify the editing regions that are aligned with the textual prompts. We show that this simple approach enables flexible editing that is compatible with current image generation models and is able to handle complex prompts featuring multiple objects complex sentences or lengthy paragraphs. We conduct an extensive user study to compare our method against state-of-the-art methods. The experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that correspond to the provided language descriptions. Our project webpage can be found at: https://yuanzelin.me/LearnableRegions_page.
-
Despite their exceptional generative abilities large T2I diffusion models much like skilled but careless artists often struggle with accurately depicting visual relationships between objects. This issue as we uncover through careful analysis arises from a misaligned text encoder that struggles to interpret specific relationships and differentiate the logical order of associated objects. To resolve this we introduce a novel task termed Relation Rectification aiming to refine the model to accurately represent a given relationship it initially fails to generate. To address this we propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN). It models the directional relationships between relation terms and corresponding objects within the input prompts. Specifically we optimize the HGCN on a pair of prompts with identical relational words but reversed object orders supplemented by a few reference images. The lightweight HGCN adjusts the text embeddings generated by the text encoder ensuring accurate reflection of the textual relation in the embedding space. Crucially our method retains the parameters of the text encoder and diffusion model preserving the model's robust performance on unrelated descriptions. We validated our approach on a newly curated dataset of diverse relational data demonstrating both quantitative and qualitative enhancements in generating images with precise visual relations. Project page: https://wuyinwei-hah.github.io/rrnet.github.io/ .
-
The practicality of 3D object pose estimation remains limited for many applications due to the need for prior knowledge of a 3D model and a training period for new objects. To address this limitation we propose an approach that takes a single image of a new object as input and predicts the relative pose of this object in new images without prior knowledge of the object's 3D model and without requiring training time for new objects and categories. We achieve this by training a model to directly predict discriminative embeddings for viewpoints surrounding the object. This prediction is done using a simple U-Net architecture with attention and conditioned on the desired pose which yields extremely fast inference. We compare our approach to state-of-the-art methods and show it outperforms them both in terms of accuracy and robustness.
-
We present a lightweight and affordable motion capture method based on two smartwatches and a head-mounted camera. In contrast to the existing approaches that use six or more expert-level IMU devices our approach is much more cost-effective and convenient. Our method can make wearable motion capture accessible to everyone everywhere enabling 3D full-body motion capture in diverse environments. As a key idea to overcome the extreme sparsity and ambiguities of sensor inputs with different modalities we integrate 6D head poses obtained from the head-mounted cameras for motion estimation. To enable capture in expansive indoor and outdoor scenes we propose an algorithm to track and update floor level changes to define head poses coupled with a multi-stage Transformer-based regression module. We also introduce novel strategies leveraging visual cues of egocentric images to further enhance the motion capture quality while reducing ambiguities. We demonstrate the performance of our method on various challenging scenarios including complex outdoor environments and everyday motions including object interactions and social interactions among multiple individuals.
-
Automatic web navigation aims to build a web agent that can follow language instructions to execute complex and diverse tasks on real-world websites. Existing work primarily takes HTML documents as input which define the contents and action spaces (i.e. actionable elements and operations) of webpages. Nevertheless HTML documents may not provide a clear task-related context for each element making it hard to select the right (sequence of) actions. In this paper we propose to contextualize HTML elements through their "dual views" in webpage screenshots: each HTML element has its corresponding bounding box and visual content in the screenshot. We build upon the insight---web developers tend to arrange task-related elements nearby on webpages to enhance user experiences---and propose to contextualize each element with its neighbor elements using both textual and visual features. The resulting representations of HTML elements are more informative for the agent to take action. We validate our method on the recently released Mind2Web dataset which features diverse navigation domains and tasks on real-world websites. Our method consistently outperforms the baseline in all the scenarios including cross-task cross-website and cross-domain ones.
-
Grasp detection is a persistent and intricate challenge with various industrial applications. Recently many methods and datasets have been proposed to tackle the grasp detection problem. However most of them do not consider using natural language as a condition to detect the grasp poses. In this paper we introduce Grasp-Anything++ a new language-driven grasp detection dataset featuring 1M samples over 3M objects and upwards of 10M grasping instructions. We utilize foundation models to create a large-scale scene corpus with corresponding images and grasp prompts. We approach the language-driven grasp detection task as a conditional generation problem. Drawing on the success of diffusion models in generative tasks and given that language plays a vital role in this task we propose a new language-driven grasp detection method based on diffusion models. Our key contribution is the contrastive training objective which explicitly contributes to the denoising process to detect the grasp pose given the language instructions. We illustrate that our approach is theoretically supportive. The intensive experiments show that our method outperforms state-of-the-art approaches and allows real-world robotic grasping. Finally we demonstrate our large-scale dataset enables zero-short grasp detection and is a challenging benchmark for future work.
-
In recent years image manipulation localization has attracted increasing attention due to its pivotal role in ensuring social media security. However effectively identifying forged regions remains an open challenge. The high acquisition cost and the severe scarcity of high-quality data are major factors hindering the performance improvement of modern image manipulation localization systems. To address this issue we propose a novel paradigm termed as CAAA to automatically and accurately annotate the manually forged images from the web at the pixel-level. We further propose a novel metric termed as QES to assist in filtering out unreliable annotations. With CAAA and QES we construct a large-scale diverse and high-quality dataset comprising 123150 manually forged images with mask annotations. Furthermore we develop a new model termed as APSC-Net for accurate image manipulation localization. According to extensive experiments our methods outperforms previous state-of-the-art methods our dataset significantly improves the performance of various models on the widely-used benchmarks. The dataset and codes are publicly available at https://github.com/qcf-568/MIML.
-
Noisy correspondence that refers to mismatches in cross-modal data pairs is prevalent on human-annotated or web-crawled datasets. Prior approaches to leverage such data mainly consider the application of uni-modal noisy label learning without amending the impact on both cross-modal and intra-modal geometrical structures in multimodal learning. Actually we find that both structures are effective to discriminate noisy correspondence through structural differences when being well-established. Inspired by this observation we introduce a Geometrical Structure Consistency (GSC) method to infer the true correspondence. Specifically GSC ensures the preservation of geometrical structures within and between modalities allowing for the accurate discrimination of noisy samples based on structural differences. Utilizing these inferred true correspondence labels GSC refines the learning of geometrical structures by filtering out the noisy samples. Experiments across four cross-modal datasets confirm that GSC effectively identifies noisy samples and significantly outperforms the current leading methods. Source code is available at https://github.com/MediaBrain-SJTU/GSC.
-
This paper addresses the challenge of learning a local visual pattern of an object from one image and generating images depicting objects with that pattern. Learning a localized concept and placing it on an object in a target image is a nontrivial task as the objects may have different orientations and shapes. Our approach builds upon recent advancements in visual concept learning. It involves acquiring a visual concept (e.g. an ornament) from a source image and subsequently applying it to an object (e.g. a chair) in a target image. Our key idea is to perform in-context concept learning acquiring the local visual concept within the broader context of the objects they belong to. To localize the concept learning we employ soft masks that contain both the concept within the mask and the surrounding image area. We demonstrate our approach through object generation within an image showcasing plausible embedding of in-context learned concepts. We also introduce methods for directing acquired concepts to specific locations within target images employing cross-attention mechanisms and establishing correspondences between source and target objects. The effectiveness of our method is demonstrated through quantitative and qualitative experiments along with comparisons against baseline techniques.
-
Reverse engineering in the realm of Computer-Aided Design (CAD) has been a longstanding aspiration though not yet entirely realized. Its primary aim is to uncover the CAD process behind a physical object given its 3D scan. We propose CAD-SIGNet an end-to-end trainable and auto-regressive architecture to recover the design history of a CAD model represented as a sequence of sketch- and-extrusion from an input point cloud. Our model learns CAD visual-language representations by layer-wise cross-attention between point cloud and CAD language embedding. In particular a new Sketch instance Guided Attention (SGA) module is proposed in order to reconstruct the fine- grained details of the sketches. Thanks to its auto-regressive nature CAD-SIGNet not only reconstructs a unique full design history of the corresponding CAD model given an in- put point cloud but also provides multiple plausible design choices. This allows for an interactive reverse engineering scenario by providing designers with multiple next step choices along with the design process. Extensive experiments on publicly available CAD datasets showcase the effectiveness of our approach against existing baseline models in two settings namely full design history recovery and conditional auto-completion from point clouds.
-
We present an approach to pose object recognition as next token prediction. The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels. To ground this prediction process in auto-regression we customize a non-causal attention mask for the decoder incorporating two key features: modeling tokens from different labels to be independent and treating image tokens as a prefix. This masking mechanism inspires an efficient method -- one-shot sampling -- to simultaneously sample tokens of multiple labels in parallel and rank generated labels by their probabilities during inference. To further enhance the efficiency we propose a simple strategy to construct a compact decoder by simply discarding the intermediate blocks of a pretrained language model. This approach yields a decoder that matches the full model's performance while being notably more efficient. The code is available at https://github.com/kaiyuyue/nxtp.
-
Face Image Quality Assessment (FIQA) is pivotal for guaranteeing the accuracy of face recognition in unconstrained environments. Recent progress in deep quality-fitting-based methods that train models to align with quality anchors has shown promise in FIQA. However these methods heavily depend on a recognition model to yield quality anchors and indiscriminately treat the confidence of inaccurate anchors as equivalent to that of accurate ones during the FIQA model training leading to a fitting bottleneck issue. This paper seeks a solution by putting forward the Confidence-Calibrated Face Image Quality Assessment (CLIB-FIQA) approach underpinned by the synergistic interplay between the quality anchors and objective quality factors such as blur pose expression occlusion and illumination. Specifically we devise a joint learning framework built upon the vision-language alignment model which leverages the joint distribution with multiple quality factors to facilitate the quality fitting of the FIQA model. Furthermore to alleviate the issue of the model placing excessive trust in inaccurate quality anchors we propose a confidence calibration method to correct the quality distribution by exploiting to the fullest extent of these objective quality factors characterized as the merged-factor distribution during training. Experimental results on eight datasets reveal the superior performance of the proposed method.
-
Determining the relative pose of an object between two images is pivotal to the success of generalizable object pose estimation. Existing approaches typically approximate the continuous pose representation with a large number of discrete pose hypotheses which incurs a computationally expensive process of scoring each hypothesis at test time. By contrast we present a Deep Voxel Matching Network (DVMNet) that eliminates the need for pose hypotheses and computes the relative object pose in a single pass. To this end we map the two input RGB images reference and query to their respective voxelized 3D representations. We then pass the resulting voxels through a pose estimation module where the voxels are aligned and the pose is computed in an end-to-end fashion by solving a least-squares problem. To enhance robustness we introduce a weighted closest voxel algorithm capable of mitigating the impact of noisy voxels. We conduct extensive experiments on the CO3D LINEMOD and Objaverse datasets demonstrating that our method delivers more accurate relative pose estimates for novel objects at a lower computational cost compared to state-of-the-art methods. Our code is released at: https://github.com/sailor-z/DVMNet.
-
Self-supervised learning (SSL) has been successful in building patch embeddings of small histology images (e.g. 224 x 224 pixels) but scaling these models to learn slide embeddings from the entirety of giga-pixel whole-slide images (WSIs) remains challenging. Here we leverage complementary information from gene expression profiles to guide slide representation learning using multi-modal pre-training. Expression profiles constitute highly detailed molecular descriptions of a tissue that we hypothesize offer a strong task-agnostic training signal for learning slide embeddings. Our slide and expression (S+E) pretraining strategy called TANGLE employs modality-specific encoders the outputs of which are aligned via contrastive learning. TANGLE was pre-trained on samples from three different organs: liver (n=6597 S+E pairs) breast (n=1020) and lung (n=1012) from two different species (Homo sapiens and Rattus norvegicus). Across three independent test datasets consisting of 1265 breast WSIs 1946 lung WSIs and 4584 liver WSIs TANGLE shows significantly better few-shot performance compared to supervised and SSL baselines. When assessed using prototype-based classification and slide retrieval TANGLE also shows a substantial performance improvement over all baselines. Code available at https://github.com/mahmoodlab/TANGLE.
-
Diffusion models have achieved remarkable success in generating high-quality diverse and creative images. However in text-based image generation they often struggle to accurately capture the intended meaning of the text. For instance a specified object might not be generated or an adjective might incorrectly alter unintended objects. Moreover we found that relationships indicating possession between objects are frequently overlooked. Despite the diversity of users' intentions in text existing methods often focus on only some aspects of these intentions. In this paper we propose Predicated Diffusion a unified framework designed to more effectively express users' intentions. It represents the intended meaning as propositions using predicate logic and treats the pixels in attention maps as fuzzy predicates. This approach provides a differentiable loss function that offers guidance for the image generation process to better fulfill the propositions. Comparative evaluations with existing methods demonstrated that Predicated Diffusion excels in generating images faithful to various text prompts while maintaining high image quality as validated by human evaluators and pretrained image-text models.
-
We present Multi-Baseline Radiance Fields (MuRF) a general feed-forward approach to solving sparse view synthesis under multiple different baseline settings (small and large baselines and different number of input views). To render a target novel view we discretize the 3D space into planes parallel to the target image plane and accordingly construct a target view frustum volume. Such a target volume representation is spatially aligned with the target view which effectively aggregates relevant information from the input views for high-quality rendering. It also facilitates subsequent radiance field regression with a convolutional network thanks to its axis-aligned nature. The 3D context modeled by the convolutional network enables our method to synthesis sharper scene structures than prior works. Our MuRF achieves state-of-the-art performance across multiple different baseline settings and diverse scenarios ranging from simple objects (DTU) to complex indoor and outdoor scenes (RealEstate10K and LLFF). We also show promising zero-shot generalization abilities on the Mip-NeRF 360 dataset demonstrating the general applicability of MuRF.
-
Autonomous driving stands as a pivotal domain in computer vision shaping the future of transportation. Within this paradigm the backbone of the system plays a crucial role in interpreting the complex environment. However a notable challenge has been the loss of clear supervision when it comes to Bird's Eye View elements. To address this limitation we introduce CLIP-BEVFormer a novel approach that leverages the power of contrastive learning techniques to enhance the multi-view image-derived BEV backbones with ground truth information flow. We conduct extensive experiments on the challenging nuScenes dataset and showcase significant and consistent improvements over the SOTA. Specifically CLIP-BEVFormer achieves an impressive 8.5% and 9.2% enhancement in terms of NDS and mAP respectively over the previous best BEV model on the 3D object detection task.
-
Utilizing large language models (LLMs) to compose off-the-shelf visual tools represents a promising avenue of research for developing robust visual assistants capable of addressing diverse visual tasks. However these methods often overlook the potential for continual learning typically by freezing the utilized tools thus limiting their adaptation to environments requiring new knowledge. To tackle this challenge we propose CLOVA a Closed-Loop Visual Assistant which operates within a framework encompassing inference reflection and learning phases. During the inference phase LLMs generate programs and execute corresponding tools to complete assigned tasks. In the reflection phase a multimodal global-local reflection scheme analyzes human feedback to determine which tools require updating. Lastly the learning phase employs three flexible approaches to automatically gather training data and introduces a novel prompt tuning scheme to update the tools allowing CLOVA to efficiently acquire new knowledge. Experimental findings demonstrate that CLOVA surpasses existing tool-usage methods by 5% in visual question answering and multiple-image reasoning by 10% in knowledge tagging and by 20% in image editing. These results underscore the significance of the continual learning capability in general visual assistants.
-
Dense depth maps have been used as a key element of visual perception tasks. There have been tremendous efforts to enhance the depth quality ranging from optimization-based to learning-based methods. Despite the remarkable progress for a long time their applicability in the real world is limited due to systematic measurement biases such as density sensing pattern and scan range. It is well-known that the biases make it difficult for these methods to achieve their generalization. We observe that learning a joint representation for input modalities (e.g. images and depth) which most recent methods adopt is sensitive to the biases. In this work we disentangle those modalities to mitigate the biases with prompt engineering. For this we design a novel depth prompt module to allow the desirable feature representation according to new depth distributions from either sensor types or scene configurations. Our depth prompt can be embedded into foundation models for monocular depth estimation. Through this embedding process our method helps the pretrained model to be free from restraint of depth scan range and to provide absolute scale depth maps. We demonstrate the effectiveness of our method through extensive evaluations. Source code is publicly available at https://github.com/JinhwiPark/DepthPrompting.
-
We introduce a novel 3D generative method Generative 3D Reconstruction (G3DR) in ImageNet capable of generating diverse and high-quality 3D objects from single images addressing the limitations of existing methods. At the heart of our framework is a novel depth regularization technique that enables the generation of scenes with high-geometric fidelity. G3DR also leverages a pretrained language-vision model such as CLIP to enable reconstruction in novel views and improve the visual realism of generations. Additionally G3DR designs a simple but effective sampling procedure to further improve the quality of generations. G3DR offers diverse and efficient 3D asset generation based on class or text conditioning. Despite its simplicity G3DR is able to beat state-of-theart methods improving over them by up to 22% in perceptual metrics and 90% in geometry scores while needing only half of the training time. Code is available at https://github.com/preddy5/G3DR
-
In the academic field the research on human motion prediction tasks mainly focuses on exploiting the observed information to forecast human movements accurately in the near future horizon. However a significant gap appears when it comes to the application field as current models are all trained offline with fixed parameters that are inherently suboptimal to handle the complex yet ever-changing nature of human behaviors. To bridge this gap in this paper we introduce the task of online meta adaptation for human motion prediction based on the insight that finding "smart weights" capable of swift adjustments to suit different motion contexts along the time is a key to improving predictive accuracy. We propose MoML which ingeniously borrows the bilevel optimization spirit of model-agnostic meta-learning to transform previous predictive mistakes into strong inductive biases to guide online adaptation. This is achieved by our MoAdapter blocks that can learn error information by facilitating efficient adaptation via a few gradient steps which fine-tunes our meta-learned "smart" initialization produced by the generic predictor. Considering real-time requirements in practice we further propose Fast-MoML a more efficient variant of MoML that features a closed-form solution instead of conventional gradient update. Experimental results show that our approach can effectively bring many existing offline motion prediction models online and improves their predictive accuracy.
-
Generative Adversarial Networks (GANs) dominate the research field in image-based virtual try-on but have not resolved problems such as unnatural deformation of garments and the blurry generation quality. While the generative quality of diffusion models is impressive achieving controllability poses a significant challenge when applying it to virtual try-on and multiple denoising iterations limit its potential for real-time applications. In this paper we propose Controllable Accelerated virtual Try-on with Diffusion Model (CAT-DM). To enhance the controllability a basic diffusion-based virtual try-on network is designed which utilizes ControlNet to introduce additional control conditions and improves the feature extraction of garment images. In terms of acceleration CAT-DM initiates a reverse denoising process with an implicit distribution generated by a pre-trained GAN-based model. Compared with previous try-on methods based on diffusion models CAT-DM not only retains the pattern and texture details of the in-shop garment but also reduces the sampling steps without compromising generation quality. Extensive experiments demonstrate the superiority of CAT-DM against both GAN-based and diffusion-based methods in producing more realistic images and accurately reproducing garment patterns.
-
Aiming to enhance the utilization of metric space by the parametric softmax classifier recent studies suggest replacing it with a non-parametric alternative. Although a non-parametric classifier may provide better metric space utilization it introduces the challenge of capturing inter-class relationships. A shared characteristic among prior non-parametric classifiers is the static assignment of labels to prototypes during the training i.e. each prototype consistently represents a class throughout the training course. Orthogonal to previous works we present a simple yet effective method to optimize the category assigned to each prototype (label-to-prototype assignment) during the training. To this aim we formalize the problem as a two-step optimization objective over network parameters and label-to-prototype assignment mapping. We solve this optimization using a sequential combination of gradient descent and Bipartide matching. We demonstrate the benefits of the proposed approach by conducting experiments on balanced and long-tail classification problems using different backbone network architectures. In particular our method outperforms its competitors by 1.22% accuracy on CIFAR-100 and 2.15% on ImageNet-200 using a metric space dimension half of the size of its competitors. \href https://github.com/msed-Ebrahimi/DL2PA_CVPR24 Code
-
Large language models (LLMs) have shown remarkable text understanding capabilities which have been extended as Video LLMs to handle video data for comprehending visual details. However existing Video LLMs can only provide a coarse description of the entire video failing to capture the precise start and end time boundary of specific events. In this paper we solve this issue via proposing VTimeLLM a novel Video LLM designed for fine-grained video moment understanding and reasoning with respect to time boundary. Specifically our VTimeLLM adopts a boundary-aware three-stage training strategy which respectively utilizes image-text pairs for feature alignment multiple-event videos to increase temporal-boundary awareness and high-quality video-instruction tuning to further improve temporal understanding ability as well as align with human intents. Extensive experiments demonstrate that in fine-grained time-related comprehension tasks for videos such as Temporal Video Grounding and Dense Video Captioning VTimeLLM significantly outperforms existing Video LLMs. Besides benefits from the fine-grained temporal understanding of the videos further enable VTimeLLM to beat existing Video LLMs in video dialogue benchmark showing its superior cross-modal understanding and reasoning abilities.
-
Federated learning (FL) is a powerful technology that enables collaborative training of machine learning models without sharing private data among clients. The fundamental challenge in FL lies in learning over extremely heterogeneous data distributions device capacities and device state availabilities all of which adversely impact performance and communication efficiency. While data heterogeneity has been well-studied in the literature this paper introduces FLHetBench the first FL benchmark targeted toward understanding device and state heterogeneity. FLHetBench comprises two new sampling methods to generate real-world device and state databases with varying heterogeneity and new metrics for quantifying the success of FL methods under these real-world constraints. Using FLHetBench we conduct a comprehensive evaluation of existing methods and find that they struggle under these settings which inspires us to propose BiasPrompt+ a new method employing staleness-aware aggregation and fast weights to tackle these new heterogeneity challenges. Experiments on various FL tasks and datasets validate the effectiveness of our BiasPrompt+ method and highlight the value of FLHetBench in fostering the development of more efficient and robust FL solutions under real-world device and state constraints.
-
Hierarchy is a natural representation of semantic taxonomies including the ones routinely used in image segmentation. Indeed recent work on semantic segmentation reports improved accuracy from supervised training leveraging hierarchical label structures. Encouraged by these results we revisit the fundamental assumptions behind that work. We postulate and then empirically verify that the reasons for the observed improvement in segmentation accuracy may be entirely unrelated to the use of the semantic hierarchy. To demonstrate this we design a range of cross-domain experiments with a representative hierarchical approach. We find that on the new testing domains a flat (non-hierarchical) segmentation network in which the parents are inferred from the children has superior segmentation accuracy to the hierarchical approach across the board. Complementing these findings and inspired by the intrinsic properties of hyperbolic spaces we study a more principled approach to hierarchical segmentation using the Poincare ball model. The hyperbolic representation largely outperforms the previous (Euclidean) hierarchical approach as well and is on par with our flat Euclidean baseline in terms of segmentation accuracy. However it additionally exhibits surprisingly strong calibration quality of the parent nodes in the semantic hierarchy especially on the more challenging domains. Our combined analysis suggests that the established practice of hierarchical segmentation may be limited to in-domain settings whereas flat classifiers generalize substantially better especially if they are modeled in the hyperbolic space.
-
The modern surge in camera usage alongside widespread computer vision technology applications poses significant privacy and security concerns. Current artificial intelligence (AI) technologies aid in recognizing relevant events and assisting in daily tasks in homes offices hospitals etc. The need to access or process personal information for these purposes raises privacy concerns. While software-level solutions like face de-identification provide a good privacy/utility trade-off they present vulnerabilities to sniffing attacks. In this paper we propose a hardware-level face de-identification method to solve this vulnerability. Specifically our approach first learns an optical encoder along with a regression model to obtain a face heatmap while hiding the face identity from the source image. We also propose an anonymization framework that generates a new face using the privacy-preserving image face heatmap and a reference face image from a public dataset as input. We validate our approach with extensive simulations and hardware experiments.
-
Predicting the future motion of surrounding agents is essential for autonomous vehicles (AVs) to operate safely in dynamic human-robot-mixed environments. Context information such as road maps and surrounding agents' states provides crucial geometric and semantic information for motion behavior prediction. To this end recent works explore two-stage prediction frameworks where coarse trajectories are first proposed and then used to select critical context information for trajectory refinement. However they either incur a large amount of computation or bring limited improvement if not both. In this paper we introduce a novel scenario-adaptive refinement strategy named SmartRefine to refine prediction with minimal additional computation. Specifically SmartRefine can comprehensively adapt refinement configurations based on each scenario's properties and smartly chooses the number of refinement iterations by introducing a quality score to measure the prediction quality and remaining refinement potential of each scenario. SmartRefine is designed as a generic and flexible approach that can be seamlessly integrated into most state-of-the-art motion prediction models. Experiments on Argoverse (1 & 2) show that our method consistently improves the prediction accuracy of multiple state-of-the-art prediction models. Specifically by adding SmartRefine to QCNet we outperform all published ensemble-free works on the Argoverse 2 leaderboard (single agent track) at submission. Comprehensive studies are also conducted to ablate design choices and explore the mechanism behind multi-iteration refinement. Codes are available at https://github.com/opendilab/SmartRefine/.
-
With the rapid development of Multi-modal Large Language Models (MLLMs) a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However most benchmarks predominantly assess spatial understanding in the static image tasks while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue we introduce a comprehensive Multi-modal Video understanding Benchmark namely MVBench which covers 20 challenging video tasks that cannot be effectively solved with a single frame. Specifically we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones we enable the systematic generation of video tasks that require a broad spectrum of temporal skills ranging from perception to cognition. Then guided by the task definition we automatically convert public video annotations into multiple-choice QA to evaluate each task. On one hand such a distinct paradigm allows us to build MVBench efficiently without much manual intervention. On the other hand it guarantees evaluation fairness with ground-truth video annotations avoiding the biased scoring of LLMs. Moreover we further develop a robust video MLLM baseline i.e. VideoChat2 by progressive multi-modal training with diverse instruction-tuning data. The extensive results on our MVBench reveal that the existing MLLMs are far from satisfactory in temporal understanding while our VideoChat2 largely surpasses these leading models by over 15% on MVBench.
-
ecent progress in video anomaly detection suggests that the features of appearance and motion play crucial roles in distinguishing abnormal patterns from normal ones. However we note that the effect of spatial scales of anomalies is ignored. The fact that many abnormal events occur in limited localized regions and severe background noise interferes with the learning of anomalous changes. Meanwhile most existing methods are limited by coarse-grained modeling approaches which are inadequate for learning highly discriminative features to discriminate subtle differences between small-scale anomalies and normal patterns. To this end this paper address multi-scale video anomaly detection by multi-grained spatio-temporal representation learning. We utilize video continuity to design three proxy tasks to perform feature learning at both coarse-grained and fine-grained levels i.e. continuity judgment discontinuity localization and missing frame estimation. In particular we formulate missing frame estimation as a contrastive learning task in feature space instead of a reconstruction task in RGB space to learn highly discriminative features. Experiments show that our proposed method outperforms state-of-the-art methods on four datasets especially in scenes with small-scale anomalies.
-
The performance of Federated Learning (FL) hinges on the effectiveness of utilizing knowledge from distributed datasets. Traditional FL methods adopt an aggregate-then-adapt framework where clients update local models based on a global model aggregated by the server from the previous training round. This process can cause client drift especially with significant cross-client data heterogeneity impacting model performance and convergence of the FL algorithm. To address these challenges we introduce FedAF a novel aggregation-free FL algorithm. In this framework clients collaboratively learn condensed data by leveraging peer knowledge the server subsequently trains the global model using the condensed data and soft labels received from the clients. FedAF inherently avoids the issue of client drift enhances the quality of condensed data amid notable data heterogeneity and improves the global model performance. Extensive numerical studies on several popular benchmark datasets show FedAF surpasses various state-of-the-art FL algorithms in handling label-skew and feature-skew data heterogeneity leading to superior global model accuracy and faster convergence.
-
Humans can easily solve multimodal tasks in context with only a few demonstrations or simple instructions which current multimodal systems largely struggle to imitate. In this work we demonstrate that by effectively scaling up generative multimodal models their task-agnostic in-context learning capabilities can be significantly enhanced. We introduce Emu2 a generative multimodal model with 37 billion parameters which serves as a base model and general-purpose interface for a variety of multimodal tasks. Emu2 not only achieves strong performance in few-shot setting but can also be instruct-tuned to follow specific instructions such as visual question answering and object-grounded image generation. Emu2 even emerges to solve tasks that require on-the-fly reasoning such as visual prompting which existing models are unlikely to handle. We identify additional tasks where Emu2's in-context learning can further improve and discuss its broader societal impact. Our code and models will be made publicly available to facilitate future research.
-
Remarkable strides have been made in reconstructing static scenes or human bodies from monocular videos. Yet the two problems have largely been approached independently without much synergy. Most visual SLAM methods can only reconstruct camera trajectories and scene structures up to scale while most HMR methods reconstruct human meshes in metric scale but fall short in reasoning with cameras and scenes. This work introduces Synergistic Camera and Human Reconstruction (SynCHMR) to marry the best of both worlds. Specifically we design Human-aware Metric SLAM to reconstruct metric-scale camera poses and scene point clouds using camera-frame HMR as a strong prior addressing depth scale and dynamic ambiguities. Conditioning on the dense scene recovered we further learn a Scene-aware SMPL Denoiser to enhance world-frame HMR by incorporating spatiotemporal coherency and dynamic scene constraints. Together they lead to consistent reconstructions of camera trajectories human meshes and dense scene point clouds in a common world frame.
-
Recent methods for label-free 3D semantic segmentation aim to assist 3D model training by leveraging the open-world recognition ability of pre-trained vision language models. However these methods usually suffer from inconsistent and noisy pseudo-labels provided by the vision language models. To address this issue we present a hierarchical intra-modal correlation learning framework that captures visual and geometric correlations in 3D scenes at three levels: intra-set intra-scene and inter-scene to help learn more compact 3D representations. We refine pseudo-labels using intra-set correlations within each geometric consistency set and align features of visually and geometrically similar points using intra-scene and inter-scene correlation learning. We also introduce a feedback mechanism to distill the correlation learning capability into the 3D model. Experiments on both indoor and outdoor datasets show the superiority of our method. We achieve a state-of-the-art 36.6% mIoU on the ScanNet dataset and a 23.0% mIoU on the nuScenes dataset with improvements of 7.8% mIoU and 2.2% mIoU compared with previous SOTA. We also provide theoretical analysis and qualitative visualization results to discuss the mechanism and conduct thorough ablation studies to support the effectiveness of our framework.
-
Multiple instance learning (MIL) is the most widely used framework in computational pathology encompassing sub-typing diagnosis prognosis and more. However the existing MIL paradigm typically requires an offline instance feature extractor such as a pre-trained ResNet or a foundation model. This approach lacks the capability for feature fine-tuning within the specific downstream tasks limiting its adaptability and performance. To address this issue we propose a Re-embedded Regional Transformer (RRT) for re-embedding the instance features online which captures fine-grained local features and establishes connections across different regions. Unlike existing works that focus on pre-training powerful feature extractor or designing sophisticated instance aggregator RRT is tailored to re-embed instance features online. It serves as a portable module that can seamlessly integrate into mainstream MIL models. Extensive experimental results on common computational pathology tasks validate that: 1) feature re-embedding improves the performance of MIL models based on ResNet-50 features to the level of foundation model features and further enhances the performance of foundation model features; 2) the RRT can introduce more significant performance improvements to various MIL models; 3) RRT-MIL as an RRT-enhanced AB-MIL outperforms other latest methods by a large margin. The code is available at: https://github.com/DearCaat/RRT-MIL.
-
Audio-visual saliency prediction can draw support from diverse modality complements but further performance enhancement is still challenged by customized architectures as well as task-specific loss functions. In recent studies denoising diffusion models have shown more promising in unifying task frameworks owing to their inherent ability of generalization. Following this motivation a novel Diffusion architecture for generalized audio-visual Saliency prediction (DiffSal) is proposed in this work which formulates the prediction problem as a conditional generative task of the saliency map by utilizing input audio and video as the conditions. Based on the spatio-temporal audio-visual features an extra network Saliency-UNet is designed to perform multi-modal attention modulation for progressive refinement of the ground-truth saliency map from the noisy map. Extensive experiments demonstrate that the proposed DiffSal can achieve excellent performance across six challenging audio-visual benchmarks with an average relative improvement of 6.3% over the previous state-of-the-art results by six metrics.
-
This research focuses on the issue of single-image reflection removal (SIRR) in real-world conditions examining it from two angles: the collection pipeline of real reflection pairs and the perception of real reflection locations. We devise an advanced reflection collection pipeline that is highly adaptable to a wide range of real-world reflection scenarios and incurs reduced costs in collecting large-scale aligned reflection pairs. In the process we develop a large-scale high-quality reflection dataset named Reflection Removal in the Wild (RRW). RRW contains over 14950 high-resolution real-world reflection pairs a dataset forty-five times larger than its predecessors. Regarding perception of reflection locations we identify that numerous virtual reflection objects visible in reflection images are not present in the corresponding ground-truth images. This observation drawn from the aligned pairs leads us to conceive the Maximum Reflection Filter (MaxRF). The MaxRF could accurately and explicitly characterize reflection locations from pairs of images. Building upon this we design a reflection location-aware cascaded framework specifically tailored for SIRR. Powered by these innovative techniques our solution achieves superior performance than current leading methods across multiple real-world benchmarks. Codes and datasets are available at \href https://github.com/zhuyr97/Reflection_RemoVal_CVPR2024 \color blue here .
-
3D Morphable Models (3DMMs) provide promising 3D face reconstructions in various applications. However existing methods struggle to reconstruct faces with extreme expressions due to deficiencies in supervisory signals such as sparse or inaccurate landmarks. Segmentation information contains effective geometric contexts for face reconstruction. Certain attempts intuitively depend on differentiable renderers to compare the rendered silhouettes of reconstruction with segmentation which is prone to issues like local optima and gradient instability. In this paper we fully utilize the facial part segmentation geometry by introducing Part Re-projection Distance Loss (PRDL). Specifically PRDL transforms facial part segmentation into 2D points and re-projects the reconstruction onto the image plane. Subsequently by introducing grid anchors and computing different statistical distances from these anchors to the point sets PRDL establishes geometry descriptors to optimize the distribution of the point sets for face reconstruction. PRDL exhibits a clear gradient compared to the renderer-based methods and presents state-of-the-art reconstruction performance in extensive quantitative and qualitative experiments. Our project is available at https://github.com/wang-zidu/3DDFA-V3.
-
In this paper we uncover the untapped potential of diffusion U-Net which serves as a "free lunch" that substantially improves the generation quality on the fly. We initially investigate the key contributions of the U-Net architecture to the denoising process and identify that its main backbone primarily contributes to denoising whereas its skip connections mainly introduce high-frequency features into the decoder module causing the potential neglect of crucial functions intrinsic to the backbone network. Capitalizing on this discovery we propose a simple yet effective method termed "FreeU" which enhances generation quality without additional training or finetuning. Our key insight is to strategically re-weight the contributions sourced from the U-Net's skip connections and backbone feature maps to leverage the strengths of both components of the U-Net architecture. Promising results on image and video generation tasks demonstrate that our FreeU can be readily integrated to existing diffusion models e.g. Stable Diffusion DreamBooth and ControlNet to improve the generation quality with only a few lines of code. All you need is to adjust two scaling factors during inference.
-
Weakly supervised video anomaly detection (WSVAD) is a challenging task. Generating fine-grained pseudo-labels based on weak-label and then self-training a classifier is currently a promising solution. However since the existing methods use only RGB visual modality and the utilization of category text information is neglected thus limiting the generation of more accurate pseudo-labels and affecting the performance of self-training. Inspired by the manual labeling process based on the event description in this paper we propose a novel pseudo-label generation and self-training framework based on Text Prompt with Normality Guidance (TPWNG) for WSVAD. Our idea is to transfer the rich language-visual knowledge of the contrastive language-image pre-training (CLIP) model for aligning the video event description text and corresponding video frames to generate pseudo-labels. Specifically We first fine-tune the CLIP for domain adaptation by designing two ranking losses and a distributional inconsistency loss. Further we propose a learnable text prompt mechanism with the assist of a normality visual prompt to further improve the matching accuracy of video event description text and video frames. Then we design a pseudo-label generation module based on the normality guidance to infer reliable frame-level pseudo-labels. Finally we introduce a temporal context self-adaptive learning module to learn the temporal dependencies of different video events more flexibly and accurately. Extensive experiments show that our method achieves state-of-the-art performance on two benchmark datasets UCF-Crime and XD-Violence demonstrating the effectiveness of our proposed method.
-
Vision-based perception for autonomous driving requires an explicit modeling of a 3D space where 2D latent representations are mapped and subsequent 3D operators are applied. However operating on dense latent spaces introduces a cubic time and space complexity which limits scalability in terms of perception range or spatial resolution. Existing approaches compress the dense representation using projections like Bird's Eye View (BEV) or Tri-Perspective View (TPV). Although efficient these projections result in information loss especially for tasks like semantic occupancy prediction. To address this we propose SparseOcc an efficient occupancy network inspired by sparse point cloud processing. It utilizes a lossless sparse latent representation with three key innovations. Firstly a 3D sparse diffuser performs latent completion using spatially decomposed 3D sparse convolutional kernels. Secondly a feature pyramid and sparse interpolation enhance scales with information from others. Finally the transformer head is redesigned as a sparse variant. SparseOcc achieves a remarkable 74.9% reduction on FLOPs over the dense baseline. Interestingly it also improves accuracy from 12.8% to 14.1% mIOU which in part can be attributed to the sparse representation's ability to avoid hallucinations on empty voxels.
-
While super-resolution (SR) methods based on diffusion models exhibit promising results their practical application is hindered by the substantial number of required inference steps. Recent methods utilize the degraded images in the initial state thereby shortening the Markov chain. Nevertheless these solutions either rely on a precise formulation of the degradation process or still necessitate a relatively lengthy generation path (e.g. 15 iterations). To enhance inference speed we propose a simple yet effective method for achieving single-step SR generation named SinSR. Specifically we first derive a deterministic sampling process from the most recent state-of-the-art (SOTA) method for accelerating diffusion-based SR. This allows the mapping between the input random noise and the generated high-resolution image to be obtained in a reduced and acceptable number of inference steps during training. We show that this deterministic mapping can be distilled into a student model that performs SR within only one inference step. Additionally we propose a novel consistency-preserving loss to simultaneously leverage the ground-truth image during the distillation process ensuring that the performance of the student model is not solely bound by the feature manifold of the teacher model resulting in further performance improvement. Extensive experiments conducted on synthetic and real-world datasets demonstrate that the proposed method can achieve comparable or even superior performance compared to both previous SOTA methods and the teacher model in just one sampling step resulting in a remarkable up to x10 speedup for inference. Our code will be released at https://github.com/wyf0912/SinSR/.
-
Video Motion Magnification (VMM) aims to reveal subtle and imperceptible motion information of objects in the macroscopic world. Prior methods directly model the motion field from the Eulerian perspective by Representation Learning that separates shape and texture or Multi-domain Learning from phase fluctuations. Inspired by the frequency spectrum we observe that the low-frequency components with stable energy always possess spatial structure and less noise making them suitable for modeling the subtle motion field. To this end we present FD4MM a new paradigm of Frequency Decoupling for Motion Magnification with a Multi-level Isomorphic Architecture to capture multi-level high-frequency details and a stable low-frequency structure (motion field) in video space. Since high-frequency details and subtle motions are susceptible to information degradation due to their inherent subtlety and unavoidable external interference from noise we carefully design Sparse High/Low-pass Filters to enhance the integrity of details and motion structures and a Sparse Frequency Mixer to promote seamless recoupling. Besides we innovatively design a contrastive regularization for this task to strengthen the model's ability to discriminate irrelevant features reducing undesired motion magnification. Extensive experiments on both Real-world and Synthetic Datasets show that our FD4MM outperforms SOTA methods. Meanwhile FD4MM reduces FLOPs by 1.63xand boosts inference speed by 1.68xthan the latest method. Our code is available at https://github.com/Jiafei127/FD4MM.
-
In typical medical image classification problems labeled data is scarce while unlabeled data is more available. Semi-supervised learning and self-supervised learning are two different research directions that can improve accuracy by learning from extra unlabeled data. Recent methods from both directions have reported significant gains on traditional benchmarks. Yet past benchmarks do not focus on medical tasks and rarely compare self- and semi- methods together on an equal footing. Furthermore past benchmarks often handle hyperparameter tuning suboptimally. First they may not tune hyperparameters at all leading to underfitting. Second when tuning does occur it often unrealistically uses a labeled validation set that is much larger than the training set. Therefore currently published rankings might not always corroborate with their practical utility This study contributes a systematic evaluation of self- and semi- methods with a unified experimental protocol intended to guide a practitioner with scarce overall labeled data and a limited compute budget. We answer two key questions: Can hyperparameter tuning be effective with realistic-sized validation sets? If so when all methods are tuned well which self- or semi-supervised methods achieve the best accuracy? Our study compares 13 representative semi- and self-supervised methods to strong labeled-set-only baselines on 4 medical datasets. From 20000+ GPU hours of computation we provide valuable best practices to resource-constrained practitioners: hyperparameter tuning is effective and the semi-supervised method known as MixMatch delivers the most reliable gains across 4 datasets.
-
3D asset generation is getting massive amounts of attention inspired by the recent success on text-guided 2D content creation. Existing text-to-3D methods use pretrained text-to-image diffusion models in an optimization problem or fine-tune them on synthetic data which often results in non-photorealistic 3D objects without backgrounds. In this paper we present a method that leverages pretrained text-to-image models as a prior and learn to generate multi-view images in a single denoising process from real-world data. Concretely we propose to integrate 3D volume-rendering and cross-frame-attention layers into each block of the existing U-Net network of the text-to-image model. Moreover we design an autoregressive generation that renders more 3D-consistent images at any viewpoint. We train our model on real-world datasets of objects and showcase its capabilities to generate instances with a variety of high-quality shapes and textures in authentic surroundings. Compared to the existing methods the results generated by our method are consistent and have favorable visual quality (-30% FID -37% KID).
-
Open-world detection poses significant challenges as it requires the detection of any object using either object class labels or free-form texts. Existing related works often use large-scale manual annotated caption datasets for training which are extremely expensive to collect. Instead we propose to transfer knowledge from vision-language models (VLMs) to enrich the open-vocabulary descriptions automatically. Specifically we bootstrap dense synthetic captions using pre-trained VLMs to provide rich descriptions on different regions in images and incorporate these captions to train a novel detector that generalizes to novel concepts. To mitigate the noise caused by hallucination in synthetic captions we also propose a novel hyperbolic vision-language learning approach to impose a hierarchy between visual and caption embeddings. We call our detector "HyperLearner". We conduct extensive experiments on a wide variety of open-world detection benchmarks (COCO LVIS Object Detection in the Wild RefCOCO) and our results show that our model consistently outperforms existing state-of-the-art methods such as GLIP GLIPv2 and Grounding DINO when using the same backbone.
-
In recent advancements in high-fidelity image generation Denoising Diffusion Probabilistic Models (DDPMs) have emerged as a key player. However their application at high resolutions presents significant computational challenges. Current methods such as patchifying expedite processes in UNet and Transformer architectures but at the expense of representational capacity. Addressing this we introduce the Diffusion State Space Model (DiffuSSM) an architecture that supplants attention mechanisms with a more scalable state space model backbone. This approach effectively handles higher resolutions without resorting to global compression thus preserving detailed image representation throughout the diffusion process. Our focus on FLOP-efficient architectures in diffusion training marks a significant step forward. Comprehensive evaluations on both ImageNet and LSUN datasets at two resolutions demonstrate that DiffuSSMs are on par or even outperform existing diffusion models with attention modules in FID and Inception Score metrics while significantly reducing total FLOP usage.
-
Quantifying the degree of similarity between images is a key copyright issue for image-based machine learning. In legal doctrine however determining the degree of similarity between works requires subjective analysis and fact-finders (judges and juries) can demonstrate considerable variability in these subjective judgement calls. Images that are structurally similar can be deemed dissimilar whereas images of completely different scenes can be deemed similar enough to support a claim of copying. We seek to define and compute a notion of "conceptual similarity" among images that captures high-level relations even among images that do not share repeated elements or visually similar components. The idea is to use a base multi-modal model to generate "explanations" (captions) of visual data at increasing levels of complexity. Then similarity can be measured by the length of the caption needed to discriminate between the two images: Two highly dissimilar images can be discriminated early in their description whereas conceptually dissimilar ones will need more detail to be distinguished. We operationalize this definition and show that it correlates with subjective (averaged human evaluation) assessment and beats existing baselines on both image-to-image and text-to-text similarity benchmarks. Beyond just providing a number our method also offers interpretability by pointing to the specific level of granularity of the description where the source data is differentiated.
-
Existing methods for synthesizing 3D human gestures from speech have shown promising results but they do not explicitly model the impact of emotions on the generated gestures. Instead these methods directly output animations from speech without control over the expressed emotion. To address this limitation we present AMUSE an emotional speech-driven body animation model based on latent diffusion. Our observation is that content (i.e. gestures related to speech rhythm and word utterances) emotion and personal style are separable. To account for this AMUSE maps the driving audio to three disentangled latent vectors: one for content one for emotion and one for personal style. A latent diffusion model trained to generate gesture motion sequences is then conditioned on these latent vectors. Once trained AMUSE synthesizes 3D human gestures directly from speech with control over the expressed emotions and style by combining the content from the driving speech with the emotion and style of another speech sequence. Randomly sampling the noise of the diffusion model further generates variations of the gesture with the same emotional expressivity. Qualitative quantitative and perceptual evaluations demonstrate that AMUSE outputs realistic gesture sequences. Compared to the state of the art the generated gestures are better synchronized with the speech content and better represent the emotion expressed by the input speech. Our code is available at amuse.is.tue.mpg.de.
-
This paper presents the first 3D feature tracking method with the corresponding dataset. Our proposed method takes event streams from stereo event cameras as input to predict 3D trajectories of the target features with high-speed motion. To achieve this our method leverages a joint framework to predict the 2D feature motion offsets and the 3D feature spatial position simultaneously. A motion compensation module is leveraged to overcome the feature deformation. A patch matching module based on bi-polarity hypergraph modeling is proposed to robustly estimate the feature spatial position. Meanwhile we collect the first 3D feature tracking dataset with high-speed moving objects and ground truth 3D feature trajectories at 250 FPS named E-3DTrack which can be used as the first high-speed 3D feature tracking benchmark. Our code and dataset could be found at: https://github.com/lisiqi19971013/E-3DTrack.
-
Content-aware graphic layout generation aims to automatically arrange visual elements along with a given content such as an e-commerce product image. In this paper we argue that the current layout generation approaches suffer from the limited training data for the high-dimensional layout structure. We show that a simple retrieval augmentation can significantly improve the generation quality. Our model which is named Retrieval-Augmented Layout Transformer (RALF) retrieves nearest neighbor layout examples based on an input image and feeds these results into an autoregressive generator. Our model can apply retrieval augmentation to various controllable generation tasks and yield high-quality layouts within a unified architecture. Our extensive experiments show that RALF successfully generates content-aware layouts in both constrained and unconstrained settings and significantly outperforms the baselines.
-
Public datasets such as KITTI nuScenes and Waymo have played a key role in the research and development of autonomous vehicles and advanced driver assistance systems. However many of these datasets fail to incorporate a full range of driving conditions; some datasets only contain clear-weather conditions underrepresenting or entirely missing colder weather conditions such as snow or autumn scenes with bright colorful foliage. In this paper we present the Michigan State University Four Seasons (MSU-4S) Dataset which contains real-world collections of autonomous vehicle data from varied types of driving scenarios. These scenarios were recorded throughout a full range of seasons and capture clear rainy snowy and fall weather conditions at varying times of day. MSU-4S contains more than 100000 two- and three-dimensional frames for camera lidar and radar data as well as Global Navigation Satellite System (GNSS) wheel speed and steering data all annotated with weather time-of-day and time-of-year. Our data includes cluttered scenes that have large numbers of vehicles and pedestrians; and it also captures industrial scenes busy traffic thoroughfare with traffic lights and numerous signs and scenes with dense foliage. While providing a diverse set of scenes our data incorporate an important feature: virtually every scene and its corresponding lidar camera and radar frames were captured in four different seasons enabling unparalleled object detection analysis and testing of the domain shift problem across weather conditions. In that context we present detailed analyses for 3D and 2D object detection showing a strong domain shift effect among MSU-4S data segments collected across different conditions. MSU-4S will also enable advanced multimodal fusion research including different combinations of camera-lidar-radar fusion which continues to be of strong interest for the computer vision autonomous driving and ADAS development communities. The MSU-4S dataset is available online at https://egr.msu.edu/waves/msu4s.
-
Online Continual Learning (CL) solves the problem of learning the ever-emerging new classification tasks from a continuous data stream. Unlike its offline counterpart in online CL the training data can only be seen once. Most existing online CL research regards catastrophic forgetting (i.e. model stability) as almost the only challenge. In this paper we argue that the model's capability to acquire new knowledge (i.e. model plasticity) is another challenge in online CL. While replay-based strategies have been shown to be effective in alleviating catastrophic forgetting there is a notable gap in research attention toward improving model plasticity. To this end we propose Collaborative Continual Learning (CCL) a collaborative learning based strategy to improve the model's capability in acquiring new concepts. Additionally we introduce Distillation Chain (DC) a collaborative learning scheme to boost the training of the models. We adapt CCL-DC to existing representative online CL works. Extensive experiments demonstrate that even if the learners are well-trained with state-of-the-art online CL methods our strategy can still improve model plasticity dramatically and thereby improve the overall performance by a large margin. The source code of our work is available at https://github.com/maorong-wang/CCL-DC.
-
Recent advances in personalized image generation have enabled pre-trained text-to-image models to learn new concepts from specific image sets. However these methods often necessitate extensive test-time finetuning for each new concept leading to inefficiencies in both time and scalability. To address this challenge we introduce InstantBooth an innovative approach leveraging existing text-to-image models for instantaneous text-guided image personalization eliminating the need for test-time finetuning. This efficiency is achieved through two primary innovations. Firstly we utilize an image encoder that transforms input images into a global embedding to grasp the general concept. Secondly we integrate new adapter layers into the pre-trained model enhancing its ability to capture intricate identity details while maintaining language coherence. Significantly our model is trained exclusively on text-image pairs without reliance on concept-specific paired images. When benchmarked against existing finetuning-based personalization techniques like DreamBooth and Textual-Inversion InstantBooth not only shows comparable proficiency in aligning language with image maintaining image quality and preserving identity but also boasts a 100-fold increase in processing speed.
-
N:M sparsity has received increasing attention due to its remarkable performance and latency trade-off compared with structured and unstructured sparsity. However existing N:M sparsity methods do not differentiate the relative importance of weights among blocks and leave important weights underappreciated. Besides they directly apply N:M sparsity to the whole network which will cause severe information loss. Thus they are still sub-optimal. In this paper we propose an efficient and effective Multi-Axis Query methodology dubbed as MaxQ to rectify these problems. During the training MaxQ employs a dynamic approach to generate soft N:M masks considering the weight importance across multiple axes. This method enhances the weights with more importance and ensures more effective updates. Meanwhile a sparsity strategy that gradually increases the percentage of N:M weight blocks is applied which allows the network to heal from the pruning-induced damage progressively. During the runtime the N:M soft masks can be precomputed as constants and folded into weights without causing any distortion to the sparse pattern and incurring additional computational overhead. Comprehensive experiments demonstrate that MaxQ achieves consistent improvements across diverse CNN architectures in various computer vision tasks including image classification object detection and instance segmentation. For ResNet50 with 1:16 sparse pattern MaxQ can achieve 74.6% top-1 accuracy on ImageNet and improve by over 2.8% over the state-of-the-art. Codes and checkpoints are available at https://github.com/JingyangXiang/MaxQ.
-
While remarkable progress has been made on supervised skeleton-based action recognition the challenge of zero-shot recognition remains relatively unexplored. In this paper we argue that relying solely on aligning label-level semantics and global skeleton features is insufficient to effectively transfer locally consistent visual knowledge from seen to unseen classes. To address this limitation we introduce Part-aware Unified Representation between Language and Skeleton (PURLS) to explore visual-semantic alignment at both local and global scales. PURLS introduces a new prompting module and a novel partitioning module to generate aligned textual and visual representations across different levels. The former leverages a pre-trained GPT-3 to infer refined descriptions of the global and local (body-part-based and temporal-interval-based) movements from the original action labels. The latter employs an adaptive sampling strategy to group visual features from all body joint movements that are semantically relevant to a given description. Our approach is evaluated on various skeleton/language backbones and three large-scale datasets i.e. NTU-RGB+D 60 NTU-RGB+D 120 and a newly curated dataset Kinetics-skeleton 200. The results showcase the universality and superior performance of PURLS surpassing prior skeleton-based solutions and standard baselines from other domains. The source codes can be accessed at https://github.com/azzh1/PURLS.
-
Event cameras offer many advantages over traditional frame-based cameras such as high dynamic range and low latency. Therefore event cameras are widely applied in diverse computer vision applications where event-based keypoint detection is a fundamental task. However achieving robust event-based keypoint detection remains challenging because the ground truth of event keypoints is difficult to obtain descriptors extracted by CNN usually lack discriminative ability in the presence of intense noise and fixed keypoint detectors are limited in detecting varied keypoint patterns. To address these challenges a novel event-based keypoint detection method is proposed by learning dynamic detectors and contextual descriptors in a self-supervised manner (SD2Event) including a contextual feature descriptor learning (CFDL) module and a dynamic keypoint detector learning (DKDL) module. The proposed SD2Event enjoys several merits. First the proposed CFDL module can model long-range contexts efficiently and effectively. Second the DKDL module generates dynamic keypoint detectors which can detect keypoints with diverse patterns across various event streams. Third the proposed self-supervised signals can guide the model's adaptation to event data. Extensive experimental results on three challenging benchmarks show that our proposed method significantly outperforms stateof-the-art event-based keypoint detection methods.
-
We study the visual semantic embedding problem for image-text matching. Most existing work utilizes a tailored cross-attention mechanism to perform local alignment across the two image and text modalities. This is computationally expensive even though it is more powerful than the unimodal dual-encoder approach. This work introduces a dual-encoder image-text matching model leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges. Utilizing a graph attention network our model efficiently encodes object-attribute and object-object semantic relations resulting in a robust and fast-performing system. Representing caption as a scene graph offers the ability to utilize the strong relational inductive bias of graph neural networks to learn object-attribute and object-object relations effectively. To train the model we propose losses that align the image and caption both at the holistic level (image-caption) and the local level (image-object entity) which we show is key to the success of the model. Our model is termed Composition model for Object Relations and Attributes CORA. Experimental results on two prominent image-text retrieval benchmarks Flickr30K and MS-COCO demonstrate that CORA outperforms existing state-of-the-art computationally expensive cross-attention methods regarding recall score while achieving fast computation speed of the dual encoder. Our code is available at https://github.com/vkhoi/cora_cvpr24
-
We introduce multimodal story summarization by leveraging TV episode recaps - short video sequences interweaving key story moments from previous episodes to bring viewers up to speed. We propose PlotSnap a dataset featuring two crime thriller TV shows with rich recaps and long episodes of 40 minutes. Story summarization labels are unlocked by matching recap shots to corresponding sub-stories in the episode. We propose a hierarchical model TaleSumm that processes entire episodes by creating compact shot and dialog representations and predicts importance scores for each video shot and dialog utterance by enabling interactions between local story groups. Unlike traditional summarization our method extracts multiple plot points from long videos. We present a thorough evaluation on story summarization including promising cross-series generalization. TaleSumm also shows good results on classic video summarization benchmarks.
-
With photo-realistic image generation Neural Radiance Field (NeRF) is widely used for large-scale dynamic scene reconstruction as autonomous driving simulator. However large-scale scene reconstruction still suffers from extremely long training time and rendering time. Low-resolution (LR) rendering combined with upsampling can alleviate this problem but it degrades image quality. In this paper we design a lightweight reference decoder which exploits prior information from known views to improve image reconstruction quality of new views. In addition to speed up prior information search we propose an optical flow and structural similarity based prior information search method. Results on KITTI and VKITTI2 datasets show that our method significantly outperforms the baseline method in terms of training speed rendering speed and rendering quality.
-
Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However previous methods have primarily focused on enhancing multi-modal capabilities. In this work we introduce a versatile multi-modal large language model mPLUG-Owl2 which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design with the language decoder acting as a universal interface for managing different modalities. Specifically mPLUG-Owl2 incorporates shared functional modules to facilitate modality collaboration and introduces a modality-adaptive module that preserves modality-specific features. Extensive experiments reveal that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal tasks while achieving state-of-the-art performances with a single generalized model. Notably mPLUG-Owl2 is the first MLLM model that demonstrates the modality collaboration phenomenon in both pure-text and multi-modal scenarios setting a pioneering path in the development of future multi-modal foundation models.
-
Image datasets are essential not only in validating existing methods in computer vision but also in developing new methods. Many image datasets exist consisting of trichromatic intensity images taken with RGB cameras which are designed to replicate human vision. However polarization and spectrum the wave properties of light that animals in harsh environments and with limited brain capacity often rely on remain underrepresented in existing datasets. Although there are previous spectro-polarimetric datasets they have insufficient object diversity limited illumination conditions linear-only polarization data and inadequate image count. Here we introduce two spectro-polarimetric datasets consisting of trichromatic Stokes images and hyperspectral Stokes images. These datasets encompass both linear and circular polarization; they introduce multiple spectral channels; and they feature a broad selection of real-world scenes. With our dataset in hand we analyze the spectro-polarimetric image statistics develop efficient representations of such high-dimensional data and evaluate spectral dependency of shape-from-polarization methods. As such the proposed dataset promises a foundation for data-driven spectro-polarimetric imaging and vision research.
-
Generative vision-language models (VLMs) have shown impressive performance in zero-shot vision-language tasks like image captioning and visual question answering.However improving their zero-shot reasoning typically requires second-stage instruction tuning which relies heavily on human-labeled or large language model-generated annotation incurring high labeling costs. To tackle this challenge we introduce Image-Conditioned Caption Correction (ICCC) a novel pre-training task designed to enhance VLMs' zero-shot performance without the need for labeled task-aware data. The ICCC task compels VLMs to rectify mismatches between visual and language concepts thereby enhancing instruction following and text generation conditioned on visual inputs. Leveraging language structure and a lightweight dependency parser we construct data samples of ICCC task from image-text datasets with low labeling and computation costs. Experimental results on BLIP-2 and InstructBLIP demonstrate significant improvements in zero-shot image-text generation-based VL tasks through ICCC instruction tuning.
-
Automating visual inspection in industrial production lines is essential for increasing product quality across various industries. Anomaly detection (AD) methods serve as robust tools for this purpose. However existing public datasets primarily consist of images without anomalies limiting the practical application of AD methods in production settings. To address this challenge we present (1) the Valeo Anomaly Dataset (VAD) a novel real-world industrial dataset comprising 5000 images including 2000 instances of challenging real defects across more than 20 subclasses. Acknowledging that traditional AD methods struggle with this dataset we introduce (2) Segmentation-based Anomaly Detector (SegAD). First SegAD leverages anomaly maps as well as segmentation maps to compute local statistics. Next SegAD uses these statistics and an optional supervised classifier score as input features for a Boosted Random Forest (BRF) classifier yielding the final anomaly score. Our SegAD achieves state-of-the-art performance on both VAD (+2.1% AUROC) and the VisA dataset (+0.4% AUROC). The code and the models are publicly available.
-
Current approaches for 3D scene graph prediction rely on labeled datasets to train models for a fixed set of known object classes and relationship categories. We present Open3DSG an alternative approach to learn 3D scene graph prediction in an open world without requiring labeled scene graph data. We co-embed the features from a 3D scene graph prediction backbone with the feature space of powerful open world 2D vision language foundation models. This enables us to predict 3D scene graphs from 3D point clouds in a zero-shot manner by querying object classes from an open vocabulary and predicting the inter-object relationships from a grounded LLM with scene graph features and queried object classes as context. Open3DSG is the first 3D point cloud method to predict not only explicit open-vocabulary object classes but also open-set relationships that are not limited to a predefined label set making it possible to express rare as well as specific objects and relationships in the predicted 3D scene graph. Our experiments show that Open3DSG is effective at predicting arbitrary object classes as well as their complex inter-object relationships describing spatial supportive semantic and comparative relationships.
-
In this paper we revisit techniques for uncertainty estimation within deep neural networks and consolidate a suite of techniques to enhance their reliability. Our investigation reveals that an integrated application of diverse techniques--spanning model regularization classifier and optimization--substantially improves the accuracy of uncertainty predictions in image classification tasks. The synergistic effect of these techniques culminates in our novel SURE approach. We rigorously evaluate SURE against the benchmark of failure prediction a critical testbed for uncertainty estimation efficacy. Our results showcase a consistently better performance than models that individually deploy each technique across various datasets and model architectures. When applied to real-world challenges such as data corruption label noise and long-tailed class distribution SURE exhibits remarkable robustness delivering results that are superior or on par with current state-of-the-art specialized methods. Particularly on Animal-10N and Food-101N for learning with noisy labels SURE achieves state-of-the-art performance without any task-specific adjustments. This work not only sets a new benchmark for robust uncertainty estimation but also paves the way for its application in diverse real-world scenarios where reliability is paramount. Our code is available at https://yutingli0606.github.io/SURE/.
-
In radio astronomy visibility data which are measurements of wave signals from radio telescopes are transformed into images for observation of distant celestial objects. However these resultant images usually contain both real sources and artifacts due to signal sparsity and other factors. One way to obtain cleaner images is to reconstruct samples into dense forms before imaging. Unfortunately existing reconstruction methods often miss some components of visibility in frequency domain so blurred object edges and persistent artifacts remain in the images. Furthermore the computation overhead is high on irregular visibility samples due to the data skew. To address these problems we propose PolarRec a transformer-encoder-conditioned reconstruction pipeline with visibility samples converted into the polar coordinate system. This coordinate system matches the way in which radio telescopes observe a celestial area as the Earth rotates. As a result visibility samples distribute in the polar system more uniformly than in the Cartesian space. Therefore we propose to use radial distance in the loss function to help reconstruct complete visibility effectively. Also we group visibility samples by their polar angles and propose a group-based encoding scheme to improve the efficiency. Our experiments demonstrate that PolarRec markedly improves imaging results by faithfully reconstructing all frequency components in the visibility domain while significantly reducing the computation cost in visibility data encoding. The code is available at https://github.com/RapidsAtHKUST/PolarRec.
-
Convolutional neural networks benefit from translation equivariance achieving tremendous success. Equivariant networks further extend this property to other transformation groups. However most existing methods require discretization or sampling of groups leading to increased model sizes for larger groups such as the affine group. In this paper we build affine equivariant networks based on differential invariants from the viewpoint of symmetric PDEs without discretizing or sampling the group. To address the division-by-zero issue arising from fractional differential invariants of the affine group we construct a new kind of affine invariants by normalizing polynomial relative differential invariants to replace classical differential invariants. For further flexibility we design an equivariant layer which can be directly integrated into convolutional networks of various architectures. Moreover our framework for the affine group is also applicable to its continuous subgroups. We implement equivariant networks for the scale group the rotation-scale group and the affine group. Numerical experiments demonstrate the outstanding performance of our framework across classification tasks involving transformations of these groups. Remarkably under the out-of-distribution setting our model achieves a 3.37% improvement in accuracy over the main counterpart affConv on the affNIST dataset.
-
In text-to-image personalization a timely and crucial challenge is the tendency of generated images overfitting to the biases present in the reference images. We initiate our study with a comprehensive categorization of the biases into background nearby-object tied-object substance (in style re-contextualization) and pose biases. These biases manifest in the generated images due to their entanglement into the subject embedding. This undesired embedding entanglement not only results in the reflection of biases from the reference images into the generated images but also notably diminishes the alignment of the generated images with the given generation prompt. To address this challenge we propose SID (Selectively Informative Description) a text description strategy that deviates from the prevalent approach of only characterizing the subject's class identification. SID is generated utilizing multimodal GPT-4 and can be seamlessly integrated into optimization-based models. We present comprehensive experimental results along with analyses of cross-attention maps subject-alignment non-subject-disentanglement and text-alignment.
-
We study object interaction anticipation in egocentric videos. This task requires an understanding of the spatio-temporal context formed by past actions on objects coined "action context". We propose TransFusion a multimodal transformer-based architecture for short-term object interaction anticipation. Our method exploits the representational power of language by summarizing the action context textually after leveraging pre-trained vision-language foundation models to extract the action context from past video frames. The summarized action context and the last observed video frame are processed by the multimodal fusion module to forecast the next object interaction. Experiments on the Ego4D next active object interaction dataset show the effectiveness of our multimodal fusion model and highlight the benefits of using the power of foundation models and language-based context summaries in a task where vision may appear to suffice. Our novel approach outperforms all state-of-the-art methods on both versions of the Ego4D dataset.
-
Image denoising is a fundamental task in computer vision. While prevailing deep learning-based supervised and self-supervised methods have excelled in eliminating in-distribution noise their susceptibility to out-of-distribution (OOD) noise remains a significant challenge. The recent emergence of contrastive language-image pre-training (CLIP) model has showcased exceptional capabilities in open-world image recognition and segmentation. Yet the potential for leveraging CLIP to enhance the robustness of low-level tasks remains largely unexplored. This paper uncovers that certain dense features extracted from the frozen ResNet image encoder of CLIP exhibit distortion-invariant and content-related properties which are highly desirable for generalizable denoising. Leveraging these properties we devise an asymmetrical encoder-decoder denoising network which incorporates dense features including the noisy image and its multi-scale features from the frozen ResNet encoder of CLIP into a learnable image decoder to achieve generalizable denoising. The progressive feature augmentation strategy is further proposed to mitigate feature overfitting and improve the robustness of the learnable decoder. Extensive experiments and comparisons conducted across diverse OOD noises including synthetic noise real-world sRGB noise and low-dose CT image noise demonstrate the superior generalization ability of our method.
-
Recently diffusion models have made remarkable progress in text-to-image (T2I) generation synthesizing images with high fidelity and diverse contents. Despite this advancement latent space smoothness within diffusion models remains largely unexplored. Smooth latent spaces ensure that a perturbation on an input latent corresponds to a steady change in the output image. This property proves beneficial in downstream tasks including image interpolation inversion and editing. In this work we expose the non-smoothness of diffusion latent spaces by observing noticeable visual fluctuations resulting from minor latent variations. To tackle this issue we propose Smooth Diffusion a new category of diffusion models that can be simultaneously high-performing and smooth. Specifically we introduce Step-wise Variation Regularization to enforce the proportion between the variations of an arbitrary input latent and that of the output image is a constant at any diffusion training step. In addition we devise an interpolation standard deviation (ISTD) metric to effectively assess the latent space smoothness of a diffusion model. Extensive quantitative and qualitative experiments demonstrate that Smooth Diffusion stands out as a more desirable solution not only in T2I generation but also across various downstream tasks. Smooth Diffusion is implemented as a plug-and-play Smooth-LoRA to work with various community models. Code is available at https://github.com/SHI-Labs/Smooth-Diffusion.
-
3D visual grounding plays a crucial role in scene understanding with extensive applications in AR/VR. Despite the significant progress made in recent methods the requirement of dense textual descriptions for each individual object which is time-consuming and costly hinders their scalability. To mitigate reliance on text annotations during training researchers have explored language-free training paradigms in the 2D field via explicit text generation or implicit feature substitution. Nevertheless unlike 2D images the complexity of spatial relations in 3D coupled with the absence of robust 3D visual language pre-trained models makes it challenging to directly transfer previous strategies. To tackle the above issues in this paper we introduce a language-free training framework for 3D visual grounding. By utilizing the visual-language joint embedding in 2D large cross-modality model as a bridge we can expediently produce the pseudo-language features by leveraging the features of 2D images which are equivalent to that of real textual descriptions. We further develop a relation injection scheme with a Neighboring Relation-aware Modeling module and a Cross-modality Relation Consistency module aiming to enhance and preserve the complex relationships between the 2D and 3D embedding space. Extensive experiments demonstrate that our proposed language-free 3D visual grounding approach can obtain promising performance across three widely used datasets --ScanRefer Nr3D and Sr3D. Our codes are available at https://github.com/xibi777/3DLFVG
-
The task of Visual Place Recognition (VPR) aims to match a query image against references from an extensive database of images from different places relying solely on visual cues. State-of-the-art pipelines focus on the aggregation of features extracted from a deep backbone in order to form a global descriptor for each image. In this context we introduce SALAD (Sinkhorn Algorithm for Locally Aggregated Descriptors) which reformulates NetVLAD's soft-assignment of local features to clusters as an optimal transport problem. In SALAD we consider both feature-to-cluster and cluster-to-feature relations and we also introduce a dustbin cluster designed to selectively discard features deemed non-informative enhancing the overall descriptor quality. Additionally we leverage and fine-tune DINOv2 as a backbone which provides enhanced description power for the local features and dramatically reduces the required training time. As a result our single-stage method not only surpasses single-stage baselines in public VPR datasets but also surpasses two-stage methods that add a re-ranking with significantly higher cost.
-
Image enhancement holds extensive applications in real-world scenarios due to complex environments and limitations of imaging devices. Conventional methods are often constrained by their tailored models resulting in diminished robustness when confronted with challenging degradation conditions. In response we propose FlowIE a simple yet highly effective flow-based image enhancement framework that estimates straight-line paths from an elementary distribution to high-quality images. Unlike previous diffusion-based methods that suffer from long-time inference FlowIE constructs a linear many-to-one transport mapping via conditioned rectified flow. The rectification straightens the trajectories of probability transfer accelerating inference by an order of magnitude. This design enables our FlowIE to fully exploit rich knowledge in the pre-trained diffusion model rendering it well-suited for various real-world applications. Moreover we devise a faster inference algorithm inspired by Lagrange's Mean Value Theorem harnessing midpoint tangent direction to optimize path estimation ultimately yielding visually superior results. Thanks to these designs our FlowIE adeptly manages a diverse range of enhancement tasks within a concise sequence of fewer than 5 steps. Our contributions are rigorously validated through comprehensive experiments on synthetic and real-world datasets unveiling the compelling efficacy and efficiency of our proposed FlowIE.
-
Vision foundation models have been explored recently to build general-purpose vision systems. However predominant paradigms driven by casting instance-level tasks as an object-word alignment bring heavy cross-modality interaction which is not effective in prompting object detection and visual grounding. Another line of work that focuses on pixel-level tasks often encounters a large annotation gap of things and stuff and suffers from mutual interference between foreground-object and background-class segmentation. In stark contrast to the prevailing methods we present APE a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks i.e. detection segmentation and grounding as an instance-level sentence-object matching paradigm. Specifically APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection which efficiently scales up model prompting to thousands of category vocabularies and region descriptions while maintaining the effectiveness of cross-modality fusion. To bridge the granularity gap of different pixel-level tasks APE equalizes semantic and panoptic segmentation to proxy instance learning by considering any isolated regions as individual instances. APE aligns vision and language representation on broad data with natural and challenging characteristics all at once without task-specific fine-tuning. The extensive experiments on over 160 datasets demonstrate that with only one-suit of weights APE outperforms (or is on par with) the state-of-the-art models proving that an effective yet universal perception for anything aligning and prompting is indeed feasible. Codes and trained models are released at https://github.com/shenyunhang/APE.
-
Multimodal sentiment analysis (MSA) aims to understand human sentiment through multimodal data. Most MSA efforts are based on the assumption of modality completeness. However in real-world applications some practical factors cause uncertain modality missingness which drastically degrades the model's performance. To this end we propose a Correlation-decoupled Knowledge Distillation (CorrKD) framework for the MSA task under uncertain missing modalities. Specifically we present a sample-level contrastive distillation mechanism that transfers comprehensive knowledge containing cross-sample correlations to reconstruct missing semantics. Moreover a category-guided prototype distillation mechanism is introduced to capture cross-category correlations using category prototypes to align feature distributions and generate favorable joint representations. Eventually we design a response-disentangled consistency distillation strategy to optimize the sentiment decision boundaries of the student network through response disentanglement and mutual information maximization. Comprehensive experiments on three datasets indicate that our framework can achieve favorable improvements compared with several baselines.
-
The machine learning community has witnessed a drastic change in the training pipeline pivoted by those "foundation models" with unprecedented scales. However the field of adversarial training is lagging behind predominantly centered around small model sizes like ResNet-50 and tiny and low-resolution datasets like CIFAR-10. To bridge this transformation gap this paper provides a modern re-examination with adversarial training investigating its potential benefits when applied at scale. Additionally we introduce an efficient and effective training strategy to enable adversarial training with giant models and web-scale data at an affordable computing cost. We denote this newly introduced framework as AdvXL. Empirical results demonstrate that AdvXL establishes new state-of-the-art robust accuracy records under AutoAttack on ImageNet-1K. For example by training on DataComp-1B dataset our AdvXL empowers a vanilla ViT-g model to substantially surpass the previous records of l_ infinity - l_ 2 - and l_ 1 -robust accuracy by margins of 11.4% 14.2% and 12.9% respectively. This achievement posits AdvXL as a pioneering approach charting a new trajectory for the efficient training of robust visual representations at significantly larger scales. Our code is available at https://github.com/UCSC-VLAA/AdvXL.
-
Although adversarial training (AT) has proven effective in enhancing the model's robustness the recently revealed issue of fairness in robustness has not been well addressed i.e. the robust accuracy varies significantly among different categories. In this paper instead of uniformly evaluating the model's average class performance we delve into the issue of robust fairness by considering the worst-case distribution across various classes. We propose a novel learning paradigm named Fairness-Aware Adversarial Learning (FAAL). As a generalization of conventional AT we re-define the problem of adversarial training as a min-max-max framework to ensure both robustness and fairness of the trained model. Specifically by taking advantage of distributional robust optimization our method aims to find the worst distribution among different categories and the solution is guaranteed to obtain the upper bound performance with high probability. In particular FAAL can fine-tune an unfair robust model to be fair within only two epochs without compromising the overall clean and robust accuracies. Extensive experiments on various image datasets validate the superior performance and efficiency of the proposed FAAL compared to other state-of-the-art methods.
-
Referring video object segmentation (RVOS) aims to segment the target instance referred by a given text expression in a video clip. The text expression normally contains sophisticated description of the instance's appearance action and relation with others. It is therefore rather difficult for a RVOS model to capture all these attributes correspondingly in the video; in fact the model often favours more on the action- and relation-related visual attributes of the instance. This can end up with partial or even incorrect mask prediction of the target instance. We tackle this problem by taking a subject-centric short text expression from the original long text expression. The short one retains only the appearance-related information of the target instance so that we can use it to focus the model's attention on the instance's appearance. We let the model make joint predictions using both long and short text expressions; and insert a long-short cross-attention module to interact the joint features and a long-short predictions intersection loss to regulate the joint predictions. Besides the improvement on the linguistic part we also introduce a forward-backward visual consistency loss which utilizes optical flows to warp visual features between the annotated frames and their temporal neighbors for consistency. We build our method on top of two state of the art pipelines. Extensive experiments on A2D-Sentences Refer-YouTube-VOS JHMDB-Sentences and Refer-DAVIS17 show impressive improvements of our method. Code is available here.
-
Nowadays leveraging 2D images and pre-trained models to guide 3D point cloud feature representation has shown a remarkable potential to boost the performance of 3D fundamental models. While some works rely on additional data such as 2D real-world images and their corresponding camera poses recent studies target at using point cloud exclusively by designing 3D-to-2D projection. However in the indoor scene scenario existing 3D-to-2D projection strategies suffer from severe occlusions and incoherence which fail to contain sufficient information for fine-grained point cloud segmentation task. In this paper we argue that the crux of the matter resides in the basic premise of existing projection strategies that the medium is homogeneous thereby projection rays propagate along straight lines and behind objects are occluded by front ones. Inspired by the phenomenon of mirage where the occluded objects are exposed by distorted light rays due to heterogeneous medium refraction rate we propose MirageRoom by designing parametric mirage projection with heterogeneous medium to obtain series of projected images with various distorted degrees. We further develop a masked reprojection module across 2D and 3D latent space to bridge the gap between pre-trained 2D backbone and 3D point-wise features. Both quantitative and qualitative experimental results on S3DIS and ScanNet V2 demonstrate the effectiveness of our method.
-
Dual-camera compressive hyperspectral imaging (DCCHI) offers the capability to reconstruct 3D hyperspectral image (HSI) by fusing compressive and panchromatic (PAN) image which has shown great potential for snapshot hyperspectral imaging in practice. In this paper we introduce a novel DCCHI reconstruction network intra-inter similarity exploiting Transformer (In2SET). Our key insight is to make full use of the PAN image to assist the reconstruction. To this end we propose to use the intra-similarity within the PAN image as a proxy for approximating the intra-similarity in the original HSI thereby offering an enhanced content prior for more accurate HSI reconstruction. Furthermore we propose to use the inter-similarity to align the features between HSI and PAN images thereby maintaining semantic consistency between the two modalities during the reconstruction process. By integrating In2SET into a PAN-guided deep unrolling (PGDU) framework our method substantially enhances the spatial-spectral fidelity and detail of the reconstructed images providing a more comprehensive and accurate depiction of the scene. Experiments conducted on both real and simulated datasets demonstrate that our approach consistently outperforms existing state-of-the-art methods in terms of reconstruction quality and computational complexity. The code is available at https://github.com/2JONAS/In2SET.
-
Unsupervised video object segmentation (VOS) aims to detect and segment the most salient object in videos. The primary techniques used in unsupervised VOS are 1) the collaboration of appearance and motion information; and 2) temporal fusion between different frames. This paper proposes two novel prototype-based attention mechanisms inter-modality attention (IMA) and inter-frame attention (IFA) to incorporate these techniques via dense propagation across different modalities and frames. IMA densely integrates context information from different modalities based on a mutual refinement. IFA injects global context of a video to the query frame enabling a full utilization of useful properties from multiple frames. Experimental results on public benchmark datasets demonstrate that our proposed approach outperforms all existing methods by a substantial margin. The proposed two components are also thoroughly validated via ablative study.
-
Look-Up Table (LUT) has recently gained increasing attention for restoring High-Quality (HQ) images from Low-Quality (LQ) observations thanks to its high computational efficiency achieved through a "space for time" strategy of caching learned LQ-HQ pairs. However incorporating multiple LUTs for improved performance comes at the cost of a rapidly growing storage size which is ultimately restricted by the allocatable on-device cache size. In this work we propose a novel LUT compression framework to achieve a better trade-off between storage size and performance for LUT-based image restoration models. Based on the observation that most cached LQ image patches are distributed along the diagonal of a LUT we devise a Diagonal-First Compression (DFC) framework where diagonal LQ-HQ pairs are preserved and carefully re-indexed to maintain the representation capacity while non-diagonal pairs are aggressively subsampled to save storage. Extensive experiments on representative image restoration tasks demonstrate that our DFC framework significantly reduces the storage size of LUT-based models (including our new design) while maintaining their performance. For instance DFC saves up to 90% of storage at a negligible performance drop for x4 super-resolution. The source code is available on GitHub: https://github.com/leenas233/DFC.
-
Acquiring large-scale well-annotated datasets is essential for training robust scene text detectors yet the process is often resource-intensive and time-consuming. While some efforts have been made to explore the synthesis of scene text images a notable gap remains between synthetic and authentic data. In this paper we introduce a novel method that utilizes Neural Radiance Fields (NeRF) to model real-world scenes and emulate the data collection process by rendering images from diverse camera perspectives enriching the variability and realism of the synthesized data. A semi-supervised learning framework is proposed to categorize semantic regions within 3D scenes ensuring consistent labeling of text regions across various viewpoints. Our method also models the pose and view-dependent appearance of text regions thereby offering precise control over camera poses and significantly improving the realism of text insertion and editing within scenes. Employing our technique on real-world scenes has led to the creation of a novel scene text image dataset. Compared to other existing benchmarks the proposed dataset is distinctive in providing not only standard annotations such as bounding boxes and transcriptions but also the information of 3D pose attributes for text regions enabling a more detailed evaluation of the robustness of text detection algorithms. Through extensive experiments we demonstrate the effectiveness of our proposed method in enhancing the performance of scene text detectors.
-
In the film and gaming industries achieving a realistic hair appearance typically involves the use of strands originating from the scalp. However reconstructing these strands from observed surface images of hair presents significant challenges. The difficulty in acquiring Ground Truth (GT) data has led state-of-the-art learning-based methods to rely on pre-training with manually prepared synthetic CG data. This process is not only labor-intensive and costly but also introduces complications due to the domain gap when compared to real-world data. In this study we propose an optimization-based approach that eliminates the need for pre-training. Our method represents hair strands as line segments growing from the scalp and optimizes them using a novel differentiable rendering algorithm. To robustly optimize a substantial number of slender explicit geometries we introduce 3D orientation estimation utilizing global optimization strand initialization based on Laplace's equation and reparameterization that leverages geometric connectivity and spatial proximity. Unlike existing optimization-based methods our method is capable of reconstructing internal hair flow in an absolute direction. Our method exhibits robust and accurate inverse rendering surpassing the quality of existing methods and significantly improving processing speed.
-
Diffusion models emerging as powerful deep generative tools excel in various applications. They operate through a two-steps process: introducing noise into training samples and then employing a model to convert random noise into new samples (e.g. images). However their remarkable generative performance is hindered by slow training and sampling. This is due to the necessity of tracking extensive forward and reverse diffusion trajectories and employing a large model with numerous parameters across multiple timesteps (i.e. noise levels). To tackle these challenges we present a multi-stage framework inspired by our empirical findings. These observations indicate the advantages of employing distinct parameters tailored to each timestep while retaining universal parameters shared across all time steps. Our approach involves segmenting the time interval into multiple stages where we employ custom multi-decoder U-net architecture that blends time-dependent models with a universally shared encoder. Our framework enables the efficient distribution of computational resources and mitigates inter-stage interference which substantially improves training efficiency. Extensive numerical experiments affirm the effectiveness of our framework showcasing significant training and sampling efficiency enhancements on three state-of-the-art diffusion models including large-scale latent diffusion models. Furthermore our ablation studies illustrate the impact of two important components in our framework: (i) a novel timestep clustering algorithm for stage division and (ii) an innovative multi-decoder U-net architecture seamlessly integrating universal and customized hyperparameters.
-
We introduce in-context matting a novel task setting of image matting. Given a reference image of a certain foreground and guided priors such as points scribbles and masks in-context matting enables automatic alpha estimation on a batch of target images of the same foreground category without additional auxiliary input. This setting marries good performance in auxiliary input-based matting and ease of use in automatic matting which finds a good trade-off between customization and automation. To overcome the key challenge of accurate foreground matching we introduce IconMatting an in-context matting model built upon a pre-trained text-to-image diffusion model. Conditioned on inter- and intra-similarity matching IconMatting can make full use of reference context to generate accurate target alpha mattes. To benchmark the task we also introduce a novel testing dataset ICM-57 covering 57 groups of real-world images. Quantitative and qualitative results on the ICM-57 testing set show that IconMatting rivals the accuracy of trimap-based matting while retaining the automation level akin to automatic matting. Code is available at https://github.com/tiny-smart/in-context-matting.
-
Recent studies have noted an intriguing phenomenon termed Neural Collapse that is when the neural networks establish the right correlation between feature spaces and the training targets their last-layer features together with the classifier weights will collapse into a stable and symmetric structure. In this paper we extend the investigation of Neural Collapse to the biased datasets with imbalanced attributes. We observe that models will easily fall into the pitfall of shortcut learning and form a biased non-collapsed feature space at the early period of training which is hard to reverse and limits the generalization capability. To tackle the root cause of biased classification we follow the recent inspiration of prime training and propose an avoid-shortcut learning framework without additional training complexity. With well-designed shortcut primes based on Neural Collapse structure the models are encouraged to skip the pursuit of simple shortcuts and naturally capture the intrinsic correlations. Experimental results demonstrate that our method induces a better convergence property during training and achieves state-of-the-art generalization performance on both synthetic and real-world biased datasets.
-
Advances in neural fields are enabling high-fidelity capture of the shape and appearance of dynamic 3D scenes. However their capabilities lag behind those offered by conventional representations such as 2D videos because of algorithmic challenges and the lack of large-scale multi-view real-world datasets. We address the dataset limitation with DiVa-360 a real-world 360? dynamic visual dataset that contains synchronized high-resolution and long-duration multi-view video sequences of table-scale scenes captured using a customized low-cost system with 53 cameras. It contains 21 object-centric sequences categorized by different motion types 25 intricate hand-object interaction sequences and 8 long-duration sequences for a total of 17.4 M image frames. In addition we provide foreground-background segmentation masks synchronized audio and text descriptions. We benchmark the state-of-the-art dynamic neural field methods on DiVa-360 and provide insights about existing methods and future challenges on long-duration neural field capture.
-
We present the subspace-constrained Tyler's estimator (STE) designed for recovering a low-dimensional subspace within a dataset that may be highly corrupted with outliers. STE is a fusion of the Tyler's M-estimator (TME) and a variant of the fast median subspace. Our theoretical analysis suggests that under a common inlier-outlier model STE can effectively recover the underlying subspace even when it contains a smaller fraction of inliers relative to other methods in the field of robust subspace recovery. We apply STE in the context of Structure from Motion (SfM) in two ways: for robust estimation of the fundamental matrix and for the removal of outlying cameras enhancing the robustness of the SfM pipeline. Numerical experiments confirm the state-of-the-art performance of our method in these applications. This research makes significant contributions to the field of robust subspace recovery particularly in the context of computer vision and 3D reconstruction.
-
While previous studies have demonstrated successful 3D object shape completion with a sufficient number of points they often fail in scenarios when a few points e.g. tens of points are observed. Surprisingly via entropy analysis we find that even a few points e.g. 64 points could retain substantial information to help recover the 3D shape of the object. To address the challenge of shape completion with very sparse point clouds we then propose Few-point Shape Completion (FSC) model which contains a novel dual-branch feature extractor for handling extremely sparse inputs coupled with an extensive branch for maximal point utilization with a saliency branch for dynamic importance assignment. This model is further bolstered by a two-stage revision network that refines both the extracted features and the decoder output enhancing the detail and authenticity of the completed point cloud. Our experiments demonstrate the feasibility of recovering 3D shapes from a few points. The proposed Few-point Shape Completion (FSC) model outperforms previous methods on both few-point inputs and many-point inputs and shows good generalizability to different object categories.
-
The increased demand for 3D data in AR/VR robotics and gaming applications gave rise to powerful generative pipelines capable of synthesizing high-quality 3D objects. Most of these models rely on the Score Distillation Sampling (SDS) algorithm to optimize a 3D representation such that the rendered image maintains a high likelihood as evaluated by a pre-trained diffusion model. However this distillation process involves finding a correct mode in the high-dimensional and large-variance distribution produced by the diffusion model. This task is challenging and often leads to issues such as over-saturation over-smoothing and Janus-like artifacts in the 3D generation. In this paper we propose a novel learning paradigm for 3D synthesis that utilizes pre-trained diffusion models. Instead of focusing on mode-seeking our method directly models the distribution discrepancy between multi-view renderings and diffusion priors in an adversarial manner which unlocks the generation of high-fidelity and photorealistic 3D content conditioned on a single image and prompt. Moreover by harnessing the latent space of GANs and expressive diffusion model priors our method enables a wide variety of 3D applications including single-view reconstruction high diversity generation and continuous 3D interpolation in open domain. Our experiments demonstrate the superiority of our pipeline compared to previous works in terms of generation quality and diversity.
-
We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4 we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localization to carefully design 10 pre-training tasks with large scale annotated data. These tasks resemble downstream tasks across different domains and the annotations are cheap to obtain. We demonstrate that compared to current screenshot pre-training objectives our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks - up to 76.1% improvements on Table Detection and at least 1% on Widget Captioning.
-
Visual sound source localization poses a significant challenge in identifying the semantic region of each sounding source within a video. Existing self-supervised and weakly supervised source localization methods struggle to accurately distinguish the semantic regions of each sounding object particularly in multi-source mixtures. These methods often rely on audio-visual correspondence as guidance which can lead to substantial performance drops in complex multi-source localization scenarios. The lack of access to individual source sounds in multi-source mixtures during training exacerbates the difficulty of learning effective audio-visual correspondence for localization. To address this limitation in this paper we propose incorporating the text modality as an intermediate feature guide using tri-modal joint embedding models (e.g. AudioCLIP) to disentangle the semantic audio-visual source correspondence in multi-source mixtures. Our framework dubbed T-VSL begins by predicting the class of sounding entities in mixtures. Subsequently the textual representation of each sounding source is employed as guidance to disentangle fine-grained audio-visual source correspondence from multi-source mixtures leveraging the tri-modal AudioCLIP embedding. This approach enables our framework to handle a flexible number of sources and exhibits promising zero-shot transferability to unseen classes during test time. Extensive experiments conducted on the MUSIC VGGSound and VGGSound-Instruments datasets demonstrate significant performance improvements over state-of-the-art methods. Code is released at https://github.com/enyac-group/T-VSL/tree/main.
-
In this paper we democratise caricature generation empowering individuals to effortlessly craft personalised caricatures with just a photo and a conceptual sketch. Our objective is to strike a delicate balance between abstraction and identity while preserving the creativity and subjectivity inherent in a sketch. To achieve this we present Explicit Rank-1 Model Editing alongside single-image personalisation selectively applying nuanced edits to cross-attention layers for a seamless merge of identity and style. Additionally we propose Random Mask Reconstruction to enhance robustness directing the model to focus on distinctive identity and style features. Crucially our aim is not to replace artists but to eliminate accessibility barriers allowing enthusiasts to engage in the artistry.
-
We concentrate on a novel human-centric image synthesis task that is given only one reference facial photograph it is expected to generate specific individual images with diverse head positions poses facial expressions and illuminations in different contexts. To accomplish this goal we argue that our generative model should be capable of the following favorable characteristics: (1) a strong visual and semantic understanding of our world and human society for basic object and human image generation. (2) generalizable identity preservation ability. (3) flexible and fine-grained head control. Recently large pre-trained text-to-image diffusion models have shown remarkable results serving as a powerful generative foundation. As a basis we aim to unleash the above two capabilities of the pre-trained model. In this work we present a new framework named CapHuman. We embrace the "encode then learn to align" paradigm which enables generalizable identity preservation for new individuals without cumbersome tuning at inference. CapHuman encodes identity features and then learns to align them into the latent space. Moreover we introduce the 3D facial prior to equip our model with control over the human head in a flexible and 3D-consistent manner. Extensive qualitative and quantitative analyses demonstrate our CapHuman can produce well-identity-preserved photo-realistic and high-fidelity portraits with content-rich representations and various head renditions superior to established baselines. Code and checkpoint will be released at https://github.com/VamosC/CapHuman.
-
Recently transformer-based methods have achieved state-of-the-art prediction quality on human pose estimation(HPE). Nonetheless most of these top-performing transformer-based models are too computation-consuming and storage-demanding to deploy on edge computing platforms. Those transformer-based models that require fewer resources are prone to under-fitting due to their smaller scale and thus perform notably worse than their larger counterparts. Given this conundrum we introduce SDPose a new self-distillation method for improving the performance of small transformer-based models. To mitigate the problem of under-fitting we design a transformer module named Multi-Cycled Transformer(MCT) based on multiple-cycled forwards to more fully exploit the potential of small model parameters. Further in order to prevent the additional inference compute-consuming brought by MCT we introduce a self-distillation scheme extracting the knowledge from the MCT module to a naive forward model. Specifically on the MSCOCO validation dataset SDPose-T obtains 69.7% mAP with 4.4M parameters and 1.8 GFLOPs. Furthermore SDPose-S-V2 obtains 73.5% mAP on the MSCOCO validation dataset with 6.2M parameters and 4.7 GFLOPs achieving a new state-of-the-art among predominant tiny neural network methods.
-
The authentic 3D hand avatar with every identifiable information such as hand shapes and textures is necessary for immersive experiences in AR/VR. In this paper we present a universal hand model (UHM) which 1) can universally represent high-fidelity 3D hand meshes of arbitrary identities (IDs) and 2) can be adapted to each person with a short phone scan for the authentic hand avatar. For effective universal hand modeling we perform tracking and modeling at the same time while previous 3D hand models perform them separately. The conventional separate pipeline suffers from the accumulated errors from the tracking stage which cannot be recovered in the modeling stage. On the other hand ours does not suffer from the accumulated errors while having a much more concise overall pipeline. We additionally introduce a novel image matching loss function to address a skin sliding during the tracking and modeling while existing works have not focused on it much. Finally using learned priors from our UHM we effectively adapt our UHM to each person's short phone scan for the authentic hand avatar.
-
Humans possess the remarkable skill of Visual Perception the ability to see and understand the seen helping them make sense of the visual world and in turn reason. Multimodal Large Language Models (MLLM) have recently achieved impressive performance on vision-language tasks ranging from visual question-answering and image captioning to visual reasoning and image generation. However when prompted to identify or count (perceive) the entities in a given image existing MLLM systems fail. Working towards developing an accurate MLLM system for perception and reasoning we propose using Versatile vision enCoders (VCoder) as perception eyes for Multimodal LLMs. We feed the VCoder with perception modalities such as segmentation or depth maps improving the MLLM's perception abilities. Secondly we leverage the images from COCO and outputs from off-the-shelf vision perception models to create our COCO Segmentation Text (COST) dataset for training and evaluating MLLMs on the object perception task. Thirdly we introduce metrics to assess the object perception abilities in MLLMs on our COST dataset. Lastly we provide extensive experimental evidence proving the VCoder's improved object-level perception skills over existing Multimodal LLMs including GPT-4V. We open-source our dataset code and models to promote research.
-
Visible and Infrared image Fusion (VIF) offers a comprehensive scene description by combining thermal infrared images with the rich textures from visible cameras. However conventional VIF systems may capture over/under exposure or blurry images in extreme lighting and high dynamic motion scenarios leading to degraded fusion results. To address these problems we propose a novel Event-based Visible and Infrared Fusion (EVIF) system that employs a visible event camera as an alternative to traditional frame-based cameras for the VIF task. With extremely low latency and high dynamic range event cameras can effectively address blurriness and are robust against diverse luminous ranges. To produce high-quality fused images we develop a multi-task collaborative framework that simultaneously performs event-based visible texture reconstruction event-guided infrared image deblurring and visible-infrared fusion. Rather than independently learning these tasks our framework capitalizes on their synergy leveraging cross-task event enhancement for efficient deblurring and bi-level min-max mutual information optimization to achieve higher fusion quality. Experiments on both synthetic and real data show that EVIF achieves remarkable performance in dealing with extreme lighting conditions and high-dynamic scenes ensuring high-quality fused images across a broad range of practical scenarios.
-
Interpreting camera data is key for autonomously acting systems such as autonomous vehicles. Vision systems that operate in real-world environments must be able to understand their surroundings and need the ability to deal with novel situations. This paper tackles open-world semantic segmentation i.e. the variant of interpreting image data in which objects occur that have not been seen during training. We propose a novel approach that performs accurate closed-world semantic segmentation and at the same time can identify new categories without requiring any additional training data. Our approach additionally provides a similarity measure for every newly discovered class in an image to a known category which can be useful information in downstream tasks such as planning or mapping. Through extensive experiments we show that our model achieves state-of-the-art results on classes known from training data as well as for anomaly segmentation and can distinguish between different unknown classes.
-
We propose a lightweight and scalable Regional Point-Language Contrastive learning framework namely RegionPLC for open-world 3D scene understanding aiming to identify and recognize open-set objects and categories. Specifically based on our empirical studies we introduce a 3D-aware SFusion strategy that fuses 3D vision-language pairs derived from multiple 2D foundation models yielding high-quality dense region-level language descriptions without human 3D annotations. Subsequently we devise a region-aware point-discriminative contrastive learning objective to enable robust and effective 3D learning from dense regional language supervision. We carry out extensive experiments on ScanNet ScanNet200 and nuScenes datasets and our model outperforms prior 3D open-world scene understanding approaches by an average of 17.2% and 9.1% for semantic and instance segmentation respectively while maintaining greater scalability and lower resource demands. Furthermore our method has the flexibility to be effortlessly integrated with language models to enable open-ended grounded 3D reasoning without extra task-specific training. Code will be released.
-
Visual-inertial odometry (VIO) has demonstrated remarkable success due to its low-cost and complementary sensors. However existing VIO methods lack the generalization ability to adjust to different environments and sensor attributes. In this paper we propose Adaptive VIO a new monocular visual-inertial odometry that combines online continual learning with traditional nonlinear optimization. Adaptive VIO comprises two networks to predict visual correspondence and IMU bias. Unlike end-to-end approaches that use networks to fuse the features from two modalities (camera and IMU) and predict poses directly we combine neural networks with visual-inertial bundle adjustment in our VIO system. The optimized estimates will be fed back to the visual and IMU bias networks refining the networks in a self-supervised manner. Such a learning-optimization-combined framework and feedback mechanism enable the system to perform online continual learning. Experiments demonstrate that our Adaptive VIO manifests adaptive capability on EuRoC and TUM-VI datasets. The overall performance exceeds the currently known learning-based VIO methods and is comparable to the state-of-the-art optimization-based methods.
-
Pretrained diffusion models and their outputs are widely accessible due to their exceptional capacity for synthesizing high-quality images and their open-source nature. The users however may face litigation risks owing to the models' tendency to memorize and regurgitate training data during inference. To address this we introduce Anti-Memorization Guidance (AMG) a novel framework employing three targeted guidance strategies for the main causes of memorization: image and caption duplication and highly specific user prompts. Consequently AMG ensures memorization-free outputs while maintaining high image quality and text alignment leveraging the synergy of its guidance methods each indispensable in its own right. AMG also features an innovative automatic detection system for potential memorization during each step of inference process allows selective application of guidance strategies minimally interfering with the original sampling process to preserve output utility. We applied AMG to pretrained Denoising Diffusion Probabilistic Models (DDPM) and Stable Diffusion across various generation tasks. The results demonstrate that AMG is the first approach to successfully eradicates all instances of memorization with no or marginal impacts on image quality and text-alignment as evidenced by FID and CLIP scores.
-
The lightweight "local-match-global" matching introduced by SRe2L successfully creates a distilled dataset with comprehensive information on the full 224x224 ImageNet-1k. However this one-sided approach is limited to a particular backbone layer and statistics which limits the improvement of the generalization of a distilled dataset. We suggest that sufficient and various "local-match-global" matching are more precise and effective than a single one and has the ability to create a distilled dataset with richer information and better generalization. We call this perspective "generalized matching" and propose Generalized Various Backbone and Statistical Matching (G-VBSM) in this work which aims to create a synthetic dataset with densities ensuring consistency with the complete dataset across various backbones layers and statistics. As experimentally demonstrated G-VBSM is the first algorithm to obtain strong performance across both small-scale and large-scale datasets. Specifically G-VBSM achieves a performance of 38.7% on CIFAR-100 with 128-width ConvNet 47.6% on Tiny-ImageNet with ResNet18 and 31.4% on the full 224x224 ImageNet-1k with ResNet18 under images per class (IPC) 10 50 and 10 respectively. These results surpass all SOTA methods by margins of 3.9% 6.5% and 10.1% respectively.
-
Self-supervised image backbones can be used to address complex 2D tasks (e.g. semantic segmentation object discovery) very efficiently and with little or no downstream supervision. Ideally 3D backbones for lidar should be able to inherit these properties after distillation of these powerful 2D features. The most recent methods for image-to-lidar distillation on autonomous driving data show promising results obtained thanks to distillation methods that keep improving. Yet we still notice a large performance gap when measuring by linear probing the quality of distilled vs fully supervised features. In this work instead of focusing only on the distillation method we study the effect of three pillars for distillation: the 3D backbone the pretrained 2D backbone and the pretraining 2D+3D dataset. In particular thanks to our scalable distillation method named ScaLR we show that scaling the 2D and 3D backbones and pretraining on diverse datasets leads to a substantial improvement of the feature quality. This allows us to significantly reduce the gap between the quality of distilled and fully-supervised 3D features and to improve the robustness of the pretrained backbones to domain gaps and perturbations.
-
How important is it for training and evaluation sets to not have class overlap in image retrieval? We revisit Google Landmarks v2 clean the most popular training set by identifying and removing class overlap with Revisited Oxford and Paris the most popular training set. By comparing the original and the new RGLDv2-clean on a benchmark of reproduced state-of-the-art methods our findings are striking. Not only is there a dramatic drop in performance but it is inconsistent across methods changing the ranking. What does it take to focus on objects or interest and ignore background clutter when indexing? Do we need to analyze the evaluation set? Do we need to train an object detector and the representation separately? Do we need location supervision? We introduce Single-stage Detect-to-Retrieve (CiDeR) an end-to-end single-stage pipeline to detect objects of interest and extract a global image representation. We outperform previous state-of-the-art on both existing training sets and the new RGLDv2-clean.
-
Editable 3D-aware generation which supports user-interacted editing has witnessed rapid development recently. However existing editable 3D GANs either fail to achieve high-accuracy local editing or suffer from huge computational costs. We propose AttriHuman-3D an editable 3D human generation model which address the aforementioned problems with attribute decomposition and indexing. The core idea of the proposed model is to generate all attributes (e.g. human body hair clothes and so on) in an overall attribute space with six feature planes which are then decomposed and manipulated with different attribute indexes. To precisely extract features of different attributes from the generated feature planes we propose a novel attribute indexing method as well as an orthogonal projection regularization to enhance the disentanglement. We also introduce a hyper-latent training strategy and an attribute-specific sampling strategy to avoid style entanglement and misleading punishment from the discriminator. Our method allows users to interactively edit selected attributes in the generated 3D human avatars while keeping others fixed. Both qualitative and quantitative experiments demonstrate that our model provides a strong disentanglement between different attributes allows fine-grained image editing and generates high-quality 3D human avatars.
-
In this paper we address the challenge of making ViT models more robust to unseen affine transformations. Such robustness becomes useful in various recognition tasks such as face recognition when image alignment failures occur. We propose a novel method called KP-RPE which leverages key points (e.g.facial landmarks) to make ViT more resilient to scale translation and pose variations. We begin with the observation that Relative Position Encoding (RPE) is a good way to bring affine transform generalization to ViTs. RPE however can only inject the model with prior knowledge that nearby pixels are more important than far pixels. Keypoint RPE (KP-RPE) is an extension of this principle where the significance of pixels is not solely dictated by their proximity but also by their relative positions to specific keypoints within the image. By anchoring the significance of pixels around keypoints the model can more effectively retain spatial relationships even when those relationships are disrupted by affine transformations. We show the merit of KP-RPE in face and gait recognition. The experimental results demonstrate the effectiveness in improving face recognition performance from low-quality images particularly where alignment is prone to failure. Code and pre-trained models are available.
-
Mesh denoising (MD) is a critical task in geometry processing as meshes from scanning or AIGC techniques are susceptible to noise contamination. The challenge of MD lies in the diverse nature of mesh facets in terms of geometric characteristics and noise distributions. Despite recent advancements in deep learning-based MD methods existing MD networks typically neglect the consideration of geometric characteristics and noise distributions. In this paper we propose Hyper-MD a hyper-network-based approach that addresses this limitation by dynamically customizing denoising parameters for each facet based on its noise intensity and geometric characteristics. Specifically Hyper-MD is composed of a hyper-network and an MD network. For each noisy facet the hyper-network takes two angles as input to customize parameters for the MD network. These two angles are specially defined to reveal the noise intensity and geometric characteristics of the current facet respectively. The MD network receives a facet patch as input and outputs the denoised normal using the customized parameters. Experimental results on synthetic and real-scanned meshes demonstrate that Hyper-MD outperforms state-of-the-art mesh denoising methods.
-
Object State Changes (OSCs) are pivotal for video understanding. While humans can effortlessly generalize OSC understanding from familiar to unknown objects current approaches are confined to a closed vocabulary. Addressing this gap we introduce a novel open-world formulation for the video OSC problem. The goal is to temporally localize the three stages of an OSC---the object's initial state its transitioning state and its end state---whether or not the object has been observed during training. Towards this end we develop VidOSC a holistic learning approach that: (1) leverages text and vision-language models for supervisory signals to obviate manually labeling OSC training data and (2) abstracts fine-grained shared state representations from objects to enhance generalization. Furthermore we present HowToChange the first open-world benchmark for video OSC localization which offers an order of magnitude increase in the label space and annotation volume compared to the best existing benchmark. Experimental results demonstrate the efficacy of our approach in both traditional closed-world and open-world scenarios.
-
Sampling from the posterior distribution in latent diffusion models for inverse problems is computationally challenging. Existing methods often rely on Tweedie's first-order moments that tend to induce biased results. Second-order approximations are computationally prohibitive making standard reverse diffusion processes intractable for posterior sampling. This paper presents Second-order Tweedie sampler from Surrogate Loss (STSL) a novel sampler offering efficiency comparable to first-order Tweedie while enabling tractable reverse processes using second-order approximation. Theoretical results reveal that our approach utilizing for the trace of the Hessian with only O(1) compute establishes a lower bound through a surrogate loss and enables a tractable reverse process. We show STSL outperforms SoTA solvers PSLD and P2L by reducing neural function evaluations by 4X and 8X respectively while enhancing sampling quality on FFHQ ImageNet and COCO benchmarks. Moreover STSL extends to text guided image editing and mitigates residual distortions in corrupted images. To our best knowledge this is the first work to offer an efficient second order approximation for solving inverse problems using latent diffusion and editing real world images with corruptions.
-
Vector-Quantized (VQ-based) generative models usually consist of two basic components i.e. VQ tokenizers and generative transformers. Prior research focuses on improving the reconstruction fidelity of VQ tokenizers but rarely examines how the improvement in reconstruction affects the generation ability of generative transformers. In this paper we find that improving the reconstruction fidelity of VQ tokenizers does not necessarily improve the generation. Instead learning to compress semantic features within VQ tokenizers significantly improves generative transformers' ability to capture textures and structures. We thus highlight two competing objectives of VQ tokenizers for image synthesis: semantic compression and details preservation. Different from previous work that prioritizes better details preservation we propose Semantic-Quantized GAN (SeQ-GAN) with two learning phases to balance the two objectives. In the first phase we propose a semantic-enhanced perceptual loss for better semantic compression. In the second phase we fix the encoder and codebook but finetune the decoder to achieve better details preservation. Our proposed SeQ-GAN significantly improves VQ-based generative models for both unconditional and conditional image generation. Specifically SeQ-GAN achieves a Frechet Inception Distance (FID) of 6.25 and Inception Score (IS) of 140.9 on 256x256 ImageNet generation a remarkable improvement over VIT-VQGAN which obtains 11.2 FID and 97.2 IS.
-
Feature matching is a crucial task in the field of computer vision which involves finding correspondences between images. Previous studies achieve remarkable performance using learning-based feature comparison. However the pervasive presence of matching redundancy between images gives rise to unnecessary and error-prone computations in these methods imposing limitations on their accuracy. To address this issue we propose MESA a novel approach to establish precise area (or region) matches for efficient matching redundancy reduction. MESA first leverages the advanced image understanding capability of SAM a state-of-the-art foundation model for image segmentation to obtain image areas with implicit semantic. Then a multi-relational graph is proposed to model the spatial structure of these areas and construct their scale hierarchy. Based on graphical models derived from the graph the area matching is reformulated as an energy minimization task and effectively resolved. Extensive experiments demonstrate that MESA yields substantial precision improvement for multiple point matchers in indoor and outdoor downstream tasks e.g. +13.61% for DKM in indoor pose estimation.
-
Image signal processing (ISP) pipeline plays a fundamental role in digital cameras which converts raw Bayer sensor data to RGB images. However ISP-generated images usually suffer from imperfections due to the compounded degradations that stem from sensor noises demosaicing noises compression artifacts and possibly adverse effects of erroneous ISP hyperparameter settings such as ISO and gamma values. In a general sense these ISP imperfections can be considered as degradations. The highly complex mechanisms of ISP degradations some of which are even unknown pose great challenges to the generalization capability of deep neural networks (DNN) for image restoration and to their adaptability to downstream tasks. To tackle the issues we propose a novel DNN approach to learn degradation-independent representations (DiR) through the refinement of a self-supervised learned baseline representation. The proposed DiR learning technique has remarkable domain generalization capability and consequently it outperforms state-of-the-art methods across various downstream tasks including blind image restoration object detection and instance segmentation as verified in our experiments.
-
Accurate representation in media is known to improve the well-being of the people who consume it. Generative image models trained on large web-crawled datasets such as LAION are known to produce images with harmful stereotypes and misrepresentations of cultures. We improve inclusive representation in generated images by (1) engaging with communities to collect a culturally representative dataset that we call the Cross-Cultural Understanding Benchmark (CCUB) and (2) proposing a novel Self-Contrastive Fine-Tuning (SCoFT pronounced /soft/) method that leverages the model's known biases to self-improve. SCoFT is designed to prevent overfitting on small datasets encode only high-level information from the data and shift the generated distribution away from misrepresentations encoded in a pretrained model. Our user study conducted on 51 participants from 5 different countries based on their self-selected national cultural affiliation shows that fine-tuning on CCUB consistently generates images with higher cultural relevance and fewer stereotypes when compared to the Stable Diffusion baseline which is further improved with our SCoFT technique.
-
In this paper we showcase the effectiveness of optimizing monocular camera poses as a continuous function of time. The camera poses are represented using an implicit neural function which maps the given time to the corresponding camera pose. The mapped camera poses are then used for the downstream tasks where joint camera pose optimization is also required. While doing so the network parameters - that implicitly represent camera poses - are optimized. We exploit the proposed method in four diverse experimental settings namely (1) NeRF from noisy poses; (2) NeRF from asynchronous Events; (3) Visual Simultaneous Localization and Mapping (vSLAM); and (4) vSLAM with IMUs. In all four settings the proposed method performs significantly better than the compared baselines and the state-of-the-art methods. Additionally using the assumption of continuous motion changes in pose may actually live in a manifold that has lower than 6 degrees of freedom (DOF) is also realized. We call this low DOF motion representation as the intrinsic motion and use the approach in vSLAM settings show ing impressive camera tracking performance.
-
The image matching field has been witnessing a continuous emergence of novel learnable feature matching techniques with ever-improving performance on conventional benchmarks. However our investigation shows that despite these gains their potential for real-world applications is restricted by their limited generalization capabilities to novel image domains. In this paper we introduce OmniGlue the first learnable image matcher that is designed with generalization as a core principle. OmniGlue leverages broad knowledge from a vision foundation model to guide the feature matching process boosting generalization to domains not seen at training time. Additionally we propose a novel keypoint position-guided attention mechanism which disentangles spatial and appearance information leading to enhanced matching descriptors. We perform comprehensive experiments on a suite of 6 datasets with varied image domains including scene-level object-centric and aerial images. OmniGlue's novel components lead to relative gains on unseen domains of 20.9% with respect to a directly comparable reference model while also outperforming the recent LightGlue method by 9.5% relatively. Code and model can be found at https://hwjiang1510.github.io/OmniGlue.
-
Dataset distillation offers a lightweight synthetic dataset for fast network training with promising test accuracy. To imitate the performance of the original dataset most approaches employ bi-level optimization and the distillation space relies on the matching architecture. Nevertheless these approaches either suffer significant computational costs on large-scale datasets or experience performance decline on cross-architectures. We advocate for designing an economical dataset distillation framework that is independent of the matching architectures.With empirical observations we argue that constraining the consistency of the real and synthetic image spaces will enhance the cross-architecture generalization. Motivated by this we introduce Dataset Distillation via Disentangled Diffusion Model (D^4M) an efficient framework for dataset distillation. Compared to architecture-dependent methods D^4M employs latent diffusion model to guarantee consistency and incorporates label information into category prototypes. The distilled datasets are versatile eliminating the need for repeated generation of distinct datasets for various architectures. Through comprehensive experiments D^4M demonstrates superior performance and robust generalization surpassing the SOTA methods across most aspects.
-
We present a method to reconstruct indoor and outdoor static scene geometry and appearance from an omnidirectional video moving in a small circular sweep. This setting is challenging because of the small baseline and large depth ranges making it difficult to find ray crossings. To better constrain the optimization we estimate geometry as a signed distance field within a spherical binoctree data structure and use a complementary efficient tree traversal strategy based on a breadth-first search for sampling. Unlike regular grids or trees the shape of this structure well-matches the camera setting creating a better memory-quality trade-off. From an initial depth estimate the binoctree is adaptively subdivided throughout the optimization; previous methods use a fixed depth that leaves the scene undersampled. In comparison with three neural optimization methods and two non-neural methods ours shows decreased geometry error on average especially in a detailed scene while significantly reducing the required number of voxels to represent such details.
-
Recovering ghost-free High Dynamic Range (HDR) images from multiple Low Dynamic Range (LDR) images becomes challenging when the LDR images exhibit saturation and significant motion. Recent Diffusion Models (DMs) have been introduced in HDR imaging field demonstrating promising performance particularly in achieving visually perceptible results compared to previous DNN-based methods. However DMs require extensive iterations with large models to estimate entire images resulting in inefficiency that hinders their practical application. To address this challenge we propose the Low-Frequency aware Diffusion (LF-Diff) model for ghost-free HDR imaging. The key idea of LF-Diff is implementing the DMs in a highly compacted latent space and integrating it into a regression-based model to enhance the details of reconstructed images. Specifically as low-frequency information is closely related to human visual perception we propose to utilize DMs to create compact low-frequency priors for the reconstruction process. In addition to take full advantage of the above low-frequency priors the Dynamic HDR Reconstruction Network (DHRNet) is carried out in a regression-based manner to obtain final HDR images. Extensive experiments conducted on synthetic and real-world benchmark datasets demonstrate that our LF-Diff performs favorably against several state-of-the-art methods and is 10x faster than previous DM-based methods.
-
A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet despite the performance gains contributed by large vision and language pretraining recent investigations find that most--if not all--our state-of-the-art vision-language models struggle at compositionality. They are unable to distinguish between images of "a girl in white facing a man in black" and "a girl in black facing a man in white". Moreover prior work suggests that compositionality doesn't arise with scale: larger model sizes or training data don't help. This paper develops a new iterated training algorithm that incentivizes compositionality. We draw on decades of cognitive science research that identifies cultural transmission--the need to teach a new generation--as a necessary inductive prior that incentivizes humans to develop compositional languages. Specifically we reframe vision-language contrastive learning as the Lewis Signaling Game between a vision agent and a language agent and operationalize cultural transmission by iteratively resetting one of the agent's weights during training. After every iteration this training paradigm induces representations that become "easier to learn" a property of compositional languages: e.g. our model trained on CC3M and CC12M improves standard CLIP by 4.7% 4.0% respectfully in the SugarCrepe benchmark.
-
Tracking with bio-inspired event cameras has garnered increasing interest in recent years. Existing works either utilize aligned RGB and event data for accurate tracking or directly learn an event-based tracker. The former incurs higher inference costs while the latter may be susceptible to the impact of noisy events or sparse spatial resolution. In this paper we propose a novel hierarchical knowledge distillation framework that can fully utilize multi-modal / multi-view information during training to facilitate knowledge transfer enabling us to achieve high-speed and low-latency visual tracking during testing by using only event signals. Specifically a teacher Transformer-based multi-modal tracking framework is first trained by feeding the RGB frame and event stream simultaneously. Then we design a new hierarchical knowledge distillation strategy which includes pairwise similarity feature representation and response maps-based knowledge distillation to guide the learning of the student Transformer network. In particular since existing event-based tracking datasets are all low-resolution (346 * 260) we propose the first large-scale high-resolution (1280 * 720) dataset named EventVOT. It contains 1141 videos and covers a wide range of categories such as pedestrians vehicles UAVs ping pong etc. Extensive experiments on both low-resolution (FE240hz VisEvent COESOT) and our newly proposed high-resolution EventVOT dataset fully validated the effectiveness of our proposed method. The dataset evaluation toolkit and source code will be released.
-
In this paper we present LiDAR-Net a new real-scanned indoor point cloud dataset containing nearly 3.6 billion precisely point-level annotated points covering an expansive area of 30000m^2. It encompasses three prevalent daily environments including learning scenes working scenes and living scenes. LiDAR-Net is characterized by its non-uniform point distribution e.g. scanning holes and scanning lines. Additionally it meticulously records and annotates scanning anomalies including reflection noise and ghost. These anomalies stem from specular reflections on glass or metal as well as distortions due to moving persons. LiDAR-Net's realistic representation of non-uniform distribution and anomalies significantly enhances the training of deep learning models leading to improved generalization in practical applications. We thoroughly evaluate the performance of state-of-the-art algorithms on LiDAR-Net and provide a detailed analysis of the results. Crucially our research identifies several fundamental challenges in understanding indoor point clouds contributing essential insights to future explorations in this field. Our dataset can be found online: http://lidar-net.njumeta.com
-
Temporal Action Detection (TAD) aims to identify the action boundaries and the corresponding category within untrimmed videos. Inspired by the success of DETR in object detection several methods have adapted the query-based framework to the TAD task. However these approaches primarily followed DETR to predict actions at the instance level (i.e. identify each action by its center point) leading to sub-optimal boundary localization. To address this issue we propose a new Dual-level query-based TAD framework namely DualDETR to detect actions from both instance-level and boundary-level. Decoding at different levels requires semantics of different granularity therefore we introduce a two-branch decoding structure. This structure builds distinctive decoding processes for different levels facilitating explicit capture of temporal cues and semantics at each level. On top of the two-branch design we present a joint query initialization strategy to align queries from both levels. Specifically we leverage encoder proposals to match queries from each level in a one-to-one manner. Then the matched queries are initialized using position and content prior from the matched action proposal. The aligned dual-level queries can refine the matched proposal with complementary cues during subsequent decoding. We evaluate DualDETR on three challenging multi-label TAD benchmarks. The experimental results demonstrate the superior performance of DualDETR to the existing state-of-the-art methods achieving a substantial improvement under det-mAP and delivering impressive results under seg-mAP.
-
Recent Text-to-Image (T2I) generation models such as Stable Diffusion and Imagen have made significant progress in generating high-resolution images based on text descriptions. However many generated images still suffer from issues such as artifacts/implausibility misalignment with text descriptions and low aesthetic quality. Inspired by the success of Reinforcement Learning with Human Feedback (RLHF) for large language models prior works collected human-provided scores as feedback on generated images and trained a reward model to improve the T2I generation. In this paper we enrich the feedback signal by (i) marking image regions that are implausible or misaligned with the text and (ii) annotating which words in the text prompt are misrepresented or missing on the image. We collect such rich human feedback on 18K generated images (RichHF-18K) and train a multimodal transformer to predict the rich feedback automatically. We show that the predicted rich human feedback can be leveraged to improve image generation for example by selecting high-quality training data to finetune and improve the generative models or by creating masks with predicted heatmaps to inpaint the problematic regions. Notably the improvements generalize to models (Muse) beyond those used to generate the images on which human feedback data were collected (Stable Diffusion variants). The RichHF-18K data set will be released in our GitHub repository: https://github.com/google-research/google-research/tree/master/richhf_18k.
-
Panorama video recently attracts more interest in both study and application courtesy of its immersive experience. Due to the expensive cost of capturing 360-degree panoramic videos generating desirable panorama videos by prompts is urgently required. Lately the emerging text-to-video (T2V) diffusion methods demonstrate notable effectiveness in standard video generation. However due to the significant gap in content and motion patterns between panoramic and standard videos these methods encounter challenges in yielding satisfactory 360-degree panoramic videos. In this paper we propose a pipeline named 360-Degree Video Diffusion model (360DVD) for generating 360-degree panoramic videos based on the given prompts and motion conditions. Specifically we introduce a lightweight 360-Adapter accompanied by 360 Enhancement Techniques to transform pre-trained T2V models for panorama video generation. We further propose a new panorama dataset named WEB360 consisting of panoramic video-text pairs for training 360DVD addressing the absence of captioned panoramic video datasets. Extensive experiments demonstrate the superiority and effectiveness of 360DVD for panorama video generation.
-
Pose regression networks predict the camera pose of a query image relative to a known environment. Within this family of methods absolute pose regression (APR) has recently shown promising accuracy in the range of a few centimeters in position error. APR networks encode the scene geometry implicitly in their weights. To achieve high accuracy they require vast amounts of training data that realistically can only be created using novel view synthesis in a days-long process. This process has to be repeated for each new scene again and again. We present a new approach to pose regression map-relative pose regression (marepo) that satisfies the data hunger of the pose regression network in a scene-agnostic fashion. We condition the pose regressor on a scene-specific map representation such that its pose predictions are relative to the scene map. This allows us to train the pose regressor across hundreds of scenes to learn the generic relation between a scene-specific map representation and the camera pose. Our map-relative pose regressor can be applied to new map representations immediately or after mere minutes of fine-tuning for the highest accuracy. Our approach outperforms previous pose regression methods by far on two public datasets indoor and outdoor. Code is available: https://nianticlabs.github.io/marepo.
-
Implicit neural SLAM has achieved remarkable progress recently. Nevertheless existing methods face significant challenges in non-ideal scenarios such as motion blur or lighting variation which often leads to issues like convergence failures localization drifts and distorted mapping. To address these challenges we propose EN-SLAM the first event-RGBD implicit neural SLAM framework which effectively leverages the high rate and high dynamic range advantages of event data for tracking and mapping. Specifically EN-SLAM proposes a differentiable CRF (Camera Response Function) rendering technique to generate distinct RGB and event camera data via a shared radiance field which is optimized by learning a unified implicit representation with the captured event and RGBD supervision. Moreover based on the temporal difference property of events we propose a temporal aggregating optimization strategy for the event joint tracking and global bundle adjustment capitalizing on the consecutive difference constraints of events significantly enhancing tracking accuracy and robustness. Finally we construct the simulated dataset DEV-Indoors and real captured dataset DEV-Reals containing 6 scenes 17 sequences with practical motion blur and lighting changes for evaluations. Experimental results show that our method outperforms the SOTA methods in both tracking ATE and mapping ACC with a real-time 17 FPS in various challenging environments. Project page: https://delinqu.github.io/EN-SLAM.
-
Virtual Immunohistochemistry Staining for Histological Images Assisted by Weakly-supervised Learning
Recently virtual staining technology has greatly promoted the advancement of histopathology. Despite the practical successes achieved the outstanding performance of most virtual staining methods relies on hard-to-obtain paired images in training. In this paper we propose a method for virtual immunohistochemistry (IHC) staining named confusion-GAN which does not require paired images and can achieve comparable performance to supervised algorithms. Specifically we propose a multi-branch discriminator which judges if the features of generated images can be embedded into the feature pool of target domain images to improve the visual quality of generated images. Meanwhile we also propose a novel patch-level pathology information extractor which is assisted by multiple instance learning to ensure pathological consistency during virtual staining. Extensive experiments were conducted on three types of IHC images including a high-resolution hepatocellular carcinoma immunohistochemical dataset proposed by us. The results demonstrated that our proposed confusion-GAN can generate highly realistic images that are capable of deceiving even experienced pathologists. Furthermore compared to using H&E images directly the downstream diagnosis achieved higher accuracy when using images generated by confusion-GAN. Our dataset and codes will be available at https://github.com/jiahanli2022/confusion-GAN.
-
In this paper we introduce a novel approach that harnesses both 2D and 3D attentions to enable highly accurate depth completion without requiring iterative spatial propagations. Specifically we first enhance a baseline convolutional depth completion model by applying attention to 2D features in the bottleneck and skip connections. This effectively improves the performance of this simple network and sets it on par with the latest complex transformer-based models. Leveraging the initial depths and features from this network we uplift the 2D features to form a 3D point cloud and construct a 3D point transformer to process it allowing the model to explicitly learn and exploit 3D geometric features. In addition we propose normalization techniques to process the point cloud which improves learning and leads to better accuracy than directly using point transformers off the shelf. Furthermore we incorporate global attention on downsampled point cloud features which enables long-range context while still being computationally feasible. We evaluate our method DeCoTR on established depth completion benchmarks including NYU Depth V2 and KITTI showcasing that it sets new state-of-the-art performance. We further conduct zero-shot evaluations on ScanNet and DDAD benchmarks and demonstrate that DeCoTR has superior generalizability compared to existing approaches.
-
When building classification systems with demographic fairness considerations there are two objectives to satisfy: 1) maximizing utility for the specific task and 2) ensuring fairness w.r.t. a known demographic attribute. These objectives often compete so optimizing both can lead to a trade-off between utility and fairness. While existing works acknowledge the trade-offs and study their limits two questions remain unanswered: 1) What are the optimal tradeoffs between utility and fairness? and 2) How can we numerically quantify these trade-offs from data for a desired prediction task and demographic attribute of interest? This paper addresses these questions. We introduce two utility-fairness trade-offs: the Data-Space and Label-Space Trade-off. The trade-offs reveal three regions within the utility-fairness plane delineating what is fully and partially possible and impossible. We propose U-FaTE a method to numerically quantify the trade-offs for a given prediction task and group fairness definition from data samples. Based on the trade-offs we introduce a new scheme for evaluating representations. An extensive evaluation of fair representation learning methods and representations from over 1000 pre-trained models revealed that most current approaches are far from the estimated and achievable fairness-utility trade-offs across multiple datasets and prediction tasks.
-
Test-time adaptation (TTA) aims to adapt a pre-trained model to a new test domain without access to source data after deployment. Existing approaches typically rely on self-training with pseudo-labels since ground-truth cannot be obtained from test data. Although the quality of pseudo labels is important for stable and accurate long-term adaptation it has not been previously addressed. In this work we propose DPLOT a simple yet effective TTA framework that consists of two components: (1) domain-specific block selection and (2) pseudo-label generation using paired-view images. Specifically we select blocks that involve domain-specific feature extraction and train these blocks by entropy minimization. After blocks are adjusted for current test domain we generate pseudo-labels by averaging given test images and corresponding flipped counterparts. By simply using flip augmentation we prevent a decrease in the quality of the pseudo-labels which can be caused by the domain gap resulting from strong augmentation. Our experimental results demonstrate that DPLOT outperforms previous TTA methods in CIFAR10-C CIFAR100-C and ImageNet-C benchmarks reducing error by up to 5.4% 9.1% and 2.9% respectively. Also we provide an extensive analysis to demonstrate effectiveness of our framework. Code is available at https://github.com/gist-ailab/domain-specific-block-selection-and-paired-view-pseudo-labeling-for-online-TTA.
-
We present a neural radiance field method for urban-scale semantic and building-level instance segmentation from aerial images by lifting noisy 2D labels to 3D. This is a challenging problem due to two primary reasons. Firstly objects in urban aerial images exhibit substantial variations in size including buildings cars and roads which pose a significant challenge for accurate 2D segmentation. Secondly the 2D labels generated by existing segmentation methods suffer from the multi-view inconsistency problem especially in the case of aerial images where each image captures only a small portion of the entire scene. To overcome these limitations we first introduce a scale-adaptive semantic label fusion strategy that enhances the segmentation of objects of varying sizes by combining labels predicted from different altitudes harnessing the novel-view synthesis capabilities of NeRF. We then introduce a novel cross-view instance label grouping strategy based on the 3D scene representation to mitigate the multi-view inconsistency problem in the 2D instance labels. Furthermore we exploit multi-view reconstructed depth priors to improve the geometric quality of the reconstructed radiance field resulting in enhanced segmentation results. Experiments on multiple real-world urban-scale datasets demonstrate that our approach outperforms existing methods highlighting its effectiveness. The source code is available at https://github.com/zyqz97/Aerial_lifting.
-
We introduce SAOR a novel approach for estimating the 3D shape texture and viewpoint of an articulated object from a single image captured in the wild. Unlike prior approaches that rely on pre-defined category-specific 3D templates or tailored 3D skeletons SAOR learns to articulate shapes from single-view image collections with a skeleton-free part-based model without requiring any 3D object shape priors. To prevent ill-posed solutions we propose a cross-instance consistency loss that exploits disentangled object shape deformation and articulation. This is helped by a new silhouette-based sampling mechanism to enhance viewpoint diversity during training. Our method only requires estimated object silhouettes and relative depth maps from off-the-shelf pre-trained networks during training. At inference time given a single-view image it efficiently outputs an explicit mesh representation. We obtain improved qualitative and quantitative results on challenging quadruped animals compared to relevant existing work.
-
We present a novel theory that establishes the relationship between light transport in visible and thermal infrared and heat transport in solids. We show that heat generated due to light absorption can be estimated by modeling heat transport using a thermal camera. For situations where heat conduction is negligible we analytically solve the heat transport equation to derive a simple expression relating the change in thermal image intensity to the absorbed light intensity and heat capacity of the material. Next we prove that intrinsic image decomposition for Lambertian scenes becomes a well-posed problem if one has access to the absorbed light. Our theory generalizes to arbitrary shapes and unstructured illumination. Our theory is based on applying energy conservation principle at each pixel independently. We validate our theory using real-world experiments on diffuse objects made of different materials that exhibit both direct and global components (inter-reflections) of light transport under unknown complex lighting.
-
Referring multi-object tracking (RMOT) aims to track multiple objects based on input textual descriptions. Previous works realize it by simply integrating an extra textual module into the multi-object tracker. However they typically need to retrain the entire framework and have difficulties in optimization. In this work we propose an insertable Knowledge Unification Network termed iKUN to enable communication with off-the-shelf trackers in a plug-and-play manner. Concretely a knowledge unification module (KUM) is designed to adaptively extract visual features based on textual guidance. Meanwhile to improve the localization accuracy we present a neural version of Kalman filter (NKF) to dynamically adjust process noise and observation noise based on the current motion status. Moreover to address the problem of open-set long-tail distribution of textual descriptions a test-time similarity calibration method is proposed to refine the confidence score with pseudo frequency. Extensive experiments on Refer-KITTI dataset verify the effectiveness of our framework. Finally to speed up the development of RMOT we also contribute a more challenging dataset Refer-Dance by extending public DanceTrack dataset with motion and dressing descriptions. The codes and dataset are available at https://github.com/dyhBUPT/iKUN.
-
RankMatch: Exploring the Better Consistency Regularization for Semi-supervised Semantic Segmentation
The key lie in semi-supervised semantic segmentation is how to fully exploit substantial unlabeled data to improve the model's generalization performance by resorting to constructing effective supervision signals. Most methods tend to directly apply contrastive learning to seek additional supervision to complement independent regular pixel-wise consistency regularization. However these methods tend not to be preferred ascribed to their complicated designs heavy memory footprints and susceptibility to confirmation bias. In this paper we analyze the bottlenecks exist in contrastive learning-based methods and offer a fresh perspective on inter-pixel correlations to construct more safe and effective supervision signals which is in line with the nature of semantic segmentation. To this end we develop a coherent RankMatch network including the construction of representative agents to model inter-pixel correlation beyond regular individual pixel-wise consistency and further unlock the potential of agents by modeling inter-agent relationships in pursuit of rank-aware correlation consistency. Extensive experimental results on multiple benchmarks including mitochondria segmentation demonstrate that RankMatch performs favorably against state-of-the-art methods. Particularly in the low-data regimes RankMatch achieves significant improvements.
-
The unprecedented capture and application of face images raise increasing concerns on anonymization to fight against privacy disclosure. Most existing methods may suffer from the problem of excessive change of the identity-independent information or insufficient identity protection. In this paper we present a new face anonymization approach by distracting the intrinsic and extrinsic identity attentions. On the one hand we anonymize the identity information in the feature space by distracting the intrinsic identity attention. On the other we anonymize the visual clues (i.e. appearance and geometry structure) by distracting the extrinsic identity attention. Our approach allows for flexible and intuitive manipulation of face appearance and geometry structure to produce diverse results and it can also be used to instruct users to perform personalized anonymization. We conduct extensive experiments on multiple datasets and demonstrate that our approach outperforms state-of-the-art methods.
-
Text-driven 3D scene generation techniques have made rapid progress in recent years. Their success is mainly attributed to using existing generative models to iteratively perform image warping and inpainting to generate 3D scenes. However these methods heavily rely on the outputs of existing models leading to error accumulation in geometry and appearance that prevent the models from being used in various scenarios (e.g. outdoor and unreal scenarios). To address this limitation we generatively refine the newly generated local views by querying and aggregating global 3D information and then progressively generate the 3D scene. Specifically we employ a tri-plane features-based NeRF as a unified representation of the 3D scene to constrain global 3D consistency and propose a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior from 2D diffusion model as well as the global 3D information of the current scene. Our extensive experiments demonstrate that in comparison to previous methods our approach supports wide variety of scene generation and arbitrary camera trajectories with improved visual quality and 3D consistency.
-
This paper introduces a versatile multi-view inverse rendering framework with near- and far-field light sources. Tackling the fundamental challenge of inherent ambiguity in inverse rendering our framework adopts a lightweight yet inclusive lighting model for different near- and far-field lights thus is able to make use of input images under varied lighting conditions available during capture. It leverages observations under each lighting to disentangle the intrinsic geometry and material from the external lighting using both neural radiance field rendering and physically-based surface rendering on the 3D implicit fields. After training the reconstructed scene is extracted to a textured triangle mesh for seamless integration into industrial rendering software for various applications. Quantitatively and qualitatively tested on synthetic and real-world scenes our method shows superiority to state-of-the-art multi-view inverse rendering methods in both speed and quality.
-
We propose RoHM an approach for robust 3D human motion reconstruction from monocular RGB(-D) videos in the presence of noise and occlusions. Most previous approaches either train neural networks to directly regress motion in 3D or learn data-driven motion priors and combine them with optimization at test time. RoHM is a novel diffusion-based motion model that conditioned on noisy and occluded input data reconstructs complete plausible motions in consistent global coordinates. Given the complexity of the problem -- requiring one to address different tasks (denoising and infilling) in different solution spaces (local and global motion) -- we decompose it into two sub-tasks and learn two models one for global trajectory and one for local motion. To capture the correlations between the two we then introduce a novel conditioning module combining it with an iterative inference scheme. We apply RoHM to a variety of tasks -- from motion reconstruction and denoising to spatial and temporal infilling. Extensive experiments on three popular datasets show that our method outperforms state-of-the-art approaches qualitatively and quantitatively while being faster at test time. The code is available at https://sanweiliti.github.io/ROHM/ROHM.html.
-
There has been significant attention to the research on dense video captioning which aims to automatically localize and caption all events within untrimmed video. Several studies introduce methods by designing dense video captioning as a multitasking problem of event localization and event captioning to consider inter-task relations. However addressing both tasks using only visual input is challenging due to the lack of semantic content. In this study we address this by proposing a novel framework inspired by the cognitive information processing of humans. Our model utilizes external memory to incorporate prior knowledge. The memory retrieval method is proposed with cross-modal video-to-text matching. To effectively incorporate retrieved text features the versatile encoder and the decoder with visual and textual cross-attention modules are designed. Comparative experiments have been conducted to show the effectiveness of the proposed method on ActivityNet Captions and YouCook2 datasets. Experimental results show promising performance of our model without extensive pretraining from a large video dataset. Our code is available at https://github.com/ailab-kyunghee/CM2_DVC.
-
Recently One-stage Weakly Supervised Semantic Segmentation (WSSS) with image-level labels has gained increasing interest due to simplification over its cumbersome multi-stage counterpart. Limited by the inherent ambiguity of Class Activation Map (CAM) we observe that one-stage pipelines often encounter confirmation bias caused by incorrect CAM pseudo-labels impairing their final segmentation performance. Although recent works discard many unreliable pseudo-labels to implicitly alleviate this issue they fail to exploit sufficient supervision for their models. To this end we propose a dual student framework with trustworthy progressive learning (DuPL). Specifically we propose a dual student network with a discrepancy loss to yield diverse CAMs for each sub-net. The two sub-nets generate supervision for each other mitigating the confirmation bias caused by learning their own incorrect pseudo-labels. In this process we progressively introduce more trustworthy pseudo-labels to be involved in the supervision through dynamic threshold adjustment with an adaptive noise filtering strategy. Moreover we believe that every pixel even discarded from supervision due to its unreliability is important for WSSS. Thus we develop consistency regularization on these discarded regions providing supervision of every pixel. Experiment results demonstrate the superiority of the proposed DuPL over the recent state-of-the-art alternatives on PASCAL VOC 2012 and MS COCO datasets. Code is available at https://github.com/Wu0409/DuPL.
-
Deep Neural Networks (DNNs) have demonstrated remarkable performance across diverse domains and tasks with large-scale datasets. To reduce labeling costs for large-scale datasets semi-automated and crowdsourcing labeling methods are developed but their labels are inevitably noisy. Learning with Noisy Labels (LNL) approaches aim to train DNNs despite the presence of noisy labels. These approaches utilize the memorization effect to select correct labels and refine noisy ones which are then used for subsequent training. However these methods encounter a significant decrease in the model's generalization performance due to the inevitably existing noise labels. To overcome this limitation we propose a new approach to enhance learning with noisy labels by incorporating additional distribution information--structural labels. In order to leverage additional distribution information for generalization we employ a reverse k-NN which helps the model in achieving a better feature manifold and mitigating overfitting to noisy labels. The proposed method shows outperformed performance in multiple benchmark datasets with IDN and real-world noisy datasets.
-
Dynamic human rendering from video sequences has achieved remarkable progress by formulating the rendering as a mapping from static poses to human images. However existing methods focus on the human appearance reconstruction of every single frame while the temporal motion relations are not fully explored. In this paper we propose a new 4D motion modeling paradigm SurMo that jointly models the temporal dynamics and human appearances in a unified framework with three key designs: 1) Surface-based motion encoding that models 4D human motions with an efficient compact surface-based triplane. It encodes both spatial and temporal motion relations on the dense surface manifold of a statistical body template which inherits body topology priors for generalizable novel view synthesis with sparse training observations. 2) Physical motion decoding that is designed to encourage physical motion learning by decoding the motion triplane features at timestep t to predict both spatial derivatives and temporal derivatives at the next timestep t+1 in the training stage. 3) 4D appearance decoding that renders the motion triplanes into images by an efficient volumetric surface-conditioned renderer that focuses on the rendering of body surfaces with motion learning conditioning. Extensive experiments validate the state-of-the-art performance of our new paradigm and illustrate the expressiveness of surface-based motion triplanes for rendering high-fidelity view-consistent humans with fast motions and even motion-dependent shadows. Our project page is at: https://taohuumd.github.io/projects/SurMo.
-
We present SPAD a novel approach for creating consistent multi-view images from text prompts or single images. To enable multi-view generation we repurpose a pretrained 2D diffusion model by extending its self-attention layers with cross-view interactions and fine-tune it on a high quality subset of Objaverse. We find that a naive extension of the self-attention proposed in prior work (e.g. MVDream) leads to content copying between views. Therefore we explicitly constrain the cross-view attention based on epipolar geometry. To further enhance 3D consistency we utilize Pl ?ucker coordinates derived from camera rays and inject them as positional encoding. This enables SPAD to reason over spatial proximity in 3D well. Compared to concurrent works that can only generate views at fixed azimuth and elevation (e.g. MVDream SyncDreamer) SPAD offers full camera control and achieves state-of-the-art results in novel view synthesis on unseen objects from the Objaverse and Google Scanned Objects datasets. Finally we demonstrate that text-to-3D generation using SPAD prevents the multi-face Janus issue.
-
Class-Incremental Learning (CIL) trains a model to continually recognize new classes from non-stationary data while retaining learned knowledge. A major challenge of CIL arises when applying to real-world data characterized by non-uniform distribution which introduces a dual imbalance problem involving (i) disparities between stored exemplars of old tasks and new class data (inter-phase imbalance) and (ii) severe class imbalances within each individual task (intra-phase imbalance). We show that this dual imbalance issue causes skewed gradient updates with biased weights in FC layers thus inducing over/under-fitting and catastrophic forgetting in CIL. Our method addresses it by reweighting the gradients towards balanced optimization and unbiased classifier learning. Additionally we observe imbalanced forgetting where paradoxically the instance-rich classes suffer higher performance degradation during CIL due to a larger amount of training data becoming unavailable in subsequent learning phases. To tackle this we further introduce a distribution-aware knowledge distillation loss to mitigate forgetting by aligning output logits proportionally with the distribution of lost training data. We validate our method on CIFAR-100 ImageNetSubset and Food101 across various evaluation protocols and demonstrate consistent improvements compared to existing works showing great potential to apply CIL in real-world scenarios with enhanced robustness and effectiveness.
-
Despite diffusion models having shown powerful abilities to generate photorealistic images generating videos that are realistic and diverse still remains in its infancy. One of the key reasons is that current methods intertwine spatial content and temporal dynamics together leading to a notably increased complexity of text-to-video generation (T2V). In this work we propose HiGen a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives i.e. structure level and content level. At the structure level we decompose the T2V task into two steps including spatial reasoning and temporal reasoning using a unified denoiser. Specifically we generate spatially coherent priors using text during spatial reasoning and then generate temporally coherent motions from these priors during temporal reasoning. At the content level we extract two subtle cues from the content of the input video that can express motion and appearance changes respectively. These two cues then guide the model's training for generating videos enabling flexible content variations and enhancing temporal stability. Through the decoupled paradigm HiGen can effectively reduce the complexity of this task and generate realistic videos with semantics accuracy and motion stability. Extensive experiments demonstrate the superior performance of HiGen over the state-of-the-art T2V methods. We have released our source code and models.
-
Recent advancements in large-scale pre-trained text-to-image models have led to remarkable progress in semantic image synthesis. Nevertheless synthesizing high-quality images with consistent semantics and layout remains a challenge. In this paper we propose the adaPtive LAyout-semantiC fusion modulE (PLACE) that harnesses pre-trained models to alleviate the aforementioned issues. Specifically we first employ the layout control map to faithfully represent layouts in the feature space. Subsequently we combine the layout and semantic features in a timestep-adaptive manner to synthesize images with realistic details. During fine-tuning we propose the Semantic Alignment (SA) loss to further enhance layout alignment. Additionally we introduce the Layout-Free Prior Preservation (LFP) loss which leverages unlabeled data to maintain the priors of pre-trained models thereby improving the visual quality and semantic consistency of synthesized images. Extensive experiments demonstrate that our approach performs favorably in terms of visual quality semantic consistency and layout alignment. The source code and model are available at \href https://github.com/cszy98/PLACE/tree/main PLACE .
-
Self-supervised denoising has attracted widespread attention due to its ability to train without clean images. However noise in real-world scenarios is often spatially correlated which causes many self-supervised algorithms that assume pixel-wise independent noise to perform poorly. Recent works have attempted to break noise correlation with downsampling or neighborhood masking. However denoising on downsampled subgraphs can lead to aliasing effects and loss of details due to a lower sampling rate. Furthermore the neighborhood masking methods either come with high computational complexity or do not consider local spatial preservation during inference. Through the analysis of existing methods we point out that the key to obtaining high-quality and texture-rich results in real-world self-supervised denoising tasks is to train at the original input resolution structure and use asymmetric operations during training and inference. Based on this we propose Asymmetric Tunable Blind-Spot Network (AT-BSN) where the blind-spot size can be freely adjusted thus better balancing noise correlation suppression and image local spatial destruction during training and inference. In addition we regard the pre-trained AT-BSN as a meta-teacher network capable of generating various teacher networks by sampling different blind-spots. We propose a blind-spot based multi-teacher distillation strategy to distill a lightweight network significantly improving performance. Experimental results on multiple datasets prove that our method achieves state-of-the-art and is superior to other self-supervised algorithms in terms of computational overhead and visual effects.
-
We present the first application of 3D Gaussian Splatting in monocular SLAM the most fundamental but the hardest setup for Visual SLAM. Our method which runs live at 3fps utilises Gaussians as the only 3D representation unifying the required representation for accurate efficient tracking mapping and high-quality rendering. Designed for challenging monocular settings our approach is seamlessly extendable to RGB-D SLAM when an external depth sensor is available. Several innovations are required to continuously reconstruct 3D scenes with high fidelity from a live camera. First to move beyond the original 3DGS algorithm which requires accurate poses from an offline Structure from Motion (SfM) system we formulate camera tracking for 3DGS using direct optimisation against the 3D Gaussians and show that this enables fast and robust tracking with a wide basin of convergence. Second by utilising the explicit nature of the Gaussians we introduce geometric verification and regularisation to handle the ambiguities occurring in incremental 3D dense reconstruction. Finally we introduce a full SLAM system which not only achieves state-of-the-art results in novel view synthesis and trajectory estimation but also reconstruction of tiny and even transparent objects.
-
The consistency training (CT)-based semi-supervised learning (SSL) bites state-of-the-art performance on SSL-based image classification. However the existing CT-based SSL methods do not highlight the non-Euclidean characteristics and class-wise varieties of embedding spaces in an SSL model thus they cannot fully utilize the effectiveness of CT. Thus we propose a metric tensor-based consistency regularization exploiting the class-variant geometrical structure of embeddings on the high-dimensional feature space. The proposed method not only minimizes the prediction discrepancy between different views of a given image but also estimates the intrinsic geometric curvature of embedding spaces by employing the global and local metric tensors. The global metric tensor is used to globally estimate the class-invariant embeddings from the whole data distribution while the local metric tensor is exploited to estimate the class-variant embeddings of each cluster. The two metric tensors are optimized by the consistency regularization based on the weak and strong augmentation strategy. The proposed method provides the highest classification accuracy on average compared to the existing state-of-the-art SSL methods on conventional datasets.
-
Understanding long real-world videos requires modeling of long-range visual dependencies. To this end we explore video-first architectures building on the common paradigm of transferring large-scale image--text models to video via shallow temporal fusion. However we expose two limitations to the approach: (1) decreased spatial capabilities likely due to poor video--language alignment in standard video datasets and (2) higher memory consumption bottlenecking the number of frames that can be processed. To mitigate the memory bottleneck we systematically analyze the memory/accuracy trade-off of various efficient methods: factorized attention parameter-efficient image-to-video adaptation input masking and multi-resolution patchification. Surprisingly simply masking large portions of the video (up to 75%) during contrastive pre-training proves to be one of the most robust ways to scale encoders to videos up to 4.3 minutes at 1 FPS. Our simple approach for training long video-to-text models which scales to 1B parameters does not add new architectural complexity and is able to outperform the popular paradigm of using much larger LLMs as an information aggregator over segment-based information on benchmarks with long-range temporal dependencies (YouCook2 EgoSchema).
-
Two-view correspondence learning has recently focused on considering the coherence and smoothness of the motion field between an image pair. Dominant schemes include controlling the complexity of the field function with regularization or smoothing the field with local filters but the former suffers from heavy computational burden and the latter fails to accommodate discontinuities in the case of large scene disparities. In this paper inspired by Fourier expansion we propose a novel network called DeMatch which decomposes the motion field to retain its main "low-frequency" and smooth part. This achieves implicit regularization with lower computational cost and generates piecewise smoothness naturally. Specifically we first decompose the rough motion field that is contaminated by false matches into several different sub-fields which are highly smooth and contain the main energy of the original field. Then with these smooth sub-fields we recover a cleaner motion field from which correct motion vectors are subsequently derived. We also design a special masked decomposition strategy to further mitigate the negative influence of false matches. All the mentioned processes are finally implemented in a discrete and learnable manner avoiding the difficulty of calculating real dense fields. Extensive experiments reveal that DeMatch outperforms state-of-the-art methods in multiple tasks and shows promising low computational usage and piecewise smoothness property. The code and trained models are publicly available at https://github.com/SuhZhang/DeMatch.
-
This paper introduces Hierarchical Diffusion Policy (HDP) a hierarchical agent for multi-task robotic manipulation. HDP factorises a manipulation policy into a hierarchical structure: a high-level task-planning agent which predicts a distant next-best end-effector pose (NBP) and a low-level goal-conditioned diffusion policy which generates optimal motion trajectories. The factorised policy representation allows HDP to tackle both long-horizon task planning while generating fine-grained low-level actions. To generate context-aware motion trajectories while satisfying robot kinematics constraints we present a novel kinematics-aware goal-conditioned control agent Robot Kinematics Diffuser (RK-Diffuser). Specifically RK-Diffuser learns to generate both the end-effector pose and joint position trajectories and distill the accurate but kinematics-unaware end-effector pose diffuser to the kinematics-aware but less accurate joint position diffuser via differentiable kinematics. Empirically we show that HDP achieves a significantly higher success rate than the state-of-the-art methods in both simulation and real-world.
-
Coarse-to-fine schemes are widely used in traditional single-image motion deblur; however in the context of deep learning existing multi-scale algorithms not only require the use of complex modules for feature fusion of low-scale RGB images and deep semantics but also manually generate low-resolution pairs of images that do not have sufficient confidence. In this work we propose a multi-scale network based on single-input and multiple-outputs(SIMO) for motion deblurring. This simplifies the complexity of algorithms based on a coarse-to-fine scheme. To alleviate restoration defects impacting detail information brought about by using a multi-scale architecture we combine the characteristics of real-world blurring trajectories with a learnable wavelet transform module to focus on the directional continuity and frequency features of the step-by-step transitions between blurred images to sharp images. In conclusion we propose a multi-scale network with a learnable discrete wavelet transform (MLWNet) which exhibits state-of-the-art performance on multiple real-world deblurred datasets in terms of both subjective and objective quality as well as computational efficiency.
-
Layout planning spanning from architecture to interior design is a slow iterative exploration of ill-defined problems adopting a "I'll know it when I see it" approach to potential solutions. Recent advances in generative models promise automating layout generation yet often overlook the crucial role of user-guided iteration cannot generate full solutions from incomplete design ideas and do not learn for the inter-dependency of layout attributes. To address these limitations we propose MaskPLAN a novel generative model based on Graph-structured Dynamic Masked Autoencoders (GDMAE) featuring five transformers generating a blend of graph-based and image-based layout attributes. MaskPLAN lets users generate and adjust layouts with partial attribute definitions create alternatives for preferences and practice new composition-driven or functionality-driven workflows. Through cross-attribute learning and the user input as a global conditional prior we ensure that design synthesis is calibrated at every intermediate stage maintaining its feasibility and practicality. Extensive evaluations show MaskPLAN's superior performance over existing methods across multiple metrics.
-
Temporal action detection (TAD) aims to locate action positions and recognize action categories in long-term untrimmed videos. Although many methods have achieved promising results their robustness has not been thoroughly studied. In practice we observe that temporal information in videos can be occasionally corrupted such as missing or blurred frames. Interestingly existing methods often incur a significant performance drop even if only one frame is affected. To formally evaluate the robustness we establish two temporal corruption robustness benchmarks namely THUMOS14-C and ActivityNet-v1.3-C. In this paper we extensively analyze the robustness of seven leading TAD methods and obtain some interesting findings: 1) Existing methods are particularly vulnerable to temporal corruptions and end-to-end methods are often more susceptible than those with a pre-trained feature extractor; 2) Vulnerability mainly comes from localization error rather than classification error; 3) When corruptions occur in the middle of an action instance TAD models tend to yield the largest performance drop. Besides building a benchmark we further develop a simple but effective robust training method to defend against temporal corruptions through the FrameDrop augmentation and Temporal-Robust Consistency loss. Remarkably our approach not only improves robustness but also yields promising improvements on clean data. We believe that this study will serve as a benchmark for future research in robust video analysis. Source code and models are available at https://github.com/Alvin-Zeng/temporal-robustness-benchmark.
-
In this paper we develop MP-HOI a powerful Multi-modal Prompt-based HOI detector designed to leverage both textual descriptions for open-set generalization and visual exemplars for handling high ambiguity in descriptions realizing HOI detection in the open world. Specifically it integrates visual prompts into existing language-guided-only HOI detectors to handle situations where textual descriptions face difficulties in generalization and to address complex scenarios with high interaction ambiguity. To facilitate MP-HOI training we build a large-scale HOI dataset named Magic-HOI which gathers six existing datasets into a unified label space forming over 186K images with 2.4K objects 1.2K actions and 20K HOI interactions. Furthermore to tackle the long-tail issue within the Magic-HOI dataset we introduce an automated pipeline for generating realistically annotated HOI images and present SynHOI a high-quality synthetic HOI dataset containing 100K images. Leveraging these two datasets MP-HOI optimizes the HOI task as a similarity learning process between multi-modal prompts and objects/interactions via a unified contrastive loss to learn generalizable and transferable objects/interactions representations from large-scale data. MP-HOI could serve as a generalist HOI detector surpassing the HOI vocabulary of existing expert models by more than 30 times. Concurrently our results demonstrate that MP-HOI exhibits remarkable zero-shot capability in real-world scenarios and consistently achieves a new state-of-the-art performance across various benchmarks. Our project homepage is available at https://MP-HOI.github.io/.
-
It is especially challenging to achieve real-time human motion tracking on a standalone VR Head-Mounted Display (HMD) such as Meta Quest and PICO. In this paper we propose HMD-Poser the first unified approach to recover full-body motions using scalable sparse observations from HMD and body-worn IMUs. In particular it can support a variety of input scenarios such as HMD HMD+2IMUs HMD+3IMUs etc. The scalability of inputs may accommodate users' choices for both high tracking accuracy and easy-to-wear. A lightweight temporal-spatial feature learning network is proposed in HMD-Poser to guarantee that the model runs in real-time on HMDs. Furthermore HMD-Poser presents online body shape estimation to improve the position accuracy of body joints. Extensive experimental results on the challenging AMASS dataset show that HMD-Poser achieves new state-of-the-art results in both accuracy and real-time performance. We also build a new free-dancing motion dataset to evaluate HMD-Poser's on-device performance and investigate the performance gap between synthetic data and real-captured sensor data. Finally we demonstrate our HMD-Poser with a real-time Avatar-driving application on a commercial HMD. Our code and free-dancing motion dataset are available \href https://pico-ai-team.github.io/hmd-poser here .
-
Realizing unified monocular 3D object detection including both indoor and outdoor scenes holds great importance in applications like robot navigation. However involving various scenarios of data to train models poses challenges due to their significantly different characteristics e.g. diverse geometry properties and heterogeneous domain distributions. To address these challenges we build a detector based on the bird's-eye-view (BEV) detection paradigm where the explicit feature projection is beneficial to addressing the geometry learning ambiguity when employing multiple scenarios of data to train detectors. Then we split the classical BEV detection architecture into two stages and propose an uneven BEV grid design to handle the convergence instability caused by the aforementioned challenges. Moreover we develop a sparse BEV feature projection strategy to reduce computational cost and a unified domain alignment method to handle heterogeneous domains. Combining these techniques a unified detector UniMODE is derived which surpasses the previous state-of-the-art on the challenging Omni3D dataset (a large-scale dataset including both indoor and outdoor scenes) by 4.9% \rm AP_ 3D revealing the first successful generalization of a BEV detector to unified 3D object detection.
-
Recently 3D content creation from text prompts has demonstrated remarkable progress by utilizing 2D and 3D diffusion models. While 3D diffusion models ensure great multi-view consistency their ability to generate high-quality and diverse 3D assets is hindered by the limited 3D data. In contrast 2D diffusion models find a distillation approach that achieves excellent generalization and rich details without any 3D data. However 2D lifting methods suffer from inherent view-agnostic ambiguity thereby leading to serious multi-face Janus issues where text prompts fail to provide sufficient guidance to learn coherent 3D results. Instead of retraining a costly viewpoint-aware model we study how to fully exploit easily accessible coarse 3D knowledge to enhance the prompts and guide 2D lifting optimization for refinement. In this paper we propose Sherpa3D a new text-to-3D framework that achieves high-fidelity generalizability and geometric consistency simultaneously. Specifically we design a pair of guiding strategies derived from the coarse 3D prior generated by the 3D diffusion model: a structural guidance for geometric fidelity and a semantic guidance for 3D coherence. Employing the two types of guidance the 2D diffusion model enriches the 3D content with diversified and high-quality results. Extensive experiments show the superiority of our Sherpa3D over the state-of-the-art text-to-3D methods in terms of quality and 3D consistency.
-
Periocular and face are complementary biometrics for identity management albeit with inherent limitations notably in scenarios involving occlusion due to sunglasses or masks. In response to these challenges we introduce Flexible Biometric Recognition (FBR) a novel framework designed to advance conventional face periocular and multimodal face-periocular biometrics across both intra- and cross-modality recognition tasks. FBR strategically utilizes the Multimodal Fusion Attention (MFA) and Multimodal Prompt Tuning (MPT) mechanisms within the Vision Transformer architecture. MFA facilitates the fusion of modalities ensuring cohesive alignment between facial and periocular embeddings while incorporating soft-biometrics to enhance the model's ability to discriminate between individuals. The fusion of three modalities is pivotal in exploring interrelationships between different modalities. Additionally MPT serves as a unifying bridge intertwining inputs and promoting cross-modality interactions while preserving their distinctive characteristics. The collaborative synergy of MFA and MPT enhances the shared features of the face and periocular with a specific emphasis on the ocular region yielding exceptional performance in both intra- and cross-modality recognition tasks. Rigorous experimentation across four benchmark datasets validates the noteworthy performance of the FBR model. The source code is available at https://github.com/MIS-DevWorks/FBR.
-
Collaborative perception allows for information sharing between multiple agents such as vehicles and infrastructure to obtain a comprehensive view of the environment through communication and fusion. Current research on multi-agent collaborative perception systems often assumes ideal communication and perception environments and neglects the effect of real-world noise such as pose noise motion blur and perception noise. To address this gap in this paper we propose a novel motion-aware robust communication network (MRCNet) that mitigates noise interference and achieves accurate and robust collaborative perception. MRCNet consists of two main components: multi-scale robust fusion (MRF) addresses pose noise by developing cross-semantic multi-scale enhanced aggregation to fuse features of different scales while motion enhanced mechanism (MEM) captures motion context to compensate for information blurring caused by moving objects. Experimental results on popular collaborative 3D object detection datasets demonstrate that MRCNet outperforms competing methods in noisy scenarios with improved perception performance using less bandwidth.
-
In the past few decades Japanese comics commonly referred to as Manga have transcended both cultural and linguistic boundaries to become a true worldwide sensation. Yet the inherent reliance on visual cues and illustration within manga renders it largely inaccessible to individuals with visual impairments. In this work we seek to address this substantial barrier with the aim of ensuring that manga can be appreciated and actively engaged by everyone. Specifically we tackle the problem of diarisation i.e. generating a transcription of who said what and when in a fully automatic way. To this end we make the following contributions: (1) we present a unified model Magi that is able to (a) detect panels text boxes and character boxes (b) cluster characters by identity (without knowing the number of clusters apriori) and (c) associate dialogues to their speakers; (2) we propose a novel approach that is able to sort the detected text boxes in their reading order and generate a dialogue transcript; (3) we annotate an evaluation benchmark for this task using publicly available [English] manga pages.
-
Open-vocabulary object detection aims to detect novel categories that are independent from the base categories used during training. Most modern methods adhere to the paradigm of learning vision-language space from a large-scale multi-modal corpus and subsequently transferring the acquired knowledge to off-the-shelf detectors like Faster-RCNN. However information attenuation or destruction may occur during the process of knowledge transfer due to the domain gap hampering the generalization ability on novel categories. To mitigate this predicament in this paper we present a novel framework named BIND standing for Bulit-IN Detector to eliminate the need for module replacement or knowledge transfer to off-the-shelf detectors. Specifically we design a two-stage training framework with an Encoder-Decoder structure. In the first stage an image-text dual encoder is trained to learn region-word alignment from a corpus of image-text pairs. In the second stage a DETR-style decoder is trained to perform detection on annotated object detection datasets. In contrast to conventional manually designed non-adaptive anchors which generate numerous redundant proposals we develop an anchor proposal network that generates anchor proposals with high likelihood based on candidates adaptively thereby substantially improving detection efficiency. Experimental results on two public benchmarks COCO and LVIS demonstrate that our method stands as a state-of-the-art approach for open-vocabulary object detection.
-
Recently integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet existing systems can only handle videos with very few frames. For long videos the computation complexity memory cost and long-term temporal connection impose additional challenges. Taking advantage of the Atkinson-Shiffrin memory model with tokens in Transformers being employed as the carriers of memory in combination with our specially designed memory mechanism we propose the MovieChat to overcome these challenges. MovieChat achieves state-of-the-art performance in long video understanding along with the released MovieChat-1K benchmark with 1K long video and 14K manual annotations for validation of the effectiveness of our method. The code models and data can be found in https://rese1f.github.io/MovieChat.
-
In order to gain insights about the decision-making of different visual recognition backbones we propose two methodologies sub-explanation counting and cross-testing that systematically applies deep explanation algorithms on a dataset-wide basis and compares the statistics generated from the amount and nature of the explanations. These methodologies reveal the difference among networks in terms of two properties called compositionality and disjunctivism. Transformers and ConvNeXt are found to be more compositional in the sense that they jointly consider multiple parts of the image in building their decisions whereas traditional CNNs and distilled transformers are less compositional and more disjunctive which means that they use multiple diverse but smaller set of parts to achieve a confident prediction. Through further experiments we pinpointed the choice of normalization to be especially important in the compositionality of a model in that batch normalization leads to less compositionality while group and layer normalization lead to more. Finally we also analyze the features shared by different backbones and plot a landscape of different models based on their feature-use similarity.
-
Estimating full-body human motion via sparse tracking signals from head-mounted displays and hand controllers in 3D scenes is crucial to applications in AR/VR. One of the biggest challenges to this task is the one-to-many mapping from sparse observations to dense full-body motions which endowed inherent ambiguities. To help resolve this ambiguous problem we introduce a new framework to combine rich contextual information provided by scenes to benefit full-body motion tracking from sparse observations. To estimate plausible human motions given sparse tracking signals and 3D scenes we develop \text S ^2Fusion a unified framework fusing \underline S cene and sparse \underline S ignals with a conditional dif\underline Fusion model. \text S ^2Fusion first extracts the spatial-temporal relations residing in the sparse signals via a periodic autoencoder and then produces time-alignment feature embedding as additional inputs. Subsequently by drawing initial noisy motion from a pre-trained prior \text S ^2Fusion utilizes conditional diffusion to fuse scene geometry and sparse tracking signals to generate full-body scene-aware motions. The sampling procedure of \text S ^2Fusion is further guided by a specially designed scene-penetration loss and phase-matching loss which effectively regularizes the motion of the lower body even in the absence of any tracking signals making the generated motion much more plausible and coherent. Extensive experimental results have demonstrated that our \text S ^2Fusion outperforms the state-of-the-art in terms of estimation quality and smoothness.
-
Due to its promising results density map regression has been widely employed for image-based crowd counting. The approach however often suffers from severe performance degradation when tested on data from unseen scenarios the so-called "domain shift" problem. To address the problem we investigate in this work single domain generalization (SDG) for crowd counting. The existing SDG approaches are mainly for image classification and segmentation and can hardly be extended to our case due to its regression nature and label ambiguity (i.e. ambiguous pixel-level ground truths). We propose MPCount a novel effective SDG approach even for narrow source distribution. MPCount stores diverse density values for density map regression and reconstructs domain-invariant features by means of only one memory bank a content error mask and attention consistency loss. By partitioning the image into grids it employs patch-wise classification as an auxiliary task to mitigate label ambiguity. Through extensive experiments on different datasets MPCount is shown to significantly improve counting accuracy compared to the state of the art under diverse scenarios unobserved in the training data characterized by narrow source distribution. Code is available at https://github.com/Shimmer93/MPCount.
-
Monocular depth estimation has experienced significant progress on terrestrial images in recent years thanks to deep learning advancements. But it remains inadequate for underwater scenes primarily due to data scarcity. Given the inherent challenges of light attenuation and backscatter in water acquiring clear underwater images or precise depth is notably difficult and costly. To mitigate this issue learning-based approaches often rely on synthetic data or turn to self- or unsupervised manners. Nonetheless their performance is often hindered by domain gap and looser constraints. In this paper we propose a novel pipeline for generating photorealistic underwater images using accurate terrestrial depth. This approach facilitates the supervised training of models for underwater depth estimation effectively reducing the performance disparity between terrestrial and underwater environments. Contrary to previous synthetic datasets that merely apply style transfer to terrestrial images without scene content change our approach uniquely creates vivid non-existent underwater scenes by leveraging terrestrial depth data through the innovative Stable Diffusion model. Specifically we introduce a specialized Depth2Underwater ControlNet trained on prepared \ Underwater Depth Text\ data triplets for this generation task. Our newly developed dataset Atlantis enables terrestrial depth estimation models to achieve considerable improvements on unseen underwater scenes surpassing their terrestrial pretrained counterparts both quantitatively and qualitatively. Moreover we further show its practical utility by applying the improved depth in underwater image enhancement and its smaller domain gap from the LLVM perspective. Code and dataset are publicly available at https://github.com/zkawfanx/Atlantis.
-
The robust association of the same objects across video frames in complex scenes is crucial for many applications especially object tracking. Current methods predominantly rely on labeled domain-specific video datasets which limits cross-domain generalization of learned similarity embeddings. We propose MASA a novel method for robust instance association learning capable of matching any objects within videos across diverse domains without tracking labels. Leveraging the rich object segmentation from the Segment Anything Model (SAM) MASA learns instance-level correspondence through exhausive data transformations. We treat the SAM outputs as dense object region proposals and learn to match those regions from a vast image collection. We further design a universal MASA adapter which can work in tandem with foundational segmentation or detection models and enable them to track any detected objects. Those combinations present strong zero-shot tracking ability in complex domains. Extensive tests on multiple challenging MOT and MOTS benchmarks indicate that the proposed method using only unlabelled static images achieves even better performance than state-of-the-art methods trained with fully annotated in-domain video sequences in zero-shot association. Our code is available at https://github.com/siyuanliii/masa.
-
Human facial action units (AUs) are mutually related in a hierarchical manner as not only they are associated with each other in both spatial and temporal domains but also AUs located in the same/close facial regions show stronger relationships than those of different facial regions. While none of existing approach thoroughly model such hierarchical inter-dependencies among AUs this paper proposes to comprehensively model multi-scale AU-related dynamic and hierarchical spatio-temporal relationship among AUs for their occurrences recognition. Specifically we first propose a novel multi-scale temporal differencing network with an adaptive weighting block to explicitly capture facial dynamics across frames at different spatial scales which specifically considers the heterogeneity of range and magnitude in different AUs' activation. Then a two-stage strategy is introduced to hierarchically model the relationship among AUs based on their spatial distribution (i.e. local and cross-region AU relationship modelling). Experimental results achieved on BP4D and DISFA show that our approach is the new state-of-the-art in the field of AU occurrence recognition. Our code is publicly available at https://github.com/CVI-SZU/MDHR.
-
We delve into pseudo-labeling for semi-supervised monocular 3D object detection (SSM3OD) and discover two primary issues: a misalignment between the prediction quality of 3D and 2D attributes and the tendency of depth supervision derived from pseudo-labels to be noisy leading to significant optimization conflicts with other reliable forms of supervision. To tackle these issues we introduce a novel decoupled pseudo-labeling (DPL) approach for SSM3OD. Our approach features a Decoupled Pseudo-label Generation (DPG) module designed to efficiently generate pseudo-labels by separately processing 2D and 3D attributes. This module incorporates a unique homography-based method for identifying dependable pseudo-labels in Bird's Eye View (BEV) space specifically for 3D attributes. Additionally we present a Depth Gradient Projection (DGP) module to mitigate optimization conflicts caused by noisy depth supervision of pseudo-labels effectively decoupling the depth gradient and removing conflicting gradients. This dual decoupling strategy--at both the pseudo-label generation and gradient levels--significantly improves the utilization of pseudo-labels in SSM3OD. Our comprehensive experiments on the KITTI benchmark demonstrate the superiority of our method over existing approaches.
-
We propose a novel approach to the action segmentation task for long untrimmed videos based on solving an optimal transport problem. By encoding a temporal consistency prior into a Gromov-Wasserstein problem we are able to decode a temporally consistent segmentation from a noisy affinity/matching cost matrix between video frames and action classes. Unlike previous approaches our method does not require knowing the action order for a video to attain temporal consistency. Furthermore our resulting (fused) Gromov-Wasserstein problem can be efficiently solved on GPUs using a few iterations of projected mirror descent. We demonstrate the effectiveness of our method in an unsupervised learning setting where our method is used to generate pseudo-labels for self-training. We evaluate our segmentation approach and unsupervised learning pipeline on the Breakfast 50-Salads YouTube Instructions and Desktop Assembly datasets yielding state-of-the-art results for the unsupervised video action segmentation task.
-
Existing prompt learning methods have shown certain capabilities in Out-of-Distribution (OOD) detection but the lack of OOD images in the target dataset in their training can lead to mismatches between OOD images and In-Distribution (ID) categories resulting in a high false positive rate. To address this issue we introduce a novel OOD detection method named 'NegPrompt' to learn a set of negative prompts each representing a negative connotation of a given class label for delineating the boundaries between ID and OOD images. It learns such negative prompts with ID data only without any reliance on external outlier data. Further current methods assume the availability of samples of all ID classes rendering them ineffective in open-vocabulary learning scenarios where the inference stage can contain novel ID classes not present during training. In contrast our learned negative prompts are transferable to novel class labels. Experiments on various ImageNet benchmarks show that NegPrompt surpasses state-of-the-art prompt-learning-based OOD detection methods and maintains a consistent lead in hard OOD detection in closed- and open-vocabulary classification scenarios. Code is available at https://github.com/mala-lab/negprompt.
-
Long-tail class incremental learning (LT-CIL) is designed to perpetually acquire novel knowledge from an imbalanced and perpetually evolving data stream while ensuring the retention of previously acquired knowledge. The existing method only re-balances data distribution and ignores exploring the potential relationship between different samples causing non-robust representations and even severe forgetting in classes with few samples. In this paper we constructed two parallel spaces simultaneously: 1) Sub-prototype space and 2) Reminiscence space to learn robust representations while alleviating forgetfulness. Concretely we advance the concept of the sub-prototype space which amalgamates insights from diverse classes. This integration facilitates the mutual complementarity of varied knowledge thereby augmenting the attainment of more robust representations. Furthermore we introduce the reminiscence space which encapsulates each class distribution aiming to constraint model optimization and mitigate the phenomenon of forgetting. The tandem utilization of the two parallel spaces effectively alleviates the adverse consequences associated with imbalanced data distribution preventing forgetting without needing replay examples. Extensive experiments demonstrate that our method achieves state-of-the-art performance on various benchmarks.
-
Learning with Unreliability: Fast Few-shot Voxel Radiance Fields with Relative Geometric Consistency
We propose a voxel-based optimization framework ReVoRF for few-shot radiance fields that strategically addresses the unreliability in pseudo novel view synthesis. Our method pivots on the insight that relative depth relationships within neighboring regions are more reliable than the absolute color values in disoccluded areas. Consequently we devise a bilateral geometric consistency loss that carefully navigates the trade-off between color fidelity and geometric accuracy in the context of depth consistency for uncertain regions. Moreover we present a reliability-guided learning strategy to discern and utilize the variable quality across synthesized views complemented by a reliability-aware voxel smoothing algorithm that smoothens the transition between reliable and unreliable data patches. Our approach allows for a more nuanced use of all available data promoting enhanced learning from regions previously considered unsuitable for high-quality reconstruction. Extensive experiments across diverse datasets reveal that our approach attains significant gains in efficiency and accuracy delivering rendering speeds of 3 FPS 7 mins to train a 360deg scene and a 5% improvement in PSNR over existing few-shot methods. Code is available at https://github.com/HKCLynn/ReVoRF
-
Recent literature has demonstrated that vision transformers (VITs) exhibit superior performance compared to convolutional neural networks (CNNs). The majority of recent research on adversarial robustness however has predominantly focused on CNNs. In this work we bridge this gap by analyzing the effectiveness of existing attacks on VITs. We demonstrate that due to the softmax computations in every attention block in VITs they are inherently vulnerable to floating point underflow errors. This can lead to a gradient masking effect resulting in suboptimal attack strength of well-known attacks like PGD Carlini and Wagner (CW) GAMA and Patch attacks. Motivated by this we propose Adaptive Attention Scaling (AAS) attack that can automatically find the optimal scaling factors of pre-softmax outputs using gradient-based optimization. We show that the proposed simple strategy can be incorporated with any existing adversarial attacks as well as adversarial training methods and achieved improved performance. On VIT-B16 we demonstrate an improved attack strength of upto 2.2% on CIFAR10 and upto 2.9% on CIFAR100 by incorporating the proposed AAS attack with state-of-the-art single attack methods like GAMA attack. Further we utilise the proposed AAS attack for every few epochs in existing adversarial training methods which is termed as Adaptive Attention Scaling Adversarial Training (AAS-AT). On incorporating AAS-AT with existing methods we outperform them on VITs over 1.3-3.5% on CIFAR10. We observe improved performance on ImageNet-100 as well.
-
Monocular egocentric 3D human motion capture is a challenging and actively researched problem. Existing methods use synchronously operating visual sensors (e.g. RGB cameras) and often fail under low lighting and fast motions which can be restricting in many applications involving head-mounted devices. In response to the existing limitations this paper 1) introduces a new problem i.e. 3D human motion capture from an egocentric monocular event camera with a fisheye lens and 2) proposes the first approach to it called EventEgo3D (EE3D). Event streams have high temporal resolution and provide reliable cues for 3D human motion capture under high-speed human motions and rapidly changing illumination. The proposed EE3D framework is specifically tailored for learning with event streams in the LNES representation enabling high 3D reconstruction accuracy. We also design a prototype of a mobile head-mounted device with an event camera and record a real dataset with event observations and the ground-truth 3D human poses (in addition to the synthetic dataset). Our EE3D demonstrates robustness and superior 3D accuracy compared to existing solutions across various challenging experiments while supporting real-time 3D pose update rates of 140Hz.
-
For text-to-video retrieval (T2VR) which aims to retrieve unlabeled videos by ad-hoc textual queries CLIP-based methods currently lead the way. Compared to CLIP4Clip which is efficient and compact state-of-the-art models tend to compute video-text similarity through fine-grained cross-modal feature interaction and matching putting their scalability for large-scale T2VR applications into doubt. We propose TeachCLIP enabling a CLIP4Clip based student network to learn from more advanced yet computationally intensive models. In order to create a learning channel to convey fine-grained cross-modal knowledge from a heavy model to the student we add to CLIP4Clip a simple Attentional frame-Feature Aggregation (AFA) block which by design adds no extra storage / computation overhead at the retrieval stage. Frame-text relevance scores calculated by the teacher network are used as soft labels to supervise the attentive weights produced by AFA. Extensive experiments on multiple public datasets justify the viability of the proposed method. TeachCLIP has the same efficiency and compactness as CLIP4Clip yet has near-SOTA effectiveness.
-
Comparing different age estimation methods poses a challenge due to the unreliability of published results stemming from inconsistencies in the benchmarking process. Previous studies have reported continuous performance improvements over the past decade using specialized methods; however our findings challenge these claims. This paper identifies two trivial yet persistent issues with the currently used evaluation protocol and describes how to resolve them. We offer an extensive comparative analysis for state-of-the-art facial age estimation methods. Surprisingly we find that the performance differences between the methods are negligible compared to the effect of other factors such as facial alignment facial coverage image resolution model architecture or the amount of data used for pretraining. We use the gained insights to propose using FaRL as the backbone model and demonstrate its effectiveness on all public datasets. We make the source code and exact data splits public on GitHub and in the supplementary material.
-
Co-salient object detection (CoSOD) aims to identify the common and salient (usually in the foreground) regions across a given group of images. Although achieving significant progress state-of-the-art CoSODs could be easily affected by some adversarial perturbations leading to substantial accuracy reduction. The adversarial perturbations can mislead CoSODs but do not change the high-level semantic information (e.g. concept) of the co-salient objects. In this paper we propose a novel robustness enhancement framework by first learning the concept of the co-salient objects based on the input group images and then leveraging this concept to purify adversarial perturbations which are subsequently fed to CoSODs for robustness enhancement. Specifically we propose CosalPure containing two modules i.e. group-image concept learning and concept-guided diffusion purification. For the first module we adopt a pre-trained text-to-image diffusion model to learn the concept of co-salient objects within group images where the learned concept is robust to adversarial examples. For the second module we map the adversarial image to the latent space and then perform diffusion generation by embedding the learned concept into the noise prediction function as an extra condition. Our method can effectively alleviate the influence of the SOTA adversarial attack containing different adversarial patterns including exposure and noise. The extensive results demonstrate that our method could enhance the robustness of CoSODs significantly.
-
Human action anticipation aims at predicting what people will do in the future based on past observations. In this paper we introduce Uncertainty-aware Action Decoupling Transformer (UADT) for action anticipation. Unlike existing methods that directly predict action in a verb-noun pair format we decouple the action anticipation task into verb and noun anticipations separately. The objective is to make the two decoupled tasks assist each other and eventually improve the action anticipation task. Specifically we propose a two-stream Transformer-based architecture which is composed of a verb-to-noun model and a noun-to-verb model. The verb-to-noun model leverages the verb information to improve the noun prediction and the other way around. We extend the model in a probabilistic manner and quantify the predictive uncertainty of each decoupled task to select features. In this way the noun prediction leverages the most informative and redundancy-free verb features and verb prediction works similarly. Finally the two streams are combined dynamically based on their uncertainties to make the joint action anticipation. We demonstrate the efficacy of our method by achieving state-of-the-art performance on action anticipation benchmarks including EPIC-KITCHENS EGTEA Gaze+ and 50-Salads.
-
Deep neural networks have shown exemplary performance on semantic scene understanding tasks on source domains but due to the absence of style diversity during training enhancing performance on unseen target domains using only single source domain data remains a challenging task. Generation of simulated data is a feasible alternative to retrieving large style-diverse real-world datasets as it is a cumbersome and budget-intensive process. However the large domain-specific inconsistencies between simulated and real-world data pose a significant generalization challenge in semantic segmentation. In this work to alleviate this problem we propose a novel Multi-Resolution Feature Perturbation (MRFP) technique to randomize domain-specific fine-grained features and perturb style of coarse features. Our experimental results on various urban-scene segmentation datasets clearly indicate that along with the perturbation of style-information perturbation of fine-feature components is paramount to learn domain invariant robust feature maps for semantic segmentation models. MRFP is a simple and computationally efficient transferable module with no additional learnable parameters or objective functions that helps state-of-the-art deep neural networks to learn robust domain invariant features for simulation-to-real semantic segmentation. Code is available at https://github.com/airl-iisc/MRFP.
-
Current 3D stylization methods often assume static scenes which violates the dynamic nature of our real world. To address this limitation we present S-DyRF a reference-based spatio-temporal stylization method for dynamic neural radiance fields. However stylizing dynamic 3D scenes is inherently challenging due to the limited availability of stylized reference images along the temporal axis. Our key insight lies in introducing additional temporal cues besides the provided reference. To this end we generate temporal pseudo-references from the given stylized reference. These pseudo-references facilitate the propagation of style information from the reference to the entire dynamic 3D scene. For coarse style transfer we enforce novel views and times to mimic the style details present in pseudo-references at the feature level. To preserve high-frequency details we create a collection of stylized temporal pseudo-rays from temporal pseudo-references. These pseudo-rays serve as detailed and explicit stylization guidance for achieving fine style transfer. Experiments on both synthetic and real-world datasets demonstrate that our method yields plausible stylized results of space-time view synthesis on dynamic 3D scenes.
-
Existing diffusion-based video editing models have made gorgeous advances for editing attributes of a source video over time but struggle to manipulate the motion information while preserving the original protagonist's appearance and background. To address this we propose MotionEditor the first diffusion model for video motion editing. MotionEditor incorporates a novel content-aware motion adapter into ControlNet to capture temporal motion correspondence. While ControlNet enables direct generation based on skeleton poses it encounters challenges when modifying the source motion in the inverted noise due to contradictory signals between the noise (source) and the condition (reference). Our adapter complements ControlNet by involving source content to transfer adapted control signals seamlessly. Further we build up a two-branch architecture (a reconstruction branch and an editing branch) with a high-fidelity attention injection mechanism facilitating branch interaction. This mechanism enables the editing branch to query the key and value from the reconstruction branch in a decoupled manner making the editing branch retain the original background and protagonist appearance. We also propose a skeleton alignment algorithm to address the discrepancies in pose size and position. Experiments demonstrate the promising motion editing ability of MotionEditor both qualitatively and quantitatively. To the best of our knowledge MotionEditor is the first to use diffusion models specifically for video motion editing considering the origin dynamic background and camera movement.
-
It is a well-known fact that the performance of deep learning models deteriorates when they encounter a distribution shift at test time. Test-time adaptation (TTA) algorithms have been proposed to adapt the model online while inferring test data. However existing research predominantly focuses on classification tasks through the optimization of batch normalization layers or classification heads but this approach limits its applicability to various model architectures like Transformers and makes it challenging to apply to other tasks such as object detection. In this paper we propose a novel online adaption approach for object detection in continually changing test domains considering which part of the model to update how to update it and when to perform the update. By introducing architecture-agnostic and lightweight adaptor modules and only updating these while leaving the pre-trained backbone unchanged we can rapidly adapt to new test domains in an efficient way and prevent catastrophic forgetting. Furthermore we present a practical and straightforward class-wise feature aligning method for object detection to resolve domain shifts. Additionally we enhance efficiency by determining when the model is sufficiently adapted or when additional adaptation is needed due to changes in the test distribution. Our approach surpasses baselines on widely used benchmarks achieving improvements of up to 4.9%p and 7.9%p in mAP for COCO ? COCO-corrupted and SHIFT respectively while maintaining about 20 FPS or higher. The implementation code is available at https://github.com/natureyoo/ContinualTTA_ObjectDetection.
-
Large foundation models known for their strong zero-shot generalization have excelled in visual and language applications. However applying them to medical image segmentation a domain with diverse imaging types and target labels remains an open challenge. Current approaches such as adapting interactive segmentation models like Segment Anything Model (SAM) require user prompts for each sample during inference. Alternatively transfer learning methods like few/one-shot models demand labeled samples leading to high costs. This paper introduces a new paradigm toward the universal medical image segmentation termed 'One-Prompt Segmentation.' One-Prompt Segmentation combines the strengths of one-shot and interactive methods. In the inference stage with just one prompted sample it can adeptly handle the unseen task in a single forward pass. We train One-Prompt Model on 64 open-source medical datasets accompanied by the collection of over 3000 clinician-labeled prompts. Tested on 14 previously unseen datasets the One-Prompt Model showcases superior zero-shot segmentation capabilities outperforming a wide range of related methods. The code and data is released as https://github.com/KidsWithTokens/one-prompt.
-
Low-shot image classification is a fundamental task in computer vision and the emergence of large-scale vision-language models such as CLIP has greatly advanced the forefront of research in this field. However most existing CLIP-based methods lack the flexibility to effectively incorporate other pre-trained models that encompass knowledge distinct from CLIP. To bridge the gap this work proposes a simple and effective probabilistic model ensemble framework based on Gaussian processes which have previously demonstrated remarkable efficacy in processing small data. We achieve the integration of prior knowledge by specifying the mean function with CLIP and the kernel function with an ensemble of deep kernels built upon various pre-trained models. By regressing the classification label directly our framework enables analytical inference straightforward uncertainty quantification and principled hyper-parameter tuning. Through extensive experiments on standard benchmarks we demonstrate that our method consistently outperforms competitive ensemble baselines regarding predictive performance. Additionally we assess the robustness of our method and the quality of the yielded uncertainty estimates on out-of-distribution datasets. We also illustrate that our method despite relying on label regression still enjoys superior model calibration compared to most deterministic baselines.
-
Most multimodal large language models (MLLMs) learn language-to-object grounding through causal language modeling where grounded objects are captured by bounding boxes as sequences of location tokens. This paradigm lacks pixel-level representations that are important for fine-grained visual understanding and diagnosis. In this work we introduce GROUNDHOG an MLLM developed by grounding Large Language Models to holistic segmentation. GROUNDHOG incorporates a masked feature extractor and converts extracted features into visual entity tokens for the MLLM backbone which then connects groundable phrases to unified grounding masks by retrieving and merging the entity masks. To train GROUNDHOG we carefully curated M3G2 a grounded visual instruction tuning dataset with Multi-Modal Multi-Grained Grounding by harvesting a collection of segmentation-grounded datasets with rich annotations. Our experimental results show that GROUNDHOG achieves superior performance on various language grounding tasks without task-specific fine-tuning and significantly reduces object hallucination. GROUNDHOG also demonstrates better grounding towards complex forms of visual input and provides easy-to-understand diagnosis in failure cases.
-
Feature matching is an important computer vision task that involves estimating correspondences between two images of a 3D scene and dense methods estimate all such correspondences. The aim is to learn a robust model i.e. a model able to match under challenging real-world changes. In this work we propose such a model leveraging frozen pretrained features from the foundation model DINOv2. Although these features are significantly more robust than local features trained from scratch they are inherently coarse. We therefore combine them with specialized ConvNet fine features creating a precisely localizable feature pyramid. To further improve robustness we propose a tailored transformer match decoder that predicts anchor probabilities which enables it to express multimodality. Finally we propose an improved loss formulation through regression-by-classification with subsequent robust regression. We conduct a comprehensive set of experiments that show that our method RoMa achieves significant gains setting a new state-of-the-art. In particular we achieve a 36% improvement on the extremely challenging WxBS benchmark. Code is provided at github.com/Parskatt/RoMa.
-
In this work we present Omni-SMoLA a multimodal architecture that mixes many multi-modal experts efficiently and achieves both high specialist and generalist performance. In contrast to previous models for which we see performance degradation on average when training the models on a wide range of tasks we show that the SMoLA low-rank experts are able to model different skills and task and overall improve the performance of a generalist model. This finding indicates that simple LMM fine-tuning is suboptimal for handling a wide range of tasks and that pairing the act of fine-tuning with specifically-designed architecture changes leads to better performing models.
-
We tackle semi-supervised object detection based on motion cues. Recent results suggest that heuristic-based clustering methods in conjunction with object trackers can be used to pseudo-label instances of moving objects and use these as supervisory signals to train 3D object detectors in Lidar data without manual supervision. We re-think this approach and suggest that both object detection as well as motion-inspired pseudo-labeling can be tackled in a data-driven manner. We leverage recent advances in scene flow estimation to obtain point trajectories from which we extract long-term class-agnostic motion patterns. Revisiting correlation clustering in the context of message passing networks we learn to group those motion patterns to cluster points to object instances. By estimating the full extent of the objects we obtain per-scan 3D bounding boxes that we use to supervise a Lidar object detection network. Our method not only outperforms prior heuristic-based approaches (57.5 AP +14 improvement over prior work) more importantly we show we can pseudo-label and train object detectors across datasets.
-
The boundless possibility of neural networks which can be used to solve a problem - each with different performance - leads to a situation where a Deep Learning expert is required to identify the best neural network. This goes against the hope of removing the need for experts. Neural Architecture Search (NAS) offers a solution to this by automatically identifying the best architecture. However to date NAS work has focused on a small set of datasets which we argue are not representative of real-world problems. We introduce eight new datasets created for a series of NAS Challenges: AddNIST Language MultNIST CIFARTile Gutenberg Isabella GeoClassing and Chesseract. These datasets and challenges are developed to direct attention to issues in NAS development and to encourage authors to consider how their models will perform on datasets unknown to them at development time. We present experimentation using standard Deep Learning methods as well as the best results from challenge participants
-
Few-shot learning (FSL) facilitates a variety of computer vision tasks yet remains vulnerable to adversarial attacks. Existing adversarially robust FSL methods rely on either visual similarity learning or class concept learning. Our analysis reveals that these two learning paradigms are complementary exhibiting distinct robustness due to their unique decision boundary types (concepts clustering by the visual similarity label vs. classification by the class labels). To bridge this gap we propose a novel framework unifying adversarially robust similarity learning and class concept learning. Specifically we distill parameters from both network branches into a "unified embedding model" during robust optimization and redistribute them to individual network branches periodically. To capture generalizable robustness across diverse branches we initialize adversaries in each episode with cross-branch class-wise "global adversarial perturbations" instead of less informative random initialization. We also propose a branch robustness harmonization to modulate the optimization of similarity and class concept learners via their relative adversarial robustness. Extensive experiments demonstrate the state-of-the-art performance of our method in diverse few-shot scenarios.
-
Spatio-temporal video grounding (or STVG) task aims at locating a spatio-temporal tube for a specific instance given a text query. Despite advancements current methods easily suffer the distractors or heavy object appearance variations in videos due to insufficient object information from the text leading to degradation. Addressing this we propose a novel framework context-guided STVG (CG-STVG) which mines discriminative instance context for object in videos and applies it as a supplementary guidance for target localization. The key of CG-STVG lies in two specially designed modules including instance context generation (ICG) which focuses on discovering visual context information (in both appearance and motion) of the instance and instance context refinement (ICR) which aims to improve the instance context from ICG by eliminating irrelevant or even harmful information from the context. During grounding ICG together with ICR are deployed at each decoding stage of a Transformer architecture for instance context learning. Particularly instance context learned from one decoding stage is fed to the next stage and leveraged as a guidance containing rich and discriminative object feature to enhance the target-awareness in decoding feature which conversely benefits generating better new instance context for improving localization finally. Compared to existing methods CG-STVG enjoys object information in text query and guidance from mined instance visual context for more accurate target localization. In our experiments on three benchmarks including HCSTVG-v1/-v2 and VidSTG CG-STVG sets new state-of-the-arts in m_tIoU and m_vIoU on all of them showing efficacy. Code is released at https://github.com/HengLan/CGSTVG.
-
The many variations of Implicit Neural Representations (INRs) where a neural network is trained as a continuous representation of a signal have tremendous practical utility for downstream tasks including novel view synthesis video compression and image super-resolution. Unfortunately the inner workings of these networks are seriously understudied. Our work eXplaining the Implicit Neural Canvas (XINC) is a unified framework for explaining properties of INRs by examining the strength of each neuron's contribution to each output pixel. We call the aggregate of these contribution maps the Implicit Neural Canvas and we use this concept to demonstrate that the INRs we study learn to "see" the frames they represent in surprising ways. For example INRs tend to have highly distributed representations. While lacking high-level object semantics they have a significant bias for color and edges and are almost entirely space-agnostic. We arrive at our conclusions by examining how objects are represented across time in video INRs using clustering to visualize similar neurons across layers and architectures and show that this is dominated by motion. These insights demonstrate the general usefulness of our analysis framework.
-
While real-world anime super-resolution (SR) has gained increasing attention in the SR community existing methods still adopt techniques from the photorealistic domain. In this paper we analyze the anime production workflow and rethink how to use characteristics of it for the sake of the real-world anime SR. First we argue that video networks and datasets are not necessary for anime SR due to the repetition use of hand-drawing frames. Instead we propose an anime image collection pipeline by choosing the least compressed and the most informative frames from the video sources. Based on this pipeline we introduce the Anime Production-oriented Image (API) dataset. In addition we identify two anime-specific challenges of distorted and faint hand-drawn lines and unwanted color artifacts. We address the first issue by introducing a prediction-oriented compression module in the image degradation model and a pseudo-ground truth preparation with enhanced hand-drawn lines. In addition we introduce the balanced twin perceptual loss combining both anime and photorealistic high-level features to mitigate unwanted color artifacts and increase visual clarity. We evaluate our method through extensive experiments on the public benchmark showing our method outperforms state-of-the-art anime dataset-trained approaches.
-
Multi-view photometric stereo (MVPS) recovers a high-fidelity 3D shape of a scene by benefiting from both multi-view stereo and photometric stereo. While photometric stereo boosts detailed shape reconstruction it necessitates recording images under various light conditions for each viewpoint. In particular calibrating the light directions for each view significantly increases the cost of acquiring images. To make MVPS more accessible we introduce a practical and easy-to-implement setup multi-view constrained photometric stereo (MVCPS) where the light directions are unknown but constrained to move together with the camera. Unlike conventional multi-view uncalibrated photometric stereo our constrained setting reduces the ambiguities of surface normal estimates from per-view linear ambiguities to a single and global linear one thereby simplifying the disambiguation process. The proposed method integrates the ambiguous surface normal into neural surface reconstruction (NeuS) to simultaneously resolve the global ambiguity and estimate the detailed 3D shape. Experiments demonstrate that our method estimates accurate shapes under sparse viewpoints using only a few multi-view constrained light sources.
-
Recent advancements in multimodal pre-training have shown promising efficacy in 3D representation learning by aligning multimodal features across 3D shapes their 2D counterparts and language descriptions. However the methods used by existing frameworks to curate such multimodal data in particular language descriptions for 3D shapes are not scalable and the collected language descriptions are not diverse. To address this we introduce ULIP-2 a simple yet effective tri-modal pretraining framework that leverages large multimodal models to automatically generate holistic language descriptions for 3D shapes. It only needs 3D data as input eliminating the need for any manual 3D annotations and is therefore scalable to large datasets. ULIP-2 is also equipped with scaled-up backbones for better multi-modal representation learning. We conduct experiments on two large-scale 3D datasets Objaverse and ShapeNet and augment them with tri-modal datasets of 3D point clouds images and language for training ULIP-2. Experiments show that ULIP-2 demonstrates substantial benefits in three downstream tasks: zero-shot 3D classification standard 3D classification with fine-tuning and 3D captioning (3D-to-language generation). It achieves a new SOTA of 50.6% (top- 1) on Objaverse-LVIS and 84.7% (top-1) on ModelNet40 in zero-shot classification. In the ScanObjectNN benchmark for standard fine-tuning ULIP-2 reaches an overall accuracy of 91.5% with a compact model of only 1.4 million parameters. ULIP-2 sheds light on a new paradigm for scalable multimodal 3D representation learning without human annotations and shows significant improvements over existing baselines. The code and datasets are released at https://github.com/salesforce/ULIP.
-
Normalizing flows have proven their efficacy for density estimation in Euclidean space but their application to rotational representations crucial in various domains such as robotics or human pose modeling remains underexplored. Probabilistic models of the human pose can benefit from approaches that rigorously consider the rotational nature of human joints. For this purpose we introduce HuProSO3 a normalizing flow model that operates on a high-dimensional product space of SO(3) manifolds modeling the joint distribution for human joints with three degrees of freedom. HuProSO3's advantage over state-of-the-art approaches is demonstrated through its superior modeling accuracy in three different applications and its capability to evaluate the exact likelihood. This work not only addresses the technical challenge of learning densities on SO(3) manifolds but it also has broader implications for domains where the probabilistic regression of correlated 3D rotations is of importance. Code will be available at https://github.com/odunkel/HuProSO.
-
Trajectory prediction plays an important role in various applications including autonomous driving robotics and scene understanding. Existing approaches mainly focus on developing compact neural networks to increase prediction precision on public datasets typically employing a standardized input duration. However a notable issue arises when these models are evaluated with varying observation lengths leading to a significant performance drop a phenomenon we term the Observation Length Shift. To address this issue we introduce a general and effective framework the FlexiLength Network (FLN) to enhance the robustness of existing trajectory prediction techniques against varying observation periods. Specifically FLN integrates trajectory data with diverse observation lengths incorporates FlexiLength Calibration (FLC) to acquire temporal invariant representations and employs FlexiLength Adaptation (FLA) to further refine these representations for more accurate future trajectory predictions. Comprehensive experiments on multiple datasets i.e. ETH/UCY nuScenes and Argoverse 1 demonstrate the effectiveness and flexibility of our proposed FLN framework.
-
Three-dimensional (3D) reconstruction from a single image is an ill-posed problem with inherent ambiguities i.e. scale. Predicting a 3D scene from text description(s) is similarly ill-posed i.e. spatial arrangements of objects described. We investigate the question of whether two inherently ambiguous modalities can be used in conjunction to produce metric-scaled reconstructions. To test this we focus on monocular depth estimation the problem of predicting a dense depth map from a single image but with an additional text caption describing the scene. To this end we begin by encoding the text caption as a mean and standard deviation; using a variational framework we learn the distribution of the plausible metric reconstructions of 3D scenes corresponding to the text captions as a prior. To "select" a specific reconstruction or depth map we encode the given image through a conditional sampler that samples from the latent space of the variational text encoder which is then decoded to the output depth map. Our approach is trained alternatingly between the text and image branches: in one optimization step we predict the mean and standard deviation from the text description and sample from a standard Gaussian and in the other we sample using a (image) conditional sampler. Once trained we directly predict depth from the encoded text using the conditional sampler. We demonstrate our approach on indoor (NYUv2) and outdoor (KITTI) scenarios where we show that language can consistently improve performance in both. Code: https://github.com/Adonis-galaxy/WorDepth.
-
Imaging through scattering media is a fundamental and pervasive challenge in fields ranging from medical diagnostics to astronomy. A promising strategy to overcome this challenge is wavefront modulation which induces measurement diversity during image acquisition. Despite its importance designing optimal wavefront modulations to image through scattering remains under-explored. This paper introduces a novel learning-based framework to address the gap. Our approach jointly optimizes wavefront modulations and a computationally lightweight feedforward "proxy" reconstruction network. This network is trained to recover scenes obscured by scattering using measurements that are modified by these modulations. The learned modulations produced by our framework generalize effectively to unseen scattering scenarios and exhibit remarkable versatility. During deployment the learned modulations can be decoupled from the proxy network to augment other more computationally expensive restoration algorithms. Through extensive experiments we demonstrate our approach significantly advances the state of the art in imaging through scattering media. Our project webpage is at https://wavemo-2024.github.io/.
-
Humans constantly interact with their surrounding environments. Current human-centric generative models mainly focus on synthesizing humans plausibly interacting with static scenes and objects while the dynamic human action-reaction synthesis for ubiquitous causal human-human interactions is less explored. Human-human interactions can be regarded as asymmetric with actors and reactors in atomic interaction periods. In this paper we comprehensively analyze the asymmetric dynamic synchronous and detailed nature of human-human interactions and propose the first multi-setting human action-reaction synthesis benchmark to generate human reactions conditioned on given human actions. To begin with we propose to annotate the actor-reactor order of the interaction sequences for the NTU120 InterHuman and Chi3D datasets. Based on them a diffusion-based generative model with a Transformer decoder architecture called ReGenNet together with an explicit distance-based interaction loss is proposed to predict human reactions in an online manner where the future states of actors are unavailable to reactors. Quantitative and qualitative results show that our method can generate instant and plausible human reactions compared to the baselines and can generalize to unseen actor motions and viewpoint changes.
-
Hand mesh reconstruction has attracted considerable attention in recent years with various approaches and techniques being proposed. Some of these methods incorporate complex components and designs which while effective may complicate the model and hinder efficiency. In this paper we decompose the mesh decoder into token generator and mesh regressor. Through extensive ablation experiments we found that the token generator should select discriminating and representative points while the mesh regressor needs to upsample sparse keypoints into dense meshes in multiple stages. Given these functionalities we can achieve high performance with minimal computational resources. Based on this observation we propose a simple yet effective baseline that outperforms state-of-the-art methods by a large margin while maintaining real-time efficiency. Our method outperforms existing solutions achieving state-of-the-art (SOTA) results across multiple datasets. On the FreiHAND dataset our approach produced a PA-MPJPE of 5.8mm and a PA-MPVPE of 6.1mm. Similarly on the DexYCB dataset we observed a PA-MPJPE of 5.5mm and a PA-MPVPE of 5.5mm. As for performance speed our method reached up to 33 frames per second (fps) when using HRNet and up to 70 fps when employing FastViT-MA36. Code will be made available.
-
In the realm of computer vision and graphics accurately establishing correspondences between geometric 3D shapes is pivotal for applications like object tracking registration texture transfer and statistical shape analysis. Moving beyond traditional hand-crafted and data-driven feature learning methods we incorporate spectral methods with deep learning focusing on functional maps (FMs) and optimal transport (OT). Traditional OT-based approaches often reliant on entropy regularization OT in learning-based framework face computational challenges due to their quadratic cost. Our key contribution is to employ the sliced Wasserstein distance (SWD) for OT which is a valid fast optimal transport metric in an unsupervised shape matching framework. This unsupervised framework integrates functional map regularizers with a novel OT-based loss derived from SWD enhancing feature alignment between shapes treated as discrete probability measures. We also introduce an adaptive refinement process utilizing entropy regularized OT further refining feature alignments for accurate point-to-point correspondences. Our method demonstrates superior performance in non-rigid shape matching including near-isometric and non-isometric scenarios and excels in downstream tasks like segmentation transfer. The empirical results on diverse datasets highlight our framework's effectiveness and generalization capabilities setting new standards in non-rigid shape matching with efficient OT metrics and an adaptive refinement module.
-
Recent advances in text-to-image generation have made remarkable progress in synthesizing realistic human photos conditioned on given text prompts. However existing personalized generation methods cannot simultaneously satisfy the requirements of high efficiency promising identity (ID) fidelity and flexible text controllability. In this work we introduce PhotoMaker an efficient personalized text-to-image generation method which mainly encodes an arbitrary number of input ID images into a stack ID embedding for preserving ID information. Such an embedding also empowers our method to be applied in many interesting scenarios such as when replacing the corresponding class word and when combining the characteristics of different identities. Besides to better drive the training of our PhotoMaker we propose an ID-oriented data creation pipeline to assemble the training data. Under the nourishment of the dataset constructed through the proposed pipeline our PhotoMaker demonstrates comparable performance to test-time fine-tuning-based methods yet provides significant speed improvements strong generalization capabilities and a wide range of applications.
-
We present Score-Guided Human Mesh Recovery (ScoreHMR) an approach for solving inverse problems for 3D human pose and shape reconstruction. These inverse problems involve fitting a human body model to image observations traditionally solved through optimization techniques. ScoreHMR mimics model fitting approaches but alignment with the image observation is achieved through score guidance in the latent space of a diffusion model. The diffusion model is trained to capture the conditional distribution of the human model parameters given an input image. By guiding its denoising process with a task-specific score ScoreHMR effectively solves inverse problems for various applications without the need for retraining the task-agnostic diffusion model. We evaluate our approach on three settings/applications. These are: (i) single-frame model fitting; (ii) reconstruction from multiple uncalibrated views; (iii) reconstructing humans in video sequences. ScoreHMR consistently outperforms all optimization baselines on popular benchmarks across all settings. We make our code and models available on the project website: https://statho.github.io/ScoreHMR.
-
Diffusion models have recently achieved remarkable progress in generating realistic images. However challenges remain in accurately understanding and synthesizing the layout requirements in the textual prompts. To align the generated image with layout instructions we present a training-free layout calibration system SimM that intervenes in the generative process on the fly during inference time. Specifically following a "check-locate-rectify" pipeline the system first analyses the prompt to generate the target layout and compares it with the intermediate outputs to automatically detect errors. Then by moving the located activations and making intra- and inter-map adjustments the rectification process can be performed with negligible computational overhead. To evaluate SimM over a range of layout requirements we present a benchmark SimMBench that compensates for the lack of superlative spatial relations in existing datasets. And both quantitative and qualitative results demonstrate the effectiveness of the proposed SimM in calibrating the layout inconsistencies. Our project page is at https://simm-t2i.github.io/SimM.
-
Unpaired image dehazing (UID) holds significant research importance due to the challenges in acquiring haze/clear image pairs with identical backgrounds. This paper proposes a novel method for UID named Orthogonal Decoupling Contrastive Regularization (ODCR). Our method is grounded in the assumption that an image consists of both haze-related features which influence the degree of haze and haze-unrelated features such as texture and semantic information. ODCR aims to ensure that the haze-related features of the dehazing result closely resemble those of the clear image while the haze-unrelated features align with the input hazy image. To accomplish the motivation Orthogonal MLPs optimized geometrically on the Stiefel manifold are proposed which can project image features into an orthogonal space thereby reducing the relevance between different features. Furthermore a task-driven Depth-wise Feature Classifier (DWFC) is proposed which assigns weights to the orthogonal features based on the contribution of each channel's feature in predicting whether the feature source is hazy or clear in a self-supervised fashion. Finally a Weighted PatchNCE (WPNCE) loss is introduced to achieve the pulling of haze-related features in the output image toward those of clear images while bringing haze-unrelated features close to those of the hazy input. Extensive experiments demonstrate the superior performance of our ODCR method on UID.
-
Predicting 3D point trajectory is a fundamental learning task which commonly should be equivariant under Euclidean transformation e.g. SE(3). The existing equivariant models are commonly based on the group equivariant convolution equivariant message passing vector neuron frame averaging etc. In this paper we propose a novel pose-transformed equivariant network in which the points are firstly uniquely normalized and then transformed by the learned pose transformations upon which the points after motion are predicted and aggregated. Under each transformed pose we design the point position predictor consisting of multiple Pose-Transformed Points Prediction blocks in which the global and local motions are estimated and aggregated. This framework can be proven to be equivariant to SE(3) transformation over 3D points. We evaluate the pose-transformed equivariant network on extensive datasets including human motion capture molecular dynamics modeling and dynamics simulation. Extensive experimental comparisons demonstrated our SOTA performance compared with the existing equivariant networks for 3D point trajectory prediction.
-
Towards holistic understanding of 3D scenes a general 3D segmentation method is needed that can segment diverse objects without restrictions on object quantity or categories while also reflecting the inherent hierarchical structure. To achieve this we propose OmniSeg3D an omniversal segmentation method aims for segmenting anything in 3D all at once. The key insight is to lift multi-view inconsistent 2D segmentations into a consistent 3D feature field through a hierarchical contrastive learning framework which is accomplished by two steps. Firstly we design a novel hierarchical representation based on category-agnostic 2D segmentations to model the multi-level relationship among pixels. Secondly image features rendered from the 3D feature field are clustered at different levels which can be further drawn closer or pushed apart according to the hierarchical relationship between different levels. In tackling the challenges posed by inconsistent 2D segmentations this framework yields a global consistent 3D feature field which further enables hierarchical segmentation multi-object selection and global discretization. Extensive experiments demonstrate the effectiveness of our method on high-quality 3D segmentation and accurate hierarchical structure understanding. A graphical user interface further facilitates flexible interaction for omniversal 3D segmentation.
-
Many problems in computer vision can be formulated as geometric estimation problems i.e. given a collection of measurements (e.g. point correspondences) we wish to fit a model (e.g. an essential matrix) that agrees with our observations. This necessitates some measure of how much an observation "agrees" with a given model. A natural choice is to consider the smallest perturbation that makes the observation exactly satisfy the constraints. However for many problems this metric is expensive or otherwise intractable to compute. The so-called Sampson error approximates this geometric error through a linearization scheme. For epipolar geometry the Sampson error is a popular choice and in practice known to yield very tight approximations of the corresponding geometric residual (the reprojection error). In this paper we revisit the Sampson approximation and provide new theoretical insights as to why and when this approximation works as well as provide explicit bounds on the tightness under some mild assumptions. Our theoretical results are validated in several experiments on real data and in the context of different geometric estimation tasks.
-
We introduce the Fixed Point Diffusion Model (FPDM) a novel approach to image generation that integrates the concept of fixed point solving into the framework of diffusion-based generative modeling. Our approach embeds an implicit fixed point solving layer into the denoising network of a diffusion model transforming the diffusion process into a sequence of closely-related fixed point problems. Combined with a new stochastic training method this approach significantly reduces model size reduces memory usage and accelerates training. Moreover it enables the development of two new techniques to improve sampling efficiency: reallocating computation across timesteps and reusing fixed point solutions between timesteps. We conduct extensive experiments with state-of-the-art models on ImageNet FFHQ CelebA-HQ and LSUN-Church demonstrating substantial improvements in performance and efficiency. Compared to the state-of-the-art DiT model FPDM contains 87% fewer parameters consumes 60% less memory during training and improves image generation quality in situations where sampling computation or time is limited.
-
Learning from a limited amount of data namely Few-Shot Learning stands out as a challenging computer vision task. Several works exploit semantics and design complicated semantic fusion mechanisms to compensate for rare representative features within restricted data. However relying on naive semantics such as class names introduces biases due to their brevity while acquiring extensive semantics from external knowledge takes a huge time and effort. This limitation severely constrains the potential of semantics in Few-Shot Learning. In this paper we design an automatic way called Semantic Evolution to generate high-quality semantics. The incorporation of high-quality semantics alleviates the need for complex network structures and learning algorithms used in previous works. Hence we employ a simple two-layer network termed Semantic Alignment Network to transform semantics and visual features into robust class prototypes with rich discriminative features for few-shot classification. The experimental results show our framework outperforms all previous methods on six benchmarks demonstrating a simple network with high-quality semantics can beat intricate multi-modal modules on few-shot classification tasks. Code is available at https://github.com/zhangdoudou123/SemFew.
-
Defocus blur is a persistent problem in microscope imaging that poses harm to pathology interpretation and medical intervention in cell microscopy and microscope surgery. To address this problem a unified framework including the multi-pyramid transformer (MPT) and extended frequency contrastive regularization (EFCR) is proposed to tackle two outstanding challenges in microscopy deblur: longer attention span and data deficiency. The MPT employs an explicit pyramid structure at each network stage that integrates the cross-scale window attention (CSWA) the intra-scale channel attention (ISCA) and the feature-enhancing feed-forward network (FEFN) to capture long-range cross-scale spatial interaction and global channel context. The EFCR addresses the data deficiency problem by exploring latent deblur signals from different frequency bands. It also enables deblur knowledge transfer to learn cross-domain information from extra data improving deblur performance for labeled and unlabeled data. Extensive experiments and downstream task validation show the framework achieves state-of-the-art performance across multiple datasets. Project page: https://github.com/PieceZhang/MPT-CataBlur.
-
Training a linear classifier or lightweight model on top of pretrained vision model outputs so-called 'frozen features' leads to impressive performance on a number of downstream few-shot tasks. Currently frozen features are not modified during training. On the other hand when networks are trained directly on images data augmentation is a standard recipe that improves performance with no substantial overhead. In this paper we conduct an extensive pilot study on few-shot image classification that explores applying data augmentations in the frozen feature space dubbed 'frozen feature augmentation (FroFA)' covering twenty augmentations in total. Our study demonstrates that adopting a deceptively simple pointwise FroFA such as brightness can improve few-shot performance consistently across three network architectures three large pretraining datasets and eight transfer datasets.
-
Diffusion models (DMs) have achieved remarkable generative performance particularly with the introduction of stochastic differential equations (SDEs). Nevertheless a gap emerges in the model sampling trajectory constructed by reverse-SDE due to the accumulation of score estimation and discretization errors. This gap results in a residual in the generated images adversely impacting the image quality. To remedy this we propose a novel residual learning framework built upon a correction function. The optimized function enables to improve image quality via rectifying the sampling trajectory effectively. Importantly our framework exhibits transferable residual correction ability i.e. a correction function optimized for one pre-trained DM can also enhance the sampling trajectory constructed by other different DMs on the same dataset. Experimental results on four widely-used datasets demonstrate the effectiveness and transferable capability of our framework.
-
CLIP showcases exceptional cross-modal matching capabilities due to its training on image-text contrastive learning tasks. However without specific optimization for unimodal scenarios its performance in single-modality feature extraction might be suboptimal. Despite this some studies have directly used CLIP's image encoder for tasks like few-shot classification introducing a misalignment between its pre-training objectives and feature extraction methods. This inconsistency can diminish the quality of the image's feature representation adversely affecting CLIP's effectiveness in target tasks. In this paper we view text features as precise neighbors of image features in CLIP's space and present a novel CrOss-moDal nEighbor Representation (CODER) based on the distance structure between images and their neighbor texts. This feature extraction method aligns better with CLIP's pre-training objectives thereby fully leveraging CLIP's robust cross-modal capabilities. The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images. We introduce the Auto Text Generator (ATG) to automatically produce the required text in a data-free and training-free manner. We apply CODER to CLIP's zero-shot and few-shot image classification tasks. Experiment results across various datasets and models confirm CODER's effectiveness. Code is available at: https://github.com/YCaigogogo/CVPR24-CODER.
-
In this paper we delve into a novel aspect of learning novel diffusion conditions with datasets an order of magnitude smaller. The rationale behind our approach is the elimination of textual constraints during the few-shot learning process. To that end we implement two optimization strategies. The first prompt-free conditional learning utilizes a prompt-free encoder derived from a pre-trained Stable Diffusion model. This strategy is designed to adapt new conditions to the diffusion process by minimizing the textual-visual correlation thereby ensuring a more precise alignment between the generated content and the specified conditions. The second strategy entails condition-specific negative rectification which addresses the inconsistencies typically brought about by Classifier-free guidance in few-shot training contexts. Our extensive experiments across a variety of condition modalities demonstrate the effectiveness and efficiency of our framework yielding results comparable to those obtained with datasets a thousand times larger. Our codes are available at https://github.com/Yuyan9Yu/BeyondTextConstraint.
-
Existing object recognition models have been shown to lack robustness in diverse geographical scenarios due to domain shifts in design and context. Class representations need to be adapted to more accurately reflect an object concept under these shifts. In the absence of training data from target geographies we hypothesize that geographically diverse descriptive knowledge of categories can enhance robustness. For this purpose we explore the feasibility of probing a large language model for geography-based object knowledge and we examine the effects of integrating knowledge into zero-shot and learnable soft prompting with CLIP. Within this exploration we propose geography knowledge regularization to ensure that soft prompts trained on a source set of geographies generalize to an unseen target set. Accuracy gains over prompting baselines on DollarStreet while training only on Europe data are up to +2.8/1.2/1.6 on target data from Africa/Asia/Americas and +4.6 overall on the hardest classes. Competitive performance is shown vs. few-shot target training and analysis is provided to direct future study of geographical robustness.
-
Deep neural networks are vulnerable to adversarial attacks leading to erroneous outputs. Adversarial training has been recognized as one of the most effective methods to counter such attacks. However existing adversarial training techniques have predominantly been evaluated on balanced datasets whereas real-world data often exhibit a long-tailed distribution casting doubt on the efficacy of these methods in practical scenarios. In this paper we delve into the performance of adversarial training under long-tailed distributions. Through an analysis of the prior method "RoBal" (Wu et al. CVPR'21) we discover that utilizing Balanced Softmax Loss (BSL) alone can obtain comparable performance to the complete RoBal approach while significantly reducing the training overhead. Then we reveal that adversarial training under long-tailed distributions also suffers from robust overfitting similar to uniform distributions. We explore utilizing data augmentation to mitigate this issue and unexpectedly discover that unlike results obtained with balanced data data augmentation not only effectively alleviates robust overfitting but also significantly improves robustness. We further identify that the improvement is attributed to the increased diversity of training data. Extensive experiments further corroborate that data augmentation alone can significantly improve robustness. Finally building on these findings we demonstrate that compared to RoBal the combination of BSL and data augmentation leads to a +6.66% improvement in model robustness under AutoAttack on CIFAR-10-LT. Our code is available at: https://github.com/NISPLab/AT-BSL.
-
This paper presents a new approach for the detection of fake videos based on the analysis of style latent vectors and their abnormal behavior in temporal changes in the generated videos. We discovered that the generated facial videos suffer from the temporal distinctiveness in the temporal changes of style latent vectors which are inevitable during the generation of temporally stable videos with various facial expressions and geometric transformations. Our framework utilizes the StyleGRU module trained by contrastive learning to represent the dynamic properties of style latent vectors. Additionally we introduce a style attention module that integrates StyleGRU-generated features with content-based features enabling the detection of visual and temporal artifacts. We demonstrate our approach across various benchmark scenarios in deepfake detection showing its superiority in cross-dataset and cross-manipulation scenarios. Through further analysis we also validate the importance of using temporal changes of style latent vectors to improve the generality of deepfake video detection.
-
Vision-Language Models (VLMs) such as Flamingo and GPT-4V have shown immense potential by integrating large language models with vision systems. Nevertheless these models face challenges in the fundamental computer vision task of object localisation due to their training on multimodal data containing mostly captions without explicit spatial grounding. While it is possible to construct custom supervised training pipelines with bounding box annotations that integrate with VLMs these result in specialized and hard-to-scale models. In this paper we aim to explore the limits of caption-based VLMs and instead propose to tackle the challenge in a simpler manner by i) keeping the weights of a caption-based VLM frozen and ii) not using any supervised detection data. To this end we introduce an input-agnostic Positional Insert (PIN) a learnable spatial prompt containing a minimal set of parameters that are slid inside the frozen VLM unlocking object localisation capabilities. Our PIN module is trained with a simple next-token prediction task on synthetic data without requiring the introduction of new output heads. Our experiments demonstrate strong zero-shot localisation performances on a variety of images including Pascal VOC COCO LVIS and diverse images like paintings or cartoons.
-
Garment manipulation (e.g. unfolding folding and hanging clothes) is essential for future robots to accomplish home-assistant tasks while highly challenging due to the diversity of garment configurations geometries and deformations. Although able to manipulate similar shaped garments in a certain task previous works mostly have to design different policies for different tasks could not generalize to garments with diverse geometries and often rely heavily on human-annotated data. In this paper we leverage the property that garments in a certain category have similar structures and then learn the topological dense (point-level) visual correspondence among garments in the category level with different deformations in the self-supervised manner. The topological correspondence can be easily adapted to the functional correspondence to guide the manipulation policies for various downstream tasks within only one or few-shot demonstrations. Experiments over garments in 3 different categories on 3 representative tasks in diverse scenarios using one or two arms taking one or more steps inputting flat or messy garments demonstrate the effectiveness of our proposed method. Project page: https://warshallrho.github.io/unigarmentmanip.
-
3D visual grounding aims to localize 3D objects described by free-form language sentences. Following the detection-then-matching paradigm existing methods mainly focus on embedding object attributes in unimodal feature extraction and multimodal feature fusion to enhance the discriminability of the proposal feature for accurate grounding. However most of them ignore the explicit interaction of multiple attributes causing a bias in unimodal representation and misalignment in multimodal fusion. In this paper we propose a multi-attribute aware Transformer for 3D visual grounding learning the multi-attribute interactions to refine the intra-modal and inter-modal grounding cues. Specifically we first develop an attribute causal analysis module to quantify the causal effect of different attributes for the final prediction which provides powerful supervision to correct the misleading attributes and adaptively capture other discriminative features. Then we design an exchanging-based multimodal fusion module which dynamically replaces tokens with low attribute attention between modalities before directly integrating low-dimensional global features. This ensures an attribute-level multimodal information fusion and helps align the language and vision details more efficiently for fine-grained multimodal features. Extensive experiments show that our method can achieve state-of-the-art performance on ScanRefer and Sr3D/Nr3D datasets.
-
Video-P2P is the first framework for real-world video editing with cross-attention control. While attention control has proven effective for image editing with pre-trained image generation models there are currently no large-scale video generation models publicly available. Video-P2P addresses this limitation by adapting an image generation diffusion model to complete various video editing tasks. Specifically we propose to first tune a Text-to-Set (T2S) model to complete an approximate inversion and then optimize a shared unconditional embedding to achieve accurate video inversion with a small memory cost. We further prove that it is crucial for consistent video editing. For attention control we introduce a novel decoupled-guidance strategy which uses different guidance strategies for the source and target prompts. The optimized unconditional embedding for the source prompt improves reconstruction ability while an initialized unconditional embedding for the target prompt enhances editability. Incorporating the attention maps of these two branches enables detailed editing. These technical designs enable various text-driven editing applications including word swap prompt refinement and attention re-weighting. Video-P2P works well on real-world videos for generating new characters while optimally preserving their original poses and scenes. It significantly outperforms previous approaches.
-
Recent weakly supervised semantic segmentation (WSSS) methods strive to incorporate contextual knowledge to improve the completeness of class activation maps (CAM). In this work we argue that the knowledge bias between instances and contexts affects the capability of the prototype to sufficiently understand instance semantics. Inspired by prototype learning theory we propose leveraging prototype awareness to capture diverse and fine-grained feature attributes of instances. The hypothesis is that contextual prototypes might erroneously activate similar and frequently co-occurring object categories due to this knowledge bias. Therefore we propose to enhance the prototype representation ability by mitigating the bias to better capture spatial coverage in semantic object regions. With this goal we present a Context Prototype-Aware Learning (CPAL) strategy which leverages semantic context to enrich instance comprehension. The core of this method is to accurately capture intra-class variations in object features through context-aware prototypes facilitating the adaptation to the semantic attributes of various instances. We design feature distribution alignment to optimize prototype awareness aligning instance feature distributions with dense features. In addition a unified training framework is proposed to combine label-guided classification supervision and prototypes-guided self-supervision. Experimental results on PASCAL VOC 2012 and MS COCO 2014 show that CPAL significantly improves off-the-shelf methods and achieves state-of-the-art performance. The project is available at \href https://github.com/Barrett-python/CPAL https://github.com/Barrett-python/CPAL.
-
In this paper we explore the potential of Snapshot Com- pressive Imaging (SCI) technique for recovering the under- lying 3D scene representation from a single temporal com- pressed image. SCI is a cost-effective method that enables the recording of high-dimensional data such as hyperspec- tral or temporal information into a single image using low- cost 2D imaging sensors. To achieve this a series of spe- cially designed 2D masks are usually employed which not only reduces storage requirements but also offers potential privacy protection. Inspired by this to take one step further our approach builds upon the powerful 3D scene represen- tation capabilities of neural radiance fields (NeRF). Specif- ically we formulate the physical imaging process of SCI as part of the training of NeRF allowing us to exploit its impressive performance in capturing complex scene struc- tures. To assess the effectiveness of our method we con- duct extensive evaluations using both synthetic data and real data captured by our SCI system. Extensive experi- mental results demonstrate that our proposed approach sur- passes the state-of-the-art methods in terms of image re- construction and novel view image synthesis. Moreover our method also exhibits the ability to restore high frame- rate multi-view consistent images by leveraging SCI and the rendering capabilities of NeRF. The code is available at https://github.com/WU-CVGL/SCINeRF.
-
We show that physics-based simulations can be seamlessly integrated with NeRF to generate high-quality elastodynamics of real-world objects. Unlike existing methods we discretize nonlinear hyperelasticity in a meshless way obviating the necessity for intermediate auxiliary shape proxies like a tetrahedral mesh or voxel grid. A quadratic generalized moving least square is employed to capture nonlinear dynamics and large deformation on the implicit model. Such meshless integration enables versatile simulations of complex and codimensional shapes. We adaptively place the least-square kernels according to the NeRF density field to significantly reduce the complexity of the nonlinear simulation. As a result physically realistic animations can be conveniently synthesized using our method for a wide range of hyperelastic materials at an interactive rate. For more information please visit https://fytalon.github.io/pienerf.
-
Vision-and-language models trained to match images with text can be combined with visual explanation methods to point to the locations of specific objects in an image. Our work shows that the localization --"grounding'"-- abilities of these models can be further improved by finetuning for self-consistent visual explanations. We propose a strategy for augmenting existing text-image datasets with paraphrases using a large language model and SelfEQ a weakly-supervised strategy on visual explanation maps for paraphrases that encourages self-consistency. Specifically for an input textual phrase we attempt to generate a paraphrase and finetune the model so that the phrase and paraphrase map to the same region in the image. We posit that this both expands the vocabulary that the model is able to handle and improves the quality of the object locations highlighted by gradient-based visual explanation methods (e.g. GradCAM). We demonstrate that SelfEQ improves performance on Flickr30k ReferIt and RefCOCO+ over a strong baseline method and several prior works. Particularly comparing to other methods that do not use any type of box annotations we obtain 84.07% on Flickr30k (an absolute improvement of 4.69%) 67.40% on ReferIt (an absolute improvement of 7.68%) and 75.10% 55.49% on RefCOCO+ test sets A and B respectively (an absolute improvement of 3.74% on average).
-
Large Multimodal Models (LMMs) have shown promise in vision-language tasks but struggle with high-resolution input and detailed scene understanding. Addressing these challenges we introduce Monkey to enhance LMM capabilities. Firstly Monkey processes input images by dividing them into uniform patches each matching the size (e.g. 448x448) used in the original training of the well-trained vision encoder. Equipped with individual adapter for each patch Monkey can handle higher resolutions up to 1344x896 pixels enabling the detailed capture of complex visual information. Secondly it employs a multi-level description generation method enriching the context for scene-object associations. This two-part strategy ensures more effective learning from generated data: the higher resolution allows for a more detailed capture of visuals which in turn enhances the effectiveness of comprehensive descriptions. Extensive ablative results validate the effectiveness of our designs. Additionally experiments on 18 datasets further demonstrate that Monkey surpasses existing LMMs in many tasks like Image Captioning and various Visual Question Answering formats. Specially in qualitative tests focused on dense text question answering Monkey has exhibited encouraging results compared with GPT4V. Code is available at https://github.com/Yuliang-Liu/Monkey.
-
We propose FlashAvatar a novel and lightweight 3D animatable avatar representation that could reconstruct a digital avatar from a short monocular video sequence in minutes and render high-fidelity photo-realistic images at 300FPS on a consumer-grade GPU. To achieve this we maintain a uniform 3D Gaussian field embedded in the surface of a parametric face model and learn extra spatial offset to model non-surface regions and subtle facial details. While full use of geometric priors can capture high-frequency facial details and preserve exaggerated expressions proper initialization can help reduce the number of Gaussians thus enabling super-fast rendering speed. Extensive experimental results demonstrate that FlashAvatar outperforms existing works regarding visual quality and personalized details and is almost an order of magnitude faster in rendering speed. Project page: https://ustc3dv.github.io/FlashAvatar/
-
Scene flow estimation which aims to predict per-point 3D displacements of dynamic scenes is a fundamental task in the computer vision field. However previous works commonly suffer from unreliable correlation caused by locally constrained searching ranges and struggle with accumulated inaccuracy arising from the coarse-to-fine structure. To alleviate these problems we propose a novel uncertainty-aware scene flow estimation network (DifFlow3D) with the diffusion probabilistic model. Iterative diffusion-based refinement is designed to enhance the correlation robustness and resilience to challenging cases e.g. dynamics noisy inputs repetitive patterns etc. To restrain the generation diversity three key flow-related features are leveraged as conditions in our diffusion model. Furthermore we also develop an uncertainty estimation module within diffusion to evaluate the reliability of estimated scene flow. Our DifFlow3D achieves state-of-the-art performance with 24.0% and 29.1% EPE3D reduction respectively on FlyingThings3D and KITTI 2015 datasets. Notably our method achieves an unprecedented millimeter-level accuracy (0.0078m in EPE3D) on the KITTI dataset. Additionally our diffusion-based refinement paradigm can be readily integrated as a plug-and-play module into existing scene flow networks significantly increasing their estimation accuracy. Codes are released at https://github.com/IRMVLab/DifFlow3D.
-
While standard Empirical Risk Minimization (ERM) training is proven effective for image classification on in-distribution data it fails to perform well on out-of-distribution samples. One of the main sources of distribution shift for image classification is the compositional nature of images. Specifically in addition to the main object or component(s) determining the label some other image components usually exist which may lead to the shift of input distribution between train and test environments. More importantly these components may have spurious correlations with the label. To address this issue we propose Decompose-and-Compose (DaC) which improves robustness to correlation shift by a compositional approach based on combining elements of images. Based on our observations models trained with ERM usually highly attend to either the causal components or the components having a high spurious correlation with the label (especially in datapoints on which models have a high confidence). In fact according to the amount of spurious correlation and the easiness of classification based on the causal or non-causal components the model usually attends to one of these more (on samples with high confidence). Following this we first try to identify the causal components of images using class activation maps of models trained with ERM. Afterward we intervene on images by combining them and retraining the model on the augmented data including the counterfactual ones. This work proposes a group-balancing method by intervening on images without requiring group labels or information regarding the spurious features during training. The method has an overall better worst group accuracy compared to previous methods with the same amount of supervision on the group labels in correlation shift. Our code is available at https://github.com/fhn98/DaC.
-
In recent years there has been significant progress in the development of text-to-image generative models. Evaluating the quality of the generative models is one essential step in the development process. Unfortunately the evaluation process could consume a significant amount of computational resources making the required periodic evaluation of model performance (e.g. monitoring training progress) impractical. Therefore we seek to improve the evaluation efficiency by selecting the representative subset of the text-image dataset. We systematically investigate the design choices including the selection criteria (textural features or imagebased metrics) and the selection granularity (prompt-level or set-level). We find that the insights from prior work on subset selection for training data do not generalize to this problem and we propose FlashEval an iterative search algorithm tailored to evaluation data selection. We demonstrate the effectiveness of FlashEval on ranking diffusion models with various configurations including architectures quantization levels and sampler schedules on COCO and DiffusionDB datasets. Our searched 50-item subset could achieve comparable evaluation quality to the randomly sampled 500-item subset for COCO annotations on unseen models achieving a 10x evaluation speedup. We release the condensed subset of these commonly used datasets to help facilitate diffusion algorithm design and evaluation and open-source FlashEval as a tool for condensing future datasets accessible at https://github.com/thu-nics/FlashEval.
-
ZERO-IG: Zero-Shot Illumination-Guided Joint Denoising and Adaptive Enhancement for Low-Light Images
This paper presents a novel zero-shot method for jointly denoising and enhancing real-word low-light images. The proposed method is independent of training data and noise distribution. Guided by illumination we integrate denoising and enhancing processes seamlessly enabling end-to-end training. Pairs of downsampled images are extracted from a single original low-light image and processed to preliminarily reduce noise. Based on the smoothness of illumination near-authentic illumination can be estimated from the denoised low-light image. Specifically the illumination is constrained by the denoised image's brightness uniformly amplifying pixels to raise overall brightness to normal-light level. We simultaneously restrict the illumination by scaling each pixel of the denoised image based on its intensity controlling the enhancement amplitude for different pixels. Applying the illumination to the original low-light image yields an adaptively enhanced reflection. This prevents under-enhancement and localized overexposure. Notably we concatenate the reflection with the illumination preserving their computational relationship to ultimately remove noise from the original low-light image in the form of reflection. This provides sufficient image information for the denoising procedure without changing the noise characteristics. Extensive experiments demonstrate that our method outperforms other state-of-the-art methods. The source code is available at https://github.com/Doyle59217/ZeroIG.
-
This paper presents a novel aerial-to-ground feature aggregation strategy tailored for the task of cross-view image-based geo-localization. Conventional vision-based methods heavily rely on matching ground-view image features with a pre-recorded image database often through establishing planar homography correspondences via a planar ground assumption. As such they tend to ignore features that are off-ground and not suited for handling visual occlusions leading to unreliable localization in challenging scenarios. We propose a Top-to-Ground Aggregation module that capitalizes aerial orthographic views to aggregate features down to the ground level leveraging reliable off-ground information to improve feature alignment. Furthermore we introduce a Cycle Domain Adaptation loss that ensures feature extraction robustness across domain changes. Additionally an Equidistant Re-projection loss is introduced to equalize the impact of all keypoints on orientation error leading to a more extended distribution of keypoints which benefits orientation estimation. On both KITTI and Ford Multi-AV datasets our method consistently achieves the lowest mean longitudinal and lateral translations across different settings and obtains the smallest orientation error when the initial pose is less accurate a more challenging setting. Further it can complete an entire route through continual vehicle pose estimation with initial vehicle pose given only at the starting point.
-
The 3D Human Pose Estimation (3D HPE) task uses 2D images or videos to predict human joint coordinates in 3D space. Despite recent advancements in deep learning-based methods they mostly ignore the capability of coupling accessible texts and naturally feasible knowledge of humans missing out on valuable implicit supervision to guide the 3D HPE task. Moreover previous efforts often study this task from the perspective of the whole human body neglecting fine-grained guidance hidden in different body parts. To this end we present a new Fine-Grained Prompt-Driven Denoiser based on a diffusion model for 3D HPE named FinePOSE. It consists of three core blocks enhancing the reverse process of the diffusion model: (1) Fine-grained Part-aware Prompt learning (FPP) block constructs fine-grained part-aware prompts via coupling accessible texts and naturally feasible knowledge of body parts with learnable prompts to model implicit guidance. (2) Fine-grained Prompt-pose Communication (FPC) block establishes fine-grained communications between learned part-aware prompts and poses to improve the denoising quality. (3) Prompt-driven Timestamp Stylization (PTS) block integrates learned prompt embedding and temporal information related to the noise level to enable adaptive adjustment at each denoising step. Extensive experiments on public single-human pose estimation datasets show that FinePOSE outperforms state-of-the-art methods. We further extend FinePOSE to multi-human pose estimation. Achieving 34.3mm average MPJPE on the EgoHumans dataset demonstrates the potential of FinePOSE to deal with complex multi-human scenarios. Code is available at https://github.com/PKU-ICST-MIPL/FinePOSE_CVPR2024.
-
Data mixing methods play a crucial role in semi-supervised learning (SSL) but their application is unexplored in long-tailed semi-supervised learning (LTSSL). The primary reason is that the in-batch mixing manner fails to address class imbalance. Furthermore existing LTSSL methods mainly focus on re-balancing data quantity but ignore class-wise uncertainty which is also vital for class balance. For instance some classes with sufficient samples might still exhibit high uncertainty due to indistinguishable features. To this end this paper introduces the Balanced and Entropy-based Mix (BEM) a pioneering mixing approach to re-balance the class distribution of both data quantity and uncertainty. Specifically we first propose a class balanced mix bank to store data of each class for mixing. This bank samples data based on the estimated quantity distribution thus re-balancing data quantity. Then we present an entropy-based learning approach to re-balance class-wise uncertainty including entropy-based sampling strategy entropy-based selection module and entropy-based class balanced loss. Our BEM first leverages data mixing for improving LTSSL and it can also serve as a complement to the existing re-balancing methods. Experimental results show that BEM significantly enhances various LTSSL frameworks and achieves state-of-the-art performances across multiple benchmarks.
-
Holistic understanding of urban scenes based on RGB images is a challenging yet important problem. It encompasses understanding both the geometry and appearance to enable novel view synthesis parsing semantic labels and tracking moving objects. Despite considerable progress existing approaches often focus on specific aspects of this task and require additional inputs such as LiDAR scans or manually annotated 3D bounding boxes. In this paper we introduce a novel pipeline that utilizes 3D Gaussian Splatting for holistic urban scene understanding. Our main idea involves the joint optimization of geometry appearance semantics and motion using a combination of static and dynamic 3D Gaussians where moving object poses are regularized via physical constraints. Our approach offers the ability to render new viewpoints in real-time yielding 2D and 3D semantic information with high accuracy and reconstruct dynamic scenes even in scenarios where 3D bounding box detection are highly noisy. Experimental results on KITTI KITTI-360 and Virtual KITTI 2 demonstrate the effectiveness of our approach. Our project page is at https://xdimlab.github.io/hugs_website.
-
Recent methods such as Score Distillation Sampling (SDS) and Variational Score Distillation (VSD) using 2D diffusion models for text-to-3D generation have demonstrated impressive generation quality. However the long generation time of such algorithms significantly degrades the user experience. To tackle this problem we propose DreamPropeller a drop-in acceleration algorithm that can be wrapped around any existing text-to-3D generation pipeline based on score distillation. Our framework generalizes Picard iterations a classical algorithm for parallel sampling an ODE path and can account for non-ODE paths such as momentum-based gradient updates and changes in dimensions during the optimization process as in many cases of 3D generation. We show that our algorithm trades parallel compute for wallclock time and empirically achieves up to 4.7x speedup with a negligible drop in generation quality for all tested frameworks.
-
Recent progress in Vision-Language (VL) foundation models has revealed the great advantages of cross-modality learning. However due to a large gap between vision and text they might not be able to sufficiently utilize the benefits of cross-modality information. In the field of human action recognition the additional pose modality may bridge the gap between vision and text to improve the effectiveness of cross-modality learning. In this paper we propose a novel framework called the Pose-enhanced Vision-Language (PeVL) model to adapt the VL model with pose modality to learn effective knowledge of fine-grained human actions. Our PeVL model includes two novel components: an Unsymmetrical Cross-Modality Refinement (UCMR) block and a Semantic-Guided Multi-level Contrastive (SGMC) module. The UCMR block includes Pose-guided Visual Refinement (P2V-R) and Visual-enriched Pose Refinement (V2P-R) for effective cross-modality learning. The SGMC module includes Multi-level Contrastive Associations of vision-text and pose-text at both action and sub-action levels and a Semantic-Guided Loss enabling effective contrastive learning with text. Built upon a pre-trained VL foundation model our model integrates trainable adapters and can be trained end-to-end. Our novel PeVL design over VL foundation model yields remarkable performance gains on four fine- grained human action recognition datasets achieving a new SOTA with a significantly small number of FLOPs for low- cost re-training.
-
Diffusion models have recently gained unprecedented attention in the field of image synthesis due to their remarkable generative capabilities. Notwithstanding their prowess these models often incur substantial computational costs primarily attributed to the sequential denoising process and cumbersome model size. Traditional methods for compressing diffusion models typically involve extensive retraining presenting cost and feasibility challenges. In this paper we introduce DeepCache a novel training-free paradigm that accelerates diffusion models from the perspective of model architecture. DeepCache capitalizes on the inherent temporal redundancy observed in the sequential denoising steps of diffusion models which caches and retrieves features across adjacent denoising stages thereby curtailing redundant computations. Utilizing the property of the U-Net we reuse the high-level features while updating the low-level features in a very cheap way. This innovative strategy in turn enables a speedup factor of 2.3xfor Stable Diffusion v1.5 with only a 0.05 decline in CLIP Score and 4.1xfor LDM-4-G with a slight decrease of 0.22 in FID on ImageNet. Our experiments also demonstrate DeepCache's superiority over existing pruning and distillation methods that necessitate retraining and its compatibility with current sampling techniques. Furthermore we find that under the same throughput DeepCache effectively achieves comparable or even marginally improved results with DDIM or PLMS.
-
Point clouds captured by different sensors such as RGB-D cameras and LiDAR possess non-negligible domain gaps. Most existing methods design different network architectures and train separately on point clouds from various sensors. Typically point-based methods achieve outstanding performances on even-distributed dense point clouds from RGB-D cameras while voxel-based methods are more efficient for large-range sparse LiDAR point clouds. In this paper we propose geometry-to-voxel auxiliary learning to enable voxel representations to access point-level geometric information which supports better generalisation of the voxel-based backbone with additional interpretations of multi-sensor point clouds. Specifically we construct hierarchical geometry pools generated by a voxel-guided dynamic point network which efficiently provide auxiliary fine-grained geometric information adapted to different stages of voxel features. We conduct experiments on joint multi-sensor datasets to demonstrate the effectiveness of GeoAuxNet. Enjoying elaborate geometric information our method outperforms other models collectively trained on multi-sensor datasets and achieve competitive results with the-state-of-art experts on each single dataset.
-
Humans possess a remarkable ability to integrate auditory and visual information enabling a deeper understanding of the surrounding environment. This early fusion of audio and visual cues demonstrated through cognitive psychology and neuroscience research offers promising potential for developing multimodal perception models. However training early fusion architectures poses significant challenges as the increased model expressivity requires robust learning frameworks to harness their enhanced capabilities. In this paper we address this challenge by leveraging the masked reconstruction framework previously successful in unimodal settings to train audio-visual encoders with early fusion. Additionally we propose an attention-based fusion module that captures interactions between local audio and visual representations enhancing the model's ability to capture fine-grained interactions. While effective this procedure can become computationally intractable as the number of local representations increases. Thus to address the computational complexity we propose an alternative procedure that factorizes the local representations before representing audio-visual interactions. Extensive evaluations on a variety of datasets demonstrate the superiority of our approach in audio-event classification visual sound localization sound separation and audio-visual segmentation. These contributions enable the efficient training of deeply integrated audio-visual models and significantly advance the usefulness of early fusion architectures.
-
We introduce a new attention mechanism dubbed structural self-attention (StructSA) that leverages rich correlation patterns naturally emerging in key-query interactions of attention. StructSA generates attention maps by recognizing space-time structures of key-query correlations via convolution and uses them to dynamically aggregate local contexts of value features. This effectively leverages rich structural patterns in images and videos such as scene layouts object motion and inter-object relations.Using StructSA as a main building block we develop the structural vision transformer (StructViT) and evaluate its effectiveness on both image and video classification tasks achieving state-of-the-art results on ImageNet-1K Kinetics-400 Something-Something V1 & V2 Diving-48 and FineGym.
-
Text-to-video (T2V) synthesis has gained increasing attention in the community in which the recently emerged diffusion models (DMs) have promisingly shown stronger performance than the past approaches. While existing state-of-the-art DMs are competent to achieve high-resolution video generation they may largely suffer from key limitations (e.g. action occurrence disorders crude video motions) with respect to the intricate temporal dynamics modeling one of the crux of video synthesis. In this work we investigate strengthening the awareness of video dynamics for DMs for high-quality T2V generation. Inspired by human intuition we design an innovative dynamic scene manager (dubbed as Dysen) module which includes (step-1) extracting from input text the key actions with proper time-order arrangement (step-2) transforming the action schedules into the dynamic scene graph (DSG) representations and (step-3) enriching the scenes in the DSG with sufficient and reasonable details. Taking advantage of the existing powerful LLMs (e.g. ChatGPT) via in-context learning Dysen realizes (nearly) human-level temporal dynamics understanding. Finally the resulting video DSG with rich action scene details is encoded as fine-grained spatio-temporal features integrated into the backbone T2V DM for video generating. Experiments on popular T2V datasets suggest that our Dysen-VDM consistently outperforms prior arts with significant margins especially in scenarios with complex actions.
-
Understanding the anatomy of renal pathology is crucial for advancing disease diagnostics treatment evaluation and clinical research. The complex kidney system comprises various components across multiple levels including regions (cortex medulla) functional units (glomeruli tubules) and cells (podocytes mesangial cells in glomerulus). Prior studies have predominantly overlooked the intricate spatial interrelations among objects from clinical knowledge. In this research we introduce a novel universal proposition learning approach called panoramic renal pathology segmentation (PrPSeg) designed to segment comprehensively panoramic structures within kidney by integrating extensive knowledge of kidney anatomy. In this paper we propose (1) the design of a comprehensive universal proposition matrix for renal pathology facilitating the incorporation of classification and spatial relationships into the segmentation process; (2) a token-based dynamic head single network architecture with the improvement of the partial label image segmentation and capability for future data enlargement; and (3) an anatomy loss function quantifying the inter-object relationships across the kidney.
-
In this work we present RepKPU an efficient network for point cloud upsampling. We propose to promote upsampling performance by exploiting better shape representation and point generation strategy. Inspired by KPConv we propose a novel representation called RepKPoints to effectively characterize the local geometry whose advantages over prior representations are as follows: (1) density-sensitive; (2) large receptive fields; (3) position-adaptive which makes RepKPoints a generalized form of previous representations. Moreover we propose a novel paradigm namely Kernel-to-Displacement generation for point generation where point cloud upsampling is reformulated as the deformation of kernel points. Specifically we propose KP-Queries which is a set of kernel points with predefined positions and learned features to serve as the initial state of upsampling. Using cross-attention mechanisms we achieve interactions between RepKPoints and KP-Queries and subsequently KP-Queries are converted to displacement features followed by a MLP to predict the new positions of KP-Queries which serve as the generated points. Extensive experimental results demonstrate that RepKPU outperforms state-of-the-art methods on several widely-used benchmark datasets with high efficiency.
-
While recent Vision-Language (VL) models excel at open-vocabulary tasks it is unclear how to use them with specific or uncommon concepts. Personalized Text-to-Image Retrieval (TIR) or Generation (TIG) are recently introduced tasks that represent this challenge where the VL model has to learn a concept from few images and respectively discriminate or generate images of the target concept in arbitrary contexts. We identify the ability to learn new meanings and their compositionality with known ones as two key properties of a personalized system. We show that the available benchmarks offer a limited validation of personalized textual concept learning from images with respect to the above properties and introduce ConCon-Chi as a benchmark for both personalized TIR and TIG designed to fill this gap. We modelled the new-meaning concepts by crafting chimeric objects and formulating a large varied set of contexts where we photographed each object. To promote the compositionality assessment of the learned concepts with known contexts we combined different contexts with the same concept and vice-versa. We carry out a thorough evaluation of state-of-the-art methods on the resulting dataset. Our study suggests that future work on personalized TIR and TIG methods should focus on the above key properties and we propose principles and a dataset for their performance assessment. Dataset: https://doi.org/10.48557/QJ1166 and code: https://github.com/hsp-iit/concon-chi_benchmark.
-
In this paper we address the weakly-supervised Audio-Visual Video Parsing (AVVP) problem which aims at labeling events in a video as audible visible or both and temporally localizing and classifying them into known categories. This is challenging since we only have access to video-level (weak) event labels when training but need to predict event labels at the segment (frame) level at test time. Recent methods employ multiple-instance learning (MIL) techniques that tend to focus solely on the most discriminative segments resulting in frequent misclassifications. Our idea is to first construct several prototype features for each event class by clustering key segments identified for the event in the training data. We then assign pseudo labels to all training segments based on their feature similarities with these prototypes and re-train the model under weak and strong supervision. We facilitate this by structuring the feature space with contrastive learning using pseudo labels. Experiments show that we outperform existing methods for weakly-supervised AVVP. We also show that learning with weak and iteratively re-estimated pseudo labels can be interpreted as an expectation-maximization (EM) algorithm providing further insight for our training procedure.
-
Surgical decisions are informed by aligning rapid portable 2D intraoperative images (e.g. X-rays) to a high-fidelity 3D preoperative reference scan (e.g. CT). However 2D/3D registration can often fail in practice: conventional optimization methods are prohibitively slow and susceptible to local minima while neural networks trained on small datasets fail on new patients or require impractical landmark supervision. We present DiffPose a self-supervised approach that leverages patient-specific simulation and differentiable physics-based rendering to achieve accurate 2D/3D registration without relying on manually labeled data. Preoperatively a CNN is trained to regress the pose of a randomly oriented synthetic X-ray rendered from the preoperative CT. The CNN then initializes rapid intraoperative test-time optimization that uses the differentiable X-ray renderer to refine the solution. Our work further proposes several geometrically principled methods for sampling camera poses from SE(3) for sparse differentiable rendering and for driving registration in the tangent space se(3) with geodesic and multiscale locality-sensitive losses. DiffPose achieves sub-millimeter accuracy across surgical datasets at intraoperative speeds improving upon existing unsupervised methods by an order of magnitude and even outperforming supervised baselines. Our implementation is at https://github.com/eigenvivek/DiffPose.
-
Characters are an important aspect of any storyline and identifying and including them in descriptions is necessary for story understanding. While previous work has largely ignored identity and generated captions with someone (anonymized names) recent work formulates id-aware captioning as a fill-in-the-blanks (FITB) task where given a caption with blanks the goal is to predict person id labels. However to predict captions with ids a two-stage approach is required: first predict captions with someone then fill in identities. In this work we present a new single stage approach that can seamlessly switch between id-aware caption generation or FITB when given a caption with blanks. Our model Movie-Identity Captioner (MICap) uses a shared auto-regressive decoder that benefits from training with FITB and full-caption generation objectives while the encoder can benefit from or disregard captions with blanks as input. Another challenge with id-aware captioning is the lack of a metric to capture subtle differences between person ids. To this end we introduce iSPICE a caption evaluation metric that focuses on identity tuples created through intermediate scene graphs. We evaluate MICap on Large-Scale Movie Description Challenge (LSMDC) where we show a 4.2% improvement in FITB accuracy and a 1-2% bump in classic captioning metrics.
-
3D object detection and pose estimation from a single-view image is challenging due to the high uncertainty caused by the absence of 3D perception. As a solution recent monocular 3D detection methods leverage additional modalities such as stereo image pairs and LiDAR point clouds to enhance image features at the expense of additional annotation costs. We propose using diffusion models to learn effective representations for monocular 3D detection without additional modalities or training data. We present MonoDiff a novel framework that employs the reverse diffusion process to estimate 3D bounding box and orientation. But considering the variability in bounding box sizes along different dimensions it is ineffective to sample noise from a standard Gaussian distribution. Hence we adopt a Gaussian mixture model to sample noise during the forward diffusion process and initialize the reverse diffusion process. Furthermore since the diffusion model generates the 3D parameters for a given object image we leverage 2D detection information to provide additional supervision by maintaining the correspondence between 3D/2D projection. Finally depending on the signal-to-noise ratio we incorporate a dynamic weighting scheme to account for the level of uncertainty in the supervision by projection at different timesteps. MonoDiff outperforms current state-of-the-art monocular 3D detection methods on the KITTI and Waymo benchmarks without additional depth priors. MonoDiff project is available at: https://dylran.github.io/monodiff.github.io.
-
We present GLEE in this work an object-level foundation model for locating and identifying objects in images and videos. Through a unified framework GLEEaccomplishes detection segmentation tracking grounding and identification of arbitrary objects in the open world scenario for various object perception tasks. Adopting a cohesive learning strategy GLEE acquires knowledge from diverse data sources with varying supervision levels to formulate general object representations excelling in zero-shot transfer to new data and tasks. Specifically we employ an image encoder text encoder and visual prompter to handle multi-modal inputs enabling to simultaneously solve various object-centric downstream tasks while maintaining state-of-the-art performance. Demonstrated through extensive training on over five million images from diverse benchmarks GLEE exhibits remarkable versatility and improved generalization performance efficiently tackling downstream tasks without the need for task-specific adaptation. By integrating large volumes of automatically labeled data we further enhance its zero-shot generalization capabilities. Additionally GLEE is capable of being integrated into Large Language Models serving as a foundational model to provide universal object-level information for multi-modal tasks. We hope that the versatility and universality of our method will mark a significant step in the development of efficient visual foundation models for AGI systems. The models and code are released at https://github.com/FoundationVision/GLEE.
-
Heterogeneous Federated Learning (HtFL) enables collaborative learning on multiple clients with different model architectures while preserving privacy. Despite recent research progress knowledge sharing in HtFL is still difficult due to data and model heterogeneity. To tackle this issue we leverage the knowledge stored in pre-trained generators and propose a new upload-efficient knowledge transfer scheme called Federated Knowledge-Transfer Loop (FedKTL). Our FedKTL can produce client-task-related prototypical image-vector pairs via the generator's inference on the server. With these pairs each client can transfer pre-existing knowledge from the generator to its local model through an additional supervised local task. We conduct extensive experiments on four datasets under two types of data heterogeneity with 14 kinds of models including CNNs and ViTs. Results show that our upload-efficient FedKTL surpasses seven state-of-the-art methods by up to 7.31% in accuracy. Moreover our knowledge transfer scheme is applicable in scenarios with only one edge client. Code: https://github.com/TsingZ0/FedKTL
-
We introduce MeshGPT a new approach for generating triangle meshes that reflects the compactness typical of artist-created meshes in contrast to dense triangle meshes extracted by iso-surfacing methods from neural fields. Inspired by recent advances in powerful large language models we adopt a sequence-based approach to autoregressively generate triangle meshes as sequences of triangles. We first learn a vocabulary of latent quantized embeddings using graph convolutions which inform these embeddings of the local mesh geometry and topology. These embeddings are sequenced and decoded into triangles by a decoder ensuring that they can effectively reconstruct the mesh. A transformer is then trained on this learned vocabulary to predict the index of the next embedding given previous embeddings. Once trained our model can be autoregressively sampled to generate new triangle meshes directly generating compact meshes with sharp edges more closely imitating the efficient triangulation patterns of human-crafted meshes. MeshGPT demonstrates a notable improvement over state of the art mesh generation methods with a 9% increase in shape coverage and a 30-point enhancement in FID scores across various categories.
-
Inliers estimation constitutes a pivotal step in partially overlapping point cloud registration. Existing methods broadly obey coordinate-based scheme where inlier confidence is scored through simply capturing coordinate differences in the context. However this scheme results in massive inlier misinterpretation readily consequently affecting the registration performance. In this paper we explore to extend a new definition called inlier confidence calibration (ICC) to alleviate the above issues. Firstly we provide finely initial correspondences for ICC in order to generate high quality reference point cloud copy corresponding to the source point cloud. In particular we develop a soft assignment matrix optimization theorem that offers faster speed and greater precision compared to Sinkhorn. Benefiting from the high quality reference copy we argue the neighborhood patch formed by inlier and its neighborhood should have consistency between source point cloud and its reference copy. Based on this insight we construct transformation-invariant geometric constraints and capture geometric structure consistency to calibrate inlier confidence for estimated correspondences between source point cloud and its reference copy. Finally transformation is further calculated by the weighted SVD algorithm with the calibrated inlier confidence. Our model is trained in an unsupervised manner and extensive experiments on synthetic and real-world datasets illustrate the effectiveness of the proposed method.
-
As a new embodied vision task Instance ImageGoal Navigation (IIN) aims to navigate to a specified object depicted by a goal image in an unexplored environment. The main challenge of this task lies in identifying the target object from different viewpoints while rejecting similar distractors. Existing ImageGoal Navigation methods usually adopt the simple Exploration-Exploitation framework and ignore the identification of specific instance during navigation. In this work we propose to imitate the human behaviour of "getting closer to confirm" when distinguishing objects from a distance. Specifically we design a new modular navigation framework named Instance-aware Exploration-Verification-Exploitation (IEVE) for instancelevel image goal navigation. Our method allows for active switching among the exploration verification and exploitation actions thereby facilitating the agent in making reasonable decisions under different situations. On the challenging HabitatMatterport 3D semantic (HM3DSEM) dataset our method surpasses previous state-of-theart work with a classical segmentation model (0.684 vs. 0.561 success) or a robust model (0.702 vs. 0.561 success). Our code will be made publicly available at https://github.com/XiaohanLei/IEVE.
-
One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion
Recent advancements in open-world 3D object generation have been remarkable with image-to-3D methods offering superior fine-grained control over their text-to-3D counterparts. However most existing models fall short in simultaneously providing rapid generation speeds and high fidelity to input images - two features essential for practical applications. In this paper we present One-2-3-45++ an innovative method that transforms a single image into a detailed 3D textured mesh in approximately one minute. Our approach aims to fully harness the extensive knowledge embedded in 2D diffusion models and priors from valuable yet limited 3D data. This is achieved by initially finetuning a 2D diffusion model for consistent multi-view image generation followed by elevating these images to 3D with the aid of multi-view-conditioned 3D native diffusion models. Extensive experimental evaluations demonstrate that our method can produce high-quality diverse 3D assets that closely mirror the original input image.
-
Chain-of-Thought (CoT) guides large language models (LLMs) to reason step-by-step and can motivate their logical reasoning ability. While effective for logical tasks CoT is not conducive to creative problem-solving which often requires out-of-box thoughts and is crucial for innovation advancements. In this paper we explore the Leap-of-Thought (LoT) abilities within LLMs -- a non-sequential creative paradigm involving strong associations and knowledge leaps. To this end we study LLMs on the popular Oogiri game which needs participants to have good creativity and strong associative thinking for responding unexpectedly and humorously to the given image text or both and thus is suitable for LoT study. Then to investigate LLMs' LoT ability in the Oogiri game we first build a multimodal and multilingual Oogiri-GO dataset which contains over 130000 samples from the Oogiri game and observe the insufficient LoT ability or failures of most existing LLMs on the Oogiri game. Accordingly we introduce a creative Leap-of-Thought (CLoT) paradigm to improve LLM's LoT ability. CLoT first formulates the Oogiri-GO dataset into LoT-oriented instruction tuning data to train pretrained LLM for achieving certain LoT humor generation and discrimination abilities. Then CLoT designs an explorative self-refinement that encourages the LLM to generate more creative LoT data via exploring parallels between seemingly unrelated concepts and selects high-quality data to train itself for self-refinement. CLoT not only excels in humor generation in the Oogiri game as shown in Fig. 1 but also boosts creative abilities in various tasks like "cloud guessing game" and "divergent association task". These findings advance our understanding and offer a pathway to improve LLMs' creative capacities for innovative applications across domains. The dataset code and models have been released online: https://zhongshsh.github.io/CLoT.
-
Existing 3D scene understanding methods are heavily focused on 3D semantic and instance segmentation. However identifying objects and their parts only constitutes an intermediate step towards a more fine-grained goal which is effectively interacting with the functional interactive elements (e.g. handles knobs buttons) in the scene to accomplish diverse tasks. To this end we introduce SceneFun3D a large-scale dataset with more than 14.8k highly accurate interaction annotations for 710 high-resolution real-world 3D indoor scenes. We accompany the annotations with motion parameter information describing how to interact with these elements and a diverse set of natural language descriptions of tasks that involve manipulating them in the scene context. To showcase the value of our dataset we introduce three novel tasks namely functionality segmentation task-driven affordance grounding and 3D motion estimation and adapt existing state-of-the-art methods to tackle them. Our experiments show that solving these tasks in real 3D scenes remains challenging despite recent progress in closed-set and open-set 3D scene understanding methods.
-
We present Readout Guidance a method for controlling text-to-image diffusion models with learned signals. Readout Guidance uses readout heads lightweight networks trained to extract signals from the features of a pre-trained frozen diffusion model at every timestep. These readouts can encode single-image properties such as pose depth and edges; or higher-order properties that relate multiple images such as correspondence and appearance similarity. Furthermore by comparing the readout estimates to a user-defined target and back-propagating the gradient through the readout head these estimates can be used to guide the sampling process. Compared to prior methods for conditional generation Readout Guidance requires significantly fewer added parameters and training samples and offers a convenient and simple recipe for reproducing different forms of conditional control under a single framework with a single architecture and sampling procedure. We showcase these benefits in the applications of drag-based manipulation identity-consistent generation and spatially aligned control.
-
Large-scale diffusion generative models are greatly simplifying image video and 3D asset creation from user provided text prompts and images. However the challenging problem of text-to-4D dynamic 3D scene generation with diffusion guidance remains largely unexplored. We propose Dream-in-4D which features a novel two-stage approach for text-to-4D synthesis leveraging (1) 3D and 2D diffusion guidance to effectively learn a high-quality static 3D asset in the first stage; (2) a deformable neural radiance field that explicitly disentangles the learned static asset from its deformation preserving quality during motion learning; and (3) a multi-resolution feature grid for the deformation field with a displacement total variation loss to effectively learn motion with video diffusion guidance in the second stage. Through a user preference study we demonstrate that our approach significantly advances image and motion quality 3D consistency and text fidelity for text-to-4D generation compared to baseline approaches. Thanks to its motion-disentangled representation Dream-in-4D can also be easily adapted for controllable generation where appearance is defined by one or multiple images without the need to modify the motion learning stage. Thus our method offers for the first time a unified approach for text-to-4D image-to-4D and personalized 4D generation tasks.
-
We present GaussianAvatar an efficient approach to creating realistic human avatars with dynamic 3D appearances from a single video. We start by introducing animatable 3D Gaussians to explicitly represent humans in various poses and clothing styles. Such an explicit and animatable representation can fuse 3D appearances more efficiently and consistently from 2D observations. Our representation is further augmented with dynamic properties to support pose-dependent appearance modeling where a dynamic appearance network along with an optimizable feature tensor is designed to learn the motion-to-appearance mapping. Moreover by leveraging the differentiable motion condition our method enables a joint optimization of motions and appearances during avatar modeling which helps to tackle the long-standing issue of inaccurate motion estimation in monocular settings. The efficacy of GaussianAvatar is validated on both the public dataset and our collected dataset demonstrating its superior performances in terms of appearance quality and rendering efficiency.
-
Multi-target multi-camera tracking is a crucial task that involves identifying and tracking individuals over time using video streams from multiple cameras. This task has practical applications in various fields such as visual surveillance crowd behavior analysis and anomaly detection. However due to the difficulty and cost of collecting and labeling data existing datasets for this task are either synthetically generated or artificially constructed within a controlled camera network setting which limits their ability to model real-world dynamics and generalize to diverse camera configurations. To address this issue we present MTMMC a real-world large-scale dataset that includes long video sequences captured by 16 multi-modal cameras in two different environments - campus and factory - across various time weather and season conditions. This dataset provides a challenging test bed for studying multi-camera tracking under diverse real-world complexities and includes an additional input modality of spatially aligned and temporally synchronized RGB and thermal cameras which enhances the accuracy of multi-camera tracking. MTMMC is a super-set of existing datasets benefiting independent fields such as person detection re-identification and multiple object tracking. We provide baselines and new learning setups on this dataset and set the reference scores for future studies. The datasets models and test server will be made publicly available.
-
Extending large image-text pre-trained models (e.g. CLIP) for video understanding has made significant advancements. To enable the capability of CLIP to perceive dynamic information in videos existing works are dedicated to equipping the visual encoder with various temporal modules. However these methods exhibit "asymmetry" between the visual and textual sides with neither temporal descriptions in input texts nor temporal modules in text encoder. This limitation hinders the potential of language supervision emphasized in CLIP and restricts the learning of temporal features as the text encoder has demonstrated limited proficiency in motion understanding. To address this issue we propose leveraging "MoTion-Enhanced Descriptions" (MoTED) to facilitate the extraction of distinctive temporal features in videos. Specifically we first generate discriminative motion-related descriptions via querying GPT-4 to compare easy-confusing action categories. Then we incorporate both the visual and textual encoders with additional perception modules to process the video frames and generated descriptions respectively. Finally we adopt a contrastive loss to align the visual and textual motion features. Extensive experiments on five benchmarks show that MoTED surpasses state-of-the-art methods with convincing gaps laying a solid foundation for empowering CLIP with strong temporal modeling.
-
Patch-based adversarial attacks were proven to compromise the robustness and reliability of computer vision systems. However their conspicuous and easily detectable nature challenge their practicality in real-world setting. To address this recent work has proposed using Generative Adversarial Networks (GANs) to generate naturalistic patches that may not attract human attention. However such approaches suffer from a limited latent space making it challenging to produce a patch that is efficient stealthy and robust to multiple real-world transformations. This paper introduces a novel approach that produces a Dynamic Adversarial Patch (DAP) designed to overcome these limitations. DAP maintains a naturalistic appearance while optimizing attack efficiency and robustness to real-world transformations. The approach involves redefining the optimization problem and introducing a novel objective function that incorporates a similarity metric to guide the patch's creation. Unlike GAN-based techniques the DAP directly modifies pixel values within the patch providing increased flexibility and adaptability to multiple transformations. Furthermore most clothing-based physical attacks assume static objects and ignore the possible transformations caused by non-rigid deformation due to changes in a person's pose. To address this limitation a `Creases Transformation' (CT) block is introduced enhancing the patch's resilience to a variety of real-world distortions. Experimental results demonstrate that the proposed approach outperforms state-of-the-art attacks achieving a success rate of up to 82.28% in the digital world when targeting the YOLOv7 detector and 65% in the physical world when targeting YOLOv3tiny detector deployed in edge-based smart cameras.
-
Autoregressive Initial Bits (ArIB) a framework that combines subimage autoregression and latent variable models has shown its advantages in lossless image compression. However in current methods the image splitting makes the information of latent variables being uniformly distributed in each subimage and causes inadequate use of latent variables in addition to posterior collapse. To tackle these issues we introduce Bit Plane Slicing (BPS) splitting images in the bit plane dimension with the considerations on different importance for latent variables. Thus BPS provides a more effective representation by arranging subimages with decreasing importance for latent variables. To solve the problem of the increased number of dimensions caused by BPS we further propose a dimension-tailored autoregressive model that tailors autoregression methods for each dimension based on their characteristics efficiently capturing the dependencies in plane space and color dimensions. As shown in the extensive experimental results our method demonstrates the superior compression performance with comparable inference speed when compared to the state-of-the-art normalizing-flow-based methods. The code is at https://github.com/ZZ022/ArIB-BPS.
-
3D face reconstruction aims at generating high-fidelity 3D face shapes and textures from single-view or multi-view images. However current prevailing facial texture generation methods generally suffer from low-quality texture identity information loss and inadequate handling of occlusions. To solve these problems we introduce an Identity-Conditioned Latent Diffusion Model for face UV-texture generation (UV-IDM) to generate photo-realistic textures based on the Basel Face Model (BFM). UV-IDM leverages the powerful texture generation capacity of a latent diffusion model (LDM) to obtain detailed facial textures. To preserve the identity during the reconstruction procedure we design an identity-conditioned module that can utilize any in-the-wild image as a robust condition for the LDM to guide texture generation. UV-IDM can be easily adapted to different BFM-based methods as a high-fidelity texture generator. Furthermore in light of the limited accessibility of most existing UV-texture datasets we build a large-scale and publicly available UV-texture dataset based on BFM termed BFM-UV. Extensive experiments show that our UV-IDM can generate high-fidelity textures in 3D face reconstruction within seconds while maintaining image consistency bringing new state-of-the-art performance in facial texture generation.
-
Current diffusion or flow-based generative models for 3D shapes divide to two: distilling pre-trained 2D image diffusion models and training directly on 3D shapes. When training a diffusion or flow models on 3D shapes a crucial design choice is the shape representation. An effective shape representation needs to adhere three design principles: it should allow an efficient conversion of large 3D datasets to the representation form; it should provide a good tradeoff of approximation power versus number of parameters; and it should have a simple tensorial form that is compatible with existing powerful neural architectures. While standard 3D shape representations such as volumetric grids and point clouds do not adhere to all these principles simultaneously we advocate in this paper a new representation that does. We introduce Mosaic-SDF (M-SDF): a simple 3D shape representation that approximates the Signed Distance Function (SDF) of a given shape by using a set of local grids spread near the shape's boundary. The M-SDF representation is fast to compute for each shape individually making it readily parallelizable; it is parameter efficient as it only covers the space around the shape's boundary; and it has a simple matrix form compatible with Transformer-based architectures. We demonstrate the efficacy of the M-SDF representation by using it to train a 3D generative flow model including class-conditioned generation with the ShapeNetCore-V2 (3D Warehouse) dataset and text-to-3D generation using a dataset of about 600k caption-shape pairs.
-
Diffusion handles is a novel approach to enable 3D object edits on diffusion images requiring only existing pre-trained diffusion models depth estimation without any fine-tuning or 3D object retrieval. The edited results remain plausible photo-real and preserve object identity. Diffusion handles address a critically missing facet of generative image-based creative design. Our key insight is to lift diffusion activations for a selected object to 3D using a proxy depth 3D-transform the depth and associated activations and project them back to image space. The diffusion process guided by the manipulated activations produces plausible edited images showing complex 3D occlusion and lighting effects. We evaluate diffusion handles: quantitatively on a large synthetic data benchmark; and qualitatively by a user study showing our output to be more plausible and better than prior art at both 3D editing and identity control.
-
Extensive advancements have been made in person ReID through the mining of semantic information. Nevertheless existing methods that utilize semantic-parts from a single image modality do not explicitly achieve this goal. Whiteness the impressive capabilities in multimodal understanding of Vision Language Foundation Model CLIP a recent two-stage CLIP-based method employs automated prompt engineering to obtain specific textual labels for classifying pedestrians. However we note that the predefined soft prompts may be inadequate in expressing the entire visual context and struggle to generalize to unseen classes. This paper presents an end-to-end Prompt-driven Semantic Guidance (PromptSG) framework that harnesses the rich semantics inherent in CLIP. Specifically we guide the model to attend to regions that are semantically faithful to the prompt. To provide personalized language descriptions for specific individuals we propose learning pseudo tokens that represent specific visual contexts. This design not only facilitates learning fine-grained attribute information but also can inherently leverage language prompts during inference. Without requiring additional labeling efforts our PromptSG achieves state-of-the-art by over 10% on MSMT17 and nearly 5% on the Market-1501 benchmark.
-
Sharpness-Aware Minimization (SAM) has been instrumental in improving deep neural network training by minimizing both training loss and loss sharpness. Despite the practical success the mechanisms behind SAM's generalization enhancements remain elusive limiting its progress in deep learning optimization. In this work we investigate SAM's core components for generalization improvement and introduce "Friendly-SAM" (F-SAM) to further enhance SAM's generalization. Our investigation reveals the key role of batch-specific stochastic gradient noise within the adversarial perturbation i.e. the current minibatch gradient which significantly influences SAM's generalization performance. By decomposing the adversarial perturbation in SAM into full gradient and stochastic gradient noise components we discover that relying solely on the full gradient component degrades generalization while excluding it leads to improved performance. The possible reason lies in the full gradient component's increase in sharpness loss for the entire dataset creating inconsistencies with the subsequent sharpness minimization step solely on the current minibatch data. Inspired by these insights F-SAM aims to mitigate the negative effects of the full gradient component. It removes the full gradient estimated by an exponentially moving average (EMA) of historical stochastic gradients and then leverages stochastic gradient noise for improved generalization. Moreover we provide theoretical validation for the EMA approximation and prove the convergence of F-SAM on non-convex problems. Extensive experiments demonstrate the superior generalization performance and robustness of F-SAM over vanilla SAM. Code is available at https://github.com/nblt/F-SAM.
-
Diffusion models have made tremendous progress in text-driven image and video generation. Now text-to-image foundation models are widely applied to various downstream image synthesis tasks such as controllable image generation and image editing while downstream video synthesis tasks are less explored for several reasons. First it requires huge memory and computation overhead to train a video generation foundation model. Even with video foundation models additional costly training is still required for downstream video synthesis tasks. Second although some works extend image diffusion models into videos in a training-free manner temporal consistency cannot be well preserved. Finally these adaption methods are specifically designed for one task and fail to generalize to different tasks. To mitigate these issues we propose a training-free general-purpose video synthesis framework coined as BIVDiff via bridging specific image diffusion models and general text-to-video foundation diffusion models. Specifically we first use a specific image diffusion model (e.g. ControlNet and Instruct Pix2Pix) for frame-wise video generation then perform Mixed Inversion on the generated video and finally input the inverted latents into the video diffusion models (e.g. VidRD and ZeroScope) for temporal smoothing. This decoupled framework enables flexible image model selection for different purposes with strong task generalization and high efficiency. To validate the effectiveness and general use of BIVDiff we perform a wide range of video synthesis tasks including controllable video generation video editing video inpainting and outpainting.
-
Despite their exceptional performance in vision tasks deep learning models often struggle when faced with domain shifts during testing. Test-Time Training (TTT) methods have recently gained popularity by their ability to enhance the robustness of models through the addition of an auxiliary objective that is jointly optimized with the main task. Being strictly unsupervised this auxiliary objective is used at test time to adapt the model without any access to labels. In this work we propose Noise-Contrastive Test-Time Training (NC-TTT) a novel unsupervised TTT technique based on the discrimination of noisy feature maps. By learning to classify noisy views of projected feature maps and then adapting the model accordingly on new domains classification performance can be recovered by an important margin. Experiments on several popular test-time adaptation baselines demonstrate the advantages of our method compared to recent approaches for this task. The code can be found at: https://github.com/GustavoVargasHakim/NCTTT.git
-
The complex dynamicity of open-world objects presents non-negligible challenges for multi-object tracking (MOT) often manifested as severe deformations fast motion and occlusions. Most methods that solely depend on coarse-grained object cues such as boxes and the overall appearance of the object are susceptible to degradation due to distorted internal relationships of dynamic objects. To address this problem this work proposes NetTrack an efficient generic and affordable tracking framework to introduce fine-grained learning that is robust to dynamicity. Specifically NetTrack constructs a dynamicity-aware association with a fine-grained Net leveraging point-level visual cues. Correspondingly a fine-grained sampler and matching method have been incorporated. Furthermore NetTrack learns object-text correspondence for fine-grained localization. To evaluate MOT in extremely dynamic open-world scenarios a bird flock tracking (BFT) dataset is constructed which exhibits high dynamicity with diverse species and open-world scenarios. Comprehensive evaluation on BFT validates the effectiveness of fine-grained learning on object dynamicity and thorough transfer experiments on challenging open-world benchmarks i.e. TAO TAO-OW AnimalTrack and GMOT-40 validate the strong generalization ability of NetTrack even without finetuning.
-
Existing approaches to video understanding mainly designed for short videos from a third-person perspective are limited in their applicability in certain fields such as robotics. In this paper we delve into open-ended question-answering (QA) in long egocentric videos which allows individuals or robots to inquire about their own past visual experiences. This task presents unique challenges including the complexity of temporally grounding queries within extensive video content the high resource demands for precise data annotation and the inherent difficulty of evaluating open-ended answers due to their ambiguous nature. Our proposed approach tackles these challenges by (i) integrating query grounding and answering within a unified model to reduce error propagation; (ii) employing large language models for efficient and scalable data synthesis; and (iii) introducing a close-ended QA task for evaluation to manage answer ambiguity. Extensive experiments demonstrate the effectiveness of our method which also achieves state-of-the-art performance on the QAEgo4D and Ego4D-NLQ benchmarks. Code data and models are open-sourced at https://github.com/Becomebright/GroundVQA.
-
Predicting the trajectories of road agents is essential for autonomous driving systems. The recent mainstream methods follow a static paradigm which predicts the future trajectory by using a fixed duration of historical frames. These methods make the predictions independently even at adjacent time steps which leads to potential instability and temporal inconsistency. As successive time steps have largely overlapping historical frames their forecasting should have intrinsic correlation such as overlapping predicted trajectories should be consistent or be different but share the same motion goal depending on the road situation. Motivated by this in this work we introduce HPNet a novel dynamic trajectory forecasting method. Aiming for stable and accurate trajectory forecasting our method leverages not only historical frames including maps and agent states but also historical predictions. Specifically we newly design a Historical Prediction Attention module to automatically encode the dynamic relationship between successive predictions. Besides it also extends the attention range beyond the currently visible window benefitting from the use of historical predictions. The proposed Historical Prediction Attention together with the Agent Attention and Mode Attention is further formulated as the Triple Factorized Attention module serving as the core design of HPNet. Experiments on the Argoverse and INTERACTION datasets show that HPNet achieves state-of-the-art performance and generates accurate and stable future trajectories. Our code are available at https://github.com/XiaolongTang23/HPNet.
-
While recent depth completion methods have achieved remarkable results filling in relatively dense depth maps (e.g. projected 64-line LiDAR on KITTI or 500 sampled points on NYUv2) with RGB guidance their performance on very sparse input (e.g. 4-line LiDAR or 32 depth point measurements) is unverified. These sparser regimes present new challenges as a 4-line LiDAR increases the distance between pixels without depth and their nearest depth point sixfold from 5 pixels to 30 pixels compared to 64 lines. Observing that existing methods struggle with sparse and variable distribution depth maps we propose an Affinity-Based Shift Correction (ASC) module that iteratively aligns depth predictions to input depth based on predicted affinities between image pixels and depth points. Our framework enables each depth point to adaptively influence and improve predictions across the image leading to largely improved results for fewer-line fewer-point and variable sparsity settings. Further we show improved performance in domain transfer from KITTI to nuScenes and from random sampling to irregular point distributions. Our correction module can easily be added to any depth completion or RGB-only depth estimation model notably allowing the latter to perform both completion and estimation with a single model.
-
Data-free knowledge distillation is able to utilize the knowledge learned by a large teacher network to augment the training of a smaller student network without accessing the original training data avoiding privacy security and proprietary risks in real applications. In this line of research existing methods typically follow an inversion-and-distillation paradigm in which a generative adversarial network on-the-fly trained with the guidance of the pre-trained teacher network is used to synthesize a large-scale sample set for knowledge distillation. In this paper we reexamine this common data-free knowledge distillation paradigm showing that there is considerable room to improve the overall training efficiency through a lens of "small-scale inverted data for knowledge distillation". In light of three empirical observations indicating the importance of how to balance class distributions in terms of synthetic sample diversity and difficulty during both data inversion and distillation processes we propose Small Scale Data-free Knowledge Distillation (SSD-KD). In formulation SSD-KD introduces a modulating function to balance synthetic samples and a priority sampling function to select proper samples facilitated by a dynamic replay buffer and a reinforcement learning strategy. As a result SSD-KD can perform distillation training conditioned on an extremely small scale of synthetic samples (e.g. 10x less than the original training data scale) making the overall training efficiency one or two orders of magnitude faster than many mainstream methods while retaining superior or competitive model performance as demonstrated on popular image classification and semantic segmentation benchmarks. The code is available at https://github.com/OSVAI/SSD-KD.
-
Generative models can produce impressively realistic images. This paper demonstrates that generated images have geometric features different from those of real images. We build a set of collections of generated images prequalified to fool simple signal-based classifiers into believing they are real. We then show that prequalified generated images can be identified reliably by classifiers that only look at geometric properties. We use three such classifiers. All three classifiers are denied access to image pixels and look only at derived geometric features. The first classifier looks at the perspective field of the image the second looks at lines detected in the image and the third looks at relations between detected objects and shadows. Our procedure detects generated images more reliably than SOTA local signal based detectors for images from a number of distinct generators. Saliency maps suggest that the classifiers can identify geometric problems reliably. We conclude that current generators cannot reliably reproduce geometric properties of real images.
-
Domain generalization (DG) based Face Anti-Spoofing (FAS) aims to improve the model's performance on unseen domains. Existing methods either rely on domain labels to align domain-invariant feature spaces or disentangle generalizable features from the whole sample which inevitably lead to the distortion of semantic feature structures and achieve limited generalization. In this work we make use of large-scale VLMs like CLIP and leverage the textual feature to dynamically adjust the classifier's weights for exploring generalizable visual features. Specifically we propose a novel Class Free Prompt Learning (CFPL) paradigm for DG FAS which utilizes two lightweight transformers namely Content Q-Former (CQF) and Style Q-Former (SQF) to learn the different semantic prompts conditioned on content and style features by using a set of learnable query vectors respectively. Thus the generalizable prompt can be learned by two improvements: (1) A Prompt-Text Matched (PTM) supervision is introduced to ensure CQF learns visual representation that is most informative of the content description. (2) A Diversified Style Prompt (DSP) technology is proposed to diversify the learning of style prompts by mixing feature statistics between instance-specific styles. Finally the learned text features modulate visual features to generalization through the designed Prompt Modulation (PM). Extensive experiments show that the CFPL is effective and outperforms the state-of-the-art methods on several cross-domain datasets.
-
Introducing interpretability and reasoning into Multiple Instance Learning (MIL) methods for Whole Slide Image (WSI) analysis is challenging given the complexity of gigapixel slides. Traditionally MIL interpretability is limited to identifying salient regions deemed pertinent for downstream tasks offering little insight to the end-user (pathologist) regarding the rationale behind these selections. To address this we propose Self-Interpretable MIL (SI-MIL) a method intrinsically designed for interpretability from the very outset. SI-MIL employs a deep MIL framework to guide an interpretable branch grounded on handcrafted pathological features facilitating linear predictions. Beyond identifying salient regions SI-MIL uniquely provides feature-level interpretations rooted in pathological insights for WSIs. Notably SI-MIL with its linear prediction constraints challenges the prevalent myth of an inevitable trade-off between model interpretability and performance demonstrating competitive results compared to state-of-the-art methods on WSI-level prediction tasks across three cancer types. In addition we thoroughly benchmark the local- and global-interpretability of SI-MIL in terms of statistical analysis a domain expert study and desiderata of interpretability namely user-friendliness and faithfulness.
-
Generating realistic hand motion sequences in interaction with objects has gained increasing attention with the growing interest in digital humans. Prior work has illustrated the effectiveness of employing occupancy-based or distance-based virtual sensors to extract hand-object interaction features. Nonetheless these methods show limited generalizability across object categories shapes and sizes. We hypothesize that this is due to two reasons: 1) the limited expressiveness of employed virtual sensors and 2) scarcity of available training data. To tackle this challenge we introduce a novel joint-centered sensor designed to reason about local object geometry near potential interaction regions. The sensor queries for object surface points in the neighbourhood of each hand joint. As an important step towards mitigating the learning complexity we transform the points from global frame to hand template frame and use a shared module to process sensor features of each individual joint. This is followed by a spatio-temporal transformer network aimed at capturing correlation among the joints in different dimensions. Moreover we devise simple heuristic rules to augment the limited training sequences with vast static hand grasping samples. This leads to a broader spectrum of grasping types observed during training in turn enhancing our model's generalization capability. We evaluate on two public datasets GRAB and InterCap where our method shows superiority over baselines both quantitatively and perceptually.
-
We study the underexplored but fundamental vision problem of machine understanding of abstract freehand scene sketches. We introduce a sketch encoder that results in semantically-aware feature space which we evaluate by testing its performance on a semantic sketch segmentation task. To train our model we rely only on the availability of bitmap sketches with their brief captions and do not require any pixel-level annotations. To obtain generalization to a large set of sketches and categories we build on a vision transformer encoder pretrained with the CLIP model. We freeze the text encoder and perform visual-prompt tuning of the visual encoder branch while introducing a set of critical modifications. Firstly we augment the classical key-query (k-q) self-attention blocks with value-value (v-v) self-attention blocks. Central to our model is a two-level hierarchical network design that enables efficient semantic disentanglement: The first level ensures holistic scene sketch encoding and the second level focuses on individual categories. We then in the second level of the hierarchy introduce a cross-attention between textual and visual branches. Our method outperforms zero-shot CLIP pixel accuracy of segmentation results by 37 points reaching an accuracy of 85.5% on the FS-COCO sketch dataset. Finally we conduct a user study that allows us to identify further improvements needed over our method to reconcile machine and human understanding of scene sketches.
-
We present IntrinsicAvatar a novel approach to recovering the intrinsic properties of clothed human avatars including geometry albedo material and environment lighting from only monocular videos. Recent advancements in human-based neural rendering have enabled high-quality geometry and appearance reconstruction of clothed humans from just monocular videos. However these methods bake intrinsic properties such as albedo material and environment lighting into a single entangled neural representation. On the other hand only a handful of works tackle the problem of estimating geometry and disentangled appearance properties of clothed humans from monocular videos. They usually achieve limited quality and disentanglement due to approximations of secondary shading effects via learned MLPs. In this work we propose to model secondary shading effects explicitly via Monte-Carlo ray tracing. We model the rendering process of clothed humans as a volumetric scattering process and combine ray tracing with body articulation. Our approach can recover high-quality geometry albedo material and lighting properties of clothed humans from a single monocular video without requiring supervised pre-training using ground truth materials. Furthermore since we explicitly model the volumetric scattering process and ray tracing our model naturally generalizes to novel poses enabling animation of the reconstructed avatar in novel lighting conditions.
-
Group synchronization plays a crucial role in global pipelines for Structure from Motion (SfM). Its formulation is nonconvex and it is faced with highly corrupted measurements. Cycle consistency has been effective in addressing these challenges. However computationally efficient solutions are needed for cycles longer than three especially in practical scenarios where 3-cycles are unavailable. To overcome this computational bottleneck we propose an algorithm for group synchronization that leverages information from cycles of lengths ranging from three to six with a complexity of O(n^3) (or O(n^ 2.373 ) when using a faster matrix multiplication algorithm). We establish non-trivial theory for this and related methods that achieves competitive sample complexity assuming the uniform corruption model. To advocate the practical need for our method we consider distributed group synchronization which requires at least 4-cycles and we illustrate state-of-the-art performance by our method in this context.
-
Existing scene text detectors generally focus on accurately detecting single-level (i.e. word-level line-level or paragraph-level) text entities without exploring the relationships among different levels of text entities. To comprehensively understand scene texts detecting multi-level texts while exploring their contextual information is critical. To this end we propose a unified framework (dubbed LayoutFormer) for hierarchical text detection which simultaneously conducts multi-level text detection and predicts the geometric layouts for promoting scene text understanding. In LayoutFormer WordDecoder LineDecoder and ParaDecoder are proposed to be responsible for word-level text prediction line-level text prediction and paragraph-level text prediction respectively. Meanwhile WordDecoder and ParaDecoder adaptively learn word-line and line-paragraph relationships respectively. In addition we propose a Prior Location Sampler to be used on multi-scale features to adaptively select a few representative foreground features for updating text queries. It can improve hierarchical detection performance while significantly reducing the computational cost. Comprehensive experiments verify that our method achieves state-of-the-art performance on single-level and hierarchical text detection.
-
In this work we present Vlogger a generic AI system for generating a minute-level video blog (i.e. vlog) of user descriptions. Different from short videos with a few seconds vlog often contains a complex storyline with diversified scenes which is challenging for most existing video generation approaches. To break through this bottleneck our Vlogger smartly leverages Large Language Model (LLM) as Director and decomposes a long video generation task of vlog into four key stages where we invoke various foundation models to play the critical roles of vlog professionals including (1) Script (2) Actor (3) ShowMaker and (4) Voicer. With such a design of mimicking human beings our Vlogger can generate vlogs through explainable cooperation of top-down planning and bottom-up shooting. More over we introduce a novel video diffusion model ShowMaker which serves as a videographer in our Vlogger for generating the video snippet of each shooting scene. By incorporating Script and Actor attentively as textual and visual prompts it can effectively enhance spatial-temporal coherence in the snippet. Besides we design a concise mixed training paradigm for ShowMaker boosting its capacity for both T2V generation and prediction. Finally the extensive experiments show that our method achieves state-of-the-art performance on zero-shot T2V generation and prediction tasks. More importantly Vlogger can generate over 5-minute vlogs from open-world descriptions without loss of video coherence on script and actor.
-
Point-spread-function (PSF) engineering is a well-established computational imaging technique that uses phase masks and other optical elements to embed extra information (e.g. depth) into the images captured by conventional CMOS image sensors. To date however PSF-engineering has not been applied to neuromorphic event cameras; a powerful new image sensing technology that responds to changes in the log-intensity of light. This paper establishes theoretical limits (Cramer Rao bounds) on 3D point localization and tracking with PSF-engineered event cameras. Using these bounds we first demonstrate that existing Fisher phase masks are already near-optimal for localizing static flashing point sources (e.g. blinking fluorescent molecules). We then demonstrate that existing designs are sub-optimal for tracking moving point sources and proceed to use our theory to design optimal phase masks and binary amplitude masks for this task. To overcome the non-convexity of the design problem we leverage novel implicit neural representation based parameterizations of the phase and amplitude masks. We demonstrate the efficacy of our designs through extensive simulations. We also validate our method with a simple prototype.
-
Adversarial attacks aim to perturb images such that a predictor outputs incorrect results. Due to the limited research in structured attacks imposing consistency checks on natural multi-object scenes is a practical defense against conventional adversarial attacks. More desired attacks should be able to fool defenses with such consistency checks. Therefore we present the first approach GLOW that copes with various attack requests by generating global layout-aware adversarial attacks in which both categorical and geometric layout constraints are explicitly established. Specifically we focus on object detection tasks and given a victim image GLOW first localizes victim objects according to target labels. And then it generates multiple attack plans together with their context-consistency scores. GLOW on the one hand is capable of handling various types of requests including single or multiple victim objects with or without specified victim objects. On the other hand it produces a consistency score for each attack plan reflecting the overall contextual consistency that both semantic category and global scene layout are considered. We conduct our experiments on MS COCO and Pascal. Extensive experimental results demonstrate that we can achieve about 30% average relative improvement compared to state-of-the-art methods in conventional single object attack request; Moreover such superiority is also valid across more generic attack requests under both white-box and zero-query black-box settings. Finally we conduct comprehensive human analysis which not only validates our claim further but also provides strong evidence that our evaluation metrics reflect human reviews well.
-
Label noise commonly found in real-world datasets has a detrimental impact on a model's generalization. To effectively detect incorrectly labeled instances previous works have mostly relied on distinguishable training signals such as training loss as indicators to differentiate between clean and noisy labels. However they have limitations in that the training signals incompletely reveal the model's behavior and are not effectively generalized to various noise types resulting in limited detection accuracy. In this paper we propose DynaCor framework that distinguishes incorrectly labeled instances from correctly labeled ones based on the dynamics of the training signals. To cope with the absence of supervision for clean and noisy labels DynaCor first introduces a label corruption strategy that augments the original dataset with intentionally corrupted labels enabling indirect simulation of the model's behavior on noisy labels. Then DynaCor learns to identify clean and noisy instances by inducing two clearly distinguishable clusters from the latent representations of training dynamics. Our comprehensive experiments show that DynaCor outperforms the state-of-the-art competitors and shows strong robustness to various noise types and noise rates.
-
We present Neural 3D Strokes a novel technique to generate stylized images of a 3D scene at arbitrary novel views from multi-view 2D images. Different from existing methods which apply stylization to trained neural radiance fields at the voxel level our approach draws inspiration from image-to-painting methods simulating the progressive painting process of human artwork with vector strokes. We develop a palette of stylized 3D strokes from basic primitives and splines and consider the 3D scene stylization task as a multi-view reconstruction process based on these 3D stroke primitives. Instead of directly searching for the parameters of these 3D strokes which would be too costly we introduce a differentiable renderer that allows optimizing stroke parameters using gradient descent and propose a training scheme to alleviate the vanishing gradient issue. The extensive evaluation demonstrates that our approach effectively synthesizes 3D scenes with significant geometric and aesthetic stylization while maintaining a consistent appearance across different views. Our method can be further integrated with style loss and image-text contrastive models to extend its applications including color transfer and text-driven 3D scene drawing. Results and code are available at http://buaavrcg.github.io/Neural3DStrokes.
-
Conventional radar feature extraction faces limitations due to low spatial resolution noise multipath reflection the presence of ghost targets and motion blur. Such limitations can be exacerbated by nonlinear object motion particularly from an ego-centric viewpoint. It becomes evident that to address these challenges the key lies in exploiting temporal feature relation over an extended horizon and enforcing spatial motion consistence for effective association. To this end this paper proposes SIRA (Scalable Inter-frame Relation and Association) with two designs. First inspired by Swin Transformer we introduce extended temporal relation generalizing the existing temporal relation layer from two consecutive frames to multiple inter-frames with temporally regrouped window attention for scalability. Second we propose motion consistency track with the concept of a pseudo-tracklet generated from observational data for better trajectory prediction and subsequent object association. Our approach achieves 58.11 mAP@0.5 for oriented object detection and 47.79 MOTA for multiple object tracking on the Radiate dataset surpassing previous state-of-the-art by a margin of +4.11 mAP@0.5 and +9.94 MOTA respectively.
-
We present a 3D-aware one-shot head reenactment method based on a fully volumetric neural disentanglement framework for source appearance and driver expressions. Our method is real-time and produces high-fidelity and view-consistent output suitable for 3D teleconferencing systems based on holographic displays. Existing cutting-edge 3D-aware reenactment methods often use neural radiance fields or 3D meshes to produce view-consistent appearance encoding but at the same time they rely on linear face models such as 3DMM to achieve its disentanglement with facial expressions. As a result their reenactment results often exhibit identity leakage from the driver or have unnatural expressions. To address these problems we propose a neural self-supervised disentanglement approach that lifts both the source image and driver video frame into a shared 3D volumetric representation based on tri-planes. This representation can then be freely manipulated with expression tri-planes extracted from the driving images and rendered from an arbitrary view using neural radiance fields. We achieve this disentanglement via self-supervised learning on a large in-the-wild video dataset. We further introduce a highly effective fine-tuning approach to improve the generalizability of the 3D lifting using the same real-world data. We demonstrate state-of-the-art performance on a wide range of datasets and also showcase high-quality 3D-aware head reenactment on highly challenging and diverse subjects including non-frontal head poses and complex expressions for both source and driver.
-
Existing automatic captioning methods for visual content face challenges such as lack of detail content hallucination and poor instruction following. In this work we propose VisualFactChecker (VFC) a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal where image-to-text captioning models propose multiple initial captions; 2) verification where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions; 3) captioning where an LLM generates the final caption by summarizing caption proposals and the fact check verification results. In this step VFC can flexibly generate captions in various styles following complex instructions. We conduct comprehensive captioning evaluations using four metrics: 1) CLIP-Score for image-text similarity; 2) CLIP-Image-Score for measuring the image-image similarity between the original and the reconstructed image generated by a text-to-image model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V for fine-grained evaluation. Evaluation results show that VFC outperforms state-of-the-art open-sourced captioning methods for 2D images on the COCO dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by combining open-source models into a pipeline we can attain captioning capability comparable to proprietary models such as GPT-4V despite being over 10x smaller in model size.
-
Collaborative perception empowers each agent to improve its perceptual ability through the exchange of perceptual messages with other agents. It inherently results in a fundamental trade-off between perception ability and communication cost. To address this bottleneck issue our core idea is to optimize the collaborative messages from two key aspects: representation and selection. The proposed codebook-based message representation enables the transmission of integer codes rather than high-dimensional feature maps. The proposed information-filling-driven message selection optimizes local messages to collectively fill each agent's information demand preventing information overflow among multiple agents. By integrating these two designs we propose CodeFilling a novel communication-efficient collaborative perception system which significantly advances the perception-communication trade-off and is inclusive to both homogeneous and heterogeneous collaboration settings. We evaluate CodeFilling in both a real-world dataset DAIR-V2X and a new simulation dataset OPV2VH+. Results show that CodeFilling outperforms previous SOTA Where2comm on DAIR-V2X/OPV2VH+ with 1333/1206x lower communication volume. Our code is available at https://github.com/PhyllisH/CodeFilling.
-
DiPrompT: Disentangled Prompt Tuning for Multiple Latent Domain Generalization in Federated Learning
Federated learning (FL) has emerged as a powerful paradigm for learning from decentralized data and federated domain generalization further considers the test dataset (target domain) is absent from the decentralized training data (source domains). However most existing FL methods assume that domain labels are provided during training and their evaluation imposes explicit constraints on the number of domains which must strictly match the number of clients. Because of the underutilization of numerous edge devices and additional cross-client domain annotations in the real world such restrictions may be impractical and involve potential privacy leaks. In this paper we propose an efficient and novel approach called Disentangled Prompt Tuning (DiPrompT) a method that tackles the above restrictions by learning adaptive prompts for domain generalization in a distributed manner. Specifically we first design two types of prompts i.e. global prompt to capture general knowledge across all clients and domain prompts to capture domain-specific knowledge. They eliminate the restriction on the one-to-one mapping between source domains and local clients. Furthermore a dynamic query metric is introduced to automatically search the suitable domain label for each sample which includes two-substep text-image alignments based on prompt tuning without labor-intensive annotation. Extensive experiments on multiple datasets demonstrate that our DiPrompT achieves superior domain generalization performance over state-of-the-art FL methods when domain labels are not provided and even outperforms many centralized learning methods using domain labels.
-
We present MVD-Fusion: a method for single-view 3D inference via generative modeling of multi-view-consistent RGB-D images. While recent methods pursuing 3D inference advocate learning novel-view generative models these generations are not 3D-consistent and require a distillation process to generate a 3D output. We instead cast the task of 3D inference as directly generating mutually-consistent multiple views and build on the insight that additionally inferring depth can provide a mechanism for enforcing this consistency. Specifically we train a denoising diffusion model to generate multi-view RGB-D images given a single RGB input image and leverage the (intermediate noisy) depth estimates to obtain reprojection-based conditioning to maintain multi-view consistency. We train our model using large-scale synthetic dataset Obajverse as well as the real-world CO3D dataset comprising of generic camera viewpoints. We demonstrate that our approach can yield more accurate synthesis compared to recent state-of-the-art including distillation-based 3D inference and prior multi-view generation methods. We also evaluate the geometry induced by our multi-view depth prediction and find that it yields a more accurate representation than other direct 3D inference approaches.
-
Image-based mirror detection has recently undergone rapid research due to its significance in applications such as robotic navigation semantic segmentation and scene reconstruction. Recently VMD-Net was proposed as the first video mirror detection technique by modeling dual correspondences between the inside and outside of the mirror both spatially and temporally. However this approach is not reliable as correspondences can occur completely inside or outside of the mirrors. In addition the proposed dataset VMD-D contains many small mirrors limiting its applicability to real-world scenarios. To address these problems we developed a more challenging dataset that includes mirrors of various shapes and sizes at different locations of the frames providing a better reflection of real-world scenarios. Next we observed that the motions between the inside and outside of the mirror are often inconsistent. For instance when moving in front of a mirror the motion inside the mirror is often much smaller than the motion outside due to increased depth perception. With these observations we propose modeling inconsistent motion cues to detect mirrors and a new network with two novel modules. The Motion Attention Module (MAM) explicitly models inconsistent motions around mirrors via optical flow and the Motion-Guided Edge Detection Module (MEDM) uses motions to guide mirror edge feature learning. Experimental results on our proposed dataset show that our method outperforms state-of-the-arts. The code and dataset are available at https://github.com/AlexAnthonyWarren/MG-VMD.
-
Low-light scenes are prevalent in real-world applications (e.g. autonomous driving and surveillance at night). Recently multi-object tracking in various practical use cases have received much attention but multi-object tracking in dark scenes is rarely considered. In this paper we focus on multi-object tracking in dark scenes. To address the lack of datasets we first build a Low-light Multi-Object Tracking (LMOT) dataset. LMOT provides well-aligned low-light video pairs captured by our dual-camera system and high-quality multi-object tracking annotations for all videos. Then we propose a low-light multi-object tracking method termed as LTrack. We introduce the adaptive low-pass downsample module to enhance low-frequency components of images outside the sensor noises. The degradation suppression learning strategy enables the model to learn invariant information under noise disturbance and image quality degradation. These components improve the robustness of multi-object tracking in dark scenes. We conducted a comprehensive analysis of our LMOT dataset and proposed LTrack. Experimental results demonstrate the superiority of the proposed method and its competitiveness in real night low-light scenes. Dataset and Code: https://github.com/ying-fu/LMOT
-
Human image editing includes tasks like changing a person's pose their clothing or editing the image according to a text prompt. However prior work often tackles these tasks separately overlooking the benefit of mutual reinforcement from learning them jointly. In this paper we propose UniHuman a unified model that addresses multiple facets of human image editing in real-world settings. To enhance the model's generation quality and generalization capacity we leverage guidance from human visual encoders and introduce a lightweight pose-warping module that can exploit different pose representations accommodating unseen textures and patterns. Furthermore to bridge the disparity between existing human editing benchmarks with real-world data we curated 400K high-quality human image-text pairs for training and collected 2K human images for out-of-domain testing both encompassing diverse clothing styles backgrounds and age groups. Experiments on both in-domain and out-of-domain test sets demonstrate that UniHuman outperforms task-specific models by a significant margin. In user studies UniHuman is preferred by the users in an average of 77% of cases. Our project is available at https://github.com/NannanLi999/UniHuman.
-
Text-to-image (T2I) generative models have attracted significant attention and found extensive applications within and beyond academic research. For example the Civitai community a platform for T2I innovation currently hosts an impressive array of 74492 distinct models. However this diversity presents a formidable challenge in selecting the most appropriate model and parameters a process that typically requires numerous trials. Drawing inspiration from the tool usage research of large language models (LLMs) we introduce DiffAgent an LLM agent designed to screen the accurate selection in seconds via API calls. DiffAgent leverages a novel two-stage training framework SFTA enabling it to accurately align T2I API responses with user input in accordance with human preferences. To train and evaluate DiffAgent's capabilities we present DABench a comprehensive dataset encompassing an extensive range of T2I APIs from the community. Our evaluations reveal that DiffAgent not only excels in identifying the appropriate T2I API but also underscores the effectiveness of the SFTA training framework. Codes are available at https://github.com/OpenGVLab/DiffAgent.
-
Neural field is an emerging paradigm in data representation that trains a neural network to approximate the given signal. A key obstacle that prevents its widespread adoption is the encoding speed---generating neural fields requires an overfitting of a neural network which can take a significant number of SGD steps to reach the desired fidelity level. In this paper we delve into the impacts of data transformations on the speed of neural field training specifically focusing on how permuting pixel locations affect the convergence speed of SGD. Counterintuitively we find that randomly permuting the pixel locations can considerably accelerate the training. To explain this phenomenon we examine the neural field training through the lens of PSNR curves loss landscapes and error patterns. Our analyses suggest that the random pixel permutations remove the easy-to-fit patterns which facilitate easy optimization in the early stage but hinder capturing fine details of the signal.
-
We present Zero-Painter a novel training-free framework for layout-conditional text-to-image synthesis that facilitates the creation of detailed and controlled imagery from textual prompts. Our method utilizes object masks and individual descriptions coupled with a global text prompt to generate images with high fidelity. Zero-Painter employs a two-stage process involving our novel Prompt-Adjusted Cross-Attention (PACA) and Region-Grouped Cross-Attention (ReGCA) blocks ensuring precise alignment of generated objects with textual prompts and mask shapes. Our extensive experiments demonstrate that Zero-Painter surpasses current state-of-the-art methods in preserving textual details and adhering to mask shapes. We will make the codes and the models publicly available.
-
Absolute pose regression (APR) estimates global pose in an end-to-end manner achieving impressive results in learn-based LiDAR localization. However compared to the top-performing methods reliant on 3D-3D correspondence matching APR's accuracy still has room for improvement. We recognize APR's lack of robust features learning and iterative denoising process leads to suboptimal results. In this paper we propose DiffLoc a novel framework that formulates LiDAR localization as a conditional generation of poses. First we propose to utilize the foundation model and static-object-aware pool to learn robust features. Second we incorporate the iterative denoising process into APR via a diffusion model conditioned on the learned geometrically robust features. In addition due to the unique nature of diffusion models we propose to adapt our models to two additional applications: (1) using multiple inferences to evaluate pose uncertainty and (2) seamlessly introducing geometric constraints on denoising steps to improve prediction accuracy. Extensive experiments conducted on the Oxford Radar RobotCar and NCLT datasets demonstrate that DiffLoc outperforms better than the stateof-the-art methods. Especially on the NCLT dataset we achieve 35% and 34.7% improvement on position and orientation accuracy respectively. Our code is released at https://github.com/liw95/DiffLoc.
-
We present a method for reconstructing 3D shape of arbitrary Lambertian objects based on measurements by miniature energy-efficient low-cost single-photon cameras. These cameras operating as time resolved image sensors illuminate the scene with a very fast pulse of diffuse light and record the shape of that pulse as it returns back from the scene at a high temporal resolution. We propose to model this image formation process account for its non-idealities and adapt neural rendering to reconstruct 3D geometry from a set of spatially distributed sensors with known poses. We show that our approach can successfully recover complex 3D shapes from simulated data. We further demonstrate 3D object reconstruction from real-world captures utilizing measurements from a commodity proximity sensor. Our work draws a connection between image-based modeling and active range scanning and offers a step towards 3D vision with single-photon cameras.
-
We introduce WonderJourney a modular framework for perpetual 3D scene generation. Unlike prior work on view generation that focuses on a single type of scenes we start at any user-provided location (by a text description or an image) and generate a journey through a long sequence of diverse yet coherently connected 3D scenes. We leverage an LLM to generate textual descriptions of the scenes in this journey a text-driven point cloud generation pipeline to make a compelling and coherent sequence of 3D scenes and a large VLM to verify the generated scenes. We show compelling diverse visual results across various scene types and styles forming imaginary "wonderjourneys". Project website: https://kovenyu.com/WonderJourney.
-
We explore the boundaries of scaling up a multilingual vision and language model both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks including multiple image-based captioning and question-answering tasks image-based document understanding and few-shot (in-context) learning as well as object detection video question answering and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally we observe emerging capabilities such as complex counting and multilingual object detection tasks that are not explicitly in the training mix.
-
Previous advances in vehicle re-identification (ReID) are mostly reported under favorable lighting conditions while cross-day-and-night performance is neglected which greatly hinders the development of related traffic intelligence applications. This work instead develops a novel Day-Night Dual-domain Modulation (DNDM) vehicle re-identification framework for day-night cross-domain traffic scenarios. Specifically a unique night-domain glare suppression module is provided to attenuate the headlight glare from raw nighttime vehicle images. To enhance vehicle features under low-light environments we propose a dual-domain structure enhancement module in the feature extractor which enhances geometric structures between appearance features. To alleviate day-night domain discrepancies we develop a cross-domain class awareness module that facilitates the interaction between appearance and structure features in both domains. In this work we address the Day-Night cross-domain ReID (DN-ReID) problem and provide a new cross-domain dataset named DN-Wild including day and night images of 2286 identities giving in total 85945 daytime images and 54952 nighttime images. Furthermore we also take into account the matter of balance between day and night samples and provide a dataset called DN-348. Exhaustive experiments demonstrate the robustness of the proposed framework in the DN-ReID problem. The code and benchmark are released at https://github.com/chenjingong/DN-ReID.
-
Recent breakthroughs in text-to-4D generation rely on pre-trained text-to-image and text-to-video models to generate dynamic 3D scenes. However current text-to-4D methods face a three-way tradeoff between the quality of scene appearance 3D structure and motion. For example text-to-image models and their 3D-aware variants are trained on internet-scale image datasets and can be used to produce scenes with realistic appearance and 3D structure---but no motion. Text-to-video models are trained on relatively smaller video datasets and can produce scenes with motion but poorer appearance and 3D structure. While these models have complementary strengths they also have opposing weaknesses making it difficult to combine them in a way that alleviates this three-way tradeoff. Here we introduce hybrid score distillation sampling an alternating optimization procedure that blends supervision signals from multiple pre-trained diffusion models and incorporates benefits of each for high-fidelity text-to-4D generation. Using hybrid SDS we demonstrate synthesis of 4D scenes with compelling appearance 3D structure and motion.
-
Adversarial distillation (AD) is a highly effective method for enhancing the robustness of small models. Contrary to expectations a high-performing teacher model does not always result in a more robust student model. This is due to two main reasons. First when there are significant differences in predictions between the teacher model and the student model exact matching of predicted values using KL divergence interferes with training leading to poor performance of existing methods. Second matching solely based on the output prevents the student model from fully understanding the behavior of the teacher model. To address these challenges this paper proposes a novel AD method named SmaraAD. During the training process we facilitate the student model in better understanding the teacher model's behavior by aligning the attribution region that the student model focuses on with that of the teacher model. Concurrently we relax the condition of exact matching in KL divergence and replace it with a more flexible matching criterion thereby enhancing the model's robustness. Extensive experiments substantiate the effectiveness of our method in improving the robustness of small models outperforming previous SOTA methods.
-
As a bio-inspired vision sensor with ultra-high speed spike cameras exhibit great potential in recording dynamic scenes with high-speed motion or drastic light changes. Different from traditional cameras each pixel in spike cameras records the arrival of photons continuously by firing binary spikes at an ultra-fine temporal granularity. In this process multiple factors impact the imaging including the photons' Poisson arrival thermal noises from circuits and quantization effects in spike readout. These factors introduce fluctuations to spikes making the recorded spike intervals unstable and unable to reflect accurate light intensities. In this paper we present an approach to deal with spike fluctuations and boost spike camera image reconstruction. We first analyze the quantization effects and reveal the unbiased estimation attribute of the reciprocal of differential of spike firing time (DSFT). Based on this we propose a spike representation module to use DSFT with multiple orders for fluctuation suppression where DSFT with higher orders indicates spike integration duration between multiple spikes. We also propose a module for inter-moment feature alignment at multiple granularities. The coarser alignment is based on patch-level cross-attention with a local search strategy and the finer alignment is based on deformable convolution at the pixel level. Experimental results demonstrate the effectiveness of our method on both synthetic and real-captured data. The source code and dataset are available at https://github.com/ruizhao26/BSF.
-
In this paper we introduce the problem of zero-shot text-guided exploration of the solutions to open-domain image super-resolution. Our goal is to allow users to explore diverse semantically accurate reconstructions that preserve data consistency with the low-resolution inputs for different large downsampling factors without explicitly training for these specific degradations. We propose two approaches for zero-shot text-guided super-resolution - i) modifying the generative process of text-to-image (T2I) diffusion models to promote consistency with low-resolution inputs and ii) incorporating language guidance into zero-shot diffusion-based restoration methods. We show that the proposed approaches result in diverse solutions that match the semantic meaning provided by the text prompt while preserving data consistency with the degraded inputs. We evaluate the proposed baselines for the task of extreme super-resolution and demonstrate advantages in terms of restoration quality diversity and explorability of solutions.
-
Recent approaches such as ControlNet offer users fine-grained spatial control over text-to-image (T2I) diffusion models. However auxiliary modules have to be trained for each spatial condition type model architecture and checkpoint putting them at odds with the diverse intents and preferences a human designer would like to convey to the AI models during the content creation process. In this work we present FreeControl a training-free approach for controllable T2I generation that supports multiple conditions architectures and checkpoints simultaneously. FreeControl enforces structure guidance to facilitate the global alignment with a guidance image and appearance guidance to collect visual details from images generated without control. Extensive qualitative and quantitative experiments demonstrate the superior performance of FreeControl across a variety of pre-trained T2I models. In particular FreeControl enables convenient training-free control over many different architectures and checkpoints allows the challenging input conditions on which most of the existing training-free methods fail and achieves competitive synthesis quality compared to training-based approaches. Project page:https://genforce.github.io/freecontrol/.
-
VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models
Text-to-video diffusion models have advanced video generation significantly. However customizing these models to generate videos with tailored motions presents a substantial challenge. In specific they encounter hurdles in (1) accurately reproducing motion from a target video and (2) creating diverse visual variations. For example straightforward extensions of static image customization methods to video often lead to intricate entanglements of appearance and motion data. To tackle this here we present the Video Motion Customization (VMC) framework a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models. Our approach introduces a novel motion distillation objective using residual vectors between consecutive noisy latent frames as a motion reference. The diffusion process then preserve low-frequency motion trajectories while mitigating high-frequency motion-unrelated noise in image space. We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts. Our codes and data can be found at: https://video-motion-customization.github.io/
-
3D simulated environments play a critical role in Embodied AI but their creation requires expertise and extensive manual effort restricting their diversity and scope. To mitigate this limitation we present Holodeck a system that generates 3D environments to match a user-supplied prompt fully automatedly. Holodeck can generate diverse scenes e.g. arcades spas and museums adjust the designs for styles and can capture the semantics of complex queries such as "apartment for a researcher with a cat" and "office of a professor who is a fan of Star Wars". Holodeck leverages a large language model (i.e. GPT-4) for common sense knowledge about what the scene might look like and uses a large collection of 3D assets from Objaverse to populate the scene with diverse objects. To address the challenge of positioning objects correctly we prompt GPT-4 to generate spatial relational constraints between objects and then optimize the layout to satisfy those constraints. Our large-scale human evaluation shows that annotators prefer Holodeck over manually designed procedural baselines in residential scenes and that Holodeck can produce high-quality outputs for diverse scene types. We also demonstrate an exciting application of Holodeck in Embodied AI training agents to navigate in novel scenes like music rooms and daycares without human-constructed data which is a significant step forward in developing general-purpose embodied agents.
-
The proliferation of large-scale AI models trained on extensive datasets has revolutionized machine learning. With these models taking on increasingly central roles in various applications the need to understand their behavior and enhance interpretability has become paramount. To investigate the impact of changes in training data on a pre-trained model a common approach is leave-one-out retraining. This entails systematically altering the training dataset by removing specific samples to observe resulting changes within the model. However retraining the model for each altered dataset presents a significant computational challenge given the need to perform this operation for every dataset variation. In this paper we introduce an efficient framework for assessing data impact comprising offline training and online evaluation stages. During the offline training phase we approximate the influence of training data on the target model through a distilled synset formulated as a reversed gradient matching problem. For online evaluation we expedite the leave-one-out process using the synset which is then utilized to compute the attribution matrix based on the evaluation objective. Experimental evaluations including training data attribution and assessments of data quality demonstrate that our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
-
Diffusion models have achieved great success in synthesizing high-quality images. However generating high-resolution images with diffusion models is still challenging due to the enormous computational costs resulting in a prohibitive latency for interactive applications. In this paper we propose DistriFusion to tackle this problem by leveraging parallelism across multiple GPUs. Our method splits the model input into multiple patches and assigns each patch to a GPU. However naively implementing such an algorithm breaks the interaction between patches and loses fidelity while incorporating such an interaction will incur tremendous communication overhead. To overcome this dilemma we observe the high similarity between the input from adjacent diffusion steps and propose Displaced Patch Parallelism which takes advantage of the sequential nature of the diffusion process by reusing the pre-computed feature maps from the previous timestep to provide context for the current step. Therefore our method supports asynchronous communication which can be pipelined by computation. Extensive experiments show that our method can be applied to recent Stable Diffusion XL with no quality degradation and achieve up to a 6.1x speedup on eight NVIDIA A100s compared to one. Our code is publicly available at https://github.com/mit-han-lab/distrifuser.
-
The success of large language models has inspired the computer vision community to explore image segmentation foundation model that is able to zero/few-shot generalize through prompt engineering. Segment-Anything (SAM) among others is the state-of-the-art image segmentation foundation model demonstrating strong zero/few-shot generalization. Despite the success recent studies reveal the weakness of SAM under strong distribution shift. In particular SAM performs awkwardly on corrupted natural images camouflaged images medical images etc. Motivated by the observations we aim to develop a self-training based strategy to adapt SAM to target distribution. Given the unique challenges of large source dataset high computation cost and incorrect pseudo label we propose a weakly supervised self-training architecture with anchor regularization and low-rank finetuning to improve the robustness and computation efficiency of adaptation. We validate the effectiveness on 5 types of downstream segmentation tasks including natural clean/corrupted images medical images camouflaged images and robotic images. Our proposed method is task-agnostic in nature and outperforms pre-trained SAM and state-of-the-art domain adaptation methods on almost all downstream tasks with the same testing prompt inputs.
-
Recent self-training techniques have shown notable improvements in unsupervised domain adaptation for 3D object detection (3D UDA). These techniques typically select pseudo labels i.e. 3D boxes to supervise models for the target domain. However this selection process inevitably introduces unreliable 3D boxes in which 3D points cannot be definitively assigned as foreground or background. Previous techniques mitigate this by reweighting these boxes as pseudo labels but these boxes can still poison the training process. To resolve this problem in this paper we propose a novel pseudo label refinery framework. Specifically in the selection process to improve the reliability of pseudo boxes we propose a complementary augmentation strategy. This strategy involves either removing all points within an unreliable box or replacing it with a high-confidence box. Moreover the point numbers of instances in high-beam datasets are considerably higher than those in low-beam datasets also degrading the quality of pseudo labels during the training process. We alleviate this issue by generating additional proposals and aligning RoI features across different domains. Experimental results demonstrate that our method effectively enhances the quality of pseudo labels and consistently surpasses the state-of-the-art methods on six autonomous driving benchmarks. Code will be available at https://github.com/Zhanwei-Z/PERE.
-
We present an approach that can reconstruct hands in 3D from monocular input. Our approach for Hand Mesh Recovery HaMeR follows a fully transformer-based architecture and can analyze hands with significantly increased accuracy and robustness compared to previous work. The key to HaMeR's success lies in scaling up both the data used for training and the capacity of the deep network for hand reconstruction. For training data we combine multiple datasets that contain 2D or 3D hand annotations. For the deep model we use a large scale Vision Transformer architecture. Our final model consistently outperforms the previous baselines on popular 3D hand pose benchmarks. To further evaluate the effect of our design in non-controlled settings we annotate existing in-the-wild datasets with 2D hand keypoint annotations. On this newly collected dataset of annotations HInt we demonstrate significant improvements over existing baselines. We will make our code data and models publicly available upon publication. We make our code data and models available on the project website: https://geopavlakos.github.io/hamer/.
-
Training-free network architecture search (NAS) aims to discover high-performing networks with zero-cost proxies capturing network characteristics related to the final performance. However network rankings estimated by previous training-free NAS methods have shown weak correlations with the performance. To address this issue we propose AZ-NAS a novel approach that leverages the ensemble of various zero-cost proxies to enhance the correlation between a predicted ranking of networks and the ground truth substantially in terms of the performance. To achieve this we introduce four novel zero-cost proxies that are complementary to each other analyzing distinct traits of architectures in the views of expressivity progressivity trainability and complexity. The proxy scores can be obtained simultaneously within a single forward and backward pass making an overall NAS process highly efficient. In order to integrate the rankings predicted by our proxies effectively we introduce a non-linear ranking aggregation method that highlights the networks highly-ranked consistently across all the proxies. Experimental results conclusively demonstrate the efficacy and efficiency of AZ-NAS outperforming state-of-the-art methods on standard benchmarks all while maintaining a reasonable runtime cost.
-
This paper presents a novel non-rigid point set registration method that is inspired by unsupervised clustering analysis. Unlike previous approaches that treat the source and target point sets as separate entities we develop a holistic framework where they are formulated as clustering centroids and clustering members separately. We then adopt Tikhonov regularization with an ?1-induced Laplacian kernel instead of the commonly used Gaussian kernel to ensure smooth and more robust displacement fields. Our formulation delivers closed-form solutions theoretical guarantees independence from dimensions and the ability to handle large deformations. Subsequently we introduce a clustering-improved Nystrom method to effectively reduce the computational complexity and storage of the Gram matrix to linear while providing a rigorous bound for the low rank approximation. Our method achieves high accuracy results across various scenarios and surpasses competitors by a significant margin particularly on shapes with substantial deformations. Additionally we demonstrate the versatility of our method in challenging tasks such as shape transfer and medical registration.
-
Geometry-agnostic system identification is a technique for identifying the geometry and physical properties of an object from video sequences without any geometric assumptions. Recently physics-augmented continuum neural radiance fields (PAC-NeRF) has demonstrated promising results for this technique by utilizing a hybrid Eulerian-Lagrangian representation in which the geometry is represented by the Eulerian grid representations of NeRF the physics is described by a material point method (MPM) and they are connected via Lagrangian particles. However a notable limitation of PAC-NeRF is that its performance is sensitive to the learning of the geometry from the first frames owing to its two-step optimization. First the grid representations are optimized with the first frames of video sequences and then the physical properties are optimized through video sequences utilizing the fixed first-frame grid representations. This limitation can be critical when learning of the geometric structure is difficult for example in a few-shot (sparse view) setting. To overcome this limitation we propose Lagrangian particle optimization (LPO) in which the positions and features of particles are optimized through video sequences in Lagrangian space. This method allows for the optimization of the geometric structure across the entire video sequence within the physical constraints imposed by the MPM. The experimental results demonstrate that the LPO is useful for geometric correction and physical identification in sparse-view settings.
-
Contrastive Vision-Language Pre-training known as CLIP has shown promising effectiveness in addressing downstream image recognition tasks. However recent works revealed that the CLIP model can be implanted with a downstream-oriented backdoor. On downstream tasks one victim model performs well on clean samples but predicts a specific target class whenever a specific trigger is present. For injecting a backdoor existing attacks depend on a large amount of additional data to maliciously fine-tune the entire pre-trained CLIP model which makes them inapplicable to data-limited scenarios. In this work motivated by the recent success of learnable prompts we address this problem by injecting a backdoor into the CLIP model in the prompt learning stage. Our method named BadCLIP is built on a novel and effective mechanism in backdoor attacks on CLIP i.e. influencing both the image and text encoders with the trigger. It consists of a learnable trigger applied to images and a trigger-aware context generator such that the trigger can change text features via trigger-aware prompts resulting in a powerful and generalizable attack. Extensive experiments conducted on 11 datasets verify that the clean accuracy of BadCLIP is similar to those of advanced prompt learning methods and the attack success rate is higher than 99% in most cases. BadCLIP is also generalizable to unseen classes and shows a strong generalization capability under cross-dataset and cross-domain settings. The code is available at https://github.com/jiawangbai/BadCLIP.
-
In real-world scenarios image recognition tasks such as semantic segmentation and object detection often pose greater challenges due to the lack of information available within low-resolution (LR) content. Image super-resolution (SR) is one of the promising solutions for addressing the challenges. However due to the ill-posed property of SR it is challenging for typical SR methods to restore task-relevant high-frequency contents which may dilute the advantage of utilizing the SR method. Therefore in this paper we propose Super-Resolution for Image Recognition (SR4IR) that effectively guides the generation of SR images beneficial to achieving satisfactory image recognition performance when processing LR images. The critical component of our SR4IR is the task-driven perceptual (TDP) loss that enables the SR network to acquire task-specific knowledge from a network tailored for a specific task. Moreover we propose a cross-quality patch mix and an alternate training framework that significantly enhances the efficacy of the TDP loss by addressing potential problems when employing the TDP loss. Through extensive experiments we demonstrate that our SR4IR achieves outstanding task performance by generating SR images useful for a specific image recognition task including semantic segmentation object detection and image classification. The implementation code is available at https://github.com/JaehaKim97/SR4IR.
-
Applying a pre-trained large model to downstream tasks is prohibitive under resource-constrained conditions. Recent dominant approaches for addressing efficiency issues involve adding a few learnable parameters to the fixed backbone model. This strategy however leads to more challenges in loading large models for downstream fine-tuning with limited resources. In this paper we propose a novel method for increasing the parameter efficiency of pre-trained models by introducing an intermediate pre-training stage. To this end we first employ low-rank approximation to compress the original large model and then devise a feature distillation module and a weight perturbation regularization module. These modules are specifically designed to enhance the low-rank model. In particular we update only the low-rank model while freezing the backbone parameters during pre-training. This allows for direct and efficient utilization of the low-rank model for downstream fine-tuning tasks. The proposed method achieves both efficiencies in terms of required parameters and computation time while maintaining comparable results with minimal modifications to the backbone architecture. Specifically when applied to three vision-only and one vision-language Transformer models our approach often demonstrates a merely 0.6-point decrease in performance while reducing the original parameter size by 1/3 to 2/3.
-
We present XCube a novel generative model for high-resolution sparse 3D voxel grids with arbitrary attributes. Our model can generate millions of voxels with a finest effective resolution of up to 1024^3 in a feed-forward fashion without time-consuming test-time optimization. To achieve this we employ a hierarchical voxel latent diffusion model which generates progressively higher resolution grids in a coarse-to-fine manner using a custom framework built on the highly efficient VDB data structure. Apart from generating high-resolution objects we demonstrate the effectiveness of XCube on large outdoor scenes at scales of 100m x 100m with a voxel size as small as 10cm. We observe clear qualitative and quantitative improvements over past approaches. In addition to unconditional generation we show that our model can be used to solve a variety of tasks such as user-guided editing scene completion from a single scan and text-to-3D.
-
PixelRNN: In-pixel Recurrent Neural Networks for End-to-end-optimized Perception with Neural Sensors
Conventional image sensors digitize high-resolution images at fast frame rates producing a large amount of data that needs to be transmitted off the sensor for further processing. This is challenging for perception systems operating on edge devices because communication is power inefficient and induces latency. Fueled by innovations in stacked image sensor fabrication emerging sensor--processors offer programmability and processing capabilities directly on the sensor. We exploit these capabilities by developing an efficient recurrent neural network architecture PixelRNN that encodes spatio-temporal features on the sensor using purely binary operations. PixelRNN reduces the amount of data to be transmitted off the sensor by factors up to 256 compared to the raw sensor data while offering competitive accuracy for hand gesture recognition and lip reading tasks. We experimentally validate PixelRNN using a prototype implementation on the SCAMP-5 sensor--processor platform.
-
Scene-aware Adaptive Compressive Sensing (ACS) has constituted a persistent pursuit holding substantial promise for the enhancement of Compressive Sensing (CS) performance. Cascaded ACS furnishes a proficient multi-stage framework for adaptively allocating the CS sampling based on previous CS measurements. However reconstruction is commonly required for analyzing and steering the successive CS sampling which bottlenecks the ACS speed and impedes the practical application in time-sensitive scenarios. Addressing this challenge we propose a reconstruction-free cascaded ACS method which requires NO reconstruction during the adaptive sampling process. A lightweight Score Network (ScoreNet) is proposed to directly determine the ACS allocation with previous CS measurements and a differentiable adaptive sampling module is proposed for end-to-end training. For image reconstruction we propose a Multi-Grid Spatial-Attention Network (MGSANet) that could facilitate efficient multi-stage training and inferencing. By introducing the reconstruction-fidelity supervision outside the loop of the multi-stage sampling process ACS can be efficiently optimized and achieve high imaging fidelity. The effectiveness of the proposed method is demonstrated with extensive quantitative and qualitative experiments compared with the state-of-the-art CS algorithms.
-
Current techniques for deep neural network (DNN) pruning often involve intricate multi-step processes that require domain-specific expertise making their widespread adoption challenging. To address the limitation the Only-Train-Once (OTO) and OTOv2 are proposed to eliminate the need for additional fine-tuning steps by directly training and compressing a general DNN from scratch. Nevertheless the static design of optimizers (in OTO) can lead to convergence issues of local optima. In this paper we proposed the Auto-Train-Once (ATO) an innovative network pruning algorithm designed to automatically reduce the computational and storage costs of DNNs. During the model training phase our approach not only trains the target model but also leverages a controller network as an architecture generator to guide the learning of target model weights. Furthermore we developed a novel stochastic gradient algorithm that enhances the coordination between model training and controller network training thereby improving pruning performance. We provide a comprehensive convergence analysis as well as extensive experiments and the results show that our approach achieves state-of-the-art performance across various model architectures (including ResNet18 ResNet34 ResNet50 ResNet56 and MobileNetv2) on standard benchmark datasets (CIFAR-10 CIFAR-100 and ImageNet).
-
Both limited annotation and domain shift are prevalent challenges in medical image segmentation. Traditional semi-supervised segmentation and unsupervised domain adaptation methods address one of these issues separately. However the coexistence of limited annotation and domain shift is quite common which motivates us to introduce a novel and challenging scenario: Mixed Domain Semi-supervised medical image Segmentation (MiDSS). In this scenario we handle data from multiple medical centers with limited annotations available for a single domain and a large amount of unlabeled data from multiple domains. We found that the key to solving the problem lies in how to generate reliable pseudo labels for the unlabeled data in the presence of domain shift with labeled data. To tackle this issue we employ Unified Copy-Paste (UCP) between images to construct intermediate domains facilitating the knowledge transfer from the domain of labeled data to the domains of unlabeled data. To fully utilize the information within the intermediate domain we propose a symmetric Guidance training strategy (SymGD) which additionally offers direct guidance to unlabeled data by merging pseudo labels from intermediate samples. Subsequently we introduce a Training Process aware Random Amplitude MixUp (TP-RAM) to progressively incorporate style-transition components into intermediate samples. Compared with existing state-of-the-art approaches our method achieves a notable 13.57% improvement in Dice score on Prostate dataset as demonstrated on three public datasets. Our code is available at https://github.com/MQinghe/MiDSS
-
Multi-view stereo reconstruction (MVS) in the wild requires to first estimate the camera intrinsic and extrinsic parameters. These are usually tedious and cumbersome to obtain yet they are mandatory to triangulate corresponding pixels in 3D space which is at the core of all best performing MVS algorithms. In this work we take an opposite stance and introduce DUSt3R a radically novel paradigm for Dense and Unconstrained Stereo 3D Reconstruction of arbitrary image collections operating without prior information about camera calibration nor viewpoint poses. We cast the pairwise reconstruction problem as a regression of pointmaps relaxing the hard constraints of usual projective camera models. We show that this formulation smoothly unifies the monocular and binocular reconstruction cases. In the case where more than two images are provided we further propose a simple yet effective global alignment strategy that expresses all pairwise pointmaps in a common reference frame. We base our network architecture on standard Transformer encoders and decoders allowing us to leverage powerful pretrained models. Our formulation directly provides a 3D model of the scene as well as depth information but interestingly we can seamlessly recover from it pixel matches focal lengths relative and absolute cameras. Extensive experiments on all these tasks showcase how DUSt3R effectively unifies various 3D vision tasks setting new performance records on monocular & multi-view depth estimation as well as relative pose estimation. In summary DUSt3R makes many geometric 3D vision tasks easy. Code and models at https://github.com/naver/dust3r
-
Action understanding matters for intelligent agents and has attracted long-term attention. It can be formed as the mapping from the action physical space to the semantic space. Typically researchers built action datasets according to idiosyncratic choices to define classes and push the envelope of benchmarks respectively. Thus datasets are incompatible with each other like "Isolated Islands" due to semantic gaps and various class granularities e.g. do housework in dataset A and wash plate in dataset B. We argue that a more principled semantic space is an urgent need to concentrate the community efforts and enable us to use all datasets together to pursue generalizable action learning. To this end we design a structured action semantic space in view of verb taxonomy hierarchy and covering massive actions. By aligning the classes of previous datasets to our semantic space we gather (image/video/skeleton/MoCap) datasets into a unified database in a unified label system i.e. bridging "isolated islands" into a "Pangea". Accordingly we propose a novel model mapping from the physical space to semantic space to fully use Pangea. In extensive experiments our new system shows significant superiority especially in transfer learning. Our code and data will be made public at https://mvig-rhos.com/pangea.
-
The perception of autonomous vehicles using radars has attracted increased research interest due its ability to operate in fog and bad weather. However training radar models is hindered by the cost and difficulty of annotating large-scale radar data. To overcome this bottleneck we propose a self-supervised learning framework to leverage the large amount of unlabeled radar data to pre-train radar-only embeddings for self-driving perception tasks. The proposed method combines radar-to-radar and radar-to-vision contrastive losses to learn a general representation from unlabeled radar heatmaps paired with their corresponding camera images. When used for downstream object detection we demonstrate that the proposed self-supervision framework can improve the accuracy of state-of-the-art supervised baselines by 5.8% in mAP. Code is available at https://github.com/yiduohao/Radical.
-
Adversarially robust knowledge distillation aims to compress large-scale models into lightweight models while preserving adversarial robustness and natural performance on a given dataset. Existing methods typically align probability distributions of natural and adversarial samples between teacher and student models but they overlook intermediate adversarial samples along the "adversarial path" formed by the multi-step gradient ascent of a sample towards the decision boundary. Such paths capture rich information about the decision boundary. In this paper we propose a novel adversarially robust knowledge distillation approach by incorporating such adversarial paths into the alignment process. Recognizing the diverse impacts of intermediate adversarial samples (ranging from benign to noisy) we propose an adaptive weighting strategy to selectively emphasize informative adversarial samples thus ensuring efficient utilization of lightweight model capacity. Moreover we propose a dual-branch mechanism exploiting two following insights: (i) complementary dynamics of adversarial paths obtained by targeted and untargeted adversarial learning and (ii) inherent differences between the gradient ascent path from class c_i towards the nearest class boundary and the gradient descent path from a specific class c_j towards the decision region of c_i (i \neq j). Comprehensive experiments demonstrate the effectiveness of our method on lightweight models under various settings.
-
We propose functional diffusion a generative diffusion model focused on infinite-dimensional function data samples. In contrast to previous work functional diffusion works on samples that are represented by functions with a continuous domain. Functional diffusion can be seen as an extension of classical diffusion models to an infinite-dimensional domain. Functional diffusion is very versatile as images videos audio 3D shapes deformations etc. can be handled by the same framework with minimal changes. In addition functional diffusion is especially suited for irregular data or data defined in non-standard domains. In our work we derive the necessary foundations for functional diffusion and propose a first implementation based on the transformer architecture. We show generative results on complicated signed distance functions and deformation functions defined on 3D surfaces.
-
Adversarial training (AT) is currently one of the most effective ways to obtain the robustness of deep neural networks against adversarial attacks. However most AT methods suffer from robust overfitting i.e. a significant generalization gap in adversarial robustness between the training and testing curves. In this paper we first identify a connection between robust overfitting and the excessive memorization of noisy labels in AT from a view of gradient norm. As such label noise is mainly caused by a distribution mismatch and improper label assignments we are motivated to propose a label refinement approach for AT. Specifically our Self-Guided Label Refinement first self-refines a more accurate and informative label distribution from over-confident hard labels and then it calibrates the training by dynamically incorporating knowledge from self-distilled models into the current model and thus requiring no external teachers. Empirical results demonstrate that our method can simultaneously boost the standard accuracy and robust performance across multiple benchmark datasets attack types and architectures. In addition we also provide a set of analyses from the perspectives of information theory to dive into our method and suggest the importance of soft labels for robust generalization.
-
Monocular 3D detection (M3D) aims for precise 3D object localization from a single-view image which usually involves labor-intensive annotation of 3D detection boxes. Weakly supervised M3D has recently been studied to obviate the 3D annotation process by leveraging many existing 2D annotations but it often requires extra training data such as LiDAR point clouds or multi-view images which greatly degrades its applicability and usability in various applications. We propose SKD-WM3D a weakly supervised monocular 3D detection framework that exploits depth information to achieve M3D with a single-view image exclusively without any 3D annotations or other training data. One key design in SKD-WM3D is a self-knowledge distillation framework which transforms image features into 3D-like representations by fusing depth information and effectively mitigates the inherent depth ambiguity in monocular scenarios with little computational overhead in inference. In addition we design an uncertainty-aware distillation loss and a gradient-targeted transfer modulation strategy which facilitate knowledge acquisition and knowledge transfer respectively. Extensive experiments show that SKD-WM3D surpasses the state-of-the-art clearly and is even on par with many fully supervised methods.
-
Unsupervised landmarks discovery (ULD) for an object category is a challenging computer vision problem. In pursuit of developing a robust ULD framework we explore the potential of a recent paradigm of self-supervised learning algorithms known as diffusion models. Some recent works have shown that these models implicitly contain important correspondence cues. Towards harnessing the potential of diffusion models for ULD task we make the following core contributions. First we propose a ZeroShot ULD baseline based on simple clustering of random pixel locations with nearest neighbour matching. It delivers better results than the existing ULD methods. Second motivated by the ZeroShot performance we develop a ULD algorithm based on diffusion features using self-training and clustering which also outperforms prior methods by notable margins. Third we introduce a new proxy task based on generating latent pose codes and also propose a two-stage clustering mechanism to facilitate effective pseudo-labeling resulting in a significant performance improvement. Overall our approach consistently outperforms state-of-the-art methods on four challenging benchmarks AFLW MAFL CatHeads and LS3D by significant margins.
-
The study of complex human interactions and group activities has become a focal point in human-centric computer vision. However progress in related tasks is often hindered by the challenges of obtaining large-scale labeled datasets from real-world scenarios. To address the limitation we introduce M3Act a synthetic data generator for multi-view multi-group multi-person human atomic actions and group activities. Powered by Unity Engine M3Act features multiple semantic groups highly diverse and photorealistic images and a comprehensive set of annotations which facilitates the learning of human-centered tasks across single-person multi-person and multi-group conditions. We demonstrate the advantages of M3Act across three core experiments. The results suggest our synthetic dataset can significantly improve the performance of several downstream methods and replace real-world datasets to reduce cost. Notably M3Act improves the state-of-the-art MOTRv2 on DanceTrack dataset leading to a hop on the leaderboard from 10th to 2nd place. Moreover M3Act opens new research for controllable 3D group activity generation. We define multiple metrics and propose a competitive baseline for the novel task. Our code and data are available at our project page: http://cjerry1243.github.io/M3Act.
-
A novel approach to blind image quality assessment called quality comparison network (QCN) is proposed in this paper which sorts the feature vectors of input images according to their quality scores in an embedding space. QCN employs comparison transformers (CTs) and score pivots which act as the centroids of feature vectors of similar-quality images. Each CT updates the score pivots and the feature vectors of input images based on their ordered correlation. To this end we adopt four loss functions. Then we estimate the quality score of a test image by searching the nearest score pivot to its feature vector in the embedding space. Extensive experiments show that the proposed QCN algorithm yields excellent image quality assessment performances on various datasets. Furthermore QCN achieves great performances in cross-dataset evaluation demonstrating its superb generalization capability. The source codes are available at https://github.com/nhshin-mcl/QCN.
-
Significant progress has been made in scene text detection models since the rise of deep learning but scene text layout analysis which aims to group detected text instances as paragraphs has not kept pace. Previous works either treated text detection and grouping using separate models or train a model from scratch while using a unified one. All of them have not yet made full use of the already well-trained text detectors and easily obtainable detection datasets. In this paper we present Text Grouping Adapter (TGA) a module that can enable the utilization of various pre-trained text detectors to learn layout analysis allowing us to adopt a well-trained text detector right off the shelf or just fine-tune it efficiently. Designed to be compatible with various text detector architectures TGA takes detected text regions and image features as universal inputs to assemble text instance features. To capture broader contextual information for layout analysis we propose to predict text group masks from text instance features by one-to-many assignment. Our comprehensive experiments demonstrate that even with frozen pre-trained models incorporating our TGA into various pre-trained text detectors and text spotters can achieve superior layout analysis performance simultaneously inheriting generalized text detection ability from pre-training. In the case of full parameter fine-tuning we can further improve layout analysis performance.
-
Whole Slide Image (WSI) classification is often formulated as a Multiple Instance Learning (MIL) problem. Recently Vision-Language Models (VLMs) have demonstrated remarkable performance in WSI classification. However existing methods leverage coarse-grained pathogenetic descriptions for visual representation supervision which are insufficient to capture the complex visual appearance of pathogenetic images hindering the generalizability of models on diverse downstream tasks. Additionally processing high-resolution WSIs can be computationally expensive. In this paper we propose a novel "Fine-grained Visual-Semantic Interaction" (FiVE) framework for WSI classification. It is designed to enhance the model's generalizability by leveraging the interaction between localized visual patterns and fine-grained pathological semantics. Specifically with meticulously designed queries we start by utilizing a large language model to extract fine-grained pathological descriptions from various non-standardized raw reports. The output descriptions are then reconstructed into fine-grained labels used for training. By introducing a Task-specific Fine-grained Semantics (TFS) module we enable prompts to capture crucial visual information in WSIs which enhances representation learning and augments generalization capabilities significantly. Furthermore given that pathological visual patterns are redundantly distributed across tissue slices we sample a subset of visual instances during training. Our method demonstrates robust generalizability and strong transferability dominantly outperforming the counterparts on the TCGA Lung Cancer dataset with at least 9.19% higher accuracy in few-shot experiments. The code is available at: https://github.com/ls1rius/WSI_FiVE.
-
Mitigating hallucinations in large vision-language models (LVLMs) remains an open problem. Recent benchmarks do not address hallucinations in open-ended free-form responses which we term "Type I hallucinations". Instead they focus on hallucinations responding to very specific question formats---typically a multiple-choice response regarding a particular object or attribute---which we term "Type II hallucinations". Additionally such benchmarks often require external API calls to models which are subject to change. In practice we observe that a reduction in Type II hallucinations does not lead to a reduction in Type I hallucinations but rather that the two forms of hallucinations are often anti-correlated. To address this we propose THRONE a novel object-based automatic framework for quantitatively evaluating Type I hallucinations in LVLM free-form outputs. We use public language models (LMs) to identify hallucinations in LVLM responses and compute informative metrics. By evaluating a large selection of recent LVLMs using public datasets we show that an improvement in existing metrics do not lead to a reduction in Type I hallucinations and that established benchmarks for measuring Type I hallucinations are incomplete. Finally we provide a simple and effective data augmentation method to reduce Type I and Type II hallucinations as a strong baseline.
-
Creating multi-view wire art (MVWA) a static 3D sculpture with diverse interpretations from different viewpoints is a complex task even for skilled artists. In response we present DreamWire an AI system enabling everyone to craft MVWA easily. Users express their vision through text prompts or scribbles freeing them from intricate 3D wire organisation. Our approach synergises 3D Bezier curves Prim's algorithm and knowledge distillation from diffusion models or their variants (e.g. ControlNet). This blend enables the system to represent 3D wire art ensuring spatial continuity and overcoming data scarcity. Extensive evaluation and analysis are conducted to shed insight on the inner workings of the proposed system including the trade-off between connectivity and visual aesthetics.
-
Lithic Use-Wear Analysis (LUWA) using microscopic images is an underexplored vision-for-science research area. It seeks to distinguish the worked material which is critical for understanding archaeological artifacts material interactions tool functionalities and dental records. However this challenging task goes beyond the well-studied image classification problem for common objects. It is affected by many confounders owing to the complex wear mechanism and microscopic imaging which makes it difficult even for human experts to identify the worked material successfully. In this paper we investigate the following three questions on this unique vision task for the first time:(i) How well can state-of-the-art pre-trained models (like DINOv2) generalize to the rarely seen domain? (ii) How can few-shot learning be exploited for scarce microscopic images? (iii) How do the ambiguous magnification and sensing modality influence the classification accuracy? To study these we collaborated with archaeologists and built the first open-source and the largest LUWA dataset containing 23130 microscopic images with different magnifications and sensing modalities. Extensive experiments show that existing pre-trained models notably outperform human experts but still leave a large gap for improvements. Most importantly the LUWA dataset provides an underexplored opportunity for vision and learning communities and complements existing image classification problems on common objects.
-
We focus on the generalization ability of the 6-DoF grasp detection method in this paper. While learning-based grasp detection methods can predict grasp poses for unseen objects using the grasp distribution learned from the training set they often exhibit a significant performance drop when encountering objects with diverse shapes and structures. To enhance the grasp detection methods' generalization ability we incorporate domain prior knowledge of robotic grasping enabling better adaptation to objects with significant shape and structure differences. More specifically we employ the physical constraint regularization during the training phase to guide the model towards predicting grasps that comply with the physical rule on grasping. For the unstable grasp poses predicted on novel objects we design a contact-score joint optimization using the projection contact map to refine these poses in cluttered scenarios. Extensive experiments conducted on the GraspNet-1billion benchmark demonstrate a substantial performance gain on the novel object set and the real-world grasping experiments also demonstrate the effectiveness of our generalizing 6-DoF grasp detection method.
-
In recent years the thriving development of research related to egocentric videos has provided a unique perspective for the study of conversational interactions where both visual and audio signals play a crucial role. While most prior work focus on learning about behaviors that directly involve the camera wearer we introduce the Ego-Exocentric Conversational Graph Prediction problem marking the first attempt to infer exocentric conversational interactions from egocentric videos. We propose a unified multi-modal framework---Audio-Visual Conversational Attention (AV-CONV) for the joint prediction of conversation behaviors---speaking and listening---for both the camera wearer as well as all other social partners present in the egocentric video. Specifically we adopt the self-attention mechanism to model the representations across-time across-subjects and across-modalities. To validate our method we conduct experiments on a challenging egocentric video dataset that includes multi-speaker and multi-conversation scenarios. Our results demonstrate the superior performance of our method compared to a series of baselines. We also present detailed ablation studies to assess the contribution of each component in our model. Check our \href https://vjwq.github.io/AV-CONV/ Project Page .
-
Byzantine-robust Decentralized Federated Learning via Dual-domain Clustering and Trust Bootstrapping
Decentralized federated learning (DFL) facilitates collaborative model training across multiple connected clients without a central coordination server thereby avoiding the single point of failure in traditional centralized federated learning (CFL). However DFL exhibits heightened susceptibility to Byzantine attacks owing to the lack of a responsible central server. Furthermore a benign client in DFL may be dominated by Byzantine clients (more than half of its neighbors are malicious) posing significant challenges for robust model training. In this work we propose DFL-Dual a novel Byzantine-robust DFL method through dual-domain client clustering and trust bootstrapping. Specifically we first propose to leverage both data-domain and model-domain distance metrics to identify client discrepancies. Then we design a trust evaluation mechanism centered on benign clients which enables them to evaluate their neighbors. Building upon the dual-domain distance metric and trust evaluation mechanism we further develop a two-stage clustering and trust bootstrapping technique to exclude Byzantine clients from local model aggregation. We extensively evaluate the proposed DFL-Dual method through rigorous experimentation demonstrating its remarkable performance superiority over existing robust CFL and DFL schemes.
-
In Structure-from-Motion (SfM) the underlying viewgraphs of unordered image collections generally have a highly redundant set of edges that can be sparsified for efficiency without significant loss of reconstruction quality. Often there are also false edges due to incorrect image retrieval and repeated structures (symmetries) that give rise to ghosting and superimposed reconstruction artifacts. We present a unified method to simultaneously sparsify the viewgraph and remove false edges. We propose a scoring mechanism based on camera triplets that identifies edge redundancy as well as false edges. Our edge selection is formulated as an optimization problem which can be provably solved using a simple thresholding scheme. This results in a highly efficient algorithm which can be incorporated as a pre-processing step into any SfM pipeline making it practically usable. We demonstrate the utility of our method on generic and ambiguous datasets that cover the range of small medium and large-scale datasets all with different statistical properties. Sparsification of generic datasets using our method significantly reduces reconstruction time while maintaining the accuracy of the reconstructions as well as removing ghosting artifacts. For ambiguous datasets our method removes false edges thereby avoiding incorrect superimposed reconstructions.
-
The recent wave of AI-generated content has witnessed the great development and success of Text-to-Image (T2I) technologies. By contrast Text-to-Video (T2V) still falls short of expectations though attracting increasing interest. Existing works either train from scratch or adapt large T2I model to videos both of which are computation and resource expensive. In this work we propose a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1.1B parameters of a strong T2I model adapting it to video generation in a parameter-efficient way. In particular we turn the T2I model for T2V by designing light-weight spatial and temporal adapters for transfer learning. Besides we change the original spatial attention to the proposed Latent-Shift Attention (LSA) for temporal consistency. With a similar model architecture we further train a video super-resolution model to generate high-definition (1024 x 1024) videos. In addition to T2V generation in the wild SimDA could also be utilized in one-shot video editing with only 2 minutes tuning. Doing so our method could minimize the training effort with extremely few tunable parameters for model adaptation.
-
Dichotomous Image Segmentation (DIS) has recently emerged towards high-precision object segmentation from high-resolution natural images. When designing an effective DIS model the main challenge is how to balance the semantic dispersion of high-resolution targets in the small receptive field and the loss of high-precision details in the large receptive field. Existing methods rely on tedious multiple encoder-decoder streams and stages to gradually complete the global localization and local refinement. Human visual system captures regions of interest by observing them from multiple views. Inspired by it we model DIS as a multi-view object perception problem and provide a parsimonious multi-view aggregation network (MVANet) which unifies the feature fusion of the distant view and close-up view into a single stream with one encoder-decoder structure. With the help of the proposed multi-view complementary localization and refinement modules our approach established long-range profound visual interactions across multiple views allowing the features of the detailed close-up view to focus on highly slender structures. Experiments on the popular DIS-5K dataset show that our MVANet significantly outperforms state-of-the-art methods in both accuracy and speed. The source code and datasets will be publicly available at \href https://github.com/qianyu-dlut/MVANet MVANet .
-
Diffusion-based text-to-video generation has witnessed impressive progress in the past year yet still falls behind text-to-image generation. One of the key reasons is the limited scale of publicly available data (e.g. 10M video-text pairs in WebVid10M vs. 5B image-text pairs in LAION) considering the high cost of video captioning. Instead it could be far easier to collect unlabeled clips from video platforms like YouTube. Motivated by this we come up with a novel text-to-video generation framework termed TF-T2V which can directly learn with text-free videos. The rationale behind is to separate the process of text decoding from that of temporal modeling. To this end we employ a content branch and a motion branch which are jointly optimized with weights shared. Following such a pipeline we study the effect of doubling the scale of training set (i.e. video-only WebVid10M) with some randomly collected text-free videos and are encouraged to observe the performance improvement (FID from 9.67 to 8.19 and FVD from 484 to 441) demonstrating the scalability of our approach. We also find that our model could enjoy sustainable performance gain (FID from 8.19 to 7.64 and FVD from 441 to 366) after reintroducing some text labels for training. Finally we validate the effectiveness and generalizability of our ideology on both native text-to-video generation and compositional video synthesis paradigms. Code and models will be publicly available at here.
-
The premise for the great advancement of molecular machine learning is dependent on a considerable amount of labeled data. In many real-world scenarios the labeled molecules are limited in quantity or laborious to derive. Recent pseudo-labeling methods are usually designed based on a single domain knowledge thereby failing to understand the comprehensive molecular configurations and limiting their adaptability to generalize across diverse biochemical context. To this end we introduce an innovative paradigm for dealing with the molecule pseudo-labeling named as Molecular Data Programming (MDP). In particular we adopt systematic supervision sources via crafting multiple graph labeling functions which covers various molecular structural knowledge of graph kernels molecular fingerprints and topological features. Each of them creates an uncertain and biased labels for the unlabeled molecules. To address the decision conflicts among the diverse pseudo-labels we design a label synchronizer to differentiably model confidences and correlations between the labeling functions which yields probabilistic molecular labels to adapt for specific applications. These probabilistic molecular labels are used to train a molecular classifier for improving its generalization capability. On eight benchmark datasets we empirically demonstrate the effectiveness of MDP on the weakly supervised molecule classification tasks.
-
Object detection in radar imagery with neural networks shows great potential for improving autonomous driving. However obtaining annotated datasets from real radar images crucial for training these networks is challenging especially in scenarios with long-range detection and adverse weather and lighting conditions where radar performance excels. To address this challenge we present RadSimReal an innovative physical radar simulation capable of generating synthetic radar images with accompanying annotations for various radar types and environmental conditions all without the need for real data collection. Remarkably our findings demonstrate that training object detection models on RadSimReal data and subsequently evaluating them on real-world data produce performance levels comparable to models trained and tested on real data from the same dataset and even achieves better performance when testing across different real datasets. RadSimReal offers advantages over other physical radar simulations that it does not necessitate knowledge of the radar design details which are often not disclosed by radar suppliers and has faster run-time. This innovative tool has the potential to advance the development of computer vision algorithms for radar-based autonomous driving applications.
-
Inherent ambiguity in layout annotations poses significant challenges to developing accurate 360deg room layout estimation models. To address this issue we propose a novel Bi-Layout model capable of predicting two distinct layout types. One stops at ambiguous regions while the other extends to encompass all visible areas. Our model employs two global context embeddings where each embedding is designed to capture specific contextual information for each layout type. With our novel feature guidance module the image feature retrieves relevant context from these embeddings generating layout-aware features for precise bi-layout predictions. A unique property of our Bi-Layout model is its ability to inherently detect ambiguous regions by comparing the two predictions. To circumvent the need for manual correction of ambiguous annotations during testing we also introduce a new metric for disambiguating ground truth layouts. Our method demonstrates superior performance on benchmark datasets notably outperforming leading approaches. Specifically on the MatterportLayout dataset it improves 3DIoU from 81.70% to 82.57% across the full test set and notably from 54.80% to 59.97% in subsets with significant ambiguity.
-
We propose residual denoising diffusion models (RDDM) a novel dual diffusion process that decouples the traditional single denoising diffusion process into residual diffusion and noise diffusion. This dual diffusion framework expands the denoising-based diffusion models initially uninterpretable for image restoration into a unified and interpretable model for both image generation and restoration by introducing residuals. Specifically our residual diffusion represents directional diffusion from the target image to the degraded input image and explicitly guides the reverse generation process for image restoration while noise diffusion represents random perturbations in the diffusion process. The residual prioritizes certainty while the noise emphasizes diversity enabling RDDM to effectively unify tasks with varying certainty or diversity requirements such as image generation and restoration. We demonstrate that our sampling process is consistent with that of DDPM and DDIM through coefficient transformation and propose a partially path-independent generation process to better understand the reverse process. Notably our RDDM enables a generic UNet trained with only an L1 loss and a batch size of 1 to compete with state-of-the-art image restoration methods. We provide code and pre-trained models to encourage further exploration application and development of our innovative framework (https://github.com/nachifur/RDDM).
-
To defend deep neural networks from adversarial attacks adversarial training has been drawing increasing attention for its effectiveness. However the accuracy and robustness resulting from the adversarial training are limited by the architecture because adversarial training improves accuracy and robustness by adjusting the weight connection affiliated to the architecture. In this work we propose ARNAS to search for accurate and robust architectures for adversarial training. First we design an accurate and robust search space in which the placement of the cells and the proportional relationship of the filter numbers are carefully determined. With the design the architectures can obtain both accuracy and robustness by deploying accurate and robust structures to their sensitive positions respectively. Then we propose a differentiable multi-objective search strategy performing gradient descent towards directions that are beneficial for both natural loss and adversarial loss thus the accuracy and robustness can be guaranteed at the same time. We conduct comprehensive experiments in terms of white-box attacks black-box attacks and transferability. Experimental results show that the searched architecture has the strongest robustness with the competitive accuracy and breaks the traditional idea that NAS-based architectures cannot transfer well to complex tasks in robustness scenarios. By analyzing outstanding architectures searched we also conclude that accurate and robust neural architectures tend to deploy different structures near the input and output which has great practical significance on both hand-crafting and automatically designing of accurate and robust architectures.
-
Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration but overlook the modeling of close interactions. In this work we tackle the task of reconstructing closely interactive humans from a monocular video. The main challenge of this task comes from insufficient visual information caused by depth ambiguity and severe inter-person occlusion. In view of this we propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information. This is based on the observation that human interaction has specific patterns following the social proxemics. Specifically we first design a latent representation based on Vector Quantised-Variational AutoEncoder (VQ-VAE) to model human interaction. A proxemics and physics guided diffusion model is then introduced to denoise the initial distribution. We design the diffusion model as dual branch with each branch representing one individual such that the interaction can be modeled via cross attention. With the learned priors of VQ-VAE and physical constraint as the additional information our proposed approach is capable of estimating accurate poses that are also proxemics and physics plausible. Experimental results on Hi4D 3DPW and CHI3D demonstrate that our method outperforms existing approaches. The code is available at https://github.com/boycehbz/HumanInteraction.
-
The ability to detect unfamiliar or unexpected images is essential for safe deployment of computer vision systems. In the context of classification the task of detecting images outside of a model's training domain is known as out-of-distribution (OOD) detection. While there has been a growing research interest in developing post-hoc OOD detection methods there has been comparably little discussion around how these methods perform when the underlying classifier is not trained on a clean carefully curated dataset. In this work we take a closer look at 20 state-of-the-art OOD detection methods in the (more realistic) scenario where the labels used to train the underlying classifier are unreliable (e.g. crowd-sourced or web-scraped labels). Extensive experiments across different datasets noise types & levels architectures and checkpointing strategies provide insights into the effect of class label noise on OOD detection and show that poor separation between incorrectly classified ID samples vs. OOD samples is an overlooked yet important limitation of existing methods. Code: https://github.com/glhr/ood-labelnoise
-
Recently the advancement of self-supervised learning techniques like masked autoencoders (MAE) has greatly influenced visual representation learning for images and videos. Nevertheless it is worth noting that the predominant approaches in existing masked image / video modeling rely excessively on resource-intensive vision transformers (ViTs) as the feature encoder. In this paper we propose a new approach termed as VideoMAC which combines video masked autoencoders with resource-friendly ConvNets. Specifically VideoMAC employs symmetric masking on randomly sampled pairs of video frames. To prevent the issue of mask pattern dissipation we utilize ConvNets which are implemented with sparse convolutional operators as encoders. Simultaneously we present a simple yet effective masked video modeling (MVM) approach a dual encoder architecture comprising an online encoder and an exponential moving average target encoder aimed to facilitate inter-frame reconstruction consistency in videos. Additionally we demonstrate that VideoMAC empowering classical (ResNet) / modern (ConvNeXt) convolutional encoders to harness the benefits of MVM outperforms ViT-based approaches on downstream tasks including video object segmentation (+5.2% / 6.4% \mathcal J &\mathcal F ) body part propagation (+6.3% / 3.1% mIoU) and human pose tracking (+10.2% / 11.1% PCK@0.1).
-
Generative models e.g. Stable Diffusion have enabled the creation of photorealistic images from text prompts. Yet the generation of 360-degree panorama images from text remains a challenge particularly due to the dearth of paired text-panorama data and the domain gap between panorama and perspective images. In this paper we introduce a novel dual-branch diffusion model named PanFusion to generate a 360-degree image from a text prompt. We leverage the stable diffusion model as one branch to provide prior knowledge in natural image generation and register it to another panorama branch for holistic image generation. We propose a unique cross-attention mechanism with projection awareness to minimize distortion during the collaborative denoising process. Our experiments validate that PanFusion surpasses existing methods and thanks to its dual-branch structure can integrate additional constraints like room layout for customized panorama outputs.
-
Learning 3D scene flow from LiDAR point clouds presents significant difficulties including poor generalization from synthetic datasets to real scenes scarcity of real-world 3D labels and poor performance on real sparse LiDAR point clouds. We present a novel approach from the perspective of auto-labelling aiming to generate a large number of 3D scene flow pseudo labels for real-world LiDAR point clouds. Specifically we employ the assumption of rigid body motion to simulate potential object-level rigid movements in autonomous driving scenarios. By updating different motion attributes for multiple anchor boxes the rigid motion decomposition is obtained for the whole scene. Furthermore we developed a novel 3D scene flow data augmentation method for global and local motion. By perfectly synthesizing target point clouds based on augmented motion parameters we easily obtain lots of 3D scene flow labels in point clouds highly consistent with real scenarios. On multiple real-world datasets including LiDAR KITTI nuScenes and Argoverse our method outperforms all previous supervised and unsupervised methods without requiring manual labelling. Impressively our method achieves a tenfold reduction in EPE3D metric on the LiDAR KITTI dataset reducing it from 0.190m to a mere 0.008m error.
-
Neural implicit representation of geometric shapes has witnessed considerable advancements in recent years. However common distance field based implicit representations specifically signed distance field (SDF) for watertight shapes or unsigned distance field (UDF) for arbitrary shapes routinely suffer from degradation of reconstruction accuracy when converting to explicit surface points and meshes. In this paper we introduce a novel neural implicit representation based on unsigned orthogonal distance fields (UODFs). In UODFs the minimal unsigned distance from any spatial point to the shape surface is defined solely in one orthogonal direction contrasting with the multi-directional determination made by SDF and UDF. Consequently every point in the 3D UODFs can directly access its closest surface points along three orthogonal directions. This distinctive feature leverages the accurate reconstruction of surface points without interpolation errors. We verify the effectiveness of UODFs through a range of reconstruction examples extending from simple watertight or non-watertight shapes to complex shapes that include hollows internal or assembling structures.
-
Blind video quality assessment (BVQA) plays a pivotal role in evaluating and improving the viewing experience of end-users across a wide range of video-based platforms and services. Contemporary deep learning-based models primarily analyze video content in its aggressively subsampled format while being blind to the impact of the actual spatial resolution and frame rate on video quality. In this paper we propose a modular BVQA model and a method of training it to improve its modularity. Our model comprises a base quality predictor a spatial rectifier and a temporal rectifier responding to the visual content and distortion spatial resolution and frame rate changes on video quality respectively. During training spatial and temporal rectifiers are dropped out with some probabilities to render the base quality predictor a standalone BVQA model which should work better with the rectifiers. Extensive experiments on both professionally-generated content and user-generated content video databases show that our quality model achieves superior or comparable performance to current methods. Additionally the modularity of our model offers an opportunity to analyze existing video quality databases in terms of their spatial and temporal complexity.
-
Vision-Language (VL) models have gained significant research focus enabling remarkable advances in multimodal reasoning. These architectures typically comprise a vision encoder a Large Language Model (LLM) and a projection module that aligns visual features with the LLM's representation space. Despite their success a critical limitation persists: the vision encoding process remains decoupled from user queries often in the form of image-related questions. Consequently the resulting visual features may not be optimally attuned to the query-specific elements of the image. To address this we introduce QA-ViT a Question Aware Vision Transformer approach for multimodal reasoning which embeds question awareness directly within the vision encoder. This integration results in dynamic visual features focusing on relevant image aspects to the posed question. QA-ViT is model-agnostic and can be incorporated efficiently into any VL architecture. Extensive experiments demonstrate the effectiveness of applying our method to various multimodal architectures leading to consistent improvement across diverse tasks and showcasing its potential for enhancing visual and scene-text understanding.
-
Due to the resource-intensive nature of training vision-language models on expansive video data a majority of studies have centered on adapting pre-trained image-language models to the video domain. Dominant pipelines propose to tackle the visual discrepancies with additional temporal learners while overlooking the substantial discrepancy for web-scaled descriptive narratives and concise action category names leading to less distinct semantic space and potential performance limitations. In this work we prioritize the refinement of text knowledge to facilitate generalizable video recognition. To address the limitations of the less distinct semantic space of category names we prompt a large language model (LLM) to augment action class names into Spatio-Temporal Descriptors thus bridging the textual discrepancy and serving as a knowledge base for general recognition. Moreover to assign the best descriptors with different video instances we propose Optimal Descriptor Solver forming the video recognition problem as solving the optimal matching flow across frame-level representations and descriptors. Comprehensive evaluations in zero-shot few-shot and fully supervised video recognition highlight the effectiveness of our approach. Our best model achieves a state-of-the-art zero-shot accuracy of 75.1% on Kinetics-600.
-
We contribute the Habitat Synthetic Scene Dataset a dataset of 211 high-quality 3D scenes and use it to test navigation agent generalization to realistic 3D environments. Our dataset represents real interiors and contains a diverse set of 18656 models of real-world objects. We investigate the impact of synthetic 3D scene dataset scale and realism on the task of training embodied agents to find and navigate to objects (ObjectGoal navigation). By comparing to synthetic 3D scene datasets from prior work we find that scale helps in generalization but the benefits quickly saturate making visual fidelity and correlation to real-world scenes more important. Our experiments show that agents trained on our smaller-scale dataset can match or outperform agents trained on much larger datasets. Surprisingly we observe that agents trained on just 122 scenes from our dataset outperform agents trained on 10000 scenes from the ProcTHOR-10K dataset in terms of zero-shot generalization in real-world scanned environments.
-
The booming of 3D recognition in the 2020s began with the introduction of point cloud transformers. They quickly overwhelmed sparse CNNs and became state-of-the-art models especially in 3D semantic segmentation. However sparse CNNs are still valuable networks due to their efficiency treasure and ease of application. In this work we reexamine the design distinctions and test the limits of what a sparse CNN can achieve. We discover that the key credit to the performance difference is adaptivity. Specifically we propose two key components i.e. adaptive receptive fields (spatially) and adaptive relation to bridge the gap. This exploration led to the creation of Omni-Adaptive 3D CNNs (OA-CNNs) a family of networks that integrates a lightweight module to greatly enhance the adaptivity of sparse CNNs at minimal computational cost. Without any self-attention modules OA-CNNs favorably surpass point transformers in terms of accuracy in both indoor and outdoor scenes with much less latency and memory cost. Notably it achieves 76.1% 78.9% and 70.6% mIoU on ScanNet v2 nuScenes and SemanticKITTI validation benchmarks respectively while maintaining at most 5x better speed than transformer counterparts. This revelation highlights the potential of pure sparse CNNs to outperform transformer-related networks. Our code is built upon Pointcept which is available at https://github.com/Pointcept/Pointcept.
-
Comprehensive capturing of human motions requires both accurate captures of complex poses and precise localization of the human within scenes. Most of the HPE datasets and methods primarily rely on RGB LiDAR or IMU data. However solely using these modalities or a combination of them may not be adequate for HPE particularly for complex and fast movements. For holistic human motion understanding we present RELI11D a high-quality multimodal human motion dataset involves LiDAR IMU system RGB camera and Event camera. It records the motions of 10 actors performing 5 sports in 7 scenes including 3.32 hours of synchronized LiDAR point clouds IMU measurement data RGB videos and Event steams. Through extensive experiments we demonstrate that the RELI11D presents considerable challenges and opportunities as it contains many rapid and complex motions that require precise location. To address the challenge of integrating different modalities we propose LEIR a multimodal baseline that effectively utilizes LiDAR Point Cloud Event stream and RGB through our cross-attention fusion strategy. We show that LEIR exhibits promising results for rapid motions and daily motions and that utilizing the characteristics of multiple modalities can indeed improve HPE performance. Both the dataset and source code will be released publicly to the research community fostering collaboration and enabling further exploration in this field.
-
We present an approach to modeling an image-space prior on scene motion. Our prior is learned from a collection of motion trajectories extracted from real video sequences depicting natural oscillatory dynamics of objects such as treesflowers candles and clothes swaying in the wind. We model dense long-term motion in the Fourier domain as spectral volumes which we find are well-suited to prediction with diffusion models. Given a single image our trained model uses a frequency-coordinated diffusion sampling process to predict a spectral volume which can be converted into a motion texture that spans an entire video. Along with an image-based rendering module the predicted motion representation can be used for a number of downstream applications such as turning still images into seamlessly looping videos or allowing users to interact with objects in real images producing realistic simulated dynamics (by interpreting the spectral volumes as image-space modal bases). See our project page for more results: generative-dynamics.github.io
-
Many face anti-spoofing (FAS) methods have focused on learning discriminative features from both live and spoof training data to strengthen the security of face recognition systems. However since not every possible attack type is available in the training stage these FAS methods usually fail to detect unseen attacks in the inference stage. In comparison one-class FAS where the training data are from only live faces aims to detect whether a test face image belongs to the live class or not. In this paper we propose a novel One-Class Spoof Cue Map estimation Network (OC-SCMNet) to address the one-class FAS detection problem. Our first goal is to learn to extract latent spoof features from live images so that their estimated Spoof Cue Maps (SCMs) should have zero responses. To avoid trapping to a trivial solution we devise a novel SCM-guided feature learning by combining many SCMs as pseudo ground-truths to guide a conditional generator to generate latent spoof features for spoof data. Our second goal is to approximately simulate the potential out-of-distribution spoof attacks. To this end we propose using a memory bank to dynamically preserve a set of sufficiently "independent" latent spoof features to encourage the generator to probe the latent spoof feature space. Extensive experiments conducted on eight FAS benchmark datasets demonstrate that the proposed OC-SCMNet not only outperforms previous one-class FAS methods but also achieves comparable performances to state-of-the-art two-class FAS method. The codes are available at https://github.com/Pei-KaiHuang/CVPR24_OC_SCMNet.
-
The development of large vision-language models notably CLIP has catalyzed research into effective adaptation techniques with a particular focus on soft prompt tuning. Conjointly test-time augmentation which utilizes multiple augmented views of a single image to enhance zero-shot generalization is emerging as a significant area of interest. This has predominantly directed research efforts towards test-time prompt tuning. In contrast we introduce a robust MeanShift for Test-time Augmentation (MTA) which surpasses prompt-based methods without requiring this intensive training procedure. This positions MTA as an ideal solution for both standalone and API-based applications. Additionally our method does not rely on ad hoc rules (e.g. confidence threshold) used in some previous test-time augmentation techniques to filter the augmented views. Instead MTA incorporates a quality assessment variable for each view directly into its optimization process termed as the inlierness score. This score is jointly optimized with a density mode seeking process leading to an efficient training- and hyperparameter-free approach. We extensively benchmark our method on 15 datasets and demonstrate MTA's superiority and computational efficiency. Deployed easily as plug-and-play module on top of zero-shot models and state-of-the-art few-shot methods MTA shows systematic and consistent improvements.
-
Large-scale text-to-image (T2I) diffusion models have showcased incredible capabilities in generating coherent images based on textual descriptions enabling vast applications in content generation. While recent advancements have introduced control over factors such as object localization posture and image contours a crucial gap remains in our ability to control the interactions between objects in the generated content. Well-controlling interactions in generated images could yield meaningful applications such as creating realistic scenes with interacting characters. In this work we study the problems of conditioning T2I diffusion models with Human-Object Interaction (HOI) information consisting of a triplet label (person action object) and corresponding bounding boxes. We propose a pluggable interaction control model called InteractDiffusion that extends existing pre-trained T2I diffusion models to enable them being better conditioned on interactions. Specifically we tokenize the HOI information and learn their relationships via interaction embeddings. A conditioning self-attention layer is trained to map HOI tokens to visual tokens thereby conditioning the visual tokens better in existing T2I diffusion models. Our model attains the ability to control the interaction and location on existing T2I diffusion models which outperforms existing baselines by a large margin in HOI detection score as well as fidelity in FID and KID. Project page: https://jiuntian.github.io/interactdiffusion.
-
We propose NViST a transformer-based model for efficient and generalizable novel-view synthesis from a single image for real-world scenes. In contrast to many methods that are trained on synthetic data object-centred scenarios or in a category-specific manner NViST is trained on MVImgNet a large-scale dataset of casually-captured real-world videos of hundreds of object categories with diverse backgrounds. NViST transforms image inputs directly into a radiance field conditioned on camera parameters via adaptive layer normalisation. In practice NViST exploits fine-tuned masked autoencoder (MAE) features and translates them to 3D output tokens via cross-attention while addressing occlusions with self-attention. To move away from object-centred datasets and enable full scene synthesis NViST adopts a 6-DOF camera pose model and only requires relative pose dropping the need for canonicalization of the training data which removes a substantial barrier to it being used on casually captured datasets. We show results on unseen objects and categories from MVImgNet and even generalization to casual phone captures. We conduct qualitative and quantitative evaluations on MVImgNet and ShapeNet to show that our model represents a step forward towards enabling true in-the-wild generalizable novel-view synthesis from a single image. Project webpage: https://wbjang.github.io/nvist_webpage.
-
In this work we investigate the potential of a large language model (LLM) to directly comprehend visual signals without the necessity of fine-tuning on multi-modal datasets. The foundational concept of our method views an image as a linguistic entity and translates it to a set of discrete words derived from the LLM's vocabulary. To achieve this we present the Vision-to-Language Tokenizer abbreviated as V2T Tokenizer which transforms an image into a "foreign language" with the combined aid of an encoder-decoder the LLM vocabulary and a CLIP model. With this innovative image encoding the LLM gains the ability not only for visual comprehension but also for image denoising and restoration in an auto-regressive fashion--crucially without any fine-tuning. We undertake rigorous experiments to validate our method encompassing understanding tasks like image recognition image captioning and visual question answering as well as image denoising tasks like inpainting outpainting deblurring and shift restoration. Code and models are available at https://github.com/zh460045050/V2L-Tokenizer.
-
Referring Remote Sensing Image Segmentation (RRSIS) is a new challenge that combines computer vision and natural language processing. Traditional Referring Image Segmentation (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery leading to suboptimal segmentation results. To address these challenges we introduce the Rotated Multi-Scale Interaction Network (RMSIN) an innovative approach designed for the unique demands of RRSIS. RMSIN incorporates an Intra-scale Interaction Module (IIM) to effectively address the fine-grained detail required at multiple scales and a Cross-scale Interaction Module (CIM) for integrating these details coherently across the network. Furthermore RMSIN employs an Adaptive Rotated Convolution (ARC) to account for the diverse orientations of objects a novel contribution that significantly enhances segmentation accuracy. To assess the efficacy of RMSIN we have curated an expansive dataset comprising 17402 image-caption-mask triplets which is unparalleled in terms of scale and variety. This dataset not only presents the model with a wide range of spatial and rotational scenarios but also establishes a stringent benchmark for the RRSIS task ensuring a rigorous evaluation of performance. Experimental evaluations demonstrate the exceptional performance of RMSIN surpassing existing state-of-the-art models by a significant margin. Datasets and code are available at https://github.com/Lsan2401/RMSIN.
-
Scene coordinate regression (SCR) methods are a family of visual localization methods that directly regress 2D-3D matches for camera pose estimation. They are effective in small-scale scenes but face significant challenges in large-scale scenes that are further amplified in the absence of ground truth 3D point clouds for supervision. Here the model can only rely on reprojection constraints and needs to implicitly triangulate the points. The challenges stem from a fundamental dilemma: The network has to be invariant to observations of the same landmark at different viewpoints and lighting conditions etc. but at the same time discriminate unrelated but similar observations. The latter becomes more relevant and severe in larger scenes. In this work we tackle this problem by introducing the concept of co-visibility to the network. We propose GLACE which integrates pre-trained global and local encodings and enables SCR to scale to large scenes with only a single small-sized network. Specifically we propose a novel feature diffusion technique that implicitly groups the reprojection constraints with co-visibility and avoids overfitting to trivial solutions. Additionally our position decoder parameterizes the output positions for large-scale scenes more effectively. Without using 3D models or depth maps for supervision our method achieves state-of-the-art results on large-scale scenes with a low-map-size model. On Cambridge landmarks with a single model we achieve a 17% lower median position error than Poker the ensemble variant of the state-of-the-art SCR method ACE. Code is available at: https://github.com/cvg/glace.
-
From image-text pairs large-scale vision-language models (VLMs) learn to implicitly associate image regions with words which prove effective for tasks like visual question answering. However leveraging the learned association for open-vocabulary semantic segmentation remains a challenge. In this paper we propose a simple yet extremely effective training-free technique Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS) for this task. PnP-OVSS leverages a VLM with direct text-to-image cross-attention and an image-text matching loss. To balance between over-segmentation and under-segmentation we introduce Salience Dropout; by iteratively dropping patches that the model is most attentive to we are able to better resolve the entire extent of the segmentation mask. PnP-OVSS does not require any neural network training and performs hyperparameter tuning without the need for any segmentation annotations even for a validation set. PnP-OVSS demonstrates substantial improvements over comparable baselines (+29.4% mIoU on Pascal VOC +13.2% mIoU on Pascal Context +14.0% mIoU on MS COCO +2.4% mIoU on COCO Stuff) and even outperforms most baselines that conduct additional network training on top of pretrained VLMs. Our codebase is at https://github.com/letitiabanana/PnP-OVSS.
-
The task of online mapping is to predict a local map using current sensor observations e.g. from lidar and camera without relying on a pre-built map. State-of-the-art methods are based on supervised learning and are trained predominantly using two datasets: nuScenes and Argoverse 2. However these datasets revisit the same geographic locations across training validation and test sets. Specifically over 80% of nuScenes and 40% of Argoverse 2 validation and test samples are less than 5 m from a training sample. At test time the methods are thus evaluated more on how well they localize within a memorized implicit map built from the training data than on extrapolating to unseen locations. Naturally this data leakage causes inflated performance numbers and we propose geographically disjoint data splits to reveal the true performance in unseen environments. Experimental results show that methods perform considerably worse some dropping more than 45 mAP when trained and evaluated on proper data splits. Additionally a reassessment of prior design choices reveals diverging conclusions from those based on the original split. Notably the impact of lifting methods and the support from auxiliary tasks (e.g. depth supervision) on performance appears less substantial or follows a different trajectory than previously perceived.
-
We propose a method to control material attributes of objects like roughness metallic albedo and transparency in real images. Our method capitalizes on the generative prior of text-to-image models known for photorealism employing a scalar value and instructions to alter low-level material properties. Addressing the lack of datasets with controlled material attributes we generated an object-centric synthetic dataset with physically-based materials. Fine-tuning a modified pre-trained text-to-image model on this synthetic dataset enables us to edit material properties in real-world images while preserving all other attributes. We show the potential application of our model to material edited NeRFs.
-
Comparing a user video to a reference how-to video is a key requirement for AR/VR technology delivering personalized assistance tailored to the user's progress. However current approaches for language-based assistance can only answer questions about a single video. We propose an approach that first automatically generates large amounts of visual instruction tuning data involving pairs of videos from HowTo100M by leveraging existing step annotations and accompanying narrations and then trains a video-conditioned language model to jointly reason across multiple raw videos. Our model achieves state-of-the-art performance at identifying differences between video pairs and ranking videos based on the severity of these differences and shows promising ability to perform general reasoning over multiple videos.
-
This work presents Depth Anything a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules we aim to build a simple yet powerful foundation model dealing with any images under any circumstances. To this end we scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data ( 62M) which significantly enlarges the data coverage and thus is able to reduce the generalization error. We investigate two simple yet effective strategies that make data scaling-up promising. First a more challenging optimization target is created by leveraging data augmentation tools. It compels the model to actively seek extra visual knowledge and acquire robust representations. Second an auxiliary supervision is developed to enforce the model to inherit rich semantic priors from pre-trained encoders. We evaluate its zero-shot capabilities extensively including six public datasets and randomly captured photos. It demonstrates impressive generalization ability. Further through fine-tuning it with metric depth information from NYUv2 and KITTI new SOTAs are set. Our better depth model also results in a better depth-conditioned ControlNet. Our models are released at https://github.com/LiheYoung/Depth-Anything.
-
We present a new self-supervised approach SelfPose3d for estimating 3d poses of multiple persons from multiple camera views. Unlike current state-of-the-art fully-supervised methods our approach does not require any 2d or 3d ground-truth poses and uses only the multi-view input images from a calibrated camera setup and 2d pseudo poses generated from an off-the-shelf 2d human pose estimator. We propose two self-supervised learning objectives: self-supervised person localization in 3d space and self-supervised 3d pose estimation. We achieve self-supervised 3d person localization by training the model on synthetically generated 3d points serving as 3d person root positions and on the projected root-heatmaps in all the views. We then model the 3d poses of all the localized persons with a bottleneck representation map them onto all views obtaining 2d joints and render them using 2d Gaussian heatmaps in an end-to-end differentiable manner. Afterwards we use the corresponding 2d joints and heatmaps from the pseudo 2d poses for learning. To alleviate the intrinsic inaccuracy of the pseudo labels we propose an adaptive supervision attention mechanism to guide the self-supervision. Our experiments and analysis on three public benchmark datasets including Panoptic Shelf and Campus show the effectiveness of our approach which is comparable to fully-supervised methods. Code is available at https://github.com/CAMMA-public/SelfPose3D.
-
The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster being less sensitive to false negative noises in other clusters. At inference time we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely the samples in one cluster should be semantically similar but the number of data experts should still be reasonable for training and inference. As such we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less (<35%) training cost. Meanwhile MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available here.
-
3D human generation is increasingly significant in various applications. However the direct use of 2D generative methods in 3D generation often results in losing local details while methods that reconstruct geometry from generated images struggle with global view consistency. In this work we introduce Joint2Human a novel method that leverages 2D diffusion models to generate detailed 3D human geometry directly ensuring both global structure and local details. To achieve this we employ the Fourier occupancy field (FOF) representation enabling the direct generation of 3D shapes as preliminary results with 2D generative models. With the proposed high-frequency enhancer and the multi-view recarving strategy our method can seamlessly integrate the details from different views into a uniform global shape. To better utilize the 3D human prior and enhance control over the generated geometry we introduce a compact spherical embedding of 3D joints. This allows for an effective guidance of pose during the generation process. Additionally our method can generate 3D humans guided by textual inputs. Our experimental results demonstrate the capability of our method to ensure global structure local details high resolution and low computational cost simultaneously. More results and the code can be found on our project page at http://cic.tju.edu.cn/faculty/likun/projects/Joint2Human.
-
Text-to-image (T2I) research has grown explosively in the past year owing to the large-scale pre-trained diffusion models and many emerging personalization and editing approaches. Yet one pain point persists: the text prompt engineering and searching high-quality text prompts for customized results is more art than science. Moreover as commonly argued: "an image is worth a thousand words" - the attempt to describe a desired image with texts often ends up being ambiguous and cannot comprehensively cover delicate visual details hence necessitating more additional controls from the visual domain. In this paper we take a bold step forward: taking "Text" out of a pretrained T2I diffusion model to reduce the burdensome prompt engineering efforts for users. Our proposed framework Prompt-Free Diffusion relies on only visual inputs to generate new images: it takes a reference image as "context" an optional image structural conditioning and an initial noise with absolutely no text prompt. The core architecture behind the scene is Semantic Context Encoder (SeeCoder) substituting the commonly used CLIP-based or LLM-based text encoder. The reusability of SeeCoder also makes it a convenient drop-in component: one can also pre-train a SeeCoder in one T2I model and reuse it for another. Through extensive experiments Prompt-Free Diffusion is experimentally found to (i) outperform prior exemplar-based image synthesis approaches; (ii) perform on par with state-of-the-art T2I models using prompts following the best practice; and (iii) be naturally extensible to other downstream applications such as anime figure generation and virtual try-on with promising quality. Our code and models will be open-sourced.
-
Recent advancements in single image driven 3D content generation have been propelled by leveraging prior knowledge from pretrained 2D diffusion models. However the 3D content generated by existing methods often exhibits distorted outline shapes and inadequate details. To solve this problem we propose a novel framework called Mask-enhanced Progressive Outline-to-Detail optimization (aka. MPOD123) which consists of two stages. Specifically in the first stage MPOD123 utilizes the pretrained view-conditioned diffusion model to guide the outline shape optimization of the 3D content. Given certain viewpoint we estimate outline shape priors in the form of 2D mask from the 3D content by leveraging opacity calculation. In the second stage MPOD123 incorporates Detail Appearance Inpainting (DAI) to guide the refinement on local geometry and texture with the shape priors. The essence of DAI lies in the Mask Rectified Cross-Attention (MRCA) which can be conveniently plugged in the stable diffusion model. The MRCA module utilizes the mask to rectify the attention map from each cross-attention layer. Accompanied with this new module DAI is capable of guiding the detail refinement of the 3D content while better preserves the outline shape. To assess the applicability in practical scenarios we contribute a new dataset modeled on real-world e-commerce environments. Extensive quantitative and qualitative experiments on this dataset and open benchmarks demonstrate the effectiveness of MPOD123 over the state-of-the-arts.
-
Human pose forecasting garners attention for its diverse applications. However challenges in modeling the multi-modal nature of human motion and intricate interactions among agents persist particularly with longer timescales and more agents. In this paper we propose an interaction-aware trajectory-conditioned long-term multi-agent human pose forecasting model utilizing a coarse-to-fine prediction approach: multi-modal global trajectories are initially forecasted followed by respective local pose forecasts conditioned on each mode. In doing so our Trajectory2Pose model introduces a graph-based agent-wise interaction module for a reciprocal forecast of local motion-conditioned global trajectory and trajectory-conditioned local pose. Our model effectively handles the multi-modality of human motion and the complexity of long-term multi-agent interactions improving performance in complex environments. Furthermore we address the lack of long-term (6s+) multi-agent (5+) datasets by constructing a new dataset from real-world images and 2D annotations enabling a comprehensive evaluation of our proposed model. State-of-the-art prediction performance on both complex and simpler datasets confirms the generalized effectiveness of our method. The code is available at https://github.com/Jaewoo97/T2P.
-
Being able to carry out complicated vision language reasoning tasks in 3D space represents a significant milestone in developing household robots and human-centered embodied AI. In this work we demonstrate that a critical and distinct challenge in 3D vision language reasoning is the situational awareness which incorporates two key components: (1) The autonomous agent grounds its self-location based on a language prompt. (2) The agent answers open-ended questions from the perspective of its calculated position. To address this challenge we introduce SIG3D an end-to-end Situation-Grounded model for 3D vision language reasoning. We tokenize the 3D scene into sparse voxel representation and propose a language-grounded situation estimator followed by a situated question answering module. Experiments on the SQA3D and ScanQA datasets show that SIG3D outperforms state-of-the-art models in situational estimation and question answering by a large margin (e.g. an enhancement of over 30% on situation accuracy). Subsequent analysis corroborates our architectural design choices explores the distinct functions of visual and textual tokens and highlights the importance of situational awareness in the domain of 3D question-answering. Project page is available at https://yunzeman.github.io/situation3d.
-
Three-dimensional object detection is one of the key tasks in autonomous driving. To reduce costs in practice low-cost multi-view cameras for 3D object detection are proposed to replace the expansive LiDAR sensors. However relying solely on cameras is difficult to achieve highly accurate and robust 3D object detection. An effective solution to this issue is combining multi-view cameras with the economical millimeter-wave radar sensor to achieve more reliable multi-modal 3D object detection. In this paper we introduce RCBEVDet a radar-camera fusion 3D object detection method in the bird's eye view (BEV). Specifically we first design RadarBEVNet for radar BEV feature extraction. RadarBEVNet consists of a dual-stream radar backbone and a Radar Cross-Section (RCS) aware BEV encoder. In the dual-stream radar backbone a point-based encoder and a transformer-based encoder are proposed to extract radar features with an injection and extraction module to facilitate communication between the two encoders. The RCS-aware BEV encoder takes RCS as the object size prior to scattering the point feature in BEV. Besides we present the Cross-Attention Multi-layer Fusion module to automatically align the multi-modal BEV feature from radar and camera with the deformable attention mechanism and then fuse the feature with channel and spatial fusion layers. Experimental results show that RCBEVDet achieves new state-of-the-art radar-camera fusion results on nuScenes and view-of-delft (VoD) 3D object detection benchmarks. Furthermore RCBEVDet achieves better 3D detection results than all real-time camera-only and radar-camera 3D object detectors with a faster inference speed at 21 28 FPS. The source code will be released at https://github.com/VDIGPKU/RCBEVDet.
-
Even the best current algorithms for estimating body 3D shape and pose yield results that include body self-intersections. In this paper we present CLOAF which exploits the diffeomorphic nature of Ordinary Differential Equations to eliminate such self-intersections while still imposing body shape constraints. We show that unlike earlier approaches to addressing this issue ours completely eliminates the self-intersections without compromising the accuracy of the reconstructions. Being differentiable CLOAF can be used to fine-tune pose and shape estimation baselines to improve their overall performance and eliminate self-intersections in their predictions. Furthermore we demonstrate how our CLOAF strategy can be applied to practically any motion field induced by the user. CLOAF also makes it possible to edit motion to interact with the environment without worrying about potential collision or loss of body-shape prior.
-
Non-isometric shape correspondence remains a fundamental challenge in computer vision. Traditional methods using Laplace-Beltrami operator (LBO) eigenmodes face limitations in characterizing high-frequency extrinsic shape changes like bending and creases. We propose a novel approach of combining the non-orthogonal extrinsic basis of eigenfunctions of the elastic thin-shell hessian with the intrinsic ones of the LBO creating a hybrid spectral space in which we construct functional maps. To this end we present a theoretical framework to effectively integrate non-orthogonal basis functions into descriptor- and learning-based functional map methods. Our approach can be incorporated easily into existing functional map pipelines across varying applications and is able to handle complex deformations beyond isometries. We show extensive evaluations across various supervised and unsupervised settings and demonstrate significant improvements. Notably our approach achieves up to 15% better mean geodesic error for non-isometric correspondence settings and up to 45% improvement in scenarios with topological noise.
-
Densely annotating the large-scale point clouds is laborious. To alleviate the annotation burden contrastive learning has attracted increasing attention for tackling semi-supervised 3D semantic segmentation. However existing point-to-point contrastive learning techniques in literature are generally sensitive to outliers resulting in insufficient modeling of the point-wise representations. To address this problem we propose a method named DDSemi for semi-supervised 3D semantic segmentation where a density-guided contrastive learning technique is explored. This technique calculates the contrastive loss in a point-to-anchor manner by estimating an anchor for each class from the memory bank based on the finding that the cluster centers tend to be located in dense regions. In this technique an inter-contrast loss is derived from the perturbed unlabeled point cloud pairs while an intra-contrast loss is derived from a single unlabeled point cloud. The derived losses could enhance the discriminability of the features and implicitly constrain the semantic consistency between the perturbed unlabeled point cloud pairs. In addition we propose a dual-space hardness sampling strategy to pay more attention to the hard samples located in sparse regions of both the geometric space and feature space by reweighting the point-wise intra-contrast loss. Experimental results on both indoor-scene and outdoor-scene datasets demonstrate that the proposed method outperforms the comparative state-of-the-art semi-supervised methods.
-
Softassign is a pivotal method in graph matching and other learning tasks. Many softassign-based algorithms exhibit performance sensitivity to a parameter in the softassign. However tuning the parameter is challenging and almost done empirically. This paper proposes an adaptive softassign method for graph matching by analyzing the relationship between the objective score and the parameter. This method can automatically tune the parameter based on a given error bound to guarantee accuracy. The Hadamard-Equipped Sinkhorn formulas introduced in this study significantly enhance the efficiency and stability of the adaptive softassign. Moreover these formulas can also be used in optimal transport problems. The resulting adaptive softassign graph matching algorithm enjoys significantly higher accuracy than previous state-of-the-art large graph matching algorithms while maintaining comparable efficiency.
-
The unauthorized use of personal data for commercial purposes and the covert acquisition of private data for training machine learning models continue to raise concerns. To address these issues researchers have proposed availability attacks that aim to render data unexploitable. However many availability attack methods can be easily disrupted by adversarial training. Although some robust methods can resist adversarial training their protective effects are limited. In this paper we re-examine the existing availability attack methods and propose a novel two-stage min-max-min optimization paradigm to generate robust unlearnable noise. The inner min stage is utilized to generate unlearnable noise while the outer min-max stage simulates the training process of the poisoned model. Additionally we formulate the attack effects and use it to constrain the optimization objective. Comprehensive experiments have revealed that the noise generated by our method can lead to a decline in test accuracy for adversarially trained poisoned models by up to approximately 30% in comparison to SOTA methods.
-
Diffusion models have revolutionized image generation in recent years yet they are still limited to a few sizes and aspect ratios. We propose ElasticDiffusion a novel training-free decoding method that enables pretrained text-to-image diffusion models to generate images with various sizes. ElasticDiffusion attempts to decouple the generation trajectory of a pretrained model into local and global signals. The local signal controls low-level pixel information and can be estimated on local patches while the global signal is used to maintain overall structural consistency and is estimated with a reference image. We test our method on CelebA-HQ (faces) and LAION-COCO (objects/indoor/outdoor scenes). Our experiments and qualitative results show superior image coherence quality across aspect ratios compared to MultiDiffusion and the standard decoding strategy of Stable Diffusion. Project Webpage: https://elasticdiffusion.github.io
-
We present the Locally Adaptive Morphable Model (LAMM) a highly flexible Auto-Encoder (AE) framework for learning to generate and manipulate 3D meshes. We train our architecture following a simple self-supervised training scheme in which input displacements over a set of sparse control vertices are used to overwrite the encoded geometry in order to transform one training sample into another. During inference our model produces a dense output that adheres locally to the specified sparse geometry while maintaining the overall appearance of the encoded object. This approach results in state-of-the-art performance in both disentangling manipulated geometry and 3D mesh reconstruction. To the best of our knowledge LAMM is the first end-to-end framework that enables direct local control of 3D vertex geometry in a single forward pass. A very efficient computational graph allows our network to train with only a fraction of the memory required by previous methods and run faster during inference generating 12k vertex meshes at >60fps on a single CPU thread. We further leverage local geometry control as a primitive for higher level editing operations and present a set of derivative capabilities such as swapping and sampling object parts. Code and pretrained models can be found at https://github.com/michaeltrs/LAMM.
-
Neural Radiance Fields (NeRF) exhibit remarkable performance for Novel View Synthesis (NVS) given a set of 2D images. However NeRF training requires accurate camera pose for each input view typically obtained by Structure-from-Motion (SfM) pipelines. Recent works have attempted to relax this constraint but they still often rely on decent initial poses which they can refine. Here we aim at removing the requirement for pose initialization. We present Incremental CONfidence (ICON) an optimization procedure for training NeRFs from 2D video frames. ICON only assumes smooth camera motion to estimate initial guess for poses. Further ICON introduces "confidence": an adaptive measure of model quality used to dynamically reweight gradients. ICON relies on high-confidence poses to learn NeRF and high-confidence 3D structure (as encoded by NeRF) to learn poses. We show that ICON without prior pose initialization achieves superior performance in both CO3D and HO3D versus methods which use SfM pose.
-
Fine-grained action analysis in multi-person sports is complex due to athletes' quick movements and intense physical confrontations which result in severe visual obstructions in most scenes. In addition accessible multi-person sports video datasets lack fine-grained action annotations in both space and time adding to the difficulty in fine-grained action analysis. To this end we construct a new multi-person basketball sports video dataset named FineSports which contains fine-grained semantic and spatial-temporal annotations on 10000 NBA game videos covering 52 fine-grained action types 16000 action instances and 123000 spatial-temporal bounding boxes. We also propose a new prompt-driven spatial-temporal action location approach called PoSTAL composed of a prompt-driven target action encoder (PTA) and an action tube-specific detector (ATD) to directly generate target action tubes with fine-grained action types without any off-line proposal generation. Extensive experiments on the FineSports dataset demonstrate that PoSTAL outperforms state-of-the-art methods. Data and code are available at https://github.com/PKU-ICST-MIPL/FineSports_CVPR2024.
-
Open-vocabulary object detection (OvOD) has transformed detection into a language-guided task empowering users to freely define their class vocabularies of interest during inference. However our initial investigation indicates that existing OvOD detectors exhibit significant variability when dealing with vocabularies across various semantic granularities posing a concern for real-world deployment. To this end we introduce Semantic Hierarchy Nexus (SHiNe) a novel classifier that uses semantic knowledge from class hierarchies. It runs offline in three steps: i) it retrieves relevant super-/sub-categories from a hierarchy for each target class; ii) it integrates these categories into hierarchy-aware sentences; iii) it fuses these sentence embeddings to generate the nexus classifier vector. Our evaluation on various detection benchmarks demonstrates that SHiNe enhances robustness across diverse vocabulary granularities achieving up to +31.9% mAP50 with ground truth hierarchies while retaining improvements using hierarchies generated by large language models. Moreover when applied to open-vocabulary classification on ImageNet-1k SHiNe improves the CLIP zero-shot baseline by +2.8% accuracy. SHiNe is training-free and can be seamlessly integrated with any off-the-shelf OvOD detector without incurring additional computational overhead during inference. The code is open source.
-
Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (e.g. a woman's photo) and a text description (e.g. "a woman is drinking water."). Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning. In this paper we propose TI2V-Zero a zero-shot tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image enabling TI2V generation without any optimization fine-tuning or introducing external modules. Our approach leverages a pretrained T2V diffusion foundation model as the generative prior. To guide video generation with the additional image input we propose a "repeat-and-slide" strategy that modulates the reverse denoising process allowing the frozen diffusion model to synthesize a video frame-by-frame starting from the provided image. To ensure temporal continuity we employ a DDPM inversion strategy to initialize Gaussian noise for each newly synthesized frame and a resampling technique to help preserve visual details. We conduct comprehensive experiments on both domain-specific and open-domain datasets where TI2V-Zero consistently outperforms a recent open-domain TI2V model. Furthermore we show that TI2V-Zero can seamlessly extend to other tasks such as video infilling and prediction when provided with more images. Its autoregressive design also supports long video generation.
-
This paper focuses on open-ended video question answering which aims to find the correct answers from a large answer set in response to a video-related question. This is essentially a multi-label classification task since a question may have multiple answers. However due to annotation costs the labels in existing benchmarks are always extremely insufficient typically one answer per question. As a result existing works tend to directly treat all the unlabeled answers as negative labels leading to limited ability for generalization. In this work we introduce a simple yet effective ranking distillation framework (RADI) to mitigate this problem without additional manual annotation. RADI employs a teacher model trained with incomplete labels to generate rankings for potential answers which contain rich knowledge about label priority as well as label-associated visual cues thereby enriching the insufficient labeling information. To avoid overconfidence in the imperfect teacher model we further present two robust and parameter-free ranking distillation approaches: a pairwise approach which introduces adaptive soft margins to dynamically refine the optimization constraints on various pairwise rankings and a listwise approach which adopts sampling-based partial listwise learning to resist the bias in teacher ranking. Extensive experiments on five popular benchmarks consistently show that both our pairwise and listwise RADIs outperform state-of-the-art methods. Further analysis demonstrates the effectiveness of our methods on the insufficient labeling problem.
-
Grouping is inherently ambiguous due to the multiple levels of granularity in which one can decompose a scene --- should the wheels of an excavator be considered separate or part of the whole? We propose Group Anything with Radiance Fields (GARField) an approach for decomposing 3D scenes into a hierarchy of semantically meaningful groups from posed image inputs. To do this we embrace group ambiguity through physical scale: by optimizing a scale-conditioned 3D affinity feature field a point in the world can belong to different groups of different sizes. We optimize this field from a set of 2D masks provided by Segment Anything (SAM) in a way that respects coarse-to-fine hierarchy using scale to consistently fuse conflicting masks from different viewpoints. From this field we can derive a hierarchy of possible groupings via automatic tree construction or user interaction. We evaluate GARField on a variety of in-the-wild scenes and find it effectively extracts groups at many levels: clusters of objects objects and various subparts. GARField inherently represents multi-view consistent groupings and produces higher fidelity groups than the input SAM masks. GARField's hierarchical grouping could have exciting downstream applications such as 3D asset extraction or dynamic scene understanding. Project site: https://www.garfield.studio/
-
Concealed Object Detection (COD) aims to identify objects visually embedded in their background. Existing COD datasets and methods predominantly focus on animals or humans ignoring the agricultural domain which often contains numerous small and concealed crops with severe occlusions. In this paper we introduce Concealed Crop Detection (CCD) which extends classic COD to agricultural domains. Experimental study shows that unimodal data provides insufficient information for CCD. To address this gap we first collect a large-scale RGB-D dataset ACOD-12K containing high-resolution crop images and depth maps. Then we propose a foundational framework named Recurrent Iterative Segmentation Network (RISNet). To tackle the challenge of dense objects we employ multi-scale receptive fields to capture objects of varying sizes thus enhancing the detection performance for dense objects. By fusing depth features our method can acquire spatial information about concealed objects to mitigate disturbances caused by intricate backgrounds and occlusions. Furthermore our model adopts a multi-stage iterative approach using predictions from each stage as gate attention to reinforce position information thereby improving the detection accuracy for small objects. Extensive experimental results demonstrate that our RISNet achieves new state-of-the-art performance on both newly proposed CCD and classic COD tasks. All resources will be available at https://github.com/Kki2Eve/RISNet.
-
Online continual learning suffers from an underfitted solution due to insufficient training for prompt model updates (e.g. single-epoch training). To address the challenge we propose an efficient online continual learning method using the neural collapse phenomenon. In particular we induce neural collapse to form a simplex equiangular tight frame (ETF) structure in the representation space so that the continuously learned model with a single epoch can better fit to the streamed data by proposing preparatory data training and residual correction in the representation space. With an extensive set of empirical validations using CIFAR-10/100 TinyImageNet ImageNet-200 and ImageNet-1K we show that our proposed method outperforms state-of-the-art methods by a noticeable margin in various online continual learning scenarios such as disjoint and Gaussian scheduled continuous (i.e. boundary-free) data setups.
-
iToF is a prevalent cost-effective technology for 3D perception. While its reliance on multi-measurement commonly leads to reduced performance in dynamic environments. Based on the analysis of the physical iToF imaging process we propose the iToF flow composed of crossmode transformation and uni-mode photometric correction to model the variation of measurements caused by different measurement modes and 3D motion respectively. We propose a local linear transform (LLT) based cross-mode transfer module (LCTM) for mode-varying and pixel shift compensation of cross-mode flow and uni-mode photometric correct module (UPCM) for estimating the depth-wise motion caused photometric residual of uni-mode flow. The iToF flow-based depth extraction network is proposed which could facilitate the estimation of the 4-phase measurements at each individual time for high framerate and accurate depth estimation. Extensive experiments including both simulation and real-world experiments are conducted to demonstrate the effectiveness of the proposed methods. Compared with the SOTA method our approach reduces the computation time by 75% while improving the performance by 38%. The code and database are available at https://github.com/ComputationalPerceptionLab/iToF_flow.
-
Generalized Category Discovery (GCD) aims to identify a mix of known and novel categories within unlabeled data sets providing a more realistic setting for image recognition. Essentially GCD needs to remember existing patterns thoroughly to recognize novel categories. Recent state-of-the-art method SimGCD transfers the knowledge from known-class data to the learning of novel classes through debiased learning. However some patterns are catastrophically forgot during adaptation and thus lead to poor performance in novel categories classification. To address this issue we propose a novel learning approach LegoGCD which is seamlessly integrated into previous methods to enhance the discrimination of novel classes while maintaining performance on previously encountered known classes. Specifically we design two types of techniques termed as \underline L ocal \underline E ntropy Re\underline g ularization (LER) and Dual-views Kullback-Leibler divergence c\underline o nstraint (DKL). The LER optimizes the distribution of potential known class samples in unlabeled data thus ensuring the preservation of knowledge related to known categories while learning novel classes. Meanwhile DKL introduces Kullback-Leibler divergence to encourage the model to produce a similar prediction distribution of two view samples from the same image. In this way it successfully avoids mismatched prediction and generates more reliable potential known class samples simultaneously. Extensive experiments validate that the proposed LegoGCD effectively addresses the known category forgetting issue across all datasets e.g. delivering a 7.74% and 2.51% accuracy boost on known and novel classes in CUB respectively. Our code is available at: https://github.com/Cliffia123/LegoGCD.
-
4D medical images which represent 3D images with temporal information are crucial in clinical practice for capturing dynamic changes and monitoring long-term disease progression. However acquiring 4D medical images poses challenges due to factors such as radiation exposure and imaging duration necessitating a balance between achieving high temporal resolution and minimizing adverse effects. Given these circumstances not only is data acquisition challenging but increasing the frame rate for each dataset also proves difficult. To address this challenge this paper proposes a simple yet effective Unsupervised Volumetric Interpolation framework UVI-Net. This framework facilitates temporal interpolation without the need for any intermediate frames distinguishing it from the majority of other existing unsupervised methods. Experiments on benchmark datasets demonstrate significant improvements across diverse evaluation metrics compared to unsupervised and supervised baselines. Remarkably our approach achieves this superior performance even when trained with a dataset as small as one highlighting its exceptional robustness and efficiency in scenarios with sparse supervision. This positions UVI-Net as a compelling alternative for 4D medical imaging particularly in settings where data availability is limited. The source code is available at https://github.com/jungeun122333/UVI-Net.
-
Multi-constraint offline reinforcement learning (RL) promises to learn policies that satisfy both cumulative and state-wise costs from offline datasets. This arrangement provides an effective approach for the widespread application of RL in high-risk scenarios where both cumulative and state-wise costs need to be considered simultaneously. However previously constrained offline RL algorithms are primarily designed to handle single-constraint problems related to cumulative cost which faces challenges when addressing multi-constraint tasks that involve both cumulative and state-wise costs. In this work we propose a novel Primal policy Optimization with Conservative Estimation algorithm (POCE) to address the problem of multi-constraint offline RL. Concretely we reframe the objective of multi-constraint offline RL by introducing the concept of Maximum Markov Decision Processes (MMDP). Subsequently we present a primal policy optimization algorithm to confront the multi-constraint problems which improves the stability and convergence speed of model training. Furthermore we propose a conditional Bellman operator to estimate cumulative and state-wise Q-values reducing the extrapolation error caused by out-of-distribution (OOD) actions. Finally extensive experiments demonstrate that the POCE algorithm achieves competitive performance across multiple experimental tasks particularly outperforming baseline algorithms in terms of safety. Our code is available at \href https://github.com/guanjiayi/poce github.POCE .
-
Learning 3D models of all animals in nature requires massively scaling up existing solutions. With this ultimate goal in mind we develop 3D-Fauna an approach that learns a pan-category deformable 3D animal model for more than 100 animal species jointly. One crucial bottleneck of modeling animals is the limited availability of training data which we overcome by learning our model from 2D Internet images. We show that prior approaches which are category-specific fail to generalize to rare species with limited training images. We address this challenge by introducing the Semantic Bank of Skinned Models (SBSM) which automatically discovers a small set of base animal shapes by combining geometric inductive priors with semantic knowledge implicitly captured by an off-the-shelf self-supervised feature extractor. To train such a model we also contribute a new large-scale dataset of diverse animal species. At inference time given a single image of any quadruped animal our model reconstructs an articulated 3D mesh in a feed-forward manner in seconds.
-
The main function of depth completion is to compensate for an insufficient and unpredictable number of sparse depth measurements of hardware sensors. However existing research on depth completion assumes that the sparsity --- the number of points or LiDAR lines --- is fixed for training and testing. Hence the completion performance drops severely when the number of sparse depths changes significantly. To address this issue we propose the sparsity-adaptive depth refinement (SDR) framework which refines monocular depth estimates using sparse depth points. For SDR we propose the masked spatial propagation network (MSPN) to perform SDR with a varying number of sparse depths effectively by gradually propagating sparse depth information throughout the entire depth map. Experimental results demonstrate that MPSN achieves state-of-the-art performance on both SDR and conventional depth completion scenarios.
-
Although perception systems have made remarkable advancements in recent years they still rely on explicit human instruction or pre-defined categories to identify the target objects before executing visual recognition tasks. Such systems cannot actively reason and comprehend implicit user intention. In this work we propose a new segmentation task --- reasoning segmentation. The task is designed to output a segmentation mask given a complex and implicit query text. Furthermore we establish a benchmark comprising over one thousand image-instruction-mask data samples incorporating intricate reasoning and world knowledge for evaluation purposes. Finally we present LISA: large Language Instructed Segmentation Assistant which inherits the language generation capabilities of multimodal Large Language Models (LLMs) while also possessing the ability to produce segmentation masks. We expand the original vocabulary with a
token and propose the embedding-as-mask paradigm to unlock the segmentation capability. Remarkably LISA can handle cases involving complex reasoning and world knowledge. Also it demonstrates robust zero-shot capability when trained exclusively on reasoning-free datasets. In addition fine-tuning the model with merely 239 reasoning segmentation data samples results in further performance enhancement. Both quantitative and qualitative experiments show our method effectively unlocks new reasoning segmentation capabilities for multimodal LLMs. Code models and data are available at github.com/dvlab-research/LISA. -
Portrait harmonization aims to composite a subject into a new background adjusting its lighting and color to ensure harmony with the background scene. Existing harmonization techniques often only focus on adjusting the global color and brightness of the foreground and ignore crucial illumination cues from the background such as apparent lighting direction leading to unrealistic compositions. We introduce Relightful Harmonization a lighting-aware diffusion model designed to seamlessly harmonize sophisticated lighting effect for the foreground portrait using any background image. Our approach unfolds in three stages. First we introduce a lighting representation module that allows our diffusion model to encode lighting information from target image background. Second we introduce an alignment network that aligns lighting features learned from image background with lighting features learned from panorama environment maps which is a complete representation for scene illumination. Last to further boost the photorealism of the proposed method we introduce a novel data simulation pipeline that generates synthetic training pairs from a diverse range of natural images which are used to refine the model. Our method outperforms existing benchmarks in visual fidelity and lighting coherence showing superior generalization in real-world testing scenarios highlighting its versatility and practicality.
-
Video Moment Retrieval (MR) and Highlight Detection (HD) have attracted significant attention due to the growing demand for video analysis. Recent approaches treat MR and HD as similar video grounding problems and address them together with transformer-based architecture. However we observe that the emphasis of MR and HD differs with one necessitating the perception of local relationships and the other prioritizing the understanding of global contexts. Consequently the lack of task-specific design will inevitably lead to limitations in associating the intrinsic specialty of two tasks. To tackle the issue we propose a Unified Video COMprehension framework (UVCOM) to bridge the gap and jointly solve MR and HD effectively. By performing progressive integration on intra and inter-modality across multi-granularity UVCOM achieves the comprehensive understanding in processing a video. Moreover we present multi-aspect contrastive learning to consolidate the local relation modeling and global knowledge accumulation via well aligned multi-modal space. Extensive experiments on QVHighlights Charades-STA TACoS YouTube Highlights and TVSum datasets demonstrate the effectiveness and rationality of UVCOM which outperforms the state-of-the-art methods by a remarkable margin.
-
Music recommendation for videos attracts growing interest in multi-modal research. However existing systems focus primarily on content compatibility often ignoring the users' preferences. Their inability to interact with users for further refinements or to provide explanations leads to a less satisfying experience. We address these issues with MuseChat a first-of-its-kind dialogue-based recommendation system that personalizes music suggestions for videos. Our system consists of two key functionalities with associated modules: recommendation and reasoning. The recommendation module takes a video along with optional information including previous suggested music and user's preference as inputs and retrieves an appropriate music matching the context. The reasoning module equipped with the power of Large Language Model (Vicuna-7B) and extended to multi-modal inputs is able to provide reasonable explanation for the recommended music. To evaluate the effectiveness of MuseChat we build a large-scale dataset conversational music recommendation for videos that simulates a two-turn interaction between a user and a recommender based on accurate music track information. Experiment results show that MuseChat achieves significant improvements over existing video-based music retrieval methods as well as offers strong interpretability and interactability. The dataset of this work is available at https://dongzhikang.github.io/musechat.
-
Neural Radiance Fields (NeRFs) have shown great potential in novel view synthesis. However they struggle to render sharp images when the data used for training is affected by motion blur. On the other hand event cameras excel in dynamic scenes as they measure brightness changes with microsecond resolution and are thus only marginally affected by blur. Recent methods attempt to enhance NeRF reconstructions under camera motion by fusing frames and events. However they face challenges in recovering accurate color content or constrain the NeRF to a set of predefined camera poses harming reconstruction quality in challenging conditions. This paper proposes a novel formulation addressing these issues by leveraging both model- and learning-based modules. We explicitly model the blur formation process exploiting the event double integral as an additional model-based prior. Additionally we model the event-pixel response using an end-to-end learnable response function allowing our method to adapt to non-idealities in the real event-camera sensor. We show on synthetic and real data that the proposed approach outperforms existing deblur NeRFs that use only frames as well as those that combine frames and events by +6.13dB and +2.48dB respectively.
-
We present Compound Conditioned ControlNet C3Net a novel generative neural architecture taking conditions from multiple modalities and synthesizing multimodal contents simultaneously (e.g. image text audio). C3Net adapts the ControlNet architecture to jointly train and make inferences on a production-ready diffusion model and its trainable copies. Specifically C3Net first aligns the conditions from multi-modalities to the same semantic latent space using modality-specific encoders based on contrastive training. Then it generates multimodal outputs based on the aligned latent space whose semantic information is combined using a ControlNet-like architecture called Control C3-UNet. Correspondingly with this system design our model offers an improved solution for joint-modality generation through learning and explaining multimodal conditions involving more than just linear interpolation within the latent space. Meanwhile as we align conditions to a unified latent space C3Net only requires one trainable Control C3-UNet to work on multimodal semantic information. Furthermore our model employs unimodal pretraining on the condition alignment stage outperforming the non-pretrained alignment even on relatively scarce training data and thus demonstrating high-quality compound condition generation. We contribute the first high-quality tri-modal validation set to validate quantitatively that C3Net outperforms or is on par with the first and contemporary state-of-the-art multimodal generation. Our codes and tri-modal dataset will be released.
-
Neural network pruning particularly channel pruning is a widely used technique for compressing deep learning models to enable their deployment on edge devices with limited resources. Typically redundant weights or structures are removed to achieve the target resource budget. Although data-driven pruning approaches have proven to be more effective they cannot be directly applied to federated learning (FL) which has emerged as a popular technique in edge computing applications because of distributed and confidential datasets. In response to this challenge we design a new network pruning method for FL. We propose device-wise sub-networks for each device assuming that the data distribution is similar within each device. These sub-networks are generated through sub-network embeddings and a hypernetwork. To further minimize memory usage and communication costs we permanently prune the full model to remove weights that are not useful for all devices. During the FL process we simultaneously train the device-wise sub-networks and the base sub-network to facilitate the pruning process. We then finetune the pruned model with device-wise sub-networks to regain performance. Moreover we provided the theoretical guarantee of convergence for our method. Our method achieves better performance and resource trade-off than other well-established network pruning baselines as demonstrated through extensive experiments on CIFAR-10 CIFAR-100 and TinyImageNet.
-
Few-shot segmentation performance declines substantially when facing images from a domain different than the training domain effectively limiting real-world use cases. To alleviate this recently cross-domain few-shot segmentation (CD-FSS) has emerged. Works that address this task mainly attempted to learn segmentation on a source domain in a manner that generalizes across domains. Surprisingly we can outperform these approaches while eliminating the training stage and removing their main segmentation network. We show test-time task-adaption is the key for successful CD-FSS instead. Task-adaption is achieved by appending small networks to the feature pyramid of a conventionally classification-pretrained backbone. To avoid overfitting to the few labeled samples in supervised fine-tuning consistency across augmented views of input images serves as guidance while learning the parameters of the attached layers. Despite our self-restriction not to use any images other than the few labeled samples at test time we achieve new state-of-the-art performance in CD-FSS evidencing the need to rethink approaches for the task. Code is available at https://github.com/Vision-Kek/ABCDFSS.
-
We address the problem of regressing 3D human pose and shape from a single image with a focus on 3D accuracy. The current best methods leverage large datasets of 3D pseudo-ground-truth (p-GT) and 2D keypoints leading to robust performance. With such methods however we observe a paradoxical decline in 3D pose accuracy with increasing 2D accuracy. This is caused by biases in the p-GT and the use of an approximate camera projection model. We quantify the error induced by current camera models and show that fitting 2D keypoints and p-GT accurately causes incorrect 3D poses. Our analysis defines the invalid distances within which minimizing 2D and p-GT losses is detrimental. We use this to formulate a new loss "Threshold-Adaptive Loss Scaling" (TALS) that penalizes gross 2D and p-GT errors but not smaller ones. With such a loss there are many 3D poses that could equally explain the 2D evidence. To reduce this ambiguity we need a prior over valid human poses but such priors can introduce unwanted bias. To address this we exploit a tokenized representation of human pose and reformulate the problem as token prediction. This restricts the estimated poses to the space of valid poses effectively improving robustness to occlusion. Extensive experiments on the EMDB and 3DPW datasets show that our reformulated loss and tokenization allows us to train on in-the-wild data while improving 3D accuracy over the state-of-the-art. Our models and code are available for research at https://tokenhmr.is.tue.mpg.de.
-
This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However through a simple and effective baseline we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus unlike traditional single-stage planning methods we propose a multi-stage system consisting of an event parser a grounding stage and a final reasoning stage in conjunction with an external memory. All stages are training-free and performed using few-shot prompting of large models creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity our method MoReVQA improves over prior work on standard videoQA benchmarks (NExT-QA iVQA EgoSchema and ActivityNet-QA) with state-of-the-art results and extensions to related tasks (grounded videoQA paragraph captioning).
-
Parameter-efficient fine-tuning for pre-trained Vision Transformers aims to adeptly tailor a model to downstream tasks by learning a minimal set of new adaptation parameters while preserving the frozen majority of pre-trained parameters. Striking a balance between retaining the generalizable representation capacity of the pre-trained model and acquiring task-specific features poses a key challenge. Currently there is a lack of focus on guiding this delicate trade-off. In this study we approach the problem from the perspective of Singular Value Decomposition (SVD) of pre-trained parameter matrices providing insights into the tuning dynamics of existing methods. Building upon this understanding we propose a Residual-based Low-Rank Rescaling (RLRR) fine-tuning strategy. This strategy not only enhances flexibility in parameter tuning but also ensures that new parameters do not deviate excessively from the pre-trained model through a residual design. Extensive experiments demonstrate that our method achieves competitive performance across various downstream image classification tasks all while maintaining comparable new parameters. We believe this work takes a step forward in offering a unified perspective for interpreting existing methods and serves as motivation for the development of new approaches that move closer to effectively considering the crucial trade-off mentioned above. Our code is available at https://github.com/zstarN70/RLRR.git.
-
We propose FaceCom a method for 3D facial shape completion which delivers high-fidelity results for incomplete facial inputs of arbitrary forms. Unlike end-to-end shape completion methods based on point clouds or voxels our approach relies on a mesh-based generative network that is easy to optimize enabling it to handle shape completion for irregular facial scans. We first train a shape generator on a mixed 3D facial dataset containing 2405 identities. Based on the incomplete facial input we fit complete faces using an optimization approach under image inpainting guidance. The completion results are refined through a post-processing step. FaceCom demonstrates the ability to effectively and naturally complete facial scan data with varying missing regions and degrees of missing areas. Our method can be used in medical prosthetic fabrication and the registration of deficient scanning data. Our experimental results demonstrate that FaceCom achieves exceptional performance in fitting and shape completion tasks.
-
Lifelong person re-identification (LReID) suffers from the catastrophic forgetting problem when learning from non-stationary data. Existing exemplar-based and knowledge distillation-based LReID methods encounter data privacy and limited acquisition capacity respectively. In this paper we instead introduce the prototype which is under-investigated in LReID to better balance knowledge forgetting and acquisition. Existing prototype-based works primarily focus on the classification task where the prototypes are set as discrete points or statistical distributions. However they either discard the distribution information or omit instance-level diversity which are crucial fine-grained clues for LReID. To address the above problems we propose Distribution-aware Knowledge Prototyping (DKP) where the instance-level diversity of each sample is modeled to transfer comprehensive fine-grained knowledge for prototyping and facilitating LReID learning. Specifically an Instance-level Distribution Modeling network is proposed to capture the local diversity of each instance. Then the Distribution-oriented Prototype Generation algorithm transforms the instance-level diversity into identity-level distributions as prototypes which is further explored by the designed Prototype-based Knowledge Transfer module to enhance the knowledge anti-forgetting and acquisition capacity of the LReID model. Extensive experiments verify that our method achieves superior plasticity and stability balancing and outperforms existing LReID methods by 8.1%/9.1% average mAP/R@1 improvement. The code is available at https://github.com/zhoujiahuan1991/CVPR2024-DKP
-
We present a lightweight solution for estimating spatially-coherent indoor lighting from a single RGB image. Previous methods for estimating illumination using volumetric representations have overlooked the sparse distribution of light sources in space necessitating substantial memory and computational resources for achieving high-quality results. We introduce a unified voxel octree-based illumination estimation framework to produce 3D spatially-coherent lighting. Additionally a differentiable voxel octree cone tracing rendering layer is proposed to eliminate regular volumetric representation throughout the entire process and ensure the retention of features across different frequency domains. This reduction significantly decreases spatial usage and required floating-point operations without substantially compromising precision. Experimental results demonstrate that our approach achieves high-quality coherent estimation with minimal cost compared to previous methods.
-
The recent progress in language-based open-vocabulary object detection can be largely attributed to finding better ways of leveraging large-scale data with free-form text annotations. Training such models with a discriminative objective function has proven successful but requires good positive and negative samples. However the free-form nature and the open vocabulary of object descriptions make the space of negatives extremely large. Prior works randomly sample negatives or use rule-based techniques to build them. In contrast we propose to leverage the vast knowledge built into modern generative models to automatically build negatives that are more relevant to the original data. Specifically we use large-language-models to generate negative text descriptions and text-to-image diffusion models to also generate corresponding negative images. Our experimental analysis confirms the relevance of the generated negative data and its use in language-based detectors improves performance on two complex benchmarks. Code is available at https://github.com/xiaofeng94/Gen-Enhanced-Negs.
-
In precision agriculture the detection and recognition of insects play an essential role in the ability of crops to grow healthy and produce a high-quality yield. The current machine vision model requires a large volume of data to achieve high performance. However there are approximately 5.5 million different insect species in the world. None of the existing insect datasets can cover even a fraction of them due to varying geographic locations and acquisition costs. In this paper we introduce a novel "Insect-1M" dataset a game-changing resource poised to revolutionize insect-related foundation model training. Covering a vast spectrum of insect species our dataset including 1 million images with dense identification labels of taxonomy hierarchy and insect descriptions offers a panoramic view of entomology enabling foundation models to comprehend visual and semantic information about insects like never before. Then to efficiently establish an Insect Foundation Model we develop a micro-feature self-supervised learning method with a Patch-wise Relevant Attention mechanism capable of discerning the subtle differences among insect images. In addition we introduce Description Consistency loss to improve micro-feature modeling via insect descriptions. Through our experiments we illustrate the effectiveness of our proposed approach in insect modeling and achieve State-of-the-Art performance on standard benchmarks of insect-related tasks. Our Insect Foundation Model and Dataset promise to empower the next generation of insect-related vision models bringing them closer to the ultimate goal of precision agriculture.
-
The goal of multimodal alignment is to learn a single latent space that is shared between multimodal inputs. The most powerful models in this space have been trained using massive datasets of paired inputs and large-scale computational resources making them prohibitively expensive to train in many practical scenarios. We surmise that existing unimodal encoders pre-trained on large amounts of unimodal data should provide an effective bootstrap to create multimodal models from unimodal ones at much lower costs. We therefore propose FuseMix a multimodal augmentation scheme that operates on the latent spaces of arbitrary pre-trained unimodal encoders. Using FuseMix for multimodal alignment we achieve competitive performance - and in certain cases outperform state-of-the art methods - in both image-text and audio-text retrieval with orders of magnitude less compute and data: for example we outperform CLIP on the Flickr30K text-to-image retrieval task with ?600x fewer GPU days and ?80x fewer image-text pairs. Additionally we show how our method can be applied to convert pre-trained text-to-image generative models into audio-to-image ones. Code is available at: https://github.com/layer6ai-labs/fusemix.
-
Standard federated learning approaches suffer when client data distributions have sufficient heterogeneity. Recent methods addressed the client data heterogeneity issue via personalized federated learning (PFL) - a class of FL algorithms aiming to personalize learned global knowledge to better suit the clients' local data distributions. Existing PFL methods usually decouple global updates in deep neural networks by performing personalization on particular layers (i.e. classifier heads) and global aggregation for the rest of the network. However preselecting network layers for personalization may result in suboptimal storage of global knowledge. In this work we propose FedSelect a novel PFL algorithm inspired by the iterative subnetwork discovery procedure used for the Lottery Ticket Hypothesis. FedSelect incrementally expands subnetworks to personalize client parameters concurrently conducting global aggregations on the remaining parameters. This approach enables the personalization of both client parameters and subnetwork structure during the training process. Finally we show that FedSelect outperforms recent state-of-the-art PFL algorithms under challenging client data heterogeneity settings and demonstrates robustness to various real-world distributional shifts.
-
3D facial landmark localization has proven to be of particular use for applications such as face tracking 3D face modeling and image-based 3D face reconstruction. In the supervised learning case such methods usually rely on 3D landmark datasets derived from 3DMM-based registration that often lack spatial definition alignment as compared with that chosen by hand-labeled human consensus e.g. how are eyebrow landmarks defined? This creates a gap between landmark datasets generated via high-quality 2D human labels and 3DMMs and it ultimately limits their effectiveness. To address this issue we introduce a novel semi-supervised learning approach that learns 3D landmarks by directly lifting (visible) hand-labeled 2D landmarks and ensures better definition alignment without the need for 3D landmark datasets. To lift 2D landmarks to 3D we leverage 3D-aware GANs for better multi-view consistency learning and in-the-wild multi-frame videos for robust cross-generalization. Empirical experiments demonstrate that our method not only achieves better definition alignment between 2D-3D landmarks but also outperforms other supervised learning 3D landmark localization methods on both 3DMM labeled and photogrammetric ground truth evaluation datasets. Project Page: https://davidcferman.github.io/FaceLift
-
Image-level Weakly Supervised Semantic Segmentation (WSSS) has received increasing attention due to its low annotation cost. Class Activation Mapping (CAM) generated through classifier weights in WSSS inevitably ignores certain useful cues while the CAM generated through class prototypes can alleviate that. However because of the different goals of image classification and semantic segmentation the class prototypes still focus on activating primary discriminative pixels learned from classification loss leading to incomplete CAM. In this paper we propose a plugand-play Prototype-based Secondary Discriminative Pixels Mining (PSDPM) framework for enabling class prototypes to activate more secondary discriminative pixels thus generating a more complete CAM. Specifically we introduce a Foreground Pixel Estimation Module (FPEM) for estimating potential foreground pixels based on the correlations between primary and secondary discriminative pixels and the semantic segmentation results of baseline methods. Then we enable WSSS model to learn discriminative features from secondary discriminative pixels through a consistency loss calculated between FPEM result and class-prototype CAM. Experimental results show that our PSDPM improves various baseline methods significantly and achieves new state-of-the-art performances on WSSS benchmarks. Codes are available at https://github.com/xinqiaozhao/PSDPM.
-
How to effectively explore multi-scale representations of rain streaks is important for image deraining. In contrast to existing Transformer-based methods that depend mostly on single-scale rain appearance we develop an end-to-end multi-scale Transformer that leverages the potentially useful features in various scales to facilitate high-quality image reconstruction. To better explore the common degradation representations from spatially-varying rain streaks we incorporate intra-scale implicit neural representations based on pixel coordinates with the degraded inputs in a closed-loop design enabling the learned features to facilitate rain removal and improve the robustness of the model in complex scenarios. To ensure richer collaborative representation from different scales we embed a simple yet effective inter-scale bidirectional feedback operation into our multi-scale Transformer by performing coarse-to-fine and fine-to-coarse information communication. Extensive experiments demonstrate that our approach named as NeRD-Rain performs favorably against the state-of-the-art ones on both synthetic and real-world benchmark datasets. The source code and trained models are available at https://github.com/cschenxiang/NeRD-Rain.
-
Weakly supervised semantic segmentation has witnessed great achievements with image-level labels. Several recent approaches use the CLIP model to generate pseudo labels for training an individual segmentation model while there is no attempt to apply the CLIP model as the backbone to directly segment objects with image-level labels. In this paper we propose WeCLIP a CLIP-based single-stage pipeline for weakly supervised semantic segmentation. Specifically the frozen CLIP model is applied as the backbone for semantic feature extraction and a new decoder is designed to interpret extracted semantic features for final prediction. Meanwhile we utilize the above frozen backbone to generate pseudo labels for training the decoder. Such labels cannot be optimized during training. We then propose a refinement module (RFM) to rectify them dynamically. Our architecture enforces the proposed decoder and RFM to benefit from each other to boost the final performance. Extensive experiments show that our approach significantly outperforms other approaches with less training cost. Additionally our WeCLIP also obtains promising results for fully supervised settings. The code is available at https://github.com/zbf1991/WeCLIP.
-
Personalized Federated Learning (PFL) is primarily designed to provide customized models for each client to better fit the non-iid distributed client data which is a inherent challenge in Federated Learning. However current PFL methods suffer from inconsistencies in both intra-client and inter-client levels: 1) The intra-client inconsistency stems from the asynchronous update strategy for personalized and shared parameters. In PFL clients update their shared parameters to communicate and learn from others while keeping personalized parts unchanged leading to poor coordination between these two components. 2) The Inter-client inconsistency arises from "stragglers" - inactive clients that communicate and train with the server less frequently. This results in their under-trained personalized models and impedes the collaborative training stage for other clients. In this paper we present a novel PFL framework named FedAS which uses Federated Parameter-Alignment and Client-Synchronization to overcome above challenges. Initially we enhance the localization of global parameters by infusing them with local insights. We make the shared parts learn from previous model thereby increasing their local relevance and reducing the impact of parameter inconsistency. Furthermore we design a robust aggregation method to mitigate the impact of stragglers by preventing the incorporation of their under-trained knowledge into aggregated model. Experimental results on Cifar10 and Cifar100 validate the effectiveness of our FedAS in achieving better performance and robustness against data heterogeneity.
-
In this work we focus on learning facial representations that can be adapted to train effective face recognition models particularly in the absence of labels. Firstly compared with existing labelled face datasets a vastly larger magnitude of unlabeled faces exists in the real world. We explore the learning strategy of these unlabeled facial images through self-supervised pretraining to transfer generalized face recognition performance. Moreover motivated by one recent finding that is the face saliency area is critical for face recognition in contrast to utilizing random cropped blocks of images for constructing augmentations in pretraining we utilize patches localized by extracted facial landmarks. This enables our method - namely Landmark-based Facial Self-supervised learning (LAFS) to learn key representation that is more critical for face recognition. We also incorporate two landmark-specific augmentations which introduce more diversity of landmark information to further regularize the learning. With learned landmark-based facial representations we further adapt the representation for face recognition with regularization mitigating variations in landmark positions. Our method achieves significant improvement over the state-of-the-art on multiple face recognition benchmarks especially on more challenging few-shot scenarios. The code is available at https://github.com/szlbiubiubiu/LAFS_CVPR2024
-
Open-vocabulary semantic segmentation strives to distinguish pixels into different semantic groups from an open set of categories. Most existing methods explore utilizing pre-trained vision-language models in which the key is to adopt the image-level model for pixel-level segmentation task. In this paper we propose a simple encoder-decoder named SED for open-vocabulary semantic segmentation which comprises a hierarchical encoder-based cost map generation and a gradual fusion decoder with category early rejection. The hierarchical encoder-based cost map generation employs hierarchical backbone instead of plain transformer to predict pixel-level image-text cost map. Compared to plain transformer hierarchical backbone better captures local spatial information and has linear computational complexity with respect to input size. Our gradual fusion decoder employs a top-down structure to combine cost map and the feature maps of different backbone levels for segmentation. To accelerate inference speed we introduce a category early rejection scheme in the decoder that rejects many no-existing categories at the early layer of decoder resulting in at most 4.7 times acceleration without accuracy degradation. Experiments are performed on multiple open-vocabulary semantic segmentation datasets which demonstrates the efficacy of our SED method. When using ConvNeXt-B our SED method achieves mIoU score of 31.6% on ADE20K with 150 categories at 82 millisecond (ms) per image on a single A6000. Our source code is available at https://github.com/xb534/SED.
-
State-of-the-art man-made shape generative models usually adopt established generative models under a suitable implicit shape representation. A common theme is to perform distribution alignment which does not explicitly model important shape priors. As a result many synthetic shapes are not connected. Other synthetic shapes present problems of physical stability and geometric feasibility. This paper introduces a novel latent diffusion shape-generative model regularized by a quality checker that outputs a score of a latent code. The scoring function employs a learned function that provides a geometric feasibility score and a deterministic procedure to quantify a physical stability score. The key to our approach is a new diffusion procedure that combines the discrete empirical data distribution and a continuous distribution induced by the quality checker. We introduce a principled approach to determine the tradeoff parameters for learning the denoising network at different noise levels. Experimental results show that our approach outperforms state-of-the-art shape generations quantitatively and qualitatively on ShapeNet-v2.
-
Existing quality enhancement methods for compressed images focus on aligning the enhancement domain with the raw domain to yield realistic images. However these methods exhibit a pervasive enhancement bias towards the compression domain inadvertently regarding it as more realistic than the raw domain. This bias makes enhanced images closely resemble their compressed counterparts thus degrading their perceptual quality. In this paper we propose a simple yet effective method to mitigate this bias and enhance the quality of compressed images. Our method employs a conditional discriminator with the compressed image as a key condition and then incorporates a domain-divergence regularization to actively distance the enhancement domain from the compression domain. Through this dual strategy our method enables the discrimination against the compression domain and brings the enhancement domain closer to the raw domain. Comprehensive quality evaluations confirm the superiority of our method over other state-of-the-art methods without incurring inference overheads.
-
Humans live in a 3D world and commonly use natural language to interact with a 3D scene. Modeling a 3D language field to support open-ended language queries in 3D has gained increasing attention recently. This paper introduces LangSplat which constructs a 3D language field that enables precise and efficient open-vocabulary querying within 3D spaces. Unlike existing methods that ground CLIP language embeddings in a NeRF model LangSplat advances the field by utilizing a collection of 3D Gaussians each encoding language features distilled from CLIP to represent the language field. By employing a tile-based splatting technique for rendering language features we circumvent the costly rendering process inherent in NeRF. Instead of directly learning CLIP embeddings LangSplat first trains a scene-wise language autoencoder and then learns language features on the scene-specific latent space thereby alleviating substantial memory demands imposed by explicit modeling. Existing methods struggle with imprecise and vague 3D language fields which fail to discern clear boundaries between objects. We delve into this issue and propose to learn hierarchical semantics using SAM thereby eliminating the need for extensively querying the language field across various scales and the regularization of DINO features. Extensive experimental results show that LangSplat significantly outperforms the previous state-of-the-art method LERF by a large margin. Notably LangSplat is extremely efficient achieving a 199 x speedup compared to LERF at the resolution of 1440 x 1080. We strongly recommend readers to check out our video results at https://langsplat.github.io/.
-
Many existing motion prediction approaches rely on symbolic perception outputs to generate agent trajectories such as bounding boxes road graph information and traffic lights. This symbolic representation is a high-level abstraction of the real world which may render the motion prediction model vulnerable to perception errors (e.g. failures in detecting open-vocabulary obstacles) while missing salient information from the scene context (e.g. poor road conditions). An alternative paradigm is end-to-end learning from raw sensors. However this approach suffers from the lack of interpretability and requires significantly more training resources. In this work we propose tokenizing the visual world into a compact set of scene elements and then leveraging pre-trained image foundation models and LiDAR neural networks to encode all the scene elements in an open-vocabulary manner. The image foundation model enables our scene tokens to encode the general knowledge of the open world while the LiDAR neural network encodes geometry information. Our proposed representation can efficiently encode the multi-frame multi-modality observations with a few hundred tokens and is compatible with most transformer-based architectures. To evaluate our method we have augmented Waymo Open Motion Dataset with camera embeddings. Experiments over Waymo Open Motion Dataset show that our approach leads to significant performance improvements over the state-of-the-art.
-
Planet-scale image geolocalization remains a challenging problem due to the diversity of images originating from anywhere in the world. Although approaches based on vision transformers have made significant progress in geolocalization accuracy success in prior literature is constrained to narrow distributions of images of landmarks and performance has not generalized to unseen places. We present a new geolocalization system that combines semantic geocell creation multi-task contrastive pretraining and a novel loss function. Additionally our work is the first to perform retrieval over location clusters for guess refinements. We train two models for evaluations on street-level data and general-purpose image geolocalization; the first model PIGEON is trained on data from the game of GeoGuessr and is capable of placing over 40% of its guesses within 25 kilometers of the target location globally. We also develop a bot and deploy PIGEON in a blind experiment against humans ranking in the top 0.01% of players. We further challenge one of the world's foremost professional GeoGuessr players to a series of six matches with millions of viewers winning all six games. Our second model PIGEOTTO differs in that it is trained on a dataset of images from Flickr and Wikipedia achieving state-of-the-art results on a wide range of image geolocalization benchmarks outperforming the previous SOTA by up to 7.7 percentage points on the city accuracy level and up to 38.8 percentage points on the country level. Our findings suggest that PIGEOTTO is the first image geolocalization model that effectively generalizes to unseen places and that our approach can pave the way for highly accurate planet-scale image geolocalization systems. Our code is available on GitHub.
-
How to effectively utilize the spectral and spatial characteristics of Hyperspectral Image (HSI) is always a key problem in spectral snapshot reconstruction. Recently the spectra-wise transformer has shown great potential in capturing inter-spectra similarities of HSI but the classic design of the transformer i.e. multi-head division in the spectral (channel) dimension hinders the modeling of global spectral information and results in mean effect. In addition previous methods adopt the normal spatial priors without taking imaging processes into account and fail to address the unique spatial degradation in snapshot spectral reconstruction. In this paper we analyze the influence of multi-head division and propose a novel Spectral-Spatial Rectification (SSR) method to enhance the utilization of spectral information and improve spatial degradation. Specifically SSR includes two core parts: Window-based Spectra-wise Self-Attention (WSSA) and spAtial Rectification Block (ARB). WSSA is proposed to capture global spectral information and account for local differences whereas ARB aims to mitigate the spatial degradation using a spatial alignment strategy. The experimental results on simulation and real scenes demonstrate the effectiveness of the proposed modules and we also provide models at multiple scales to demonstrate the superiority of our approach.
-
We address the challenge of content diversity and controllability in pedestrian simulation for driving scenarios. Recent pedestrian animation frameworks have a significant limitation wherein they primarily focus on either following trajectory or the content of the reference video consequently overlooking the potential diversity of human motion within such scenarios. This limitation restricts the ability to generate pedestrian behaviors that exhibit a wider range of variations and realistic motions and therefore restricts its usage to provide rich motion content for other components in the driving simulation system e.g. suddenly changed motion to which the autonomous vehicle should respond. In our approach we strive to surpass the limitation by showcasing diverse human motions obtained from various sources such as generated human motions in addition to following the given trajectory. The fundamental contribution of our framework lies in combining the motion tracking task with trajectory following which enables the tracking of specific motion parts (e.g. upper body) while simultaneously following the given trajectory by a single policy. This way we significantly enhance both the diversity of simulated human motion within the given scenario and the controllability of the content including language-based control. Our framework facilitates the generation of a wide range of human motions contributing to greater realism and adaptability in pedestrian simulations for driving scenarios.
-
Advancements in neural signed distance fields (SDFs) have enabled modeling 3D surface geometry from a set of 2D images of real-world scenes. Baking neural SDFs can extract explicit mesh with appearance baked into texture maps as neural features. The baked meshes still have a large memory footprint and require a powerful GPU for real-time rendering. Neural optimization of such large meshes with differentiable rendering pose significant challenges. We propose a method to produce optimized meshes for large unbounded scenes with low triangle budget and high fidelity of geometry and appearance. We achieve this by combining advancements in baking neural SDFs with classical mesh simplification techniques and proposing a joint appearance-geometry refinement step. The visual quality is comparable to or better than state-of-the-art neural meshing and baking methods with high geometric accuracy despite significant reduction in triangle count making the produced meshes efficient for storage transmission and rendering on mobile hardware. We validate the effectiveness of the proposed method on large unbounded scenes from mip-NeRF 360 Tanks & Temples and Deep Blending datasets achieving at-par rendering quality with 73x reduced triangles and 11x reduction in memory footprint.
-
Conditional diffusion models are powerful generative models that can leverage various types of conditional information such as class labels segmentation masks or text captions. However in many real-world scenarios conditional information may be noisy or unreliable due to human annotation errors or weak alignment. In this paper we propose the Coherence-Aware Diffusion (CAD) a novel method to integrate confidence in conditional information into diffusion models allowing them to learn from noisy annotations without discarding data. We assume that each data point has an associated confidence score that reflects the quality of the conditional information. We then condition the diffusion model on both the conditional information and the confidence score. In this way the model learns to ignore or discount the conditioning when the confidence is low. We show that our method is theoretically sound and empirically effective on various conditional generation tasks. Moreover we show that leveraging confidence generates realistic and diverse samples that respect conditional information better than models trained on cleaned datasets where samples with low confidence have been discarded.
-
Stereo rectification is widely considered "solved" due to the abundance of traditional approaches to perform rectification. However autonomous vehicles and robots in-the-wild require constant re-calibration due to exposure to various environmental factors including vibration and structural stress when cameras are arranged in a wide-baseline configuration. Conventional rectification methods fail in these challenging scenarios: especially for larger vehicles such as autonomous freight trucks and semi-trucks the resulting incorrect rectification severely affects the quality of downstream tasks that use stereo/multi-view data. To tackle these challenges we propose an online rectification approach that operates at real-time rates while achieving high accuracy. We propose a novel learning-based online calibration approach that utilizes stereo correlation volumes built from a feature representation obtained from cross-image attention. Our model is trained to minimize vertical optical flow as proxy rectification constraint and predicts the relative rotation between the stereo pair. The method is real-time and even outperforms conventional methods used for offline calibration and substantially improves downstream stereo depth post-rectification. We release two public datasets (https://light.princeton.edu/online-stereo-recification/) a synthetic and experimental wide baseline dataset to foster further research.
-
DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization
Radiance fields have demonstrated impressive performance in synthesizing novel views from sparse input views yet prevailing methods suffer from high training costs and slow inference speed. This paper introduces DNGaussian a depth-regularized framework based on 3D Gaussian radiance fields offering real-time and high-quality few-shot novel view synthesis at low costs. Our motivation stems from the highly efficient representation and surprising quality of the recent 3D Gaussian Splatting despite it will encounter a geometry degradation when input views decrease. In the Gaussian radiance fields we find this degradation in scene geometry primarily lined to the positioning of Gaussian primitives and can be mitigated by depth constraint. Consequently we propose a Hard and Soft Depth Regularization to restore accurate scene geometry under coarse monocular depth supervision while maintaining a fine-grained color appearance. To further refine detailed geometry reshaping we introduce Global-Local Depth Normalization enhancing the focus on small local depth changes. Extensive experiments on LLFF DTU and Blender datasets demonstrate that DNGaussian outperforms state-of-the-art methods achieving comparable or better results with significantly reduced memory cost a 25x reduction in training time and over 3000x faster rendering speed. Code is available at: https://github.com/Fictionarry/DNGaussian
-
Point cloud registration is still a challenging and open problem. For example when the overlap between two point clouds is extremely low geo-only features may be not sufficient. Therefore it is important to further explore how to utilize color data in this task. Under such circumstances we propose ColorPCR for color point cloud registration with multi-stage geometric-color fusion. We design a Hierarchical Color Enhanced Feature Extraction module to extract multi-level geometric-color features and a GeoColor Superpoint Matching Module to encode transformation-invariant geo-color global context for robust patch correspondences. In this way both geometric and color data can be used thus lead to robust performance even under extremely challenging scenarios such as low overlap between two point clouds. To evaluate the performance of our method we colorize 3DMatch/3DLoMatch datasets as Color3DMatch/Color3DLoMatch and evaluations on these datasets demonstrate the effectiveness of our proposed method. Our method achieves state-of-the-art registration recall of 97.5%/88.9% on them.
-
The spatial non-uniformity and diverse patterns of shadow degradation conflict with the weight sharing manner of dominant models which may lead to an unsatisfactory compromise. To tackle with this issue we present a novel strategy from the view of shadow transformation in this paper: directly homogenizing the spatial distribution of shadow degradation. Our key design is the random shuffle operation and its corresponding inverse operation. Specifically random shuffle operation stochastically rearranges the pixels across spatial space and the inverse operation recovers the original order. After randomly shuffling the shadow diffuses in the whole image and the degradation appears in a homogenized way which can be effectively processed by the local self-attention layer. Moreover we further devise a new feed forward network with position modeling to exploit image structural information. Based on these elements we construct the final local window based transformer named HomoFormer for image shadow removal. Our HomoFormer can enjoy the linear complexity of local transformers while bypassing challenges of non-uniformity and diversity of shadow. Extensive experiments are conducted to verify the superiority of our HomoFormer across public datasets.
-
Counterfactual reasoning a fundamental aspect of human cognition involves contemplating alternatives to established facts or past events significantly enhancing our abilities in planning and decision-making. In light of the advancements in current multi-modal large language models we explore their effectiveness in counterfactual reasoning. To facilitate this investigation we introduce a novel dataset C-VQA specifically designed to examine the counterfactual reasoning capabilities of modern multi-modal large language models. This dataset is constructed by infusing original questions with counterfactual presuppositions spanning various types such as numerical and boolean queries. It encompasses a mix of real and synthetic data representing a wide range of difficulty levels. Our thorough evaluations of contemporary vision-language models using this dataset have revealed substantial performance drops with some models showing up to a 40% decrease highlighting a significant gap between current models and human-like vision reasoning capabilities. We hope our dataset will serve as a vital benchmark for evaluating the counterfactual reasoning capabilities of models. Code and dataset are publicly available at https://bzhao.me/C-VQA/.
-
Driver's eye gaze holds a wealth of cognitive and intentional cues crucial for intelligent vehicles. Despite its significance research on in-vehicle gaze estimation remains limited due to the scarcity of comprehensive and well-annotated datasets in real driving scenarios. In this paper we present three novel elements to advance in-vehicle gaze research. Firstly we introduce IVGaze a pioneering dataset capturing in-vehicle gaze collected from 125 individuals and covering a large range of gaze and head within vehicles. Conventional gaze collection systems are inadequate for in-vehicle use. In this dataset we propose a new vision-based solution for in-vehicle gaze collection introducing a refined gaze target calibration method to tackle annotation challenges. Second our research focuses on in-vehicle gaze estimation leveraging the IVGaze. Images of in-vehicle faces often suffer from low resolution prompting our introduction of a gaze pyramid transformer that harnesses transformer-based multilevel features integration. Expanding upon this we introduce the dual-stream gaze pyramid transformer (GazeDPTR). Employing perspective transformation we rotate virtual cameras to normalize images utilizing camera pose to merge normalized and original images for accurate gaze estimation. GazeDPTR showcases state-of-the-art performance on the IVGaze dataset. Thirdly we explore a novel strategy for gaze zone classification by extending the GazeDPTR. A foundational tri-plane and project gaze onto these planes are newly defined. Leveraging both positional features from the projection points and visual attributes from images we achieve superior performance compared to relying solely on visual features thereby substantiating the advantage of gaze estimation. The project is available at https://yihua.zone/work/ivgaze
-
Adapting driving behavior to new environments customs and laws is a long-standing problem in autonomous driving precluding the widespread deployment of autonomous vehicles (AVs). In this paper we present LLaDA a simple yet powerful tool that enables human drivers and autonomous vehicles alike to drive everywhere by adapting their tasks and motion plans to traffic rules in new locations. LLaDA achieves this by leveraging the impressive zero-shot generalizability of large language models (LLMs) in interpreting the traffic rules in the local driver handbook. Through an extensive user study we show that LLaDA's instructions are useful in disambiguating in-the-wild unexpected situations. We also demonstrate LLaDA's ability to adapt AV motion planning policies in real-world datasets; LLaDA outperforms baseline planning approaches on all our metrics. Please check our website for more details: https://boyiliee.github.io/llada.
-
Generalizable neural implicit surface reconstruction aims to obtain an accurate underlying geometry given a limited number of multi-view images from unseen scenes. However existing methods select only informative and relevant views using predefined scores for training and testing phases. This constraint renders the model impractical in real-world scenarios where the availability of favorable combinations cannot always be ensured. We introduce and validate a view-combination score to indicate the effectiveness of the input view combination. We observe that previous methods output degenerate solutions under arbitrary and unfavorable sets. Building upon this finding we propose UFORecon a robust view-combination generalizable surface reconstruction framework. To achieve this we apply cross-view matching transformers to model interactions between source images and build correlation frustums to capture global correlations. Additionally we explicitly encode pairwise feature similarities as view-consistent priors. Our proposed framework significantly outperforms previous methods in terms of view-combination generalizability and also in the conventional generalizable protocol trained with favorable view-combinations. The code is available at https://github.com/Youngju-Na/UFORecon.
-
Estimating relative camera poses between images has been a central problem in computer vision. Methods that find correspondences and solve for the fundamental matrix offer high precision in most cases. Conversely methods predicting pose directly using neural networks are more robust to limited overlap and can infer absolute translation scale but at the expense of reduced precision. We show how to combine the best of both methods; our approach yields results that are both precise and robust while also accurately inferring translation scales. At the heart of our model lies a Transformer that (1) learns to balance between solved and learned pose estimations and (2) provides a prior to guide a solver. A comprehensive analysis supports our design choices and demonstrates that our method adapts flexibly to various feature extractors and correspondence estimators showing state-of-the-art performance in 6DoF pose estimation on Matterport3D InteriorNet StreetLearn and Map-free Relocalization.
-
Event cameras with their high temporal and dynamic range and minimal memory usage have found applications in various fields. However their potential in static traffic monitoring remains largely unexplored. To facilitate this exploration we present eTraM - a first-of-its-kind fully event-based traffic monitoring dataset. eTraM offers 10 hr of data from different traffic scenarios in various lighting and weather conditions providing a comprehensive overview of real-world situations. Providing 2M bounding box annotations it covers eight distinct classes of traffic participants ranging from vehicles to pedestrians and micro-mobility. eTraM's utility has been assessed using state-of-the-art methods for traffic participant detection including RVT RED and YOLOv8. We quantitatively evaluate the ability of event-based models to generalize on nighttime and unseen scenes. Our findings substantiate the compelling potential of leveraging event cameras for traffic monitoring opening new avenues for research and application. eTraM is available at https://eventbasedvision.github.io/eTraM.
-
Learning-based stereo matching techniques have made significant progress. However existing methods inevitably lose geometrical structure information during the feature channel generation process resulting in edge detail mismatches. In this paper the Motif Channel Attention Stereo Matching Network (MoCha-Stereo) is designed to address this problem. We provide the Motif Channel Correlation Volume (MCCV) to determine more accurate edge matching costs. MCCV is achieved by projecting motif channels which capture common geometric structures in feature channels onto feature maps and cost volumes. In addition edge variations in the reconstruction error map also affect details matching we propose the Reconstruction Error Motif Penalty (REMP) module to further refine the full-resolution disparity estimation. REMP integrates the frequency information of typical channel features from the reconstruction error. MoCha-Stereo ranks 1st on the KITTI-2015 and KITTI-2012 Reflective leaderboards. Our structure also shows excellent performance in Multi-View Stereo. Code is avaliable at https://github.com/ZYangChen/MoCha-Stereo.
-
Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships. State-of-the-art video Large Language Models (vLLMs) hold promise as a viable solution due to their demonstrated emergent capabilities on new tasks. However despite being trained on millions of short seconds-long videos vLLMs are unable to understand minutes-long videos and accurately answer questions about them. To address this limitation we propose a lightweight and self-supervised approach Key frame-conditioned long video-LLM (Koala) that introduces learnable spatiotemporal queries to adapt pretrained vLLMs for generalizing to longer videos. Our approach introduces two new tokenizers that condition on visual tokens computed from sparse video key frames for understanding short and long video moments. We train our proposed approach on HowTo100M and demonstrate its effectiveness on zero-shot long video understanding benchmarks where it outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks. Surprisingly we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.
-
Registration of point clouds collected from a pair of distant vehicles provides a comprehensive and accurate 3D view of the driving scenario which is vital for driving safety related applications yet existing literature suffers from the expensive pose label acquisition and the deficiency to generalize to new data distributions. In this paper we propose EYOC an unsupervised distant point cloud registration method that adapts to new point cloud distributions on the fly requiring no global pose labels. The core idea of EYOC is to train a feature extractor in a progressive fashion where in each round the feature extractor trained with near point cloud pairs can label slightly farther point cloud pairs enabling self-supervision on such far point cloud pairs. This process continues until the derived extractor can be used to register distant point clouds. Particularly to enable high-fidelity correspondence label generation we devise an effective spatial filtering scheme to select the most representative correspondences to register a point cloud pair and then utilize the aligned point clouds to discover more correct correspondences. Experiments show that EYOC can achieve comparable performance with state-of-the-art supervised methods at a lower training cost. Moreover it outwits supervised methods regarding generalization performance on new data distributions.
-
We introduce "HallusionBench" a comprehensive benchmark designed for the evaluation of image-context reasoning. This benchmark presents significant challenges to advanced large visual-language models (LVLMs) such as GPT-4V(ision) Gemini Pro Vision Claude 3 and LLaVA-1.5 by emphasizing nuanced understanding and interpretation of visual data. The benchmark comprises 346 images paired with 1129 questions all meticulously crafted by human experts. We introduce a novel structure for these visual questions designed to establish control groups. This structure enables us to conduct a quantitative analysis of the models' response tendencies logical consistency and various failure modes. In our evaluation on HallusionBench we benchmarked 15 different models highlighting a 31.42% question-pair accuracy achieved by the state-of-the-art GPT-4V. Notably all other evaluated models achieve accuracy below 16%. Moreover our analysis not only highlights the observed failure modes including language hallucination and visual illusion but also deepens an under standing of these pitfalls. Our comprehensive case studies within HallusionBench shed light on the challenges of hallucination and illusion in LVLMs. Based on these insights we suggest potential pathways for their future improvement. The benchmark and codebase can be accessed at https://github.com/tianyilab/HallusionBench.
-
Out-of-distribution (OOD) detection methods often exploit auxiliary outliers to train model identifying OOD samples especially discovering challenging outliers from auxiliary outliers dataset to improve OOD detection. However they may still face limitations in effectively distinguishing between the most challenging OOD samples that are much like in-distribution (ID) data i.e. ID-like samples. To this end we propose a novel OOD detection framework that discovers ID-like outliers using CLIP from the vicinity space of the ID samples thus helping to identify these most challenging OOD samples. Then a prompt learning framework is proposed that utilizes the identified ID-like outliers to further leverage the capabilities of CLIP for OOD detection. Benefiting from the powerful CLIP we only need a small number of ID samples to learn the prompts of the model without exposing other auxiliary outlier datasets. By focusing on the most challenging ID-like OOD samples and elegantly exploiting the capabilities of CLIP our method achieves superior few-shot learning performance on various real-world image datasets (e.g. in 4-shot OOD detection on the ImageNet-1k dataset our method reduces the average FPR95 by 12.16% and improves the average AUROC by 2.76% compared to state-of-the-art methods).
-
A sketch is one of the most intuitive and versatile tools humans use to convey their ideas visually. An animated sketch opens another dimension to the expression of ideas and is widely used by designers for a variety of purposes. Animating sketches is a laborious process requiring extensive experience and professional design skills. In this work we present a method that automatically adds motion to a single-subject sketch (hence "breathing life into it") merely by providing a text prompt indicating the desired motion. The output is a short animation provided in vector representation which can be easily edited. Our method does not require extensive training but instead leverages the motion prior of a large pretrained text-to-video diffusion model using a score-distillation loss to guide the placement of strokes. To promote natural and smooth motion and to better preserve the sketch's appearance we model the learned motion through two components. The first governs small local deformations and the second controls global affine transformations. Surprisingly we find that even models that struggle to generate sketch videos on their own can still serve as a useful backbone for animating abstract representations.
-
Precise geospatial vegetation forecasting holds potential across diverse sectors including agriculture forestry humanitarian aid and carbon accounting. To leverage the vast availability of satellite imagery for this task various works have applied deep neural networks for predicting multispectral images in photorealistic quality. However the important area of vegetation dynamics has not been thoroughly explored. Our study introduces GreenEarthNet the first dataset specifically designed for high-resolution vegetation forecasting and Contextformer a novel deep learning approach for predicting vegetation greenness from Sentinel 2 satellite images with fine resolution across Europe. Our multi-modal transformer model Contextformer leverages spatial context through a vision backbone and predicts the temporal dynamics on local context patches incorporating meteorological time series in a parameter-efficient manner. The GreenEarthNet dataset features a learned cloud mask and an appropriate evaluation scheme for vegetation modeling. It also maintains compatibility with the existing satellite imagery forecasting dataset EarthNet2021 enabling cross-dataset model comparisons. Our extensive qualitative and quantitative analyses reveal that our methods outperform a broad range of baseline techniques. This includes surpassing previous state-of-the-art models on EarthNet2021 as well as adapted models from time series forecasting and video prediction. To the best of our knowledge this work presents the first models for continental-scale vegetation modeling at fine resolution able to capture anomalies beyond the seasonal cycle thereby paving the way for predicting vegetation health and behaviour in response to climate variability and extremes. We provide open source code and pre-trained weights to reproduce our experimental results under https://github.com/vitusbenson/greenearthnet.
-
Diffusion Models have shown remarkable performance in image generation tasks which are capable of generating diverse and realistic image content. When adopting diffusion models for image restoration the crucial challenge lies in how to preserve high-level image fidelity in the randomness diffusion process and generate accurate background structures and realistic texture details. In this paper we propose a general framework and develop a Diffusion Texture Prior Model (DTPM) for image restoration tasks. DTPM explicitly models high-quality texture details through the diffusion process rather than global contextual content. In phase one of the training stage we pre-train DTPM on approximately 55K high-quality image samples after which we freeze most of its parameters. In phase two we insert conditional guidance adapters into DTPM and equip it with an initial predictor thereby facilitating its rapid adaptation to downstream image restoration tasks. Our DTPM could mitigate the randomness of traditional diffusion models by utilizing encapsulated rich and diverse texture knowledge and background structural information provided by the initial predictor during the sampling process. Our comprehensive evaluations of five image restoration tasks demonstrate DTPM's superiority over existing regression and diffusion-based image restoration methods in perceptual quality and its exceptional generalization capabilities.
-
Single RGB or LiDAR is the mainstream sensor for the challenging scene flow which relies heavily on visual features to match motion features. Compared with single modality existing methods adopt a fusion strategy to directly fuse the cross-modal complementary knowledge in motion space. However these direct fusion methods may suffer the modality gap due to the visual intrinsic heterogeneous nature between RGB and LiDAR thus deteriorating motion features. We discover that event has the homogeneous nature with RGB and LiDAR in both visual and motion spaces. In this work we bring the event as a bridge between RGB and LiDAR and propose a novel hierarchical visual-motion fusion framework for scene flow which explores a homogeneous space to fuse the cross-modal complementary knowledge for physical interpretation. In visual fusion we discover that event has a complementarity (relative v.s. absolute) in luminance space with RGB for high dynamic imaging and has a complementarity (local boundary v.s. global shape) in scene structure space with LiDAR for structure integrity. In motion fusion we figure out that RGB event and LiDAR are complementary (spatial-dense temporal-dense v.s. spatiotemporal-sparse) to each other in correlation space which motivates us to fuse their motion correlations for motion continuity. The proposed hierarchical fusion can explicitly fuse the multimodal knowledge to progressively improve scene flow from visual space to motion space. Extensive experiments have been performed to verify the superiority of the proposed method.
-
Generalizable NeRF can directly synthesize novel views across new scenes eliminating the need for scene-specific retraining in vanilla NeRF. A critical enabling factor in these approaches is the extraction of a generalizable 3D representation by aggregating source-view features. In this paper we propose an Entangled View-Epipolar Information Aggregation method dubbed EVE-NeRF. Different from existing methods that consider cross-view and along-epipolar information independently EVE-NeRF conducts the view-epipolar feature aggregation in an entangled manner by injecting the scene-invariant appearance continuity and geometry consistency priors to the aggregation process. Our approach effectively mitigates the potential lack of inherent geometric and appearance constraint resulting from one-dimensional interactions thus further boosting the 3D representation generalizablity. EVE-NeRF attains state-of-the-art performance across various evaluation scenarios. Extensive experiments demonstate that compared to prevailing single-dimensional aggregation the entangled network excels in the accuracy of 3D scene geometry and appearance reconstruction. Our code is publicly available at https://github.com/tatakai1/EVENeRF.
-
The ability of large language models (LLMs) to process visual inputs has given rise to general-purpose vision systems unifying various vision-language (VL) tasks by instruction tuning. However due to the enormous diversity in input-output formats in the vision domain existing general-purpose models fail to successfully integrate segmentation and multi-image inputs with coarse-level tasks into a single framework. In this work we introduce VistaLLM a powerful visual system that addresses coarse- and fine grained VL tasks over single and multiple input images using a unified framework. VistaLLM utilizes an instruction-guided image tokenizer that filters global embeddings using task descriptions to extract compressed and refined features from numerous images. Moreover VistaLLM employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences significantly improving over previously used uniform sampling. To bolster the desired capability of VistaLLM we curate CoinIt a comprehensive coarse-to-fine instruction tuning dataset with 6.8M samples. We also address the lack of multi-image grounding datasets by introducing a novel task AttCoSeg (Attribute-level Co Segmentation) which boosts the model's reasoning and grounding capability over multiple input images. Extensive experiments on a wide range of V- and VL tasks demonstrate the effectiveness of VistaLLM by achieving consistent state-of-the-art performance over strong baselines across many downstream tasks. Our project page can be found at https://shramanpramanick.github.io/VistaLLM/
-
Foot contact is an important cue for human motion capture understanding and generation. Existing datasets tend to annotate dense foot contact using visual matching with thresholding or incorporating pressure signals. However these approaches either suffer from low accuracy or are only designed for small-range and slow motion. There is still a lack of a vision-pressure multimodal dataset with large-range and fast human motion as well as accurate and dense foot-contact annotation. To fill this gap we propose a Multimodal MoCap Dataset with Vision and Pressure sensors named MMVP. MMVP provides accurate and dense plantar pressure signals synchronized with RGBD observations which is especially useful for both plausible shape estimation robust pose fitting without foot drifting and accurate global translation tracking. To validate the dataset we propose an RGBD-P SMPL fitting method and also a monocular-video-based baseline framework VP-MoCap for human motion capture. Experiments demonstrate that our RGBD-P SMPL Fitting results significantly outperform pure visual motion capture. Moreover VP-MoCap outperforms SOTA methods in foot-contact and global translation estimation accuracy. We believe the configuration of the dataset and the baseline frameworks will stimulate the research in this direction and also provide a good reference for MoCap applications in various domains. Project page: https://metaverse-ai-lab-thu.github.io/MMVP-Dataset/.
-
Out-of-distribution (OOD) detection has attracted a large amount of attention from the machine learning research community in recent years due to its importance in deployed systems. Most of the previous studies focused on the detection of OOD samples in the multi-class classification task. However OOD detection in the multi-label classification task a more common real-world use case remains an underexplored domain. In this research we propose YolOOD - a method that utilizes concepts from the object detection domain to perform OOD detection in the multi-label classification task. Object detection models have an inherent ability to distinguish between objects of interest (in-distribution data) and irrelevant objects (OOD data) in images that contain multiple objects belonging to different class categories. These abilities allow us to convert a regular object detection model into an image classifier with inherent OOD detection capabilities with just minor changes. We compare our approach to state-of-the-art OOD detection methods and demonstrate YolOOD's ability to outperform these methods on a comprehensive suite of in-distribution and OOD benchmark datasets.
-
Accuracy and computational efficiency are the most important metrics to Visual Inertial Navigation System (VINS). The existing VINS algorithms with either high accuracy or low computational complexity are difficult to provide the high precision localization in resource-constrained devices. To this end we propose a novel filter-based VINS framework named SchurVINS (SV) which could guarantee both high accuracy by building a complete residual model and low computational complexity with Schur complement. Technically we first formulate the full residual model where Gradient Hessian and observation covariance are explicitly modeled. Then Schur complement is employed to decompose the full model into ego-motion residual model and landmark residual model. Finally Extended Kalman Filter (EKF) update is implemented in these two models with high efficiency. Experiments on EuRoC and TUM-VI datasets show that our method notably outperforms state-of-the-art (SOTA) methods in both accuracy and computational complexity. The experimental code of SchurVINS is available at https://github.com/bytedance/SchurVINS.
-
Domain Generalized Semantic Segmentation (DGSS) deals with training a model on a labeled source domain with the aim of generalizing to unseen domains during inference. Existing DGSS methods typically effectuate robust features by means of Domain Randomization (DR). Such an approach is often limited as it can only account for style diversification and not content. In this work we take an orthogonal approach to DGSS and propose to use an assembly of CoLlaborative FOUndation models for Domain Generalized Semantic Segmentation (CLOUDS). In detail CLOUDS is a framework that integrates Foundation Models of various kinds: (i) CLIP backbone for its robust feature representation (ii) Diffusion Model to diversify the content thereby covering various modes of the possible target distribution and (iii) Segment Anything Model (SAM) for iteratively refining the predictions of the segmentation model. Extensive experiments show that our CLOUDS excels in adapting from synthetic to real DGSS benchmarks and under varying weather conditions notably outperforming prior methods by 5.6% and 6.7% on averaged mIoU respectively. Our code is available at https://github.com/yasserben/CLOUDS
-
This paper addresses the problem of generating lifelike holistic co-speech motions for 3D avatars focusing on two key aspects: variability and coordination. Variability allows the avatar to exhibit a wide range of motions even with similar speech content while coordination ensures a harmonious alignment among facial expressions hand gestures and body poses. We aim to achieve both with ProbTalk a unified probabilistic framework designed to jointly model facial hand and body movements in speech. ProbTalk builds on the variational autoencoder (VAE) architecture and incorporates three core designs. First we introduce product quantization (PQ) to the VAE which enriches the representation of complex holistic motion. Second we devise a novel non-autoregressive model that embeds 2D positional encoding into the product-quantized representation thereby preserving essential structure information of the PQ codes. Last we employ a secondary stage to refine the preliminary prediction further sharpening the high-frequency details. Coupling these three designs enables ProbTalk to generate natural and diverse holistic co-speech motions outperforming several state-of-the-art methods in qualitative and quantitative evaluations particularly in terms of realism. Our code and model will be released for research purposes at https://feifeifeiliu.github.io/probtalk/.
-
Leveraging few-shot datasets in prompt learning for Vision-Language Models eliminates the need for manual prompt engineering while highlighting the necessity of accurate annotations for the labels. However high-level or complex label noise challenges prompt learning for Vision-Language Models. Aiming at this issue we propose a new framework for improving its robustness. Specifically we introduce the Joint Adaptive Partitioning for Label Refurbishment (JoAPR) a structured framework encompassing two key steps. 1) Data Partitioning where we differentiate between clean and noisy data using joint adaptive thresholds. 2) Label Refurbishment where we correct the labels based on the partition outcomes before retraining the network. Our comprehensive experiments confirm that JoAPR substantially enhances the robustness of prompt learning for Vision-Language Models against label noise offering a promising direction for future research.
-
Semi-supervised semantic segmentation (SSSS) has been proposed to alleviate the burden of time-consuming pixel-level manual labeling which leverages limited labeled data along with larger amounts of unlabeled data. Current state-of-the-art methods train the labeled data with ground truths and unlabeled data with pseudo labels. However the two training flows are separate which allows labeled data to dominate the training process resulting in low-quality pseudo labels and consequently sub-optimal results. To alleviate this issue we present AllSpark which reborns the labeled features from unlabeled ones with the channel-wise cross-attention mechanism. We further introduce a Semantic Memory along with a Channel Semantic Grouping strategy to ensure that unlabeled features adequately represent labeled features. The AllSpark shed new light on the architecture level designs of SSSS rather than framework level which avoids increasingly complicated training pipeline designs. It can also be regarded as a flexible bottleneck module that can be seamlessly integrated into a general transformer-based segmentation model. The proposed AllSpark outperforms existing methods across all evaluation protocols on Pascal Cityscapes and COCO benchmarks without bells-and-whistles. Code and model weights are available at: https://github.com/xmed-lab/AllSpark.
-
In dynamic 3D environments the ability to recognize a diverse range of objects without the constraints of predefined categories is indispensable for real-world applications. In response to this need we introduce OV3D an innovative framework designed for open-vocabulary 3D semantic segmentation. OV3D leverages the broad open-world knowledge embedded in vision and language foundation models to establish a fine-grained correspondence between 3D points and textual entity descriptions. These entity descriptions are enriched with contextual information enabling a more open and comprehensive understanding. By seamlessly aligning 3D point features with entity text features OV3D empowers open-vocabulary recognition in the 3D domain achieving state-of-the-art open-vocabulary semantic segmentation performance across multiple datasets including ScanNet Matterport3D and nuScenes.
-
Advances in image diffusion models have recently led to notable improvements in the generation of high-quality images. In combination with Neural Radiance Fields (NeRFs) they enabled new opportunities in 3D generation. However most generative 3D approaches are object-centric and applying them to editing existing photorealistic scenes is not trivial. We propose SIGNeRF a novel approach for fast and controllable NeRF scene editing and scene-integrated object generation. A new generative update strategy ensures 3D consistency across the edited images without requiring iterative optimization. We find that depth-conditioned diffusion models inherently possess the capability to generate 3D consistent views by requesting a grid of images instead of single views. Based on these insights we introduce a multi-view reference sheet of modified images. Our method updates an image collection consistently based on the reference sheet and refines the original NeRF with the newly generated image set in one go. By exploiting the depth conditioning mechanism of the image diffusion model we gain fine control over the spatial location of the edit and enforce shape guidance by a selected region or an external mesh.
-
While existing large vision-language multimodal models focus on whole image understanding there is a prominent gap in achieving region-specific comprehension. Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge we introduce a novel multimodal model capable of decoding arbitrary (free-form) visual prompts. This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow'". Our simple design directly overlays visual markers onto the RGB image eliminating the need for complex region encodings yet achieves state-of-the-art performance on region-understanding tasks like Visual7W PointQA and Visual Commonsense Reasoning benchmark. Furthermore we present ViP-Bench a comprehensive benchmark to assess the capability of models in understanding visual prompts across multiple dimensions enabling future research in this domain. Code data and model are publicly available.
-
Recent advances in Iterative Vision-and-Language Navigation(IVLN) introduce a more meaningful and practical paradigm of VLN by maintaining the agent's memory across tours of scenes. Although the long-term memory aligns better with the persistent nature of the VLN task it poses more challenges on how to utilize the highly unstructured navigation memory with extremely sparse supervision. Towards this end we propose OVER-NAV which aims to go over and beyond the current arts of IVLN techniques. In particular we propose to incorporate LLMs and open-vocabulary detectors to distill key information and establish correspondence between multi-modal signals. Such a mechanism introduces reliable cross-modal supervision and enables on-the-fly generalization to unseen scenes without the need of extra annotation and re-training. To fully exploit the interpreted navigation data we further introduce a structured representation coded Omnigraph to effectively integrate multi-modal information along the tour. Accompanied with a novel omnigraph fusion mechanism OVER-NAV is able to extract the most relevant knowledge from omnigraph for a more accurate navigating action. In addition OVER-NAV seamlessly supports both discrete and continuous environments under a unified framework. We demonstrate the superiority of OVER-NAV in extensive experiments.
-
The robustness of neural networks against input perturbations with bounded magnitude represents a serious concern in the deployment of deep learning models in safety-critical systems. Recently the scientific community has focused on enhancing certifiable robustness guarantees by crafting \ols neural networks that leverage Lipschitz bounded dense and convolutional layers. Different methods have been proposed in the literature to achieve this goal however comparing the performance of such methods is not straightforward since different metrics can be relevant (e.g. training time memory usage accuracy certifiable robustness) for different applications. Therefore this work provides a thorough comparison between different methods covering theoretical aspects such as computational complexity and memory requirements as well as empirical measurements of time per epoch required memory accuracy and certifiable robust accuracy. The paper also provides some guidelines and recommendations to support the user in selecting the methods that work best depending on the available resources. We provide code at github.com/berndprach/1LipschitzLayersCompared
-
Data privacy is of great concern in cloud machine-learning service platforms when sensitive data are exposed to service providers. While private computing environments (e.g. secure enclaves) and cryptographic approaches (e.g. homomorphic encryption) provide strong privacy protection their computing performance still falls short compared to cloud GPUs. To achieve privacy protection with high computing performance we propose Delta a new private training and inference framework with comparable model performance as non-private centralized training. Delta features two asymmetric data flows: the main information-sensitive flow and the residual flow. The main part flows into a small model while the residuals are offloaded to a large model. Specifically Delta embeds the information-sensitive representations into a low-dimensional space while pushing the information-insensitive part into high-dimension residuals. To ensure privacy protection the low-dimensional information-sensitive part is secured and fed to a small model in a private environment. On the other hand the residual part is sent to fast cloud GPUs and processed by a large model. To further enhance privacy and reduce the communication cost Delta applies a random binary quantization technique along with a DP-based technique to the residuals before sharing them with the public platform. We theoretically show that Delta guarantees differential privacy in the public environment and greatly reduces the complexity in the private environment. We conduct empirical analyses on CIFAR-10 CIFAR-100 and ImageNet datasets and ResNet-18 and ResNet-34 showing that Delta achieves strong privacy protection fast training and inference without significantly compromising the model utility.
-
We introduce a new task of generating "Illustrated Instructions" i.e. visual instructions customized to a user's needs. We identify desiderata unique to this task and formalize it through a suite of automatic and human evaluation metrics designed to measure the validity consistency and efficacy of the generations. We combine the power of large language models (LLMs) together with strong text-to-image generation diffusion models to propose a simple approach called StackedDiffusion which generates such illustrated instructions given text as input. The resulting model strongly outperforms baseline approaches and state-of-the-art multimodal LLMs; and in 30% of cases users even prefer it to human-generated articles. Most notably it enables various new and exciting applications far beyond what static articles on the web can provide such as personalized instructions complete with intermediate steps and pictures in response to a user's individual situation.
-
This paper tackles the domain adaptation problem in point cloud semantic segmentation which performs adaptation from a fully labeled domain (source domain) to an unlabeled target domain. Due to the unordered property of point clouds LiDAR scans typically show varying geometric structures across different regions in terms of density noises etc hence leading to increased dynamics on context. However such characteristics are not consistent across domains due to the difference in sensors environments etc thus hampering the effective scene comprehension across domains. To solve this we propose Cooperative Context Learning that performs context modeling and modulation from different aspects but in a cooperative manner. Specifically we first devise context embeddings to discover and model contextual relationships with close neighbors in a learnable manner. Then with the context embeddings from two domains we introduce a set of learnable prototypes to attend and associate them under the attention paradigm. As a result these prototypes naturally establish long-range dependency across regions and domains thereby encouraging the transfer of context knowledge and easing the adaptation. Moreover the attention in turn attunes and guides the local context modeling and urges them to focus on the domain-invariant context knowledge thus promoting the adaptation in a cooperative manner. Experiments on representative benchmarks verify that our method attains the new state-of-the-art.
-
Image denoising approaches based on deep neural networks often struggle with overfitting to specific noise distributions present in training data. This challenge persists in existing real-world denoising networks which are trained using a limited spectrum of real noise distributions and thus show poor robustness to out-of-distribution real noise types. To alleviate this issue we develop a novel training framework called Adversarial Frequency Mixup (AFM). AFM leverages mixup in the frequency domain to generate noisy images with distinctive and challenging noise characteristics all the while preserving the properties of authentic real-world noise. Subsequently incorporating these noisy images into the training pipeline enhances the denoising network's robustness to variations in noise distributions. Extensive experiments and analyses conducted on a wide range of real noise benchmarks demonstrate that denoising networks trained with our proposed framework exhibit significant improvements in robustness to unseen noise distributions. The code is available at https://github.com/dhryougit/AFM.
-
Reconstructing 3D hand mesh robustly from a single image is very challenging due to the lack of diversity in existing real-world datasets. While data synthesis helps relieve the issue the syn-to-real gap still hinders its usage. In this work we present HandBooster a new approach to uplift the data diversity and boost the 3D hand-mesh reconstruction performance by training a conditional generative space on hand-object interactions and purposely sampling the space to synthesize effective data samples. First we construct versatile content-aware conditions to guide a diffusion model to produce realistic images with diverse hand appearances poses views and backgrounds; favorably accurate 3D annotations are obtained for free. Then we design a novel condition creator based on our similarity-aware distribution sampling strategies to deliberately find novel and realistic interaction poses that are distinctive from the training set. Equipped with our method several baselines can be significantly improved beyond the SOTA on the HO3D and DexYCB benchmarks. Our code will be released on https://github.com/hxwork/HandBooster_Pytorch.
-
This work proposes the first online asymmetric semi-supervised framework namely A-Teacher for LiDAR-based 3D object detection. Our motivation stems from the observation that 1) existing symmetric teacher-student methods for semi-supervised 3D object detection have characterized simplicity but impede the distillation performance between teacher and student because of the demand for an identical model structure and input data format. 2) The offline asymmetric methods with a complex teacher model constructed differently can generate more precise pseudo-labels but is challenging to jointly optimize the teacher and student model. Consequently in this paper we devise a different path from the conventional paradigm which can harness the capacity of a strong teacher while preserving the advantages of online teacher model updates. The essence is the proposed attention-based refinement model that can be seamlessly integrated into a vanilla teacher. The refinement model works in the divide-and-conquer manner that respectively handles three challenging scenarios including 1) objects detected in the current timestamp but with suboptimal box quality 2) objects are missed in the current timestamp but are detected in past or future frames 3) objects are neglected in all frames. It is worth noting that even while tackling these complex cases our model retains the efficiency of the online teacher-student semi-supervised framework. Experimental results on Waymo show that our method outperforms previous state-of-the-art HSSDA for 4.7 on mAP (L1) while consuming fewer training resources.
-
Matching cost aggregation plays a fundamental role in learning-based multi-view stereo networks. However directly aggregating adjacent costs can lead to suboptimal results due to local geometric inconsistency. Related methods either seek selective aggregation or improve aggregated depth in the 2D space both are unable to handle geometric inconsistency in the cost volume effectively. In this paper we propose GoMVS to aggregate geometrically consistent costs yielding better utilization of adjacent geometries. More specifically we correspond and propagate adjacent costs to the reference pixel by leveraging the local geometric smoothness in conjunction with surface normals. We achieve this by the geometric consistent propagation (GCP) module. It computes the correspondence from the adjacent depth hypothesis space to the reference depth space using surface normals then uses the correspondence to propagate adjacent costs to the reference geometry followed by a convolution for aggregation. Our method achieves new state-of-the-art performance on DTU Tanks & Temple and ETH3D datasets. Notably our method ranks 1st on the Tanks & Temple Advanced benchmark. Code is available at https://github.com/Wuuu3511/GoMVS.
-
Retrieval tasks play central roles in real-world machine learning systems such as search engines recommender systems and retrieval-augmented generation (RAG). Achieving decent performance in these tasks often requires fine-tuning various pre-trained models on specific datasets and selecting the best candidate a process that can be both time and resource-consuming. To tackle the problem we introduce a novel and efficient method called RetMMD that leverages Maximum Mean Discrepancy (MMD) and kernel methods to assess the transferability of pretrained models in retrieval tasks. RetMMD is calculated on pretrained model and target dataset without any fine-tuning involved. Specifically given some query we quantify the distribution discrepancy between relevant and irrelevant document embeddings by estimating the similarities within their mappings in the fine-tuned embedding space through kernel method. This discrepancy is averaged over multiple queries taking into account the distribution characteristics of the target dataset. Experiments suggest that the proposed metric calculated on pre-trained models closely aligns with retrieval performance post-fine-tuning. The observation holds across a variety of datasets including image text and multi-modal domains indicating the potential of using MMD and kernel methods for transfer learning evaluation in retrieval scenarios. In addition we also design a way of evaluating dataset transferability for retrieval tasks with experimental results demonstrating the effectiveness of the proposed approach.
-
Recent advancements in text-to-image technology have significantly advanced the field of image customization. Among various applications the task of customizing diverse scenes for user-specified composited elements holds great application value but has not been extensively explored. Addressing this gap we propose AnyScene a specialized framework designed to create varied scenes from composited foreground using textual prompts. AnyScene addresses the primary challenges inherent in existing methods particularly scene disharmony due to a lack of foreground semantic understanding and distortion of foreground elements. Specifically we develop a foreground injection module that guides a pre-trained diffusion model to generate cohesive scenes in visual harmony with the provided foreground. To enhance robust generation we implement a layout control strategy that prevents distortions of foreground elements. Furthermore an efficient image blending mechanism seamlessly reintegrates foreground details into the generated scenes producing outputs with overall visual harmony and precise foreground details. In addition we propose a new benchmark and a series of quantitative metrics to evaluate this proposed image customization task. Extensive experimental results demonstrate the effectiveness of AnyScene which confirms its potential in various applications.
-
Super-resolution (SR) is an ill-posed inverse problem where the size of the set of feasible solutions that are consistent with a given low-resolution image is very large. Many algorithms have been proposed to find a "good" solution among the feasible solutions that strike a balance between fidelity and perceptual quality. Unfortunately all known methods generate artifacts and hallucinations while trying to reconstruct high-frequency (HF) image details. A fundamental question is: Can a model learn to distinguish genuine image details from artifacts? Although some recent works focused on the differentiation of details and artifacts this is a very challenging problem and a satisfactory solution is yet to be found. This paper shows that the characterization of genuine HF details versus artifacts can be better learned by training GAN-based SR models using wavelet-domain loss functions compared to RGB-domain or Fourier-space losses. Although wavelet-domain losses have been used in the literature before they have not been used in the context of the SR task. More specifically we train the discriminator only on the HF wavelet sub-bands instead of on RGB images and the generator is trained by a fidelity loss over wavelet subbands to make it sensitive to the scale and orientation of structures. Extensive experimental results demonstrate that our model achieves better perception-distortion trade-off according to multiple objective measures and visual evaluations.
-
In film gender studies the concept of "male gaze" refers to the way the characters are portrayed on-screen as objects of desire rather than subjects. In this article we introduce a novel video-interpretation task to detect character objectification in films. The purpose is to reveal and quantify the usage of complex temporal patterns operated in cinema to produce the cognitive perception of objectification. We introduce the ObyGaze12 dataset made of 1914 movie clips densely annotated by experts for objectification concepts identified in film studies and psychology. We evaluate recent vision models show the feasibility of the task and where the challenges remain with concept bottleneck models. Our new dataset and code are made available to the community.
-
In this work we address various segmentation tasks each traditionally tackled by distinct or partially unified models. We propose OMG-Seg One Model that is Good enough to efficiently and effectively handle all the segmentation tasks including image semantic instance and panoptic segmentation as well as their video counterparts open vocabulary settings prompt-driven interactive segmentation like SAM and video object segmentation. To our knowledge this is the first model to handle all these tasks in one model and achieve satisfactory performance. We show that OMG-Seg a transformer-based encoder-decoder architecture with task-specific queries and outputs can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead across various tasks and datasets. We rigorously evaluate the inter-task influences and correlations during co-training. Code and models are available at https://github.com/lxtGH/OMG-Seg.
-
Creating personalized hand avatars is important to offer a realistic experience to users on AR / VR platforms. While most prior studies focused on reconstructing 3D hand shapes some recent work has tackled the reconstruction of hand textures on top of shapes. However these methods are often limited to capturing pixels on the visible side of a hand requiring diverse views of the hand in a video or multiple images as input. In this paper we propose a novel method BiTT(Bi-directional Texture reconstruction of Two hands) which is the first end-to-end train- able method for relightable pose-free texture reconstruction of two interacting hands taking only a single RGB image by three novel components: 1) bi-directional (left ? right) texture reconstruction using the texture symmetry of left / right hands 2) utilizing a texture parametric model for hand texture recovery and 3) the overall coarse-to-fine stage pipeline for reconstructing personalized texture of two interacting hands. BiTT first estimates the scene light condition and albedo image from an input image then reconstructs the texture of both hands through the texture parametric model and bi-directional texture reconstructor. In experiments using InterHand2.6M and RGB2Hands datasets our method significantly outperforms state-of-the-art hand texture reconstruction methods quantitatively and qualitatively. The code is available at https://github.com/ yunminjin2/BiTT.
-
Existing open-vocabulary object detectors typically require a predefined set of categories from users significantly confining their application scenarios. In this paper we introduce DetCLIPv3 a high-performing detector that excels not only at both open-vocabulary object detection but also generating hierarchical labels for detected objects. DetCLIPv3 is characterized by three core designs: 1. Versatile model architecture: we derive a robust open-set detection framework which is further empowered with generation ability via the integration of a caption head. 2. High information density data: we develop an auto-annotation pipeline leveraging visual large language model to refine captions for large-scale image-text pairs providing rich multi-granular object labels to enhance the training. 3. Efficient training strategy: we employ a pre-training stage with low-resolution inputs that enables the object captioner to efficiently learn a broad spectrum of visual concepts from extensive image-text paired data. This is followed by a fine-tuning stage that leverages a small number of high-resolution samples to further enhance detection performance. With these effective designs DetCLIPv3 demonstrates superior open-vocabulary detection performance e.g. our Swin-T backbone model achieves a notable 47.0 zero-shot fixed AP on the LVIS minival benchmark outperforming GLIPv2 GroundingDINO and DetCLIPv2 by 18.0/19.6/6.6 AP respectively. DetCLIPv3 also achieves a state-of-the-art 19.7 AP in dense captioning task on VG dataset showcasing its strong generative capability.
-
Learning-based underwater image enhancement (UIE) methods have made great progress. However the lack of large-scale and high-quality paired training samples has become the main bottleneck hindering the development of UIE. The inter-frame information in underwater videos can accelerate or optimize the UIE process. Thus we constructed the first large-scale high-resolution underwater video enhancement benchmark (UVEB) to promote the development of underwater vision.It contains 1308 pairs of video sequences and more than 453000 high-resolution with 38% Ultra-High-Definition (UHD) 4K frame pairs. UVEB comes from multiple countries containing various scenes and video degradation types to adapt to diverse and complex underwater environments. We also propose the first supervised underwater video enhancement method UVE-Net. UVE-Net converts the current frame information into convolutional kernels and passes them to adjacent frames for efficient inter-frame information exchange. By fully utilizing the redundant degraded information of underwater videos UVE-Net completes video enhancement better. Experiments show the effective network design and good performance of UVE-Net.
-
Integration of Large Language Models (LLMs) into visual domain tasks resulting in visual-LLMs (V-LLMs) has enabled exceptional performance in vision-language tasks particularly for visual question answering (VQA). However existing V-LLMs (e.g. BLIP-2 LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers these models fail at simple tasks like distinguishing a left vs right location. In this work we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations data-efficient instruction fine-tuning objectives and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally our resulting model improves VQA across image and video domains reduces undesired hallucination and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.
-
Recent 3D face reconstruction methods have made remarkable advancements yet there remain huge challenges in monocular high-quality facial reflectance reconstruction. Existing methods rely on a large amount of light-stage captured data to learn facial reflectance models. However the lack of subject diversity poses challenges in achieving good generalization and widespread applicability. In this paper we learn the reflectance prior in image space rather than UV space and present a framework named ID2Reflectance. Our framework can directly estimate the reflectance maps of a single image while using limited reflectance data for training. Our key insight is that reflectance data shares facial structures with RGB faces which enables obtaining expressive facial prior from inexpensive RGB data thus reducing the dependency on reflectance data. We first learn a high-quality prior for facial reflectance. Specifically we pretrain multi-domain facial feature codebooks and design a codebook fusion method to align the reflectance and RGB domains. Then we propose an identity-conditioned swapping module that injects facial identity from the target image into the pre-trained auto-encoder to modify the identity of the source reflectance image. Finally we stitch multi-view swapped reflectance images to obtain renderable assets. Extensive experiments demonstrate that our method exhibits excellent generalization capability and achieves state-of-the-art facial reflectance reconstruction results for in-the-wild faces.
-
Most neural compression models are trained on large datasets of images or videos in order to generalize to unseen data. Such generalization typically requires large and expressive architectures with a high decoding complexity. Here we introduce C3 a neural compression method with strong rate-distortion (RD) performance that instead overfits a small model to each image or video separately. The resulting decoding complexity of C3 can be an order of magnitude lower than neural baselines with similar RD performance. C3 builds on COOL-CHIC [Ladune et al 2023] and makes several simple and effective improvements for images. We further develop new methodology to apply C3 to videos. On the CLIC2020 image benchmark we match the RD performance of VTM the reference implementation of the H.266 codec with less than3k MACs/pixel for decoding. On the UVG video benchmark we match the RD performance of the Video Compression Transformer [Mentzer er al 2022] a well-established neural video codec with less than 5k MACs/pixel for decoding.
-
We propose an efficient abnormal event detection model based on a lightweight masked auto-encoder (AE) applied at the video frame level. The novelty of the proposed model is threefold. First we introduce an approach to weight tokens based on motion gradients thus shifting the focus from the static background scene to the foreground objects. Second we integrate a teacher decoder and a student decoder into our architecture leveraging the discrepancy between the outputs given by the two decoders to improve anomaly detection. Third we generate synthetic abnormal events to augment the training videos and task the masked AE model to jointly reconstruct the original frames (without anomalies) and the corresponding pixel-level anomaly maps. Our design leads to an efficient and effective model as demonstrated by the extensive experiments carried out on four benchmarks: Avenue ShanghaiTech UBnormal and UCSD Ped2. The empirical results show that our model achieves an excellent trade-off between speed and accuracy obtaining competitive AUC scores while processing 1655 FPS. Hence our model is between 8 and 70 times faster than competing methods. We also conduct an ablation study to justify our design. Our code is freely available at: https://github.com/ristea/aed-mae.
-
The field of image synthesis is currently flourishing due to the advancements in diffusion models. While diffusion models have been successful their computational intensity has prompted the pursuit of more efficient alternatives. As a representative work non-autoregressive Transformers (NATs) have been recognized for their rapid generation. However a major drawback of these models is their inferior performance compared to diffusion models. In this paper we aim to re-evaluate the full potential of NATs by revisiting the design of their training and inference strategies. Specifically we identify the complexities in properly configuring these strategies and indicate the possible sub-optimality in existing heuristic-driven designs. Recognizing this we propose to go beyond existing methods by directly solving the optimal strategies in an automatic framework. The resulting method named AutoNAT advances the performance boundaries of NATs notably and is able to perform comparably with the latest diffusion models with a significantly reduced inference cost. The effectiveness of AutoNAT is comprehensively validated on four benchmark datasets i.e. ImageNet-256 & 512 MS-COCO and CC3M. Code and pre-trained models will be available at https://github.com/LeapLabTHU/ImprovedNAT.
-
The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video model by video-instruction-tuning (VIIT) is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-language model performs well on a wide range of video-language benchmarks. For instance it surpasses the best prior result on open-ended NExT-QA by2.8%. Besides our model generates detailed descriptions for previously unseen videos which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also leverages vision-language models. Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%. As a side product we generate the largest video caption dataset to date.
-
Recent progress in human shape learning shows that neural implicit models are effective in generating 3D human surfaces from limited number of views and even from a single RGB image. However existing monocular approaches still struggle to recover fine geometric details such as face hands or cloth wrinkles. They are also easily prone to depth ambiguities that result in distorted geometries along the camera optical axis. In this paper we explore the benefits of incorporating depth observations in the reconstruction process by introducing ANIM a novel method that reconstructs arbitrary 3D human shapes from single-view RGB-D images with an unprecedented level of accuracy. Our model learns geometric details from both multi-resolution pixel-aligned and voxel-aligned features to leverage depth information and enable spatial relationships mitigating depth ambiguities. We further enhance the quality of the reconstructed shape by introducing a depth-supervision strategy which improves the accuracy of the signed distance field estimation of points that lie on the reconstructed surface. Experiments demonstrate that ANIM outperforms state-of-the-art works that use RGB surface normals point cloud or RGB-D data as input. In addition we introduce ANIM-Real a new multi-modal dataset comprising high-quality scans paired with consumer-grade RGB-D camera and our protocol to fine-tune ANIM enabling high-quality reconstruction from real-world human capture.
-
We present SimXR a method for controlling a simulated avatar from information (headset pose and cameras) obtained from AR / VR headsets. Due to the challenging viewpoint of head-mounted cameras the human body is often clipped out of view making traditional image-based egocentric pose estimation challenging. On the other hand headset poses provide valuable information about overall body motion but lack fine-grained details about the hands and feet. To synergize headset poses with cameras we control a humanoid to track headset movement while analyzing input images to decide body movement. When body parts are seen the movements of hands and feet will be guided by the images; when unseen the laws of physics guide the controller to generate plausible motion. We design an end-to-end method that does not rely on any intermediate representations and learns to directly map from images and headset poses to humanoid control signals. To train our method we also propose a large-scale synthetic dataset created using camera configurations compatible with a commercially available VR headset (Quest 2) and show promising results on real-world captures. To demonstrate the applicability of our framework we also test it on an AR headset with a forward-facing camera.
-
Recently Vision-Language Model (VLM) has greatly advanced the Human-Object Interaction (HOI) detection. The existing VLM-based HOI detectors typically adopt a hand-crafted template (e.g. a photo of a person [action] a/an [object]) to acquire text knowledge through the VLM text encoder. However such approaches only encoding the action-specific text prompts in vocabulary level may suffer from learning ambiguity without exploring the fine-grained clues from the perspective of interaction context. In this paper we propose a novel method to discover Syntactic Interaction Clues for HOI detection (SICHOI) by using VLM. Specifically we first investigate what are the essential elements for an interaction context and then establish a syntactic interaction bank from three levels: spatial relationship action-oriented posture and situational condition. Further to align visual features with the syntactic interaction bank we adopt a multi-view extractor to jointly aggregate visual features from instance interaction and image levels accordingly. In addition we also introduce a dual cross-attention decoder to perform context propagation between text knowledge and visual features thereby enhancing the HOI detection. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on HICO-DET and V-COCO.
-
The analysis of the ubiquitous human-human interactions is pivotal for understanding humans as social beings. Existing human-human interaction datasets typically suffer from inaccurate body motions lack of hand gestures and fine-grained textual descriptions. To better perceive and generate human-human interactions we propose Inter-X a currently largest human-human interaction dataset with accurate body movements and diverse interaction patterns together with detailed hand gestures. The dataset includes 11K interaction sequences and more than 8.1M frames. We also equip Inter-X with versatile annotations of more than 34K fine-grained human part-level textual descriptions semantic interaction categories interaction order and the relationship and personality of the subjects. Based on the elaborate annotations we propose a unified benchmark composed of 4 categories of downstream tasks from both the perceptual and generative directions. Extensive experiments and comprehensive analysis show that Inter-X serves as a testbed for promoting the development of versatile human-human interaction analysis. Our dataset and benchmark will be publicly available for research purposes.
-
In this paper we introduce the first large-scale video prediction model in the autonomous driving discipline. To eliminate the restriction of high-cost data collection and empower the generalization ability of our model we acquire massive data from the web and pair it with diverse and high-quality text descriptions. The resultant dataset accumulates over 2000 hours of driving videos spanning areas all over the world with diverse weather conditions and traffic scenarios. Inheriting the merits from recent latent diffusion models our model dubbed GenAD handles the challenging dynamics in driving scenes with novel temporal reasoning blocks. We showcase that it can generalize to various unseen driving datasets in a zero-shot manner surpassing general or driving-specific video prediction counterparts. Furthermore GenAD can be adapted into an action-conditioned prediction model or a motion planner holding great potential for real-world driving applications.
-
We study supervised action segmentation whose goal is to predict framewise action labels of a video. To capture temporal dependencies over long horizons prior works either improve framewise features with transformer or refine framewise predictions with learned action features. However they are computationally costly and ignore that frame and action features contain complimentary information which can be leveraged to enhance both features and improve temporal modeling. Therefore we propose an efficient Frame-Action Cross-attention Temporal modeling (FACT) framework that performs temporal modeling with frame and action features in parallel and leverage this parallelism to achieve iterative bidirectional information transfer between the features and refine them. FACT network contains (i) a frame branch to learn frame-level information with convolutions and frame features (ii) an action branch to learn action-level dependencies with transformers and action tokens and (iii) cross-attentions to allow communication between the two branches. We also propose a new matching loss to ensure each action token uniquely encodes an action segment thus better captures its semantics. Thanks to our architecture we can also leverage textual transcripts of videos to help action segmentation. We evaluate FACT on four video datasets (two egocentric and two third-person) for action segmentation with and without transcripts showing that it significantly improves the state-of-the-art accuracy while enjoys lower computational cost (3 times faster) than existing transformer-based methods
-
Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate actions in untrimmed videos unseen during training. Existing ZS-TAL methods involve fine-tuning a model on a large amount of annotated training data. While effective training-based ZS-TAL approaches assume the availability of labeled data for supervised learning which can be impractical in some applications. Furthermore the training process naturally induces a domain bias into the learned model which may adversely affect the model's generalization ability to arbitrary videos. These considerations prompt us to approach the ZS-TAL problem from a radically novel perspective relaxing the requirement for training data. To this aim we introduce a novel method that performs Test-Time adaptation for Temporal Action Localization (T3AL). In a nutshell T3AL adapts a pre-trained Vision and Language Model (VLM). T3AL operates in three steps. First a video-level pseudo-label of the action category is computed by aggregating information from the entire video. Then action localization is performed adopting a novel procedure inspired by self-supervised learning. Finally frame-level textual descriptions extracted with a state-of-the-art captioning model are employed for refining the action region proposals. We validate the effectiveness of T 3AL by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results demonstrate that T3AL significantly outperforms zero-shot baselines based on state-of-the-art VLMs confirming the benefit of a test-time adaptation approach.
-
A handful of visual foundation models (VFMs) have recently emerged as the backbones for numerous downstream tasks. VFMs like CLIP DINOv2 SAM are trained with distinct objectives exhibiting unique characteristics for various downstream tasks. We find that despite their conceptual differences these models can be effectively merged into a unified model through multi-teacher distillation. We name this approach AM-RADIO (Agglomerative Model -- Reduce All Domains Into One). This integrative approach not only surpasses the performance of individual teacher models but also amalgamates their distinctive features such as zero-shot vision-language comprehension detailed pixel-level understanding and open vocabulary segmentation capabilities. Additionally in pursuit of the most hardware-efficient backbone we evaluated numerous architectures in our multi-teacher distillation pipeline using the same training recipe. This led to the development of a novel architecture (E-RADIO) that exceeds the performance of its predecessors and is at least 6x faster than the teacher models at matched resolution. Our comprehensive benchmarking process covers downstream tasks including ImageNet classification semantic segmentation linear probing COCO object detection and integration into LLaVa-1.5.
-
Open-vocabulary 3D instance segmentation is cutting-edge for its ability to segment 3D instances without predefined categories. However progress in 3D lags behind its 2D counterpart due to limited annotated 3D data. To address this recent works first generate 2D open-vocabulary masks through 2D models and then merge them into 3D instances based on metrics calculated between two neighboring frames. In contrast to these local metrics we propose a novel metric view consensus rate to enhance the utilization of multi-view observations. The key insight is that two 2D masks should be deemed part of the same 3D instance if a significant number of other 2D masks from different views contain both these two masks. Using this metric as edge weight we construct a global mask graph where each mask is a node. Through iterative clustering of masks showing high view consensus we generate a series of clusters each representing a distinct 3D instance. Notably our model is training-free. Through extensive experiments on publicly available datasets including ScanNet++ ScanNet200 and MatterPort3D we demonstrate that our method achieves state-of-the-art performance in open-vocabulary 3D instance segmentation. Our project page is at \href https://pku-epic.github.io/MaskClustering/ https://pku-epic.github.io/MaskClustering .
-
Conditional human motion generation is an important topic with many applications in virtual reality gaming and robotics. While prior works have focused on generating motion guided by text music or scenes these typically result in isolated motions confined to short durations. Instead we address the generation of long continuous sequences guided by a series of varying textual descriptions. In this context we introduce FlowMDM the first diffusion-based model that generates seamless Human Motion Compositions (HMC) without any postprocessing or redundant denoising steps. For this we introduce the Blended Positional Encodings a technique that leverages both absolute and relative positional encodings in the denoising chain. More specifically global motion coherence is recovered at the absolute stage whereas smooth and realistic transitions are built at the relative stage. As a result we achieve state-of-the-art results in terms of accuracy realism and smoothness on the Babel and HumanML3D datasets. FlowMDM excels when trained with only a single description per motion sequence thanks to its Pose-Centric Cross-ATtention which makes it robust against varying text descriptions at inference time. Finally to address the limitations of existing HMC metrics we propose two new metrics: the Peak Jerk and the Area Under the Jerk to detect abrupt transitions.
-
Adversarial robustness of the neural network is a significant concern when it is applied to security-critical domains. In this situation adversarial distillation is a promising option which aims to distill the robustness of the teacher network to improve the robustness of a small student network. Previous works pretrain the teacher network to make it robust against the adversarial examples aimed at itself. However the adversarial examples are dependent on the parameters of the target network. The fixed teacher network inevitably degrades its robustness against the unseen transferred adversarial examples which target the parameters of the student network in the adversarial distillation process. We propose PeerAiD to make a peer network learn the adversarial examples of the student network instead of adversarial examples aimed at itself. PeerAiD is an adversarial distillation that trains the peer network and the student network simultaneously in order to specialize the peer network for defending the student network. We observe that such peer networks surpass the robustness of the pretrained robust teacher model against adversarial examples aimed at the student network. With this peer network and adversarial distillation PeerAiD achieves significantly higher robustness of the student network with AutoAttack (AA) accuracy by up to 1.66%p and improves the natural accuracy of the student network by up to 4.72%p with ResNet-18 on TinyImageNet dataset. Code is available at https://github.com/jaewonalive/PeerAiD.
-
Vision-language models (VLMs) are trained for thousands of GPU hours on carefully selected subsets of massive web scrapes. For instance the LAION public dataset retained only about 10 percent of the total crawled data. In recent times data curation has gained prominence with several works developing strategies to retain high-quality subsets of raw scraped data. However these strategies are typically developed agnostic to the available compute for training. In this paper we demonstrate that making filtering decisions independent of training compute is often suboptimal: well-curated data rapidly loses its utility when repeated eventually decreasing below the utility of unseen but lower-quality data. While past research in neural scaling laws has considered web data to be homogenous real data is not. Our work bridges this important gap in the literature by developing scaling laws that characterize the differing utility of various data subsets and accounting for how this diminishes for a data point at its nth repetition. Our key message is that data curation can not be agnostic of the total compute a model will be trained for. Even without ever jointly training on multiple data buckets our scaling laws enable us to estimate model performance under this dynamic trade-off between quality and repetition. This allows us to curate the best possible pool for achieving top performance on Datacomp at various compute budgets carving out a pareto-frontier for data curation.
-
3D correspondence i.e. a pair of 3D points is a fundamental concept in computer vision. A set of 3D correspondences when equipped with compatibility edges forms a correspondence graph. This graph is a critical component in several state-of-the-art 3D point cloud registration approaches e.g. the one based on maximal cliques (MAC). However its properties have not been well understood. So we present the first study that introduces graph signal processing into the domain of correspondence graph. We exploit the generalized degree signal on correspondence graph and pursue sampling strategies that preserve high-frequency components of this signal. To address time-consuming singular value decomposition in deterministic sampling we resort to a stochastic approximate sampling strategy. As such the core of our method is the stochastic spectral sampling of correspondence graph. As an application we build a complete 3D registration algorithm termed as FastMAC that reaches real-time speed while leading to little to none performance drop. Through extensive experiments we validate that FastMAC works for both indoor and outdoor benchmarks. For example FastMAC can accelerate MAC by 80 times while maintaining high registration success rate on KITTI. Codes are publicly available at https://github.com/Forrest-110/FastMAC.
-
Federated learning is a promising framework to train neural networks with widely distributed data. However performance degrades heavily with heterogeneously distributed data. Recent work has shown this is due to the final layer of the network being most prone to local bias some finding success freezing the final layer as an orthogonal classifier. We investigate the training dynamics of the classifier by applying SVD to the weights motivated by the observation that freezing weights results in constant singular values. We find that there are differences when training in IID and non-IID settings. Based on this finding we introduce two regularization terms for local training to continuously emulate IID settings: (1) variance in the dimension-wise probability distribution of the classifier and (2) hyperspherical uniformity of representations of the encoder. These regularizations promote local models to act as if it were in an IID setting regardless of the local data distribution thus offsetting proneness to bias while being flexible to the data. On extensive experiments in both label-shift and feature-shift settings we verify that our method achieves highest performance by a large margin especially in highly non-IID cases in addition to being scalable to larger models and datasets.
-
Federated Learning (FL) aggregates locally trained models from individual clients to construct a global model. While FL enables learning a model with data privacy it often suffers from significant performance degradation when clients have heterogeneous data distributions. This data heterogeneity causes the model to forget the global knowledge acquired from previously sampled clients after being trained on local datasets. Although the introduction of proximal objectives in local updates helps to preserve global knowledge it can also hinder local learning by interfering with local objectives. Inspired by Continual Learning (CL) we adopt an orthogonal learning strategy to balance these two conflicting objectives. However we observe that directly negating the proximal gradient in the local gradient significantly undermines local learning. To address the problem we propose a novel method Federated Stabilized Orthogonal Learning (FedSOL). FedSOL is designed to identify gradients of local objectives that are inherently orthogonal to directions affecting the proximal objective. Specifically FedSOL targets parameter regions where learning on the local objective is minimally influenced by proximal weight perturbations. Our experiments demonstrate that FedSOL consistently achieves state-of-the-art performance across various scenarios.
-
Gaussian splatting has emerged as a powerful 3D representation that harnesses the advantages of both explicit (mesh) and implicit (NeRF) 3D representations. In this paper we seek to leverage Gaussian splatting to generate realistic animatable avatars from textual descriptions addressing the limitations (e.g. efficiency and flexibility) imposed by mesh or NeRF-based representations. However a naive application of Gaussian splatting cannot generate high-quality animatable avatars and suffers from learning instability; it also cannot capture fine avatar geometries and often leads to degenerate body parts. To tackle these problems we first propose a primitive-based 3D Gaussian representation where Gaussians are defined inside pose-driven primitives to facilitate animations. Second to stabilize and amortize the learning of millions of Gaussians we propose to use implicit neural fields to predict the Gaussian attributes (e.g. colors). Finally to capture fine avatar geometries and extract detailed meshes we propose a novel SDF-based implicit mesh learning approach for 3D Gaussians that regularizes the underlying geometries and extracts highly detailed textured meshes. Our proposed method GAvatar enables the large-scale generation of diverse animatable avatars using only text prompts. GAvatar significantly surpasses existing methods in terms of both appearance and geometry quality and achieves extremely fast rendering (100 fps) at 1K resolution.
-
Understanding how attention varies across individuals has significant scientific and societal impacts. However existing visual scanpath models treat attention uniformly neglecting individual differences. To bridge this gap this paper focuses on individualized scanpath prediction (ISP) a new attention modeling task that aims to accurately predict how different individuals shift their attention in diverse visual tasks. It proposes an ISP method featuring three novel technical components: (1) an observer encoder to characterize and integrate an observer's unique attention traits (2) an observer-centric feature integration approach that holistically combines visual features task guidance and observer-specific characteristics and (3) an adaptive fixation prioritization mechanism that refines scanpath predictions by dynamically prioritizing semantic feature maps based on individual observers' attention traits. These novel components allow scanpath models to effectively address the attention variations across different observers. Our method is generally applicable to different datasets model architectures and visual tasks offering a comprehensive tool for transforming general scanpath models into individualized ones. Comprehensive evaluations using value-based and ranking-based metrics verify the method's effectiveness and generalizability.
-
This paper presents a novel category agnostic model for visual rearrangement task which can help an embodied agent to physically recover the shuffled scene configuration without any category concepts to the goal configuration. Previous methods usually follow a similar architecture completing the rearrangement task by aligning the scene changes of the goal and shuffled configuration according to the semantic scene graphs. However constructing scene graphs requires the inference of category labels which not only causes the accuracy drop of the entire task but also limits the application in real world scenario. In this paper we delve deep into the essence of visual rearrangement task and focus on the two most essential issues scene change detection and scene change matching. We utilize the movement and the protrusion of point cloud to accurately identify the scene changes and match these changes depending on the similarity of category agnostic appearance feature. Moreover to assist the agent to explore the environment more efficiently and comprehensively we propose a closer-aligned-retrace exploration policy aiming to observe more details of the scene at a closer distance. We conduct extensive experiments on AI2THOR Rearrangement Challenge based on RoomR dataset and a new multi-room multi-instance dataset MrMiR collected by us. The experimental results demonstrate the effectiveness of our proposed method.
-
Vision-language foundation models have shown remarkable performance in various zero-shot settings such as image retrieval classification or captioning. But so far those models seem to fall behind when it comes to zero-shot localization of referential expressions and objects in images. As a result they need to be fine-tuned for this task. In this paper we show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning. To leverage those capabilities we propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path. We show that the concept of self-self attention corresponds to clustering thus enforcing groups of tokens arising from the same object to be similar while preserving the alignment with the language space. To further guide the group formation we propose a set of regularizations that allows the model to finally generalize across datasets and backbones. We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation. GEM not only outperforms other training-free open-vocabulary localization methods but also achieves state-of-the-art results on the recently proposed OpenImagesV7 large-scale segmentation benchmark. Code is available at https://github.com/WalBouss/GEM
-
We focus on a very challenging task: imaging at nighttime dynamic scenes. Most previous methods rely on the low-light enhancement of a conventional RGB camera. However they would inevitably face a dilemma between the long exposure time of nighttime and the motion blur of dynamic scenes. Event cameras react to dynamic changes with higher temporal resolution (microsecond) and higher dynamic range (120dB) offering an alternative solution. In this work we present a novel nighttime dynamic imaging method with an event camera. Specifically we discover that the event at nighttime exhibits temporal trailing characteristics and spatial non-stationary distribution. Consequently we propose a nighttime event reconstruction network (NER-Net) which mainly includes a learnable event timestamps calibration module (LETC) to align the temporal trailing events and a non-uniform illumination aware module (NIAM) to stabilize the spatiotemporal distribution of events. Moreover we construct a paired real low-light event dataset (RLED) through a co-axial imaging system including 64200 spatially and temporally aligned image GTs and low-light events. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art methods in terms of visual quality and generalization ability on real-world nighttime datasets. The project are available at: https://github.com/Liu-haoyue/NER-Net.
-
Humans effortlessly interpret images by parsing them into part-whole hierarchies; deep learning excels in learning multi-level feature spaces but they often lack explicit coding of part-whole relations a prominent property of medical imaging. To overcome this limitation we introduce Adam-v2 a new self-supervised learning framework extending Adam [68] by explicitly incorporating part-whole hierarchies into its learning objectives through three key branches: (1) Localizability acquiring discriminative representations to distinguish different anatomical patterns; (2) Composability learning each anatomical structure in a parts-to-whole manner; and (3) Decomposability comprehending each anatomical structure in a whole-to-parts manner. Experimental results across 10 tasks compared to 11 baselines in zero-shot few-shot transfer and full fine-tuning settings showcase Adam-v2's superior performance over large-scale medical models and existing SSL methods across diverse downstream tasks. The higher generality and robustness of Adam-v2's representations originate from its explicit construction of hierarchies for distinct anatomical structures from unlabeled medical images. Adam-v2 preserves a semantic balance of anatomical diversity and harmony in its embedding yielding representations that are both generic and semantically meaningful yet overlooked in existing SSL methods. All code and pretrained models are available at GitHub.com/JLiangLab/Eden.
-
Test-time adaptation with pre-trained vision-language models has attracted increasing attention for tackling distribution shifts during the test time. Though prior studies have achieved very promising performance they involve intensive computation which is severely unaligned with test-time adaptation. We design TDA a training-free dynamic adapter that enables effective and efficient test-time adaptation with vision-language models. TDA works with a lightweight key-value cache that maintains a dynamic queue with few-shot pseudo labels as values and the corresponding test-sample features as keys. Leveraging the key-value cache TDA allows adapting to test data gradually via progressive pseudo label refinement which is super-efficient without incurring any backpropagation. In addition we introduce negative pseudo labeling that alleviates the adverse impact of pseudo label noises by assigning pseudo labels to certain negative classes when the model is uncertain about its pseudo label predictions. Extensive experiments over two benchmarks demonstrate TDA's superior effectiveness and efficiency as compared with the state-of-the-art. The code has been released in https://kdiaaa.github.io/tda/.
-
Is vision good enough for language? Recent advancements in multimodal models primarily stem from the powerful reasoning abilities of large language models (LLMs). However the visual component typically depends only on the instance-level contrastive language-image pre-training (CLIP). Our research reveals that the visual capabilities in recent MultiModal LLMs (MLLMs) still exhibit systematic shortcomings. To understand the roots of these errors we explore the gap between the visual embedding space of CLIP and vision-only self-supervised learning. We identify "CLIP-blind pairs" - images that CLIP perceives as similar despite their clear visual differences. With these pairs we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposes areas where state-of-the-art systems including GPT-4V struggle with straightforward questions across nine basic visual patterns often providing incorrect answers and hallucinated explanations. We further evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs. As an initial effort to address these issues we propose a Mixture of Features (MoF) approach demonstrating that integrating vision self-supervised learning features with MLLMs can significantly enhance their visual grounding capabilities. Together our research suggests visual representation learning remains an open challenge and accurate visual grounding is crucial for future successful multimodal systems.
-
Transformer models developed in NLP make a great impact on computer vision fields producing promising performance on various tasks. While multi-head attention a characteristic mechanism of the transformer attracts keen research interest such as for reducing computation cost we analyze the transformer model from a viewpoint of feature transformation based on a distribution of input feature tokens. The analysis inspires us to derive a novel transformation method from mean-shift update which is an effective gradient ascent to seek a local mode of distinctive representation on the token distribution. We also present an efficient projection approach to reduce parameter size of linear projections constituting the proposed multi-head feature transformation. In the experiments on ImageNet-1K dataset the proposed methods embedded into various network models exhibit favorable performance improvement in place of the transformer module.
-
Saliency object ranking (SOR) has attracted significant attention recently. Previous methods usually failed to explicitly explore the saliency degree-related relationships between objects. In this paper we propose a novel Domain Separation Graph Neural Network (DSGNN) which starts with separately extracting the shape and texture cues from each object and builds an shape graph as well as a texture graph for all objects in the given image. Then we propose a Shape-Texture Graph Domain Separation (STGDS) module to separate the task-relevant and irrelevant information of target objects by explicitly modelling the relationship between each pair of objects in terms of their shapes and textures respectively. Furthermore a Cross Image Graph Domain Separation (CIGDS) module is introduced to explore the saliency degree subspace that is robust to different scenes aiming to create a unified representation for targets with the same saliency levels in different images. Importantly our DSGNN automatically learns a multi-dimensional feature to represent each graph edge allowing complex diverse and ranking-related relationships to be modelled. Experimental results show that our DSGNN achieved the new state-of-the-art performance on both ASSR and IRSR datasets with large improvements of 5.2% and 4.1% SA-SOR respectively. Our code is provided in https://github.com/Wu-ZJ/DSGNN.
-
Crack segmentation datasets make great efforts to obtain the ground truth crack or non-crack labels as clearly as possible. However it can be observed that ambiguities are still inevitable when considering the marginal non-crack region due to low contrast and heterogeneous texture. To solve this problem we propose a novel clustering-inspired representation learning framework which contains a two-phase strategy for automatic crack segmentation. In the first phase a pre-process is proposed to localize the marginal non-crack region. Then we propose an ambiguity-aware segmentation loss (Aseg Loss) that enables crack segmentation models to capture ambiguities in the above regions via learning segmentation variance which allows us to further localize ambiguous regions. In the second phase to learn the discriminative features of the above regions we propose a clustering-inspired loss (CI Loss) that alters the supervision learning of these regions into an unsupervised clustering manner. We demonstrate that the proposed method could surpass the existing crack segmentation models on various datasets and our constructed CrackSeg5k dataset.
-
Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging thin filamentous and widely branching morphologies multiple neurons are tightly inter-weaved and partial volume effects uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing to date methods are typically benchmarked on synthetic datasets. To address this gap we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies and facilitate scientific discovery in basic neuroscience.
-
Vision language models (VLMs) have experienced rapid advancements through the integration of large language models (LLMs) with image-text pairs yet they struggle with detailed regional visual understanding due to limited spatial awareness of the vision encoder and the use of coarse-grained training data that lacks detailed region-specific captions. To address this we introduce RegionGPT (short as RGPT) a novel framework designed for complex region-level captioning and understanding. RGPT enhances the spatial awareness of regional representation with simple yet effective modifications to existing visual encoders in VLMs. We further improve performance on tasks requiring a specific output scope by integrating task-guided instruction prompts during both training and inference phases while maintaining the model's versatility for general-purpose tasks. Additionally we develop an automated region caption data generation pipeline enriching the training set with detailed region-level captions. We demonstrate that a universal RGPT model can be effectively applied and significantly enhancing performance across a range of region-level tasks including but not limited to complex region descriptions reasoning object classification and referring expressions comprehension.
-
Recent progress in Large Multimodal Models (LMM) has opened up great possibilities for various applications in the field of human-machine interactions. However developing LMMs that can comprehend reason and plan in complex and diverse 3D environments remains a challenging topic especially considering the demand for understanding permutation-invariant point cloud representations of the 3D scene. Existing works seek help from multi-view images by projecting 2D features to 3D space which inevitably leads to huge computational overhead and performance degradation. In this paper we present LL3DA a Large Language 3D Assistant that takes point cloud as the direct input and responds to both text instructions and visual interactions. The additional visual interaction enables LMMs to better comprehend human interactions with the 3D environment and further remove the ambiguities within plain texts. Experiments show that LL3DA achieves remarkable results and surpasses various 3D vision-language models on both 3D Dense Captioning and 3D Question Answering.
-
Representing and rendering dynamic scenes has been an important but challenging task. Especially to accurately model complex motions high efficiency is usually hard to guarantee. To achieve real-time dynamic scene rendering while also enjoying high training and storage efficiency we propose 4D Gaussian Splatting (4D-GS) as a holistic representation for dynamic scenes rather than applying 3D-GS for each individual frame. In 4D-GS a novel explicit representation containing both 3D Gaussians and 4D neural voxels is proposed. A decomposed neural voxel encoding algorithm inspired by HexPlane is proposed to efficiently build Gaussian features from 4D neural voxels and then a lightweight MLP is applied to predict Gaussian deformations at novel timestamps. Our 4D-GS method achieves real-time rendering under high resolutions 82 FPS at an 800*800 resolution on an RTX 3090 GPU while maintaining comparable or better quality than previous state-of-the-art methods. More demos and code are available at https://guanjunwu.github.io/4dgs.
-
This paper focuses on advancing the applicability of human avatar learning methods by proposing RAM-Avatar which learns a Real-time photo-realistic Avatar that supports full-body control from Monocular videos. To achieve this goal RAM-Avatar leverages two statistical templates responsible for modeling the facial expression and hand gesture variations while a sparsely computed dual attention module is introduced upon another body template to facilitate high-fidelity texture rendering for the torsos and limbs. Building on this foundation we deploy a lightweight yet powerful StyleUnet along with a temporal-aware discriminator to achieve real-time realistic rendering. To enable robust animation for out-of-distribution poses we propose a Motion Distribution Align module to compensate for the discrepancies between the training and testing motion distribution. Results and extensive experiments conducted in various experimental settings demonstrate the superiority of our proposed method and a real-time live system is proposed to further push research into applications. The training and testing code will be released for research purposes.
-
Stereo matching methods based on iterative optimization like RAFT-Stereo and IGEV-Stereo have evolved into a cornerstone in the field of stereo matching. However these methods struggle to simultaneously capture high-frequency information in edges and low-frequency information in smooth regions due to the fixed receptive field. As a result they tend to lose details blur edges and produce false matches in textureless areas. In this paper we propose Selective Recurrent Unit (SRU) a novel iterative update operator for stereo matching. The SRU module can adaptively fuse hidden disparity information at multiple frequencies for edge and smooth regions. To perform adaptive fusion we introduce a new Contextual Spatial Attention (CSA) module to generate attention maps as fusion weights. The SRU empowers the network to aggregate hidden disparity information across multiple frequencies mitigating the risk of vital hidden disparity information loss during iterative processes. To verify SRU's universality we apply it to representative iterative stereo matching methods collectively referred to as Selective-Stereo. Our Selective-Stereo ranks first on KITTI 2012 KITTI 2015 ETH3D and Middlebury leaderboards among all published methods. Code is available at https://github.com/Windsrain/Selective-Stereo.
-
Personalized Federated Learning (pFL) has emerged as a promising solution to tackle data heterogeneity across clients in FL. However existing pFL methods either (1) introduce high computation and communication costs or (2) overfit to local data which can be limited in scope and vulnerable to evolved test samples with natural distribution shifts. In this paper we propose PerAda a parameter-efficient pFL framework that reduces communication and computational costs and exhibits superior generalization performance especially under test-time distribution shifts. PerAda reduces the costs by leveraging the power of pretrained models and only updates and communicates a small number of additional parameters from adapters. PerAda achieves high generalization by regularizing each client's personalized adapter with a global adapter while the global adapter uses knowledge distillation to aggregate generalized information from all clients. Theoretically we provide generalization bounds of PerAda and we prove its convergence to stationary points under non-convex settings. Empirically PerAda demonstrates higher personalized performance (+4.85% on CheXpert) and enables better out-of-distribution generalization (+5.23% on CIFAR-10-C) on different datasets across natural and medical domains compared with baselines while only updating 12.6% of parameters per model. Our code is available at https://github.com/NVlabs/PerAda.
-
We consider a critical issue of false negatives in Vision- Language Pre-training (VLP) a challenge that arises from the inherent many-to-many correspondence of image-text pairs in large-scale web-crawled datasets. The presence of false negatives can impede achieving optimal performance and even lead to a significant performance drop. To address this challenge we propose MAFA (MAnaging FAlse nega- tives) which consists of two pivotal components building upon the recently developed GRouped mIni-baTch sampling (GRIT) strategy: 1) an efficient connection mining process that identifies and converts false negatives into positives and 2) label smoothing for the image-text contrastive (ITC) loss. Our comprehensive experiments verify the effectiveness of MAFA across multiple downstream tasks emphasizing the crucial role of addressing false negatives in VLP potentially even surpassing the importance of addressing false posi- tives. In addition the compatibility of MAFA with the recent BLIP-family model is also demonstrated. Code is available at https://github.com/jaeseokbyun/MAFA.
-
Diffusion models have made significant strides in image generation mastering tasks such as unconditional image synthesis text-image translation and image-to-image conversions. However their capability falls short in the realm of video prediction mainly because they treat videos as a collection of independent images relying on external constraints such as temporal attention mechanisms to enforce temporal coherence. In our paper we introduce a novel model class that treats video as a continuous multi-dimensional process rather than a series of discrete frames. Through extensive experimentation we establish state-of-the-art performance in video prediction validated on benchmark datasets including KTH BAIR Human3.6M and UCF101.
-
In this paper we propose a novel virtual try-on from unconstrained designs (ucVTON) task to enable photorealistic synthesis of personalized composite clothing on input human images. Unlike prior arts constrained by specific input types our method allows flexible specification of style (text or image) and texture (full garment cropped sections or texture patches) conditions. To address the entanglement challenge when using full garment images as conditions we develop a two-stage pipeline with explicit disentanglement of style and texture. In the first stage we generate a human parsing map reflecting the desired style conditioned on the input. In the second stage we composite textures onto the parsing map areas based on the texture input. To represent complex and non-stationary textures that have never been achieved in previous fashion editing works we first propose extracting hierarchical and balanced CLIP features and applying position encoding in VTON. Experiments demonstrate superior synthesis quality and personalization enabled by our method. The flexible control over style and texture mixing brings virtual try-on to a new level of user experience for online shopping and fashion design.
-
Continual learning requires the model to learn multiple tasks sequentially. In continual learning the model should possess the ability to maintain its performance on old tasks (stability) and the ability to adapt to new tasks continuously (plasticity). Recently parameter-efficient fine-tuning (PEFT) which involves freezing a pre-trained model and injecting a small number of learnable parameters to adapt to downstream tasks has gained increasing popularity in continual learning. Although existing continual learning methods based on PEFT have demonstrated superior performance compared to those not based on PEFT most of them do not consider how to eliminate the interference of the new task on the old tasks which inhibits the model from making a good trade-off between stability and plasticity. In this work we propose a new PEFT method called interference-free low-rank adaptation (InfLoRA) for continual learning. InfLoRA injects a small number of parameters to reparameterize the pre-trained weights and shows that fine-tuning these injected parameters is equivalent to fine-tuning the pre-trained weights within a subspace. Furthermore InfLoRA designs this subspace to eliminate the interference of the new task on the old tasks making a good trade-off between stability and plasticity. Experimental results show that InfLoRA outperforms existing state-of-the-art continual learning methods on multiple datasets.
-
3D pose transfer that aims to transfer the desired pose to a target mesh is one of the most challenging 3D generation tasks. Previous attempts rely on well-defined parametric human models or skeletal joints as driving pose sources. However to obtain those clean pose sources cumbersome but necessary pre-processing pipelines are inevitable hindering implementations of the real-time applications. This work is driven by the intuition that the robustness of the model can be enhanced by introducing adversarial samples into the training leading to a more invulnerable model to the noisy inputs which even can be further extended to directly handling the real-world data like raw point clouds/scans without intermediate processing. Furthermore we propose a novel 3D pose Masked Autoencoder (3D-PoseMAE) a customized MAE that effectively learns 3D extrinsic presentations (i.e. pose). 3D-PoseMAE facilitates learning from the aspect of extrinsic attributes by simultaneously generating adversarial samples that perturb the model and learning the arbitrary raw noisy poses via a multi-scale masking strategy. Both qualitative and quantitative studies show that the transferred meshes given by our network result in much better quality. Besides we demonstrate the strong generalizability of our method on various poses different domains and even raw scans. Experimental results also show meaningful insights that the intermediate adversarial samples generated in the training can successfully attack the existing pose transfer models.
-
We present a new egocentric procedural error dataset containing videos with various types of errors as well as normal videos and propose a new framework for procedural error detection using error-free training videos only. Our framework consists of an action segmentation model and a contrastive step prototype learning module to segment actions and learn useful features for error detection. Based on the observation that interactions between hands and objects often inform action and error understanding we propose to combine holistic frame features with relations features which we learn by building a graph using active object detection followed by a Graph Convolutional Network. To handle errors unseen during training we use our contrastive step prototype learning to learn multiple prototypes for each step capturing variations of error-free step executions. At inference time we use feature-prototype similarities for error detection. By experiments on three datasets we show that our proposed framework outperforms state-of-the-art video anomaly detection methods for error detection and provides smooth action and error predictions.
-
Semantic segmentation has innately relied on extensive pixel-level annotated data leading to the emergence of unsupervised methodologies. Among them leveraging self-supervised Vision Transformers for unsupervised semantic segmentation (USS) has been making steady progress with expressive deep features. Yet for semantically segmenting images with complex objects a predominant challenge remains: the lack of explicit object-level semantic encoding in patch-level features. This technical limitation often leads to inadequate segmentation of complex objects with diverse structures. To address this gap we present a novel approach EAGLE which emphasizes object-centric representation learning for unsupervised semantic segmentation. Specifically we introduce EiCue a spectral technique providing semantic and structural cues through an eigenbasis derived from the semantic similarity matrix of deep image features and color affinity from an image. Further by incorporating our object-centric contrastive loss with EiCue we guide our model to learn object-level representations with intra- and inter-image object-feature consistency thereby enhancing semantic accuracy. Extensive experiments on COCO-Stuff Cityscapes and Potsdam-3 datasets demonstrate the state-of-the-art USS results of EAGLE with accurate and consistent semantic segmentation across complex scenes.
-
Recent advances in diffusion models have successfully enabled text-guided image inpainting. While it seems straightforward to extend such editing capability into the video domain there have been fewer works regarding text-guided video inpainting. Given a video a masked region at its initial frame and an editing prompt it requires a model to do infilling at each frame following the editing guidance while keeping the out-of-mask region intact. There are three main challenges in text-guided video inpainting: (i) temporal consistency of the edited video (ii) supporting different inpainting types at different structural fidelity levels and (iii) dealing with variable video length. To address these challenges we introduce Any-Length Video Inpainting with Diffusion Model dubbed as AVID. At its core our model is equipped with effective motion modules and adjustable structure guidance for fixed-length video inpainting. Building on top of that we propose a novel Temporal MultiDiffusion sampling pipeline with a middle-frame attention guidance mechanism facilitating the generation of videos with any desired duration. Our comprehensive experiments show our model can robustly deal with various inpainting types at different video duration ranges with high quality.
-
Layout-aware text-to-image generation is a task to generate multi-object images that reflect layout conditions in addition to text conditions. The current layout-aware text-to-image diffusion models still have several issues including mismatches between the text and layout conditions and quality degradation of generated images. This paper proposes a novel layout-aware text-to-image diffusion model called NoiseCollage to tackle these issues. During the denoising process NoiseCollage independently estimates noises for individual objects and then crops and merges them into a single noise. This operation helps avoid condition mismatches; in other words it can put the right objects in the right places. Qualitative and quantitative evaluations show that NoiseCollage outperforms several state-of-the-art models. These successful results indicate that the crop-and-merge operation of noises is a reasonable strategy to control image generation. We also show that NoiseCollage can be integrated with ControlNet to use edges sketches and pose skeletons as additional conditions. Experimental results show that this integration boosts the layout accuracy of ControlNet. The code is available at https://github.com/univ-esuty/noisecollage.
-
We present a highly scalable self-training framework for incrementally adapting vision-based end-to-end autonomous driving policies in a semi-supervised manner i.e. over a continual stream of incoming video data. To facilitate large-scale model training (e.g. open web or unlabeled data) we do not assume access to ground-truth labels and instead estimate pseudo-label policy targets for each video. Our framework comprises three key components: knowledge distillation a sample purification module and an exploration and knowledge retention mechanism. First given sequential image frames we pseudo-label the data and estimate uncertainty using an ensemble of inverse dynamics models. The uncertainty is used to select the most informative samples to add to an experience replay buffer. We specifically select high-uncertainty pseudo-labels to facilitate the exploration and learning of new and diverse driving skills. However in contrast to prior work in continual learning that assumes ground-truth labeled samples the uncertain pseudo-labels can introduce significant noise. Thus we also pair the exploration with a label refinement module which makes use of consistency constraints to re-label the noisy exploratory samples and effectively learn from diverse data. Trained as a complete never-ending learning system we demonstrate state-of-the-art performance on training from domain-changing data as well as millions of images from the open web.
-
Due to the high potential for abuse of GenAI systems the task of detecting synthetic images has recently become of great interest to the research community. Unfortunately existing image space detectors quickly become obsolete as new high-fidelity text-to-image models are developed at blinding speed. In this work we propose a new synthetic image detector that uses features obtained by inverting an open-source pre-trained Stable Diffusion model. We show that these inversion features enable our detector to generalize well to unseen generators of high visual fidelity (e.g. DALL*E 3) even when the detector is trained only on lower fidelity fake images generated via Stable Diffusion. This detector achieves new state-of-the-art across multiple training and evaluation setups. Moreover we introduce a new challenging evaluation protocol that uses reverse image search to mitigate stylistic and thematic biases in the detector evaluation. We show that the resulting evaluation scores align well with detectors' in-the-wild performance and release these datasets as public benchmarks for future research.
-
Neural implicit scene representations have recently shown encouraging results in dense visual SLAM. However existing methods produce low-quality scene reconstruction and low-accuracy localization performance when scaling up to large indoor scenes and long sequences. These limitations are mainly due to their single global radiance field with finite capacity which does not adapt to large scenarios. Their end-to-end pose networks are also not robust enough with the growth of cumulative errors in large scenes. To this end we introduce PLGSLAM a neural visual SLAM system capable of high-fidelity surface reconstruction and robust camera tracking in real-time. To handle large-scale indoor scenes PLGSLAM proposes a progressive scene representation method which dynamically allocates new local scene representation trained with frames within a local sliding window. This allows us to scale up to larger indoor scenes and improves robustness (even under pose drifts). In local scene representation PLGSLAM utilizes tri-planes for local high-frequency features with multi-layer perceptron (MLP) networks for the low-frequency feature achieving smoothness and scene completion in unobserved areas. Moreover we propose local-to-global bundle adjustment method with a global keyframe database to address the increased pose drifts on long sequences. Experimental results demonstrate that PLGSLAM achieves state-of-the-art scene reconstruction results and tracking performance across various datasets and scenarios (both in small and large-scale indoor environments).
-
Previous multi-task dense prediction methods based on the Mixture of Experts (MoE) have received great performance but they neglect the importance of explicitly modeling the global relations among all tasks. In this paper we present a novel decoder-focused method for multi-task dense prediction called Mixture-of-Low-Rank-Experts (MLoRE). To model the global task relationships MLoRE adds a generic convolution path to the original MoE structure where each task feature can go through this path for explicit parameter sharing. Furthermore to control the parameters and computational cost brought by the increase in the number of experts we take inspiration from LoRA and propose to leverage the low-rank format of a vanilla convolution in the expert network. Since the low-rank experts have fewer parameters and can be dynamically parameterized into the generic convolution the parameters and computational cost do not change much with the increase of experts. Benefiting from this design we increase the number of experts and its reception field to enlarge the representation capacity facilitating multiple dense tasks learning in a unified network. Extensive experiments on the PASCAL-Context and NYUD-v2 benchmarks show that our MLoRE achieves superior performance compared to previous state-of-the-art methods on all metrics. Our code is available at https://github.com/YuqiYang213/MLoRE.
-
The ability to associate touch with other modalities has huge implications for humans and computational systems. However multimodal learning with touch remains challenging due to the expensive data collection process and non-standardized sensor outputs. We introduce UniTouch a unified tactile model for vision-based touch sensors connected to multiple modalities including vision language and sound. We achieve this by aligning our UniTouch embeddings to pretrained image embeddings already associated with a variety of other modalities. We further propose learnable sensor-specific tokens allowing the model to learn from a set of heterogeneous tactile sensors all at the same time. UniTouch is capable of conducting various touch sensing tasks in the zero-shot setting from robot grasping prediction to touch image question answering. To the best of our knowledge UniTouch is the first to demonstrate such capabilities.
-
In various domains such as surveillance and smart retail pedestrian retrieval centering on person re-identification (Re-ID) plays a pivotal role. Existing Re-ID methodologies often overlook subtle internal attribute variations which are crucial for accurately identifying individuals with changing appearances. In response our paper introduces the Attribute-Guided Pedestrian Retrieval (AGPR) task focusing on integrating specified attributes with query images to refine retrieval results. Although there has been progress in attribute-driven image retrieval there remains a notable gap in effectively blending robust Re-ID models with intra-class attribute variations. To bridge this gap we present the Attribute-Guided Transformer-based Pedestrian Retrieval (ATPR) framework. ATPR adeptly merges global ID recognition with local attribute learning ensuring a cohesive linkage between the two. Furthermore to effectively handle the complexity of attribute interconnectivity ATPR organizes attributes into distinct groups and applies both inter-group correlation and intra-group decorrelation regularizations. Our extensive experiments on a newly established benchmark using the RAP dataset demonstrate the effectiveness of ATPR within the AGPR paradigm.
-
The increasing prevalence of video clips has sparked growing interest in text-video retrieval. Recent advances focus on establishing a joint embedding space for text and video relying on consistent embedding representations to compute similarity. However the text content in existing datasets is generally short and concise making it hard to fully describe the redundant semantics of a video. Correspondingly a single text embedding may be less expressive to capture the video embedding and empower the retrieval. In this study we propose a new stochastic text modeling method T-MASS i.e. text is modeled as a stochastic embedding to enrich text embedding with a flexible and resilient semantic range yielding a text mass. To be specific we introduce a similarity-aware radius module to adapt the scale of the text mass upon the given text-video pairs. Plus we design and develop a support text regularization to further control the text mass during the training. The inference pipeline is also tailored to fully exploit the text mass for accurate retrieval. Empirical evidence suggests that T-MASS not only effectively attracts relevant text-video pairs while distancing irrelevant ones but also enables the determination of precise text embeddings for relevant pairs. Our experimental results show a substantial improvement of T-MASS over baseline (3% 6.3% by R@1). Also T-MASS achieves state-of-the-art performance on five benchmark datasets including MSRVTT LSMDC DiDeMo VATEX and Charades.
-
Recently non-transferable learning (NTL) was proposed to restrict models' generalization toward the target domain(s) which serves as state-of-the-art solutions for intellectual property (IP) protection. However the robustness of the established "transferability barrier" for degrading the target domain performance has not been well studied. In this paper we first show that the generalization performance of NTL models is widely impaired on third-party domains (i.e. the unseen domain in the NTL training stage). We explore the impairment patterns and find that: due to the dominant generalization of non-transferable task NTL models tend to make target-domain-consistent predictions on third-party domains even though only a slight distribution shift from the third-party domain to the source domain. Motivated by these findings we uncover the potential risks of NTL by proposing a simple but effective method (dubbed as TransNTL) to recover the target domain performance with few source domain data. Specifically by performing a group of different perturbations on the few source domain data we obtain diverse third-party domains that evoke the same impairment patterns as the unavailable target domain. Then we fine-tune the NTL model under an impairment-repair self-distillation framework where the source-domain predictions are used to teach the model itself how to predict on third-party domains thus repairing the impaired generalization. Empirically experiments on standard NTL benchmarks show that the proposed TransNTL reaches up to 72% target-domain improvements by using only 10% source domain data. Finally we also explore a feasible defense method and empirically demonstrate its effectiveness.
-
Computer animation's quest to bridge content and style has historically been a challenging venture with previous efforts often leaning toward one at the expense of the other. This paper tackles the inherent challenge of content-style duality ensuring a harmonious fusion where the core narrative of the content is both preserved and elevated through stylistic enhancements. We propose a novel Multi-condition Motion Latent Diffusion Model (MCM-LDM) for Arbitrary Motion Style Transfer (AMST). Our MCM-LDM significantly emphasizes preserving trajectories recognizing their fundamental role in defining the essence and fluidity of motion content. Our MCM-LDM's cornerstone lies in its ability first to disentangle and then intricately weave together motion's tripartite components: motion trajectory motion content and motion style. The critical insight of MCM-LDM is to embed multiple conditions with distinct priorities. The content channel serves as the primary flow guiding the overall structure and movement while the trajectory and style channels act as auxiliary components and synchronize with the primary one dynamically. This mechanism ensures that multi-conditions can seamlessly integrate into the main flow enhancing the overall animation without overshadowing the core content. Empirical evaluations underscore the model's proficiency in achieving fluid and authentic motion style transfers setting a new benchmark in the realm of computer animation. The source code and model are available at https://github.com/XingliangJin/MCM-LDM.git.
-
Recovering the 3D scene geometry from a single view is a fundamental yet ill-posed problem in computer vision. While classical depth estimation methods infer only a 2.5D scene representation limited to the image plane recent approaches based on radiance fields reconstruct a full 3D representation. However these methods still struggle with occluded regions since inferring geometry without visual observation requires (i) semantic knowledge of the surroundings and (ii) reasoning about spatial context. We propose KYN a novel method for single-view scene reconstruction that reasons about semantic and spatial context to predict each point's density. We introduce a vision-language modulation module to enrich point features with fine-grained semantic information. We aggregate point representations across the scene through a language-guided spatial attention mechanism to yield per-point density predictions aware of the 3D semantic context. We show that KYN improves 3D shape recovery compared to predicting density for each 3D point in isolation. We achieve state-of-the-art results in scene and object reconstruction on KITTI-360 and show improved zero-shot generalization compared to prior work. Project page: https://ruili3.github.io/kyn
-
Reliable hand mesh reconstruction (HMR) from commonly-used color and depth sensors is challenging especially under scenarios with varied illuminations and fast motions. Event camera is a highly promising alternative for its high dynamic range and dense temporal resolution properties but it lacks key texture appearance for hand mesh reconstruction. In this paper we propose EvRGBHand -- the first approach for 3D hand mesh reconstruction with an event camera and an RGB camera compensating for each other. By fusing two modalities of data across time space and information dimensionsEvRGBHand can tackle overexposure and motion blur issues in RGB-based HMR and foreground scarcity and background overflow issues in event-based HMR. We further propose EvRGBDegrader which allows our model to generalize effectively in challenging scenes even when trained solely on standard scenes thus reducing data acquisition costs. Experiments on real-world data demonstrate that EvRGBHand can effectively solve the challenging issues when using either type of camera alone via retaining the merits of both and shows the potential of generalization to outdoor scenes and another type of event camera. Our code models and dataset will be made public after acceptance.
-
Image enhancement algorithms have made remarkable advancements in recent years but directly applying them to Ultra-high-definition (UHD) images presents intractable computational overheads. Therefore previous straightforward solutions employ resampling techniques to reduce the resolution by adopting a "Downsampling-Enhancement-Upsampling" processing paradigm. However this paradigm disentangles the resampling operators and inner enhancement algorithms which results in the loss of information that is favored by the model further leading to sub-optimal outcomes. In this paper we propose a novel method of Learning Model-Aware Resampling (LMAR) which learns to customize resampling by extracting model-aware information from the UHD input image under the guidance of model knowledge. Specifically our method consists of two core designs namely compensatory kernel estimation and steganographic resampling. At the first stage we dynamically predict compensatory kernels tailored to the specific input and resampling scales. At the second stage the image-wise compensatory information is derived with the compensatory kernels and embedded into the rescaled input images. This promotes the representation of the newly derived downscaled inputs to be more consistent with the full-resolution UHD inputs as perceived by the model. Our LMAR enables model-aware and model-favored resampling while maintaining compatibility with existing resampling operators. Extensive experiments on multiple UHD image enhancement datasets and different backbones have shown consistent performance gains after correlating resizer and enhancer e.g. up to 1.2dB PSNR gain for x1.8 resampling scale on UHD-LOL4K. The code is available at \href https://github.com/YPatrickW/LMAR https://github.com/YPatrickW/LMAR .
-
Although Vision Transformer (ViT) has achieved significant success in computer vision it does not perform well in dense prediction tasks due to the lack of inner-patch information interaction and the limited diversity of feature scale. Most existing studies are devoted to designing vision-specific transformers to solve the above problems which introduce additional pre-training costs. Therefore we present a plain pre-training-free and feature-enhanced ViT backbone with Convolutional Multi-scale feature interaction named ViT-CoMer which facilitates bidirectional interaction between CNN and transformer. Compared to the state-of-the-art ViT-CoMer has the following advantages: (1) We inject spatial pyramid multi-receptive field convolutional features into the ViT architecture which effectively alleviates the problems of limited local information interaction and single-feature representation in ViT. (2) We propose a simple and efficient CNN-Transformer bidirectional fusion interaction module that performs multi-scale fusion across hierarchical features which is beneficial for handling dense prediction tasks. (3) We evaluate the performance of ViT-CoMer across various dense prediction tasks different frameworks and multiple advanced pre-training. Notably our ViT-CoMer-L achieves 64.3% AP on COCO val2017 without extra training data and 62.1% mIoU on ADE20K val both of which are comparable to state-of-the-art methods. We hope ViT-CoMer can serve as a new backbone for dense prediction tasks to facilitate future research. The code will be released at https://github.com/Traffic-X/ViT-CoMer.
-
Diffusion-based generative models have exhibited remarkable capability in the production of high-fidelity visual content such as images and videos. However their performance is significantly contingent upon the quality of textual inputs commonly referred to as "prompts". The process of traditional prompt engineering while effective necessitates empirical expertise and poses challenges for inexperienced users. In this paper we introduce PromptCoT an innovative enhancer that autonomously refines prompts for users. PromptCoT is designed based on the observation that prompts which resemble the textual information of high-quality images in the training set often lead to superior generation performance. Therefore we fine-tune the pre-trained Large Language Models (LLM) using a curated text dataset that solely comprises descriptions of high-quality visual content. By doing so the LLM can capture the distribution of high-quality training texts enabling it to generate aligned continuations and revisions to boost the original texts. Nonetheless one drawback of pre-trained LLMs is their tendency to generate extraneous or irrelevant information. We employ the Chain-of-Thought (CoT) mechanism to improve the alignment between the original text prompts and their refined versions. CoT can extract and amalgamate crucial information from the aligned continuation and revision enabling reasonable inferences based on the contextual cues to produce a more comprehensive and nuanced final output. Considering computational efficiency instead of allocating a dedicated LLM for prompt enhancement to each individual model or dataset we integrate adapters that facilitate dataset-specific adaptation leveraging a shared pre-trained LLM as the foundation for this process. With independent fine-tuning of these adapters we can adapt PromptCoT to new datasets while minimally increasing training costs and memory usage. We evaluate the effectiveness of PromptCoT by assessing its performance on widely-used latent diffusion models for image and video generation. The results demonstrate significant improvements in key performance metrics.
-
Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks. However MLLMs still face a fundamental limitation of hallucinations where they tend to generate erroneous or fabricated information. In this paper we address hallucinations in MLLMs from a novel perspective of representation learning. We first analyzed the representation distribution of textual and visual tokens in MLLM revealing two important findings: 1) there is a significant gap between textual and visual representations indicating unsatisfactory cross-modal representation alignment; 2) representations of texts that contain and do not contain hallucinations are entangled making it challenging to distinguish them. These two observations inspire us with a simple yet effective method to mitigate hallucinations. Specifically we introduce contrastive learning into MLLMs and use text with hallucination as hard negative examples naturally bringing representations of non-hallucinative text and visual samples closer while pushing way representations of non-hallucinating and hallucinative text. We evaluate our method quantitatively and qualitatively showing its effectiveness in reducing hallucination occurrences and improving performance across multiple benchmarks. On the MMhal-Bench benchmark our method obtains a 34.66% /29.5% improvement over the baseline MiniGPT-4/LLaVA. Our code is available on https://github.com/X-PLUG/mPLUG-HalOwl/tree/main/hacl.
-
Although effective deepfake detection models have been developed in recent years recent studies have revealed that these models can result in unfair performance disparities among demographic groups such as race and gender. This can lead to particular groups facing unfair targeting or exclusion from detection potentially allowing misclassified deepfakes to manipulate public opinion and undermine trust in the model. The existing method for addressing this problem is providing a fair loss function. It shows good fairness performance for intra-domain evaluation but does not maintain fairness for cross-domain testing. This highlights the significance of fairness generalization in the fight against deepfakes. In this work we propose the first method to address the fairness generalization problem in deepfake detection by simultaneously considering features loss and optimization aspects. Our method employs disentanglement learning to extract demographic and domain-agnostic forgery features fusing them to encourage fair learning across a flattened loss landscape. Extensive experiments on prominent deepfake datasets demonstrate our method's effectiveness surpassing state-of-the-art approaches in preserving fairness during cross-domain deepfake detection. The code is available at https://github.com/Purdue-M2/Fairness-Generalization.
-
With the advancement of generative models the assessment of generated images becomes increasingly more important. Previous methods measure distances between features of reference and generated images from trained vision models. In this paper we conduct an extensive investigation into the relationship between the representation space and input space around generated images. We first propose two measures related to the presence of unnatural elements within images: complexity which indicates how non-linear the representation space is and vulnerability which is related to how easily the extracted feature changes by adversarial input changes. Based on these we introduce a new metric to evaluating image-generative models called anomaly score (AS). Moreover we propose AS-i (anomaly score for individual images) that can effectively evaluate generated images individually. Experimental results demonstrate the validity of the proposed approach.
-
X-ray known for its ability to reveal internal structures of objects is expected to provide richer information for 3D reconstruction than visible light. Yet existing NeRF algorithms overlook this nature of X-ray leading to their limitations in capturing structural contents of imaged objects. In this paper we propose a framework Structure-Aware X-ray Neural Radiodensity Fields (SAX-NeRF) for sparse-view X-ray 3D reconstruction. Firstly we design a Line Segment-based Transformer (Lineformer) as the backbone of SAX-NeRF. Linefomer captures internal structures of objects in 3D space by modeling the dependencies within each line segment of an X-ray. Secondly we present a Masked Local-Global (MLG) ray sampling strategy to extract contextual and geometric information in 2D projection. Plus we collect a larger-scale dataset X3D covering wider X-ray applications. Experiments on X3D show that SAX-NeRF surpasses previous NeRF-based methods by 12.56 and 2.49 dB on novel view synthesis and CT reconstruction. https://github.com/caiyuanhao1998/SAX-NeRF
-
In this work we propose a novel discriminative framework for dexterous grasp generation named Dexterous Grasp TRansformer (DGTR) capable of predicting a diverse set of feasible grasp poses by processing the object point cloud with only one forward pass. We formulate dexterous grasp generation as a set prediction task and design a transformer-based grasping model for it. However we identify that this set prediction paradigm encounters several optimization challenges in the field of dexterous grasping and results in restricted performance. To address these issues we propose progressive strategies for both the training and testing phases. First the dynamic-static matching training (DSMT) strategy is presented to enhance the optimization stability during the training phase. Second we introduce the adversarial-balanced test-time adaptation (AB-TTA) with a pair of adversarial losses to improve grasping quality during the testing phase. Experimental results on the DexGraspNet dataset demonstrate the capability of DGTR to predict dexterous grasp poses with both high quality and diversity. Notably while keeping high quality the diversity of grasp poses predicted by DGTR significantly outperforms previous works in multiple metrics without any data pre-processing. Codes are available at https://github.com/iSEE-Laboratory/DGTR.
-
Recently an audio-visual segmentation (AVS) task has been introduced aiming to group pixels with sounding objects within a given video. This task necessitates a first-ever audio-driven pixel-level understanding of the scene posing significant challenges. In this paper we propose an innovative audio-visual transformer framework termed COMBO an acronym for COoperation of Multi-order Bilateral relatiOns. For the first time our framework explores three types of bilateral entanglements within AVS: pixel entanglement modality entanglement and temporal entanglement. Regarding pixel entanglement we employ a Siam-Encoder Module (SEM) that leverages prior knowledge to generate more precise visual features from the foundational model. For modality entanglement we design a Bilateral-Fusion Module (BFM) enabling COMBO to align corresponding visual and auditory signals bi-directionally. As for temporal entanglement we introduce an innovative adaptive inter-frame consistency loss according to the inherent rules of temporal. Comprehensive experiments and ablation studies on AVSBench-object (84.7 mIoU on S4 59.2 mIou on MS3) and AVSBench-semantic (42.1 mIoU on AVSS) datasets demonstrate that COMBO surpasses previous state-of-the-art methods. Project page is available at https://yannqi.github.io/AVS-COMBO.
-
Vision-language models (VLMs) have recently shown promising results in traditional downstream tasks. Evaluation studies have emerged to assess their abilities with the majority focusing on the third-person perspective and only a few addressing specific tasks from the first-person perspective. However the capability of VLMs to "think" from a first-person perspective a crucial attribute for advancing autonomous agents and robotics remains largely unexplored. To bridge this research gap we introduce EgoThink a novel visual question-answering benchmark that encompasses six core capabilities with twelve detailed dimensions. The benchmark is constructed using selected clips from egocentric videos with manually annotated question-answer pairs containing first-person information. To comprehensively assess VLMs we evaluate twenty-one popular VLMs on EgoThink. Moreover given the open-ended format of the answers we use GPT-4 as the automatic judge to compute single-answer grading. Experimental results indicate that although GPT-4V leads in numerous dimensions all evaluated VLMs still possess considerable potential for improvement in first-person perspective tasks. Meanwhile enlarging the number of trainable parameters has the most significant impact on model performance on EgoThink. In conclusion EgoThink serves as a valuable addition to existing evaluation benchmarks for VLMs providing an indispensable resource for future research in the realm of embodied artificial intelligence and robotics.
-
Recent years have seen immense progress in 3D computer vision and computer graphics with emerging tools that can virtualize real-world 3D environments for numerous Mixed Reality (XR) applications. However alongside immersive visual experiences immersive auditory experiences are equally vital to our holistic perception of an environment. In this paper we aim to reconstruct the spatial acoustic characteristics of an arbitrary environment given only a sparse set of (roughly 12) room impulse response (RIR) recordings and a planar reconstruction of the scene a setup that is easily achievable by ordinary users. To this end we introduce DiffRIR a differentiable RIR rendering framework with interpretable parametric models of salient acoustic features of the scene including sound source directivity and surface reflectivity. This allows us to synthesize novel auditory experiences through the space with any source audio. To evaluate our method we collect a dataset of RIR recordings and music in four diverse real environments. We show that our model outperforms state-of-the-art baselines on rendering monaural and binaural RIRs and music at unseen locations and learns physically interpretable parameters characterizing acoustic properties of the sound source and surfaces in the scene.
-
Single image depth estimation is a foundational task in computer vision and generative modeling. However prevailing depth estimation models grapple with accommodating the increasing resolutions commonplace in today's consumer cameras and devices. Existing high-resolution strategies show promise but they often face limitations ranging from error propagation to the loss of high-frequency details. We present PatchFusion a novel tile-based framework with three key components to improve the current state of the art: (1) A patch-wise fusion network that fuses a globally-consistent coarse prediction with finer inconsistent tiled predictions via high-level feature guidance (2) A Global-to-Local (G2L) module that adds vital context to the fusion network discarding the need for patch selection heuristics and (3) A Consistency-Aware Training (CAT) and Inference (CAI) approach emphasizing patch overlap consistency and thereby eradicating the necessity for post-processing. Experiments on UnrealStereo4K MVS-Synth and Middleburry 2014 demonstrate that our framework can generate high-resolution depth maps with intricate details. PatchFusion is independent of the base model for depth estimation. Notably our framework built on top of SOTA ZoeDepth brings improvements for a total of 17.3% and 29.4% in terms of the root mean squared error (RMSE) on UnrealStereo4K and MVS-Synth respectively.
-
Recently we have witnessed the explosive growth of various volumetric representations in modeling animatable head avatars. However due to the diversity of frameworks there is no practical method to support high-level applications like 3D head avatar editing across different representations. In this paper we propose a generic avatar editing approach that can be universally applied to various 3DMM driving volumetric head avatars. To achieve this goal we design a novel expression-aware modification generative model which enables lift 2D editing from a single image to a consistent 3D modification field. To ensure the effectiveness of the generative modification process we develop several techniques including an expression-dependent modification distillation scheme to draw knowledge from the large-scale head avatar model and 2D facial texture editing tools implicit latent space guidance to enhance model convergence and a segmentation-based loss reweight strategy for fine-grained texture inversion. Extensive experiments demonstrate that our method delivers high-quality and consistent results across multiple expression and viewpoints. Project page: https://zju3dv.github.io/ geneavatar/.
-
Test-time adaptation (TTA) is a technique to improve the performance of a pre-trained source model on a target distribution without using any labeled data. However existing self-trained TTA methods often face the challenges of unreliable pseudo-labels and unstable model optimization. In this paper we propose an Improved Self-Training (IST) approach which addresses these challenges by enhancing the pseudo-label quality and stabilizing the adaptation process. Specifically we use a simple augmentation strategy to generate multiple views of each test sample and construct a graph structure to correct the pseudo-labels based on the similarity of the latent features. Moreover we adopt a parameter moving average scheme to smooth the model updates and prevent catastrophic forgetting. Instead of using a model with fixed label space we explore the adaptability of the foundation model CLIP to various downstream tasks at test time. Extensive experiments on various benchmarks show that IST can achieve significant and consistent improvements over the existing TTA methods in classification detection and segmentation tasks.
-
Recent works utilize CLIP to perform the challenging unsupervised semantic segmentation task where only images without annotations are available. However we observe that when adopting CLIP to such a pixel-level understanding task unexpected bias occurs. Previous works don't explicitly model such bias which largely constrains the segmentation performance. In this paper we propose to explicitly model and rectify the bias existing in CLIP to facilitate the unsupervised semantic segmentation. Specifically we design a learnable "Reference" prompt to encode class-preference bias and project the positional embedding of vision transformer to represent space-preference bias. Via a simple element-wise subtraction we rectify the logits of CLIP classifier. Based on the rectified logits we generate a segmentation mask via a Gumbel-Softmax operation. Then a contrastive loss between masked visual feature and the text features of different classes is imposed to facilitate the effective bias modeling. To further improve the segmentation we distill the knowledge from the rectified CLIP to the advanced segmentation architecture via minimizing our designed mask-guided feature-guided and text-guided loss terms. Extensive experiments on standard benchmarks demonstrate that our method performs favorably against previous state-of-the-arts. The implementation is available at https://github.com/dogehhh/ReCLIP.
-
Given a set of images our goal is to map each image to a point in a feature space such that not only point proximity indicates visual similarity but where it is located directly encodes how prototypical the image is according to the dataset. Our key insight is to perform unsupervised feature learning in hyperbolic instead of Euclidean space where the distance between points still reflects image similarity yet we gain additional capacity for representing prototypicality with the location of the point: The closer it is to the origin the more prototypical it is. The latter property is simply emergent from optimizing the metric learning objective: The image similar to many training instances is best placed at the center of corresponding points in Euclidean space but closer to the origin in hyperbolic space. We propose an unsupervised feature learning algorithm in Hyperbolic space with sphere pACKing. HACK first generates uniformly packed particles in the Poincar'e ball of hyperbolic space and then assigns each image uniquely to a particle. With our feature mapper simply trained to spread out training instances in hyperbolic space we observe that images move closer to the origin with congealing - a warping process that aligns all the images and makes them appear more common and similar to each other validating our idea of unsupervised prototypicality discovery. We demonstrate that our data-driven prototypicality provides an easy and superior unsupervised instance selection to reduce sample complexity increase model generalization with atypical instances and robustness with typical ones.
-
Semantic image synthesis i.e. generating images from user-provided semantic label maps is an important conditional image generation task as it allows to control both the content as well as the spatial layout of generated images. Although diffusion models have pushed the state of the art in generative image modeling the iterative nature of their inference process makes them computationally demanding. Other approaches such as GANs are more efficient as they only need a single feed-forward pass for generation but the image quality tends to suffer when modeling large and diverse datasets. In this work we propose a new class of GAN discriminators for semantic image synthesis that generates highly realistic images by exploiting feature backbones pre-trained for tasks such as image classification. We also introduce a new generator architecture with better context modeling and using cross-attention to inject noise into latent variables leading to more diverse generated images. Our model which we dub DP-SIMS achieves state-of-the-art results in terms of image quality and consistency with the input label maps on ADE-20K COCO-Stuff and Cityscapes surpassing recent diffusion models while requiring two orders of magnitude less compute for inference.
-
Understanding human actions from videos of first-person view poses significant challenges. Most prior approaches explore representation learning on egocentric videos only while overlooking the potential benefit of exploiting existing large-scale third-person videos. In this paper (1) we develop EgoInstructor a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos (2) for training the cross-view retrieval module we devise an automatic pipeline to discover ego-exo video pairs from distinct large-scale egocentric and exocentric datasets (3) we train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions (4) through extensive experiments our cross-view retrieval module demonstrates superior performance across seven benchmarks. Regarding egocentric video captioning EgoInstructor exhibits significant improvements by leveraging third-person videos as references.
-
Diffusion models have demonstrated strong potential for robotic trajectory planning. However generating coherent trajectories from high-level instructions remains challenging especially for long-range composition tasks requiring multiple sequential skills. We propose SkillDiffuser an end-to-end hierarchical planning framework integrating interpretable skill learning with conditional diffusion planning to address this problem. At the higher level the skill abstraction module learns discrete human-understandable skill representations from visual observations and language instructions. These learned skill embeddings are then used to condition the diffusion model to generate customized latent trajectories aligned with the skills. This allows generating diverse state trajectories that adhere to the learnable skills. By integrating skill learning with conditional trajectory generation SkillDiffuser produces coherent behavior following abstract instructions across diverse tasks. Experiments on multi-task robotic manipulation benchmarks like Meta-World and LOReL demonstrate state-of-the-art performance and human-interpretable skill representations from SkillDiffuser. More visualization results and information could be found on https://skilldiffuser.github.io/.
-
Generalized Zero-Shot Learning (GZSL) methods often assume that the unseen classes are similar to seen classes and thus perform poor when unseen classes are dissimilar to seen classes. Although some existing GZSL approaches can alleviate this issue by leveraging additional semantic information from test unseen classes their generalization ability to dissimilar unseen classes is still unsatisfactory. This motivates us to study GZSL in the more practical setting where unseen classes can be either similar or dissimilar to seen classes. In this paper we propose a simple yet effective GZSL framework by exploring diverse semantics from external class names (DSECN) which is simultaneously robust on the similar and dissimilar unseen classes. This is achieved by introducing diverse semantics from external class names and aligning the introduced semantics to visual space using the classification head of pre-trained network. Furthermore we show that the design idea of DSECN can easily be integrate into other advanced GZSL approaches such as the generative-based ones and enhance their robustness for dissimilar unseen classes. Extensive experiments in the practical setting including both similar and dissimilar unseen classes show that our method significantly outperforms the state-of-the-art approaches on all datasets and can be trained very efficiently.
-
Recent progress in the text-driven 3D stylization of a single object has been considerably promoted by CLIP-based methods. However the stylization of multi-object 3D scenes is still impeded in that the image-text pairs used for pre-training CLIP mostly consist of an object. Meanwhile the local details of multiple objects may be susceptible to omission due to the existing supervision manner primarily relying on coarse-grained contrast of image-text pairs. To overcome these challenges we present a novel framework dubbed TeMO to parse multi-object 3D scenes and edit their styles under the contrast supervision at multiple levels. We first propose a Decoupled Graph Attention (DGA) module to distinguishably reinforce the features of 3D surface points. Particularly a cross-modal graph is constructed to align the object points accurately and noun phrases decoupled from the 3D mesh and textual description. Then we develop a Cross-Grained Contrast (CGC) supervision system where a fine-grained loss between the words in the textual description and the randomly rendered images are constructed to complement the coarse-grained loss. Extensive experiments show that our method can synthesize high-quality stylized content and outperform the existing methods over a wide range of multi-object 3D meshes.
-
In this paper we investigate that the normalized coordinate expression is a key factor as reliance on hand-crafted components in query-based detectors for temporal action detection (TAD). Despite significant advancements towards an end-to-end framework in object detection query-based detectors have been limited in achieving full end-to-end modeling in TAD. To address this issue we propose TE-TAD a full end-to-end temporal action detection transformer that integrates time-aligned coordinate expression. We reformulate coordinate expression utilizing actual timeline values ensuring length-invariant representations from the extremely diverse video duration environment. Furthermore our proposed adaptive query selection dynamically adjusts the number of queries based on video length providing a suitable solution for varying video durations compared to a fixed query set. Our approach not only simplifies the TAD process by eliminating the need for hand-crafted components but also significantly improves the performance of query-based detectors. Our TE-TAD outperforms the previous query-based detectors and achieves competitive performance compared to state-of-the-art methods on popular benchmark datasets. Code is available at: https://github.com/Dotori-HJ/TE-TAD.
-
Utilizing multi-view inputs to synthesize novel-view images Neural Radiance Fields (NeRF) have emerged as a popular research topic in 3D vision. In this work we introduce a Generalizable Semantic Neural Radiance Field (GSNeRF) which uniquely takes image semantics into the synthesis process so that both novel view images and the associated semantic maps can be produced for unseen scenes. Our GSNeRF is composed of two stages: Semantic Geo-Reasoning and Depth-Guided Visual rendering. The former is able to observe multi-view image inputs to extract semantic and geometry features from a scene. Guided by the resulting image geometry information the latter performs both image and semantic rendering with improved performances. Our experiments not only confirm that GSNeRF performs favorably against prior works on both novel-view image and semantic segmentation synthesis but the effectiveness of our sampling strategy for visual rendering is further verified.
-
Scale-ambiguity in 3D scene dimensions leads to magnitude-ambiguity of volumetric densities in neural radiance fields i.e. the densities double when scene size is halved and vice versa. We call this property alpha invariance. For NeRFs to better maintain alpha invariance we recommend 1) parameterizing both distance and volume densities in log space and 2) a discretization-agnostic initialization strategy to guarantee high ray transmittance. We revisit a few popular radiance field models and find that these systems use various heuristics to deal with issues arising from scene scaling. We test their behaviors and show our recipe to be more robust.
-
We introduce TexTile a novel differentiable metric to quantify the degree upon which a texture image can be concatenated with itself without introducing repeating artifacts (i.e. the tileability). Existing methods for tileable texture synthesis focus on general texture quality but lack explicit analysis of the intrinsic repeatability properties of a texture. In contrast our TexTile metric effectively evaluates the tileable properties of a texture opening the door to more informed synthesis and analysis of tileable textures. Under the hood TexTile is formulated as a binary classifier carefully built from a large dataset of textures of different styles semantics regularities and human annotations.Key to our method is a set of architectural modifications to baseline pre-train image classifiers to overcome their shortcomings at measuring tileability along with a custom data augmentation and training regime aimed at increasing robustness and accuracy. We demonstrate that TexTile can be plugged into different state-of-the-art texture synthesis methods including diffusion-based strategies and generate tileable textures while keeping or even improving the overall texture quality. Furthermore we show that TexTile can objectively evaluate any tileable texture synthesis method whereas the current mix of existing metrics produces uncorrelated scores which heavily hinders progress in the field.
-
Domain adaptation for object detection typically entails transferring knowledge from one visible domain to another visible domain. However there are limited studies on adapting from the visible to the thermal domain because the domain gap between the visible and thermal domains is much larger than expected and traditional domain adaptation can not successfully facilitate learning in this situation. To overcome this challenge we propose a Distinctive Dual-Domain Teacher (D3T) framework that employs distinct training paradigms for each domain. Specifically we segregate the source and target training sets for building dual-teachers and successively deploy exponential moving average to the student model to individual teachers of each domain. The framework further incorporates a zigzag learning method between dual teachers facilitating a gradual transition from the visible to thermal domains during training. We validate the superiority of our method through newly designed experimental protocols with well-known thermal datasets i.e. FLIR and KAIST. Source code is available at https://github.com/EdwardDo69/D3T.
-
Positive-Unlabeled (PU) learning aims to train a binary classifier using minimal positive data supplemented by a substantially larger pool of unlabeled data in the specific absence of explicitly annotated negatives. Despite its straightforward nature as a binary classification task the currently best-performing PU algorithms still largely lag behind the supervised counterpart. In this work we identify that the primary bottleneck lies in the difficulty of deriving discriminative representations under unreliable binary supervision with poor semantics which subsequently hinders the common label disambiguation procedures. To cope with this problem we propose a novel PU learning framework namely Latent Group-Aware Meta Disambiguation (LaGAM) which incorporates a hierarchical contrastive learning module to extract the underlying grouping semantics within PU data and produce compact representations. As a result LaGAM enables a more aggressive label disambiguation strategy where we enhance the robustness of training by iteratively distilling the true labels of unlabeled data directly through meta-learning. Extensive experiments show that LaGAM significantly outperforms the current state-of-the-art methods by an average of 6.8% accuracy on common benchmarks approaching the supervised baseline. We also provide comprehensive ablations as well as visualized analysis to verify the effectiveness of our LaGAM.
-
In this paper we introduce a new perspective for improving image restoration by removing degradation in the textual representations of a given degraded image. Intuitively restoration is much easier on text modality than image one. For example it can be easily conducted by removing degradation-related words while keeping the content-aware words. Hence we combine the advantages of images in detail description and ones of text in degradation removal to perform restoration. To address the cross-modal assistance we propose to map the degraded images into textual representations for removing the degradations and then convert the restored textual representations into a guidance image for assisting image restoration. In particular We ingeniously embed an image-to-text mapper and text restoration module into CLIP-equipped text-to-image models to generate the guidance. Then we adopt a simple coarse-to-fine approach to dynamically inject multi-scale information from guidance to image restoration networks. Extensive experiments are conducted on various image restoration tasks including deblurring dehazing deraining and denoising and all-in-one image restoration. The results showcase that our method outperforms state-of-the-art ones across all these tasks. The codes and models are available at https://github.com/mrluin/TextualDegRemoval.
-
Recent advances in vision-language models like Stable Diffusion have shown remarkable power in creative image synthesis and editing.However most existing text-to-image editing methods encounter two obstacles: First the text prompt needs to be carefully crafted to achieve good results which is not intuitive or user-friendly. Second they are insensitive to local edits and can irreversibly affect non-edited regions leaving obvious editing traces. To tackle these problems we propose a Zero-shot instructiON-guided local image Editing approach termed ZONE. We first convert the editing intent from the user-provided instruction (e.g. "make his tie blue") into specific image editing regions through InstructPix2Pix. We then propose a Region-IoU scheme for precise image layer extraction from an off-the-shelf segment model. We further develop an edge smoother based on FFT for seamless blending between the layer and the image.Our method allows for arbitrary manipulation of a specific region with a single instruction while preserving the rest. Extensive experiments demonstrate that our ZONE achieves remarkable local editing results and user-friendliness outperforming state-of-the-art methods. Code is available at https://github.com/lsl001006/ZONE.
-
Concept personalization methods enable large text-to-image models to learn specific subjects (e.g. objects/poses/3D models) and synthesize renditions in new contexts. Given that the image references are highly biased towards visual attributes state-of-the-art personalization models tend to overfit the whole subject and cannot disentangle visual characteristics in pixel space. In this study we proposed a more challenging setting namely fine-grained visual appearance personalization. Different from existing methods we allow users to provide a sentence describing the desired attributes. A novel decoupled self-augmentation strategy is proposed to generate target-related and non-target samples to learn user-specified visual attributes.These augmented data allow for refining the model's understanding of the target attribute while mitigating the impact of unrelated attributes. At the inference stage adjustments are conducted on semantic space through the learned target and non-target embeddings to further enhance the disentanglement of target attributes. Extensive experiments on various kinds of visual attributes with SOTA personalization methods shows the ability of the proposed method to mimic target visual appearance in novel contexts thus improving the controllability and flexibility of personalization.
-
Bird's-eye View (BeV) representations have emerged as the de-facto shared space in driving applications offering a unified space for sensor data fusion and supporting various downstream tasks. However conventional models use grids with fixed resolution and range and face computational inefficiencies due to the uniform allocation of resources across all cells. To address this we propose PointBeV a novel sparse BeV segmentation model operating on sparse BeV cells instead of dense grids. This approach offers precise control over memory usage enabling the use of long temporal contexts and accommodating memory-constrained platforms. PointBeV employs an efficient two-pass strategy for training enabling focused computation on regions of interest. At inference time it can be used with various memory/performance trade-offs and flexibly adjusts to new specific use cases. PointBeV achieves state-of-the-art results on the nuScenes dataset for vehicle pedestrian and lane segmentation showcasing superior performance in static and temporal settings despite being trained solely with sparse signals. We release our code with two new efficient modules used in the architecture: Sparse Feature Pulling designed for the effective extraction of features from images to BeV and Submanifold Attention which enables efficient temporal modeling. The code is available at https://github.com/valeoai/PointBeV.
-
Cross-modal alignment aims to build a bridge connecting vision and language. It is an important multi-modal task that efficiently learns the semantic similarities between images and texts. Traditional fine-grained alignment methods heavily rely on pre-trained object detectors to extract region features for subsequent region-word alignment thereby incurring substantial computational costs for region detection and error propagation issues for two-stage training. In this paper we focus on the mainstream vision transformer incorporating patch features for patch-word alignment while addressing the resultant issue of visual patch redundancy and patch ambiguity for semantic alignment. We propose a novel Linguistic-Aware Patch Slimming (LAPS) framework for fine-grained alignment which explicitly identifies redundant visual patches with language supervision and rectifies their semantic and spatial information to facilitate more effective and consistent patch-word alignment. Extensive experiments on various evaluation benchmarks and model backbones show LAPS outperforms the state-of-the-art fine-grained alignment methods by 5%-15% rSum. Our code is available at https://github.com/CrossmodalGroup/LAPS
-
Recent years have witnessed a trend of the deep integration of the generation and reconstruction paradigms. In this paper we extend the ability of controllable generative models for a more comprehensive hand mesh recovery task: direct hand mesh generation inpainting reconstruction and fitting in a single framework which we name as Holistic Hand Mesh Recovery (HHMR). Our key observation is that different kinds of hand mesh recovery tasks can be achieved by a single generative model with strong multimodal controllability and in such a framework realizing different tasks only requires giving different signals as conditions. To achieve this goal we propose an all-in-one diffusion framework based on graph convolution and attention mechanisms for holistic hand mesh recovery. In order to achieve strong control generation capability while ensuring the decoupling of multimodal control signals we map different modalities to a share feature space and apply cross-scale random masking in both modality and feature levels. In this way the correlation between different modalities can be fully exploited during the learning of hand priors. Furthermore we propose Condition-aligned Gradient Guidance to enhance the alignment of the generated model with the control signals which significantly improves the accuracy of the hand mesh reconstruction and fitting. Experiments show that our novel framework can realize multiple hand mesh recovery tasks simultaneously and outperform the existing methods in different tasks which provides more possibilities for subsequent downstream applications including gesture recognition pose generation mesh editing and so on.
-
In recent years large-scale video-language pre-training (VidLP) has received considerable attention for its effectiveness in relevant tasks. In this paper we propose a novel action-centric VidLP framework that employs video tube features for temporal modeling and language features based on semantic role labeling (SRL). Our video encoder generates multiple tube features along object trajectories identifying action-related regions within videos to overcome the limitations of existing temporal attention mechanisms. Additionally our text encoder incorporates high-level action-related language knowledge previously underutilized in current VidLP models. The SRL captures action-verbs and related semantics among objects in sentences and enhances the ability to perform instance-level text matching thus enriching the cross-modal (CM) alignment process. We also introduce two novel pre-training objectives and a self-supervision strategy to produce a more faithful CM representation. Experimental results demonstrate that our method outperforms existing VidLP frameworks in various downstream tasks and datasets establishing our model a baseline in the modern VidLP framework.
-
This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation. Multi-modal LLMs empower multi-modality understanding with the capability of semantic generation yet bring less explainability and heavier reliance on prompt contents due to their autoregressive generative nature. While manipulating prompt formats could improve outputs designing specific and precise prompts per task can be challenging and ineffective. To tackle this issue we introduce a novel inference method Prompt Highlighter which enables users to highlight specific prompt spans to interactively control the focus during generation. Motivated by the classifier-free diffusion guidance we form regular and unconditional context pairs based on highlighted tokens demonstrating that the autoregressive generation in models can be guided in a classifier-free way. Notably we find that during inference guiding the models with highlighted tokens through the attention weights leads to more desired outputs. Our approach is compatible with current LLMs and VLMs achieving impressive customized generation results without training. Experiments confirm its effectiveness in focusing on input contexts and generating reliable content. Without tuning on LLaVA-v1.5 our method secured 70.7 in the MMBench test and 1552.5 in MME-perception.
-
The problem of self-calibration of two cameras from a given fundamental matrix is one of the basic problems in geometric computer vision. Under the assumption of known principal points and square pixels the Bougnoux formula offers a means to compute the two unknown focal lengths. However in many practical situations the formula yields inaccurate results due to commonly occurring singularities. Moreover the estimates are sensitive to noise in the computed fundamental matrix and to the assumed positions of the principal points. In this paper we therefore propose an efficient and robust iterative method to estimate the focal lengths along with the principal points of the cameras given a fundamental matrix and priors for the estimated camera intrinsics. In addition we study a computationally efficient check of models generated within RANSAC that improves the accuracy of the estimated models while reducing the total computational time. Extensive experiments on real and synthetic data show that our iterative method brings significant improvements in terms of the accuracy of the estimated focal lengths over the Bougnoux formula and other state-of-the-art methods even when relying on inaccurate priors. The code for the methods and experiments is available at https://github.com/kocurvik/robust_self_calibration
-
Embodied AI such as autonomous vehicles suffers from insufficient long-tailed data because it must be obtained from the physical world. In fact data must be continuously obtained in a series of small batches and the model must also be continuously trained to achieve generalizability and scalability by improving the biased data distribution. This paper addresses the training cost and catastrophic forgetting problems when continuously updating models to adapt to incoming small batches from various environments for real-world motion prediction in autonomous driving. To this end we propose a novel continual motion prediction (CMP) learning framework based on sparse meta-representation learning and an optimal memory buffer retention strategy. In meta-representation learning a model explicitly learns a sparse representation of each driving environment from road geometry to vehicle states by training to reduce catastrophic forgetting based on an augmented modulation network with sparsity regularization. Also in the adaptation phase We develop an Optimal Memory Buffer Retention strategy that smartly preserves diverse samples by focusing on representation similarity. This approach handles the nuanced task distribution shifts characteristic of motion prediction datasets ensuring our model stays responsive to evolving input variations without requiring extensive resources. The experiment results demonstrate that the proposed method shows superior adaptation performance to the conventional continual learning approach which is developed using a synthetic dataset for the continual learning problem.
-
This paper proposes a cross-modal distillation framework PartDistill which transfers 2D knowledge from vision-language models (VLMs) to facilitate 3D shape part segmentation. PartDistill addresses three major challenges in this task: the lack of 3D segmentation in invisible or undetected regions in the 2D projections inconsistent 2D predictions by VLMs and the lack of knowledge accumulation across different 3D shapes. PartDistill consists of a teacher network that uses a VLM to make 2D predictions and a student network that learns from the 2D predictions while extracting geometrical features from multiple 3D shapes to carry out 3D part segmentation. A bi-directional distillation including forward and backward distillations is carried out within the framework where the former forward distills the 2D predictions to the student network and the latter improves the quality of the 2D predictions which subsequently enhances the final 3D segmentation. Moreover PartDistill can exploit generative models that facilitate effortless 3D shape creation for generating knowledge sources to be distilled. Through extensive experiments PartDistill boosts the existing methods with substantial margins on widely used ShapeNetPart and PartNetE datasets by more than 15% and 12% higher mIoU scores respectively. The code for this work is available at https://github.com/ardianumam/PartDistill.
-
In the domain of compressive sensing (CS) deep unfolding networks (DUNs) have garnered attention for their good performance and certain degree of interpretability rooted in CS domain achieved by marrying traditional optimization solvers with deep networks. However current DUNs are ill-suited for the intricate task of capturing fine-grained image details leading to perceptible distortions and blurriness in reconstructed images particularly at low CS ratios e.g. 0.10 and below. In this paper we propose CPP-Net a novel deep unfolding CS framework inspired by the primal-dual hybrid strategy of the Chambolle and Pock Proximal Point Algorithm (CP-PPA). First we derive three iteration submodules Xk Vk and Yk by incorporating customized deep learning modules to solve the sparse basis related proximal operator within CP-PPA. Second we design the Dual Path Fusion Block (DPFB) to adeptly extract and fuse multi-scale feature information enhancing sensitivity to feature information at different scales and improving detail reconstruction. Third we introduce the Iteration Fusion Strategy (IFS) to effectively weight the fusion of outputs from diverse reconstruction stages maximizing the utilization of feature information and mitigating the information loss during reconstruction stages. Extensive experiments demonstrate that CPP-Net effectively reduces distortion and blurriness while preserving richer image details outperforming current state-of-the-art methods. Codes are available at https://github.com/ICSResearch/CPP-Net.
-
In the era of AI-generated content (AIGC) malicious tampering poses imminent threats to copyright integrity and information security. Current deep image watermarking while widely accepted for safeguarding visual content can only protect copyright and ensure traceability. They fall short in localizing increasingly realistic image tampering potentially leading to trust crises privacy violations and legal disputes. To solve this challenge we propose an innovative proactive forensics framework EditGuard to unify copyright protection and tamper-agnostic localization especially for AIGC-based editing methods. It can offer a meticulous embedding of imperceptible watermarks and precise decoding of tampered areas and copyright information. Leveraging our observed fragility and locality of image-into-image steganography the realization of EditGuard can be converted into a united image-bit steganography issue thus completely decoupling the training process from the tampering types. Extensive experiments verify that our EditGuard balances the tamper localization accuracy copyright recovery precision and generalizability to various AIGC-based tampering methods especially for image forgery that is difficult for the naked eye to detect.
-
Constructing photo-realistic Free-Viewpoint Videos (FVVs) of dynamic scenes from multi-view videos remains a challenging endeavor. Despite the remarkable advancements achieved by current neural rendering techniques these methods generally require complete video sequences for offline training and are not capable of real-time rendering. To address these constraints we introduce 3DGStream a method designed for efficient FVV streaming of real-world dynamic scenes. Our method achieves fast on-the-fly per-frame reconstruction within 12 seconds and real-time rendering at 200 FPS. Specifically we utilize 3D Gaussians (3DGs) to represent the scene. Instead of the naive approach of directly optimizing 3DGs per-frame we employ a compact Neural Transformation Cache (NTC) to model the translations and rotations of 3DGs markedly reducing the training time and storage required for each FVV frame. Furthermore we propose an adaptive 3DG addition strategy to handle emerging objects in dynamic scenes. Experiments demonstrate that 3DGStream achieves competitive performance in terms of rendering speed image quality training time and model storage when compared with state-of-the-art methods.
-
Existing text-to-image generative models reflect or even amplify societal biases ingrained in their training data. This is especially concerning for human image generation where models are biased against certain demographic groups. Existing attempts to rectify this issue are hindered by the inherent limitations of the pre-trained models and fail to substantially improve demographic diversity. In this work we introduce Fair Retrieval Augmented Generation (FairRAG) a novel framework that conditions pre-trained generative models on reference images retrieved from an external image database to improve fairness in human generation. FairRAG enables conditioning through a lightweight linear module that projects reference images into the textual space. To enhance fairness FairRAG applies simple-yet-effective debiasing strategies providing images from diverse demographic groups during the generative process. Extensive experiments demonstrate that FairRAG outperforms existing methods in terms of demographic diversity image-text alignment and image fidelity while incurring minimal computational overhead during inference.
-
Accurate and controllable image editing is a challenging task that has attracted significant attention recently. Notably DragGAN developed by Pan et al. (2023) is an interactive point-based image editing framework that achieves impressive editing results with pixel-level precision. However due to its reliance on generative adversarial networks (GANs) its generality is limited by the capacity of pretrained GAN models. In this work we extend this editing framework to diffusion models and propose a novel approach DragDiffusion. By harnessing large-scale pretrained diffusion models we greatly enhance the applicability of interactive point-based editing on both real and diffusion-generated images. Unlike other diffusion-based editing methods that provide guidance on diffusion latents of multiple time steps our approach achieves efficient yet accurate spatial control by optimizing the latent of only one time step. This novel design is motivated by our observations that UNet features at a specific time step provides sufficient semantic and geometric information to support the drag-based editing. Moreover we introduce two additional techniques namely identity-preserving fine-tuning and reference-latent-control to further preserve the identity of the original image. Lastly we present a challenging benchmark dataset called DragBench---the first benchmark to evaluate the performance of interactive point-based image editing methods. Experiments across a wide range of challenging cases (e.g. images with multiple objects diverse object categories various styles etc.) demonstrate the versatility and generality of DragDiffusion. Code and the DragBench dataset: https://github.com/Yujun-Shi/DragDiffusion.
-
We introduce FaceTalk a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from input audio signal. To capture the expressive detailed nature of human heads including hair ears and finer-scale eye movements we propose to couple speech signal with the latent space of neural parametric head models to create high-fidelity temporally coherent motion sequences. We propose a new latent diffusion model for this task operating in the expression space of neural parametric head models to synthesize audio-driven realistic head sequences. In the absence of a dataset with corresponding NPHM expressions to audio we optimize for these correspondences to produce a dataset of temporally-optimized NPHM expressions fit to audio-video recordings of people talking. To the best of our knowledge this is the first work to propose a generative approach for realistic and high-quality motion synthesis of volumetric human heads representing a significant advancement in the field of audio-driven 3D animation. Notably our approach stands out in its ability to generate plausible motion sequences that can produce high-fidelity head animation coupled with the NPHM shape space. Our experimental results substantiate the effectiveness of FaceTalk consistently achieving superior and visually natural motion encompassing diverse facial expressions and styles outperforming existing methods by 75% in perceptual user study evaluation
-
Recently 3D Gaussian Splatting has demonstrated impressive novel view synthesis results reaching high fidelity and efficiency. However strong artifacts can be observed when changing the sampling rate e.g. by changing focal length or camera distance. We find that the source for this phenomenon can be attributed to the lack of 3D frequency constraints and the usage of a 2D dilation filter. To address this problem we introduce a 3D smoothing filter to constrains the size of the 3D Gaussian primitives based on the maximal sampling frequency induced by the input views. It eliminates high-frequency artifacts when zooming in. Moreover replacing 2D dilation with a 2D Mip filter which simulates a 2D box filter effectively mitigates aliasing and dilation issues. Our evaluation including scenarios such a training on single-scale images and testing on multiple scales validates the effectiveness of our approach.
-
The difficulty of acquiring high-resolution (HR) and low-resolution (LR) image pairs in real scenarios limits the performance of existing learning-based image super-resolution (SR) methods in the real world. To conduct training on real-world unpaired data current methods focus on synthesizing pseudo LR images to associate unpaired images. However the realness and diversity of pseudo LR images are vulnerable due to the large image space. In this paper we circumvent the difficulty of image generation and propose an alternative to build the connection between unpaired images in a compact proxy space. Specifically we first construct coupled HR and LR dictionaries and then encode HR and LR images into a common latent code space using these dictionaries. In addition we develop an autoencoder-based framework to couple these dictionaries during optimization by reconstructing input HR and LR images. The coupled dictionaries enable our method to employ a shallow network architecture with only 18 layers to achieve efficient image SR. Extensive experiments show that our method (DictSR) can effectively model the LR-to-HR mapping in coupled dictionaries and produces state-of-the-art performance on benchmark datasets.
-
Reconstructing human-object interaction in 3D from a single RGB image is a challenging task and existing data driven methods do not generalize beyond the objects present in the carefully curated 3D interaction datasets. Capturing large-scale real data to learn strong interaction and 3D shape priors is very expensive due to the combinatorial nature of human-object interactions. In this paper we propose ProciGen (Procedural interaction Generation) a method to procedurally generate datasets with both plausible interaction and diverse object variation. We generate 1M+ human-object interaction pairs in 3D and leverage this large-scale data to train our HDM (Hierarchical Diffusion Model) a novel method to reconstruct interacting human and unseen object instances without any templates. Our HDM is an image-conditioned diffusion model that learns both realistic interaction and highly accurate human and object shapes. Experiments show that our HDM trained with ProciGen significantly outperforms prior methods that require template meshes and our dataset allows training methods with strong generalization ability to unseen object instances. Our code and data are released.
-
Inverse tone mapping (ITM) aims to reconstruct high dynamic range (HDR) radiance from low dynamic range (LDR) content. Although many deep image ITM methods can generate impressive results the field of video ITM is still to be explored. Processing video sequences by image ITM methods may cause temporal inconsistency. Besides they aren't able to exploit the potentially useful information in the temporal domain. In this paper we analyze the process of video filming and then propose a Global Sample and Local Propagate strategy to better find and utilize temporal clues. To better realize the proposed strategy we design a two-stage pipeline which includes modules named Incremental Clue Aggregation Module and Feature and Clue Propagation Module. They can align and fuse frames effectively under the condition of brightness changes and propagate features and temporal clues to all frames efficiently. Our temporal clues based video ITM method can recover realistic and temporal consistent results with high fidelity in over-exposed regions. Qualitative and quantitative experiments on public datasets show that the proposed method has significant advantages over existing methods.
-
NeRF-HuGS: Improved Neural Radiance Fields in Non-static Scenes Using Heuristics-Guided Segmentation
Neural Radiance Field (NeRF) has been widely recognized for its excellence in novel view synthesis and 3D scene reconstruction. However their effectiveness is inherently tied to the assumption of static scenes rendering them susceptible to undesirable artifacts when confronted with transient distractors such as moving objects or shadows. In this work we propose a novel paradigm namely "Heuristics-Guided Segmentation" (HuGS) which significantly enhances the separation of static scenes from transient distractors by harmoniously combining the strengths of hand-crafted heuristics and state-of-the-art segmentation models thus significantly transcending the limitations of previous solutions. Furthermore we delve into the meticulous design of heuristics introducing a seamless fusion of Structure-from-Motion (SfM)-based heuristics and color residual heuristics catering to a diverse range of texture profiles. Extensive experiments demonstrate the superiority and robustness of our method in mitigating transient distractors for NeRFs trained in non-static scenes. Project page: https://cnhaox.github.io/NeRF-HuGS/
-
Existing few-shot segmentation methods usually extract foreground prototypes from support images to guide query image segmentation. However different background contexts of support and query images can cause their foreground features to be misaligned. This phenomenon known as background context bias can hinder the effectiveness of support prototypes in guiding query image segmentation. In this work we propose a novel framework with an iterative structure to address this problem. In each iteration of the framework we first generate a query prediction based on a support foreground feature. Next we extract background context from the query image to modulate the support foreground feature thus eliminating the foreground feature misalignment caused by the different backgrounds. After that we design a confidence-biased attention to eliminate noise and cleanse information. By integrating these components through an iterative structure we create a novel network that can leverage the synergies between different modules to improve their performance in a mutually reinforcing manner. Through these carefully designed components and structures our network can effectively eliminate background context bias in few-shot segmentation thus achieving outstanding performance. We conduct extensive experiments on the PASCAL-5^ i and COCO-20^ i datasets and achieve state-of-the-art (SOTA) results which demonstrate the effectiveness of our approach.
-
Current video anomaly detection (VAD) approaches with weak supervisions are inherently limited to a closed-set setting and may struggle in open-world applications where there can be anomaly categories in the test data unseen during training. A few recent studies attempt to tackle a more realistic setting open-set VAD which aims to detect unseen anomalies given seen anomalies and normal videos. However such a setting focuses on predicting frame anomaly scores having no ability to recognize the specific categories of anomalies despite the fact that this ability is essential for building more informed video surveillance systems. This paper takes a step further and explores open-vocabulary video anomaly detection (OVVAD) in which we aim to leverage pre-trained large models to detect and categorize seen and unseen anomalies. To this end we propose a model that decouples OVVAD into two mutually complementary tasks - class-agnostic detection and class-specific classification - and jointly optimizes both tasks. Particularly we devise a semantic knowledge injection module to introduce semantic knowledge from large language models for the detection task and design a novel anomaly synthesis module to generate pseudo unseen anomaly videos with the help of large vision generation models for the classification task. These semantic knowledge and synthesis anomalies substantially extend our model's capability in detecting and categorizing a variety of seen and unseen anomalies. Extensive experiments on three widely-used benchmarks demonstrate our model achieves state-of-the-art performance on OVVAD task.
-
Abstract In recent years text-image joint pre-training techniques have shown promising results in various tasks. However in Optical Character Recognition (OCR) tasks aligning text instances with their corresponding text regions in images poses a challenge as it requires effective alignment between text and OCR-Text (referring to the text in images as OCR-Text to distinguish from the text in natural language) rather than a holistic understanding of the overall image content. In this paper we propose a new pre-training method called OCR-Text Destylization Modeling (ODM) that transfers diverse styles of text found in images to a uniform style based on the text prompt. With ODM we achieve better alignment between text and OCR-Text and enable pre-trained models to adapt to the complex and diverse styles of scene text detection and spotting tasks. Additionally we have designed a new labeling generation method specifically for ODM and combined it with our proposed Text-Controller module to address the challenge of annotation costs in OCR tasks allowing a larger amount of unlabeled data to participate in pre-training. Extensive experiments on multiple public datasets demonstrate that our method significantly improves performance and outperforms current pre-training methods in scene text detection and spotting tasks. Code is available at https://github.com/PriNing/ODM.
-
Despite many attempts to leverage pre-trained text-to-image models (T2I) like Stable Diffusion (SD) for controllable image editing producing good predictable results remains a challenge. Previous approaches have focused on either fine-tuning pre-trained T2I models on specific datasets to generate certain kinds of images (e.g. with a specific object or person) or on optimizing the weights text prompts and/or learning features for each input image in an attempt to coax the image generator to produce the desired result. However these approaches all have shortcomings and fail to produce good results in a predictable and controllable manner. To address this problem we present TiNO-Edit an SD-based method that focuses on optimizing the noise patterns and diffusion timesteps during editing something previously unexplored in the literature. With this simple change we are able to generate results that both better align with the original images and reflect the desired result. Furthermore we propose a set of new loss functions that operate in the latent domain of SD greatly speeding up the optimization when compared to prior losses which operate in the pixel domain. Our method can be easily applied to variations of SD including Textual Inversion and DreamBooth that encode new concepts and incorporate them into the edited results. We present a host of image-editing capabilities enabled by our approach. Our code is publicly available at https://github.com/SherryXTChen/TiNO-Edit.
-
Epistemic uncertainty quantification (UQ) identifies where models lack knowledge. Traditional UQ methods often based on Bayesian neural networks are not suitable for pre-trained non-Bayesian models. Our study addresses quantifying epistemic uncertainty for any pre-trained model which does not need the original training data or model modifications and can ensure broad applicability regardless of network architectures or training techniques. Specifically we propose a gradient-based approach to assess epistemic uncertainty analyzing the gradients of outputs relative to model parameters and thereby indicating necessary model adjustments to accurately represent the inputs. We first explore theoretical guarantees of gradient-based methods for epistemic UQ questioning the view that this uncertainty is only calculable through differences between multiple models. We further improve gradient-driven UQ by using class-specific weights for integrating gradients and emphasizing distinct contributions from neural network layers. Additionally we enhance UQ accuracy by combining gradient and perturbation methods to refine the gradients. We evaluate our approach on out-of-distribution detection uncertainty calibration and active learning demonstrating its superiority over current state-of-the-art UQ methods for pre-trained models.
-
Diffusion models excel at modeling complex and multimodal trajectory distributions for decision-making and control. Reward-gradient guided denoising has been recently proposed to generate trajectories that maximize both a differentiable reward function and the likelihood under the data distribution captured by a diffusion model. Reward-gradient guided denoising requires a differentiable reward function fitted to both clean and noised samples limiting its applicability as a general trajectory optimizer. In this paper we propose DiffusionES a method that combines gradient-free optimization with trajectory denoising to optimize black-box non-differentiable objectives while staying in the data manifold. Diffusion-ES samples trajectories during evolutionary search from a diffusion model and scores them using a black-box reward function. It mutates high-scoring trajectories using a truncated diffusion process that applies a small number of noising and denoising steps allowing for much more efficient exploration of the solution space. We show that DiffusionES achieves state-of-the-art performance on nuPlan an established closed-loop planning benchmark for autonomous driving. Diffusion-ES outperforms existing sampling-based planners reactive deterministic or diffusion-based policies and reward-gradient guidance. Additionally we show that unlike prior guidance methods our method can optimize non-differentiable language-shaped reward functions generated by few-shot LLM prompting. When guided by a human teacher that issues instructions to follow our method can generate novel highly complex behaviors such as aggressive lane weaving which are not present in the training data. This allows us to solve the hardest nuPlan scenarios which are beyond the capabilities of existing trajectory optimization methods and driving policies.
-
AdaShift: Learning Discriminative Self-Gated Neural Feature Activation With an Adaptive Shift Factor
Nonlinearities are decisive in neural representation learning. Traditional Activation (Act) functions impose fixed inductive biases on neural networks with oriented biological intuitions. Recent methods leverage self-gated curves to compensate for the rigid traditional Act paradigms in fitting flexibility. However substantial improvements are still impeded by the norm-induced mismatched feature re-calibrations (see Section 1) i.e. the actual importance of a feature can be inconsistent with its explicit intensity such that violates the basic intention of a direct self-gated feature re-weighting. To address this problem we propose to learn discriminative neural feature Act with a novel prototype namely AdaShift which enhances typical self-gated Act by incorporating an adaptive shift factor into the re-weighting function of Act. AdaShift casts dynamic translations on the inputs of a re-weighting function by exploiting comprehensive feature-filter context cues of different ranges in a simple yet effective manner. We obtain the new intuitions of AdaShift by rethinking the feature-filter relationships from a common Softmax-based classification and by generalizing the new observations to a common learning layer that encodes features with updatable filters. Our practical AdaShifts built upon the new Act prototype demonstrate significant improvements to the popular/SOTA Act functions on different vision benchmarks. By simply replacing ReLU with AdaShifts ResNets can match advanced Transformer counterparts (e.g. ResNet-50 vs. Swin-T) with lower cost and fewer parameters.
-
Image diffusion models have been utilized in various tasks such as text-to-image generation and controllable image synthesis. Recent research has introduced tuning methods that make subtle adjustments to the original models yielding promising results in specific adaptations of foundational generative diffusion models. Rather than modifying the main backbone of the diffusion model we delve into the role of skip connection in U-Net and reveal that hierarchical features aggregating long-distance information across encoder and decoder make a significant impact on the content and quality of image generation. Based on the observation we propose an efficient generative tuning framework dubbed SCEdit which integrates and edits Skip Connection using a lightweight tuning module named SC-Tuner. Furthermore the proposed framework allows for straightforward extension to controllable image synthesis by injecting different conditions with Controllable SC-Tuner simplifying and unifying the network design for multi-condition inputs. Our SCEdit substantially reduces training parameters memory usage and computational expense due to its lightweight tuners with backward propagation only passing to the decoder blocks. Extensive experiments conducted on text-to-image generation and controllable image synthesis tasks demonstrate the superiority of our method in terms of efficiency and performance. Project page: https://scedit.github.io/.
-
We propose a single-shot approach to determining 6-DoF pose of an object with available 3D computer-aided design (CAD) model from a single RGB image. Our method dubbed MRC-Net comprises two stages. The first performs pose classification and renders the 3D object in the classified pose. The second stage performs regression to predict fine-grained residual pose within class. Connecting the two stages is a novel multi-scale residual correlation (MRC) layer that captures high-and-low level correspondences between the input image and rendering from first stage. MRC-Net employs a Siamese network with shared weights between both stages to learn embeddings for input and rendered images. To mitigate ambiguity when predicting discrete pose class labels on symmetric objects we use soft probabilistic labels to define pose class in the first stage. We demonstrate state-of-the-art accuracy outperforming all competing RGB-based methods on four challenging BOP benchmark datasets: T-LESS LM-O YCB-V and ITODD. Our method is non-iterative and requires no complex post-processing. Our code and pretrained models are available at https://github.com/amzn/mrc-net-6d-pose
-
Monocular 3D object detection has attracted widespread attention due to its potential to accurately obtain object 3D localization from a single image at a low cost. Depth estimation is an essential but challenging subtask of monocular 3D object detection due to the ill-posedness of 2D to 3D mapping. Many methods explore multiple local depth clues such as object heights and keypoints and then formulate the object depth estimation as an ensemble of multiple depth predictions to mitigate the insufficiency of single-depth information. However the errors of existing multiple depths tend to have the same sign which hinders them from neutralizing each other and limits the overall accuracy of combined depth. To alleviate this problem we propose to increase the complementarity of depths with two novel designs. First we add a new depth prediction branch named complementary depth that utilizes global and efficient depth clues from the entire image rather than the local clues to reduce the correlation of depth predictions. Second we propose to fully exploit the geometric relations between multiple depth clues to achieve complementarity in form. Benefiting from these designs our method achieves higher complementarity. Experiments on the KITTI benchmark demonstrate that our method achieves state-of-the-art performance without introducing extra data. In addition complementary depth can also be a lightweight and plug-and-play module to boost multiple existing monocular 3d object detectors. Code is available at https://github.com/elvintanhust/MonoCD.
-
We establish rigorous benchmarks for visual perception robustness. Synthetic images such as ImageNet-C ImageNet-9 and Stylized ImageNet provide specific type of evaluation over synthetic corruptions backgrounds and textures yet those robustness benchmarks are restricted in specified variations and have low synthetic quality. In this work we introduce generative model as a data source for synthesizing hard images that benchmark deep models' robustness. Leveraging diffusion models we are able to generate images with more diversified backgrounds textures and materials than any prior work where we term this benchmark as ImageNet-D. Experimental results show that ImageNet-D results in a significant accuracy drop to a range of vision models from the standard ResNet visual classifier to the latest foundation models like CLIP and MiniGPT-4 significantly reducing their accuracy by up to 60%. Our work suggests that diffusion models can be an effective source to test vision models. The code and dataset are available at https://github.com/chenshuang-zhang/imagenet_d.
-
Score distillation sampling (SDS) and its variants have greatly boosted the development of text-to-3D generation but are vulnerable to geometry collapse and poor textures yet. To solve this issue we first deeply analyze the SDS and find that its distillation sampling process indeed corresponds to the trajectory sampling of a stochastic differential equation (SDE): SDS samples along an SDE trajectory to yield a less noisy sample which then serves as a guidance to optimize a 3D model. However the randomness in SDE sampling often leads to a diverse and unpredictable sample which is not always less noisy and thus is not a consistently correct guidance explaining the vulnerability of SDS. Since for any SDE there always exists an ordinary differential equation (ODE) whose trajectory sampling can deterministically and consistently converge to the desired target point as the SDE we propose a novel and effective "Consistent3D" method that explores the ODE deterministic sampling prior for text-to-3D generation. Specifically at each training iteration given a rendered image by a 3D model we first estimate its desired 3D score function by a pre-trained 2D diffusion model and build an ODE for trajectory sampling. Next we design a consistency distillation sampling loss which samples along the ODE trajectory to generate two adjacent samples and uses the less noisy sample to guide another more noisy one for distilling the deterministic prior into the 3D model. Experimental results show the efficacy of our Consistent3D in generating high-fidelity and diverse 3D objects and large-scale scenes as shown in Fig. 1. The codes are available at https://github.com/sail-sg/Consistent3D.
-
Robot manipulation relies on accurately predicting contact points and end-effector directions to ensure successful operation. However learning-based robot manipulation trained on a limited category within a simulator often struggles to achieve generalizability especially when confronted with extensive categories. Therefore we introduce an innovative approach for robot manipulation that leverages the robust reasoning capabilities of Multimodal Large Language Models (MLLMs) to enhance the stability and generalization of manipulation. By fine-tuning the injected adapters we preserve the inherent common sense and reasoning ability of the MLLMs while equipping them with the ability for manipulation. The fundamental insight lies in the introduced fine-tuning paradigm encompassing object category understanding affordance prior reasoning and object-centric pose prediction to stimulate the reasoning ability of MLLM in manipulation. During inference our approach utilizes an RGB image and text prompt to predict the end effector's pose in chain of thoughts. After the initial contact is established an active impedance adaptation policy is introduced to plan the upcoming waypoints in a closed-loop manner. Moreover in real world we design a test-time adaptation (TTA) strategy for manipulation to enable the model better adapt to the current real-world scene configuration. Experiments in simulator and real-world show the promising performance of ManipLLM. More details and demonstrations can be found at https://sites.google.com/view/manipllm.
-
In this paper we address the challenge of image resolution variation for the Segment Anything Model (SAM). SAM known for its zero-shot generalizability exhibits a performance degradation when faced with datasets with varying image sizes. Previous approaches tend to resize the image to a fixed size or adopt structure modifications hindering the preservation of SAM's rich prior knowledge. Besides such task-specific tuning necessitates a complete retraining of the model which is cost-expensive and unacceptable for deployment in the downstream tasks. In this paper we reformulate this challenge as a length extrapolation problem where token sequence length varies while maintaining a consistent patch size for images with different sizes. To this end we propose a Scalable Bias-Mode Attention Mask (BA-SAM) to enhance SAM's adaptability to varying image resolutions while eliminating the need for structure modifications. Firstly we introduce a new scaling factor to ensure consistent magnitude in the attention layer's dot product values when the token sequence length changes. Secondly we present a bias-mode attention mask that allows each token to prioritize neighboring information mitigating the impact of untrained distant information. Our BA-SAM demonstrates efficacy in two scenarios: zero-shot and fine-tuning. Extensive evaluation of diverse datasets including DIS5K DUTS ISIC COD10K and COCO reveals its ability to significantly mitigate performance degradation in the zero-shot setting and achieve state-of-the-art performance with minimal fine-tuning. Furthermore we propose a generalized model and benchmark showcasing BA-SAM's generalizability across all four datasets simultaneously.
-
Federated Class-Incremental Learning (FCIL) is an underexplored yet pivotal issue involving the dynamic addition of new classes in the context of federated learning. In this field Data-Free Knowledge Transfer (DFKT) plays a crucial role in addressing catastrophic forgetting and data privacy problems. However prior approaches lack the crucial synergy between DFKT and the model training phases causing DFKT to encounter difficulties in generating high-quality data from a non-anchored latent space of the old task model. In this paper we introduce LANDER (Label Text Centered Data-Free Knowledge Transfer) to address this issue by utilizing label text embeddings (LTE) produced by pretrained language models. Specifically during the model training phase our approach treats LTE as anchor points and constrains the feature embeddings of corresponding training samples around them enriching the surrounding area with more meaningful information. In the DFKT phase by using these LTE anchors LANDER can synthesize more meaningful samples thereby effectively addressing the forgetting problem. Additionally instead of tightly constraining embeddings toward the anchor the Bounding Loss is introduced to encourage sample embeddings to remain flexible within a defined radius. This approach preserves the natural differences in sample embeddings and mitigates the embedding overlap caused by heterogeneous federated settings. Extensive experiments conducted on CIFAR100 Tiny-ImageNet and ImageNet demonstrate that LANDER significantly outperforms previous methods and achieves state-of-the-art performance in FCIL. The code is available at https://github.com/tmtuan1307/lander.
-
We present an approach for analyzing grouping information contained within a neural network's activations permitting extraction of spatial layout and semantic segmentation from the behavior of large pre-trained vision models. Unlike prior work our method conducts a wholistic analysis of a network's activation state leveraging features from all layers and obviating the need to guess which part of the model contains relevant information. Motivated by classic spectral clustering we formulate this analysis in terms of an optimization objective involving a set of affinity matrices each formed by comparing features within a different layer. Solving this optimization problem using gradient descent allows our technique to scale from single images to dataset-level analysis including in the latter both intra- and inter-image relationships. Analyzing a pre-trained generative transformer provides insight into the computational strategy learned by such models. Equating affinity with key-query similarity across attention layers yields eigenvectors encoding scene spatial layout whereas defining affinity by value vector similarity yields eigenvectors encoding object identity. This result suggests that key and query vectors coordinate attentional information flow according to spatial proximity (a `where' pathway) while value vectors refine a semantic category representation (a `what' pathway).
-
Large Multimodal Models (LMMs) extend Large Language Models to the vision domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual responses. Recently region-level LMMs have been used to generate visually grounded responses. However they are limited to only referring to a single object category at a time require users to specify the regions or cannot offer dense pixel-wise object grounding. In this work we present Grounding LMM (GLaMM) the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks. GLaMM not only grounds objects appearing in the conversations but is flexible enough to accept both textual and optional visual prompts (region of interest) as input. This empowers users to interact with the model at various levels of granularity both in textual and visual domains. Due to the lack of standard benchmarks for the novel setting of visually Grounded Conversation Generation (GCG) we introduce a comprehensive evaluation protocol with our curated grounded conversations. Our proposed GCG task requires densely grounded concepts in natural scenes at a large-scale. To this end we propose a densely annotated Grounding-anything Dataset (GranD) using our proposed automated annotation pipeline that encompasses 7.5M unique concepts grounded in a total of 810M regions available with segmentation masks. Besides GCG GLaMM also performs effectively on several downstream tasks e.g. referring expression segmentation image and region-level captioning and vision-language conversations.
-
Concept Bottleneck Models (CBMs) map the black-box visual representations extracted by deep neural networks onto a set of interpretable concepts and use the concepts to make predictions enhancing the transparency of the decision-making process. Multimodal pre-trained models can match visual representations with textual concept embeddings allowing for obtaining the interpretable concept bottleneck without the expertise concept annotations. Recent research has focused on the concept bank establishment and the high-quality concept selection. However it is challenging to construct a comprehensive concept bank through humans or large language models which severely limits the performance of CBMs. In this work we propose the Incremental Residual Concept Bottleneck Model (Res-CBM) to address the challenge of concept completeness. Specifically the residual concept bottleneck model employs a set of optimizable vectors to complete missing concepts then the incremental concept discovery module converts the complemented vectors with unclear meanings into potential concepts in the candidate concept bank. Our approach can be applied to any user-defined concept bank as a post-hoc processing method to enhance the performance of any CBMs. Furthermore to measure the descriptive efficiency of CBMs the Concept Utilization Efficiency (CUE) metric is proposed. Experiments show that the Res-CBM outperforms the current state-of-the-art methods in terms of both accuracy and efficiency and achieves comparable performance to black-box models across multiple datasets.
-
Reinforcement learning (RL) with dense rewards and imitation learning (IL) with human-generated trajectories are the most widely used approaches for training modern embodied agents. RL requires extensive reward shaping and auxiliary losses and is often too slow and ineffective for long-horizon tasks. While IL with human supervision is effective collecting human trajectories at scale is extremely expensive. In this work we show that imitating shortest-path planners in simulation produces agents that given a language instruction can proficiently navigate explore and manipulate objects in both simulation and in the real world using only RGB sensors (no depth map or GPS coordinates). This surprising result is enabled by our end-to-end transformer-based SPOC architecture powerful visual encoders paired with extensive image augmentation and the dramatic scale and diversity of our training data: millions of frames of shortest-path-expert trajectories collected inside approximately 200000 procedurally generated houses containing 40000 unique 3D assets. Our models data training code and newly proposed 10-task benchmarking suite CHORES are available at https://spoc-robot.github.io/.
-
Most of the previous exposure correction methods learn dense pixel-wise transformations to achieve promising results but consume huge computational resources. Recently Learnable 3D lookup tables (3D LUTs) have demonstrated impressive performance and efficiency for image enhancement. However these methods can only perform global transformations and fail to finely manipulate local regions. Moreover they uniformly downsample the input image which loses the rich color information and limits the learning of color transformation capabilities. In this paper we present a collaborative transformation framework (CoTF) for real-time exposure correction which integrates global transformation with pixel-wise transformations in an efficient manner. Specifically the global transformation adjusts the overall appearance using image-adaptive 3D LUTs to provide decent global contrast and sharp details while the pixel transformation compensates for local context. Then a relation-aware modulation module is designed to combine these two components effectively. In addition we propose an adaptive sampling strategy to preserve more color information by predicting the sampling intervals thus providing higher quality input data for the learning of 3D LUTs. Extensive experiments demonstrate that our method can process high-resolution images in real-time on GPUs while achieving comparable performance against current state-of-the-art methods. The code is available at https://github.com/HUST-IAL/CoTF.
-
We propose Lodge a network capable of generating extremely long dance sequences conditioned on given music. We design Lodge as a two-stage coarse to fine diffusion architecture and propose the characteristic dance primitives that possess significant expressiveness as intermediate representations between two diffusion models. The first stage is global diffusion which focuses on comprehending the coarse-level music-dance correlation and production characteristic dance primitives. In contrast the second-stage is the local diffusion which parallelly generates detailed motion sequences under the guidance of the dance primitives and choreographic rules. In addition we propose a Foot Refine Block to optimize the contact between the feet and the ground enhancing the physical realism of the motion. Code available at https://li-ronghui.github.io/lodge
-
Diffusion models have shown remarkable results for image generation editing and inpainting. Recent works explore diffusion models for 3D shape generation with neural implicit functions i.e. signed distance function and occupancy function. However they are limited to shapes with closed surfaces which prevents them from generating diverse 3D real-world contents containing open surfaces. In this work we present UDiFF a 3D diffusion model for unsigned distance fields (UDFs) which is capable to generate textured 3D shapes with open surfaces from text conditions or unconditionally. Our key idea is to generate UDFs in spatial-frequency domain with an optimal wavelet transformation which produces a compact representation space for UDF generation. Specifically instead of selecting an appropriate wavelet transformation which requires expensive manual efforts and still leads to large information loss we propose a data-driven approach to learn the optimal wavelet transformation for UDFs. We evaluate UDiFF to show our advantages by numerical and visual comparisons with the latest methods on widely used benchmarks. Page: https://weiqi-zhang.github.io/UDiFF.
-
Active Speaker Detection (ASD) aims to identify who is speaking in each frame of a video. Solving ASD involves using audio and visual information in two complementary contexts: long-term intra-speaker context models the temporal dependencies of the same speaker while short-term inter-speaker context models the interactions of speakers in the same scene. Motivated by these observations we propose LoCoNet a simple but effective Long-Short Context Network that leverages Long-term Intra-speaker Modeling (LIM) and Short-term Inter-speaker Modeling (SIM) in an interleaved manner. LIM employs self-attention for long-range temporal dependencies modeling and cross-attention for audio-visual interactions modeling. SIM incorporates convolutional blocks that capture local patterns for short-term inter-speaker context. Experiments show that LoCoNet achieves state-of-the-art performance on multiple datasets with 95.2% (+0.3%) mAP on AVA-ActiveSpeaker 97.2% (+2.7%) mAP on Talkies and 68.4% (+7.7%) mAP on Ego4D. Moreover in challenging cases where multiple speakers are present LoCoNet outperforms previous state-of-the-art methods by 3.0% mAP on AVA-ActiveSpeaker. The code is available at https://github.com/SJTUwxz/LoCoNet_ASD.
-
Existing methods for asymmetric image retrieval employ a rigid pairwise similarity constraint between the query network and the larger gallery network. However these one-to-one constraint approaches often fail to maintain retrieval order consistency especially when the query network has limited representational capacity. To overcome this problem we introduce the Decoupled Differential Distillation (D3still) framework. This framework shifts from absolute one-to-one supervision to optimizing the relational differences in pairwise similarities produced by the query and gallery networks thereby preserving a consistent retrieval order across both networks. Our method involves computing a pairwise similarity differential matrix within the gallery domain which is then decomposed into three components: feature representation knowledge inconsistent pairwise similarity differential knowledge and consistent pairwise similarity differential knowledge. This strategic decomposition aligns the retrieval ranking of the query network with the gallery network effectively. Extensive experiments on various benchmark datasets reveal that D3still surpasses state-of-the-art methods in asymmetric image retrieval. Code is available at https://github.com/SCY-X/D3still.
-
Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection
Deepfake detection faces a critical generalization hurdle with performance deteriorating when there is a mismatch between the distributions of training and testing data. A broadly received explanation is the tendency of these detectors to be overfitted to forgery-specific artifacts rather than learning features that are widely applicable across various forgeries. To address this issue we propose a simple yet effective detector called LSDA (\underline L atent \underline S pace \underline D ata \underline A ugmentation) which is based on a heuristic idea: representations with a wider variety of forgeries should be able to learn a more generalizable decision boundary thereby mitigating the overfitting of method-specific features (see Fig. 1). Following this idea we propose to enlarge the forgery space by constructing and simulating variations within and across forgery features in the latent space. This approach encompasses the acquisition of enriched domain-specific features and the facilitation of smoother transitions between different forgery types effectively bridging domain gaps. Our approach culminates in refining a binary classifier that leverages the distilled knowledge from the enhanced features striving for a generalizable deepfake detector. Comprehensive experiments show that our proposed method is surprisingly effective and transcends state-of-the-art detectors across several widely used benchmarks.
-
Recent significant advances in text-to-image models unlock the possibility of training vision systems using synthetic images potentially overcoming the difficulty of collecting curated data at scale. It is unclear however how these models behave at scale as more synthetic data is added to the training set. In this paper we study the scaling laws of synthetic images generated by state of the art text-to-image models for the training of supervised models: image classifiers with label supervision and CLIP with language supervision. We identify several factors including text prompts classifier-free guidance scale and types of text-to-image models that significantly affect scaling behavior. After tuning these factors we observe that synthetic images demonstrate a scaling trend similar to but slightly less effective than real images in CLIP training while they significantly underperform in scaling when training supervised image classifiers. Our analysis indicates that the main reason for this underperformance is the inability of off-the-shelf text-to-image models to generate certain concepts a limitation that significantly impairs the training of image classifiers. Our findings also suggest that scaling synthetic data can be particularly effective in scenarios such as: (1) when there is a limited supply of real images for a supervised problem (e.g. fewer than 0.5 million images in ImageNet) (2) when the evaluation dataset diverges significantly from the training data indicating the out-of-distribution scenario or (3) when synthetic data is used in conjunction with real images as demonstrated in the training of CLIP models.
-
The rapid advancement of deep learning models is often attributed to their ability to leverage massive training data. In contrast such privilege has not yet fully benefited 3D deep learning mainly due to the limited availability of large-scale 3D datasets. Merging multiple available data sources and letting them collaboratively train a single model is a potential solution. However due to the large domain gap between 3D point cloud datasets such mixed supervision could adversely affect the model's performance and lead to degenerated performance (i.e. negative transfer) compared to single-dataset training. In view of this challenge we introduce Point Prompt Training (PPT) a novel framework for multi-dataset synergistic learning in the context of 3D representation learning that supports multiple pre-training paradigms. Based on this framework we propose Prompt-driven Normalization which adapts the model to different datasets with domain-specific prompts and Language-guided Categorical Alignment that decently unifies the multiple-dataset label spaces by leveraging the relationship between label text. Extensive experiments verify that PPT can overcome the negative transfer associated with synergistic learning and produce generalizable representations. Notably it achieves state-of-the-art performance on each dataset using a single weight-shared model with supervised multi-dataset training. Moreover when served as a pre-training framework it outperforms other pre-training approaches regarding representation quality and attains remarkable state-of-the-art performance across over ten diverse downstream tasks spanning both indoor and outdoor 3D scenarios.
-
Convolution neural network is successful in pervasive vision tasks including label distribution learning which usually takes the form of learning an injection from the non-linear visual features to the well-defined labels. However how the discrepancy between features is mapped to the label discrepancy is ambient and its correctness is not guaranteed.To address these problems we study the mathematical connection between feature and its label presenting a general and simple framework for label distribution learning. We propose a so-called Triangular Distribution Transform (TDT) to build an injective function between feature and label guaranteeing that any symmetric feature discrepancy linearly reflects the difference between labels. The proposed TDT can be used as a plug-in in mainstream backbone networks to address different label distribution learning tasks. Experiments on Facial Age Recognition Illumination Chromaticity Estimation and Aesthetics assessment show that TDT achieves on-par or better results than the prior arts.
-
Today state-of-the-art deep neural networks that process event-camera data first convert a temporal window of events into dense grid-like input representations. As such they exhibit poor generalizability when deployed at higher inference frequencies (i.e. smaller temporal windows) than the ones they were trained on. We address this challenge by introducing state-space models (SSMs) with learnable timescale parameters to event-based vision. This design adapts to varying frequencies without the need to retrain the network at different frequencies. Additionally we investigate two strategies to counteract aliasing effects when deploying the model at higher frequencies. We comprehensively evaluate our approach against existing methods based on RNN and Transformer architectures across various benchmarks including Gen1 and 1 Mpx event camera datasets. Our results demonstrate that SSM-based models train 33% faster and also exhibit minimal performance degradation when tested at higher frequencies than the training input. Traditional RNN and Transformer models exhibit performance drops of more than 20 mAP with SSMs having a drop of 3.31 mAP highlighting the effectiveness of SSMs in event-based vision tasks.
-
In the realm of computer vision and robotics embodied agents are expected to explore their environment and carry out human instructions. This necessitates the ability to fully understand 3D scenes given their first-person observations and contextualize them into language for interaction. However traditional research focuses more on scene-level input and output setups from a global view. To address the gap we introduce EmbodiedScan a multi-modal ego-centric 3D perception dataset and benchmark for holistic 3D scene understanding. It encompasses over 5k scans encapsulating 1M ego-centric RGB-D views 1M language prompts 160k 3D-oriented boxes spanning over 760 categories some of which partially align with LVIS and dense semantic occupancy with 80 common categories. Building upon this database we introduce a baseline framework named Embodied Perceptron. It is capable of processing an arbitrary number of multi-modal inputs and demonstrates remarkable 3D perception capabilities both within the two series of benchmarks we set up i.e. fundamental 3D perception tasks and language-grounded tasks and in the wild.
-
We present SHINOBI an end-to-end framework for the reconstruction of shape material and illumination from object images captured with varying lighting pose and background. Inverse rendering of an object based on unconstrained image collections is a long-standing challenge in computer vision and graphics and requires a joint optimization over shape radiance and pose. We show that an implicit shape representation based on a multi-resolution hash encoding enables faster and robust shape reconstruction with joint camera alignment optimization that outperforms prior work. Further to enable the editing of illumination and object reflectance (i.e. material) we jointly optimize BRDF and illumination together with the object's shape. Our method is class-agnostic and works on in-the-wild image collections of objects to produce relightable 3D assets for several use cases such as AR/VR movies games etc.
-
We propose a novel strategy ES3 for self-supervised learning of robust audio-visual speech representations from unlabeled talking face videos. While many recent approaches for this task primarily rely on guiding the learning process using the audio modality alone to capture information shared between audio and video we reframe the problem as the acquisition of shared unique (modality-specific) and synergistic speech information to address the inherent asymmetry between the modalities. Based on this formulation we propose a novel "evolving" strategy that progressively builds joint audio-visual speech representations that are strong for both uni-modal (audio & visual) and bi-modal (audio-visual) speech. First we leverage the more easily learnable audio modality to initialize audio and visual representations by capturing audio-unique and shared speech information. Next we incorporate video-unique speech information and bootstrap the audio-visual representations on top of the previously acquired shared knowledge. Finally we maximize the total audio-visual speech information including synergistic information to obtain robust and comprehensive representations. We implement ES3 as a simple Siamese framework and experiments on both English benchmarks and a newly contributed large-scale Mandarin dataset show its effectiveness. In particular on LRS2-BBC our smallest model is on par with SoTA models with only 1/2 parameters and 1/8 unlabeled data (223h).
-
Neural Radiance Fields (NeRF) revolutionize the realm of visual media by providing photorealistic Free-Viewpoint Video (FVV) experiences offering viewers unparalleled immersion and interactivity. However the technology's significant storage requirements and the computational complexity involved in generation and rendering currently limit its broader application. To close this gap this paper presents Temporal Tri-Plane Radiance Fields (TeTriRF) a novel technology that significantly reduces the storage size for Free-Viewpoint Video (FVV) while maintaining low-cost generation and rendering. TeTriRF introduces a hybrid representation with tri-planes and voxel grids to support scaling up to long-duration sequences and scenes with complex motions or rapid changes. We propose a group training scheme tailored to achieving high training efficiency and yielding temporally consistent low-entropy scene representations on feature domain. Leveraging these properties of the representations we introduce a compression pipeline with off-the-shelf video codecs achieving an order of magnitude less storage size compared to the state-of-the-art. Our experiments demonstrate that TeTriRF can achieve competitive quality with a higher compression rate.
-
We introduce Motion2VecSets a 4D diffusion model for dynamic surface reconstruction from point cloud sequences. While existing state-of-the-art methods have demonstrated success in reconstructing non-rigid objects using neural field representations conventional feed-forward networks encounter challenges with ambiguous observations from noisy partial or sparse point clouds. To address these challenges we introduce a diffusion model that explicitly learns the shape and motion distribution of non-rigid objects through an iterative denoising process of compressed latent representations. The diffusion-based priors enable more plausible and probabilistic reconstructions when handling ambiguous inputs. We parameterize 4D dynamics with latent sets instead of using global latent codes. This novel 4D representation allows us to learn local shape and deformation patterns leading to more accurate non-linear motion capture and significantly improving generalizability to unseen motions and identities. For more temporally-coherent object tracking we synchronously denoise deformation latent sets and exchange information across multiple frames. To avoid computational overhead we designed an interleaved space and time attention block to alternately aggregate deformation latents along spatial and temporal domains. Extensive comparisons against state-of-the-art methods demonstrate the superiority of our Motion2VecSets in 4D reconstruction from various imperfect observations.
-
Multimodal learning has advanced the performance for many vision-language tasks. However most existing works in embodied dialog research focus on navigation and leave the localization task understudied. The few existing dialog-based localization approaches assume the availability of entire dialog prior to localizaiton which is impractical for deployed dialog-based localization. In this paper we propose DiaLoc a new dialog-based localization framework which aligns with a real human operator behavior. Specifically we produce an iterative refinement of location predictions which can visualize current pose believes after each dialog turn. DiaLoc effectively utilizes the multimodal data for multi-shot localization where a fusion encoder fuses vision and dialog information iteratively. We achieve state-of-the-art results on embodied dialog-based localization task in single-shot (+7.08% in Acc5@valUnseen) and multi-shot settings (+10.85% in Acc5@valUnseen). DiaLoc narrows the gap between simulation and real-world applications opening doors for future research on collaborative localization and navigation.
-
Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs. Training an LLM to write better visual programs is an attractive prospect but it is unclear how to accomplish this. No dataset of visual programs for training exists and acquisition of a visual program dataset cannot be easily crowdsourced due to the need for expert annotators. To get around the lack of direct supervision we explore improving the program synthesis abilities of an LLM using feedback from interactive experience. We propose a method where we exploit existing annotations for a vision-language task to improvise a coarse reward signal for that task treat the LLM as a policy and apply reinforced self-training to improve the visual program synthesis ability of the LLM for that task. We describe a series of experiments on object detection compositional visual question answering and image-text retrieval and show that in each case the self-trained LLM outperforms or performs on par with few-shot frozen LLMs that are an order of magnitude larger. Website: https://zaidkhan.me/ViReP
-
Deep Neural Networks (DNNs) have become pivotal in various fields especially in computer vision outperforming previous methodologies. A critical challenge in their deployment is the bias inherent in data across different domains such as image style and environmental conditions leading to domain gaps. This necessitates techniques for learning general representations from biased training data known as domain generalization. This paper presents Attend to eXpert Prompts (A2XP) a novel approach for domain generalization that preserves the privacy and integrity of the network architecture. A2XP consists of two phases: Expert Adaptation and Domain Generalization. In the first phase prompts for each source domain are optimized to guide the model towards the optimal direction. In the second phase two embedder networks are trained to effectively amalgamate these expert prompts aiming for an optimal output. Our extensive experiments demonstrate that A2XP achieves state-of-the-art results over existing non-private domain generalization methods. The experimental results validate that the proposed approach not only tackles the domain generalization challenge in DNNs but also offers a privacy-preserving efficient solution to the broader field of computer vision.
-
In the realm of video object segmentation (VOS) the challenge of operating under low-light conditions persists resulting in notably degraded image quality and compromised accuracy when comparing query and memory frames for similarity computation. Event cameras characterized by their high dynamic range and ability to capture motion information of objects offer promise in enhancing object visibility and aiding VOS methods under such low-light conditions. This paper introduces a pioneering framework tailored for low-light VOS leveraging event camera data to elevate segmentation accuracy. Our approach hinges on two pivotal components: the Adaptive Cross-Modal Fusion (ACMF) module aimed at extracting pertinent features while fusing image and event modalities to mitigate noise interference and the Event-Guided Memory Matching (EGMM) module designed to rectify the issue of inaccurate matching prevalent in low-light settings. Additionally we present the creation of a synthetic LLE-DAVIS dataset and the curation of a real-world LLE-VOS dataset encompassing frames and events. Experimental evaluations corroborate the efficacy of our method across both datasets affirming its effectiveness in low-light scenarios. The datasets are available at https://github.com/HebeiFast/EventLowLightVOS.
-
Domain adaptation adapts models to various scenes with different appearances. In this field active domain adaptation is crucial in effectively sampling a limited number of data in the target domain. We propose an active domain adaptation method for object detection focusing on quantifying the undetectability of objects. Existing methods for active sampling encounter challenges in considering undetected objects while estimating the uncertainty of model predictions. Our proposed active sampling strategy addresses this issue using an active learning approach that simultaneously accounts for uncertainty and undetectability. Our newly proposed False Negative Prediction Module evaluates the undetectability of images containing undetected objects enabling more informed active sampling. This approach considers previously overlooked undetected objects thereby reducing false negative errors. Moreover using unlabeled data our proposed method utilizes uncertainty-guided pseudo-labeling to enhance domain adaptation further. Extensive experiments demonstrate that the performance of our proposed method closely rivals that of fully supervised learning while requiring only a fraction of the labeling efforts needed for the latter.
-
The scarcity of annotated data has sparked significant interest in unsupervised pre-training methods that leverage medical reports as auxiliary signals for medical visual representation learning. However existing research overlooks the multi-granularity nature of medical visual representation and lacks suitable contrastive learning techniques to improve the models' generalizability across different granularities leading to the underutilization of image-text information. To address this we propose MLIP a novel framework leveraging domain-specific medical knowledge as guiding signals to integrate language information into the visual domain through image-text contrastive learning. Our model includes global contrastive learning with our designed divergence encoder local token-knowledge-patch alignment contrastive learning and knowledge-guided category-level contrastive learning with expert knowledge. Experimental evaluations reveal the efficacy of our model in enhancing transfer performance for tasks such as image classification object detection and semantic segmentation. Notably MLIP surpasses state-of-the-art methods even with limited annotated data highlighting the potential of multimodal pre-training in advancing medical representation learning.
-
Generative 3D part assembly involves understanding part relationships and predicting their 6-DoF poses for assembling a realistic 3D shape. Prior work often focus on the geometry of individual parts neglecting part-whole hierarchies of objects. Leveraging two key observations: 1) super-part poses provide strong hints about part poses and 2) predicting super-part poses is easier due to fewer super-parts we propose a part-whole-hierarchy message passing network for efficient 3D part assembly. We first introduce super-parts by grouping geometrically similar parts without any semantic labels. Then we employ a part-whole hierarchical encoder wherein a super-part encoder predicts latent super-part poses based on input parts. Subsequently we transform the point cloud using the latent poses feeding it to the part encoder for aggregating super-part information and reasoning about part relationships to predict all part poses. In training only ground-truth part poses are required. During inference the predicted latent poses of super-parts enhance interpretability. Experimental results on the PartNet dataset that our method achieves state-of-the-art performance in part and connectivity accuracy and enables an interpretable hierarchical part assembly.
-
Diffusion models have made significant advances in generating high-quality images but their application to video generation has remained challenging due to the complexity of temporal motion. Zero-shot video editing offers a solution by utilizing pre-trained image diffusion models to translate source videos into new ones. Nevertheless existing methods struggle to maintain strict temporal consistency and efficient memory consumption. In this work we propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames. By aligning and compressing temporally redundant tokens across frames our method improves temporal coherence and reduces memory consumption in self-attention computations. The merging strategy matches and aligns tokens according to the temporal correspondence between frames facilitating natural temporal consistency in generated video frames. To manage the complexity of video processing we divide videos into chunks and develop intra-chunk local token merging and inter-chunk global token merging ensuring both short-term video continuity and long-term content consistency. Our video editing approach seamlessly extends the advancements in image editing to video editing rendering favorable results in temporal consistency over state-of-the-art methods.
-
Recently subject-driven generation has garnered significant interest due to its ability to personalize text-to-image generation. Typical works focus on learning the new subject's private attributes. However an important fact has not been taken seriously that a subject is not an isolated new concept but should be a specialization of a certain category in the pre-trained model. This results in the subject failing to comprehensively inherit the attributes in its category causing poor attribute-related generations. In this paper motivated by object-oriented programming we model the subject as a derived class whose base class is its semantic category. This modeling enables the subject to inherit public attributes from its category while learning its private attributes from the user-provided example. Specifically we propose a plug-and-play method Subject-Derived regularization (SuDe). It constructs the base-derived class modeling by constraining the subject-driven generated images to semantically belong to the subject's category. Extensive experiments under three baselines and two backbones on various subjects show that our SuDe enables imaginative attribute-related generations while maintaining subject fidelity. For the codes please refer to \href https://github.com/modelscope/facechain FaceChain .
-
When deploying segmentation models in practice it is critical to evaluate their behaviors in varied and complex scenes. Different from the previous evaluation paradigms only in consideration of global attribute variations (e.g. adverse weather) we investigate both local and global attribute variations for robustness evaluation. To achieve this we construct a mask-preserved attribute editing pipeline to edit visual attributes of real images with precise control of structural information. Therefore the original segmentation labels can be reused for the edited images. Using our pipeline we construct a benchmark covering both object and image attributes (e.g. color material pattern style). We evaluate a broad variety of semantic segmentation models spanning from conventional close-set models to recent open-vocabulary large models on their robustness to different types of variations. We find that both local and global attribute variations affect segmentation performances and the sensitivity of models diverges across different variation types. We argue that local attributes have the same importance as global attributes and should be considered in the robustness evaluation of segmentation models. Code: https://github.com/PRIS-CV/Pascal-EA.
-
Diffusion models currently dominate the field of data-driven image synthesis with their unparalleled scaling to large datasets. In this paper we identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture without altering its high-level structure. Observing uncontrolled magnitude changes and imbalances in both the network activations and weights over the course of training we redesign the network layers to preserve activation weight and update magnitudes on expectation. We find that systematic application of this philosophy eliminates the observed drifts and imbalances resulting in considerably better networks at equal computational complexity. Our modifications improve the previous record FID of 2.41 in ImageNet-512 synthesis to 1.81 achieved using fast deterministic sampling. As an independent contribution we present a method for setting the exponential moving average (EMA) parameters post-hoc i.e. after completing the training run. This allows precise tuning of EMA length without the cost of performing several training runs and reveals its surprising interactions with network architecture training time and guidance.
-
We propose a hierarchical correlation clustering method that extends the well-known correlation clustering to produce hierarchical clusters applicable to both positive and negative pairwise dissimilarities. Then in the following we study unsupervised representation learning with such hierarchical correlation clustering. For this purpose we first investigate embedding the respective hierarchy to be used for tree preserving embedding and feature extraction. Thereafter we study the extension of minimax distance measures to correlation clustering as another representation learning paradigm. Finally we demonstrate the performance of our methods on several datasets.
-
Given a clothing image and a person image an image-based virtual try-on aims to generate a customized image that appears natural and accurately reflects the characteristics of the clothing image. In this work we aim to expand the applicability of the pre-trained diffusion model so that it can be utilized independently for the virtual try-on task. The main challenge is to preserve the clothing details while effectively utilizing the robust generative capability of the pre-trained model. In order to tackle these issues we propose StableVITON learning the semantic correspondence between the clothing and the human body within the latent space of the pre-trained diffusion model in an end-to-end manner. Our proposed zero cross-attention blocks not only preserve the clothing details by learning the semantic correspondence but also generate high-fidelity images by utilizing the inherent knowledge of the pre-trained model in the warping process. Through our proposed novel attention total variation loss and applying augmentation we achieve the sharp attention map resulting in a more precise representation of clothing details. StableVITON outperforms the baselines in qualitative and quantitative evaluation showing promising quality in arbitrary person images. Our code is available at https://github.com/rlawjdghek/StableVITON.
-
Stable Diffusion has established itself as a foundation model in generative AI artistic applications receiving widespread research and application. Some recent fine-tuning methods have made it feasible for individuals to implant personalized concepts onto the basic Stable Diffusion model with minimal computational costs on small datasets. However these innovations have also given rise to issues like facial privacy forgery and artistic copyright infringement. In recent studies researchers have explored the addition of imperceptible adversarial perturbations to images to prevent potential unauthorized exploitation and infringements when personal data is used for fine-tuning Stable Diffusion. Although these studies have demonstrated the ability to protect images it is essential to consider that these methods may not be entirely applicable in real-world scenarios. In this paper we systematically evaluate the use of perturbations to protect images within a practical threat model. The results suggest that these approaches may not be sufficient to safeguard image privacy and copyright effectively. Furthermore we introduce a purification method capable of removing protected perturbations while preserving the original image structure to the greatest extent possible. Experiments reveal that Stable Diffusion can effectively learn from purified images over all protective methods.
-
Despite the remarkable process of talking-head-based avatar-creating solutions directly generating anchor-style videos with full-body motions remains challenging. In this study we propose Make-Your-Anchor a novel system necessitating only a one-minute video clip of an individual for training subsequently enabling the automatic generation of anchor-style videos with precise torso and hand movements. Specifically we finetune a proposed structure-guided diffusion model on input video to render 3D mesh conditions into human appearances. We adopt a two-stage training strategy for the diffusion model effectively binding movements with specific appearances. To produce arbitrary long temporal video we extend the 2D U-Net in the frame-wise diffusion model to a 3D style without additional training cost and a simple yet effective batch-overlapped temporal denoising module is proposed to bypass the constraints on video length during inference. Finally a novel identity-specific face enhancement module is introduced to improve the visual quality of facial regions in the output videos. Comparative experiments demonstrate the effectiveness and superiority of the system in terms of visual quality temporal coherence and identity preservation outperforming SOTA diffusion/non-diffusion methods. Project page: https://github.com/ICTMCG/Make-Your-Anchor.
-
Human beings possess the capability to multiply a melange of multisensory cues while actively exploring and interacting with the 3D world. Current multi-modal large language models however passively absorb sensory data as inputs lacking the capacity to actively interact with the objects in the 3D environment and dynamically collect their multisensory information. To usher in the study of this area we propose MultiPLY a multisensory embodied large language model that could incorporate multisensory interactive data including visual audio tactile and thermal information into large language models thereby establishing the correlation among words actions and percepts. To this end we first collect Multisensory Universe a large-scale multisensory interaction dataset comprising 500k data by deploying an LLM-powered embodied agent to engage with the 3D environment. To perform instruction tuning with pre-trained LLM on such generated data we first encode the 3D scene as abstracted object-centric representations and then introduce action tokens denoting that the embodied agent takes certain actions within the environment as well as state tokens that represent the multisensory state observations of the agent at each time step. In the inference time MultiPLY could generate action tokens instructing the agent to take the action in the environment and obtain the next multisensory state observation. The observation is then appended back to the LLM via state tokens to generate subsequent text or action tokens. We demonstrate that MultiPLY outperforms baselines by a large margin through a diverse set of embodied tasks involving object retrieval tool use multisensory captioning and task decomposition.
-
The goal of the multi-sound source localization task is to localize sound sources from the mixture individually. While recent multi-sound source localization methods have shown improved performance they face challenges due to their reliance on prior information about the number of objects to be separated. In this paper to overcome this limitation we present a novel multi-sound source localization method that can perform localization without prior knowledge of the number of sound sources. To achieve this goal we propose an iterative object identification (IOI) module which can recognize sound-making objects in an iterative manner. After finding the regions of sound-making objects we devise object similarity-aware clustering (OSC) loss to guide the IOI module to effectively combine regions of the same object but also distinguish between different objects and backgrounds. It enables our method to perform accurate localization of sound-making objects without any prior knowledge. Extensive experimental results on the MUSIC and VGGSound benchmarks show the significant performance improvements of the proposed method over the existing methods for both single and multi-source. Our code is available at: https://github.com/VisualAIKHU/NoPrior_MultiSSL
-
Recent works in implicit representations such as Neural Radiance Fields (NeRF) have advanced the generation of realistic and animatable head avatars from video sequences. These implicit methods are still confronted by visual artifacts and jitters since the lack of explicit geometric constraints poses a fundamental challenge in accurately modeling complex facial deformations. In this paper we introduce Dynamic Tetrahedra (DynTet) a novel hybrid representation that encodes explicit dynamic meshes by neural networks to ensure geometric consistency across various motions and viewpoints. DynTet is parameterized by the coordinate-based networks which learn signed distance deformation and material texture anchoring the training data into a predefined tetrahedra grid. Leveraging Marching Tetrahedra DynTet efficiently decodes textured meshes with a consistent topology enabling fast rendering through a differentiable rasterizer and supervision via a pixel loss. To enhance training efficiency we incorporate classical 3D Morphable Models to facilitate geometry learning and define a canonical space for simplifying texture learning. These advantages are readily achievable owing to the effective geometric representation employed in DynTet. Compared with prior works DynTet demonstrates significant improvements in fidelity lip synchronization and real-time performance according to various metrics. Beyond producing stable and visually appealing synthesis videos our method also outputs the dynamic meshes which is promising to enable many emerging applications. Code is available at https://github.com/zhangzc21/DynTet.
-
nsupervised (US) video anomaly detection (VAD) in surveillance applications is gaining more popularity lately due to its practical real-world applications. Due to the extremely challenging nature of this task where learning is carried out without any annotations privacy-critical collaborative learning of US-VAD systems has not been studied yet. As surveillance videos are privacy sensitive and the availability of large-scale video data may enable better US-VAD systems collaborative learning can be highly rewarding in this setting. In this paper we propose a new baseline for anomaly detection capable of localizing anomalous events in complex surveillance scenarios in a fully unsupervised fashion without any labels on a privacy-retaining participant-based distributed training configuration. Additionally we propose three new evaluation protocols to extensively evaluate anomaly detection approaches on various scenarios of collaborations and data availability. Moreover based on these protocols we modify existing VAD datasets to extensively evaluate our approach as well as existing US SOTA methods on two large-scale datasets including UCF-Crime and XD-Violence. All proposed evaluation protocols dataset splits and codes are available here: \href https://github.com/AnasEmad11/CLAP https://github.com/AnasEmad11/CLAP .
-
Crowd counting has achieved significant progress by training regressors to predict instance positions. In heavily crowded scenarios however regressors are challenged by uncontrollable annotation variance which causes density map bias and context information inaccuracy. In this study we propose mutual prompt learning (mPrompt) which leverages a regressor and a segmenter as guidance for each other solving bias and inaccuracy caused by annotation variance while distinguishing foreground from background. In specific mPrompt leverages point annotations to tune the segmenter and predict pseudo head masks in a way of point prompt learning. It then uses the predicted segmentation masks which serve as spatial constraint to rectify biased point annotations as context prompt learning. mPrompt defines a way of mutual information maximization from prompt learning mitigating the impact of annotation variance while improving model accuracy. Experiments show that mPrompt significantly reduces the Mean Average Error (MAE) demonstrating the potential to be general framework for down-stream vision tasks. Code is available at https://github.com/csguomy/mPrompt.
-
The perception of 3D motion of surrounding traffic participants is crucial for driving safety. While existing works primarily focus on general large motions we contend that the instantaneous detection and quantification of subtle motions is equally important as they indicate the nuances in driving behavior that may be safety critical such as behaviors near a stop sign of parking positions. We delve into this under-explored task examining its unique challenges and developing our solution accompanied by a carefully designed benchmark. Specifically due to the lack of correspondences between consecutive frames of sparse Lidar point clouds static objects might appear to be moving - the so-called swimming effect. This intertwines with the true object motion thereby posing ambiguity in accurate estimation especially for subtle motion. To address this we propose to leverage local occupancy completion of object point clouds to densify the shape cue and mitigate the impact of swimming artifacts. The occupancy completion is learned in an end-to-end fashion together with the detection of moving objects and the estimation of their motion instantaneously as soon as objects start to move. Extensive experiments demonstrate superior performance compared to standard 3D motion estimation approaches particularly highlighting our method's specialized treatment of subtle motion.
-
Novel view synthesis is attractive for social media but it often contains unwanted details such as personal information that needs to be edited out for a better experience. Multiplane image (MPI) is desirable for social media because of its generality but it is complex and computationally expensive making object removal challenging. To address these challenges we propose CORE-MPI which employs embedding images to improve the consistency and accessibility of MPI object removal. CORE-MPI allows for real-time transmission and interaction with embedding images on social media facilitating object removal with a single mask. However recovering the geometric information hidden in the embedding images is a significant challenge. Therefore we propose a dual-network approach where one network focuses on color restoration and the other on inpainting the embedding image including geometric information. For the training of CORE-MPI we introduce a pseudo-reference loss aimed at proficient color recovery even in complex scenes or with large masks. Furthermore we present a disparity consistency loss to preserve the geometric consistency of the inpainted region. We demonstrate the effectiveness of CORE-MPI on RealEstate10K and UCSD datasets.
-
In this paper we propose a 3D geometry-aware deformable Gaussian Splatting method for dynamic view synthesis. Existing neural radiance fields (NeRF) based solutions learn the deformation in an implicit manner which cannot incorporate 3D scene geometry. Therefore the learned deformation is not necessarily geometrically coherent which results in unsatisfactory dynamic view synthesis and 3D dynamic reconstruction. Recently 3D Gaussian Splatting provides a new representation of the 3D scene building upon which the 3D geometry could be exploited in learning the complex 3D deformation. Specifically the scenes are represented as a collection of 3D Gaussian where each 3D Gaussian is optimized to move and rotate over time to model the deformation. To enforce the 3D scene geometry constraint during deformation we explicitly extract 3D geometry features and integrate them in learning the 3D deformation. In this way our solution achieves 3D geometry-aware deformation modeling which enables improved dynamic view synthesis and 3D dynamic reconstruction. Extensive experimental results on both synthetic and real datasets prove the superiority of our solution which achieves new state-of-the-art performance. The project is available at \href https://npucvr.github.io/GaGS/ https://npucvr.github.io/GaGS/ .
-
Wi-Fi signals in contrast to cameras offer privacy protection and occlusion resilience for some practical scenarios such as smart homes elderly care and virtual reality. Recent years have seen remarkable progress in the estimation of single-person 2D pose single-person 3D pose and multi-person 2D pose. This paper takes a step forward by introducing Person-in-WiFi 3D a pioneering Wi-Fi system that accomplishes multi-person 3D pose estimation. Person-in-WiFi 3D has two main updates. Firstly it has a greater number of Wi-Fi devices to enhance the capability for capturing spatial reflections from multiple individuals. Secondly it leverages the Transformer for end-to-end estimation. Compared to its predecessor Person-in-WiFi 3D is storage-efficient and fast. We deployed a proof-of-concept system in 4mx3.5m areas and collected a dataset of over 97K frames with seven volunteers. Person-in-WiFi 3D attains 3D joint localization errors of 91.7mm (1-person) 108.1mm (2-person) and 125.3mm (3-person) comparable to cameras and millimeter-wave radars.
-
Real-world systems often encounter new data over time which leads to experiencing target domain shifts. Existing Test-Time Adaptation (TTA) methods tend to apply computationally heavy and memory-intensive backpropagation-based approaches to handle this. Here we propose a novel method that uses a backpropagation-free approach for TTA for the specific case of 3D data. Our model uses a two-stream architecture to maintain knowledge about the source domain as well as complementary target-domain-specific information. The backpropagation-free property of our model helps address the well-known forgetting problem and mitigates the error accumulation issue. The proposed method also eliminates the need for the usually noisy process of pseudo-labeling and reliance on costly self-supervised training. Moreover our method leverages subspace learning effectively reducing the distribution variance between the two domains. Furthermore the source-domain-specific and the target-domain-specific streams are aligned using a novel entropy-based adaptive fusion strategy. Extensive experiments on popular benchmarks demonstrate the effectiveness of our method. The code will be available at https://github.com/abie-e/BFTT3D.
-
With the recent advances in vision transformers and large language models (LLMs) finetuning costly large models on downstream learning tasks poses significant challenges under limited computational resources. This paper presents a REsource and ComputAtion-efficient Pruning framework (RECAP) for the finetuning of transformer-based large models. RECAP by design bridges the gap between efficiency and performance through an iterative process cycling between pruning finetuning and updating stages to explore different chunks of the given large-scale model. At each iteration we first prune the model with Taylor-approximation-based importance estimation and then only update a subset of the pruned model weights based on the Fisher-information criterion. In this way RECAP achieves two synergistic and yet conflicting goals: reducing the GPU memory footprint while maintaining model performance unlike most existing pruning methods that require the model to be finetuned beforehand for better preservation of model performance. We perform extensive experiments with a wide range of large transformer-based architectures on various computer vision and natural language understanding tasks. Compared to recent pruning techniques we demonstrate that RECAP offers significant improvements in GPU memory efficiency capable of reducing the footprint by up to 65%.
-
RAW images are rarely shared mainly due to its excessive data size compared to their sRGB counterparts obtained by camera ISPs. Learning the forward and inverse processes of camera ISPs has been recently demonstrated enabling physically-meaningful RAW-level image processing on input sRGB images. However existing learning-based ISP methods fail to handle the large variations in the ISP processes with respect to camera parameters such as ISO and exposure time and have limitations when used for various applications. In this paper we propose ParamISP a learning-based method for forward and inverse conversion between sRGB and RAW images that adopts a novel neural-network module to utilize camera parameters which is dubbed as ParamNet. Given the camera parameters provided in the EXIF data ParamNet converts them into a feature vector to control the ISP networks. Extensive experiments demonstrate that ParamISP achieve superior RAW and sRGB reconstruction results compared to previous methods and it can be effectively used for a variety of applications such as deblurring dataset synthesis raw deblurring HDR reconstruction and camera-to-camera transfer.
-
Diffusion models (DMs) embark a new era of generative modeling and offer more opportunities for efficient generating high-quality and realistic data samples. However their widespread use has also brought forth new challenges in model security which motivates the creation of more effective adversarial attackers on DMs to understand its vulnerability. We propose CAAT a simple but generic and efficient approach that does not require costly training to effectively fool latent diffusion models (LDMs). The approach is based on the observation that cross-attention layers exhibits higher sensitivity to gradient change allowing for leveraging subtle perturbations on published images to significantly corrupt the generated images. We show that a subtle perturbation on an image can significantly impact the cross-attention layers thus changing the mapping between text and image during the fine-tuning of customized diffusion models. Extensive experiments demonstrate that CAAT is compatible with diverse diffusion models and outperforms baseline attack methods in a more effective (more noise) and efficient (twice as fast as Anti-DreamBooth and Mist) manner.
-
In this paper we introduce Fairy a minimalist yet robust adaptation of image-editing diffusion models enhancing them for video editing applications. Our approach centers on the concept of anchor-based cross-frame attention a mechanism that implicitly propagates diffusion features across frames ensuring superior temporal coherence and high-fidelity synthesis. Fairy not only addresses limitations of previous models including memory and processing speed. It also improves temporal consistency through a unique data augmentation strategy. This strategy renders the model equivariant to affine transformations in both source and target images. Remarkably efficient Fairy generates 120-frame 512x384 videos (4-second duration at 30 FPS) in just 14 seconds outpacing prior works by at least 44x. A comprehensive user study involving 1000 generated samples confirms that our approach delivers superior quality decisively outperforming established methods.
-
Current instruction-based image editing methods such as InstructPix2Pix often fail to produce satisfactory results in complex scenarios due to their dependence on the simple CLIP text encoder in diffusion models. To rectify this this paper introduces SmartEdit a novel approach of instruction-based image editing that leverages Multimodal Large Language Models (MLLMs) to enhance its understanding and reasoning capabilities. However direct integration of these elements still faces challenges in situations requiring complex reasoning. To mitigate this we propose a Bidirectional Interaction Module (BIM) that enables comprehensive bidirectional information interactions between the input image and the MLLM output. During training we initially incorporate perception data to boost the perception and understanding capabilities of diffusion models. Subsequently we demonstrate that a small amount of complex instruction editing data can effectively stimulate SmartEdit's editing capabilities for more complex instructions. We further construct a new evaluation dataset Reason-Edit specifically tailored for complex instruction-based image editing. Both quantitative and qualitative results on this evaluation dataset indicate that our SmartEdit surpasses previous methods paving the way for the practical application of complex instruction-based image editing.
-
The data bottleneck has emerged as a fundamental challenge in learning based image restoration methods. Researchers have attempted to generate synthesized training data using paired or unpaired samples to address this challenge. This study proposes SeNM-VAE a semi-supervised noise modeling method that leverages both paired and unpaired datasets to generate realistic degraded data. Our approach is based on modeling the conditional distribution of degraded and clean images with a specially designed graphical model. Under the variational inference framework we develop an objective function for handling both paired and unpaired data. We employ our method to generate paired training samples for real-world image denoising and super-resolution tasks. Our approach excels in the quality of synthetic degraded images compared to other unpaired and paired noise modeling methods. Furthermore our approach demonstrates remarkable performance in downstream image restoration tasks even with limited paired data. With more paired data our method achieves the best performance on the SIDD dataset.
-
Recent advancements have shown the potential of leveraging both point clouds and images to localize anomalies. Nevertheless their applicability in industrial manufacturing is often constrained by significant drawbacks such as the use of memory banks which leads to a substantial increase in terms of memory footprint and inference times. We propose a novel light and fast framework that learns to map features from one modality to the other on nominal samples and detect anomalies by pinpointing inconsistencies between observed and mapped features. Extensive experiments show that our approach achieves state-of-the-art detection and segmentation performance in both the standard and few-shot settings on the MVTec 3D-AD dataset while achieving faster inference and occupying less memory than previous multimodal AD methods. Furthermore we propose a layer pruning technique to improve memory and time efficiency with a marginal sacrifice in performance.
-
Despite noise and caption quality having been acknowledged as important factors impacting vision-language contrastive pre-training in this paper we show that the full potential of improving the training process by addressing such issues is yet to be realized. Specifically we firstly study and analyze two issues affecting training: incorrect assignment of negative pairs and low caption quality and diversity. Then we devise effective solutions for addressing both problems which essentially require training with multiple true positive pairs. Finally we propose training with sigmoid loss to address such a requirement. We show very large gains over the current state-of-the-art for both image recognition ( +6% on average over 11 datasets) and image retrieval ( +19% on Flickr30k and +15% on MSCOCO).
-
Researchers in natural science need reliable methods for quantifying animal behavior. Recently numerous computer vision methods emerged to automate the process. However observing wild species at remote locations remains a challenging task due to difficult lighting conditions and constraints on power supply and data storage. Event cameras offer unique advantages for battery-dependent remote monitoring due to their low power consumption and high dynamic range capabilities. We use this novel sensor to quantify a behavior in Chinstrap penguins called ecstatic display. We formulate the problem as a temporal action detection task determining the start and end times of the behavior. For this purpose we recorded a colony of breeding penguins in Antarctica for several weeks and labeled event data on 16 nests. The developed method consists of a generator of candidate time intervals (proposals) and a classifier of the actions within them. The experiments show that the event cameras' natural response to motion is effective for continuous behavior monitoring and detection reaching a mean average precision (mAP) of 58% (which increases to 63% in good weather conditions). The results also demonstrate the robustness against various lighting conditions contained in the challenging dataset. The low-power capabilities of the event camera allow it to record significantly longer than with a conventional camera. This work pioneers the use of event cameras for remote wildlife observation opening new interdisciplinary opportunities. https:// tub-rip.github.io/ eventpenguins/
-
Video-based visual relation detection tasks such as video scene graph generation play important roles in fine-grained video understanding. However current video visual relation detection datasets have two main limitations that hinder the progress of research in this area. First they do not explore complex human-human interactions in multi-person scenarios. Second the relation types of existing datasets have relatively low-level semantics and can be often recognized by appearance or simple prior information without the need for detailed spatio-temporal context reasoning. Nevertheless comprehending high-level interactions between humans is crucial for understanding complex multi-person videos such as sports and surveillance videos. To address this issue we propose a new video visual relation detection task: video human-human interaction detection and build a dataset named SportsHHI for it. SportsHHI contains 34 high-level interaction classes from basketball and volleyball sports. 118075 human bounding boxes and 50649 interaction instances are annotated on 11398 keyframes. To benchmark this we propose a two-stage baseline method and conduct extensive experiments to reveal the key factors for a successful human-human interaction detector. We hope that SportsHHI can stimulate research on human interaction understanding in videos and promote the development of spatio-temporal context modeling techniques in video visual relation detection.
-
We present DiSR-NeRF a diffusion-guided framework for view-consistent super-resolution (SR) NeRF. Unlike prior works we circumvent the requirement for high-resolution (HR) reference images by leveraging existing powerful 2D super-resolution models. Nonetheless independent SR 2D images are often inconsistent across different views. We thus propose Iterative 3D Synchronization (I3DS) to mitigate the inconsistency problem via the inherent multi-view consistency property of NeRF. Specifically our I3DS alternates between upscaling low-resolution (LR) rendered images with diffusion models and updating the underlying 3D representation with standard NeRF training. We further introduce Renoised Score Distillation (RSD) a novel score-distillation objective for 2D image resolution. Our RSD combines features from ancestral sampling and Score Distillation Sampling (SDS) to generate sharp images that are also LR-consistent. Qualitative and quantitative results on both synthetic and real-world datasets demonstrate that our DiSR-NeRF can achieve better results on NeRF super-resolution compared with existing works. Code and video results available at the project website.
-
Hyperspectral 3D imaging aims to acquire both depth and spectral information of a scene. However existing methods are either prohibitively expensive and bulky or compromise on spectral and depth accuracy. In this paper we present Dispersed Structured Light (DSL) a cost-effective and compact method for accurate hyperspectral 3D imaging. DSL modifies a traditional projector-camera system by placing a sub-millimeter thick diffraction grating film front of the projector. This configuration enables dispersing structured light based on light wavelength. To utilize the dispersed structured light we devise a model for dispersive projection image formation and a per-pixel hyperspectral 3D reconstruction method. We validate DSL by instantiating a compact experimental prototype. DSL achieves spectral accuracy of 18.8nm full-width half-maximum (FWHM) and depth error of 1mm outperforming prior work on practical hyperspectral 3D imaging. DSL promises accurate and practical hyperspectral 3D imaging for diverse application domains including computer vision and graphics cultural heritage geology and biology.
-
Crowd counting is a fundamental problem in crowd analysis which is typically accomplished by estimating a crowd density map and summing over the density values. However this approach suffers from background noise accumulation and loss of density due to the use of broad Gaussian kernels to create the ground truth density maps. This issue can be overcome by narrowing the Gaussian kernel. However existing approaches perform poorly when trained with ground truth density maps with broad kernels. To deal with this limitation we propose using conditional diffusion models to predict density maps as diffusion models show high fidelity to training data during generation. With that we present CrowdDiff that generates the crowd density map as a reverse diffusion process. Furthermore as the intermediate time steps of the diffusion process are noisy we incorporate a regression branch for direct crowd estimation only during training to improve the feature learning. In addition owing to the stochastic nature of the diffusion model we introduce producing multiple density maps to improve the counting performance contrary to the existing crowd counting pipelines. We conduct extensive experiments on publicly available datasets to validate the effectiveness of our method. CrowdDiff outperforms existing \sota crowd counting methods on several public crowd analysis benchmarks with significant improvements. CrowdDiff project is available at: https://dylran.github.io/crowddiff.github.io.
-
This paper unravels the potential of sketches for diffusion models addressing the deceptive promise of direct sketch control in generative AI. We importantly democratise the process enabling amateur sketches to generate precise images living up to the commitment of "what you sketch is what you get". A pilot study underscores the necessity revealing that deformities in existing models stem from spatial-conditioning. To rectify this we propose an abstraction-aware framework utilising a sketch adapter adaptive time-step sampling and discriminative guidance from a pre-trained fine-grained sketch-based image retrieval model working synergistically to reinforce fine-grained sketch-photo association. Our approach operates seamlessly during inference without the need for textual prompts; a simple rough sketch akin to what you and I can create suffices! We welcome everyone to examine results presented in the paper and its supplementary. Contributions include democratising sketch control introducing an abstraction-aware framework and leveraging discriminative guidance validated through extensive experiments.
-
This paper proposes a GeneraLIst encoder-Decoder (GLID) pre-training method for better handling various downstream computer vision tasks. While self-supervised pre-training approaches e.g. Masked Autoencoder have shown success in transfer learning task-specific sub-architectures are still required to be appended for different downstream tasks which cannot enjoy the benefits of large-scale pre-training. GLID overcomes this challenge by allowing the pre-trained generalist encoder-decoder to be fine-tuned on various vision tasks with minimal task-specific architecture modifications. In the GLID training scheme pre-training pretext task and other downstream tasks are modeled as "query-to-answer" problems including the pre-training pretext task and other downstream tasks. We pre-train a task-agnostic encoder-decoder with query-mask pairs. During fine-tuning GLID maintains the pre-trained encoder-decoder and queries only replacing the topmost linear transformation layer with task-specific linear heads. This minimizes the pretrain-finetune architecture inconsistency and enables the pre-trained model to better adapt to downstream tasks. GLID achieves competitive performance on various vision tasks including object detection image segmentation pose estimation and depth estimation outperforming or matching specialist models such as Mask2Former DETR ViTPose and BinsFormer.
-
Reconstructing a clothed human from a single-view image has several challenging issues including flexibly representing various body shapes and poses estimating complete 3D geometry and consistent texture and achieving more fine-grained details. To address them we propose a new diffusion-based Fourier occupancy field method to improve the human representing ability and the geometry generating ability. First we estimate the back-view image from the given reference image by incorporating a style consistency constraint. Then we extract multi-scale features of the two images as conditional and design a diffusion model to generate the Fourier occupancy field in the wavelet domain. We refine the initial estimated Fourier occupancy field with image features as conditions to improve the geometric accuracy. Finally the reference and estimated back-view images are mapped onto the human model creating a textured clothed human model. Substantial experiments are conducted and the experimental results show that our method outperforms the state-of-the-art methods in geometry and texture reconstruction performance.
-
Text-to-image diffusion models have remarkably excelled in producing diverse high-quality and photo-realistic images. This advancement has spurred a growing interest in incorporating specific identities into generated content. Most current methods employ an inversion approach to embed a target visual concept into the text embedding space using a single reference image. However the newly synthesized faces either closely resemble the reference image in terms of facial attributes such as expression or exhibit a reduced capacity for identity preservation. Text descriptions intended to guide the facial attributes of the synthesized face may fall short owing to the intricate entanglement of identity information with identity-irrelevant facial attributes derived from the reference image. To address these issues we present the novel use of the extended StyleGAN embedding space \mathcal W _+ to achieve enhanced identity preservation and disentanglement for diffusion models. By aligning this semantically meaningful human face latent space with text-to-image diffusion models we succeed in maintaining high fidelity in identity preservation coupled with the capacity for semantic editing. Additionally we propose new training objectives to balance the influences of both prompt and identity conditions ensuring that the identity-irrelevant background remains \lxm negligibly affected during facial attribute modifications. Extensive experiments reveal that our method adeptly generates personalized text-to-image outputs that are not only compatible with prompt descriptions but also amenable to common StyleGAN editing directions in diverse settings. Our code and model are available at https://github.com/csxmli2016/w-plus-adapter.
-
Annotating lots of 3D medical images for training segmentation models is time-consuming. The goal of weakly supervised semantic segmentation is to train segmentation models without using any ground truth segmentation masks. Our work addresses the case where only image-level categorical labels indicating the presence or absence of a particular region of interest (such as tumours or lesions) are available. Most existing methods rely on class activation mapping (CAM). We propose a novel approach ToNNO which is based on the Tomographic reconstruction of a Neural Network's Output. Our technique extracts stacks of slices with different angles from the input 3D volume feeds these slices to a 2D encoder and applies the inverse Radon transform in order to reconstruct a 3D heatmap of the encoder's predictions. This generic method allows to perform dense prediction tasks on 3D volumes using any 2D image encoder. We apply it to weakly supervised medical image segmentation by training the 2D encoder to output high values for slices containing the regions of interest. We test it on four large scale medical image datasets and outperform 2D CAM methods. We then extend ToNNO by combining tomographic reconstruction with CAM methods proposing Averaged CAM and Tomographic CAM which obtain even better results.
-
In the context of autonomous navigation of terrestrial robots the creation of realistic models for agent dynamics and sensing is a widespread habit in the robotics literature and in commercial applications where they are used for model based control and/or for localization and mapping. The more recent Embodied AI literature on the other hand focuses on modular or end-to-end agents trained in simulators like Habitat or AI-Thor where the emphasis is put on photo-realistic rendering and scene diversity but high-fidelity robot motion is assigned a less privileged role. The resulting sim2real gap significantly impacts transfer of the trained models to real robotic platforms. In this work we explore end-to-end training of agents in simulation in settings which minimize the sim2real gap both in sensing and in actuation. Our agent directly predicts (discretized) velocity commands which are maintained through closed-loop control in the real robot. The behavior of the real robot (including the underlying low-level controller) is identified and simulated in a modified Habitat simulator. Noise models for odometry and localization further contribute in lowering the sim2real gap. We evaluate on real navigation scenarios explore different localization and point goal calculation methods and report significant gains in performance and robustness compared to prior work.
-
Recently convolutional neural networks (CNNs) with large size kernels have attracted much attention in the computer vision field following the success of the Vision Transformers. Large kernel CNNs have been reported to perform well in downstream vision tasks as well as in classification performance. The reason for the high-performance of large kernel CNNs in downstream tasks has been attributed to the large effective receptive field (ERF) produced by large size kernels but this view has not been fully tested. We therefore revisit the performance of large kernel CNNs in downstream task focusing on the weakly supervised object localization (WSOL) task. WSOL a difficult downstream task that is not fully supervised provides a new angle to explore the capabilities of the large kernel CNNs. Our study compares the modern large kernel CNNs ConvNeXt RepLKNet and SLaK to test the validity of the naive expectation that ERF size is important for improving downstream task performance. Our analysis of the factors contributing to high performance provides a different perspective in which the main factor is feature map improvement. Furthermore we find that modern CNNs are robust to the CAM problems of local regions of objects being activated which has long been discussed in WSOL. CAM is the most classic WSOL method but because of the above-mentioned problems it is often used as a baseline method for comparison. However experiments on the CUB-200-2011 dataset show that simply combining a large kernel CNN CAM and simple data augmentation methods can achieve performance (90.99% MaxBoxAcc) comparable to the latest WSOL method which is CNN-based and requires special training or complex post-processing.
-
Knowledge distillation is an effective method for training small and efficient deep learning models. However the efficacy of a single method can degenerate when transferring to other tasks modalities or even other architectures. To address this limitation we propose a novel constrained feature distillation method. This method is derived from a small set of core principles which results in two emerging components: an orthogonal projection and a task-specific normalisation. Equipped with both of these components our transformer models can outperform all previous methods on ImageNet and reach up to a 4.4% relative improvement over the previous state-of-the-art methods. To further demonstrate the generality of our method we apply it to object detection and image generation whereby we obtain consistent and substantial performance improvements over state-of-the-art. Code and models are publicly available.
-
We present Cutie a video object segmentation (VOS) network with object-level memory reading which puts the object representation from memory back into the video object segmentation result. Recent works on VOS employ bottom-up pixel-level memory reading which struggles due to matching noise especially in the presence of distractors resulting in lower performance in more challenging data. In contrast Cutie performs top-down object-level memory reading by adapting a small set of object queries. Via those it interacts with the bottom-up pixel features iteratively with a query-based object transformer (qt hence Cutie). The object queries act as a high-level summary of the target object while high-resolution feature maps are retained for accurate segmentation. Together with foreground-background masked attention Cutie cleanly separates the semantics of the foreground object from the background. On the challenging MOSE dataset Cutie improves by 8.7 J&F over XMem with a similar running time and improves by 4.2 J&F over DeAOT while being three times faster. Code is available at: hkchengrex.github.io/Cutie
-
While there has been significant progress in customizing text-to-image generation models generating images that combine multiple personalized concepts remains challenging. In this work we introduce Concept Weaver a method for composing customized text-to-image diffusion models at inference time. Specifically the method breaks the process into two steps: creating a template image aligned with the semantics of input prompts and then personalizing the template using a concept fusion strategy. The fusion strategy incorporates the appearance of the target concepts into the template image while retaining its structural details. The results indicate that our method can generate multiple custom concepts with higher identity fidelity compared to alternative approaches. Furthermore the method is shown to seamlessly handle more than two concepts and closely follow the semantic meaning of the input prompt without blending appearances across different subjects.
-
High-quality human reconstruction and photo-realistic rendering of a dynamic scene is a long-standing problem in computer vision and graphics. Despite considerable efforts invested in developing various capture systems and reconstruction algorithms recent advancements still struggle with loose or oversized clothing and overly complex poses. In part this is due to the challenges of acquiring high-quality human datasets. To facilitate the development of these fields in this paper we present PKU-DyMVHumans a versatile human-centric dataset for high-fidelity reconstruction and rendering of dynamic human scenarios from dense multi-view videos. It comprises 8.2 million frames captured by more than 56 synchronized cameras across diverse scenarios. These sequences comprise 32 human subjects across 45 different scenarios each with a high-detailed appearance and realistic human motion. Inspired by recent advancements in neural radiance field (NeRF)-based scene representations we carefully set up an off-the-shelf framework that is easy to provide those state-of-the-art NeRF-based implementations and benchmark on PKU-DyMVHumans dataset. It is paving the way for various applications like fine-grained foreground/background decomposition high-quality human reconstruction and photo-realistic novel view synthesis of a dynamic scene. Extensive studies are performed on the benchmark demonstrating new observations and challenges that emerge from using such high-fidelity dynamic data. The project page and data is available at: https://pku-dymvhumans.github.io.
-
Cross-Domain Few-Shot Segmentation (CD-FSS) poses the challenge of segmenting novel categories from a distinct domain using only limited exemplars. In this paper we undertake a comprehensive study of CD-FSS and uncover two crucial insights: (i) the necessity of a fine-tuning stage to effectively transfer the learned meta-knowledge across domains and (ii) the overfitting risk during the naive fine-tuning due to the scarcity of novel category examples. With these insights we propose a novel cross-domain fine-tuning strategy that addresses the challenging CD-FSS tasks. We first design Bi-directional Few-shot Prediction (BFP) which establishes support-query correspondence in a bi-directional manner crafting augmented supervision to reduce the overfitting risk. Then we further extend BFP into Iterative Few-shot Adaptor (IFA) which is a recursive framework to capture the support-query correspondence iteratively targeting maximal exploitation of supervisory signals from the sparse novel category samples. Extensive empirical evaluations show that our method significantly outperforms the state-of-the-arts (+7.8%) which verifies that IFA tackles the cross-domain challenges and mitigates the overfitting simultaneously. The code is available at: https://github.com/niejiahao1998/IFA.
-
Deep neural networks have demonstrated remarkable performance in point cloud classification. However previous works show they are vulnerable to adversarial perturbations that can manipulate their predictions. Given the distinctive modality of point clouds various attack strategies have emerged posing challenges for existing defenses to achieve effective generalization. In this study we for the first time introduce causal modeling to enhance the robustness of point cloud classification models. Our insight is from the observation that adversarial examples closely resemble benign point clouds from the human perspective. In our causal modeling we incorporate two critical variables the structural information (standing for the key feature leading to the classification) and the hidden confounders (standing for the noise interfering with the classification). The resulting overall framework CausalPC consists of three sub-modules to identify the causal effect for robust classification. The framework is model-agnostic and adaptable for integration with various point cloud classifiers. Our approach significantly improves the adversarial robustness of three mainstream point cloud classification models on two benchmark datasets. For instance the classification accuracy for DGCNN on ModelNet40 increases from 29.2% to 72.0% with CausalPC whereas the best-performing baseline achieves only 42.4%.
-
Instance shape reconstruction from a 3D scene involves recovering the full geometries of multiple objects at the semantic instance level. Many methods leverage data-driven learning due to the intricacies of scene complexity and significant indoor occlusions. Training these methods often requires a large-scale high-quality dataset with aligned and paired shape annotations with real-world scans. Existing datasets are either synthetic or misaligned restricting the performance of data-driven methods on real data. To this end we introduce LASA a Large-scale Aligned Shape Annotation Dataset comprising 10412 high-quality CAD annotations aligned with 920 real-world scene scans from ArkitScenes created manually by professional artists. On this top we propose a novel Diffusion-based Cross-Modal Shape Reconstruction (DisCo) method. It is empowered by a hybrid feature aggregation design to fuse multi-modal inputs and recover high-fidelity object geometries. Besides we present an Occupancy-Guided 3D Object Detection (OccGOD) method and demonstrate that our shape annotations provide scene occupancy clues that can further improve 3D object detection. Supported by LASA extensive experiments show that our methods achieve state-of-the-art performance in both instance-level scene reconstruction and 3D object detection tasks.
-
The evolution of Diffusion Models has dramatically improved image generation quality making it increasingly difficult to differentiate between real and generated images. This development while impressive also raises significant privacy and security concerns. In response to this we propose a novel Latent REconstruction error guided feature REfinement method (LaRE^2) for detecting the diffusion-generated images. We come up with the Latent Reconstruction Error (LaRE) the first reconstruction-error based feature in the latent space for generated image detection. LaRE surpasses existing methods in terms of feature extraction efficiency while preserving crucial cues required to differentiate between the real and the fake. To exploit LaRE we propose an Error-Guided feature REfinement module (EGRE) which can refine the image feature guided by LaRE to enhance the discriminativeness of the feature. Our EGRE utilizes an align-then-refine mechanism which effectively refines the image feature for generated-image detection from both spatial and channel perspectives. Extensive experiments on the large-scale GenImage benchmark demonstrate the superiority of our LaRE^2 which surpasses the best SoTA method by up to 11.9%/12.1% average ACC/AP across 8 different image generators. LaRE also surpasses existing methods in terms of feature extraction cost delivering an impressive speed enhancement of 8 times.
-
This paper endeavors to advance the precision of snapshot compressive imaging (SCI) reconstruction for multispectral image (MSI). To achieve this we integrate the advantageous attributes of established SCI techniques and an image generative model propose a novel structured zero-shot diffusion model dubbed DiffSCI. DiffSCI leverages the structural insights from the deep prior and optimization-based methodologies complemented by the generative capabilities offered by the contemporary denoising diffusion model. Specifically firstly we employ a pre-trained diffusion model which has been trained on a substantial corpus of RGB images as the generative denoiser within the Plug-and-Play framework for the first time. This integration allows for the successful completion of SCI reconstruction especially in the case that current methods struggle to address effectively. Secondly we systematically account for spectral band correlations and introduce a robust methodology to mitigate wavelength mismatch thus enabling seamless adaptation of the RGB diffusion model to MSIs.Thirdly an accelerated algorithm is implemented to expedite the resolution of the data subproblem. This augmentation not only accelerates the convergence rate but also elevates the quality of the reconstruction process.We present extensive testing to show that DiffSCI exhibits discernible performance enhancements over prevailing self-supervised and zero-shot approaches surpassing even supervised transformer counterparts across both simulated and real datasets. Code is at https://github.com/PAN083/DiffSCI.
-
We propose DiffSHEG a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation. While previous works focused on co-speech gesture or expression generation individually the joint generation of synchronized expressions and gestures remains barely explored. To address this our diffusion-based co-speech motion generation Transformer enables uni-directional information flow from expression to gesture facilitating improved matching of joint expression-gesture distributions. Furthermore we introduce an outpainting-based sampling strategy for arbitrary long sequence generation in diffusion models offering flexibility and computational efficiency. Our method provides a practical solution that produces high-quality synchronized expression and gesture generation driven by speech. Evaluated on two public datasets our approach achieves state-of-the-art performance both quantitatively and qualitatively. Additionally a user study confirms the superiority of our method over prior approaches. By enabling the real-time generation of expressive and synchronized motions our method showcases its potential for various applications in the development of digital humans and embodied agents.
-
Music is a universal language that can communicate emotions and feelings. It forms an essential part of the whole spectrum of creative media ranging from movies to social media posts. Machine learning models that can synthesize music are predominantly conditioned on textual descriptions of it. Inspired by how musicians compose music not just from a movie script but also through visualizations we propose MeLFusion a model that can effectively use cues from a textual description and the corresponding image to synthesize music. MeLFusion is a text-to-music diffusion model with a novel "visual synapse" which effectively infuses the semantics from the visual modality into the generated music. To facilitate research in this area we introduce a new dataset MeLBench and propose a new evaluation metric IMSM. Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music measured both objectively and subjectively with a relative gain of up to 67.98% on the FAD score. We hope that our work will gather attention to this pragmatic yet relatively under-explored research area.
-
Trajectory prediction is a challenging problem that requires considering interactions among multiple actors and the surrounding environment. While data-driven approaches have been used to address this complex problem they suffer from unreliable predictions under distribution shifts during test time. Accordingly several online learning methods have been proposed using regression loss from the ground truth of observed data leveraging the auto-labeling nature of trajectory prediction task. We mainly tackle the following two issues. First previous works underfit and overfit as they only optimize the last layer of motion decoder. To this end we employ the masked autoencoder (MAE) for representation learning to encourage complex interaction modeling in shifted test distribution for updating deeper layers. Second utilizing the sequential nature of driving data we propose an actor-specific token memory that enables the test-time learning of actor-wise motion characteristics. Our proposed method has been validated across various challenging cross-dataset distribution shift scenarios including nuScenes Lyft Waymo and Interaction. Our method surpasses the performance of existing state-of-the-art online learning methods in terms of both prediction accuracy and computational efficiency. The code is available at https://github.com/daeheepark/T4P.
-
Text-to-image person re-identification (TIReID) is a compelling topic in the cross-modal community which aims to retrieve the target person based on a textual query. Although numerous TIReID methods have been proposed and achieved promising performance they implicitly assume the training image-text pairs are correctly aligned which is not always the case in real-world scenarios. In practice the image-text pairs inevitably exist under-correlated or even false-correlated a.k.a noisy correspondence (NC) due to the low quality of the images and annotation errors. To address this problem we propose a novel Robust Dual Embedding method (RDE) that can learn robust visual-semantic associations even with NC. Specifically RDE consists of two main components: 1) A Confident Consensus Division (CCD) module that leverages the dual-grained decisions of dual embedding modules to obtain a consensus set of clean training data which enables the model to learn correct and reliable visual-semantic associations. 2) A Triplet Alignment Loss (TAL) relaxes the conventional Triplet Ranking loss with the hardest negative samples to a log-exponential upper bound over all negative ones thus preventing the model collapse under NC and can also focus on hard-negative samples for promising performance. We conduct extensive experiments on three public benchmarks namely CUHK-PEDES ICFG-PEDES and RSTPReID to evaluate the performance and robustness of our RDE. Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on all three datasets. Code is available at https://github.com/QinYang79/RDE.
-
In this paper we present a novel paradigm to enhance the ability of object detector e.g. expanding categories or improving detection performance by training on syn- thetic dataset generated from diffusion models. Specifically we integrate an instance-level grounding head into a pre- trained generative diffusion model to augment it with the ability of localising instances in the generated images. The grounding head is trained to align the text embedding of category names with the regional visual feature of the diffusion model using supervision from an off-the-shelf object detector and a novel self-training scheme on (novel) categories not covered by the detector. We conduct thorough experiments to show that this enhanced version of diffusion model termed as InstaGen can serve as a data synthesizer to enhance object detectors by training on its generated samples demonstrating superior performance over existing state-of-the-art methods in open-vocabulary (+4.5 AP) and data-sparse (+1.2 ? 5.2 AP) scenarios.
-
We introduce the Panoptic 3D Reconstruction task a unified and holistic scene understanding task for a monocular video. And we present PanoRecon - a novel framework to address this new task which realizes an online geometry reconstruction alone with dense semantic and instance labeling. Specifically PanoRecon incrementally performs panoptic 3D reconstruction for each video fragment consisting of multiple consecutive key frames from a volumetric feature representation using feed-forward neural networks. We adopt a depth-guided back-projection strategy to sparse and purify the volumetric feature representation. We further introduce a voxel clustering module to get object instances in each local fragment and then design a tracking and fusion algorithm for the integration of instances from different fragments to ensure temporal coherence. Such design enables our PanoRecon to yield a coherent and accurate panoptic 3D reconstruction. Experiments on ScanNetV2 demonstrate a very competitive geometry reconstruction result compared with state-of-the-art reconstruction methods as well as promising 3D panoptic segmentation result with only RGB input while being real-time. Code is available at: https://github.com/Riser6/PanoRecon.
-
We present the pioneering Large Visual Motion Model (LVMM) meticulously engineered to analyze the intrinsic dynamics encapsulated within real-world imagery. Our model fortified with a wealth of prior knowledge extracted from billions of image pairs demonstrates promising results in predicting a diverse spectrum of scene dynamics. As a result it can infuse any generic image with authentic dynamic effects enhancing its visual allure.
-
In contrast to extensive studies on general vision pre-training for scalable visual autonomous driving remains seldom explored. Visual autonomous driving applications require features encompassing semantics 3D geometry and temporal information simultaneously for joint perception prediction and planning posing dramatic challenges for pre-training. To resolve this we bring up a new pre-training task termed as visual point cloud forecasting - predicting future point clouds from historical visual input. The key merit of this task captures the synergic learning of semantics 3D structures and temporal dynamics. Hence it shows superiority in various downstream tasks. To cope with this new problem we present ViDAR a general model to pre-train downstream visual encoders. It first extracts historical embeddings by the encoder. These representations are then transformed to 3D geometric space via a novel Latent Rendering operator for future point cloud prediction. Experiments show significant gain in downstream tasks e.g. 3.1% NDS on 3D detection 10% error reduction on motion forecasting and 15% less collision rate on planning.
-
Compared with transferable untargeted attacks transferable targeted adversarial attacks could specify the misclassification categories of adversarial samples posing a greater threat to security-critical tasks. In the meanwhile 3D adversarial samples due to their potential of multi-view robustness can more comprehensively identify weaknesses in existing deep learning systems possessing great application value. However the field of transferable targeted 3D adversarial attacks remains vacant. The goal of this work is to develop a more effective technique that could generate transferable targeted 3D adversarial examples filling the gap in this field. To achieve this goal we design a novel framework named TT3D that could rapidly reconstruct from few multi-view images into Transferable Targeted 3D textured meshes. While existing mesh-based texture optimization methods compute gradients in the high-dimensional mesh space and easily fall into local optima leading to unsatisfactory transferability and distinct distortions TT3D innovatively performs dual optimization towards both feature grid and Multi-layer Perceptron (MLP) parameters in the grid-based NeRF space which significantly enhances black-box transferability while enjoying naturalness. Experimental results show that TT3D not only exhibits superior cross-model transferability but also maintains considerable adaptability across different renders and vision tasks. More importantly we produce 3D adversarial examples with 3D printing techniques in the real world and verify their robust performance under various scenarios.
-
We introduce a co-designed approach for human portrait relighting that combines a physics-guided architecture with a pre-training framework. Drawing on the Cook-Torrance reflectance model we have meticulously configured the architecture design to precisely simulate light-surface interactions. Furthermore to overcome the limitation of scarce high-quality lightstage data we have developed a self-supervised pre-training strategy. This novel combination of accurate physical modeling and expanded training dataset establishes a new benchmark in relighting realism.
-
We present DIRECT-3D a diffusion-based 3D generative model for creating high-quality 3D assets (represented by Neural Radiance Fields) from text prompts. Unlike recent 3D generative models that rely on clean and well-aligned 3D data limiting them to single or few-class generation our model is directly trained on extensive noisy and unaligned `in-the-wild' 3D assets mitigating the key challenge (i.e. data scarcity) in large-scale 3D generation. In particular DIRECT-3D is a tri-plane diffusion model that integrates two innovations: 1) A novel learning framework where noisy data are filtered and aligned automatically during the training process. Specifically after an initial warm-up phase using a small set of clean data an iterative optimization is introduced in the diffusion process to explicitly estimate the 3D pose of objects and select beneficial data based on conditional density. 2) An efficient 3D representation that is achieved by disentangling object geometry and color features with two separate conditional diffusion models that are optimized hierarchically. Given a prompt input our model generates high-quality high-resolution realistic and complex 3D objects with accurate geometric details in seconds. We achieve state-of-the-art performance in both single-class generation and text-to-3D generation. We also demonstrate that DIRECT-3D can serve as a useful 3D geometric prior of objects for example to alleviate the well-known Janus problem in 2D-lifting methods such as DreamFusion.
-
Understanding data visualizations like charts and plots requires reasoning about both visual elements and numerics. Although strong in extractive questions current chart visual question answering (chart VQA) models suffer on complex reasoning questions. In this work we address the lack of reasoning ability by data augmentation. We leverage Large Language Models (LLMs) which have shown to have strong reasoning ability as an automatic data annotator that generates question-answer annotations for chart images. The key innovation in our method lies in the Synthesize Step-by-Step strategy: our LLM-based data generator learns to decompose the complex question into step-by-step sub-questions (rationales) which are then used to derive the final answer using external tools i.e. Python. This step-wise generation procedure is trained on synthetic data generated using a template-based QA generation pipeline. Experimental results highlight the significance of the proposed step-by-step generation. By training with the LLM-augmented data (LAMENDA) we significantly enhance the chart VQA models achieving the state-of-the-art accuracy on the ChartQA and PlotQA datasets. In particular our approach improves the accuracy of the previous state-of-the-art approach from 38% to 54% on the human-written questions in the ChartQA dataset which needs strong reasoning. We hope our work underscores the potential of synthetic data and encourages further exploration of data augmentation using LLMs for reasoning-heavy tasks.
-
Recently leveraging large language models (LLMs) or multimodal large language models (MLLMs) for document understanding has been proven very promising. However previous works that employ LLMs/MLLMs for document understanding have not fully explored and utilized the document layout information which is vital for precise document understanding. In this paper we propose LayoutLLM an LLM/MLLM based method for document understanding. The core of LayoutLLM is a layout instruction tuning strategy which is specially designed to enhance the comprehension and utilization of document layouts. The proposed layout instruction tuning strategy consists of two components: Layout-aware Pre-training and Layout-aware Supervised Fine-tuning. To capture the characteristics of document layout in Layout-aware Pre-training three groups of pre-training tasks corresponding to document-level region-level and segment-level information are introduced. Furthermore a novel module called layout chain-of-thought (LayoutCoT) is devised to enable LayoutLLM to focus on regions relevant to the question and generate accurate answers. LayoutCoT is effective for boosting the performance of document understanding. Meanwhile it brings a certain degree of interpretability which could facilitate manual inspection and correction. Experiments on standard benchmarks show that the proposed LayoutLLM significantly outperforms existing methods that adopt open-source 7B LLMs/MLLMs for document understanding.
-
Visual-language foundation models like CLIP learn generalized representations that enable zero-shot open-set classification. Few-shot adaptation methods based on prompt tuning have been shown to further improve performance on downstream datasets. However these methods do not fare well in the taxonomic open set (TOS) setting where the classifier is asked to make prediction from label set across different levels of semantic granularity. Frequently they infer incorrect labels at coarser taxonomic class levels even when the inference at the leaf level (original class labels) is correct. To address this problem we propose a prompt tuning technique that calibrates the hierarchical consistency of model predictions. A set of metrics of hierarchical consistency the Hierarchical Consistent Accuracy (HCA) and the Mean Treecut Accuracy (MTA) are first proposed to evaluate TOS model performance. A new Prompt Tuning for Hierarchical Consistency (ProTeCt) technique is then proposed to calibrate classification across label set granularities. Results show that ProTeCt can be combined with existing prompt tuning methods to significantly improve TOS classification without degrading the leaf level classification performance.
-
Adapters provide an efficient and lightweight mechanism for adapting trained transformer models to a variety of different tasks. However they have often been found to be outperformed by other adaptation mechanisms including low-rank adaptation. In this paper we provide an in-depth study of adapters their internal structure as well as various implementation choices. We uncover pitfalls for using adapters and suggest a concrete improved adapter architecture called Adapter+ that not only outperforms previous adapter implementations but surpasses a number of other more complex adaptation mechanisms in several challenging settings. Despite this our suggested adapter is highly robust and unlike previous work requires little to no manual intervention when addressing a novel scenario. Adapter+ reaches state-of-the-art average accuracy on the VTAB benchmark even without a per-task hyperparameter optimization.
-
Featurizing microscopy images for use in biological research remains a significant challenge especially for large-scale experiments spanning millions of images. This work explores the scaling properties of weakly supervised classifiers and self-supervised masked autoencoders (MAEs) when training with increasingly larger model backbones and microscopy datasets. Our results show that ViT-based MAEs outperform weakly supervised classifiers on a variety of tasks achieving as much as a 11.5% relative improvement when recalling known biological relationships curated from public databases. Additionally we develop a new channel-agnostic MAE architecture (CA-MAE) that allows for inputting images of different numbers and orders of channels at inference time. We demonstrate that CA-MAEs effectively generalize by inferring and evaluating on a microscopy image dataset (JUMP-CP) generated under different experimental conditions with a different channel structure than our pretraining data (RPI-93M). Our findings motivate continued research into scaling self-supervised learning on microscopy data in order to create powerful foundation models of cellular biology that have the potential to catalyze advancements in drug discovery and beyond.
-
In this paper we delve into the creation of one-shot hand avatars attaining high-fidelity and drivable hand representations swiftly from a single image. With the burgeoning domains of the digital human the need for quick and personalized hand avatar creation has become increasingly critical. Existing techniques typically require extensive input data and may prove cumbersome or even impractical in certain scenarios. To enhance accessibility we present a novel method OHTA (One-shot Hand avaTAr) that enables the creation of detailed hand avatars from merely one image. OHTA tackles the inherent difficulties of this data-limited problem by learning and utilizing data-driven hand priors. Specifically we design a hand prior model initially employed for 1) learning various hand priors with available data and subsequently for 2) the inversion and fitting of the target identity with prior knowledge. OHTA demonstrates the capability to create high-fidelity hand avatars with consistent animatable quality solely relying on a single image. Furthermore we illustrate the versatility of OHTA through diverse applications encompassing text-to-avatar conversion hand editing and identity latent space manipulation.
-
We propose a method to efficiently equip the Segment Anything Model (SAM) with the ability to generate regional captions. SAM presents strong generalizability to segment anything while is short for semantic understanding. By introducing a lightweight query-based feature mixer we align the region-specific features with the embedding space of language models for later caption generation. As the number of trainable parameters is small (typically in the order of tens of millions) it costs less computation less memory usage and less communication bandwidth resulting in both fast and scalable training. To address the scarcity problem of regional caption data we propose to first pre-train our model on objection detection and segmentation tasks. We call this step weak supervision pretraining since the pretraining data only contains category names instead of full-sentence descriptions. The weak supervision pretraining allows us to leverage many publicly available object detection and segmentation datasets. We conduct extensive experiments to demonstrate the superiority of our method and validate each design choice. This work serves as a stepping stone towards scaling up regional captioning data and sheds light on exploring efficient ways to augment SAM with regional semantics.
-
We investigate a new task in human motion prediction which is predicting motions under unexpected physical perturbation potentially involving multiple people. Compared with existing research this task involves predicting less controlled unpremeditated and pure reactive motions in response to external impact and how such motions can propagate through people. It brings new challenges such as data scarcity and predicting complex interactions. To this end we propose a new method capitalizing differentiable physics and deep neural networks leading to an explicit Latent Differentiable Physics (LDP) model. Through experiments we demonstrate that LDP has high data efficiency outstanding prediction accuracy strong generalizability and good explainability. Since there is no similar research a comprehensive comparison with 11 adapted baselines from several relevant domains is conducted showing LDP outperforming existing research both quantitatively and qualitatively improving prediction accuracy by as much as 70% and demonstrating significantly stronger generalization.
-
Most 3D generation research focuses on up-projecting 2D foundation models into the 3D space either by minimizing 2D Score Distillation Sampling (SDS) loss or fine-tuning on multi-view datasets. Without explicit 3D priors these methods often lead to geometric anomalies and multi-view inconsistency. Recently researchers have attempted to improve the genuineness of 3D objects by directly training on 3D datasets albeit at the cost of low-quality texture generation due to the limited texture diversity in 3D datasets. To harness the advantages of both approaches we propose Bidirectional Diffusion (BiDiff) a unified framework that incorporates both a 3D and a 2D diffusion process to preserve both 3D fidelity and 2D texture richness respectively. Moreover as a simple combination may yield inconsistent generation results we further bridge them with novel bidirectional guidance. In addition our method can be used as an initialization of optimization-based models to further improve the quality of 3D model and efficiency of optimization reducing the process from 3.4 hours to 20 minutes. Experimental results have shown that our model achieves high-quality diverse and scalable 3D generation. Project website https://bidiff.github.io/.
-
3D Scene Graph Generation (3DSGG) aims to classify objects and their predicates within 3D point cloud scenes. However current 3DSGG methods struggle with two main challenges. 1) The dependency on labor-intensive ground-truth annotations. 2) Closed-set classes training hampers the recognition of novel objects and predicates. Addressing these issues our idea is to extract cross-modality features by CLIP from text and image data naturally related to 3D point clouds. Cross-modality features are used to train a robust 3D scene graph (3DSG) feature extractor. Specifically we propose a novel Cross-Modality Contrastive Learning 3DSGG (CCL-3DSGG) method. Firstly to align the text with 3DSG the text is parsed into word level that are consistent with the 3DSG annotation. To enhance robustness during the alignment adjectives are exchanged for different objects as negative samples. Then to align the image with 3DSG the camera view is treated as a positive sample and other views as negatives. Lastly the recognition of novel object and predicate classes is achieved by calculating the cosine similarity between prompts and 3DSG features. Our rigorous experiments confirm the superior open-vocabulary capability and applicability of CCL-3DSGG in real-world contexts.
-
In autonomous driving behavior prediction is fundamental for safe motion planning hence the security and robustness of prediction models against adversarial attacks are of paramount importance. We propose a novel adversarial backdoor attack against trajectory prediction models as a means of studying their potential vulnerabilities. Our attack affects the victim at training time via naturalistic hence stealthy poisoned samples crafted using a novel two-step approach. First the triggers are crafted by perturbing the trajectory of attacking vehicle and then disguised by transforming the scene using a bi-level optimization technique. The proposed attack does not depend on a particular model architecture and operates in a black-box manner thus can be effective without any knowledge of the victim model. We conduct extensive empirical studies using state-of-the-art prediction models on two benchmark datasets using metrics customized for trajectory prediction. We show that the proposed attack is highly effective as it can significantly hinder the performance of prediction models unnoticeable by the victims and efficient as it forces the victim to generate malicious behavior even under constrained conditions. Via ablative studies we analyze the impact of different attack design choices followed by an evaluation of existing defence mechanisms against the proposed attack.
-
Creating and animating 3D biped cartoon characters is crucial and valuable in various applications. Compared with geometry the diverse texture design plays an important role in making 3D biped cartoon characters vivid and charming. Therefore we focus on automatic texture design for cartoon characters based on input instructions. This is challenging for domain-specific requirements and a lack of high-quality data. To address this challenge we propose Make-It-Vivid the first attempt to enable high-quality texture generation from text in UV space. We prepare a detailed text-texture paired data for 3D characters by using vision-question-answering agents. Then we customize a pretrained text-to-image model to generate texture map with template structure while preserving the natural 2D image knowledge. Furthermore to enhance fine-grained details we propose a novel adversarial learning scheme to shorten the domain gap between original dataset and realistic texture domain. Extensive experiments show that our approach outperforms current texture generation methods resulting in efficient character texturing and faithful generation with prompts. Besides we showcase various applications such as out of domain generation and texture stylization. We also provide an efficient generation system for automatic text-guided textured character generation and animation.
-
Point cloud filtering is a fundamental 3D vision task which aims to remove noise while recovering the underlying clean surfaces. State-of-the-art methods remove noise by moving noisy points along stochastic trajectories to the clean surfaces. These methods often require regularization within the training objective and/or during post-processing to ensure fidelity. In this paper we introduce StraightPCF a new deep learning based method for point cloud filtering. It works by moving noisy points along straight paths thus reducing discretization errors while ensuring faster convergence to the clean surfaces. We model noisy patches as intermediate states between high noise patch variants and their clean counterparts and design the VelocityModule to infer a constant flow velocity from the former to the latter. This constant flow leads to straight filtering trajectories. In addition we introduce a DistanceModule that scales the straight trajectory using an estimated distance scalar to attain convergence near the clean surface. Our network is lightweight and only has 530K parameters being 17% of IterativePFN (a most recent point cloud filtering network). Extensive experiments on both synthetic and real-world data show our method achieves state-of-the-art results. Our method also demonstrates nice distributions of filtered points without the need for regularization. The implementation code can be found at: https://github.com/ddsediri/StraightPCF.
-
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities (e.g. video audio text). For example video and audio are obtained at much higher rates than text and are roughly aligned in time. They are often not synchronized with text which comes as a global context e.g. a title or a description. Furthermore video and audio inputs are of much larger volumes and grow as the video length increases which naturally requires more compute dedicated to these modalities and makes modeling of long-range dependencies harder. We here decouple the multimodal modeling dividing it into separate autoregressive models processing the inputs according to the characteristics of the modalities. We propose a multimodal model consisting of an autoregressive component for the time-synchronized modalities (audio and video) and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential. To address the long-sequences of the video-audio inputs we further partition the video and audio sequences in consecutive snippets and autoregressively process their representations. To that end we propose a Combiner mechanism which models the audio-video information jointly producing compact but expressive representations. This allows us to scale to 512 input video frames without increase in model parameters. Our approach achieves the state-of-the-art on multiple well established multimodal benchmarks. It effectively addresses the high computational demand of media inputs by learning compact representations controlling the sequence length of the audio-video feature representations and modeling their dependencies in time.
-
Sign Languages (SL) serve as the primary mode of communication for the Deaf and Hard of Hearing communities. Deep learning methods for SL recognition and translation have achieved promising results. However Sign Language Production (SLP) poses a challenge as the generated motions must be realistic and have precise semantic meaning. Most SLP methods rely on 2D data which hinders their realism. In this work a diffusion-based SLP model is trained on a curated large-scale dataset of 4D signing avatars and their corresponding text transcripts. The proposed method can generate dynamic sequences of 3D avatars from an unconstrained domain of discourse using a diffusion process formed on a novel and anatomically informed graph neural network defined on the SMPL-X body skeleton. Through quantitative and qualitative experiments we show that the proposed method considerably outperforms previous methods of SLP. This work makes an important step towards realistic neural sign avatars bridging the communication gap between Deaf and hearing communities.
-
Contemporary machine learning which involves training large neural networks on massive datasets faces significant computational challenges. Dataset distillation as a recent emerging strategy aims to compress real-world datasets for efficient training. However this line of research currently struggles with large-scale and high-resolution datasets hindering its practicality and feasibility. Thus we re-examine existing methods and identify three properties essential for real-world applications: realism diversity and efficiency. As a remedy we propose RDED a novel computationally-efficient yet effective data distillation paradigm to enable both diversity and realism of the distilled data. Extensive empirical results over various model architectures and datasets demonstrate the advancement of RDED: we can distill a dataset to 10 images per class from full ImageNet-1K within 7 minutes achieving a notable 42% accuracy with ResNet-18 on a single RTX-4090 GPU (while the SOTA only achieves 21% but requires 6 hours). Code: https://github.com/LINs-lab/RDED.
-
Capturing and preserving motion semantics is essential to motion retargeting between animation characters. However most of the previous works neglect the semantic information or rely on human-designed joint-level representations. Here we present a novel Semantics-aware Motion reTargeting (SMT) method with the advantage of vision-language models to extract and maintain meaningful motion semantics. We utilize a differentiable module to render 3D motions. Then the high-level motion semantics are incorporated into the motion retargeting process by feeding the vision-language model with the rendered images and aligning the extracted semantic embeddings. To ensure the preservation of fine-grained motion details and high-level semantics we adopt a two-stage pipeline consisting of skeleton-aware pre-training and fine-tuning with semantics and geometry constraints. Experimental results show the effectiveness of the proposed method in producing high-quality motion retargeting results while accurately preserving motion semantics. Project page can be found at https://sites.google.com/view/smtnet.
-
Class-incremental learning (CIL) aims to enable models to continuously learn new classes while overcoming catastrophic forgetting. The introduction of pre-trained models has brought new tuning paradigms to CIL. In this paper we revisit different parameter-efficient tuning (PET) methods within the context of continual learning. We observe that adapter tuning demonstrates superiority over prompt-based methods even without parameter expansion in each learning session. Motivated by this we propose incrementally tuning the shared adapter without imposing parameter update constraints enhancing the learning capacity of the backbone. Additionally we employ feature sampling from stored prototypes to retrain a unified classifier further improving its performance. We estimate the semantic shift of old prototypes without access to past samples and update stored prototypes session by session. Our proposed method eliminates model expansion and avoids retaining any image samples. It surpasses previous pre-trained model-based CIL methods and demonstrates remarkable continual learning capabilities. Experimental results on five CIL benchmarks validate the effectiveness of our approach achieving state-of-the-art (SOTA) performance.
-
This paper focuses on the high computational complexity in Large Language Models (LLMs) a significant challenge in both natural language processing (NLP) and multi-modal tasks. We propose Low-Rank Approximation for Sparse At- tention (LoRA-Sparse) an innovative approach that strate- gically reduces this complexity. LoRA-Sparse introduces low-rank linear projection layers for sparse attention ap- proximation. It utilizes an order-mimic training methodol- ogy which is crucial for efficiently approximating the self- attention mechanism in LLMs. We empirically show that sparse attention not only reduces computational demands but also enhances model performance in both NLP and multi-modal tasks. This surprisingly shows that redundant attention in LLMs might be non-beneficial. We extensively validate LoRA-Sparse through rigorous empirical studies in both (NLP) and multi-modal tasks demonstrating its effec- tiveness and general applicability. Based on LLaMA and LLaVA models our methods can reduce more than half of the self-attention computation with even better performance than full-attention baselines.
-
Training deep models for LiDAR semantic segmentation is challenging due to the inherent sparsity of point clouds. Utilizing temporal data is a natural remedy against the sparsity problem as it makes the input signal denser. However previous multi-frame fusion algorithms fall short in utilizing sufficient temporal information due to the memory constraint and they also ignore the informative temporal images. To fully exploit rich information hidden in long-term temporal point clouds and images we present the Temporal Aggregation Network termed TASeg. Specifically we propose a Temporal LiDAR Aggregation and Distillation (TLAD) algorithm which leverages historical priors to assign different aggregation steps for different classes. It can largely reduce memory and time overhead while achieving higher accuracy. Besides TLAD trains a teacher injected with gt priors to distill the model further boosting the performance. To make full use of temporal images we design a Temporal Image Aggregation and Fusion (TIAF) module which can greatly expand the camera FOV and enhance the present features. Temporal LiDAR points in the camera FOV are used as mediums to transform temporal image features to the present coordinate for temporal multi-modal fusion. Moreover we develop a Static-Moving Switch Augmentation (SMSA) algorithm which utilizes sufficient temporal information to enable objects to switch their motion states freely thus greatly increasing static and moving training samples. Our TASeg ranks 1st on three challenging tracks i.e. SemanticKITTI single-scan track multi-scan track and nuScenes LiDAR segmentation track strongly demonstrating the superiority of our method. Codes are available at https://github.com/LittlePey/TASeg.
-
The recently proposed SparseFormer architecture provides an alternative approach to visual understanding by utilizing a significantly lower number of visual tokens via adjusting RoIs greatly reducing computational costs while still achieving promising performance. However training SparseFormers from scratch is still expensive and scaling up the number of parameters can be challenging. In this paper we propose to bootstrap SparseFormers from ViT-based vision foundation models in a simple and efficient way. Since the majority of SparseFormer blocks are the standard transformer ones we can inherit weights from large-scale pre-trained vision transformers and freeze them as much as possible. Therefore we only need to train the SparseFormer-specific lightweight focusing transformer to adjust token RoIs and fine-tune a few early pre-trained blocks to align the final token representation. In such a way we can bootstrap SparseFormer architectures from various large-scale pre-trained models (e.g. IN-21K pre-trained AugRegs or CLIPs) using a rather smaller amount of training samples (e.g. IN-1K) and without labels or captions within just a few hours. As a result the bootstrapped unimodal SparseFormer (from AugReg-ViT-L/16-384) can reach 84.9% accuracy on IN-1K with only 49 tokens and the multimodal SparseFormer from CLIPs also demonstrates notable zero-shot performance with highly reduced computational cost without seeing any caption during the bootstrapping procedure. In addition CLIP-bootstrapped SparseFormers which align the output space with language without seeing a word can serve as efficient vision encoders in multimodal large language models. Code and models are available at https://github.com/showlab/sparseformer
-
Photometric stereo is a well-established technique to estimate the surface normal of an object. However the requirement of capturing multiple high dynamic range images under different illumination conditions limits the speed and real-time applications. This paper introduces EventPS a novel approach to real-time photometric stereo using an event camera. Capitalizing on the exceptional temporal resolution dynamic range and low bandwidth characteristics of event cameras EventPS estimates surface normal only from the radiance changes significantly enhancing data efficiency. EventPS seamlessly integrates with both optimization-based and deep-learning-based photometric stereo techniques to offer a robust solution for non-Lambertian surfaces. Extensive experiments validate the effectiveness and efficiency of EventPS compared to frame-based counterparts. Our algorithm runs at over 30 fps in real-world scenarios unleashing the potential of EventPS in time-sensitive and high-speed downstream applications.
-
Traditionally training neural networks to perform semantic segmentation requires expensive human-made annotations. But more recently advances in the field of unsupervised learning have made significant progress on this issue and towards closing the gap to supervised algorithms. To achieve this semantic knowledge is distilled by learning to correlate randomly sampled features from images across an entire dataset. In this work we build upon these advances by incorporating information about the structure of the scene into the training process through the use of depth information. We achieve this by (1) learning depth-feature correlation by spatially correlating the feature maps with the depth maps to induce knowledge about the structure of the scene and (2) exploiting farthest-point sampling to more effectively select relevant features by utilizing 3D sampling techniques on depth information of the scene. Finally we demonstrate the effectiveness of our technical contributions through extensive experimentation and present significant improvements in performance across multiple benchmark datasets.
-
End-to-end motion planning models equipped with deep neural networks have shown great potential for enabling full autonomous driving. However the oversized neural networks render them impractical for deployment on resource-constrained systems which unavoidably requires more computational time and resources during reference. To handle this knowledge distillation offers a promising approach that compresses models by enabling a smaller student model to learn from a larger teacher model. Nevertheless how to apply knowledge distillation to compress motion planners has not been explored so far. In this paper we propose PlanKD the first knowledge distillation framework tailored for compressing end-to-end motion planners. First considering that driving scenes are inherently complex often containing planning-irrelevant or even noisy information transferring such information is not beneficial for the student planner. Thus we design an information bottleneck based strategy to only distill planning-relevant information rather than transfer all information indiscriminately. Second different waypoints in an output planned trajectory may hold varying degrees of importance for motion planning where a slight deviation in certain crucial waypoints might lead to a collision. Therefore we devise a safety-aware waypoint-attentive distillation module that assigns adaptive weights to different waypoints based on the importance to encourage the student to accurately mimic more crucial waypoints thereby improving overall safety. Experiments demonstrate that our PlanKD can boost the performance of smaller planners by a large margin and significantly reduce their reference time.
-
Recent advancements in diffusion-based models have demonstrated significant success in generating images from text. However video editing models have not yet reached the same level of visual quality and user control. To address this we introduce RAVE a zero-shot video editing method that leverages pre-trained text-to-image diffusion models without additional training. RAVE takes an input video and a text prompt to produce high-quality videos while preserving the original motion and semantic structure. It employs a novel noise shuffling strategy leveraging spatio-temporal interactions between frames to produce temporally consistent videos faster than existing methods. It is also efficient in terms of memory requirements allowing it to handle longer videos. RAVE is capable of a wide range of edits from local attribute modifications to shape transformations. In order to demonstrate the versatility of RAVE we create a comprehensive video evaluation dataset ranging from object-focused scenes to complex human activities like dancing and typing and dynamic scenes featuring swimming fish and boats. Our qualitative and quantitative experiments highlight the effectiveness of RAVE in diverse video editing scenarios compared to existing methods. Our code dataset and videos can be found in \href https://rave-video-edit.github.io/.
-
Predictive learning models which aim to predict future frames based on past observations are crucial to constructing world models. These models need to maintain low-level consistency and capture high-level dynamics in unannotated spatiotemporal data. Transitioning from frame-wise to token-wise prediction presents a viable strategy for addressing these needs. How to improve token representation and optimize token decoding presents significant challenges. This paper introduces PredToken a novel predictive framework that addresses these issues by decoupling space-time tokens into distinct components for iterative cascaded decoding. Concretely we first design a "decomposition quantization and reconstruction" schema based on VQGAN to improve the token representation. This scheme disentangles low- and high-frequency representations and employs a dimension-aware quantization model allowing more low-level details to be preserved. Building on this we present a "coarse-to-fine iterative decoding" method. It leverages dynamic soft decoding to refine coarse tokens and static soft decoding for fine tokens enabling more high-level dynamics to be captured. These designs make PredToken produce high-quality predictions. Extensive experiments demonstrate the superiority of our method on various real-world spatiotemporal predictive benchmarks. Furthermore PredToken can also be extended to other visual generative tasks to yield realistic outcomes.
-
By leveraging temporal dependency in video sequences multi-frame human pose estimation algorithms have demonstrated remarkable results in complicated situations such as occlusion motion blur and video defocus. These algorithms are predominantly based on heatmaps resulting in high computation and storage requirements per frame which limits their flexibility and real-time application in video scenarios particularly on edge devices. In this paper we develop an efficient and effective video-based human pose regression method which bypasses intermediate representations such as heatmaps and instead directly maps the input to the output joint coordinates. Despite the inherent spatial correlation among adjacent joints of the human pose the temporal trajectory of each individual joint exhibits relative independence. In light of this we propose a novel Decoupled Space-Time Aggregation network (DSTA) to separately capture the spatial contexts between adjacent joints and the temporal cues of each individual joint thereby avoiding the conflation of spatiotemporal dimensions. Concretely DSTA learns a dedicated feature token for each joint to facilitate the modeling of their spatiotemporal dependencies. With the proposed joint-wise local-awareness attention mechanism our method is capable of efficiently and flexibly utilizing the spatial dependency of adjacent joints and the temporal dependency of each joint itself. Extensive experiments demonstrate the superiority of our method. Compared to previous regression-based single-frame human pose estimation methods DSTA significantly enhances performance achieving an 8.9 mAP improvement on PoseTrack2017. Furthermore our approach either surpasses or is on par with the state-of-the-art heatmap-based multi-frame human pose estimation methods. Project page: https://github.com/zgspose/DSTA.
-
In the current era of generative AI breakthroughs generating panoramic scenes from a single input image remains a key challenge. Most existing methods use diffusion-based iterative or simultaneous multi-view inpainting. However the lack of global scene layout priors leads to subpar outputs with duplicated objects (e.g. multiple beds in a bedroom) or requires time-consuming human text inputs for each view. We propose L-MAGIC a novel method leveraging large language models for guidance while diffusing multiple coherent views of 360 degree panoramic scenes. L-MAGIC harnesses pre-trained diffusion and language models without fine-tuning ensuring zero-shot performance. The output quality is further enhanced by super-resolution and multi-view fusion techniques. Extensive experiments demonstrate that the resulting panoramic scenes feature better scene layouts and perspective view rendering quality compared to related works with >70% preference in human evaluations. Combined with conditional diffusion models L-MAGIC can accept various input modalities including but not limited to text depth maps sketches and colored scripts. Applying depth estimation further enables 3D point cloud generation and dynamic scene exploration with fluid camera motion.
-
When working with 3D facial data improving fidelity and avoiding the uncanny valley effect is critically dependent on accurate 3D facial performance capture. Because such methods are expensive and due to the widespread availability of 2D videos recent methods have focused on how to perform monocular 3D face tracking. However these methods often fall short in capturing precise facial movements due to limitations in their network architecture training and evaluation processes. Addressing these challenges we propose a novel face tracker FlowFace that introduces an innovative 2D alignment network for dense per-vertex alignment. Unlike prior work FlowFace is trained on high-quality 3D scan annotations rather than weak supervision or synthetic data. Our 3D model fitting module jointly fits a 3D face model from one or many observations integrating existing neutral shape priors for enhanced identity and expression disentanglement and per-vertex deformations for detailed facial feature reconstruction. Additionally we propose a novel metric and benchmark for assessing tracking accuracy. Our method exhibits superior performance on both custom and publicly available benchmarks. We further validate the effectiveness of our tracker by generating high-quality 3D data from 2D videos which leads to performance gains on downstream tasks.
-
Multi-view diffusion models obtained by applying Supervised Finetuning (SFT) to text-to-image diffusion models have driven recent breakthroughs in text-to-3D research. However due to the limited size and quality of existing 3D datasets they still suffer from multi-view inconsistencies and Neural Radiance Field (NeRF) reconstruction artifacts. We argue that multi-view diffusion models can benefit from further Reinforcement Learning Finetuning (RLFT) which allows models to learn from the data generated by themselves and improve beyond their dataset limitations during SFT. To this end we introduce Carve3D an improved RLFT algorithm coupled with a novel Multi-view Reconstruction Consistency (MRC) metric to enhance the consistency of multi-view diffusion models. To measure the MRC metric on a set of multi-view images we compare them with their corresponding NeRF renderings at the same camera viewpoints. The resulting model which we denote as Carve3DM demonstrates superior multi-view consistency and NeRF reconstruction quality than existing models. Our results suggest that pairing SFT with Carve3D's RLFT is essential for developing multi-view-consistent diffusion models mirroring the standard Large Language Model (LLM) alignment pipeline. Our code training and testing data and video results are available at: https://desaixie.github.io/carve-3d.
-
Vision Transformers (ViTs) have emerged as a compelling alternative to Convolutional Neural Networks (CNNs) in the realm of computer vision showcasing tremendous potential. However recent research has unveiled a susceptibility of ViTs to adversarial attacks akin to their CNN counterparts. Adversarial training and randomization are two representative effective defenses for CNNs. Some researchers have attempted to apply adversarial training to ViTs and achieved comparable robustness to CNNs while it is not easy to directly apply randomization to ViTs because of the architecture difference between CNNs and ViTs. In this paper we delve into the structural intricacies of ViTs and propose a novel defense mechanism termed Random entangled image Transformer (ReiT) which seamlessly integrates adversarial training and randomization to bolster the adversarial robustness of ViTs. Recognizing the challenge posed by the structural disparities between ViTs and CNNs we introduce a novel module input-independent random entangled self-attention (II-ReSA). This module optimizes random entangled tokens that lead to "dissimilar" self-attention outputs by leveraging model parameters and the sampled random tokens thereby synthesizing the self-attention module outputs and random entangled tokens to diminish adversarial similarity. ReiT incorporates two distinct random entangled tokens and employs dual randomization offering an effective countermeasure against adversarial examples while ensuring comprehensive deduction guarantees. Through extensive experiments conducted on various ViT variants and benchmarks we substantiate the superiority of our proposed method in enhancing the adversarial robustness of Vision Transformers.
-
In the realm of image composition generating realistic shadow for the inserted foreground remains a formidable challenge. Previous works have developed image-to-image translation models which are trained on paired training data. However they are struggling to generate shadows with accurate shapes and intensities hindered by data scarcity and inherent task complexity. In this paper we resort to foundation model with rich prior knowledge of natural shadow images. Specifically we first adapt ControlNet to our task and then propose intensity modulation modules to improve the shadow intensity. Moreover we extend the small-scale DESOBA dataset to DESOBAv2 using a novel data acquisition pipeline. Experimental results on both DESOBA and DESOBAv2 datasets as well as real composite images demonstrate the superior capability of our model for shadow generation task. The dataset code and model are released at https://github.com/bcmi/Object-Shadow-Generation-Dataset-DESOBAv2.
-
Generative AI has made significant strides in computer vision particularly in text-driven image/video synthesis (T2I/T2V). Despite the notable advancements it remains challenging in human-centric content synthesis such as realistic dance generation. Current methodologies primarily tailored for human motion transfer encounter difficulties when confronted with real-world dance scenarios (e.g. social media dance) which require to generalize across a wide spectrum of poses and intricate human details. In this paper we depart from the traditional paradigm of human motion transfer and emphasize two additional critical attributes for the synthesis of human dance content in social media contexts: (i) Generalizability: the model should be able to generalize beyond generic human viewpoints as well as unseen human subjects backgrounds and poses; (ii) Compositionality: it should allow for the seamless composition of seen/unseen subjects backgrounds and poses from different sources. To address these challenges we introduce DISCO which includes a novel model architecture with disentangled control to improve the compositionality of dance synthesis and an effective human attribute pre-training for better generalizability to unseen humans. Extensive qualitative and quantitative results demonstrate that DISCO can generate high-quality human dance images and videos with diverse appearances and flexible motions. Code is available at https://disco-dance.github.io/.
-
Deep neural networks have shown great success in representation learning. Deep neural networks have shown great success in representation learning. However when learning with noisy labels (LNL) they can easily overfit and fail to generalize to new data. This paper introduces a simple and effective method named Learning to Bootstrap (L2B) which enables models to bootstrap themselves using their own predictions without being adversely affected by erroneous pseudo-labels. It achieves this by dynamically adjusting the importance weight between real observed and generated labels as well as between different samples through meta-learning. Unlike existing instance reweighting methods the key to our method lies in a new versatile objective that enables implicit relabeling concurrently leading to significant improvements without incurring additional costs. L2B offers several benefits over the baseline methods. It yields more robust models that are less susceptible to the impact of noisy labels by guiding the bootstrapping procedure more effectively. It better exploits the valuable information contained in corrupted instances by adapting the weights of both instances and labels. Furthermore L2B is compatible with existing LNL methods and delivers competitive results spanning natural and medical imaging tasks including classification and segmentation under both synthetic and real-world noise. Extensive experiments demonstrate that our method effectively mitigates the challenges of noisy labels often necessitating few to no validation samples and is well generalized to other tasks such as image segmentation. This not only positions it as a robust complement to existing LNL techniques but also underscores its practical applicability. The code and models are available at https://github.com/yuyinzhou/l2b.
-
The advent of neural 3D Gaussians has recently brought about a revolution in the field of neural rendering facilitating the generation of high-quality renderings at real-time speeds. However the explicit and discrete representation encounters challenges when applied to scenes featuring reflective surfaces. In this paper we present GaussianShader a novel method that applies a simplified shading function on 3D Gaussians to enhance the neural rendering in scenes with reflective surfaces while preserving the training and rendering efficiency. The main challenge in applying the shading function lies in the accurate normal estimation on discrete 3D Gaussians. Specifically we proposed a novel normal estimation framework based on the shortest axis directions of 3D Gaussians with a delicately designed loss to make the consistency between the normals and the geometries of Gaussian spheres. Experiments show that GaussianShader strikes a commendable balance between efficiency and visual quality. Our method surpasses Gaussian Splatting in PSNR on specular object datasets exhibiting an improvement of 1.57dB. When compared to prior works handling reflective surfaces such as Ref-NeRF our optimization time is significantly accelerated (23h vs. 0.58h).
-
We present a scene representation that brings vision and touch into a shared 3D space which we call a tactile-augmented radiance field. This representation capitalizes on two key insights: (i) ubiquitous vision-based touch sensors are built on perspective cameras and (ii) visually and structurally similar regions of a scene share the same tactile features. We use these insights to train a conditional diffusion model that provided with an RGB image and a depth map rendered from a neural radiance field generates its corresponding tactile "image". To train this diffusion model we collect the largest collection of spatially-aligned visual and tactile data. Through qualitative and quantitative experiments we demonstrate the accuracy of our cross-modal generative model and the utility of collected and rendered visual-tactile pairs across a range of downstream tasks. Project page: https://dou-yiming.github.io/TaRF
-
Spike cameras a novel neuromorphic visual sensor can capture full-time spatial information through spike stream offering ultra-high temporal resolution and an extensive dynamic range. Autofocus control (AC) plays a pivotal role in a camera to efficiently capture information in challenging real-world scenarios. Nevertheless due to disparities in data modality and information characteristics compared to frame stream and event stream the current lack of efficient AC methods has made it challenging for spike cameras to adapt to intricate real-world conditions. To address this challenge we introduce a spike-based autofocus framework that includes a spike-specific focus measure called spike dispersion (SD) which effectively mitigates the influence of variations in scene light intensity during the focusing process by leveraging the spike camera's ability to record full-time spatial light intensity. Additionally the framework integrates a fast search strategy called spike-based golden fast search (SGFS) allowing rapid focal positioning without the need for a complete focus range traversal. To validate the performance of our method we have collected a spike-based autofocus dataset (SAD) containing synthetic data and real-world data under varying scene brightness and motion scenarios. Experimental results on these datasets demonstrate that our method offers state-of-the-art accuracy and efficiency. Furthermore experiments with data captured under varying scene brightness levels illustrate the robustness of our method to changes in light intensity during the focusing process.
-
Fairness is a critical concern in deep learning especially in healthcare where these models influence diagnoses and treatment decisions. Although fairness has been investigated in the vision-only domain the fairness of medical vision-language (VL) models remains unexplored due to the scarcity of medical VL datasets for studying fairness. To bridge this research gap we introduce the first fair vision-language medical dataset (Harvard-FairVLMed) that provides detailed demographic attributes ground-truth labels and clinical notes to facilitate an in-depth examination of fairness within VL foundation models. Using Harvard-FairVLMed we conduct a comprehensive fairness analysis of two widely-used VL models (CLIP and BLIP2) pre-trained on both natural and medical domains across four different protected attributes. Our results highlight significant biases in all VL models with Asian Male Non-Hispanic and Spanish being the preferred subgroups across the protected attributes of race gender ethnicity and language respectively. In order to alleviate these biases we propose FairCLIP an optimal-transport-based approach that achieves a favorable trade-off between performance and fairness by reducing the Sinkhorn distance between the overall sample distribution and the distributions corresponding to each demographic group. As the first VL dataset of its kind Harvard-FairVLMed holds the potential to catalyze advancements in the development of machine learning models that are both ethically aware and clinically effective. Our dataset and code are available at https://ophai.hms.harvard.edu/datasets/harvard-fairvlmed10k.
-
Predicting the future occupancy states of the surrounding environment is a vital task for autonomous driving. However current best-performing single-modality methods or multi-modality fusion perception methods are only able to predict uniform snapshots of future occupancy states and require strictly synchronized sensory data for sensor fusion. We propose a novel framework StreamingFlow to lift these strong limitations. StreamingFlow is a novel BEV occupancy predictor that ingests asynchronous multi-sensor data streams for fusion and performs streaming forecasting of the future occupancy map at any future timestamps. By integrating neural ordinary differential equations (N-ODE) into recurrent neural networks StreamingFlow learns derivatives of BEV features over temporal horizons updates the implicit sensor's BEV features as part of the fusion process and propagates BEV states to the desired future time point. It shows good zero-shot generalization ability of prediction reflected in the interpolation of the observed prediction time horizon and the reasonable inference of the unseen farther future period. Extensive experiments on two large-scale datasets nuScenes and Lyft L5 demonstrate that StreamingFlow significantly outperforms previous vision-based LiDAR-based methods and shows superior performance compared to state-of-the-art fusion-based methods.
-
We introduce pix2gestalt a framework for zero-shot amodal segmentation which learns to estimate the shape and appearance of whole objects that are only partially visible behind occlusions. By capitalizing on large-scale diffusion models and transferring their representations to this task we learn a conditional diffusion model for reconstructing whole objects in challenging zero-shot cases including examples that break natural and physical priors such as art. As training data we use a synthetically curated dataset containing occluded objects paired with their whole counterparts. Experiments show that our approach outperforms supervised baselines on established benchmarks. Our model can furthermore be used to significantly improve the performance of existing object recognition and 3D reconstruction methods in the presence of occlusions.
-
Manual annotation of every point in a point cloud is a costly and labor-intensive process. While weakly supervised point cloud semantic segmentation (WSPCSS) with sparse annotation shows promise the limited information from initial sparse labels can place an upper bound on performance. As a new research direction for WSPCSS we propose a novel Region Exploration via Artificial Labeling (REAL) framework. It leverages a foundational image model as an artificial oracle within the active learning context eliminating the need for manual annotation by a human oracle. To integrate the 2D model into the 3D domain we first introduce a Projection-based Point-toSegment (PP2S) module designed to enable prompt segmentation of 3D data without additional training. The REAL framework samples query points based on model predictions and requests annotations from PP2S dynamically refining labels and improving model training. Furthermore to overcome several challenges of employing an artificial model as an oracle we formulate effective query sampling and label updating strategies. Our comprehensive experiments and comparisons demonstrate that the REAL framework significantly outperforms existing methods across various benchmarks. The code is available at https://github.com/jihun1998/AO.
-
Although neural networks excel in video action recognition tasks their "black-box" nature makes it challenging to understand the rationale behind their decisions. Recent approaches used inherently interpretable models to analyze video actions in a manner akin to human reasoning. However it has been observed that these interpretable models tend to underperform when compared to their black-box counterparts. In this work we present a new framework called Language-guided Interpretable Action Recognition framework (LaIAR). This framework leverages knowledge from language models to enhance both the recognition capabilities and the interpretability of video models. In essence we reframe the challenge of understanding video model decisions as a task of aligning video and language models. Using the logical reasoning captured by the language model we steer the training of the video model. This integrated approach not only improves the video model's adaptability to different domains but also boosts its overall performance. Extensive experiments on Charades and CAD-120 datasets demonstrate the superior performance and interpretability of our proposed method. The code of LaIAR is available at https://github.com/NingWang2049/LaIAR.
-
In the context of computer vision and human-robot interaction forecasting 3D human poses is crucial for understanding human behavior and enhancing the predictive capabilities of intelligent systems. While existing methods have made significant progress they often focus on predicting major body joints overlooking fine-grained gestures and their interaction with objects. Human hand movements particularly during object interactions play a pivotal role and provide more precise expressions of human poses. This work fills this gap and introduces a novel paradigm: forecasting 3D whole-body human poses with a focus on grasping objects. This task involves predicting activities across all joints in the body and hands encompassing the complexities of internal heterogeneity and external interactivity. To tackle these challenges we also propose a novel approach: C^3HOST cross-context cross-modal consolidation for 3D whole-body pose forecasting effectively handles the complexities of internal heterogeneity and external interactivity. C^3HOST involves distinct steps including the heterogeneous content encoding and alignment and cross-modal feature learning and interaction. These enable us to predict activities across all body and hand joints ensuring high-precision whole-body human pose prediction even during object grasping. Extensive experiments on two benchmarks demonstrate that our model significantly enhances the accuracy of whole-body human motion prediction. The project page is available at https://sites.google.com/view/c3host.
-
The autonomous driving community has shown significant interest in 3D occupancy prediction driven by its exceptional geometric perception and general object recognition capabilities. To achieve this current works try to construct a Tri-Perspective View (TPV) or Occupancy (OCC) representation extending from the Bird-Eye-View perception. However compressed views like TPV representation lose 3D geometry information while raw and sparse OCC representation requires heavy but redundant computational costs. To address the above limitations we propose Compact Occupancy TRansformer (COTR) with a geometry-aware occupancy encoder and a semantic-aware group decoder to reconstruct a compact 3D OCC representation. The occupancy encoder first generates a compact geometrical OCC feature through efficient explicit-implicit view transformation. Then the occupancy decoder further enhances the semantic discriminability of the compact OCC representation by a coarse-to-fine semantic grouping strategy. Empirical experiments show that there are evident performance gains across multiple baselines e.g. COTR outperforms baselines with a relative improvement of 8%-15% demonstrating the superiority of our method.
-
Diffusion probabilistic models (DPMs) have shown remarkable performance in high-resolution image synthesis but their sampling efficiency is still to be desired due to the typically large number of sampling steps. Recent advancements in high-order numerical ODE solvers for DPMs have enabled the generation of high-quality images with much fewer sampling steps. While this is a significant development most sampling methods still employ uniform time steps which is not optimal when using a small number of steps. To address this issue we propose a general framework for designing an optimization problem that seeks more appropriate time steps for a specific numerical ODE solver for DPMs. This optimization problem aims to minimize the distance between the ground-truth solution to the ODE and an approximate solution corresponding to the numerical solver. It can be efficiently solved using the constrained trust region method taking less than 15 seconds. Our extensive experiments on both unconditional and conditional sampling using pixel- and latent-space DPMs demonstrate that when combined with the state-of-the-art sampling method UniPC our optimized time steps significantly improve image generation performance in terms of FID scores for datasets such as CIFAR-10 and ImageNet compared to using uniform time steps.
-
Current open-source Large Multimodal Models (LMMs) excel at tasks such as open-vocabulary language grounding and segmentation but can suffer under false premises when queries imply the existence of something that is not actually present in the image. We observe that existing methods that fine-tune an LMM to segment images significantly degrade their ability to reliably determine ("see") if an object is present and to interact naturally with humans ("say") a form of catastrophic forgetting. In this work we propose a cascading and joint training approach for LMMs to solve this task avoiding catastrophic forgetting of previous skills. Our resulting model can "see" by detecting whether objects are present in an image "say" by telling the user if they are not proposing alternative queries or correcting semantic errors in the query and finally "segment" by outputting the mask of the desired objects if they exist. Additionally we introduce a novel False Premise Correction benchmark dataset an extension of existing RefCOCO(+/g) referring segmentation datasets (which we call FP-RefCOCO(+/g)). The results show that our method not only detects false premises up to 55% better than existing approaches but under false premise conditions produces relative cIOU improvements of more than 31% over baselines and produces natural language feedback judged helpful up to 67% of the time.
-
End-to-end autonomous driving recently emerged as a promising research direction to target autonomy from a full-stack perspective. Along this line many of the latest works follow an open-loop evaluation setting on nuScenes to study the planning behavior. In this paper we delve deeper into the problem by conducting thorough analyses and demystifying more devils in the details. We initially observed that the nuScenes dataset characterized by relatively simple driving scenarios leads to an under-utilization of perception information in end-to-end models incorporating ego status such as the ego vehicle's velocity. These models tend to rely predominantly on the ego vehicle's status for future path planning. Beyond the limitations of the dataset we also note that current metrics do not comprehensively assess the planning quality leading to potentially biased conclusions drawn from existing benchmarks. To address this issue we introduce a new metric to evaluate whether the predicted trajectories adhere to the road. We further propose a simple baseline able to achieve competitive results without relying on perception annotations. Given the current limitations on the benchmark and metrics we suggest the community reassess relevant prevailing research and be cautious about whether the continued pursuit of state-of-the-art would yield convincing and universal conclusions. Code and models are available at https://github.com/NVlabs/BEV-Planner.
-
Unsupervised point cloud shape correspondence aims to establish point-wise correspondences between source and target point clouds. Existing methods obtain correspondences directly by computing point-wise feature similarity between point clouds. However non-rigid objects possess strong deformability and unusual shapes making it a longstanding challenge to directly establish correspondences between point clouds with unconventional shapes. To address this challenge we propose an unsupervised Template-Assisted point cloud shape correspondence Network termed TANet including a template generation module and a template assistance module. The proposed TANet enjoys several merits. Firstly the template generation module establishes a set of learnable templates with explicit structures. Secondly we introduce a template assistance module that extensively leverages the generated templates to establish more accurate shape correspondences from multiple perspectives. Extensive experiments on four human and animal datasets demonstrate that TANet achieves favorable performance against state-of-the-art methods.
-
Diffusion Models (DMs) have evolved into advanced image generation tools especially for few-shot generation where a pre-trained model is fine-tuned on a small set of images to capture a specific style or object. Despite their success concerns exist about potential copyright violations stemming from the use of unauthorized data in this process. In response we present Contrasting Gradient Inversion for Diffusion Models (CGI-DM) a novel method featuring vivid visual representations for digital copyright authentication. Our approach involves removing partial information of an image and recovering missing details by exploiting conceptual differences between the pre-trained and fine-tuned models. We formulate the differences as KL divergence between latent variables of the two models when given the same input image which can be maximized through Monte Carlo sampling and Projected Gradient Descent (PGD). The similarity between original and recovered images serves as a strong indicator of potential infringements. Extensive experiments on the WikiArt and Dreambooth datasets demonstrate the high accuracy of CGI-DM in digital copyright authentication surpassing alternative validation techniques. Code implementation is available at https://github.com/Nicholas0228/Revelio.
-
Visual perception evolves over time. This is particularly the case of oracle bone scripts where visual glyphs seem intuitive to people from distant past prove difficult to be understood in contemporary eyes. While semantic correspondence of an oracle can be found via a dictionary lookup this proves to be not enough for public viewers to connect the dots i.e. why does this oracle mean that? Common solution relies on a laborious curation process to collect visual guide for each oracle (Fig.1) which hinges on the case-by-case effort and taste of curators. This paper delves into one natural follow-up question: can AI take over?Begin with a comprehensive human study we show participants could indeed make better sense of an oracle glyph subjected to a proper visual guide and its efficacy can be approximated via a novel metric termed TransOV (Transferable Oracle Visuals). We then define a new conditional visual generation task based on an oracle glyph and its semantic meaning and importantly approach it by circumventing any form of model training in the presence of fatal lack of oracle data. At its heart is to leverage foundation model like GPT-4V to reason about the visual cues hidden inside an oracle and take advantage of an existing text-to-image model for final visual guide generation. Extensive empirical evidence shows our AI-enabled visual guides achieve significantly comparable TransOV performance compared with those collected under manual efforts. Finally we demonstrate the versatility of our system under a more complex setting where it is required to work alongside an AI image denoiser to cope with raw oracle scan image inputs (cf. processed clean oracle glyphs). Code is available at https://github.com/RQ-Lab/OBS-Visual.
-
The Laplace-Beltrami operator (LBO) emerges from studying manifolds equipped with a Riemannian metric. It is often called the swiss army knife of geometry processing as it allows to capture intrinsic shape information and gives rise to heat diffusion geodesic distances and a multitude of shape descriptors. It also plays a central role in geometric deep learning. In this work we explore Finsler manifolds as a generalization of Riemannian manifolds. We revisit the Finsler heat equation and derive a Finsler heat kernel and a Finsler-Laplace-Beltrami Operator (FLBO): a novel theoretically justified anisotropic Laplace-Beltrami operator (ALBO). In experimental evaluations we demonstrate that the proposed FLBO is a valuable alternative to the traditional Riemannian-based LBO and ALBOs for spatial filtering and shape correspondence estimation. We hope that the proposed Finsler heat kernel and the FLBO will inspire further exploration of Finsler geometry in the computer vision community.
-
We introduce a new family of minimal problems for reconstruction from multiple views. Our primary focus is a novel approach to autocalibration a long-standing problem in computer vision. Traditional approaches to this problem such as those based on Kruppa's equations or the modulus constraint rely explicitly on the knowledge of multiple fundamental matrices or a projective reconstruction. In contrast we consider a novel formulation involving constraints on image points the unknown depths of 3D points and a partially specified calibration matrix K. For 2 and 3 views we present a comprehensive taxonomy of minimal autocalibration problems obtained by relaxing some of these constraints. These problems are organized into classes according to the number of views and any assumed prior knowledge of K. Within each class we determine problems with the fewest---or a relatively small number of---solutions. From this zoo of problems we devise three practical solvers. Experiments with synthetic and real data and interfacing our solvers with COLMAP demonstrate that we achieve superior accuracy compared to state-of-the-art calibration methods. The code is available at https://github.com/andreadalcin/MinimalPerspectiveAutocalibration.
-
Previous works concerning single-view hand-held object reconstruction typically rely on supervision from 3D ground-truth models which are hard to collect in real world. In contrast readily accessible hand-object videos offer a promising training data source but they only give heavily occluded object observations. In this paper we present a novel synthetic-to-real framework to exploit Multi-view Occlusion-aware supervision from hand-object videos for Hand-held Object reconstruction (MOHO) from a single image tackling two predominant challenges in such setting: hand-induced occlusion and object's self-occlusion. First in the synthetic pre-training stage we render a large-scaled synthetic dataset SOMVideo with hand-object images and multi-view occlusion-free supervisions adopted to address hand-induced occlusion in both 2D and 3D spaces. Second in the real-world finetuning stage MOHO leverages the amodal-mask-weighted geometric supervision to mitigate the unfaithful guidance caused by the hand-occluded supervising views in real world. Moreover domain-consistent occlusion-aware features are amalgamated in MOHO to resist object's self-occlusion for inferring the complete object shape. Extensive experiments on HO3D and DexYCB datasets demonstrate 2D-supervised MOHO gains superior results against 3D-supervised methods by a large margin.
-
Largely due to their implicit nature neural fields lack a direct mechanism for filtering as Fourier analysis from discrete signal processing is not directly applicable to these representations. Effective filtering of neural fields is critical to enable level-of-detail processing in downstream applications and support operations that involve sampling the field on regular grids (e.g. marching cubes). Existing methods that attempt to decompose neural fields in the frequency domain either resort to heuristics or require extensive modifications to the neural field architecture. We show that via a simple modification one can obtain neural fields that are low-pass filtered and in turn show how this can be exploited to obtain a frequency decomposition of the entire signal. We demonstrate the validity of our technique by investigating level-of-detail reconstruction and showing how coarser representations can be computed effectively.
-
As foundation models become more popular there is a growing need to efficiently finetune them for downstream tasks. Although numerous adaptation methods have been proposed they are designed to be efficient only in terms of how many parameters are trained. They however typically still require backpropagating gradients throughout the model meaning that their training-time and -memory cost does not reduce as significantly. We propose an adaptation method which does not backpropagate gradients through the backbone. We achieve this by designing a lightweight network in parallel that operates on features from the frozen pretrained backbone. As a result our method is efficient not only in terms of parameters but also in training-time and memory usage. Our approach achieves state-of-the-art accuracy-parameter trade-offs on the popular VTAB benchmark and we further show how we outperform prior works with respect to training-time and -memory usage too. We further demonstrate the training efficiency and scalability of our method by adapting a vision transformer backbone of 4 billion parameters for the computationally demanding task of video classification without any intricate model parallelism. Here we outperform a prior adaptor-based method which could only scale to a 1 billion parameter backbone or fully-finetuning a smaller backbone with the same GPU and less training time.
-
Category-level object pose estimation aiming to predict the 6D pose and 3D size of objects from known categories typically struggles with large intra-class shape variation. Existing works utilizing mean shapes often fall short of capturing this variation. To address this issue we present SecondPose a novel approach integrating object-specific geometric features with semantic category priors from DINOv2. Leveraging the advantage of DINOv2 in providing SE(3)-consistent semantic features we hierarchically extract two types of SE(3)-invariant geometric features to further encapsulate local-to-global object-specific information. These geometric features are then point-aligned with DINOv2 features to establish a consistent object representation under SE(3) transformations facilitating the mapping from camera space to the pre-defined canonical space thus further enhancing pose estimation. Extensive experiments on NOCS-REAL275 demonstrate that SecondPose achieves a 12.4% leap forward over the state-of-the-art. Moreover on a more complex dataset HouseCat6D which provides photometrically challenging objects SecondPose still surpasses other competitors by a large margin. Code is released at https://github.com/NOrangeeroli/SecondPose.git.
-
Can computers perceive the physical properties of objects solely through vision? Research in cognitive science and vision science has shown that humans excel at identifying materials and estimating their physical properties based purely on visual appearance. In this paper we present a novel approach for dense prediction of the physical properties of objects using a collection of images. Inspired by how humans reason about physics through vision we leverage large language models to propose candidate materials for each object. We then construct a language-embedded point cloud and estimate the physical properties of each 3D point using a zero-shot kernel regression approach. Our method is accurate annotation-free and applicable to any object in the open world. Experiments demonstrate the effectiveness of the proposed approach in various physical property reasoning tasks such as estimating the mass of common objects as well as other properties like friction and hardness.
-
Understanding the world in first-person view is fundamental in Augmented Reality (AR). This immersive perspective brings dramatic visual changes and unique challenges compared to third-person views. Synthetic data has empowered third-person-view vision models but its application to embodied egocentric perception tasks remains largely unexplored. A critical challenge lies in simulating natural human movements and behaviors that effectively steer the embodied cameras to capture a faithful egocentric representation of the 3D world. To address this challenge we introduce EgoGen a new synthetic data generator that can produce accurate and rich ground-truth training data for egocentric perception tasks. At the heart of EgoGen is a novel human motion synthesis model that directly leverages egocentric visual inputs of a virtual human to sense the 3D environment. Combined with collision-avoiding motion primitives and a two-stage reinforcement learning approach our motion synthesis model offers a closed-loop solution where the embodied perception and movement of the virtual human are seamlessly coupled. Compared to previous works our model eliminates the need for a pre-defined global path and is directly applicable to dynamic environments. Combined with our easy-to-use and scalable data generation pipeline we demonstrate EgoGen's efficacy in three tasks: mapping and localization for head-mounted cameras egocentric camera tracking and human mesh recovery from egocentric views. EgoGen will be fully open-sourced offering a practical solution for creating realistic egocentric training data and aiming to serve as a useful tool for egocentric computer vision research.
-
Face Anti-Spoofing (FAS) is crucial for securing face recognition systems against presentation attacks. With advancements in sensor manufacture and multi-modal learning techniques many multi-modal FAS approaches have emerged. However they face challenges in generalizing to unseen attacks and deployment conditions. These challenges arise from (1) modality unreliability where some modality sensors like depth and infrared undergo significant domain shifts in varying environments leading to the spread of unreliable information during cross-modal feature fusion and (2) modality imbalance where training overly relies on a dominant modality hinders the convergence of others reducing effectiveness against attack types that are indistinguishable by sorely using the dominant modality. To address modality unreliability we propose the Uncertainty-Guided Cross-Adapter (U-Adapter) to recognize unreliably detected regions within each modality and suppress the impact of unreliable regions on other modalities. For modality imbalance we propose a Rebalanced Modality Gradient Modulation (ReGrad) strategy to rebalance the convergence speed of all modalities by adaptively adjusting their gradients. Besides we provide the first large-scale benchmark for evaluating multi-modal FAS performance under domain generalization scenarios. Extensive experiments demonstrate that our method outperforms state-of-the-art methods. Source codes and protocols are released on https://github.com/OMGGGGG/mmdg.
-
The remarkable success of "pretrain-then-finetune" paradigm has led to a proliferation of available pre-trained models for vision tasks. This surge presents a significant challenge in efficiently choosing the most suitable pre-trained models for downstream tasks. The critical aspect of this challenge lies in effectively predicting the model transferability by considering the underlying fine-tuning dynamics. Existing methods often model fine-tuning dynamics in feature space with linear transformations which do not precisely align with the fine-tuning objective and fail to grasp the essential nonlinearity from optimization. To this end we present LEAD a finetuning-aligned approach based on the network output of logits. LEAD proposes a theoretical framework to model the optimization process and derives an ordinary differential equation (ODE) to depict the nonlinear evolution toward the final logit state. Additionally we design a class-aware decomposition method to consider the varying evolution dynamics across classes and further ensure practical applicability. Integrating the closely aligned optimization objective and nonlinear modeling capabilities derived from the differential equation our method offers a concise solution to effectively bridge the optimization gap in a single step bypassing the lengthy fine-tuning process. The comprehensive experiments on 24 supervised and self-supervised pre-trained models across 10 downstream datasets demonstrate impressive performances and showcase its broad adaptability even in low-data scenarios.
-
Most video captioning models are designed to process short video clips of few seconds and output text describing low-level visual concepts (e.g. objects scenes atomic actions). However most real-world videos last for minutes or hours and have a complex hierarchical structure spanning different temporal granularities. We propose Video ReCap a recursive video captioning model that can process video inputs of dramatically different lengths (from 1 second to 2 hours) and output video captions at multiple hierarchy levels. The recursive video-language architecture exploits the synergy between different video hierarchies and can process hour-long videos efficiently. We utilize a curriculum learning training scheme to learn the hierarchical structure of videos starting from clip-level captions describing atomic actions then focusing on segment-level descriptions and concluding with generating summaries for hour-long videos. Furthermore we introduce Ego4D-HCap dataset by augmenting Ego4D with 8267 manually collected long-range video summaries. Our recursive model can flexibly generate captions at different hierarchy levels while also being useful for other complex video understanding tasks such as VideoQA on EgoSchema. Data code and models are publicly available at https://sites.google.com/view/vidrecap.
-
Diffusion models (DMs) excel in photo-realistic image synthesis but their adaptation to LiDAR scene generation poses a substantial hurdle. This is primarily because DMs operating in the point space struggle to preserve the curve-like patterns and 3D geometry of LiDAR scenes which consumes much of their representation power. In this paper we propose LiDAR Diffusion Models (LiDMs) to generate LiDAR-realistic scenes from a latent space tailored to capture the realism of LiDAR scenes by incorporating geometric priors into the learning pipeline. Our method targets three major desiderata: pattern realism geometry realism and object realism. Specifically we introduce curve-wise compression to simulate real-world LiDAR patterns point-wise coordinate supervision to learn scene geometry and patch-wise encoding for a full 3D object context. With these three core designs our method achieves competitive performance on unconditional LiDAR generation in 64-beam scenario and state of the art on conditional LiDAR generation while maintaining high efficiency compared to point-based DMs (up to 107xfaster). Furthermore by compressing LiDAR scenes into a latent space we enable the controllability of DMs with various conditions such as semantic maps camera views and text prompts. Our code and pretrained weights are available at https://github.com/hancyran/LiDAR-Diffusion.
-
Diffusion Reflectance Map: Single-Image Stochastic Inverse Rendering of Illumination and Reflectance
Reflectance bounds the frequency spectrum of illumination in the object appearance. In this paper we introduce the first stochastic inverse rendering method which recovers the attenuated frequency spectrum of an illumination jointly with the reflectance of an object of known geometry from a single image. Our key idea is to solve this blind inverse problem in the reflectance map an appearance representation invariant to the underlying geometry by learning to reverse the image formation with a novel diffusion model which we refer to as the Diffusion Reflectance Map Network (DRMNet). Given an observed reflectance map converted and completed from the single input image DRMNet generates a reflectance map corresponding to a perfect mirror sphere while jointly estimating the reflectance. The forward process can be understood as gradually filtering a natural illumination with lower and lower frequency reflectance and additive Gaussian noise. DRMNet learns to invert this process with two subnetworks IllNet and RefNet which work in concert towards this joint estimation. The network is trained on an extensive synthetic dataset and is demonstrated to generalize to real images showing state-of-the-art accuracy on established datasets.
-
This paper aims to achieve universal segmentation of arbitrary semantic level. Despite significant progress in recent years specialist segmentation approaches are limited to specific tasks and data distribution. Retraining a new model for adaptation to new scenarios or settings takes expensive computation and time cost which raises the demand for versatile and universal segmentation model that can cater to various granularity. Although some attempts have been made for unifying different segmentation tasks or generalization to various scenarios limitations in the definition of paradigms and input-output spaces make it difficult for them to achieve accurate understanding of content at arbitrary granularity. To this end we present UniLSeg a universal segmentation model that can perform segmentation at any semantic level with the guidance of language instructions. For training UniLSeg we reorganize a group of tasks from original diverse distributions into a unified data format where images with texts describing segmentation targets as input and corresponding masks are output. Combined with a automatic annotation engine for utilizing numerous unlabeled data UniLSeg achieves excellent performance on various tasks and settings surpassing both specialist and unified segmentation models.
-
We introduce GaussianAvatars a new method to create photorealistic head avatars that are fully controllable in terms of expression pose and viewpoint. The core idea is a dynamic 3D representation based on 3D Gaussian splats that are rigged to a parametric morphable face model. This combination facilitates photorealistic rendering while allowing for precise animation control via the underlying parametric model e.g. through expression transfer from a driving sequence or by manually changing the morphable model parameters. We parameterize each splat by a local coordinate frame of a triangle and optimize for explicit displacement offset to obtain a more accurate geometric representation. During avatar reconstruction we jointly optimize for the morphable model parameters and Gaussian splat parameters in an end-to-end fashion. We demonstrate the animation capabilities of our photorealistic avatar in several challenging scenarios. For instance we show reenactments from a driving video where our method outperforms existing works by a significant margin.
-
We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams quizzes and textbooks covering six core disciplines: Art & Design Business Science Health & Medicine Humanities & Social Science and Tech & Engineering. These questions span 30 subjects and 183 subfields comprising 30 highly heterogeneous image types such as charts diagrams maps tables music sheets and chemical structures. Unlike existing benchmarks MMMU focuses on advanced perception and reasoning with domain-specific knowledge challenging models to perform tasks akin to those faced by experts. The evaluation of 28 open-source LMMs as well as the proprietary GPT-4V(ision) and Gemini highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V and Gemini Ultra only achieve accuracies of 56% and 59% respectively indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence.
-
While diffusion models have significantly advanced the quality of image generation their capability to accurately and coherently render text within these images remains a substantial challenge. Conventional diffusion-based methods for scene text generation are typically limited by their reliance on an intermediate layout output. This dependency often results in a constrained diversity of text styles and fonts an inherent limitation stemming from the deterministic nature of the layout generation phase. To address these challenges this paper introduces SceneTextGen a novel diffusion-based model specifically designed to circumvent the need for a predefined layout stage. By doing so SceneTextGen facilitates a more natural and varied representation of text. The novelty of SceneTextGen lies in its integration of three key components: a character-level encoder for capturing detailed typographic properties coupled with a character-level instance segmentation model and a word-level spotting model to address the issues of unwanted text generation and minor character inaccuracies. We validate the performance of our method by demonstrating improved character recognition rates on generated images across different public visual text datasets in comparison to both standard diffusion based methods and text specific methods.
-
Astronaut photography spanning six decades of human spaceflight presents a unique Earth observations dataset with immense value for both scientific research and disaster response. Despite its significance accurately localizing the geographical extent of these images crucial for effective utilization poses substantial challenges. Current manual localization efforts are time-consuming motivating the need for automated solutions. We propose a novel approach - leveraging image retrieval - to address this challenge efficiently. We introduce innovative training techniques including Year-Wise Data Augmentation and a Neutral-Aware Multi-Similarity Loss which contribute to the development of a high-performance model EarthLoc. We develop six evaluation datasets and perform a comprehensive benchmark comparing EarthLoc to existing methods showcasing its superior efficiency and accuracy. Our approach marks a significant advancement in automating the localization of astronaut photography which will help bridge a critical gap in Earth observations data. Code and datasets are available at this https://github.com/gmberton/EarthLoc
-
The field of generative image inpainting and object insertion has made significant progress with the recent advent of latent diffusion models. Utilizing a precise object mask can greatly enhance these applications. However due to the challenges users encounter in creating high-fidelity masks there is a tendency for these methods to rely on more coarse masks (e.g. bounding box) for these applications. This results in limited control and compromised background content preservation. To overcome these limitations we introduce SmartMask which allows any novice user to create detailed masks for precise object insertion. Combined with a ControlNet-Inpaint model our experiments demonstrate that SmartMask achieves superior object insertion quality preserving the background content more effectively than previous methods. Notably unlike prior works the proposed approach can also be used even without user-mask guidance which allows it to perform mask-free object insertion at diverse positions and scales. Furthermore we find that when used iteratively with a novel instruction-tuning based planning model SmartMask can be used to design detailed layouts from scratch. As compared with user-scribble based layout design we observe that SmartMask allows for better quality outputs with layout-to-image generation methods.
-
Diffusion models are generative models with impressive text-to-image synthesis capabilities and have spurred a new wave of creative methods for classical machine learning tasks. However the best way to harness the perceptual knowledge of these generative models for visual tasks is still an open question. Specifically it is unclear how to use the prompting interface when applying diffusion backbones to vision tasks. We find that automatically generated captions can improve text-image alignment and significantly enhance a model's cross-attention maps leading to better perceptual performance. Our approach improves upon the current state-of-the-art in diffusion-based semantic segmentation on ADE20K and the current overall SOTA for depth estimation on NYUv2. Furthermore our method generalizes to the cross-domain setting. We use model personalization and caption modifications to align our model to the target domain and find improvements over unaligned baselines. Our cross-domain object detection model trained on Pascal VOC achieves SOTA results on Watercolor2K. Our cross-domain segmentation method trained on Cityscapes achieves SOTA results on Dark Zurich-val and Nighttime Driving. Project page: vision.caltech.edu/TADP/. Code: github.com/damaggu/TADP
-
Customizing pre-trained text-to-image generation model has attracted massive research interest recently due to its huge potential in real-world applications. Although existing methods are able to generate creative content for a novel concept contained in single user-input image their capability are still far from perfection. Specifically most existing methods require fine-tuning the generative model on testing images. Some existing methods do not require fine-tuning while their performance are unsatisfactory. Furthermore the interaction between users and models are still limited to directive and descriptive prompts such as instructions and captions. In this work we build a customization assistant based on pre-trained large language model and diffusion model which can not only perform customized generation in a tuning-free manner but also enable more user-friendly interactions: users can chat with the assistant and input either ambiguous text or clear instruction. Specifically we propose a new framework consists of a new model design and a novel training strategy. The resulting assistant can perform customized generation in 2-5 seconds without any test time fine-tuning. Extensive experiments are conducted competitive results have been obtained across different domains illustrating the effectiveness of the proposed method.
-
Recently impressive results have been achieved in 3D scene editing with text instructions based on a 2D diffusion model. However current diffusion models primarily generate images by predicting noise in the latent space and the editing is usually applied to the whole image which makes it challenging to perform delicate especially localized editing for 3D scenes. Inspired by recent 3D Gaussian splatting we propose a systematic framework named GaussianEditor to edit 3D scenes delicately via 3D Gaussians with text instructions. Benefiting from the explicit property of 3D Gaussians we design a series of techniques to achieve delicate editing. Specifically we first extract the region of interest (RoI) corresponding to the text instruction aligning it to 3D Gaussians. The Gaussian RoI is further used to control the editing process. Our framework can achieve more delicate and precise editing of 3D scenes than previous methods while enjoying much faster training speed i.e. within 20 minutes on a single V100 GPU more than twice as fast as Instruct-NeRF2NeRF (45 minutes -- 2 hours). The project page is at GaussianEditor.github.io.
-
Optical flow is a classical task that is important to the vision community. Classical optical flow estimation uses two frames as input whilst some recent methods consider multiple frames to explicitly model long-range information. The former ones limit their ability to fully leverage temporal coherence along the video sequence; and the latter ones incur heavy computational overhead typically not possible for real-time flow estimation. Some multi-frame-based approaches even necessitate unseen future frames for current estimation compromising real-time applicability in safety-critical scenarios. To this end we present MemFlow a real-time method for optical flow estimation and prediction with memory. Our method enables memory read-out and update modules for aggregating historical motion information in real-time. Furthermore we integrate resolution-adaptive re-scaling to accommodate diverse video resolutions. Besides our approach seamlessly extends to the future prediction of optical flow based on past observations. Leveraging effective historical motion aggregation our method outperforms VideoFlow with fewer parameters and faster inference speed on Sintel and KITTI-15 datasets in terms of generalization performance. At the time of submission MemFlow also leads in performance on the 1080p Spring dataset. Codes and models will be available at: https://dqiaole.github.io/MemFlow/.
-
Ultra-fine-grained visual categorization (Ultra-FGVC) aims at distinguishing highly similar sub-categories within fine-grained objects such as different soybean cultivars. Compared to traditional fine-grained visual categorization Ultra-FGVC encounters more hurdles due to the small inter-class and large intra-class variation. Given these challenges relying on human annotation for Ultra-FGVC is impractical. To this end our work introduces a novel task termed Ultra-Fine-Grained Novel Class Discovery (UFG-NCD) which leverages partially annotated data to identify new categories of unlabeled images for Ultra-FGVC. To tackle this problem we devise a Region-Aligned Proxy Learning (RAPL) framework which comprises a Channel-wise Region Alignment (CRA) module and a Semi-Supervised Proxy Learning (SemiPL) strategy. The CRA module is designed to extract and utilize discriminative features from local regions facilitating knowledge transfer from labeled to unlabeled classes. Furthermore SemiPL strengthens representation learning and knowledge transfer with proxy-guided supervised learning and proxy-guided contrastive learning. Such techniques leverage class distribution information in the embedding space improving the mining of subtle differences between labeled and unlabeled ultra-fine-grained classes. Extensive experiments demonstrate that RAPL significantly outperforms baselines across various datasets indicating its effectiveness in handling the challenges of UFG-NCD. Code is available at https://github.com/SSDUT-Caiyq/UFG-NCD.
-
We present Paint-it a text-driven high-fidelity texture map synthesis method for 3D meshes via neural re-parameterized texture optimization. Paint-it synthesizes texture maps from a text description by synthesis-through-optimization exploiting the Score-Distillation Sampling (SDS). We observe that directly applying SDS yields undesirable texture quality due to its noisy gradients. We reveal the importance of texture parameterization when using SDS. Specifically we propose Deep Convolutional Physically-Based Rendering (DC-PBR) parameterization which re-parameterizes the physically-based rendering (PBR) texture maps with randomly initialized convolution-based neural kernels instead of a standard pixel-based parameterization. We show that DC-PBR inherently schedules the optimization curriculum according to texture frequency and naturally filters out the noisy signals from SDS. In experiments Paint-it obtains remarkable quality PBR texture maps within 15 min. given only a text description. We demonstrate the generalizability and practicality of Paint-it by synthesizing high-quality texture maps for large-scale mesh datasets and showing test-time applications such as relighting and material control using a popular graphics engine.
-
Being able to understand visual scenes is a precursor for many downstream tasks including autonomous driving robotics and other vision-based approaches. A common approach enabling the ability to reason over visual data is Scene Graph Generation (SGG); however many existing approaches assume undisturbed vision i.e. the absence of real-world corruptions such as fog snow smoke as well as non-uniform perturbations like sun glare or water drops. In this work we propose a novel SGG benchmark containing procedurally generated weather corruptions and other transformations over the Visual Genome dataset. Further we introduce a corresponding approach Hierarchical Knowledge Enhanced Robust Scene Graph Generation (HiKER-SGG) providing a strong baseline for scene graph generation under such challenging setting. At its core HiKER-SGG utilizes a hierarchical knowledge graph in order to refine its predictions from coarse initial estimates to detailed predictions. In our extensive experiments we show that HiKER-SGG does not only demonstrate superior performance on corrupted images in a zero-shot manner but also outperforms current state-of-the-art methods on uncorrupted SGG tasks. Code is available at https://github.com/zhangce01/HiKER-SGG.
-
Text-guided domain adaptation and generation of 3D-aware portraits find many applications in various fields. However due to the lack of training data and the challenges in handling the high variety of geometry and appearance the existing methods for these tasks suffer from issues like inflexibility instability and low fidelity. In this paper we propose a novel framework DiffusionGAN3D which boosts text-guided 3D domain adaptation and generation by combining 3D GANs and diffusion priors. Specifically we integrate the pre-trained 3D generative models (e.g. EG3D) and text-to-image diffusion models. The former provides a strong foundation for stable and high-quality avatar generation from text. And the diffusion models in turn offer powerful priors and guide the 3D generator finetuning with informative direction to achieve flexible and efficient text-guided domain adaptation. To enhance the diversity in domain adaptation and the generation capability in text-to-avatar we introduce the relative distance loss and case-specific learnable triplane respectively. Besides we design a progressive texture refinement module to improve the texture quality for both tasks above. Extensive experiments demonstrate that the proposed framework achieves excellent results in both domain adaptation and text-to-avatar tasks outperforming existing methods in terms of generation quality and efficiency. The project homepage is at https://younglbw.github.io/DiffusionGAN3D-homepage/.
-
The credibility and practicality of a reconstructed hand-object interaction sequence depend largely on its physical plausibility. However due to high occlusions during hand-object interaction physical plausibility remains a challenging criterion for purely vision-based tracking methods. To address this issue and enhance the results of existing hand trackers this paper proposes a novel physically-aware hand motion de-noising method. Specifically we introduce two learned loss terms that explicitly capture two crucial aspects of physical plausibility: grasp credibility and manipulation feasibility. These terms are used to train a physically-aware de-noising network. Qualitative and quantitative experiments demonstrate that our approach significantly improves both fine-grained physical plausibility and overall pose accuracy surpassing current state-of-the-art de-noising methods.
-
Existing NeRF-based methods for large scene reconstruction often have limitations in visual quality and rendering speed. While the recent 3D Gaussian Splatting works well on small-scale and object-centric scenes scaling it up to large scenes poses challenges due to limited video memory long optimization time and noticeable appearance variations. To address these challenges we present VastGaussian the first method for high-quality reconstruction and real-time rendering on large scenes based on 3D Gaussian Splatting. We propose a progressive partitioning strategy to divide a large scene into multiple cells where the training cameras and point cloud are properly distributed with an airspace-aware visibility criterion. These cells are merged into a complete scene after parallel optimization. We also introduce decoupled appearance modeling into the optimization process to reduce appearance variations in the rendered images. Our approach outperforms existing NeRF-based methods and achieves state-of-the-art results on multiple large scene datasets enabling fast optimization and high-fidelity real-time rendering.
-
In recent years image editing has advanced remarkably. With increased human control it is now possible to edit an image in a plethora of ways; from specifying in text what we want to change to straight up dragging the contents of the image in an interactive point-based manner. However most of the focus has remained on editing single images at a time. Whether and how we can simultaneously edit large batches of images has remained understudied. With the goal of minimizing human supervision in the editing process this paper presents a novel method for interactive batch image editing using StyleGAN as the medium. Given an edit specified by users in an example image (e.g. make the face frontal) our method can automatically transfer that edit to other test images so that regardless of their initial state (pose) they all arrive at the same final state (e.g. all facing front). Extensive experiments demonstrate that edits performed using our method have similar visual quality to existing single-image-editing methods while having more visual consistency and saving significant time and human effort.
-
Oriented object detection has been developed rapidly in the past few years where rotation equivariance is crucial for detectors to predict rotated boxes. It is expected that the prediction can maintain the corresponding rotation when objects rotate but severe mutation in angular prediction is sometimes observed when objects rotate near the boundary angle which is well-known boundary discontinuity problem. The problem has been long believed to be caused by the sharp loss increase at the angular boundary and widely used joint-optim IoU-like methods deal with this problem by loss-smoothing. However we experimentally find that even state-of-the-art IoU-like methods actually fail to solve the problem. On further analysis we find that the key to solution lies in encoding mode of the smoothing function rather than in joint or independent optimization. In existing IoU-like methods the model essentially attempts to fit the angular relationship between box and object where the break point at angular boundary makes the predictions highly unstable. To deal with this issue we propose a dual-optimization paradigm for angles. We decouple reversibility and joint-optim from single smoothing function into two distinct entities which for the first time achieves the objectives of both correcting angular boundary and blending angle with other parameters. Extensive experiments on multiple datasets show that boundary discontinuity problem is well-addressed. Moreover typical IoU-like methods are improved to the same level without obvious performance gap. The code is available at https://github.com/hangxu-cv/cvpr24acm.
-
This paper addresses the complex issue of one-shot face stylization focusing on the simultaneous consideration of appearance and structure where previous methods have fallen short. We explore deformation-aware face stylization that diverges from traditional single-image style reference opting for a real-style image pair instead. The cornerstone of our method is the utilization of a self-supervised vision transformer specifically DINO-ViT to establish a robust and consistent facial structure representation across both real and style domains. Our stylization process begins by adapting the StyleGAN generator to be deformation-aware through the integration of spatial transformers (STN). We then introduce two innovative constraints for generator fine-tuning under the guidance of DINO semantics: i) a directional deformation loss that regulates directional vectors in DINO space and ii) a relative structural consistency constraint based on DINO token self-similarities ensuring diverse generation. Additionally style-mixing is employed to align the color generation with the reference minimizing inconsistent correspondences. This framework delivers enhanced deformability for general one-shot face stylization achieving notable efficiency with a fine-tuning duration of approximately 10 minutes. Extensive qualitative and quantitative comparisons demonstrate our superiority over state-of-the-art one-shot face stylization methods. Code is available at https://github.com/zichongc/DoesFS
-
Advances in camera-based physiological monitoring have enabled the robust non-contact measurement of respiration and the cardiac pulse which are known to be indicative of the sleep stage. This has led to research into camera-based sleep monitoring as a promising alternative to "gold-standard" polysomnography which is cumbersome expensive to administer and hence unsuitable for longer-term clinical studies. In this paper we introduce SleepVST a transformer model which enables state-of-the-art performance in camera-based sleep stage classification (sleep staging). After pre-training on contact sensor data SleepVST outperforms existing methods for cardio-respiratory sleep staging on the SHHS and MESA datasets achieving total Cohen's kappa scores of 0.75 and 0.77 respectively. We then show that SleepVST can be successfully transferred to cardio-respiratory waveforms extracted from video enabling fully contact-free sleep staging. Using a video dataset of 50 nights we achieve a total accuracy of 78.8% and a Cohen's \kappa of 0.71 in four-class video-based sleep staging setting a new state-of-the-art in the domain.
-
Diffusion model is a promising approach to image generation and has been employed for Pose-Guided Person Image Synthesis (PGPIS) with competitive performance. While existing methods simply align the person appearance to the target pose they are prone to overfitting due to the lack of a high-level semantic understanding on the source person image. In this paper we propose a novel Coarse-to-Fine Latent Diffusion (CFLD) method for PGPIS. In the absence of image-caption pairs and textual prompts we develop a novel training paradigm purely based on images to control the generation process of a pre-trained text-to-image diffusion model. A perception-refined decoder is designed to progressively refine a set of learnable queries and extract semantic understanding of person images as a coarse-grained prompt. This allows for the decoupling of fine-grained appearance and pose information controls at different stages and thus circumventing the potential overfitting problem. To generate more realistic texture details a hybrid-granularity attention module is proposed to encode multi-scale fine-grained appearance features as bias terms to augment the coarse-grained prompt. Both quantitative and qualitative experimental results on the DeepFashion benchmark demonstrate the superiority of our method over the state of the arts for PGPIS. Code is available at https://github.com/YanzuoLu/CFLD.
-
Diffusion Models (DMs) have shown remarkable capabilities in various image-generation tasks. However there are growing concerns that DMs could be used to imitate unauthorized creations and thus raise copyright issues. To address this issue we propose a novel framework that embeds personal watermarks in the generation of adversarial examples. Such examples can force DMs to generate images with visible watermarks and prevent DMs from imitating unauthorized images. We construct a generator based on conditional adversarial networks and design three losses (adversarial loss GAN loss and perturbation loss) to generate adversarial examples that have subtle perturbation but can effectively attack DMs to prevent copyright violations. Training a generator for a personal watermark by our method only requires 5-10 samples within 2-3 minutes and once the generator is trained it can generate adversarial examples with that watermark significantly fast (0.2s per image). We conduct extensive experiments in various conditional image-generation scenarios. Compared to existing methods that generate images with chaotic textures our method adds visible watermarks on the generated images which is a more straightforward way to indicate copyright violations. We also observe that our adversarial examples exhibit good transferability across unknown generative models. Therefore this work provides a simple yet powerful way to protect copyright from DM-based imitation.
-
Prompt tuning represents a valuable technique for adapting pre-trained visual-language models (VLM) to various downstream tasks. Recent advancements in CoOp-based methods propose a set of learnable domain-shared or image-conditional textual tokens to facilitate the generation of task-specific textual classifiers. However those textual tokens have a limited generalization ability regarding unseen domains as they cannot dynamically adjust to the distribution of testing classes. To tackle this issue we present a novel Textual-based Class-aware Prompt tuning(TCP) that explicitly incorporates prior knowledge about classes to enhance their discriminability. The critical concept of TCP involves leveraging Textual Knowledge Embedding (TKE) to map the high generalizability of class-level textual knowledge into class aware textual tokens. By seamlessly integrating these class-aware prompts into the Text Encoder a dynamic class-aware classifier is generated to enhance discriminability for unseen domains. During inference TKE dynamically generates class-aware prompts related to the unseen classes. Comprehensive evaluations demonstrate that TKE serves as a plug-and-play module effortlessly combinable with existing methods. Furthermore TCP consistently achieves superior performance while demanding less training time.
-
We have recently seen tremendous progress in realistic text-to-motion generation. Yet the existing methods often fail or produce implausible motions with unseen text inputs which limits the applications. In this paper we present OMG a novel framework which enables compelling motion generation from zero-shot open-vocabulary text prompts. Our key idea is to carefully tailor the pretrain-then-finetune paradigm into the text-to-motion generation. At the pre-training stage our model improves the generation ability by learning the rich out-of-domain inherent motion traits. To this end we scale up a large unconditional diffusion model up to 1B parameters so as to utilize the massive unlabeled motion data up to over 20M motion instances. At the subsequent fine-tuning stage we introduce motion ControlNet which incorporates text prompts as conditioning information through a trainable copy of the pre-trained model and the proposed novel Mixture-of-Controllers (MoC) block. MoC block adaptively recognizes various ranges of the sub-motions with a cross-attention mechanism and processes them separately with the text-token-specific experts. Such a design effectively aligns the CLIP token embeddings of text prompts to various ranges of compact and expressive motion features. Extensive experiments demonstrate that our OMG achieves significant improvements over the state-of-the-art methods on zero-shot text-to-motion generation. Project page: https://tr3e.github.io/omg-page.
-
This work proposes TimeChat a time-sensitive multimodal large language model specifically designed for long video understanding. Our model incorporates two key architectural contributions: (1) a timestamp-aware frame encoder that binds visual content with the timestamp of each frame and (2) a sliding video Q-Former that produces a video token sequence of varying lengths to accommodate videos of various durations. Additionally we construct an instruction-tuning dataset encompassing 6 tasks and a total of 125K instances to further enhance TimeChat's instruction-following performance. Experiment results across various video understanding tasks such as dense captioning temporal grounding and highlight detection demonstrate TimeChat's strong zero-shot temporal localization and reasoning capabilities. For example it achieves +9.2 F1 score and +2.8 CIDEr on YouCook2 +5.8 HIT@1 on QVHighlights and +27.5 R@1 (IoU=0.5) on Charades-STA compared to state-of-the-art video large language models holding the potential to serve as a versatile video assistant for long-form video comprehension tasks and satisfy realistic user requirements.
-
Text-guided diffusion models have revolutionized image and video generation and have also been successfully used for optimization-based 3D object synthesis. Here we instead focus on the underexplored text-to-4D setting and synthesize dynamic animated 3D objects using score distillation methods with an additional temporal dimension. Compared to previous work we pursue a novel compositional generation-based approach and combine text-to-image text-to-video and 3D-aware multiview diffusion models to provide feedback during 4D object optimization thereby simultaneously enforcing temporal consistency high-quality visual appearance and realistic geometry. Our method called Align Your Gaussians (AYG) leverages dynamic 3D Gaussian Splatting with deformation fields as 4D representation. Crucial to AYG is a novel method to regularize the distribution of the moving 3D Gaussians and thereby stabilize the optimization and induce motion. We also propose a motion amplification mechanism as well as a new autoregressive synthesis scheme to generate and combine multiple 4D sequences for longer generation. These techniques allow us to synthesize vivid dynamic scenes outperform previous work qualitatively and quantitatively and achieve state-of-the-art text-to-4D performance. Due to the Gaussian 4D representation different 4D animations can be seamlessly combined as we demonstrate. AYG opens up promising avenues for animation simulation and digital content creation as well as synthetic data generation.
-
Existing point cloud semantic segmentation networks cannot identify unknown classes and update their knowledge due to a closed-set and static perspective of the real world which would induce the intelligent agent to make bad decisions. To address this problem we propose a Probability-Driven Framework (PDF) for open world semantic segmentation that includes (i) a lightweight U-decoder branch to identify unknown classes by estimating the uncertainties (ii) a flexible pseudo-labeling scheme to supply geometry features along with probability distribution features of unknown classes by generating pseudo labels and (iii) an incremental knowledge distillation strategy to incorporate novel classes into the existing knowledge base gradually. Our framework enables the model to behave like human beings which could recognize unknown objects and incrementally learn them with the corresponding knowledge. Experimental results on the S3DIS and ScanNetv2 datasets demonstrate that the proposed PDF outperforms other methods by a large margin in both important tasks of open world semantic segmentation.
-
Face Anti-Spoofing (FAS) is pivotal in safeguarding facial recognition systems against presentation attacks. While domain generalization (DG) methods have been developed to enhance FAS performance they predominantly focus on learning domain-invariant features during training which may not guarantee generalizability to unseen data that differs largely from the source distributions. Our insight is that testing data can serve as a valuable resource to enhance the generalizability beyond mere evaluation for DG FAS. In this paper we introduce a novel Test-Time Domain Generalization (TTDG) framework for FAS which leverages the testing data to boost the model's generalizability. Our method consisting of Test-Time Style Projection (TTSP) and Diverse Style Shifts Simulation (DSSS) effectively projects the unseen data to the seen domain space. In particular we first introduce the innovative TTSP to project the styles of the arbitrarily unseen samples of the testing distribution to the known source space of the training distributions. We then design the efficient DSSS to synthesize diverse style shifts via learnable style bases with two specifically designed losses in a hyperspherical feature space. Our method eliminates the need for model updates at the test time and can be seamlessly integrated into not only the CNN but also ViT backbones. Comprehensive experiments on widely used cross-domain FAS benchmarks demonstrate our method's state-of-the-art performance and effectiveness.
-
Recently there has been an increased interest in the practical problem of learning multiple dense scene understanding tasks from partially annotated data where each training sample is only labeled for a subset of the tasks. The missing of task labels in training leads to low-quality and noisy predictions as can be observed from state-of-the-art methods. To tackle this issue we reformulate the partially-labeled multi-task dense prediction as a pixel-level denoising problem and propose a novel multi-task denoising diffusion framework coined as DiffusionMTL. It designs a joint diffusion and denoising paradigm to model a potential noisy distribution in the task prediction or feature maps and generate rectified outputs for different tasks. To exploit multi-task consistency in denoising we further introduce a Multi-Task Conditioning strategy which can implicitly utilize the complementary nature of the tasks to help learn the unlabeled tasks leading to an improvement in the denoising performance of the different tasks. Extensive quantitative and qualitative experiments demonstrate that the proposed multi-task denoising diffusion model can significantly improve multi-task prediction maps and outperform the state-of-the-art methods on three challenging multi-task benchmarks under two different partial-labeling evaluation settings. The code is available at https://prismformore.github.io/diffusionmtl/.
-
The traditional frame-based cameras that rely on exposure windows for imaging experience motion blur in high-speed scenarios. Frame-based deblurring methods lack reliable motion cues to restore sharp images under extreme blur conditions. The spike camera is a novel neuromorphic visual sensor that outputs spike streams with ultra-high temporal resolution. It can supplement the temporal information lost in traditional cameras and guide motion deblurring. However in real-world scenarios aligning discrete RGB images and continuous spike streams along both temporal and spatial axes is challenging due to the complexity of calibrating their coordinates device displacements in vibrations and time deviations. Misalignment of pixels leads to severe degradation of deblurring. We introduce the first framework for spike-guided motion deblurring without knowing the spatiotemporal alignment between spikes and images. To address the problem we first propose a novel three-stage network containing a basic deblurring net a carefully designed bi-directional deformable aligning module and a flow-based multi-scale fusion net. Experimental results demonstrate that our approach can effectively guide the image deblurring with unknown alignment surpassing the performance of other methods. Public project page: https://github.com/Leozhangjiyuan/UaSDN.
-
In this paper we propose a novel Visual Reference Prompt (VRP) encoder that empowers the Segment Anything Model (SAM) to utilize annotated reference images as prompts for segmentation creating the VRP-SAM model. In essence VRP-SAM can utilize annotated reference images to comprehend specific objects and perform segmentation of specific objects in target image. It is note that the VRP encoder can support a variety of annotation formats for reference images including point box scribble and mask. VRP-SAM achieves a breakthrough within the SAM framework by extending its versatility and applicability while preserving SAM's inherent strengths thus enhancing user-friendliness. To enhance the generalization ability of VRP-SAM the VRP encoder adopts a meta-learning strategy. To validate the effectiveness of VRP-SAM we conducted extensive empirical studies on the Pascal and COCO datasets. Remarkably VRP-SAM achieved state-of-the-art performance in visual reference segmentation with minimal learnable parameters. Furthermore VRP-SAM demonstrates strong generalization capabilities allowing it to perform segmentation of unseen objects and enabling cross-domain segmentation. The source code and models will be available at https://github.com/syp2ysy/VRP-SAM
-
Out-of-distribution (OOD) detection is essential for deploying machine learning models in open-world environments. Activation-based methods are a key approach in OOD detection working to mitigate overconfident predictions of OOD data. These techniques rectifying anomalous activations enhancing the distinguishability between in-distribution (ID) data and OOD data. However they assume by default that every channel is necessary for OOD detection and rectify anomalous activations in each channel. Empirical evidence has shown that there is a significant difference among various channels in OOD detection and discarding some channels can greatly enhance the performance of OOD detection. Based on this insight we propose \underline D iscriminability-\underline D riven \underline C hannel \underline S election (DDCS) which leverages an adaptive channel selection by estimating the discriminative score of each channel to boost OOD detection. The discriminative score takes inter-class similarity and inter-class variance of training data into account. However the estimation of discriminative score itself is susceptible to anomalous activations. To better estimate score we pre-rectify anomalous activations for each channel mildly. The experimental results show that DDCS achieves state-of-the-art performance on CIFAR and ImageNet-1K benchmarks. Moreover DDCS can generalize to different backbones and OOD scores.
-
Recent works have shown that generative models leave traces of their underlying generative process on the generated samples broadly referred to as fingerprints of a generative model and have studied their utility in detecting synthetic images from real ones. However the extend to which these fingerprints can distinguish between various types of synthetic image and help identify the underlying generative process remain under-explored. In particular the very definition of a fingerprint remains unclear to our knowledge. To that end in this work we formalize the definition of artifact and fingerprint in generative models propose an algorithm for computing them in practice and finally study its effectiveness in distinguishing a large array of different generative models. We find that using our proposed definition can significantly improve the performance on the task of identifying the underlying generative process from samples (model attribution) compared to existing methods. Additionally we study the structure of the fingerprints and observe that it is very predictive of the effect of different design choices on the generative process.
-
Synthesizing realistic videos of talking faces under custom lighting conditions and viewing angles benefits various downstream applications like video conferencing. However most existing relighting methods are either time-consuming or unable to adjust the viewpoints. In this paper we present the first real-time 3D-aware method for relighting in-the-wild videos of talking faces based on Neural Radiance Fields (NeRF). Given an input portrait video our method can synthesize talking faces under both novel views and novel lighting conditions with a photo-realistic and disentangled 3D representation. Specifically we infer an albedo tri-plane as well as a shading tri-plane based on a desired lighting condition for each video frame with fast dual-encoders. We also leverage a temporal consistency network to ensure smooth transitions and reduce flickering artifacts. Our method runs at 32.98 fps on consumer-level hardware and achieves state-of-the-art results in terms of reconstruction quality lighting error lighting instability temporal consistency and inference speed. We demonstrate the effectiveness and interactivity of our method on various portrait videos with diverse lighting and viewing conditions.
-
We introduce an approach that creates animatable human avatars from monocular videos using 3D Gaussian Splatting (3DGS). Existing methods based on neural radiance fields (NeRFs) achieve high-quality novel-view/novel-pose image synthesis but often require days of training and are extremely slow at inference time. Recently the community has explored fast grid structures for efficient training of clothed avatars. Albeit being extremely fast at training these methods can barely achieve an interactive rendering frame rate with around 15 FPS. In this paper we use 3D Gaussian Splatting and learn a non-rigid deformation network to reconstruct animatable clothed human avatars that can be trained within 30 minutes and rendered at real-time frame rates (50+ FPS). Given the explicit nature of our representation we further introduce as-isometric-as-possible regularizations on both the Gaussian mean vectors and the covariance matrices enhancing the generalization of our model on highly articulated unseen poses. Experimental results show that our method achieves comparable and even better performance compared to state-of-the-art approaches on animatable avatar creation from a monocular input while being 400x and 250x faster in training and inference respectively.
-
Diagnosis in histopathology requires a global whole slide images (WSIs) analysis requiring pathologists to compound evidence from different WSI patches. The gigapixel scale of WSIs poses a challenge for histopathology multi-modal models. Training multi-model models for histopathology requires instruction tuning datasets which currently contain information for individual image patches without a spatial grounding of the concepts within each patch and without a wider view of the WSI. To bridge this gap we introduce QUILT-INSTRUCT a large-scale dataset of 107131 histopathology-specific instruction question/answer pairs grounded within diagnostically relevant image patches that make up the WSI. Our dataset is collected by leveraging educational histopathology videos from YouTube which provides spatial localization of narrations by automatically extracting the narrators' cursor positions. QUILT-INSTRUCT supports contextual reasoning by extracting diagnosis and supporting facts from the entire WSI. Using QUILT-INSTRUCT we train QUILT-LLAVA which can reason beyond the given single image patch enabling diagnostic reasoning across patches. To evaluate QUILT-LLAVA we propose a comprehensive evaluation dataset created from 985 images and 1283 human-generated question-answers. We also thoroughly evaluate QUILT-LLAVA using public histopathology datasets where QUILT-LLAVA significantly outperforms SOTA by over 10% on relative GPT-4 score and 4% and 9% on open and closed set VQA.
-
Traffic scene perception in computer vision is a critically important task to achieve intelligent cities. To date most existing datasets focus on autonomous driving scenes. We observe that the models trained on those driving datasets often yield unsatisfactory results on traffic monitoring scenes. However little effort has been put into improving the traffic monitoring scene understanding mainly due to the lack of specific datasets. To fill this gap we introduce a specialized traffic monitoring dataset termed TSP6K containing images from the traffic monitoring scenario with high-quality pixel-level and instance-level annotations. The TSP6K dataset captures more crowded traffic scenes with several times more traffic participants than the existing driving scenes. We perform a detailed analysis of the dataset and comprehensively evaluate previous popular scene parsing methods instance segmentation methods and unsupervised domain adaption methods. Furthermore considering the vast difference in instance sizes we propose a detail refining decoder for scene parsing which recovers the details of different semantic regions in traffic scenes owing to the proposed TSP6K dataset. Experiments show its effectiveness in parsing the traffic monitoring scenes. Code and dataset are available at https://github.com/PengtaoJiang/TSP6K.
-
Large-scale Text-to-Image (T2I) models have rapidly gained prominence across creative fields generating visually compelling outputs from textual prompts. However controlling these models to ensure consistent style remains challenging with existing methods necessitating fine-tuning and manual intervention to disentangle content and style. In this paper we introduce StyleAligned a novel technique designed to establish style alignment among a series of generated images. By employing minimal `attention sharing' during the diffusion process our method maintains style consistency across images within T2I models. This approach allows for the creation of style-consistent images using a reference style through a straightforward inversion operation. Our method's evaluation across diverse styles and text prompts demonstrates high-quality synthesis and fidelity underscoring its efficacy in achieving consistent style across various inputs.
-
Geometry Problem Solving has drawn growing attention recently due to its application prospects in intelligent education field. However existing methods are still inadequate to meet the needs of practical application suffering from the following limitations: 1) explainability is not ensured which is essential in real teaching scenarios; 2) the small scale and incomplete annotation of existing datasets make it hard for model to comprehend geometric knowledge. To tackle the above problems we propose a novel method called Explainable Geometry Problem Solving (E-GPS). E-GPS first parses the geometric diagram and problem text into unified formal language representations. Then the answer and explainable reasoning and solving steps are obtained by a Top-Down Problem Solver (TD-PS) which innovatively solves the problem from the target and focuses on what is needed. To alleviate the data issues a Bottom-Up Problem Generator (BU-PG) is devised to augment the data set with various well-annotated constructed geometry problems. It enables us to train an enhanced theorem predictor with a better grasp of theorem knowledge which further improves the efficiency of TD-PS. Extensive experiments demonstrate that E-GPS maintains comparable solving performances with fewer steps and provides outstanding explainability.
-
With the immense growth of dataset sizes and computing resources in recent years so-called foundation models have become popular in NLP and vision tasks. In this work we propose to explore foundation models for the task of keypoint detection on 3D shapes. A unique characteristic of keypoint detection is that it requires semantic and geometric awareness while demanding high localization accuracy. To address this problem we propose first to back-project features from large pre-trained 2D vision models onto 3D shapes and employ them for this task. We show that we obtain robust 3D features that contain rich semantic information and analyze multiple candidate features stemming from different 2D foundation models. Second we employ a keypoint candidate optimization module which aims to match the average observed distribution of keypoints on the shape and is guided by the back-projected features. The resulting approach achieves a new state of the art for few-shot keypoint detection on the KeyPointNet dataset almost doubling the performance of the previous best methods.
-
Existing joint low-light enhancement and deblurring methods learn pixel-wise mappings from paired synthetic data which results in limited generalization in real-world scenes. While some studies explore the rich generative prior of pre-trained diffusion models they typically rely on the assumed degradation process and cannot handle unknown real-world degradations well. To address these problems we propose a novel zero-shot framework FourierDiff which embeds Fourier priors into a pre-trained diffusion model to harmoniously handle the joint degradation of luminance and structures. FourierDiff is appealing in its relaxed requirements on paired training data and degradation assumptions. The key zero-shot insight is motivated by image characteristics in the Fourier domain: most luminance information concentrates on amplitudes while structure and content information are closely related to phases. Based on this observation we decompose the sampled results of the reverse diffusion process in the Fourier domain and take advantage of the amplitude of the generative prior to align the enhanced brightness with the distribution of natural images. To yield a sharp and content-consistent enhanced result we further design a spatial-frequency alternating optimization strategy to progressively refine the phase of the input. Extensive experiments demonstrate the superior effectiveness of the proposed method especially in real-world scenes.
-
Stereo matching is a core task for many computer vision and robotics applications. Despite their dominance in traditional stereo methods the hand-crafted Markov Random Field (MRF) models lack sufficient modeling accuracy compared to end-to-end deep models. While deep learning representations have greatly improved the unary terms of the MRF models the overall accuracy is still severely limited by the hand-crafted pairwise terms and message passing. To address these issues we propose a neural MRF model where both potential functions and message passing are designed using data-driven neural networks. Our fully data-driven model is built on the foundation of variational inference theory to prevent convergence issues and retain stereo MRF's graph inductive bias. To make the inference tractable and scale well to high-resolution images we also propose a Disparity Proposal Network (DPN) to adaptively prune the search space of disparity. The proposed approach ranks 1^ st on both KITTI 2012 and 2015 leaderboards among all published methods while running faster than 100 ms. This approach significantly outperforms prior global methods e.g. lowering D1 metric by more than 50% on KITTI 2015. In addition our method exhibits strong cross-domain generalization and can recover sharp edges. The codes at https://github.com/aeolusguan/NMRF.
-
In autonomous driving predicting future events in advance and evaluating the foreseeable risks empowers autonomous vehicles to plan their actions enhancing safety and efficiency on the road. To this end we propose Drive-WM the first driving world model compatible with existing end-to-end planning models. Through a joint spatial-temporal modeling facilitated by view factorization our model is the first to generate high-fidelity multiview videos. Building on its powerful generation ability we showcase the potential of applying the world model for safe driving planning for the first time. Our Drive-WM enables driving into multiple futures based on distinct driving maneuvers and determines the optimal trajectory according to the image-based rewards. Evaluation on real-world driving datasets verifies that our method could generate high-quality consistent and controllable multiview videos opening up possibilities for real-world simulations and safe planning.
-
Event-based semantic segmentation (ESS) is a fundamental yet challenging task for event camera sensing. The difficulties in interpreting and annotating event data limit its scalability. While domain adaptation from images to event data can help to mitigate this issue there exist data representational differences that require additional effort to resolve. In this work for the first time we synergize information from image text and event-data domains and introduce OpenESS to enable scalable ESS in an open-world annotation-efficient manner. We achieve this goal by transferring the semantically rich CLIP knowledge from image-text pairs to event streams. To pursue better cross-modality adaptation we propose a frame-to-event contrastive distillation and a text-to-event semantic consistency regularization. Experimental results on popular ESS benchmarks showed our approach outperforms existing methods. Notably we achieve 53.93% and 43.31% mIoU on DDD17 and DSEC-Semantic without using either event or frame labels.
-
Aligned text-image encoders such as CLIP have become the de-facto model for vision-language tasks. Furthermore modality-specific encoders achieve impressive performances in their respective domains. This raises a central question: does an alignment exist between uni-modal vision and language encoders since they fundamentally represent the same physical world? Analyzing the latent spaces structure of vision and language models on image-caption benchmarks using the Centered Kernel Alignment (CKA) we find that the representation spaces of unaligned and aligned encoders are semantically similar. In the absence of statistical similarity in aligned encoders like CLIP we show that a possible matching of unaligned encoders exists without any training. We frame this as a seeded graph-matching problem exploiting the semantic similarity between graphs and propose two methods - a Fast Quadratic Assignment Problem optimization and a novel localized CKA metric-based matching/retrieval. We demonstrate the effectiveness of this on several downstream tasks including cross-lingual cross-domain caption matching and image classification. Code available at github.com/mayug/0-shot-llm-vision.
-
Currently high-definition (HD) map construction leans towards a lightweight online generation tendency which aims to preserve timely and reliable road scene information. However map elements contain strong shape priors. Subtle and sparse annotations make current detection-based frameworks ambiguous in locating relevant feature scopes and cause the loss of detailed structures in prediction. To alleviate these problems we propose MGMap a mask-guided approach that effectively highlights the informative regions and achieves precise map element localization by introducing the learned masks. Specifically MGMap employs learned masks based on the enhanced multi-scale BEV features from two perspectives. At the instance level we propose the Mask-activated instance (MAI) decoder which incorporates global instance and structural information into instance queries by the activation of instance masks. At the point level a novel position-guided mask patch refinement (PG-MPR) module is designed to refine point locations from a finer-grained perspective enabling the extraction of point-specific patch information. Compared to the baselines our proposed MGMap achieves a notable improvement of around 10 mAP for different input modalities. Extensive experiments also demonstrate that our approach showcases strong robustness and generalization capabilities. Our code can be found at https://github.com/xiaolul2/MGMap.
-
Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild
We introduce SUPIR (Scaling-UP Image Restoration) a groundbreaking image restoration method that harnesses generative prior and the power of model scaling up. Leveraging multi-modal techniques and advanced generative prior SUPIR marks a significant advance in intelligent and realistic image restoration. As a pivotal catalyst within SUPIR model scaling dramatically enhances its capabilities and demonstrates new potential for image restoration. We collect a dataset comprising 20 million high-resolution high-quality images for model training each enriched with descriptive text annotations. SUPIR provides the capability to restore images guided by textual prompts broadening its application scope and potential. Moreover we introduce negative-quality prompts to further improve perceptual quality. We also develop a restoration-guided sampling method to suppress the fidelity issue encountered in generative-based restoration. Experiments demonstrate SUPIR's exceptional restoration effects and its novel capacity to manipulate restoration through textual prompts.
-
Multi-modality large language models (MLLMs) as represented by GPT-4V have introduced a paradigm shift for visual perception and understanding tasks that a variety of abilities can be achieved within one foundation model. While current MLLMs demonstrate primary low-level visual abilities from the identification of low-level visual attributes (e.g. clarity brightness) to the evaluation on image quality there's still an imperative to further improve the accuracy of MLLMs to substantially alleviate human burdens. To address this we collect the first dataset consisting of human natural language feedback on low-level vision. Each feedback offers a comprehensive description of an image's low-level visual attributes culminating in an overall quality assessment. The constructed Q-Pathway dataset includes 58K detailed human feedbacks on 18973 multi-sourced images with diverse low-level appearance. To ensure MLLMs can adeptly handle diverse queries we further propose a GPT-participated transformation to convert these feedbacks into a rich set of 200K instruction-response pairs termed Q-Instruct. Experimental results indicate that the Q-Instruct consistently elevates various low-level visual capabilities across multiple base models. We anticipate that our datasets can pave the way for a future that foundation models can assist humans on low-level visual tasks.
-
Camera-parameter-free multi-view pose estimation is an emerging technique for 3D human pose estimation (HPE). They can infer the camera settings implicitly or explicitly to mitigate the depth uncertainty impact showcasing significant potential in real applications. However due to the limited camera setting diversity in the available datasets the inferred camera parameters are always simply hardcoded into the model during training and not adaptable to the input in inference making the learned models cannot generalize well under unseen camera settings. A natural solution is to artificially synthesize some samples i.e. 2D-3D pose pairs under massive new camera settings. Unfortunately to prevent over-fitting the existing camera setting the number of synthesized samples for each new camera setting should be comparable with that for the existing one which multiplies the scale of training and even makes it computationally prohibitive. In this paper we propose a novel HPE approach under the invariant risk minimization (IRM) paradigm. Precisely we first synthesize 2D poses from myriad camera settings. We then train our model under the IRM paradigm which targets at learning a common optimal model across all camera settings and thus enforces the model to automatically learn the camera parameters based on the input data. This allows the model to accurately infer 3D poses on unseen data by training on only a handful of samples from each synthesized setting and thus avoid the unbearable training cost increment. Another appealing feature of our method is that benefited from the capability of IRM in identifying the invariant features its performance on the seen camera settings is enhanced as well. Comprehensive experiments verify the superiority of our approach.
-
Tone mapping techniques aiming to convert high dynamic range (HDR) images to high-quality low dynamic range (LDR) images for display play a more crucial role in real-world vision systems with the increasing application of HDR images. However obtaining paired HDR and high-quality LDR images is difficult posing a challenge to deep learning based tone mapping methods. To overcome this challenge we propose a novel zero-shot tone mapping framework that utilizes shared structure knowledge allowing us to transfer a pre-trained mapping model from the LDR domain to HDR fields without paired training data. Our approach involves decomposing both the LDR and HDR images into two components: structural information and tonal information. To preserve the original image's structure we modify the reverse sampling process of a diffusion model and explicitly incorporate the structure information into the intermediate results. Additionally for improved image details we introduce a dual-control network architecture that enables different types of conditional inputs to control different scales of the output. Experimental results demonstrate the effectiveness of our approach surpassing previous state-of-the-art methods both qualitatively and quantitatively. Moreover our model exhibits versatility and can be applied to other low-level vision tasks without retraining. The code is available at https://github.com/ZSDM-HDR/Zero-Shot-Diffusion-HDR.
-
In this paper we propose VidLA an approach for video-language alignment at scale. There are two major limitations of previous video-language alignment approaches. First they do not capture both short-range and long-range temporal dependencies and typically employ complex hierarchical deep network architectures that are hard to integrate with existing pretrained image-text foundation models. To effectively address this limitation we instead keep the network architecture simple and use a set of data tokens that operate at different temporal resolutions in a hierarchical manner accounting for the temporally hierarchical nature of videos. By employing a simple two-tower architecture we are able to initialize our video-language model with pretrained image-text foundation models thereby boosting the final performance. Second existing video-language alignment works struggle due to the lack of semantically aligned large-scale training data. To overcome it we leverage recent LLMs to curate the largest video-language dataset to date with better visual grounding. Furthermore unlike existing video-text datasets which only contain short clips our dataset is enriched with video clips of varying durations to aid our temporally hierarchical data tokens in extracting better representations at varying temporal scales. Overall empirical results show that our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks especially on longer videos and performs competitively on classification benchmarks.
-
Self-Supervised Learning (SSL) has demonstrated promising results in 3D medical image analysis. However the lack of high-level semantics in pre-training still heavily hinders the performance of downstream tasks. We observe that 3D medical images contain relatively consistent contextual position information i.e. consistent geometric relations between different organs which leads to a potential way for us to learn consistent semantic representations in pre-training. In this paper we propose a simple-yet-effective Volume Contrast (VoCo) framework to leverage the contextual position priors for pre-training. Specifically we first generate a group of base crops from different regions while enforcing feature discrepancy among them where we employ them as class assignments of different regions. Then we randomly crop sub-volumes and predict them belonging to which class (located at which region) by contrasting their similarity to different base crops which can be seen as predicting contextual positions of different sub-volumes. Through this pretext task VoCo implicitly encodes the contextual position priors into model representations without the guidance of annotations enabling us to effectively improve the performance of downstream tasks that require high-level semantics. Extensive experimental results on six downstream tasks demonstrate the superior effectiveness of VoCo. Code will be available at https://github.com/Luffy03/VoCo.
-
In this paper we present CCEdit a versatile generative video editing framework based on diffusion models. Our approach employs a novel trident network structure that separates structure and appearance control ensuring precise and creative editing capabilities. Utilizing the foundational ControlNet architecture we maintain the structural integrity of the video during editing. The incorporation of an additional appearance branch enables users to exert fine-grained control over the edited key frame. These two side branches seamlessly integrate into the main branch which is constructed upon existing text-to-image (T2I) generation models through learnable temporal layers. The versatility of our framework is demonstrated through a diverse range of choices in both structure representations and personalized T2I models as well as the option to provide the edited key frame. To facilitate comprehensive evaluation we introduce the BalanceCC benchmark dataset comprising 100 videos and 4 target prompts for each video. Our extensive user studies compare CCEdit with eight state-of-the-art video editing methods. The outcomes demonstrate CCEdit's substantial superiority over all other methods.
-
Generalizable 3D object reconstruction from single-view RGB-D images remains a challenging task particularly with real-world data. Current state-of-the-art methods develop Transformer-based implicit field learning necessitating an intensive learning paradigm that requires dense query-supervision uniformly sampled throughout the entire space. We propose a novel approach IPoD which harmonizes implicit field learning with point diffusion. This approach treats the query points for implicit field learning as a noisy point cloud for iterative denoising allowing for their dynamic adaptation to the target object shape. Such adaptive query points harness diffusion learning's capability for coarse shape recovery and also enhances the implicit representation's ability to delineate finer details. Besides an additional self-conditioning mechanism is designed to use implicit predictions as the guidance of diffusion learning leading to a cooperative system. Experiments conducted on the CO3D-v2 dataset affirm the superiority of IPoD achieving 7.8% improvement in F-score and 28.6% in Chamfer distance over existing methods. The generalizability of IPoD is also demonstrated on the MVImgNet dataset. Our project page is at https://yushuang-wu.github.io/IPoD.
-
As for human avatar reconstruction contemporary techniques commonly necessitate the acquisition of costly data and struggle to achieve satisfactory results from a small number of casual images. In this paper we investigate this task from a few-shot unconstrained photo album. The reconstruction of human avatars from such data sources is challenging because of limited data amount and dynamic articulated poses. For handling dynamic data we integrate a skinning mechanism with deep marching tetrahedra (DMTet) to form a drivable tetrahedral representation which drives arbitrary mesh topologies generated by the DMTet for the adaptation of unconstrained images. To effectively mine instructive information from few-shot data we devise a two-phase optimization method with few-shot reference and few-shot guidance. The former focuses on aligning avatar identity with reference images while the latter aims to generate plausible appearances for unseen regions. Overall our framework called HaveFun can undertake avatar reconstruction rendering and animation. Extensive experiments on our developed benchmarks demonstrate that HaveFun exhibits substantially superior performance in reconstructing the human body and hand.
-
Collaborative perception enhances perception performance by enabling autonomous vehicles to exchange complementary information. Despite its potential to revolutionize the mobile industry challenges in various environments such as communication bandwidth limitations localization errors and information aggregation inefficiencies hinder its implementation in practical applications. In this work we propose ERMVP a communication-Efficient and collaboration-Robust Multi-Vehicle Perception method in challenging environments. Specifically ERMVP has three distinct strengths: i) It utilizes the hierarchical feature sampling strategy to abstract a representative set of feature vectors using less communication overhead for efficient communication; ii) It employs the sparse consensus features to execute precise spatial location calibrations effectively mitigating the implications of vehicle localization errors; iii) A pioneering feature fusion and interaction paradigm is introduced to integrate holistic spatial semantics among different vehicles and data sources. To thoroughly validate our method we conduct extensive experiments on real-world and simulated datasets. The results demonstrate that the proposed ERMVP is significantly superior to the state-of-the-art collaborative perception methods.
-
Diffusion models have achieved remarkable image generation quality surpassing previous generative models. However a notable limitation of diffusion models in comparison to GANs is their difficulty in smoothly interpolating between two image samples due to their highly unstructured latent space. Such a smooth interpolation is intriguing as it naturally serves as a solution for the image morphing task with many applications. In this work we address this limitation via DiffMorpher an approach that enables smooth and natural image interpolation by harnessing the prior knowledge of a pre-trained diffusion model. Our key idea is to capture the semantics of the two images by fitting two LoRAs to them respectively and interpolate between both the LoRA parameters and the latent noises to ensure a smooth semantic transition where correspondence automatically emerges without the need for annotation. In addition we propose an attention interpolation and injection technique an adaptive normalization adjustment method and a new sampling schedule to further enhance the smoothness between consecutive images. Extensive experiments demonstrate that DiffMorpher achieves starkly better image morphing effects than previous methods across a variety of object categories bridging a critical functional gap that distinguished diffusion models from GANs.
-
As an important and practical way to obtain high dynamic range (HDR) video HDR video reconstruction from sequences with alternating exposures is still less explored mainly due to the lack of large-scale real-world datasets. Existing methods are mostly trained on synthetic datasets which perform poorly in real scenes. In this work to facilitate the development of real-world HDR video reconstruction we present Real-HDRV a large-scale real-world benchmark dataset for HDR video reconstruction featuring various scenes diverse motion patterns and high-quality labels. Specifically our dataset contains 500 LDRs-HDRs video pairs comprising about 28000 LDR frames and 4000 HDR labels covering daytime nighttime indoor and outdoor scenes. To our best knowledge our dataset is the largest real-world HDR video reconstruction dataset. Correspondingly we propose an end-to-end network for HDR video reconstruction where a novel two-stage strategy is designed to perform alignment sequentially. Specifically the first stage performs global alignment with the adaptively estimated global offsets reducing the difficulty of subsequent alignment. The second stage implicitly performs local alignment in a coarse-to-fine manner at the feature level using the adaptive separable convolution. Extensive experiments demonstrate that: (1) models trained on our dataset can achieve better performance on real scenes than those trained on synthetic datasets; (2) our method outperforms previous state-of-the-art methods. Our dataset is available at https://github.com/yungsyu99/Real-HDRV.
-
3D head avatars built with neural implicit volumetric representations have achieved unprecedented levels of photorealism. However the computational cost of these methods remains a significant barrier to their widespread adoption particularly in real-time applications such as virtual reality and teleconferencing. While attempts have been made to develop fast neural rendering approaches for static scenes these methods cannot be simply employed to support realistic facial expressions such as in the case of a dynamic facial performance. To address these challenges we propose a novel fast 3D neural implicit head avatar model that achieves real-time rendering while maintaining fine-grained controllability and high rendering quality. Our key idea lies in the introduction of local hash table blendshapes which are learned and attached to the vertices of an underlying face parametric model. These per-vertex hash-tables are linearly merged with weights predicted via a CNN resulting in expression dependent embeddings. Our novel representation enables efficient density and color predictions using a lightweight MLP which is further accelerated by a hierarchical nearest neighbor search method. Extensive experiments show that our approach runs in real-time while achieving comparable rendering quality to state-of-the-arts and decent results on challenging expressions.
-
Low-precision quantization is recognized for its efficacy in neural network optimization. Our analysis reveals that non-quantized elementwise operations which are prevalent in layers such as parameterized activation functions batch normalization and quantization scaling dominate the inference cost of low-precision models. These non-quantized elementwise operations are commonly overlooked in SOTA efficiency metrics such as Arithmetic Computation Effort (ACE). In this paper we propose ACEv2 - an extended version of ACE which offers a better alignment with the inference cost of quantized models and their energy consumption on ML hardware. Moreover we introduce PikeLPN a model that addresses these efficiency issues by applying quantization to both elementwise operations and multiply-accumulate operations. In particular we present a novel quantization technique for batch normalization layers named QuantNorm which allows for quantizing the batch normalization parameters without compromising the model performance. Additionally we propose applying Double Quantization where the quantization scaling parameters are quantized. Furthermore we recognize and resolve the issue of distribution mismatch in Separable Convolution layers by introducing Distribution-Heterogeneous Quantization which enables quantizing them to low-precision. PikeLPN achieves Pareto-optimality in efficiency-accuracy trade-off with up to 3X efficiency improvement compared to SOTA low-precision models.
-
Modern depth sensors such as LiDAR operate by sweeping laser-beams across the scene resulting in a point cloud with notable 1D curve-like structures. In this work we introduce a new point cloud processing scheme and backbone called CurveCloudNet which takes advantage of the curve-like structure inherent to these sensors. While existing backbones discard the rich 1D traversal patterns and rely on generic 3D operations CurveCloudNet parameterizes the point cloud as a collection of polylines (dubbed a "curve cloud") establishing a local surface-aware ordering on the points. By reasoning along curves CurveCloudNet captures lightweight curve-aware priors to efficiently and accurately reason in several diverse 3D environments. We evaluate CurveCloudNet on multiple synthetic and real datasets that exhibit distinct 3D size and structure. We demonstrate that CurveCloudNet outperforms both point-based and sparse-voxel backbones in various segmentation settings notably scaling to large scenes better than point-based alternatives while exhibiting improved single-object performance over sparse-voxel alternatives. In all CurveCloudNet is an efficient and accurate backbone that can handle a larger variety of 3D environments than past works.
-
We address the challenge of generating 3D articulated objects in a controllable fashion. Currently modeling articulated 3D objects is either achieved through laborious manual authoring or using methods from prior work that are hard to scale and control directly. We leverage the interplay between part shape connectivity and motion using a denoising diffusion-based method with attention modules designed to extract correlations between part attributes. Our method takes an object category label and a part connectivity graph as input and generates an object's geometry and motion parameters. The generated objects conform to user-specified constraints on the object category part shape and part articulation. Our experiments show that our method outperforms the state-of-the-art in articulated object generation producing more realistic objects while conforming better to user constraints.
-
To reduce the reliance on large-scale datasets recent works in 3D segmentation resort to few-shot learning. Current 3D few-shot segmentation methods first pre-train models on 'seen' classes and then evaluate their generalization performance on 'unseen' classes. However the prior pre-training stage not only introduces excessive time overhead but also incurs a significant domain gap on 'unseen' classes. To tackle these issues we propose a Non-parametric Network for few-shot 3D Segmentation Seg-NN and its Parametric variant Seg-PN. Without training Seg-NN extracts dense representations by hand-crafted filters and achieves comparable performance to existing parameterized models. Due to the elimination of pre-training Seg-NN can alleviate the domain gap issue and save a substantial amount of time. Based on Seg-NN Seg-PN only requires training a lightweight QUEry-Support Transferring (QUEST) module which enhances the interaction between the support set and query set. Experiments suggest that Seg-PN outperforms previous state-of-the-art method by +4.19% and +7.71% mIoU on S3DIS and ScanNet datasets respectively while reducing training time by -90% indicating its effectiveness and efficiency. Code is available https://github.com/yangyangyang127/Seg-NN.
-
We introduce PhysGaussian a new method that seamlessly integrates physically grounded Newtonian dynamics within 3D Gaussians to achieve high-quality novel motion synthesis. Employing a customized Material Point Method (MPM) our approach enriches 3D Gaussian kernels with physically meaningful kinematic deformation and mechanical stress attributes all evolved in line with continuum mechanics principles. A defining characteristic of our method is the seamless integration between physical simulation and visual rendering: both components utilize the same 3D Gaussian kernels as their discrete representations. This negates the necessity for triangle/tetrahedron meshing marching cubes cage meshes or any other geometry embedding highlighting the principle of "what you see is what you simulate (WS^2)". Our method demonstrates exceptional versatility across a wide variety of materials--including elastic entities plastic metals non-Newtonian fluids and granular materials--showcasing its strong capabilities in creating diverse visual content with novel viewpoints and movements.
-
Recovering images distorted by atmospheric turbulence is a challenging inverse problem due to the stochastic nature of turbulence. Although numerous turbulence mitigation (TM) algorithms have been proposed their efficiency and generalization to real-world dynamic scenarios remain severely limited. Building upon the intuitions of classical TM algorithms we present the Deep Atmospheric TUrbulence Mitigation network (DATUM). DATUM aims to overcome major challenges when transitioning from classical to deep learning approaches. By carefully integrating the merits of classical multi-frame TM methods into a deep network structure we demonstrate that DATUM can efficiently perform long-range temporal aggregation using a recurrent fashion while deformable attention and temporal-channel attention seamlessly facilitate pixel registration and lucky imaging. With additional supervision tilt and blur degradation can be jointly mitigated. These inductive biases empower DATUM to significantly outperform existing methods while delivering a tenfold increase in processing speed. A large-scale training dataset ATSyn is presented as a co-invention to enable the generalization to real turbulence. Our code and datasets are available at https://xg416.github.io/DATUM/
-
In recent years automated Gallbladder Cancer (GBC) detection has gained the attention of researchers. Current state-of-the-art (SOTA) methodologies relying on ultrasound sonography (US) images exhibit limited generalization emphasizing the need for transformative approaches. We observe that individual US frames may lack sufficient information to capture disease manifestation. This study advocates for a paradigm shift towards video-based GBC detection leveraging the inherent advantages of spatiotemporal representations. Employing the Masked Autoencoder (MAE) for representation learning we address shortcomings in conventional image-based methods. We propose a novel design called FocusMAE to systematically bias the selection of masking tokens from high-information regions fostering a more refined representation of malignancy. Additionally we contribute the most extensive US video dataset for GBC detection. We also note that this is the first study on US video-based GBC detection. We validate the proposed methods on the curated dataset and report a new SOTA accuracy of 96.4% for the GBC detection problem against an accuracy of 84% by current Image-based SOTA - GBCNet and RadFormer and 94.7% by Video-based SOTA - AdaMAE. We further demonstrate the generality of the proposed FocusMAE on a public CT-based Covid detection dataset reporting an improvement in accuracy by 3.3% over current baselines. Project page with source code trained models and data is available at: https://gbc-iitd.github.io/focusmae.
-
Driven by the scalable diffusion models trained on large-scale datasets text-to-image synthesis methods have shown compelling results. However these models still fail to precisely follow the text prompt involving multiple objects attributes or spatial compositions. In this paper we reveal the potential causes of the diffusion model's cross-attention and self-attention layers. We propose two novel losses to refocus attention maps according to a given spatial layout during sampling. Creating the layouts manually requires additional effort and can be tedious. Therefore we explore using large language models (LLM) to produce these layouts for our method. We conduct extensive experiments on the DrawBench HRS and TIFA benchmarks to evaluate our proposed method. We show that our proposed attention refocusing effectively improves the controllability of existing approaches.
-
Determining the location of an image anywhere on Earth is a complex visual task which makes it particularly relevant for evaluating computer vision algorithms. Determining the location of an image anywhere on Earth is a complex visual task which makes it particularly relevant for evaluating computer vision algorithms. Yet the absence of standard large-scale open-access datasets with reliably localizable images has limited its potential. To address this issue we introduce OpenStreetView-5M a large-scale open-access dataset comprising over 5.1 million geo-referenced street view images covering 225 countries and territories. In contrast to existing benchmarks we enforce a strict train/test separation allowing us to evaluate the relevance of learned geographical features beyond mere memorization. To demonstrate the utility of our dataset we conduct an extensive benchmark of various state-of-the-art image encoders spatial representations and training strategies. All associated codes and models can be found at https://github.com/gastruc/osv5m.
-
Understanding what deep network models capture in their learned representations is a fundamental challenge in computer vision. We present a new methodology to understanding such vision models the Visual Concept Connectome (VCC) which discovers human interpretable concepts and their interlayer connections in a fully unsupervised manner. Our approach simultaneously reveals fine-grained concepts at a layer connection weightings across all layers and is amendable to global analysis of network structure (e.g. branching pattern of hierarchical concept assemblies). Previous work yielded ways to extract interpretable concepts from single layers and examine their impact on classification but did not afford multilayer concept analysis across an entire network architecture. Quantitative and qualitative empirical results show the effectiveness of VCCs in the domain of image classification. Also we leverage VCCs for the application of failure mode debugging to reveal where mistakes arise in deep networks.
-
Advances in NERFs have allowed for 3D scene reconstructions and novel view synthesis. Yet efficiently editing these representations while retaining photorealism is an emerging challenge. Recent methods face three primary limitations: they're slow for interactive use lack precision at object boundaries and struggle to ensure multi-view consistency. We introduce IReNe to address these limitations enabling swift near real-time color editing in NeRF. Leveraging a pre-trained NeRF model and a single training image with user-applied color edits IReNe swiftly adjusts network parameters in seconds. This adjustment allows the model to generate new scene views accurately representing the color changes from the training image while also controlling object boundaries and view-specific effects. Object boundary control is achieved by integrating a trainable segmentation module into the model. The process gains efficiency by retraining only the weights of the last network layer. We observed that neurons in this layer can be classified into those responsible for view-dependent appearance and those contributing to diffuse appearance. We introduce an automated classification approach to identify these neuron types and exclusively fine-tune the weights of the diffuse neurons. This further accelerates training and ensures consistent color edits across different views. A thorough validation on a new dataset with edited object colors shows significant quantitative and qualitative advancements over competitors accelerating speeds by 5x and 500x.
-
Weakly Supervised Semantic Segmentation (WSSS) relies on Class Activation Maps (CAMs) to extract spatial information from image-level labels. With the success of Vision Transformer (ViT) the migration of ViT is actively conducted in WSSS. This work proposes a novel WSSS framework with Class Token Infusion (CTI). By infusing the class tokens from images we guide class tokens to possess class-specific distinct characteristics and global-local consistency. For this we devise two kinds of token infusion: 1) Intra-image Class Token Infusion (I-CTI) and 2) Cross-Image Class Token Infusion (C-CTI). In I-CTI we infuse the class tokens from the same but differently augmented images and thus make CAMs consistent among various deformations (view color). In C-CTI by infusing the class tokens from the other images and imposing the resulting CAMs to be similar it learns class-specific distinct characteristics. Besides the CTI we bring the background (BG) concept into ViT with the BG token to reduce the false positive activation of CAMs. We demonstrate the effectiveness of our method on PASCAL VOC 2012 and MS COCO 2014 datasets achieving state-of-the-art results in weakly supervised semantic segmentation. The code is available at https://github.com/yoon307/CTI
-
Federated Learning (FL) enables joint training across distributed clients using their local data privately. Federated Multi-Task Learning (FMTL) builds on FL to handle multiple tasks assuming model congruity that identical model architecture is deployed in each client. To relax this assumption and thus extend real-world applicability we introduce a novel problem setting Hetero-Client Federated Multi-Task Learning (HC-FMTL) to accommodate diverse task setups. The main challenge of HC-FMTL is the model incongruity issue that invalidates conventional aggregation methods. It also escalates the difficulties in model aggregation to deal with data and task heterogeneity inherent in FMTL. To address these challenges we propose the FedHCA^2 framework which allows for federated training of personalized models by modeling relationships among heterogeneous clients. Drawing on our theoretical insights into the difference between multi-task and federated optimization we propose the Hyper Conflict-Averse Aggregation scheme to mitigate conflicts during encoder updates. Additionally inspired by task interaction in MTL the Hyper Cross Attention Aggregation scheme uses layer-wise cross attention to enhance decoder interactions while alleviating model incongruity. Moreover we employ learnable Hyper Aggregation Weights for each client to customize personalized parameter updates. Extensive experiments demonstrate the superior performance of FedHCA^2 in various HC-FMTL scenarios compared to representative methods. Code is available at https://github.com/innovator-zero/FedHCA2.
-
Image fusion aims to combine information from different source images to create a comprehensively representative image. Existing fusion methods are typically helpless in dealing with degradations in low-quality source images and non-interactive to multiple subjective and objective needs. To solve them we introduce a novel approach that leverages semantic text guidance image fusion model for degradation-aware and interactive image fusion task termed as Text-IF. It innovatively extends the classical image fusion to the text guided image fusion along with the ability to harmoniously address the degradation and interaction issues during fusion. Through the text semantic encoder and semantic interaction fusion decoder Text-IF is accessible to the all-in-one infrared and visible image degradation-aware processing and the interactive flexible fusion outcomes. In this way Text-IF achieves not only multi-modal image fusion but also multi-modal information fusion. Extensive experiments prove that our proposed text guided image fusion strategy has obvious advantages over SOTA methods in the image fusion performance and degradation treatment. The code is available at https://github.com/XunpengYi/Text-IF.
-
The increasing use of transformer-based large language models brings forward the challenge of processing long sequences. In document visual question answering (DocVQA) leading methods focus on the single-page setting while documents can span hundreds of pages. We present GRAM a method that seamlessly extends pre-trained single-page models to the multi-page setting without requiring computationally-heavy pretraining. To do so we leverage a single-page encoder for local page-level understanding and enhance it with document-level designated layers and learnable tokens facilitating the flow of information across pages for global reasoning. To enforce our model to utilize the newly introduced document tokens we propose a tailored bias adaptation method. For additional computational savings during decoding we introduce an optional compression stage using our compression-transformer (CFormer)reducing the encoded sequence length thereby allowing a tradeoff between quality and latency. Extensive experiments showcase GRAM's state-of-the-art performance on the benchmarks for multi-page DocVQA demonstrating the effectiveness of our approach.
-
DETR accomplishes end-to-end object detection through iteratively generating multiple object candidates based on image features and promoting one candidate for each ground-truth object. The traditional training procedure using one-to-one supervision in the original DETR lacks direct supervision for the object detection candidates. We aim at improving the DETR training efficiency by explicitly supervising the candidate generation procedure through mixing one-to-one supervision and one-to-many supervision. Our approach namely MS-DETR is simple and places one-to-many supervision to the object queries of the primary decoder that is used for inference. In comparison to existing DETR variants with one-to-many supervision such as Group DETR and Hybrid DETR our approach does not need additional decoder branches or object queries. The object queries of the primary decoder in our approach directly benefit from one-to-many supervision and thus are superior in object candidate prediction. Experimental results show that our approach outperforms related DETR variants such as DN-DETR Hybrid DETR and Group DETR and the combination with related DETR variants further improves the performance.
-
This study addresses the challenge of performing visual localization in demanding conditions such as night-time scenarios adverse weather and seasonal changes. While many prior studies have focused on improving image matching performance to facilitate reliable dense keypoint matching between images existing methods often heavily rely on predefined feature points on a reconstructed 3D model. Consequently they tend to overlook unobserved keypoints during the matching process. Therefore dense keypoint matches are not fully exploited leading to a notable reduction in accuracy particularly in noisy scenes. To tackle this issue we propose a novel localization method that extracts reliable semi-dense 2D-3D matching points based on dense keypoint matches. This approach involves regressing semi-dense 2D keypoints into 3D scene coordinates using a point inference network. The network utilizes both geometric and visual cues to effectively infer 3D coordinates for unobserved keypoints from the observed ones. The abundance of matching information significantly enhances the accuracy of camera pose estimation even in scenarios involving noisy or sparse 3D models. Comprehensive evaluations demonstrate that the proposed method outperforms other methods in challenging scenes and achieves competitive results in large-scale visual localization benchmarks. The code will be available at https://github.com/TruongKhang/DeViLoc
-
This paper studies amodal image segmentation: predicting entire object segmentation masks including both visible and invisible (occluded) parts. In previous work the amodal segmentation ground truth on real images is usually predicted by manual annotaton and thus is subjective. In contrast we use 3D data to establish an automatic pipeline to determine authentic ground truth amodal masks for partially occluded objects in real images. This pipeline is used to construct an amodal completion evaluation benchmark MP3D-Amodal consisting of a variety of object categories and labels. To better handle the amodal completion task in the wild we explore two architecture variants: a two-stage model that first infers the occluder followed by amodal mask completion; and a one-stage model that exploits the representation power of Stable Diffusion for amodal segmentation across many categories. Without bells and whistles our method achieves a new state-of-the-art performance on Amodal segmentation datasets that cover a large variety of objects including COCOA and our new MP3D-Amodal dataset. The dataset model and code are available at https://www. robots.ox.ac.uk/ vgg/research/amodal/.
-
We introduce Motion Diversification Networks a novel framework for learning to generate realistic and diverse 3D human motion. Despite recent advances in deep generative motion modeling existing models often fail to produce samples that capture the full range of plausible and natural 3D human motion within a given context. The lack of diversity becomes even more apparent in applications where subtle and multi-modal 3D human forecasting is crucial for safety such as robotics and autonomous driving. Towards more realistic and functional 3D motion models we highlight limitations in existing generative modeling techniques particularly in overly simplistic latent code sampling strategies. We then introduce a transformer-based diversification mechanism that learns to effectively guide sampling in the latent space. Our proposed attention-based module queries multiple stochastic samples to flexibly predict a diverse set of latent codes which can be subsequently decoded into motion samples. The proposed framework achieves state-of-the-art diversity and accuracy prediction performance across a range of benchmarks and settings particularly when used to forecast intricate in-the-wild 3D human motion within complex urban environments. Our models datasets and code are available at https://mdncvpr.github.io/.
-
While pre-trained large-scale vision models have shown significant promise for semantic correspondence their features often struggle to grasp the geometry and orientation of instances. This paper identifies the importance of being geometry-aware for semantic correspondence and reveals a limitation of the features of current foundation models under simple post-processing. We show that incorporating this information can markedly enhance semantic correspondence performance with simple but effective solutions in both zero-shot and supervised settings. We also construct a new challenging benchmark for semantic correspondence built from an existing animal pose estimation dataset for both pre-training validating models. Our method achieves a PCK@0.10 score of 65.4 (zero-shot) and 85.6 (supervised) on the challenging SPair-71k dataset outperforming the state of the art by 5.5p and 11.0p absolute gains respectively. Our code and datasets are publicly available at: https://telling-left-from-right.github.io.
-
Human avatar has become a novel type of 3D asset with various applications. Ideally a human avatar should be fully customizable to accommodate different settings and environments. In this work we introduce NECA an approach capable of learning versatile human representation from monocular or sparse-view videos enabling granular customization across aspects such as pose shadow shape lighting and texture. The core of our approach is to represent humans in complementary dual spaces and predict disentangled neural fields of geometry albedo shadow as well as an external lighting from which we are able to derive realistic rendering with high-frequency details via volumetric rendering. Extensive experiments demonstrate the advantage of our method over the state-of-the-art methods in photorealistic rendering as well as various editing tasks such as novel pose synthesis and relighting. Our code is available at https://github.com/iSEE-Laboratory/NECA.
-
Vision-based roadside 3D object detection has attracted rising attention in autonomous driving domain since it encompasses inherent advantages in reducing blind spots and expanding perception range. While previous work mainly focuses on accurately estimating depth or height for 2D-to-3D mapping ignoring the position approximation error in the voxel pooling process. Inspired by this insight we propose a novel voxel pooling strategy to reduce such error dubbed BEVSpread. Specifically instead of bringing the image features contained in a frustum point to a single BEV grid BEVSpread considers each frustum point as a source and spreads the image features to the surrounding BEV grids with adaptive weights. To achieve superior propagation performance a specific weight function is designed to dynamically control the decay speed of the weights according to distance and depth. Aided by customized CUDA parallel acceleration BEVSpread achieves comparable inference time as the original voxel pooling. Extensive experiments on two large-scale roadside benchmarks demonstrate that as a plug-in BEVSpread can significantly improve the performance of existing frustum-based BEV methods by a large margin of (1.12 5.26 3.01) AP in vehicle pedestrian and cyclist.
-
Industrial anomaly detection (IAD) has garnered significant attention and experienced rapid development. However the recent development of IAD approach has encountered certain difficulties due to dataset limitations. On the one hand most of the state-of-the-art methods have achieved saturation (over 99% in AUROC) on mainstream datasets such as MVTec and the differences of methods cannot be well distinguished leading to a significant gap between public datasets and actual application scenarios. On the other hand the research on various new practical anomaly detection settings is limited by the scale of the dataset posing a risk of overfitting in evaluation results. Therefore we propose a large-scale Real-world and multi-view Industrial Anomaly Detection dataset named Real-IAD which contains 150K high-resolution images of 30 different objects an order of magnitude larger than existing datasets. It has a larger range of defect area and ratio proportions making it more challenging than previous datasets. To make the dataset closer to real application scenarios we adopted a multi-view shooting method and proposed sample-level evaluation metrics. In addition beyond the general unsupervised anomaly detection setting we propose a new setting for Fully Unsupervised Industrial Anomaly Detection (FUIAD) based on the observation that the yield rate in industrial production is usually greater than 60% which has more practical application value. Finally we report the results of popular IAD methods on the Real-IAD dataset providing a highly challenging benchmark to promote the development of the IAD field.
-
Generative image editing has recently witnessed extremely fast-paced growth. Some works use high-level conditioning such as text while others use low-level conditioning. Nevertheless most of them lack fine-grained control over the properties of the different objects present in the image i.e. object-level image editing. In this work we tackle the task by perceiving the images as an amalgamation of various objects and aim to control the properties of each object in a fine-grained manner. Out of these properties we identify structure and appearance as the most intuitive to understand and useful for editing purposes. We propose PAIR Diffusion a generic framework that enables a diffusion model to control the structure and appearance properties of each object in the image. We show that having control over the properties of each object in an image leads to comprehensive editing capabilities. Our framework allows for various object-level editing operations on real images such as reference image-based appearance editing free-form shape editing adding objects and variations. Thanks to our design we do not require any inversion step. Additionally we propose multimodal classifier-free guidance which enables editing images using both reference images and text when using our approach with foundational diffusion models. We validate the above claims by extensively evaluating our framework on both unconditional and foundational diffusion models.
-
Adversarial examples mislead deep neural networks with imperceptible perturbations and have brought significant threats to deep learning. An important aspect is their transferability which refers to their ability to deceive other models thus enabling attacks in the black-box setting. Though various methods have been proposed to boost transferability the performance still falls short compared with white-box attacks. In this work we observe that existing input transformation based attacks one of the mainstream transfer-based attacks result in different attention heatmaps on various models which might limit the transferability. We also find that breaking the intrinsic relation of the image can disrupt the attention heatmap of the original image. Based on this finding we propose a novel input transformation based attack called block shuffle and rotation (BSR). Specifically BSR splits the input image into several blocks then randomly shuffles and rotates these blocks to construct a set of new images for gradient calculation. Empirical evaluations on the ImageNet dataset demonstrate that BSR could achieve significantly better transferability than the existing input transformation based methods under single-model and ensemble-model settings. Combining BSR with the current input transformation method can further improve the transferability which significantly outperforms the state-of-the-art methods. Code is available at https://github.com/Trustworthy-AI-Group/BSR.
-
Vision-centric autonomous driving has recently raised wide attention due to its lower cost. Pre-training is essential for extracting a universal representation. However current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task. In this paper we address this challenge by introducing a world model-based autonomous driving 4D representation learning framework dubbed DriveWorld which is capable of pre-training from multi-camera driving videos in a spatio-temporal fashion. Specifically we propose a Memory State-Space Model for spatio-temporal modelling which consists of a Dynamic Memory Bank module for learning temporal-aware latent dynamics to predict future changes and a Static Scene Propagation module for learning spatial-aware latent statics to offer comprehensive scene contexts. We additionally introduce a Task Prompt to decouple task-aware features for various downstream tasks. The experiments demonstrate that DriveWorld delivers promising results on various autonomous driving tasks. When pre-trained with the OpenScene dataset DriveWorld achieves a 7.5% increase in mAP for 3D object detection a 3.0% increase in IoU for online mapping a 5.0% increase in AMOTA for multi-object tracking a 0.1m decrease in minADE for motion forecasting a 3.0% increase in IoU for occupancy prediction and a 0.34m reduction in average L2 error for planning.
-
Modularity plays a crucial role in the development and maintenance of complex systems. While end-to-end text spotting efficiently mitigates the issues of error accumulation and sub-optimal performance seen in traditional two-step methodologies the two-step methods continue to be favored in many competitions and practical settings due to their superior modularity. In this paper we introduce Bridging Text Spotting a novel approach that resolves the error accumulation and suboptimal performance issues in two-step methods while retaining modularity. To achieve this we adopt a well-trained detector and recognizer that are developed and trained independently and then lock their parameters to preserve their already acquired capabilities. Subsequently we introduce a Bridge that connects the locked detector and recognizer through a zero-initialized neural network. This zero-initialized neural network initialized with weights set to zeros ensures seamless integration of the large receptive field features in detection into the locked recognizer. Furthermore since the fixed detector and recognizer cannot naturally acquire end-to-end optimization features we adopt the Adapter to facilitate their efficient learning of these features. We demonstrate the effectiveness of the proposed method through extensive experiments: Connecting the latest detector and recognizer through Bridging Text Spotting we achieved an accuracy of 83.3% on Total-Text 69.8% on CTW1500 and 89.5% on ICDAR 2015. The code is available at https://github.com/mxin262/Bridging-Text-Spotting.
-
We present TokenCompose a Latent Diffusion Model for text-to-image generation that achieves enhanced consistency between user-specified text prompts and model-generated images. Despite its tremendous success the standard denoising process in the Latent Diffusion Model takes text prompts as conditions only absent explicit constraint for the consistency between the text prompts and the image contents leading to unsatisfactory results for composing multiple object categories. Our proposed TokenCompose aims to improve multi-category instance composition by introducing the token-wise consistency terms between the image content and object segmentation maps in the finetuning stage. TokenCompose can be applied directly to the existing training pipeline of text-conditioned diffusion models without extra human labeling information. By finetuning Stable Diffusion with our approach the model exhibits significant improvements in multi-category instance composition and enhanced photorealism for its generated images.
-
Learning generalizable visual representations from Internet data has yielded promising results for robotics. Yet prevailing approaches focus on pre-training 2D representations being sub-optimal to deal with occlusions and accurately localize objects in complex 3D scenes. Meanwhile 3D representation learning has been limited to single-object understanding. To address these limitations we introduce a novel 3D pre-training framework for robotics named SUGAR that captures semantic geometric and affordance properties of objects through 3D point clouds. We underscore the importance of cluttered scenes in 3D representation learning and automatically construct a multi-object dataset benefiting from cost-free supervision in simulation. SUGAR employs a versatile transformer-based model to jointly address five pre-training tasks namely cross-modal knowledge distillation for semantic learning masked point modeling to understand geometry structures grasping pose synthesis for object affordance 3D instance segmentation and referring expression grounding to analyze cluttered scenes. We evaluate our learned representation on three robotic-related tasks namely zero-shot 3D object recognition referring expression grounding and language-driven robotic manipulation. Experimental results show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
-
Photorealistic simulation plays a crucial role in applications such as autonomous driving where advances in neural radiance fields (NeRFs) may allow better scalability through the automatic creation of digital 3D assets. However reconstruction quality suffers on street scenes due to largely collinear camera motions and sparser samplings at higher speeds. On the other hand the application often demands rendering from camera views that deviate from the inputs to accurately simulate behaviors like lane changes. In this paper we propose several insights that allow a better utilization of Lidar data to improve NeRF quality on street scenes. First our framework learns a geometric scene representation from Lidar which are fused with the implicit grid-based representation for radiance decoding thereby supplying stronger geometric information offered by explicit point cloud. Second we put forth a robust occlusion-aware depth supervision scheme which allows utilizing densified Lidar points by accumulation. Third we generate augmented training views from Lidar points for further improvement. Our insights translate to largely improved novel view synthesis under real driving scenes.
-
Current vision-language pre-training (VLP) methodologies predominantly depend on paired image-text datasets a resource that is challenging to acquire in radiology due to privacy considerations and labelling complexities. Data augmentation provides a practical solution to overcome the issue of data scarcity however most augmentation methods exhibit a limited focus prioritising either image or text augmentation exclusively. Acknowledging this limitation our objective is to devise a framework capable of concurrently augmenting medical image and text data. We design a Pairwise Augmentation (PairAug) approach that contains an Inter-patient Augmentation (InterAug) branch and an Intra-patient Augmentation (IntraAug) branch. Specifically the InterAug branch of our approach generates radiology images using synthesised yet plausible reports derived from a Large Language Model (LLM). The generated pairs can be considered a collection of new patient cases since they are artificially created and may not exist in the original dataset. In contrast the IntraAug branch uses newly generated reports to manipulate images. This process allows us to create new paired data for each individual with diverse medical conditions. Our extensive experiments on various downstream tasks covering medical image classification zero-shot and fine-tuning analysis demonstrate that our PairAug concurrently expanding both image and text data substantially outperforms image-/text-only expansion baselines and advanced medical VLP baselines. Our code is released at https://github.com/YtongXie/PairAug.
-
Implicit Neural Representation (INR) which utilizes a neural network to map coordinate inputs to corresponding attributes is causing a revolution in the field of signal processing. However current INR techniques suffer from a restricted capability to tune their supported frequency set resulting in imperfect performance when representing complex signals with multiple frequencies. We have identified that this frequency-related problem can be greatly alleviated by introducing variable-periodic activation functions for which we propose FINER. By initializing the bias of the neural network within different ranges sub-functions with various frequencies in the variable-periodic function are selected for activation. Consequently the supported frequency set of FINER can be flexibly tuned leading to improved performance in signal representation. We demonstrate the capabilities of FINER in the contexts of 2D image fitting 3D signed distance field representation and 5D neural radiance fields optimization and we show that it outperforms existing INRs.
-
Video anomaly detection (VAD) aims to temporally locate abnormal events in a video. Existing works mostly rely on training deep models to learn the distribution of normality with either video-level supervision one-class supervision or in an unsupervised setting. Training-based methods are prone to be domain-specific thus being costly for practical deployment as any domain change will involve data collection and model training. In this paper we radically depart from previous efforts and propose LAnguage-based VAD (LAVAD) a method tackling VAD in a novel training-free paradigm exploiting the capabilities of pre-trained large language models (LLMs) and existing vision-language models (VLMs). We leverage VLM-based captioning models to generate textual descriptions for each frame of any test video. With the textual scene description we then devise a prompting mechanism to unlock the capability of LLMs in terms of temporal aggregation and anomaly score estimation turning LLMs into an effective video anomaly detector. We further leverage modality-aligned VLMs and propose effective techniques based on cross-modal similarity for cleaning noisy captions and refining the LLM-based anomaly scores. We evaluate LAVAD on two large datasets featuring real-world surveillance scenarios (UCF-Crime and XD-Violence) showing that it outperforms both unsupervised and one-class methods without requiring any training or data collection.
-
Diffusion-based text-to-image generative models e.g. Stable Diffusion have revolutionized the field of content generation enabling significant advancements in areas like image editing and video synthesis. Despite their formidable capabilities these models are not without their limitations. It is still challenging to synthesize an image that aligns well with the input text and multiple runs with carefully crafted prompts are required to achieve satisfactory results. To mitigate these limitations numerous studies have endeavored to fine-tune the pre-trained diffusion models i.e.. UNet utilizing various technologies. Yet amidst these efforts a pivotal question of text-to-image diffusion model training has remained largely unexplored: Is it possible and feasible to fine-tune the text encoder to improve the performance of text-to-image diffusion models? Our findings reveal that instead of replacing the CLIP text encoder used in Stable Diffusion with other large language models we can enhance it through our proposed fine-tuning approach TextCraftor leading to substantial improvements in quantitative benchmarks and human assessments. Interestingly our technique also empowers controllable image generation through the interpolation of different text encoders fine-tuned with various rewards. We also demonstrate that TextCraftor is orthogonal to UNet finetuning and can be combined to further improve generative quality.
-
FineParser: A Fine-grained Spatio-temporal Action Parser for Human-centric Action Quality Assessment
Existing action quality assessment (AQA) methods mainly learn deep representations at the video level for scoring diverse actions. Due to the lack of a fine-grained understanding of actions in videos they harshly suffer from low credibility and interpretability thus insufficient for stringent applications such as Olympic diving events. We argue that a fine-grained understanding of actions requires the model to perceive and parse actions in both time and space which is also the key to the credibility and interpretability of the AQA technique. Based on this insight we propose a new fine-grained spatial-temporal action parser named FineParser. It learns human-centric foreground action representations by focusing on target action regions within each frame and exploiting their fine-grained alignments in time and space to minimize the impact of invalid backgrounds during the assessment. In addition we construct fine-grained annotations of human-centric foreground action masks for the FineDiving dataset called FineDiving-HM. With refined annotations on diverse target action procedures FineDiving-HM can promote the development of real-world AQA systems. Through extensive experiments we demonstrate the effectiveness of FineParser which outperforms state-of-the-art methods while supporting more tasks of fine-grained action understanding. Data and code are available at https://github.com/PKU-ICST-MIPL/FineParser_CVPR2024.
-
The creation of new datasets often presents new challenges for video recognition and can inspire novel ideas while addressing these challenges. While existing datasets mainly comprise landscape mode videos our paper seeks to introduce portrait mode videos to the research community and highlight the unique challenges associated with this video format. With the growing popularity of smartphones and social media applications recognizing portrait mode videos is becoming increasingly important. To this end we have developed the first dataset dedicated to portrait mode video recognition namely PortraitMode-400. The taxonomy of PortraitMode-400 was constructed in a data-driven manner comprising 400 fine-grained categories and rigorous quality assurance was implemented to ensure the accuracy of human annotations. In addition to the new dataset we conducted a comprehensive analysis of the impact of video format (portrait mode versus landscape mode) on recognition accuracy and spatial bias due to the different formats. Furthermore we designed extensive experiments to explore key aspects of portrait mode video recognition including the choice of data augmentation evaluation procedure the importance of temporal information and the role of audio modality. Building on the insights from our experimental results and the introduction of PortraitMode-400 our paper aims to inspire further research efforts in this emerging research area.
-
Universal image restoration is a practical and potential computer vision task for real-world applications. The main challenge of this task is handling the different degradation distributions at once. Existing methods mainly utilize task-specific conditions (e.g. prompt) to guide the model to learn different distributions separately named multi-partite mapping. However it is not suitable for universal model learning as it ignores the shared information between different tasks. In this work we propose an advanced selective hourglass mapping strategy based on diffusion model termed DiffUIR. Two novel considerations make our DiffUIR non-trivial. Firstly we equip the model with strong condition guidance to obtain accurate generation direction of diffusion model (selective). More importantly DiffUIR integrates a flexible shared distribution term (SDT) into the diffusion algorithm elegantly and naturally which gradually maps different distributions into a shared one. In the reverse process combined with SDT and strong condition guidance DiffUIR iteratively guides the shared distribution to the task-specific distribution with high image quality (hourglass). Without bells and whistles by only modifying the mapping strategy we achieve state-of-the-art performance on five image restoration tasks 22 benchmarks in the universal setting and zero-shot generalization setting. Surprisingly by only using a lightweight model (only 0.89M) we could achieve outstanding performance. The source code and pre-trained models are available at https://github.com/iSEE-Laboratory/DiffUIR
-
Vision-language models (VLMs) pre-trained on web-scale datasets have demonstrated remarkable capabilities on downstream tasks when fine-tuned with minimal data. However many VLMs rely on proprietary data and are not open-source which restricts the use of white-box approaches for fine-tuning. As such we aim to develop a black-box approach to optimize VLMs through natural language prompts thereby avoiding the need to access model parameters feature embeddings or even output logits. We propose employing chat-based LLMs to search for the best text prompt for VLMs. Specifically we adopt an automatic "hill-climbing" procedure that converges to an effective prompt by evaluating the performance of current prompts and asking LLMs to refine them based on textual feedback all within a conversational process without human-in-the-loop. In a challenging 1-shot image classification setup our simple approach surpasses the white-box continuous prompting method (CoOp) by an average of 1.5% across 11 datasets including ImageNet. Our approach also outperforms both human-engineered and LLM-generated prompts. We highlight the advantage of conversational feedback that incorporates both positive and negative prompts suggesting that LLMs can utilize the implicit "gradient" direction in textual feedback for a more efficient search. In addition we find that the text prompts generated through our strategy are not only more interpretable but also transfer well across different VLM architectures in a black-box manner. Lastly we demonstrate our framework on a state-of-the-art black-box VLM (DALL-E 3) for text-to-image optimization.
-
Open world object detection aims to identify objects of unseen categories and incrementally recognize them once their annotations are provided. In distinction to the traditional paradigm that is limited to predefined categories this setting promises a continual and generalizable way of estimating objectness using class-agnostic information. However achieving such decorrelation between objectness and class information proves challenging. Without explicit consideration existing methods usually exhibit low recall on unknown objects and can misclassify them into known classes. To address this problem we exploit three levels of orthogonality in the detection process: First the objectness and classification heads are disentangled by operating on separate sets of features that are orthogonal to each other in a devised polar coordinate system. Secondly a prediction decorrelation loss is introduced to guide the detector towards more general and class-independent prediction. Furthermore we propose a calibration scheme that helps maintain orthogonality throughout the training process to mitigate catastrophic interference and facilitate incremental learning of previously unseen objects. Our method is comprehensively evaluated on open world and incremental object detection benchmarks demonstrating its effectiveness in detecting both known and unknown objects. Code and models are available at https://github.com/feifeiobama/OrthogonalDet.
-
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
Large Vision-Language Models (LVLMs) have advanced considerably intertwining visual recognition and language understanding to generate content that is not only coherent but also contextually attuned. Despite their success LVLMs still suffer from the issue of object hallucinations where models generate plausible yet incorrect outputs that include objects that do not exist in the images. To mitigate this issue we introduce Visual Contrastive Decoding (VCD) a simple and training-free method that contrasts output distributions derived from original and distorted visual inputs. The proposed VCD effectively reduces the over-reliance on statistical bias and unimodal priors two essential causes of object hallucinations. This adjustment ensures the generated content is closely grounded to visual inputs resulting in contextually accurate outputs. Our experiments show that VCD without either additional training or the usage of external tools significantly mitigates the object hallucination issue across different LVLM families. Beyond mitigating object hallucinations VCD also excels in general LVLM benchmarks highlighting its wide-ranging applicability.
-
Generative object compositing emerges as a promising new avenue for compositional image editing. However the requirement of object identity preservation poses a significant challenge limiting practical usage of most existing methods. In response this paper introduces IMPRINT a novel diffusion-based generative model trained with a two-stage learning framework that decouples learning of identity preservation from that of compositing. The first stage is targeted for context-agnostic identity-preserving pretraining of the object encoder enabling the encoder to learn an embedding that is both view-invariant and conducive to enhanced detail preservation. The subsequent stage leverages this representation to learn seamless harmonization of the object composited to the background. In addition IMPRINT incorporates a shape-guidance mechanism offering user-directed control over the compositing process. Extensive experiments demonstrate that IMPRINT significantly outperforms existing methods and various baselines on identity preservation and composition quality.
-
Audio-visual segmentation (AVS) aims to segment the sounding objects in video frames. Although great progress has been witnessed we experimentally reveal that current methods reach marginal performance gain within the use of the unlabeled frames leading to the underutilization issue. To fully explore the potential of the unlabeled frames for AVS we explicitly divide them into two categories based on their temporal characteristics i.e. neighboring frame (NF) and distant frame (DF). NFs temporally adjacent to the labeled frame often contain rich motion information that assists in the accurate localization of sounding objects. Contrary to NFs DFs have long temporal distances from the labeled frame which share semantic-similar objects with appearance variations. Considering their unique characteristics we propose a versatile framework that effectively leverages them to tackle AVS. Specifically for NFs we exploit the motion cues as the dynamic guidance to improve the objectness localization. Besides we exploit the semantic cues in DFs by treating them as valid augmentations to the labeled frames which are then used to enrich data diversity in a self-training manner. Extensive experimental results demonstrate the versatility and superiority of our method unleashing the power of the abundant unlabeled frames.
-
This paper presents DriveTrack a new benchmark and data generation framework for long-range keypoint tracking in real-world videos. DriveTrack is motivated by the observation that the accuracy of state-of-the-art trackers depends strongly on visual attributes around the selected keypoints such as texture and lighting. The problem is that these artifacts are especially pronounced in real-world videos but these trackers are unable to train on such scenes due to a dearth of annotations. DriveTrack bridges this gap by building a framework to automatically annotate point tracks on autonomous driving datasets. We release a dataset consisting of 1 billion point tracks across 24 hours of video which is seven orders of magnitude greater than prior real-world benchmarks and on par with the scale of synthetic benchmarks. DriveTrack unlocks new use cases for point tracking in real-world videos. First we show that fine-tuning keypoint trackers on DriveTrack improves accuracy on real-world scenes by up to 7%. Second we analyze the sensitivity of trackers to visual artifacts in real scenes and motivate the idea of running assistive keypoint selectors alongside trackers.
-
Infrared physical adversarial examples are of great significance for studying the security of infrared AI systems that are widely used in our lives such as autonomous driving. Previous infrared physical attacks mainly focused on 2D infrared pedestrian detection which may not fully manifest its destructiveness to AI systems. In this work we propose a physical attack method against infrared detectors based on 3D modeling which is applied to a real car. The goal is to design a set of infrared adversarial stickers to make cars invisible to infrared detectors at various viewing angles distances and scenes. We build a 3D infrared car model with real infrared characteristics and propose an infrared adversarial pattern generation method based on 3D mesh shadow. We propose a 3D control points-based mesh smoothing algorithm and use a set of smoothness loss functions to enhance the smoothness of adversarial meshes and facilitate the sticker implementation. Besides We designed the aluminum stickers and conducted physical experiments on two real Mercedes-Benz A200L cars. Our adversarial stickers hid the cars from Faster RCNN an object detector at various viewing angles distances and scenes. The attack success rate (ASR) was 91.49% for real cars. In comparison the ASRs of random stickers and no sticker were only 6.21% and 0.66% respectively. In addition the ASRs of the designed stickers against six unseen object detectors such as YOLOv3 and Deformable DETR were between 73.35%-95.80% showing good transferability of the attack performance across detectors.
-
Recent works on text-to-3d generation show that using only 2D diffusion supervision for 3D generation tends to produce results with inconsistent appearances (e.g. faces on the back view) and inaccurate shapes (e.g. animals with extra legs). Existing methods mainly address this issue by retraining diffusion models with images rendered from 3D data to ensure multi-view consistency while struggling to balance 2D generation quality with 3D consistency. In this paper we present a new framework Sculpt3D that equips the current pipeline with explicit injection of 3D priors from retrieved reference objects without re-training the 2D diffusion model. Specifically we demonstrate that high-quality and diverse 3D geometry can be guaranteed by keypoints supervision through a sparse ray sampling approach. Moreover to ensure accurate appearances of different views we further modulate the output of the 2D diffusion model to the correct patterns of the template views without altering the generated object's style. These two decoupled designs effectively harness 3D information from reference objects to generate 3D objects while preserving the generation quality of the 2D diffusion model. Extensive experiments show our method can largely improve the multi-view consistency while retaining fidelity and diversity.
-
Estimating the 3D structure of the human body from nat- ural scenes is a fundamental aspect of visual perception. 3D human pose estimation is a vital step in advancing fields like AIGC and human-robot interaction serving as a crucial tech- nique for understanding and interacting with human actions in real-world settings. However the current datasets often collected under single laboratory conditions using complex motion capture equipment and unvarying backgrounds are insufficient. The absence of datasets on variable conditions is stalling the progress of this crucial task. To facilitate the development of 3D pose estimation we present FreeMan the first large-scale multi-view dataset collected under the real- world conditions. FreeMan was captured by synchronizing 8 smartphones across diverse scenarios. It comprises 11M frames from 8000 sequences viewed from different perspec- tives. These sequences cover 40 subjects across 10 different scenarios each with varying lighting conditions. We have also established an semi-automated pipeline containing er- ror detection to reduce the workload of manual check and ensure precise annotation. We provide comprehensive eval- uation baselines for a range of tasks underlining the sig- nificant challenges posed by FreeMan. Further evaluations of standard indoor/outdoor human sensing datasets reveal that FreeMan offers robust representation transferability in real and complex scenes. FreeMan is publicly available at https://wangjiongw.github.io/freeman.
-
Referring Expression Comprehension (REC) aims to localize the target objects specified by free-form natural language descriptions in images. While state-of-the-art methods achieve impressive performance they perform a dense perception of images which incorporates redundant visual regions unrelated to linguistic queries leading to additional computational overhead. This inspires us to explore a question: can we eliminate linguistic-irrelevant redundant visual regions to improve the efficiency of the model? Existing relevant methods primarily focus on fundamental visual tasks with limited exploration in vision-language fields. To address this we propose a coarse-to-fine iterative perception framework called ScanFormer. It can iteratively exploit the image scale pyramid to extract linguistic-relevant visual patches from top to bottom. In each iteration irrelevant patches are discarded by our designed informativeness prediction. Furthermore we propose a patch selection strategy for discarded patches to accelerate inference. Experiments on widely used datasets namely RefCOCO RefCOCO+ RefCOCOg and ReferItGame verify the effectiveness of our method which can strike a balance between accuracy and efficiency.
-
Model Inversion (MI) attacks aim to reconstruct private training data by abusing access to machine learning models. Contemporary MI attacks have achieved impressive attack performance posing serious threats to privacy. Meanwhile all existing MI defense methods rely on regularization that is in direct conflict with the training objective resulting in noticeable degradation in model utility. In this work we take a different perspective and propose a novel and simple Transfer Learning-based Defense against Model Inversion (TL-DMI) to render MI-robust models. Particularly by leveraging TL we limit the number of layers encoding sensitive information from private training dataset thereby degrading the performance of MI attack. We conduct an analysis using Fisher Information to justify our method. Our defense is remarkably simple to implement. Without bells and whistles we show in extensive experiments that TL-DMI achieves state-of-the-art (SOTA) MI robustness. Our code pre-trained models demo and inverted data are available at: https://hosytuyen.github.io/projects/TL-DMI
-
Existing one-shot 4D head synthesis methods usually learn from monocular videos with the aid of 3DMM reconstruction yet the latter is evenly challenging which restricts them from reasonable 4D head synthesis. We present a method to learn one-shot 4D head synthesis via large-scale synthetic data. The key is to first learn a part-wise 4D generative model from monocular images via adversarial learning to synthesize multi-view images of diverse identities and full motions as training data; then leverage a transformer-based animatable triplane reconstructor to learn 4D head reconstruction using the synthetic data. A novel learning strategy is enforced to enhance the generalizability to real images by disentangling the learning process of 3D reconstruction and reenactment. Experiments demonstrate our superiority over the prior art.
-
Applying Neural Radiance Fields (NeRF) to downstream perception tasks for scene understanding and representation is becoming increasingly popular. Most existing methods treat semantic prediction as an additional rendering task i.e. the "label rendering" task to build semantic NeRFs. However by rendering semantic/instance labels per pixel without considering the contextual information of the rendered image these methods usually suffer from unclear boundary segmentation and abnormal segmentation of pixels within an object. To solve this problem we propose Generalized Perception NeRF (GP-NeRF) a novel pipeline that makes the widely used segmentation model and NeRF work compatibly under a unified framework for facilitating context-aware 3D scene perception. To accomplish this goal we introduce transformers to aggregate radiance as well as semantic embedding fields jointly for novel views and facilitate the joint volumetric rendering of both fields. In addition we propose two self-distillation mechanisms i.e. the Semantic Distill Loss and the Depth-Guided Semantic Distill Loss to enhance the discrimination and quality of the semantic field and the maintenance of geometric consistency. In evaluation as shown in Fig. 1 we conduct experimental comparisons under two perception tasks (i.e. semantic and instance segmentation) using both synthetic and real-world datasets. Notably our method outperforms SOTA approaches by 6.94% 11.76% and 8.47% on generalized semantic segmentation finetuning semantic segmentation and instance segmentation respectively
-
Lidar has become a cornerstone sensing modality for 3D vision especially for large outdoor scenarios and autonomous driving. Conventional lidar sensors are capable of providing centimeter-accurate distance information by emitting laser pulses into a scene and measuring the time-of-flight (ToF) of the reflection. However the polarization of the received light that depends on the surface orientation and material properties is usually not considered. As such the polarization modality has the potential to improve scene reconstruction beyond distance measurements. In this work we introduce a novel long-range polarization wavefront lidar sensor (PolLidar) that modulates the polarization of the emitted and received light. Departing from conventional lidar sensors PolLidar allows access to the raw time-resolved polarimetric wavefronts. We leverage polarimetric wavefronts to estimate normals distance and material properties in outdoor scenarios with a novel learned reconstruction method. To train and evaluate the method we introduce a simulated and real-world long-range dataset with paired raw lidar data ground truth distance and normal maps. We find that the proposed method improves normal and distance reconstruction by 53% mean angular error and 41% mean absolute error compared to existing shape-from-polarization (SfP) and ToF methods. Code and data are open-sourced here.
-
Machine learning models face generalization challenges when exposed to out-of-distribution (OOD) samples with unforeseen distribution shifts. Recent research reveals that for vision tasks test-time adaptation employing diffusion models can achieve state-of-the-art accuracy improvements on OOD samples by generating domain-aligned samples without altering the model's weights. Unfortunately those studies have primarily focused on pixel-level corruptions thereby lacking the generalization to adapt to a broader range of OOD types. We introduce Generalized Diffusion Adaptation (GDA) a novel diffusion-based test-time adaptation method robust against diverse OOD types. Specifically GDA iteratively guides the diffusion by applying a marginal entropy loss derived from the model in conjunction with style and content preservation losses during the reverse sampling process. In other words GDA considers the model's output behavior and the samples' semantic information as a whole reducing ambiguity in downstream tasks. based adaptation. Evaluation across various model architectures and OOD benchmarks indicates that GDA consistently surpasses previous diffusion-based adaptation methods. Notably it achieves the highest classification accuracy improvements ranging from 4.4% to 5.02% on ImageNet-C and 2.5% to 7.4% on Rendition Sketch and Stylized benchmarks. This performance highlights GDA's generalization to a broader range of OOD benchmarks.
-
Gestures play a key role in human communication. Recent methods for co-speech gesture generation while managing to generate beat-aligned motions struggle generating gestures that are semantically aligned with the utterance. Compared to beat gestures that align naturally to the audio signal semantically coherent gestures require modeling the complex interactions between the language and human motion and can be controlled by focusing on certain words. Therefore we present ConvoFusion a diffusion-based approach for multi-modal gesture synthesis which can not only generate gestures based on multi-modal speech inputs but can also facilitate controllability in gesture synthesis. Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities (e.g. audio vs text) as well as to choose certain words to be emphasized during gesturing. Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures. To further advance the research on multi-party interactive gestures the DnD Group Gesture dataset is released which contains 6 hours of gesture data showing 5 people interacting with one another. We compare our method with several recent works and demonstrate effectiveness of our method on a variety of tasks. We urge the reader to watch our supplementary video at https://vcai.mpi-inf.mpg.de/projects/ConvoFusion/.
-
Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in multimodal understanding reasoning and interaction. However existing MLLMs prevalently suffer from serious hallucination problems generating text that is not factually grounded in associated images. The problem makes existing MLLMs untrustworthy and thus impractical in real-world (especially high-stakes) applications. To address the challenge we present RLHF-V which enhances MLLM trustworthiness via behavior alignment from fine-grained correctional human feedback. Specifically RLHF-V collects human preference in the form of segment-level corrections on hallucinations and performs dense direct preference optimization over the human feedback. Comprehensive experiments on five benchmarks in both automatic and human evaluation show that RLHF-V can enable substantially more trustworthy MLLM behaviors with promising data and computation efficiency. Remarkably using 1.4k annotated data samples RLHF-V significantly reduces the hallucination rate of the base MLLM by 34.8% outperforming the concurrent LLaVA-RLHF trained on 10k annotated data. The final model achieves state-of-the-art performance in trustworthiness among open-source MLLMs and shows better robustness than GPT-4V in preventing hallucinations aroused from over-generalization. All the data code and model weights will be released to facilitate future research.
-
We study the problem of single-image zero-shot 3D shape reconstruction. Recent works learn zero-shot shape reconstruction through generative modeling of 3D assets but these models are computationally expensive at train and inference time. In contrast the traditional approach to this problem is regression-based where deterministic models are trained to directly regress the object shape. Such regression methods possess much higher computational efficiency than generative methods. This raises a natural question: is generative modeling necessary for high performance or conversely are regression-based approaches still competitive? To answer this we design a strong regression-based model called ZeroShape based on the converging findings in this field and a novel insight. We also curate a large real-world evaluation benchmark with objects from three different real-world 3D datasets. This evaluation benchmark is more diverse and an order of magnitude larger than what prior works use to quantitatively evaluate their models aiming at reducing the evaluation variance in our field. We show that ZeroShape not only achieves superior performance over state-of-the-art methods but also demonstrates significantly higher computational and data efficiency.
-
Continual Test-Time Adaptation (CTTA) is proposed to migrate a source pre-trained model to continually changing target distributions addressing real-world dynamism. Existing CTTA methods mainly rely on entropy minimization or teacher-student pseudo-labeling schemes for knowledge extraction in unlabeled target domains. However dynamic data distributions cause miscalibrated predictions and noisy pseudo-labels in existing self-supervised learning methods hindering the effective mitigation of error accumulation and catastrophic forgetting problems during the continual adaptation process. To tackle these issues we propose a continual self-supervised method Adaptive Distribution Masked Autoencoders (ADMA) which enhances the extraction of target domain knowledge while mitigating the accumulation of distribution shifts. Specifically we propose a Distribution-aware Masking (DaM) mechanism to adaptively sample masked positions followed by establishing consistency constraints between the masked target samples and the original target samples. Additionally for masked tokens we utilize an efficient decoder to reconstruct a hand-crafted feature descriptor (e.g. Histograms of Oriented Gradients) leveraging its invariant properties to boost task-relevant representations. Through conducting extensive experiments on four widely recognized benchmarks our proposed method attains state-of-the-art performance in both classification and segmentation CTTA tasks.
-
Recognizing continuous changes offers valuable insights into past historical events supports current trend analysis and facilitates future planning. This knowledge is crucial for a variety of fields such as meteorology and agriculture environmental science urban planning and construction tourism and cultural preservation. Currently available datasets in the field of scene change understanding primarily concentrate on two main tasks: the detection of changed regions within a scene and the linguistic description of the change content. Existing datasets focus on recognizing discrete changes such as adding or deleting an object from two images and largely rely on artificially generated images. Consequently the existing change understanding methods primarily focus on identifying distinct object differences overlooking the importance of continuous gradual changes occurring over extended time intervals. To address the above issues we propose a novel benchmark dataset STVchrono targeting the localization and description of long-term continuous changes in real-world scenes. The dataset consists of 71900 photographs from Google Street View API taken over an 18-year span across 50 cities all over the world. Our STVchrono dataset is designed to support real-world continuous change recognition and description in both image pairs and extended image sequences while also enabling the segmentation of changed regions. We conduct experiments to evaluate state-of-the-art methods on continuous change description and segmentation as well as multimodal Large Language Models for describing changes. Our findings reveal that even the most advanced methods lag human performance emphasizing the need to adapt them to continuously changing real-world scenarios. We hope that our benchmark dataset will further facilitate the research of temporal change recognition in a dynamic world. The STVchrono dataset is available at STVchrono Dataset.
-
Analyzing and forecasting trajectories of agents like pedestrians and cars in complex scenes has become more and more significant in many intelligent systems and applications. The diversity and uncertainty in socially interactive behaviors among a rich variety of agents make this task more challenging than other deterministic computer vision tasks. Researchers have made a lot of efforts to quantify the effects of these interactions on future trajectories through different mathematical models and network structures but this problem has not been well solved. Inspired by marine animals that localize the positions of their companions underwater through echoes we build a new anglebased trainable social interaction representation named SocialCircle for continuously reflecting the context of social interactions at different angular orientations relative to the target agent. We validate the effect of the proposed SocialCircle by training it along with several newly released trajectory prediction models and experiments show that the SocialCircle not only quantitatively improves the prediction performance but also qualitatively helps better simulate social interactions when forecasting pedestrian trajectories in a way that is consistent with human intuitions.
-
Implicit neural representations (INRs) have emerged as a promising approach for video storage and processing showing remarkable versatility across various video tasks. However existing methods often fail to fully leverage their representation capabilities primarily due to inadequate alignment of intermediate features during target frame decoding. This paper introduces a universal boosting framework for current implicit video representation approaches. Specifically we utilize a conditional decoder with a temporal-aware affine transform module which uses the frame index as a prior condition to effectively align intermediate features with target frames. Besides we introduce a sinusoidal NeRV-like block to generate diverse intermediate features and achieve a more balanced parameter distribution thereby enhancing the model's capacity. With a high-frequency information-preserving reconstruction loss our approach successfully boosts multiple baseline INRs in the reconstruction quality and convergence speed for video regression and exhibits superior inpainting and interpolation results. Further we integrate a consistent entropy minimization technique and develop video codecs based on these boosted INRs. Experiments on the UVG dataset confirm that our enhanced codecs significantly outperform baseline INRs and offer competitive rate-distortion performance compared to traditional and learning-based codecs. Code is available at https://github.com/Xinjie-Q/Boosting-NeRV.
-
Traditional online class incremental learning assumes class sets in different tasks are disjoint. However recent works have shifted towards a more realistic scenario where tasks have shared classes creating blurred task boundaries. Under this setting although existing approaches could be directly applied challenges like data imbalance and varying class-wise data volumes complicate the critical coreset selection used for replay. To tackle these challenges we introduce DECO (Dual-Enhanced Coreset Selection with Class-wise Collaboration) an approach that starts by establishing a class-wise balanced memory to address data imbalances followed by a tailored class-wise gradient-based similarity scoring system for refined coreset selection strategies with reasonable score guidance to all classes. DECO is distinguished by two main strategies: (1) Collaborative Diverse Score Guidance that mitigates biased knowledge in less-exposed classes through guidance from well-established classes simultaneously consolidating the knowledge in the established classes to enhance overall stability. (2) Adaptive Similarity Score Constraint that relaxes constraints between class types boosting learning plasticity for less-exposed classes and assisting well-established classes in defining clearer boundaries thereby improving overall plasticity. Overall DECO helps effectively identify critical coreset samples improving learning stability and plasticity across all classes. Extensive experiments are conducted on four benchmark datasets to demonstrate the effectiveness and superiority of DECO over other competitors under this online blurry class incremental learning setting.
-
We present a framework for generating full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio we output multiple possibilities of gestural motion for an individual including face body and hands. The key behind our method is in combining the benefits of sample diversity from vector quantization with the high-frequency details obtained through diffusion to generate more dynamic expressive motion. We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures (e.g. sneers and smirks). To facilitate this line of research we introduce a first-of-its-kind multi-view conversational dataset that allows for photorealistic reconstruction. Experiments show our model generates appropriate and diverse gestures outperforming both diffusion- and VQ-only methods. Furthermore our perceptual evaluation highlights the importance of photorealism (vs. meshes) in accurately assessing subtle motion details in conversational gestures. Code and dataset available on project page.
-
In this work we explore a novel task of generating human grasps based on single-view scene point clouds which more accurately mirrors the typical real-world situation of observing objects from a single viewpoint. Due to the incompleteness of object point clouds and the presence of numerous scene points the generated hand is prone to penetrating into the invisible parts of the object and the model is easily affected by scene points. Thus we introduce S2HGrasp a framework composed of two key modules: the Global Perception module that globally perceives partial object point clouds and the DiffuGrasp module designed to generate high-quality human grasps based on complex inputs that include scene points. Additionally we introduce S2HGD dataset which comprises approximately 99000 single-object single-view scene point clouds of 1668 unique objects each annotated with one human grasp. Our extensive experiments demonstrate that S2HGrasp can not only generate natural human grasps regardless of scene points but also effectively prevent penetration between the hand and invisible parts of the object. Moreover our model showcases strong generalization capability when applied to unseen objects. Our code and dataset are available at https://github.com/iSEE-Laboratory/S2HGrasp.
-
Diffusion models generate high-quality images but require dozens of forward passes. We introduce Distribution Matching Distillation (DMD) a procedure to transform a diffusion model into a one-step image generator with minimal impact on image quality. We enforce the one-step image generator match the diffusion model at distribution level by minimizing an approximate KL divergence whose gradient can be expressed as the difference between 2 score functions one of the target distribution and the other of the synthetic distribution being produced by our one-step generator. The score functions are parameterized as two diffusion models trained separately on each distribution. Combined with a simple regression loss matching the large-scale structure of the multi-step diffusion outputs our method outperforms all published few-step diffusion approaches reaching 2.62 FID on ImageNet 64x64 and 11.49 FID on zero-shot COCO-30k comparable to Stable Diffusion but orders of magnitude faster. Utilizing FP16 inference our model can generate images at 20 FPS on modern hardware.
-
Binaural audio is obtained by simulating the biological structure of human ears which plays an important role in artificial immersive spaces. A promising approach is to utilize mono audio and corresponding vision to synthesize binaural audio thereby avoiding expensive binaural audio recording. However most existing methods directly use the entire scene as a guide ignoring the correspondence between sounds and sounding objects. In this paper we advocate generating binaural audio using fine-grained raw waveform and object-level visual information as guidance. Specifically we propose a Cyclic Locating-and-UPmixing (CLUP) framework that jointly learns visual sounding object localization and binaural audio generation. Visual sounding object localization establishes the correspondence between specific visual objects and sound modalities which provides object-aware guidance to improve binaural generation performance. Meanwhile the spatial information contained in the generated binaural audio can further improve the performance of sounding object localization. In this case visual sounding object localization and binaural audio generation can achieve cyclic learning and benefit from each other. Experimental results demonstrate that on the FAIR-Play benchmark dataset our method is significantly ahead of the existing baselines in multiple evaluation metrics (STFT\downarrow: 0.787 vs. 0.851 ENV\downarrow: 0.128 vs. 0.134 WAV\downarrow: 5.244 vs. 5.684 SNR\uparrow: 7.546 vs. 7.044).
-
Video scene detection aims to temporally link shots for obtaining semantically compact scenes. It is essential for this task to capture scene-distinguishable affinity among shots by similarity assessment. However most methods relies on ordinary shot-to-shot similarities which may inveigle similar shots into being linked even though they are from different scenes and meanwhile hinder dissimilar shots from being blended into a complete scene. In this paper we propose NeighborNet to inject shot contexts into shot-to-shot similarities through carefully exploring the relations between semantic/temporal neighbors of shots over a local time period. In this way shot-to-shot similarities are remeasured as semantic/temporal neighbor-aware similarities so that NeighborNet can learn context embedding into shot features using graph convolutional network. As a result not only do the learned shot features suppress the affinity among similar shots from different scenes but they also promote the affinity among dissimilar shots in the same scene. Experimental results on public benchmark datasets show that our proposed NeighborNet yields substantial improvements in video scene detection especially outperforms released state-of-the-arts by at least 6% in Average Precision (AP). The code is available at https://github.com/ExMorgan-Alter/NeighborNet.
-
Long-term and accurate forecasting is the long-standing pursuit of the human motion prediction task. Existing methods typically suffer from dramatic degradation in prediction accuracy with the increasing prediction horizon. It comes down to two reasons:1? Insufficient numerical stability.Unforeseen high noise and complex feature relationships in the data. 2? Inadequate modeling stability. Unreasonable step sizes and undesirable parameter updates in the prediction.In this paper we design a novel and symplectic integral-inspired framework named symplectic integral neural network (SINN) which engages symplectic trajectories to optimize the pose representation and employs a stable symplectic operator to alternately model the dynamic context. Specifically we design a Symplectic Representation Encoder that performs on enhanced human pose representation to obtain trajectories on the symplectic manifold ensuring numerical stability based on Hamiltonian mechanics and symplectic spatial splitting algorithm. We further present the Symplectic Temporal Aggregation module in the light of the symplectic temporal splitting algorithm which splits the long-term prediction into multiple accurate short-term predictions generated by a symplectic operator to secure modeling stability. Moreover our approach is model-agnostic and can be efficiently integrated with different physical dynamics models.The experimental results demonstrate that our method achieves the new state-of-the-art outperforming existing methods by large margins:20.1%on Human3.6M16.7%on CUM Mocap and 10.2% on 3DPW.
-
This paper for the first time explores text-to-image diffusion models for Zero-Shot Sketch-based Image Retrieval (ZS-SBIR). We highlight a pivotal discovery: the capacity of text-to-image diffusion models to seamlessly bridge the gap between sketches and photos. This proficiency is underpinned by their robust cross-modal capabilities and shape bias findings that are substantiated through our pilot studies. In order to harness pre-trained diffusion models effectively we introduce a straightforward yet powerful strategy focused on two key aspects: selecting optimal feature layers and utilising visual and textual prompts. For the former we identify which layers are most enriched with information and are best suited for the specific retrieval requirements (category-level or fine-grained). Then we employ visual and textual prompts to guide the model's feature extraction process enabling it to generate more discriminative and contextually relevant cross-modal representations. Extensive experiments on several benchmark datasets validate significant performance improvements.
-
Nuclear instance segmentation has played a critical role in pathology image analysis. The main challenges arise from the difficulty in accurately segmenting densely overlapping instances and the high cost of precise mask-level annotations. Existing fully-supervised nuclear instance segmentation methods such as boundary-based methods struggle to capture differences between overlapping instances and thus fail in densely distributed blurry regions. They also face challenges transitioning to point supervision where annotations are simple and effective. Inspired by natural mudslides we propose a universal method called Mudslide that uses simple representations to characterize differences between different instances and can easily be extended from fully-supervised to point-supervised. oncretely we introduce a collapse field and leverage it to construct a force map and initial boundary enabling a distinctive representation for each instance. Each pixel is assigned a collapse force with distinct directions between adjacent instances. Starting from the initial boundary Mudslide executes a pixel-by-pixel collapse along various force directions. Pixels that collapse into the same region are considered as one instance concurrently accounting for both inter-instance distinctions and intra-instance coherence. Experiments on public datasets show superior performance in both fully-supervised and point-supervised tasks.
-
Recently numerous approaches have achieved notable success in compressed video quality enhancement (VQE). However these methods usually ignore the utilization of valuable coding priors inherently embedded in compressed videos such as motion vectors and residual frames which carry abundant temporal and spatial information. To remedy this problem we propose the Coding Priors-Guided Aggregation (CPGA) network to utilize temporal and spatial information from coding priors. The CPGA mainly consists of an inter-frame temporal aggregation (ITA) module and a multi-scale non-local aggregation (MNA) module. Specifically the ITA module aggregates temporal information from consecutive frames and coding priors while the MNA module globally captures spatial information guided by residual frames. In addition to facilitate research in VQE task we newly construct the Video Coding Priors (VCP) dataset comprising 300 videos with various coding priors extracted from corresponding bitstreams. It remedies the shortage of previous datasets on the lack of coding information. Experimental results demonstrate the superiority of our method compared to existing state-of-the-art methods. The code and dataset will be released at https://github.com/VQE-CPGA/CPGA.
-
We present MicroCinema a straightforward yet effective framework for high-quality and coherent text-to-video generation. Unlike existing approaches that align text prompts with video directly MicroCinema introduces a Divide-and-Conquer strategy which divides the text-to-video into a two-stage process: text-to-image generation and image&text-to-video generation. This strategy offers two significant advantages. a) It allows us to take full advantage of the recent advances in text-to-image models such as Stable Diffusion Midjourney and DALLE to generate photorealistic and highly detailed images. b) Leveraging the generated image the model can allocate less focus to fine-grained appearance details prioritizing the efficient learning of motion dynamics. To implement this strategy effectively we introduce two core designs. First we propose the Appearance Injection Network enhancing the preservation of the appearance of the given image. Second we introduce the Appearance Noise Prior a novel mechanism aimed at maintaining the capabilities of pre-trained 2D diffusion models. These design elements empower MicroCinema to generate high-quality videos with precise motion guided by the provided text prompts. Extensive experiments demonstrate the superiority of the proposed framework. Concretely MicroCinema achieves SOTA zero-shot FVD of 342.86 on UCF-101 and 377.40 on MSR-VTT.
-
Multi-instance point cloud registration estimates the poses of multiple instances of a model point cloud in a scene point cloud. Extracting accurate point correspondences is to the center of the problem. Existing approaches usually treat the scene point cloud as a whole overlooking the separation of instances. Therefore point features could be easily polluted by other points from the background or different instances leading to inaccurate correspondences oblivious to separate instances especially in cluttered scenes. In this work we propose MIRETR Multi-Instance REgistration TRansformer a coarse-to-fine approach to the extraction of instance-aware correspondences. At the coarse level it jointly learns instance-aware superpoint features and predicts per-instance masks. With instance masks the influence from outside of the instance being concerned is minimized such that highly reliable superpoint correspondences can be extracted. The superpoint correspondences are then extended to instance candidates at the fine level according to the instance masks. At last an efficient candidate selection and refinement algorithm is devised to obtain the final registrations. Extensive experiments on three public benchmarks demonstrate the efficacy of our approach. In particular MIRETR outperforms the state of the arts by 16.6 points on F1 score on the challenging ROBI benchmark. Code and models are available at https://github.com/zhiyuanYU134/MIRETR
-
Denoising diffusion probabilistic models (DDPMs) for image inpainting aim to add the noise to the texture of the image during the forward process and recover the masked regions with the unmasked ones of the texture via the reverse denoising process. Despite the meaningful semantics generation the existing arts suffer from the semantic discrepancy between the masked and unmasked regions since the semantically dense unmasked texture fails to be completely degraded while the masked regions turn to the pure noise in diffusion process leading to the large discrepancy between them. In this paper we aim to answer how the unmasked semantics guide the texture denoising process; together with how to tackle the semantic discrepancy to facilitate the consistent and meaningful semantics generation. To this end we propose a novel structure-guided diffusion model for image inpainting named StrDiffusion to reformulate the conventional texture denoising process under the structure guidance to derive a simplified denoising objective for image inpainting while revealing: 1) the semantically sparse structure is beneficial to tackle the semantic discrepancy in the early stage while the dense texture generates the reasonable semantics in the late stage; 2) the semantics from the unmasked regions essentially offer the time-dependent structure guidance for the texture denoising process benefiting from the time-dependent sparsity of the structure semantics. For the denoising process a structure-guided neural network is trained to estimate the simplified denoising objective by exploiting the consistency of the denoised structure between masked and unmasked regions. Besides we devise an adaptive resampling strategy as a formal criterion as whether the structure is competent to guide the texture denoising process while regulate their semantic correlations. Extensive experiments validate the merits of StrDiffusion over the state-of-the-arts. Our code is available at https://github.com/htyjers/StrDiffusion.
-
Understanding social interactions involving both verbal and non-verbal cues is essential for effectively interpreting social situations. However most prior works on multimodal social cues focus predominantly on single-person behaviors or rely on holistic visual representations that are not aligned to utterances in multi-party environments. Consequently they are limited in modeling the intricate dynamics of multi-party interactions. In this paper we introduce three new challenging tasks to model the fine-grained dynamics between multiple people: speaking target identification pronoun coreference resolution and mentioned player prediction. We contribute extensive data annotations to curate these new challenges in social deduction game settings. Furthermore we propose a novel multimodal baseline that leverages densely aligned language-visual representations by synchronizing visual features with their corresponding utterances. This facilitates concurrently capturing verbal and non-verbal cues pertinent to social reasoning. Experiments demonstrate the effectiveness of the proposed approach with densely aligned multimodal representations in modeling fine-grained social interactions. Project website: https://sangmin-git.github.io/projects/MMSI.
-
In recent decades the vision community has witnessed remarkable progress in visual recognition partially owing to advancements in dataset benchmarks. Notably the established COCO benchmark has propelled the development of modern detection and segmentation systems. However the COCO segmentation benchmark has seen comparatively slow improvement over the last decade. Originally equipped with coarse polygon annotations for thing instances it gradually incorporated coarse superpixel annotations for stuff regions which were subsequently heuristically amalgamated to yield panoptic segmentation annotations. These annotations executed by different groups of raters have resulted not only in coarse segmentation masks but also in inconsistencies between segmentation types. In this study we undertake a comprehensive reevaluation of the COCO segmentation annotations. By enhancing the annotation quality and expanding the dataset to encompass 383K images with more than 5.18M panoptic masks we introduce COCONut the COCO Next Universal segmenTation dataset. COCONut harmonizes segmentation annotations across semantic instance and panoptic segmentation with meticulously crafted high-quality masks and establishes a robust benchmark for all segmentation tasks. To our knowledge COCONut stands as the inaugural large-scale universal segmentation dataset verified by human raters. We anticipate that the release of COCONut will significantly contribute to the community's ability to assess the progress of novel neural networks.
-
A novel algorithm called semantic line combination detector (SLCD) to find an optimal combination of semantic lines is proposed in this paper. It processes all lines in each line combination at once to assess the overall harmony of the lines. First we generate various line combinations from reliable lines. Second we estimate the score of each line combination and determine the best one. Experimental results demonstrate that the proposed SLCD outperforms existing semantic line detectors on various datasets. Moreover it is shown that SLCD can be applied effectively to three vision tasks of vanishing point detection symmetry axis detection and composition-based image retrieval. Our codes are available at https://github.com/Jinwon-Ko/SLCD.
-
Single-domain generalization aims to learn a model from single source domain data attaining generalized performance on other unseen target domains. Existing works primarily focus on improving the generalization ability of static networks. However static networks are unable to dynamically adapt to the diverse variations in different image scenes leading to limited generalization capability. Different scenes exhibit varying levels of complexity and the complexity of images further varies significantly in cross-domain scenarios. In this paper we propose a dynamic object-centric perception network based on prompt learning aiming to adapt to the variations in image complexity. Specifically we propose an object-centric gating module based on prompt learning to focus attention on the object-centric features guided by the various scene prompts. Then with the object-centric gating masks the dynamic selective module dynamically selects highly correlated feature regions in both spatial and channel dimensions enabling the model to adaptively perceive object-centric relevant features thereby enhancing the generalization capability. Extensive experiments were conducted on single-domain generalization tasks in image classification and object detection. The experimental results demonstrate that our approach outperforms state-of-the-art methods which validates the effectiveness and versatility of our proposed method.
-
In the context of pose-invariant object recognition and retrieval we demonstrate that it is possible to achieve significant improvements in performance if both the category-based and the object-identity-based embeddings are learned simultaneously during training. In hindsight that sounds intuitive because learning about the categories is more fundamental than learning about the individual objects that correspond to those categories. However to the best of what we know no prior work in pose invariant learning has demonstrated this effect. This paper presents an attention-based dual-encoder architecture with specially designed loss functions that optimize the inter- and intra-class distances simultaneously in two different embedding spaces one for the category embeddings and the other for the object level embeddings. The loss functions we have proposed are pose-invariant ranking losses that are designed to minimize the intra-class distances and maximize the inter-class distances in the dual representation spaces. We demonstrate the power of our approach with three challenging multi-view datasets ModelNet-40 ObjectPI and FG3D. With our dual approach for single view object recognition we outperform the previous best by 20.0% on ModelNet40 2.0% on ObjectPI and 46.5% on FG3D. On the other hand for single-view object retrieval we outperform the previous best by 33.7% on ModelNet40 18.8% on ObjectPI and 56.9% on FG3D.
-
Video Transformers have become the prevalent solution for various video downstream tasks with superior expressive power and flexibility. However these video transformers suffer from heavy computational costs induced by the massive number of tokens across the entire video frames which has been the major barrier to training the model. Further the patches irrelevant to the main contents e.g. backgrounds degrade the generalization performance of models. To tackle these issues we propose training free token merging for lightweight video Transformer (vid-TLDR) that aims to enhance the efficiency of video Transformers by merging the background tokens without additional training. For vid-TLDR we introduce a novel approach to capture the salient regions in videos only with the attention map. Further we introduce the saliency-aware token merging strategy by dropping the background tokens and sharpening the object scores. Our experiments show that vid-TLDR significantly mitigates the computational complexity of video Transformers while achieving competitive performance compared to the base model without vid-TLDR. Code is available at https://github.com/mlvlab/vid-TLDR.
-
We present DRESS a large vision language model (LVLM) that innovatively exploits Natural Language feedback (NLF) from Large Language Models to enhance its alignment and interactions by addressing two key limitations in the state-of-the-art LVLMs. First prior LVLMs generally rely only on the instruction finetuning stage to enhance alignment with human preferences. Without incorporating extra feedback they are still prone to generate unhelpful hallucinated or harmful responses. Second while the visual instruction tuning data is generally structured in a multi-turn dialogue format the connections and dependencies among consecutive conversational turns are weak. This reduces the capacity for effective multi-turn interactions. To tackle these we propose a novel categorization of the NLF into two key types: critique and refinement. The critique NLF identifies the strengths and weaknesses of the responses and is used to align the LVLMs with human preferences. The refinement NLF offers concrete suggestions for improvement and is adopted to improve the interaction ability of the LVLMs-- which focuses on LVLMs' ability to refine responses by incorporating feedback in multi-turn interactions. To address the non-differentiable nature of NLF we generalize conditional reinforcement learning for training. Our experimental results demonstrate that DRESS can generate more helpful (9.76%) honest (11.52%) and harmless (21.03%) responses and more effectively learn from feedback during multi-turn interactions compared to SOTA LVLMs.
-
In this work we introduce two types of makeup prior models to extend existing 3D face prior models: PCA-based and StyleGAN2-based priors. The PCA-based prior model is a linear model that is easy to construct and is computationally efficient. However it retains only low-frequency information. Conversely the StyleGAN2-based model can represent high-frequency information with relatively higher computational cost than the PCA-based model. Although there is a trade-off between the two models both are applicable to 3D facial makeup estimation and related applications. By leveraging makeup prior models and designing a makeup consistency module we effectively address the challenges that previous methods faced in robustly estimating makeup particularly in the context of handling self-occluded faces. In experiments we demonstrate that our approach reduces computational costs by several orders of magnitude achieving speeds up to 180 times faster. In addition by improving the accuracy of the estimated makeup we confirm that our methods are highly advantageous for various 3D facial makeup applications such as 3D makeup face reconstruction user-friendly makeup editing makeup transfer and interpolation.
-
DETR-like methods have significantly increased detection performance in an end-to-end manner. The mainstream two-stage frameworks of them perform dense self-attention and select a fraction of queries for sparse cross-attention which is proven effective for improving performance but also introduces a heavy computational burden and high dependence on stable query selection. This paper demonstrates that suboptimal two-stage selection strategies result in scale bias and redundancy due to the mismatch between selected queries and objects in two-stage initialization. To address these issues we propose hierarchical salience filtering refinement which performs transformer encoding only on filtered discriminative queries for a better trade-off between computational efficiency and precision. The filtering process overcomes scale bias through a novel scale-independent salience supervision. To compensate for the semantic misalignment among queries we introduce elaborate query refinement modules for stable two-stage initialization. Based on above improvements the proposed Salience DETR achieves significant improvements of +4.0% AP +0.2% AP +4.4% AP on three challenging task-specific detection datasets as well as 49.2% AP on COCO 2017 with less FLOPs. The code is available at https://github.com/xiuqhou/Salience-DETR.
-
The rapid advancement of large language models (LLMs) has accelerated the emergence of in-context learning (ICL) as a cutting-edge approach in the natural language processing domain. Recently ICL has been employed in visual understanding tasks such as semantic segmentation and image captioning yielding promising results. However existing visual ICL framework can not enable producing content across multiple modalities which limits their potential usage scenarios. To address this issue we present a new ICL framework for visual understanding with multi-modal output enabled. First we quantize and embed both text and visual prompt into a unified representational space structured as interleaved in-context sequences. Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them facilitating in-context learning. Thanks to this design the model is capable of handling in-context vision understanding tasks with multimodal output in a unified pipeline. Experimental results demonstrate that our model achieves competitive performance compared with specialized models and previous ICL baselines. Overall our research takes a further step toward unified multimodal in-context learning.
-
In this paper we propose an efficient data-driven solution to self-localization within a floorplan. Floorplan data is readily available long-term persistent and inherently robust to changes in the visual appearance. Our method does not require retraining per map and location or demand a large database of images of the area of interest. We propose a novel probabilistic model consisting of an observation and a novel temporal filtering module. Operating internally with an efficient ray-based representation the observation module consists of a single and a multiview module to predict horizontal depth from images and fuses their results to benefit from advantages offered by either methodology. Our method operates on conventional consumer hardware and overcomes a common limitation of competing methods that often demand upright images. Our full system meets real-time requirements while outperforming the state-of-the-art by a significant margin.
-
3D reconstruction methods such as Neural Radiance Fields (NeRFs) excel at rendering photorealistic novel views of complex scenes. However recovering a high-quality NeRF typically requires tens to hundreds of input images resulting in a time-consuming capture process. We present ReconFusion to reconstruct real-world scenes using only a few photos. Our approach leverages a diffusion prior for novel view synthesis trained on synthetic and multiview datasets which regularizes a NeRF-based 3D reconstruction pipeline at novel camera poses beyond those captured by the set of input images. Our method synthesizes realistic geometry and texture in underconstrained regions while preserving the appearance of observed regions. We perform an extensive evaluation across various real-world datasets including forward-facing and 360-degree scenes demonstrating significant performance improvements over previous few-view NeRF reconstruction approaches. Please see our project page at reconfusion.github.io.
-
We are living in a world surrounded by diverse and "smart" devices with rich modalities of sensing ability. Conveniently capturing the interactions between us humans and these objects remains far-reaching. In this paper we present I'm-HOI a monocular scheme to faithfully capture the 3D motions of both the human and object in a novel setting: using a minimal amount of RGB camera and object-mounted Inertial Measurement Unit (IMU). It combines general motion inference and category-aware refinement. For the former we introduce a holistic human-object tracking method to fuse the IMU signals and the RGB stream and progressively recover the human motions and subsequently the companion object motions. For the latter we tailor a category-aware motion diffusion model which is conditioned on both the raw IMU observations and the results from the previous stage under over-parameterization representation. It significantly refines the initial results and generates vivid body hand and object motions. Moreover we contribute a large dataset with ground truth human and object motions dense RGB inputs and rich object-mounted IMU measurements. Extensive experiments demonstrate the effectiveness of I'm-HOI under a hybrid capture setting. Our dataset and code will be released to the community.
-
Multi-Instance Learning (MIL) has shown impressive performance for histopathology whole slide image (WSI) analysis using bags or pseudo-bags. It involves instance sampling feature representation and decision-making. However existing MIL-based technologies at least suffer from one or more of the following problems: 1) requiring high storage and intensive pre-processing for numerous instances (sampling); 2) potential over-fitting with limited knowledge to predict bag labels (feature representation); 3) pseudo-bag counts and prior biases affect model robustness and generalizability (decision-making). Inspired by clinical diagnostics using the past sampling instances can facilitate the final WSI analysis but it is barely explored in prior technologies. To break free these limitations we integrate the dynamic instance sampling and reinforcement learning into a unified framework to improve the instance selection and feature aggregation forming a novel Dynamic Policy Instance Selection (DPIS) scheme for better and more credible decision-making. Specifically the measurement of feature distance and reward function are employed to boost continuous instance sampling. To alleviate the over-fitting we explore the latent global relations among instances for more robust and discriminative feature representation while establishing reward and punishment mechanisms to correct biases in pseudo-bags using contrastive learning. These strategies form the final Dynamic Policy-Driven Adaptive Multi-Instance Learning (PAMIL) method for WSI tasks. Extensive experiments reveal that our PAMIL method outperforms the state-of-the-art by 3.8% on CAMELYON16 and 4.4% on TCGA lung cancer datasets.
-
The exponential growth of large language models (LLMs) has opened up numerous possibilities for multi-modal AGI systems. However the progress in vision and vision-language foundation models which are also critical elements of multi-modal AGI has not kept pace with LLMs. In this work we design a large-scale vision-language foundation model (InternVL) which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the LLM using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks including visual perception tasks such as image-level or pixel-level recognition vision-language tasks such as zero-shot image/video classification zero-shot image/video-text retrieval and link with LLMs to create multi-modal dialogue systems. It has powerful visual capabilities and can be a good alternative to the ViT-22B. We hope that our research could contribute to the development of multi-modal large models.
-
We present Multi-View Attentive Contextualization (MvACon) a simple yet effective method for improving 2D-to-3D feature lifting in query-based multi-view 3D (MV3D) object detection. Despite remarkable progress witnessed in the field of query-based MV3D object detection prior art often suffers from either the lack of exploiting high-resolution 2D features in dense attention-based lifting due to high computational costs or from insufficiently dense grounding of 3D queries to multi-scale 2D features in sparse attention-based lifting. Our proposed MvACon hits the two birds with one stone using a representationally dense yet computationally sparse attentive feature contextualization scheme that is agnostic to specific 2D-to-3D feature lifting approaches. In experiments the proposed MvACon is thoroughly tested on the nuScenes benchmark using both the BEVFormer and its recent 3D deformable attention (DFA3D) variant as well as the PETR showing consistent detection performance improvement especially in enhancing performance in location orientation and velocity prediction. It is also tested on the Waymo-mini benchmark using BEVFormer with similar improvement. We qualitatively and quantitatively show that global cluster-based contexts effectively encode dense scene-level contexts for MV3D object detection. The promising results of our proposed MvACon reinforces the adage in computer vision "(contextualized) feature matters".
-
We propose a novel echocardiographical video segmentation model by adapting SAM to medical videos to address some long-standing challenges in ultrasound video segmentation including (1) massive speckle noise and artifacts (2) extremely ambiguous boundaries and (3) large variations of targeting objects across frames. The core technique of our model is a temporal-aware and noise-resilient prompting scheme. Specifically we employ a space-time memory that contains both spatial and temporal information to prompt the segmentation of current frame and thus we call the proposed model as MemSAM. In prompting the memory carrying temporal cues sequentially prompt the video segmentation frame by frame. Meanwhile as the memory prompt propagates high-level features it avoids the issue of misidentification caused by mask propagation and improves representation consistency. To address the challenge of speckle noise we further propose a memory reinforcement mechanism which leverages predicted masks to improve the quality of the memory before storing it. We extensively evaluate our method on two public datasets and demonstrate state-of-the-art performance compared to existing models. Particularly our model achieves comparable performance with fully supervised approaches with limited annotations. Codes are available at https://github.com/dengxl0520/MemSAM.
-
Although neural radiance fields (NeRFs) have achieved triumphs in image novel view synthesis (NVS) LiDAR NVS remains largely unexplored. Previous LiDAR NVS methods employ a simple shift from image NVS methods while ignoring the dynamic nature and the large-scale reconstruction problem of LiDAR point clouds. In light of this we propose LiDAR4D a differentiable LiDAR-only framework for novel space-time LiDAR view synthesis. In consideration of the sparsity and large-scale characteristics we design a 4D hybrid representation combined with multi-planar and grid features to achieve effective reconstruction in a coarse-to-fine manner. Furthermore we introduce geometric constraints derived from point clouds to improve temporal consistency. For the realistic synthesis of LiDAR point clouds we incorporate the global optimization of ray-drop probability to preserve cross-region patterns. Extensive experiments on KITTI-360 and NuScenes datasets demonstrate the superiority of our method in accomplishing geometry-aware and time-consistent dynamic reconstruction. Codes are available at https://github.com/ispc-lab/LiDAR4D.
-
Contents generated by recent advanced Text-to-Image (T2I) diffusion models are sometimes too imaginative for existing off-the-shelf dense predictors to estimate due to the immitigable domain gap. We introduce DMP a pipeline utilizing pre-trained T2I models as a prior for dense prediction tasks. To address the misalignment between deterministic prediction tasks and stochastic T2I models we reformulate the diffusion process through a sequence of interpolations establishing a deterministic mapping between input RGB images and output prediction distributions. To preserve generalizability we use low-rank adaptation to fine-tune pre-trained models. Extensive experiments across five tasks including 3D property estimation semantic segmentation and intrinsic image decomposition showcase the efficacy of the proposed method. Despite limited-domain training data the approach yields faithful estimations for arbitrary images surpassing existing state-of-the-art algorithms.
-
Diffusion models trained on large-scale text-image datasets have demonstrated a strong capability of controllable high-quality image generation from arbitrary text prompts. However the generation quality and generalization ability of 3D diffusion models is hindered by the scarcity of high-quality and large-scale 3D datasets. In this paper we present PI3D a framework that fully leverages the pre-trained text-to-image diffusion models' ability to generate high-quality 3D shapes from text prompts in minutes. The core idea is to connect the 2D and 3D domains by representing a 3D shape as a set of Pseudo RGB Images. We fine-tune an existing text-to-image diffusion model to produce such pseudo-images using a small number of text-3D pairs. Surprisingly we find that it can already generate meaningful and consistent 3D shapes given complex text descriptions. We further take the generated shapes as the starting point for a lightweight iterative refinement using score distillation sampling to achieve high-quality generation under a low budget. PI3D generates a single 3D shape from text in only 3 minutes and the quality is validated to outperform existing 3D generative models by a large margin.
-
Customization techniques for text-to-image models have paved the way for a wide range of previously unattainable applications enabling the generation of specific concepts across diverse contexts and styles. While existing methods facilitate high-fidelity customization for individual concepts or a limited pre-defined set of them they fall short of achieving scalability where a single model can seamlessly render countless concepts. In this paper we address a new problem called Modular Customization with the goal of efficiently merging customized models that were fine-tuned independently for individual concepts. This allows the merged model to jointly synthesize concepts in one image without compromising fidelity or incurring any additional computational costs. To address this problem we introduce Orthogonal Adaptation a method designed to encourage the customized models which do not have access to each other during fine-tuning to have orthogonal residual weights. This ensures that during inference time the customized models can be summed with minimal interference. Our proposed method is both simple and versatile applicable to nearly all optimizable weights in the model architecture. Through an extensive set of quantitative and qualitative evaluations our method consistently outperforms relevant baselines in terms of efficiency and identity preservation demonstrating a significant leap toward scalable customization of diffusion models.
-
We introduce pixelSplat a feed-forward model that learns to reconstruct 3D radiance fields parameterized by 3D Gaussian primitives from pairs of images. Our model features real-time and memory-efficient rendering for scalable training as well as fast 3D reconstruction at inference time. To overcome local minima inherent to sparse and locally supported representations we predict a dense probability distribution over 3D and sample Gaussian means from that probability distribution. We make this sampling operation differentiable via a reparameterization trick allowing us to back-propagate gradients through the Gaussian splatting representation. We benchmark our method on wide-baseline novel view synthesis on the real-world RealEstate10k and ACID datasets where we outperform state-of-the-art light field transformers and accelerate rendering by 2.5 orders of magnitude while reconstructing an interpretable and editable 3D radiance field. Additional materials can be found on the anonymous project website (pixelsplat.github.io).
-
Video generation has witnessed significant advancements yet evaluating these models remains a challenge. A comprehensive evaluation benchmark for video generation is indispensable for two reasons: 1) Existing metrics do not fully align with human perceptions; 2) An ideal evaluation system should provide insights to inform future developments of video generation. To this end we present VBench a comprehensive benchmark suite that dissects "video generation quality" into specific hierarchical and disentangled dimensions each with tailored prompts and evaluation methods. VBench has three appealing properties: 1) Comprehensive Dimensions: VBench comprises 16 dimensions in video generation (e.g. subject identity inconsistency motion smoothness temporal flickering and spatial relationship etc). The evaluation metrics with fine-grained levels reveal individual models' strengths and weaknesses. 2) Human Alignment: We also provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception for each evaluation dimension respectively. 3) Valuable Insights: We look into current models' ability across various evaluation dimensions and various content types. We also investigate the gaps between video and image generation models. We will open-source VBench including all prompts evaluation methods generated videos and human preference annotations and also include more video generation models in VBench to drive forward the field of video generation.
-
We present a new open-vocabulary detection framework. Our framework uses both image-level labels and detailed detection annotations when available. Our framework proceeds in three steps. We first train a language-conditioned object detector on fully-supervised detection data. This detector gets to see the presence or absence of ground truth classes during training and conditions prediction on the set of present classes. We use this detector to pseudo-label images with image-level labels. Our detector provides much more accurate pseudo-labels than prior approaches with its conditioning mechanism. Finally we train an unconditioned open-vocabulary detector on the pseudo-annotated images. The resulting detector named DECOLA shows strong zero-shot performance in open-vocabulary LVIS benchmark as well as direct zero-shot transfer benchmarks on LVIS COCO Object365 and OpenImages. DECOLA outperforms the prior arts by 17.1 AP-rare and 9.4 mAP on zero-shot LVIS benchmark. DECOLA achieves state-of-the-art results in various model sizes architectures and datasets by only training on open-sourced data and academic-scale computing. Code is available at https://github.com/janghyuncho/DECOLA.
-
We propose Diffusion Noise Optimization (DNO) a new method that effectively leverages existing motion diffusion models as motion priors for a wide range of motion-related tasks. Instead of training a task-specific diffusion model for each new task DNO operates by optimizing the diffusion latent noise of an existing pre-trained text-to-motion model. Given the corresponding latent noise of a human motion it propagates the gradient from the target criteria defined on the motion space through the whole denoising process to update the diffusion latent noise. As a result DNO supports any use cases where criteria can be defined as a function of motion. In particular we show that for motion editing and control DNO outperforms existing methods in both achieving the objective and preserving the motion content. DNO accommodates a diverse range of editing modes including changing trajectory pose joint locations or avoiding newly added obstacles. In addition DNO is effective in motion denoising and completion producing smooth and realistic motion from noisy and partial inputs. DNO achieves these results at inference time without the need for model retraining offering great versatility for any defined reward or loss function on the motion representation.
-
Deep learning has achieved remarkable progress in various applications heightening the importance of safeguarding the intellectual property (IP) of well-trained models. It entails not only authorizing usage but also ensuring the deployment of models in authorized data domains i.e. making models exclusive to certain target domains. Previous methods necessitate concurrent access to source training data and target unauthorized data when performing IP protection making them risky and inefficient for decentralized private data. In this paper we target a practical setting where only a well-trained source model is available and investigate how we can realize IP protection. To achieve this we propose a novel MAsk Pruning (MAP) framework. MAP stems from an intuitive hypothesis i.e. there are target-related parameters in a well-trained model locating and pruning them is the key to IP protection. Technically MAP freezes the source model and learns a target-specific binary mask to prevent unauthorized data usage while minimizing performance degradation on authorized data. Moreover we introduce a new metric aimed at achieving a better balance between source and target performance degradation. To verify the effectiveness and versatility we have evaluated MAP in a variety of scenarios including vanilla source-available practical source-free and challenging data-free. Extensive experiments indicate that MAP yields new state-of-the-art performance.
-
In this work we tackle the problem of domain generalization for object detection specifically focusing on the scenario where only a single source domain is available. We propose an effective approach that involves two key steps: diversifying the source domain and aligning detections based on class prediction confidence and localization. Firstly we demonstrate that by carefully selecting a set of augmentations a base detector can outperform existing methods for single domain generalization by a good margin. This highlights the importance of domain diversification in improving the performance of object detectors. Secondly we introduce a method to align detections from multiple views considering both classification and localization outputs. This alignment procedure leads to better generalized and well-calibrated object detector models which are crucial for accurate decision-making in safety-critical applications. Our approach is detector-agnostic and can be seamlessly applied to both single-stage and two-stage detectors. To validate the effectiveness of our proposed methods we conduct extensive experiments and ablations on challenging domain-shift scenarios. The results consistently demonstrate the superiority of our approach compared to existing methods.
-
In the realm of food computing segmenting ingredients from images poses substantial challenges due to the large intra-class variance among the same ingredients the emergence of new ingredients and the high annotation costs associated with large food segmentation datasets. Existing approaches primarily utilize a closed-vocabulary and static text embeddings setting. These methods often fall short in effectively handling the ingredients particularly new and diverse ones. In response to these limitations we introduce OVFoodSeg a framework that adopts an open-vocabulary setting and enhances text embeddings with visual context. By integrating vision-language models (VLMs) our approach enriches text embedding with image-specific information through two innovative modules e.g. an image-to-text learner FoodLearner and an Image-Informed Text Encoder. The training process of OVFoodSeg is divided into two stages: the pre-training of FoodLearner and the subsequent learning phase for segmentation. The pre-training phase equips FoodLearner with the capability to align visual information with corresponding textual representations that are specifically related to food while the second phase adapts both the FoodLearner and the Image-Informed Text Encoder for the segmentation task. By addressing the deficiencies of previous models OVFoodSeg demonstrates a significant improvement achieving an 4.9% increase in mean Intersection over Union (mIoU) on the FoodSeg103 dataset setting a new milestone for food image segmentation.
-
We introduce a lightweight and accurate architecture for resource-efficient visual correspondence. Our method dubbed XFeat (Accelerated Features) revisits fundamental design choices in convolutional neural networks for detecting extracting and matching local features. Our new model satisfies a critical need for fast and robust algorithms suitable to resource-limited devices. In particular accurate image matching requires sufficiently large image resolutions -- for this reason we keep the resolution as large as possible while limiting the number of channels in the network. Besides our model is designed to offer the choice of matching at the sparse or semi-dense levels each of which may be more suitable for different downstream applications such as visual navigation and augmented reality. Our model is the first to offer semi-dense matching efficiently leveraging a novel match refinement module that relies on coarse local descriptors. XFeat is versatile and hardware-independent surpassing current deep learning-based local features in speed (up to 5x faster) with comparable or better accuracy proven in pose estimation and visual localization. We showcase it running in real-time on an inexpensive laptop CPU without specialized hardware optimizations. Code and weights are available at verlab.dcc.ufmg.br/descriptors/xfeat_cvpr24.
-
The emergence of attention-based transformer models has led to their extensive use in various tasks due to their superior generalization and transfer properties. Recent research has demonstrated that such models when prompted appropriately are excellent for few-shot inference. However such techniques are under-explored for dense prediction tasks like semantic segmentation. In this work we examine the effectiveness of prompting a transformer-decoder with learned visual prompts for the generalized few-shot segmentation (GFSS) task. Our goal is to achieve strong performance not only on novel categories with limited examples but also to retain performance on base categories. We propose an approach to learn visual prompts with limited examples. These learned visual prompts are used to prompt a multiscale transformer decoder to facilitate accurate dense predictions. Additionally we introduce a unidirectional causal attention mechanism between the novel prompts learned with limited examples and the base prompts learned with abundant data. This mechanism enriches the novel prompts without deteriorating the base class performance. Overall this form of prompting helps us achieve state-of-the-art performance for GFSS on two different benchmark datasets: COCO-20^i and Pascal-5^i without the need for test-time optimization (or transduction). Furthermore test-time optimization leveraging unlabelled test data can be used to improve the prompts which we refer to as transductive prompt tuning.
-
We present ARTrackV2 which integrates two pivotal aspects of tracking: determining where to look (localization) and how to describe (appearance analysis) the target object across video frames. Building on the foundation of its predecessor ARTrackV2 extends the concept by introducing a unified generative framework to "read out" object's trajectory and "retell" its appearance in an autoregressive manner. This approach fosters a time-continuous methodology that models the joint evolution of motion and visual features guided by previous estimates. Furthermore ARTrackV2 stands out for its efficiency and simplicity obviating the less efficient intra-frame autoregression and hand-tuned parameters for appearance updates. Despite its simplicity ARTrackV2 achieves state-of-the-art performance on prevailing benchmark datasets while demonstrating a remarkable efficiency improvement. In particular ARTrackV2 achieves an AO score of 79. 5% on GOT-10k and an AUC of 86. 1% on TrackingNet while being 3.6 xfaster than ARTrack.
-
What does learning to model relationships between strings teach Large Language Models (LLMs) about the visual world? We systematically evaluate LLMs' abilities to generate and recognize an assortment of visual concepts of increasing complexity and then demonstrate how a preliminary visual representation learning system can be trained using models of text. As language models lack the ability to consume or output visual information as pixels we use code to represent images in our study. Although LLM-generated images do not look like natural images results on image generation and the ability of models to correct these generated images indicate that precise modeling of strings can teach language models about numerous aspects of the visual world. Furthermore experiments on self-supervised visual representation learning utilizing images generated with text models highlight the potential to train vision models capable of making semantic assessments of natural images using just LLMs.
-
In this paper we propose a new framework for online 3D scene perception. Conventional 3D scene perception methods are offline i.e. take an already reconstructed 3D scene geometry as input which is not applicable in robotic applications where the input data is streaming RGB-D videos rather than a complete 3D scene reconstructed from pre- collected RGB-D videos. To deal with online 3D scene per- ception tasks where data collection and perception should be performed simultaneously the model should be able to process 3D scenes frame by frame and make use of the temporal information. To this end we propose an adapter-based plug-and-play module for the backbone of 3D scene perception model which constructs memory to cache and aggregate the extracted RGB-D features to empower offline models with temporal learning ability. Specifically we propose a queued memory mechanism to cache the supporting point cloud and image features. Then we devise aggregation modules which directly perform on the memory and pass temporal information to current frame. We further propose 3D-to-2D adapter to enhance image features with strong global context. Our adapters can be easily inserted into mainstream offline architectures of different tasks and significantly boost their performance on online tasks. Extensive experiments on ScanNet and SceneNN datasets demonstrate our approach achieves leading performance on three 3D scene perception tasks compared with state-of-the-art online methods by simply finetuning existing offline models without any model and task-specific designs.
-
Vision-language models (VLMs) have made significant strides in cross-modal understanding through large-scale paired datasets. However in fashion domain datasets often exhibit a disparity between the information conveyed in image and text. This issue stems from datasets containing multiple images of a single fashion item all paired with one text leading to cases where some textual details are not visible in individual images. This mismatch particularly when non-co-occurring elements are masked undermines the training of conventional VLM objectives like Masked Language Modeling and Masked Image Modeling thereby hindering the model's ability to accurately align fine-grained visual and textual features. Addressing this problem we propose Synchronized attentional Masking (SyncMask) which generate masks that pinpoint the image patches and word tokens where the information co-occur in both image and text. This synchronization is accomplished by harnessing cross-attentional features obtained from a momentum model ensuring a precise alignment between the two modalities. Additionally we enhance grouped batch sampling with semi-hard negatives effectively mitigating false negative issues in Image-Text Matching and Image-Text Contrastive learning objectives within fashion datasets. Our experiments demonstrate the effectiveness of the proposed approach outperforming existing methods in three downstream tasks.
-
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames performing even worse than single-modality models. While applying the common dropout techniques to the video modality enhances robustness to missing frames it simultaneously results in a performance loss when dealing with complete data input. In this study we delve into this contrasting phenomenon through the lens of modality bias and uncover that an excessive modality bias towards the audio modality induced by dropout constitutes the fundamental cause. Next we present the Modality Bias Hypothesis (MBH) to systematically describe the relationship between the modality bias and the robustness against missing modality in multimodal systems. Building on these findings we propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality maintaining performance and robustness simultaneously. Finally to address an entirely missing modality we adopt adapters to dynamically switch decision strategies. The effectiveness of our proposed approach is evaluated through comprehensive experiments on the MISP2021 and MISP2022 datasets. Our code is available at https://github.com/dalision/ModalBiasAVSR.
-
Point cloud upsampling (PCU) enriches the representation of raw point clouds significantly improving the performance in downstream tasks such as classification and reconstruction. Most of the existing point cloud upsampling methods focus on sparse point cloud feature extraction and upsampling module design. In a different way we dive deeper into directly modelling the gradient of data distribution from dense point clouds. In this paper we proposed a conditional denoising diffusion probabilistic model (DDPM) for point cloud upsampling called PUDM. Specifically PUDM treats the sparse point cloud as a condition and iteratively learns the transformation relationship between the dense point cloud and the noise. Simultaneously PUDM aligns with a dual mapping paradigm to further improve the discernment of point features. In this context PUDM enables learning complex geometry details in the ground truth through the dominant features while avoiding an additional upsampling module design. Furthermore to generate high-quality arbitrary-scale point clouds during inference PUDM exploits the prior knowledge of the scale between sparse point clouds and dense point clouds during training by parameterizing a rate factor. Moreover PUDM exhibits strong noise robustness in experimental results. In the quantitative and qualitative evaluations on PU1K and PUGAN PUDM significantly outperformed existing methods in terms of Chamfer Distance (CD) and Hausdorff Distance (HD) achieving state of the art (SOTA) performance.
-
Neural Radiance Fields (NeRFs) excel in photorealistically rendering static scenes. However rendering dynamic long-duration radiance fields on ubiquitous devices remains challenging due to data storage and computational constraints. In this paper we introduce VideoRF the first approach to enable real-time streaming and rendering of dynamic human-centric radiance fields on mobile platforms. At the core is a serialized 2D feature image stream representing the 4D radiance field all in one. We introduce a tailored training scheme directly applied to this 2D domain to impose the temporal and spatial redundancy of the feature image stream. By leveraging the redundancy we show that the feature image stream can be efficiently compressed by 2D video codecs which allows us to exploit video hardware accelerators to achieve real-time decoding. On the other hand based on the feature image stream we propose a novel rendering pipeline for VideoRF which has specialized space mappings to query radiance properties efficiently. Paired with a deferred shading model VideoRF has the capability of real-time rendering on mobile devices thanks to its efficiency. We have developed a real-time interactive player that enables online streaming and rendering of dynamic scenes offering a seamless and immersive free-viewpoint experience across a range of devices from desktops to mobile phones. Our project page is available at https://aoliao12138.github.io/VideoRF/.
-
We introduce Diffusion Parametric Head Models (DPHMs) a generative model that enables robust volumetric head reconstruction and tracking from monocular depth sequences. While recent volumetric head models such as NPHMs can now excel in representing high-fidelity head geometries tracking and reconstructing heads from real-world single-view depth sequences remains very challenging as the fitting to partial and noisy observations is underconstrained. To tackle these challenges we propose a latent diffusion-based prior to regularize volumetric head reconstruction and tracking. This prior-based regularizer effectively constrains the identity and expression codes to lie on the underlying latent manifold which represents plausible head shapes. To evaluate the effectiveness of the diffusion-based prior we collect a dataset of monocular Kinect sequences consisting of various complex facial expression motions and rapid transitions. We compare our method to state-of-the-art tracking methods and demonstrate improved head identity reconstruction as well as robust expression tracking.
-
Current perceptive models heavily depend on resource-intensive datasets prompting the need for innovative solutions. Leveraging recent advances in diffusion models synthetic data by constructing image inputs from various annotations proves beneficial for downstream tasks. While prior methods have separately addressed generative and perceptive models DetDiffusion for the first time harmonizes both tackling the challenges in generating effective data for perceptive models. To enhance image generation with perceptive models we introduce perception-aware loss (P.A. loss) through segmentation improving both quality and controllability. To boost the performance of specific perceptive models our method customizes data augmentation by extracting and utilizing perception-aware attribute (P.A. Attr) during generation. Experimental results from the object detection task highlight DetDiffusion's superior performance establishing a new state-of-the-art in layout-guided generation. Furthermore image syntheses from DetDiffusion can effectively augment training data significantly enhancing downstream detection performance.
-
Recent years have witnessed the remarkable progress of 3D multi-modality object detection methods based on the Bird's-Eye-View (BEV) perspective. However most of them overlook the complementary interaction and guidance between LiDAR and camera. In this work we propose a novel multi-modality 3D objection detection method named GAFusion with LiDAR-guided global interaction and adaptive fusion. Specifically we introduce sparse depth guidance (SDG) and LiDAR occupancy guidance (LOG) to generate 3D features with sufficient depth information. In the following LiDAR-guided adaptive fusion transformer (LGAFT) is developed to adaptively enhance the interaction of different modal BEV features from a global perspective. Meanwhile additional downsampling with sparse height compression and multi-scale dual-path transformer (MSDPT) are designed to enlarge the receptive fields of different modal features. Finally a temporal fusion module is introduced to aggregate features from previous frames. GAFusion achieves state-of-the-art 3D object detection results with 73.6% mAP and 74.9% NDS on the nuScenes test set.
-
Previous methods for Video Frame Interpolation (VFI) have encountered challenges notably the manifestation of blur and ghosting effects. These issues can be traced back to two pivotal factors: unavoidable motion errors and misalignment in supervision. In practice motion estimates often prove to be error-prone resulting in misaligned features. Furthermore the reconstruction loss tends to bring blurry results particularly in misaligned regions. To mitigate these challenges we propose a new paradigm called PerVFI (Perception-oriented Video Frame Interpolation). Our approach incorporates an Asymmetric Synergistic Blending module (ASB) that utilizes features from both sides to synergistically blend intermediate features. One reference frame emphasizes primary content while the other contributes complementary information. To impose a stringent constraint on the blending process we introduce a self-learned sparse quasi-binary mask which effectively mitigates ghosting and blur artifacts in the output. Additionally we employ a normalizing flow-based generator and utilize the negative log-likelihood loss to learn the conditional distribution of the output which further facilitates the generation of clear and fine details. Experimental results validate the superiority of PerVFI demonstrating significant improvements in perceptual quality compared to existing methods. Codes are available at https://github.com/mulns/PerVFI
-
State-of-the-art personalized text-to-image generation systems are usually trained on a few reference images to learn novel visual representations. However this is likely to incur infringement of copyright for the reference image owners when these images are personal and publicly available. Recent progress has been made in protecting these images from unauthorized use by adding protective noises. Yet current protection methods work under the assumption that these protected images are not changed which is in contradiction to the fact that most public platforms intend to modify user-uploaded content e.g. image compression. This paper introduces a robust watermarking method namely InMark to protect images from unauthorized learning. Inspired by influence functions the proposed method forges protective watermarks on more important pixels for these reference images from both heuristic and statistical perspectives. In this way the personal semantics of these images are under protection even if these images are modified to some extent. Extensive experiments demonstrate that the proposed InMark outperforms previous state-of-the-art methods in both protective performance and robustness.
-
In recent years there has been a growing interest in training Neural Networks to approximate Unsigned Distance Fields (UDFs) for representing open surfaces in the context of 3D reconstruction. However UDFs are non-differentiable at the zero level set which leads to significant errors in distances and gradients generally resulting in fragmented and discontinuous surfaces. In this paper we propose to learn a hyperbolic scaling of the unsigned distance field which defines a new Eikonal problem with distinct boundary conditions. This allows our formulation to integrate seamlessly with state-of-the-art continuously differentiable implicit neural representation networks largely applied in the literature to represent signed distance fields. Our approach not only addresses the challenge of open surface representation but also demonstrates significant improvement in reconstruction quality and training performance. Moreover the unlocked field's differentiability allows the accurate computation of essential topological properties such as normal directions and curvatures pervasive in downstream tasks such as rendering. Through extensive experiments we validate our approach across various data sets and against competitive baselines. The results demonstrate enhanced accuracy and up to an order of magnitude increase in speed compared to previous methods.
-
The vision-language model has brought great improvement to few-shot industrial anomaly detection which usually needs to design of hundreds of prompts through prompt engineering. For automated scenarios we first use conventional prompt learning with many-class paradigm as the baseline to automatically learn prompts but found that it can not work well in one-class anomaly detection. To address the above problem this paper proposes a one-class prompt learning method for few-shot anomaly detection termed PromptAD. First we propose semantic concatenation which can transpose normal prompts into anomaly prompts by concatenating normal prompts with anomaly suffixes thus constructing a large number of negative samples used to guide prompt learning in one-class setting. Furthermore to mitigate the training challenge caused by the absence of anomaly images we introduce the concept of explicit anomaly margin which is used to explicitly control the margin between normal prompt features and anomaly prompt features through a hyper-parameter. For image-level/pixel-level anomaly detection PromptAD achieves first place in 11/12 few-shot settings on MVTec and VisA.
-
Graph Contrastive Learning (GCL) a Self-Supervised Learning (SSL) architecture tailored for graphs has shown notable potential for mitigating label scarcity. Its core idea is to amplify feature similarities between the positive sample pairs and reduce them between the negative sample pairs. Unfortunately most existing GCLs consistently present suboptimal performances on both homophilic and heterophilic graphs. This is primarily attributed to two limitations of positive sampling that is incomplete local sampling and blind sampling. To address these limitations this paper introduces a novel GCL framework with an adaptive positive sampling module named grapH contrastivE Adaptive posiTive Samples (HEATS). Motivated by the observation that the affinity matrix corresponding to optimal positive sample sets has a block-diagonal structure with equal weights within each block a self-expressive learning objective incorporating the block and idempotent constraint is presented. This learning objective and the contrastive learning objective are iteratively optimized to improve the adaptability and robustness of HEATS. Extensive experiments on graphs and images validate the effectiveness and generality of HEATS.
-
Deep unfolding networks (DUNs) renowned for their interpretability and superior performance have invigorated the realm of compressive sensing (CS). Nonetheless existing DUNs frequently suffer from issues related to insufficient feature extraction and feature attrition during the iterative steps. In this paper we propose Unrolling Fixed-point Continuous Network (UFC-Net) a novel deep CS framework motivated by the traditional fixed-point continuous optimization algorithm. Specifically we introduce Convolution-guided Attention Module (CAM) to serve as a critical constituent within the reconstruction phase encompassing tailored components such as Multi-head Attention Residual Block (MARB) Auxiliary Iterative Reconstruction Block (AIRB) etc. MARB effectively integrates multi-head attention mechanisms with convolution to reinforce feature extraction transcending the confinement of localized attributes and facilitating the apprehension of long-range correlations. Meanwhile AIRB introduces auxiliary variables significantly bolstering the preservation of features within each iterative stage. Extensive experiments demonstrate that our proposed UFC-Net achieves remarkable performance both on image CS and CS-magnetic resonance imaging (CS-MRI) in contrast to state-of-the-art methods.
-
In the absence of parallax cues a learning-based single image depth estimation (SIDE) model relies heavily on shading and contextual cues in the image. While this simplicity is attractive it is necessary to train such models on large and varied datasets which are difficult to capture. It has been shown that using embeddings from pre-trained foundational models such as CLIP improves zero shot transfer in several applications. Taking inspiration from this in our paper we explore the use of global image priors generated from a pre-trained ViT model to provide more detailed contextual information. We argue that the embedding vector from a ViT model pre-trained on a large dataset captures greater relevant information for SIDE than the usual route of generating pseudo image captions followed by CLIP based text embeddings. Based on this idea we propose a new SIDE model using a diffusion backbone which is conditioned on ViT embeddings. Our proposed design establishes a new state-of-the-art (SOTA) for SIDE on NYUv2 dataset achieving Abs Rel error of 0.059(14% improvement) compared to 0.069 by the current SOTA (VPD). And on KITTI dataset achieving Sq Rel error of 0.139 (2% improvement) compared to 0.142 by the current SOTA (GEDepth). For zero-shot transfer with a model trained on NYUv2 we report mean relative improvement of (20% 23% 81% 25%) over NeWCRFs on (Sun-RGBD iBims1 DIODE HyperSim) datasets compared to (16% 18% 45% 9%) by ZoeDepth. The project page is available at https://ecodepth-iitd.github.io
-
We have witnessed significant progress in deep learning-based 3D vision ranging from neural radiance field (NeRF) based 3D representation learning to applications in novel view synthesis (NVS). However existing scene-level datasets for deep learning-based 3D vision limited to either synthetic environments or a narrow selection of real-world scenes are quite insufficient. This insufficiency not only hinders a comprehensive benchmark of existing methods but also caps what could be explored in deep learning-based 3D analysis. To address this critical gap we present DL3DV-10K a large-scale scene dataset featuring 51.2 million frames from 10510 videos captured from 65 types of point-of-interest (POI) locations covering both bounded and unbounded scenes with different levels of reflection transparency and lighting. We conducted a comprehensive benchmark of recent NVS methods on DL3DV-10K which revealed valuable insights for future research in NVS. In addition we have obtained encouraging results in a pilot study to learn generalizable NeRF from DL3DV-10K which manifests the necessity of a large-scale scene-level dataset to forge a path toward a foundation model for learning 3D representation. Our DL3DV-10K dataset benchmark results and models will be publicly accessible.
-
Recently building on the foundation of neural radiance field various techniques have emerged to learn unsigned distance fields (UDF) to reconstruct 3D non-watertight models from multi-view images. Yet a central challenge in UDF-based volume rendering is formulating a proper way to convert unsigned distance values into volume density ensuring that the resulting weight function remains unbiased and sensitive to occlusions. Falling short on these requirements often results in incorrect topology or large reconstruction errors in resulting models. This paper addresses this challenge by presenting a novel two-stage algorithm 2S-UDF for learning a high-quality UDF from multi-view images. Initially the method applies an easily trainable density function that while slightly biased and transparent aids in coarse reconstruction. The subsequent stage then refines the geometry and appearance of the object to achieve a high-quality reconstruction by directly adjusting the weight function used in volume rendering to ensure that it is unbiased and occlusion-aware. Decoupling density and weight in two stages makes our training stable and robust distinguishing our technique from existing UDF learning approaches. Evaluations on the DeepFashion3D DTU and BlendedMVS datasets validate the robustness and effectiveness of our proposed approach. In both quantitative metrics and visual quality the results indicate our superior performance over other UDF learning techniques in reconstructing 3D non-watertight models from multi-view images. Our code is available at https://bitbucket.org/jkdeng/2sudf/.
-
The YOLO series has become the most popular framework for real-time object detection due to its reasonable trade-off between speed and accuracy. However we observe that the speed and accuracy of YOLOs are negatively affected by the NMS. Recently end-to-end Transformer-based detectors (DETRs) have provided an alternative to eliminating NMS. Nevertheless the high computational cost limits their practicality and hinders them from fully exploiting the advantage of excluding NMS. In this paper we propose the Real-Time DEtection TRansformer (RT-DETR) the first real-time end-to-end object detector to our best knowledge that addresses the above dilemma. We build RT-DETR in two steps drawing on the advanced DETR: first we focus on maintaining accuracy while improving speed followed by maintaining speed while improving accuracy. Specifically we design an efficient hybrid encoder to expeditiously process multi-scale features by decoupling intra-scale interaction and cross-scale fusion to improve speed. Then we propose the uncertainty-minimal query selection to provide high-quality initial queries to the decoder thereby improving accuracy. In addition RT-DETR supports flexible speed tuning by adjusting the number of decoder layers to adapt to various scenarios without retraining. Our RT-DETR-R50 / R101 achieves 53.1% / 54.3% AP on COCO and 108 / 74 FPS on T4 GPU outperforming previously advanced YOLOs in both speed and accuracy. We also develop scaled RT-DETRs that outperform the lighter YOLO detectors (S and M models). Furthermore RT-DETR-R50 outperforms DINO-R50 by 2.2% AP in accuracy and about 21 times in FPS. After pre-training with Objects365 RT-DETR-R50 / R101 achieves 55.3% / 56.2% AP. The project page: https://zhao-yian.github.io/RTDETR.
-
Despite the recent advances in unified image segmentation (IS) developing a unified video segmentation (VS) model remains a challenge. This is mainly because generic category-specified VS tasks need to detect all objects and track them across consecutive frames while prompt-guided VS tasks require re-identifying the target with visual/text prompts throughout the entire video making it hard to handle the different tasks with the same architecture. We make an attempt to address these issues and present a novel unified VS architecture namely UniVS by using prompts as queries. UniVS averages the prompt features of the target from previous frames as its initial query to explicitly decode masks and introduces a target-wise prompt cross-attention layer in the mask decoder to integrate prompt features in the memory pool. By taking the predicted masks of entities from previous frames as their visual prompts UniVS converts different VS tasks into prompt-guided target segmentation eliminating the heuristic inter-frame matching process. Our framework not only unifies the different VS tasks but also naturally achieves universal training and testing ensuring robust performance across different scenarios. UniVS shows a commendable balance between performance and universality on 10 challenging VS benchmarks covering video instance semantic panoptic object and referring segmentation tasks. Code can be found at https://github.com/MinghanLi/UniVS.
-
Human-Object Interaction (HOI) Detection constitutes an important aspect of human-centric scene understanding which requires precise object detection and interaction recognition. Despite increasing advancement in detection recognizing subtle and intricate interactions remains challenging. Recent methods have endeavored to leverage the rich semantic representation from pre-trained CLIP yet fail to efficiently capture finer-grained spatial features that are highly informative for interaction discrimination. In this work instead of solely using representations from CLIP we fill the gap by proposing a spatial adapter that efficiently utilizes the multi-scale spatial information in the pre-trained detector. This leads to a bilateral adaptation that mutually produces complementary features. To further improve interaction recognition under occlusion which is common in crowded scenarios we propose an Occluded Part Extrapolation module that guides the model to recover the spatial details from manually occluded feature maps. Moreover we design a Conditional Contextual Mining module that further mines informative contextual clues from the spatial features via a tailored cross-attention mechanism. Extensive experiments on V-COCO and HICO-DET benchmarks demonstrate that our method significantly outperforms prior art on both standard and zero-shot settings resulting in new state-of-the-art performance. Additional ablation studies further validate the effectiveness of each component in our method.
-
Unsupervised fine-grained image hashing aims to learn compact binary hash codes in unsupervised settings addressing challenges posed by large-scale datasets and dependence on supervision. In this paper we first identify a granularity gap between generic and fine-grained datasets for unsupervised hashing methods highlighting the inadequacy of conventional self-supervised learning for fine-grained visual objects. To bridge this gap we propose the Asymmetric Augmented Self-Supervised Learning (A^2-SSL) method comprising three modules. The asymmetric augmented SSL module employs suitable augmentation strategies for positive/negative views preventing fine-grained category confusion inherent in conventional SSL. Part-oriented dense contrastive learning utilizes the Fisher Vector framework to capture and model fine-grained object parts enhancing unsupervised representations through part-level dense contrastive learning. Self-consistent hash code learning introduces a reconstruction task aligned with the self-consistency principle guiding the model to emphasize comprehensive features particularly fine-grained patterns. Experimental results on five benchmark datasets demonstrate the superiority of A^2-SSL over existing methods affirming its efficacy in unsupervised fine-grained image hashing.
-
Domain shift is a formidable issue in Machine Learning that causes a model to suffer from performance degradation when tested on unseen domains. Federated Domain Generalization (FedDG) attempts to train a global model using collaborative clients in a privacy-preserving manner that can generalize well to unseen clients possibly with domain shift. However most existing FedDG methods either cause additional privacy risks of data leakage or induce significant costs in client communication and computation which are major concerns in the Federated Learning paradigm. To circumvent these challenges here we introduce a novel architectural method for FedDG namely gPerXAN which relies on a normalization scheme working with a guiding regularizer. In particular we carefully design Personalized eXplicitly Assembled Normalization to enforce client models selectively filtering domain-specific features that are biased towards local data while retaining discrimination of those features. Then we incorporate a simple yet effective regularizer to guide these models in directly capturing domain-invariant representations that the global model's classifier can leverage. Extensive experimental results on two benchmark datasets i.e. PACS and Office-Home and a real-world medical dataset Camelyon17 indicate that our proposed method outperforms other existing methods in addressing this particular problem.
-
Human-Object Interaction (HOI) detection plays a crucial role in visual scene comprehension. In recent advancements two-stage detectors have taken a prominent position. However they are encumbered by two primary challenges. First the misalignment between feature representation and relation reasoning gives rise to a deficiency in discriminative features crucial for interaction detection. Second due to sparse annotation the second-stage interaction head generates numerous candidate
pairs with only a small fraction receiving supervision. Towards these issues we propose a hybrid learning method based on pose-aware HOI feature refinement. Specifically we devise pose-aware feature refinement that encodes spatial features by considering human body pose characteristics. It can direct attention towards key regions ultimately offering a wealth of fine-grained features imperative for HOI detection. Further we introduce a hybrid learning method that combines HOI triplets with probabilistic soft labels supervision which is regenerated from decoupled verb-object pairs. This method explores the implicit connections between the interactions enhancing model generalization without requiring additional data. Our method establishes state-of-the-art performance on HICO-DET benchmark and excels notably in detecting rare HOIs. -
Recovering a clear image from a single hazy image is an open inverse problem. Although significant research progress has been made most existing methods ignore the effect that downstream tasks play in promoting upstream dehazing. From the perspective of the haze generation mechanism there is a potential relationship between the depth information of the scene and the hazy image. Based on this we propose a dual-task collaborative mutual promotion framework to achieve the dehazing of a single image. This framework integrates depth estimation and dehazing by a dual-task interaction mechanism and achieves mutual enhancement of their performance. To realize the joint optimization of the two tasks an alternative implementation mechanism with the difference perception is developed. On the one hand the difference perception between the depth maps of the dehazing result and the ideal image is proposed to promote the dehazing network to pay attention to the non-ideal areas of the dehazing. On the other hand by improving the depth estimation performance in the difficult-to-recover areas of the hazy image the dehazing network can explicitly use the depth information of the hazy image to assist the clear image recovery. To promote the depth estimation we propose to use the difference between the dehazed image and the ground truth to guide the depth estimation network to focus on the dehazed unideal areas. It allows dehazing and depth estimation to leverage their strengths in a mutually reinforcing manner. Experimental results show that the proposed method can achieve better performance than that of the state-of-the-art approaches. The source code is released at https://github.com/zhoushen1/DCMPNet.
-
Multi-agent trajectory prediction is essential in autonomous driving risk avoidance and traffic flow control. However the heterogeneous traffic density on interactions which caused by physical laws social norms and so on is often overlooked in existing methods. When the density varies the number of agents involved in interactions and the corresponding interaction probability change dynamically. To tackle this issue we propose a new method called \underline D ensity-\underline A daptive Model based on \underline M otif \underline M atrix for Multi-Agent Trajectory Prediction (DAMM) to gain insights into multi-agent systems. Here we leverage the motif matrix to represent dynamic connectivity in a higher-order pattern and distill the interaction information from the perspectives of the spatial and the temporal dimensions. Specifically in spatial dimension we utilize multi-scale feature fusion to adaptively select the optimal range of neighbors participating in interactions for each time slot. In temporal dimension we extract the temporal interaction features and adapt a pyramidal pooling layer to generate the interaction probability for each agent. Experimental results demonstrate that our approach surpasses state-of-the-art methods on autonomous driving dataset.
-
We propose a unified approach to simultaneously addressing the conventional setting of binary deepfake classification and a more challenging scenario of uncovering what facial components have been forged as well as the exact order of the manipulations. To solve the former task we consider multiple instance learning (MIL) that takes each image as a bag and its patches as instances. A positive bag corresponds to a forged image that includes at least one manipulated patch (i.e. a pixel in the feature map). The formulation allows us to estimate the probability of an input image being a fake one and establish the corresponding contrastive MIL loss. On the other hand tackling the component-wise deepfake problem can be reduced to solving multi-label prediction but the requirement to recover the manipulation order further complicates the learning task into a multi-label ranking problem. We resolve this difficulty by designing a tailor-made loss term to enforce that the rank order of the predicted multi-label probabilities respects the ground-truth order of the sequential modifications of a deepfake image. Through extensive experiments and comparisons with other relevant techniques we provide extensive results and ablation studies to demonstrate that the proposed method is an overall more comprehensive solution to deepfake detection.
-
The recent advent of pre-trained vision transformers has unveiled a promising property: their inherent capability to group semantically related visual concepts. In this paper we explore to harnesses this emergent feature to tackle few-shot semantic segmentation a task focused on classifying pixels in a test image with a few example data. A critical hurdle in this endeavor is preventing overfitting to the limited classes seen during training the few-shot segmentation model. As our main discovery we find that the concept of "relationship descriptors" initially conceived for enhancing the CLIP model for zero-shot semantic segmentation offers a potential solution. We adapt and refine this concept to craft a relationship descriptor construction tailored for few-shot semantic segmentation extending its application across multiple layers to enhance performance. Building upon this adaptation we proposed a few-shot semantic segmentation framework that is not only easy to implement and train but also effectively scales with the number of support examples and categories. Through rigorous experimentation across various datasets including PASCAL-5^ i and COCO-20^ i we demonstrate a clear advantage of our method in diverse few-shot semantic segmentation scenarios and a range of pre-trained vision transformer models. The findings clearly show that our method significantly outperforms current state-of-the-art techniques highlighting the effectiveness of harnessing the emerging capabilities of vision transformers for few-shot semantic segmentation. We release the code at https://github.com/ZiqinZhou66/FewSegwithRD.git.
-
Listening head generation aims to synthesize a non-verbal responsive listener head by modeling the correlation between the speaker and the listener in dynamic conversion. The applications of listener agent generation in virtual interaction have promoted many works achieving diverse and fine-grained motion generation. However they can only manipulate motions through simple emotional labels but cannot freely control the listener's motions. Since listener agents should have human-like attributes (e.g. identity personality) which can be freely customized by users this limits their realism. In this paper we propose a user-friendly framework called CustomListener to realize the free-form text prior guided listener generation. To achieve speaker-listener coordination we design a Static to Dynamic Portrait module (SDP) which interacts with speaker information to transform static text into dynamic portrait token with completion rhythm and amplitude information. To achieve coherence between segments we design a Past Guided Generation module (PGG) to maintain the consistency of customized listener attributes through the motion prior and utilize a diffusion-based structure conditioned on the portrait token and the motion prior to realize the controllable generation. To train and evaluate our model we have constructed two text-annotated listening head datasets based on ViCo and RealTalk which provide text-video paired labels. Extensive experiments have verified the effectiveness of our model.
-
Adding artificial patterns to objects like QR codes can ease tasks such as object tracking robot navigation and conveying information (e.g. a label or a website link). However these patterns require a physical application and they alter the object's appearance. Conversely projected patterns can temporarily change the object's appearance aiding tasks like 3D scanning and retrieving object textures and shading. However projected patterns impede dynamic tasks like object tracking because they do not `stick' to the object's surface. Or do they? This paper introduces a novel approach combining the advantages of projected and persistent physical patterns. Our system projects heat patterns using a laser beam (similar in spirit to a LIDAR) which a thermal camera observes and tracks. Such thermal patterns enable tracking poorly-textured objects whose tracking is highly challenging with standard cameras while not affecting the object's appearance or physical properties. To avail these thermal patterns in existing vision frameworks we train a network to reverse heat diffusion's effects and remove inconsistent pattern points between different thermal frames. We prototyped and tested this approach on dynamic vision tasks like structure from motion optical flow and object tracking of everyday textureless objects.
-
Scene graphs have been recently introduced into 3D spatial understanding as a comprehensive representation of the scene. The alignment between 3D scene graphs is the first step of many downstream tasks such as scene graph aided point cloud registration mosaicking overlap checking and robot navigation. In this work we treat 3D scene graph alignment as a partial graph-matching problem and propose to solve it with a graph neural network. We reuse the geometric features learned by a point cloud registration method and associate the clustered point-level geometric features with the node-level semantic feature via our designed feature fusion module. Partial matching is enabled by using a learnable method to select the top-k similar node pairs. Subsequent downstream tasks such as point cloud registration are achieved by running a pre-trained registration network within the matched regions. We further propose a point-matching rescoring method that uses the node-wise alignment of the 3D scene graph to reweight the matching candidates from a pre-trained point cloud registration method. It reduces the false point correspondences estimated especially in low-overlapping cases. Experiments show that our method improves the alignment accuracy by 10 20% in low-overlap and random transformation scenarios and outperforms the existing work in multiple downstream tasks. Our code and models are available here (https://github.com/dfki-av/sg-pgm.git).
-
Principal component analysis (PCA) along with its extensions to manifolds and outlier contaminated data have been indispensable in computer vision and machine learning. In this work we present a unifying formalism for PCA and its variants and introduce a framework based on the flags of linear subspaces i.e. a hierarchy of nested linear subspaces of increasing dimension which not only allows for a common implementation but also yields novel variants not explored previously. We begin by generalizing traditional PCA methods that either maximize variance or minimize reconstruction error. We expand these interpretations to develop a wide array of new dimensionality reduction algorithms by accounting for outliers and the data manifold. To devise a common computational approach we recast robust and dual forms of PCA as optimization problems on flag manifolds. We then integrate tangent space approximations of principal geodesic analysis (tangent-PCA) into this flag-based framework creating novel robust and dual geodesic PCA variations. The remarkable flexibility offered by the `flagification' introduced here enables even more algorithmic variants identified by specific flag types. Last but not least we propose an effective convergent solver for these flag-formulations employing the Stiefel manifold. Our empirical results on both real-world and synthetic scenarios demonstrate the superiority of our novel algorithms especially in terms of robustness to outliers on manifolds.
-
This paper addresses the challenge of example-based non-stationary texture synthesis. We introduce a novel two-step approach wherein users first modify a reference texture using standard image editing tools yielding an initial rough target for the synthesis. Subsequently our proposed method termed "self-rectification" automatically refines this target into a coherent seamless texture while faithfully preserving the distinct visual characteristics of the reference exemplar. Our method leverages a pre-trained diffusion network and uses self-attention mechanisms to gradually align the synthesized texture with the reference ensuring the retention of the structures in the provided target. Through experimental validation our approach exhibits exceptional proficiency in handling non-stationary textures demonstrating significant advancements in texture synthesis when compared to existing state-of-the-art techniques. Code is available at https://github.com/xiaorongjun000/Self-Rectification
-
Despite the success of recent upsampling approaches generating high-resolution point sets with uniform distribution and meticulous structures is still challenging. Unlike existing methods that only take spatial information of the raw data into account we regard point cloud upsampling as generating dense point clouds from deformable topology. Motivated by this we present SPU-PMD a self-supervised topological mesh deformation network for 3D densification. As a cascaded framework our architecture is formulated by a series of coarse mesh interpolator and mesh deformers. At each stage the mesh interpolator first produces the initial dense point clouds via mesh interpolation which allows the model to perceive the primitive topology better. Meanwhile the deformer infers the morphing by estimating the movements of mesh nodes and reconstructs the descriptive topology structure. By associating mesh deformation with feature expansion this module progressively refines point clouds' surface uniformity and structural details. To demonstrate the effectiveness of the proposed method extensive quantitative and qualitative experiments are conducted on synthetic and real-scanned 3D data. Also we compare it with state-of-the-art techniques to further illustrate the superiority of our network. The project page is: https://github.com/lyz21/SPU-PMD
-
Saliency ranking detection (SRD) has emerged as a challenging task in computer vision aiming not only to identify salient objects within images but also to rank them based on their degree of saliency. Existing SRD datasets have been created primarily using mouse-trajectory data which inadequately captures the intricacies of human visual perception. Addressing this gap this paper introduces the first large-scale SRD dataset SIFR constructed using genuine human fixation data thereby aligning more closely with real visual perceptual processes. To establish a baseline for this dataset we propose QAGNet a novel model that leverages salient instance query features from a transformer detector within a tri-tiered nested graph. Through extensive experiments we demonstrate that our approach outperforms existing state-of-the-art methods across two widely used SRD datasets and our newly proposed dataset. Code and dataset are available at https://github.com/EricDengbowen/QAGNet.
-
Contemporary models for generating images show remarkable quality and versatility. Swayed by these advantages the research community repurposes them to generate videos. Since video content is highly redundant we argue that naively bringing advances of image models to the video generation domain reduces motion fidelity visual quality and impairs scalability. In this work we build Snap Video a video-first model that systematically addresses these challenges. To do that we first extend the EDM framework to take into account spatially and temporally redundant pixels and naturally support video generation. Second we show that a U-Net--a workhorse behind image generation--scales poorly when generating videos requiring significant computational overhead. Hence we propose a new transformer-based architecture that trains 3.31 times faster than U-Nets (and is 4.5 faster at inference). This allows us to efficiently train a text-to-video model with billions of parameters for the first time reach state-of-the-art results on a number of benchmarks and generate videos with substantially higher quality temporal consistency and motion complexity. The user studies showed that our model was favored by a large margin over the most recent methods.
-
Phase unwrapping (PU) is a technique to reconstruct original phase images from their noisy wrapped counterparts finding many applications in scientific imaging. Although supervised learning has shown promise in PU its utility is limited in ground-truth (GT) scarce scenarios. This paper presents an unsupervised learning approach that eliminates the need for GTs during end-to-end training. Our approach leverages the insight that both the gradients and wrapped gradients of wrapped phases serve as noisy labels for GT phase gradients along with sparse outliers induced by the wrapping operation. A recorruption-based self-reconstruction loss in the gradient domain is proposed to mitigate the adverse effects of label noise complemented with a self-distillation loss for improved generalization. Additionally by unfolding a variational model of PU that utilizes wrapped gradients of wrapped phases for its data-fitting term we develop a deep unrolling network that encodes physics of phase wrapping and incorporates special treatments on outliers. In the experiments on three types of phase data our approach outperforms existing GT-free methods and competes well against the supervised ones.
-
Generalized category discovery (GCD) aims at grouping unlabeled samples from known and unknown classes given labeled data of known classes. To meet the recent decentralization trend in the community we introduce a practical yet challenging task Federated GCD (Fed-GCD) where the training data are distributed in local clients and cannot be shared among clients. Fed-GCD aims to train a generic GCD model by client collaboration under the privacy-protected constraint. The Fed-GCD leads to two challenges: 1) representation degradation caused by training each client model with fewer data than centralized GCD learning and 2) highly heterogeneous label spaces across different clients. To this end we propose a novel Associated Gaussian Contrastive Learning (AGCL) framework based on learnable GMMs which consists of a Client Semantics Association (CSA) and a global-local GMM Contrastive Learning (GCL). On the server CSA aggregates the heterogeneous categories of local-client GMMs to generate a global GMM containing more comprehensive category knowledge. On each client GCL builds class-level contrastive learning with both local and global GMMs. The local GCL learns robust representation with limited local data. The global GCL encourages the model to produce more discriminative representation with the comprehensive category relationships that may not exist in local data. We build a benchmark based on six visual datasets to facilitate the study of Fed-GCD. Extensive experiments show that our AGCL outperforms multiple baselines on all datasets.
-
Gradient sparsification and quantization offer a promising prospect to alleviate the communication overhead problem in distributed learning. However direct combination of the two results in suboptimal solutions due to the fact that sparsification and quantization haven't been learned together. In this paper we propose Joint Sparsification-Quantization (JointSQ) inspired by the discovery that sparsification can be treated as 0-bit quantization regardless of architectures. Specifically we mathematically formulate JointSQ as a mixed-precision quantization problem expanding the solution space. It can be solved by the designed MCKP-Greedy algorithm. Theoretical analysis demonstrates the minimal compression noise of JointSQ and extensive experiments on various network architectures including CNN RNN and Transformer also validate this point. Under the introduction of computation overhead consistent with or even lower than previous methods JointSQ achieves a compression ratio of 1000xon different models while maintaining near-lossless accuracy and brings 1.4xto 2.9xspeedup over existing methods.
-
Human-centric Point Cloud Video Understanding (PVU) is an emerging field focused on extracting and interpreting human-related features from sequences of human point clouds further advancing downstream human-centric tasks and applications. Previous works usually focus on tackling one specific task and rely on huge labeled data which has poor generalization capability. Considering that human has specific characteristics including the structural semantics of human body and the dynamics of human motions we propose a unified framework to make full use of the prior knowledge and explore the inherent features in the data itself for generalized human-centric point cloud video understanding. Extensive experiments demonstrate that our method achieves state-of-the-art performance on various human-related tasks including action recognition and 3D pose estimation. All datasets and code will be released soon.
-
While recent 3D instance segmentation approaches show promising results based on transformer architectures they often fail to correctly identify instances with similar appearances. They also ambiguously determine edges leading to multiple misclassifications of adjacent edge points. In this work we introduce a novel framework called EASE to overcome these challenges and improve the perception of complex 3D instances. We first propose a semantic guidance network to leverage rich semantic knowledge from a language model as intelligent priors enhancing the functional understanding of real-world instances beyond relying solely on geometrical information. We explicitly instruct the basic instance queries using text embeddings of each instance to learn deep semantic details. Further we utilize the edge prediction module encouraging the segmentation network to be edge-aware. We extract voxel-wise edge maps from point features and use them as auxiliary information for learning edge cues. In our extensive experiments on large-scale benchmarks ScanNetV2 ScanNet200 S3DIS and STPLS3D our EASE outperforms existing state-of-the-art models demonstrating its superior performance.
-
Passive depth estimation based on stereo or defocus relies on the presence of the texture on an object to resolve its depth. Hence recovering the depth of a textureless object-- for example a large white wall--is not just hard but perhaps even impossible. Or is it? We show that spatial coherence a property of natural light sources can be used to resolve the depth of a scene point even when it is textureless. Our approach relies on the idea that natural light scattered off a scene point is locally coherent with itself while incoherent with the light scattered from other surface points; we use this insight to design an optical setup that uses self-interference as a texture feature for estimating depth. Our lab prototype is capable of resolving the depths of textureless objects in sunlight as well as indoor lights.
-
Collaborative perception in automated vehicles leverages the exchange of information between agents aiming to elevate perception results. Previous camera-based collaborative 3D perception methods typically employ 3D bounding boxes or bird's eye views as representations of the environment. However these approaches fall short in offering a comprehensive 3D environmental prediction. To bridge this gap we introduce the first method for collaborative 3D semantic occupancy prediction. Particularly it improves local 3D semantic occupancy predictions by hybrid fusion of (i) semantic and occupancy task features and (ii) compressed orthogonal attention features shared between vehicles. Additionally due to the lack of a collaborative perception dataset designed for semantic occupancy prediction we augment a current collaborative perception dataset to include 3D collaborative semantic occupancy labels for a more robust evaluation. The experimental findings highlight that: (i) our collaborative semantic occupancy predictions excel above the results from single vehicles by over 30% and (ii) models anchored on semantic occupancy outpace state-of-the-art collaborative 3D detection techniques in subsequent perception applications showcasing enhanced accuracy and enriched semantic-awareness in road environments.
-
In class incremental learning (CIL) scenarios the phenomenon of catastrophic forgetting caused by the classifier's bias towards the current task has long posed a significant challenge. It is mainly caused by the characteristic of discriminative models. With the growing popularity of the generative multi-modal models we would explore replacing discriminative models with generative ones for CIL. However transitioning from discriminative to generative models requires addressing two key challenges. The primary challenge lies in transferring the generated textual information into the classification of distinct categories. Additionally it requires formulating the task of CIL within a generative framework. To this end we propose a novel generative multi-modal model (GMM) framework for class incremental learning. Our approach directly generates labels for images using an adapted generative model. After obtaining the detailed text we use a text encoder to extract text features and employ feature matching to determine the most similar label as the classification prediction. In the conventional CIL settings we achieve significantly better results in long-sequence task scenarios. Under the Few-shot CIL setting we have improved by at least 14% over the current state-of-the-art methods with significantly less forgetting.
-
We introduce a new RGB-D object dataset captured in the wild called WildRGB-D. Unlike most existing real-world object-centric datasets which only come with RGB capturing the direct capture of the depth channel allows better 3D annotations and broader downstream applications. WildRGB-D comprises large-scale category-level RGB-D object videos which are taken using an iPhone to go around the objects in 360 degrees. It contains around 8500 recorded objects and nearly 20000 RGB-D videos across 46 common object categories. These videos are taken with diverse cluttered backgrounds with three setups to cover as many real-world scenarios as possible: (i) a single object in one video; (ii) multiple objects in one video; and (iii) an object with a static hand in one video. The dataset is annotated with object masks real-world scale camera poses and reconstructed aggregated point clouds from RGBD videos. We benchmark four tasks with WildRGB-D including novel view synthesis camera pose estimation object 6d pose estimation and object surface reconstruction. Our experiments show that the large-scale capture of RGB-D objects provides a large potential to advance 3D object learning. Our project page is https://wildrgbd.github.io/.
-
Conventional image outpainting methods usually treat unobserved areas as unknown and extend the scene only in terms of semantic consistency thus overlooking the hidden information in shadows cast by unobserved areas such as the invisible shapes and semantics. In this paper we propose to extract and utilize the hidden information of unobserved areas from their shadows to enhance image outpainting. To this end we propose an end-to-end deep approach that explicitly looks into the shadows within the image. Specifically we extract shadows from the input image and identify instance-level shadow regions cast by the unobserved areas. Then the instance-level shadow representations are concatenated to predict the scene layout of each unobserved instance and outpaint the unobserved areas. Finally two discriminators are implemented to enhance alignment between the extended semantics and their shadows. In the experiments we show that our proposed approach provides complementary cues for outpainting and achieves considerable improvement on all datasets by adopting our approach as a plug-in module.
-
Tumor synthesis enables the creation of artificial tumors in medical images facilitating the training of AI models for tumor detection and segmentation. However success in tumor synthesis hinges on creating visually realistic tumors that are generalizable across multiple organs and furthermore the resulting AI models being capable of detecting real tumors in images sourced from different domains (e.g. hospitals). This paper made a progressive stride toward generalizable tumor synthesis by leveraging a critical observation: early-stage tumors (< 2cm) tend to have similar imaging characteristics in computed tomography (CT) whether they originate in the liver pancreas or kidneys. We have ascertained that generative AI models e.g. Diffusion Models can create realistic tumors generalized to a range of organs even when trained on a limited number of tumor examples from only one organ. Moreover we have shown that AI models trained on these synthetic tumors can be generalized to detect and segment real tumors from CT volumes encompassing a broad spectrum of patient demographics imaging protocols and healthcare facilities.
-
For image super-resolution (SR) bridging the gap between the performance on synthetic datasets and real-world degradation scenarios remains a challenge. This work introduces a novel "Low-Res Leads the Way" (LWay) training framework merging Supervised Pre-training with Self-supervised Learning to enhance the adaptability of SR models to real-world images. Our approach utilizes a low-resolution (LR) reconstruction network to extract degradation embeddings from LR images merging them with super-resolved outputs for LR reconstruction. Leveraging unseen LR images for self-supervised learning guides the model to adapt its modeling space to the target domain facilitating fine-tuning of SR models without requiring paired high-resolution (HR) images. The integration of Discrete Wavelet Transform (DWT) further refines the focus on high-frequency details. Extensive evaluations show that our method significantly improves the generalization and detail restoration capabilities of SR models on unseen real-world datasets outperforming existing methods. Our training regime is universally compatible requiring no network architecture modifications making it a practical solution for real-world SR applications.
-
The recently emerging text-to-motion advances have spired numerous attempts for convenient and interactive human motion generation. Yet existing methods are largely limited to generating body motions only without considering the rich two-hand motions let alone handling various conditions like body dynamics or texts. To break the data bottleneck we propose BOTH57M a novel multi-modal dataset for two-hand motion generation. Our dataset includes accurate motion tracking for the human body and hands and provides pair-wised finger-level hand annotations and body descriptions. We further provide a strong baseline method BOTH2Hands for the novel task: generating vivid two-hand motions from both implicit body dynamics and explicit text prompts. We first warm up two parallel body-to-hand and text-to-hand diffusion models and then utilize the cross-attention transformer for motion blending. Extensive experiments and cross-validations demonstrate the effectiveness of our approach and dataset for generating convincing two-hand motions from the hybrid body-and-textual conditions. Our dataset and code will be disseminated to the community for future research which can be found at https://github.com/Godheritage/BOTH2Hands.
-
Generating multiview images from a single view facilitates the rapid generation of a 3D mesh conditioned on a single image. Recent methods that introduce 3D global representation into diffusion models have shown the potential to generate consistent multiviews but they have reduced generation speed and face challenges in maintaining generalizability and quality. To address this issue we propose EpiDiff a localized interactive multiview diffusion model. At the core of the proposed approach is to insert a lightweight epipolar attention block into the frozen diffusion model leveraging epipolar constraints to enable cross-view interaction among feature maps of neighboring views. The newly initialized 3D modeling module preserves the original feature distribution of the diffusion model exhibiting compatibility with a variety of base diffusion models. Experiments show that EpiDiff generates 16 multiview images in just 12 seconds and it surpasses previous methods in quality evaluation metrics including PSNR SSIM and LPIPS. Additionally EpiDiff can generate a more diverse distribution of views improving the reconstruction quality from generated multiviews. Please see the project page at https://huanngzh.github.io/EpiDiff/.
-
To interpret Vision Transformers post-hoc explanations assign salience scores to input pixels providing human-understandable heatmaps. However whether these interpretations reflect true rationales behind the model's output is still underexplored. To address this gap we study the faithfulness criterion of explanations: the assigned salience scores should represent the influence of the corresponding input pixels on the model's predictions. To evaluate faithfulness we introduce Salience-guided Faithfulness Coefficient (SaCo) a novel evaluation metric leveraging essential information of salience distribution. Specifically we conduct pair-wise comparisons among distinct pixel groups and then aggregate the differences in their salience scores resulting in a coefficient that indicates the explanation's degree of faithfulness. Our explorations reveal that current metrics struggle to differentiate between advanced explanation methods and Random Attribution thereby failing to capture the faithfulness property. In contrast our proposed SaCo offers a reliable faithfulness measurement establishing a robust metric for interpretations. Furthermore our SaCo demonstrates that the use of gradient and multi-layer aggregation can markedly enhance the faithfulness of attention-based explanation shedding light on potential paths for advancing Vision Transformer explainability.
-
Establishing precise semantic correspondence across object instances in different images is a fundamental and challenging task in computer vision. In this task difficulty arises often due to three challenges: confusing regions with similar appearance inconsistent object scale and indistinguishable nearby pixels. Recognizing these challenges our paper proposes a novel semantic matching pipeline named LPMFlow toward extracting fine-grained semantics and geometry layouts for building pixel-level semantic correspondences. LPMFlow consists of three modules each addressing one of the aforementioned challenges. The layout-aware representation learning module uniformly encodes source and target tokens to distinguish pixels or regions with similar appearances but different geometry semantics. The progressive feature superresolution module outputs four sets of 4D correlation tensors to generate accurate semantic flow between objects in different scales. Finally the matching flow integration and refinement module is exploited to fuse matching flow in different scales to give the final flow predictions. The whole pipeline can be trained end-to-end with a balance of computational cost and correspondence details. Extensive experiments based on benchmarks such as SPair-71K PF-PASCAL and PF-WILLOW have proved that the proposed method can well tackle the three challenges and outperform the previous methods especially in more stringent settings. Code is available at https://github.com/YXSUNMADMAX/LPMFlow.
-
We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos. Our method uses a masked auto-encoding framework to synthesize masked binaural audio through the synergy of audio and vision thereby learning useful spatial relationships between the two modalities. We use our pretrained features to tackle two downstream video tasks requiring spatial understanding in social scenarios: active speaker detection and spatial audio denoising. Through extensive experiments we show that our features are generic enough to improve over multiple state-of-the-art baselines on both tasks on two challenging egocentric video datasets that offer binaural audio EgoCom and EasyCom. Project: http://vision.cs.utexas.edu/ projects/ego_av_corr.
-
We present DreamAvatar a text-and-shape guided framework for generating high-quality 3D human avatars with controllable poses. While encouraging results have been reported by recent methods on text-guided 3D common object generation generating high-quality human avatars remains an open challenge due to the complexity of the human body's shape pose and appearance. We propose DreamAvatar to tackle this challenge which utilizes a trainable NeRF for predicting density and color for 3D points and pretrained text-to-image diffusion models for providing 2D self-supervision. Specifically we leverage the SMPL model to provide shape and pose guidance for the generation. We introduce a dual-observation-space design that involves the joint optimization of a canonical space and a posed space that are related by a learnable deformation field. This facilitates the generation of more complete textures and geometry faithful to the target pose. We also jointly optimize the losses computed from the full body and from the zoomed-in 3D head to alleviate the common multi-face "Janus" problem and improve facial details in the generated avatars. Extensive evaluations demonstrate that DreamAvatar significantly outperforms existing methods establishing a new state-of-the-art for text-and-shape guided 3D human avatar generation.
-
Histopathological whole slide images (WSIs) classification has become a foundation task in medical microscopic imaging processing. Prevailing approaches involve learning WSIs as instance-bag representations emphasizing significant instances but struggling to capture the interactions between instances. Additionally conventional graph representation methods utilize explicit spatial positions to construct topological structures but restrict the flexible interaction capabilities between instances at arbitrary locations particularly when spatially distant. In response we propose a novel dynamic graph representation algorithm that conceptualizes WSIs as a form of the knowledge graph structure. Specifically we dynamically construct neighbors and directed edge embeddings based on the head and tail relationships between instances. Then we devise a knowledge-aware attention mechanism that can update the head node features by learning the joint attention score of each neighbor and edge. Finally we obtain a graph-level embedding through the global pooling process of the updated head serving as an implicit representation for the WSI classification. Our end-to-end graph representation learning approach has outperformed the state-of-the-art WSI analysis methods on three TCGA benchmark datasets and in-house test sets. Our code is available at https://github.com/WonderLandxD/WiKG.
-
We developed a tool for visualizing and analyzing large pre-trained vision models by mapping them onto the brain thus exposing their hidden inside. Our innovation arises from a surprising usage of brain encoding: predicting brain fMRI measurements in response to images. We report two findings. First explicit mapping between the brain and deep-network features across dimensions of space layers scales and channels is crucial. This mapping method FactorTopy is plug-and-play for any deep-network; with it one can paint a picture of the network onto the brain (literally!). Second our visualization shows how different training methods matter: they lead to remarkable differences in hierarchical organization and scaling behavior growing with more data or network capacity. It also provides insight into fine-tuning: how pre-trained models change when adapting to small datasets. We found brain-like hierarchically organized network suffer less from catastrophic forgetting after fine-tuned.
-
This paper addresses an interesting yet challenging problem-- source-free unsupervised domain adaptation (SFUDA) for pinhole-to-panoramic semantic segmentation--given only a pinhole image-trained model (i.e. source) and unlabeled panoramic images (i.e. target). Tackling this problem is nontrivial due to the semantic mismatches style discrepancies and inevitable distortion of panoramic images. To this end we propose a novel method that utilizes Tangent Projection (TP) as it has less distortion and meanwhile slits the equirectangular projection (ERP) with a fixed FoV to mimic the pinhole images. Both projections are shown effective in extracting knowledge from the source model. However the distinct projection discrepancies between source and target domains impede the direct knowledge transfer; thus we propose a panoramic prototype adaptation module (PPAM) to integrate panoramic prototypes from the extracted knowledge for adaptation. We then impose the loss constraints on both predictions and prototypes and propose a cross-dual attention module (CDAM) at the feature level to better align the spatial and channel characteristics across the domains and projections. Both knowledge extraction and transfer processes are synchronously updated to reach the best performance. Extensive experiments on the synthetic and real-world benchmarks including outdoor and indoor scenarios demonstrate that our method achieves significantly better performance than prior SFUDA methods for pinhole-to-panoramic adaptation.
-
Large-scale visual-language pre-trained models have achieved significant success in various video tasks. However most existing methods follow an "adapt then align" paradigm which adapts pre-trained image encoders to model video-level representations and utilizes one-hot or text embedding of the action labels for supervision. This paradigm overlooks the challenge of mapping from static images to complicated activity concepts. In this paper we propose a novel "Align before Adapt" (ALT) paradigm. Prior to adapting to video representation learning we exploit the entity-to-region alignments for each frame. The alignments are fulfilled by matching the region-aware image embeddings to an offline-constructed text corpus. With the aligned entities we feed their text embeddings to a transformer-based video adapter as the queries which can help extract the semantics of the most important entities from a video to a vector. This paradigm reuses the visual-language alignment of VLP during adaptation and tries to explain an action by the underlying entities. This helps understand actions by bridging the gap with complex activity semantics particularly when facing unfamiliar or unseen categories. ALT demonstrates competitive performance while maintaining remarkably low computational costs. In fully supervised experiments it achieves 88.1% top-1 accuracy on Kinetics-400 with only 4947 GFLOPs. Moreover ALT outperforms the previous state-of-the-art methods in both zero-shot and few-shot experiments emphasizing its superior generalizability across various learning scenarios.
-
Recent advancements in vision-language foundation models have significantly enhanced open-vocabulary 3D scene understanding. However the generalizability of existing methods is constrained due to their framework designs and their reliance on 3D data. We address this limitation by introducing Generalizable Open-Vocabulary Neural Semantic Fields (GOV-NeSF) a novel approach offering a generalizable implicit representation of 3D scenes with open-vocabulary semantics. We aggregate the geometry-aware features using a cost volume and propose a Multi-view Joint Fusion module to aggregate multi-view features through a cross-view attention mechanism which effectively predicts view-specific blending weights for both colors and open-vocabulary features. Remarkably our GOV-NeSF exhibits state-of-the-art performance in both 2D and 3D open-vocabulary semantic segmentation eliminating the need for ground truth semantic labels or depth priors and effectively generalize across scenes and datasets without fine-tuning.
-
The remarkable efficacy of text-to-image diffusion models has motivated extensive exploration of their potential application in video domains. Zero-shot methods seek to extend image diffusion models to videos without necessitating model training. Recent methods mainly focus on incorporating inter-frame correspondence into attention mechanisms. However the soft constraint imposed on determining where to attend to valid features can sometimes be insufficient resulting in temporal inconsistency. In this paper we introduce FRESCO intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint. This enhancement ensures a more consistent transformation of semantically similar content across frames. Beyond mere attention guidance our approach involves an explicit update of features to achieve high spatial-temporal consistency with the input video significantly improving the visual coherence of the resulting translated videos. Extensive experiments demonstrate the effectiveness of our proposed framework in producing high-quality coherent videos marking a notable improvement over existing zero-shot methods.
-
Single-pixel imaging (SPI) is a potential computational imaging technique which produces image by solving an ill-posed reconstruction problem from few measurements captured by a single-pixel detector. Deep learning has achieved impressive success on SPI reconstruction. However previous poor reconstruction performance and impractical imaging model limit its real-world applications. In this paper we propose a deep unfolding network with hybrid-attention Transformer on Kronecker SPI model dubbed HATNet to improve the imaging quality of real SPI cameras. Specifically we unfold the computation graph of the iterative shrinkage-thresholding algorithm (ISTA) into two alternative modules: efficient tensor gradient descent and hybrid-attention multi-scale denoising. By virtue of Kronecker SPI the gradient descent module can avoid high computational overheads rooted in previous gradient descent modules based on vectorized SPI. The denoising module is an encoder-decoder architecture powered by dual-scale spatial attention for high- and low-frequency aggregation and channel attention for global information recalibration. Moreover we build a SPI prototype to verify the effectiveness of the proposed method. Extensive experiments on synthetic and real data demonstrate that our method achieves the state-of-the-art performance. The source code and pre-trained models are available at https://github.com/Gang-Qu/HATNet-SPI.
-
Detecting objects in 3D under various (normal and adverse) weather conditions is essential for safe autonomous driving systems. Recent approaches have focused on employing weather-insensitive 4D radar sensors and leveraging them with other modalities such as LiDAR. However they fuse multi-modal information without considering the sensor characteristics and weather conditions and lose some height information which could be useful for localizing 3D objects. In this paper we propose a novel framework for robust LiDAR and 4D radar-based 3D object detection. Specifically we propose a 3D-LRF module that considers the distinct patterns they exhibit in 3D space (e.g. precise 3D mapping of LiDAR and wide-range weather-insensitive measurement of 4D radar) and extract fusion features based on their 3D spatial relationship. Then our weather-conditional radar-flow gating network modulates the information flow of fusion features depending on weather conditions and obtains enhanced feature that effectively incorporates the strength of two domains under various weather conditions. The extensive experiments demonstrate that our model achieves SoTA performance for 3D object detection under various weather conditions.
-
Leveraging multi-view diffusion models as priors for 3D optimization have alleviated the problem of 3D consistency e.g. the Janus face problem or the content drift problem in zero-shot text-to-3D models. However the 3D geometric fidelity of the output remains an unresolved issue; albeit the rendered 2D views are realistic the underlying geometry may contain errors such as unreasonable concavities. In this work we propose CorrespondentDream an effective method to leverage annotation-free cross-view correspondences yielded from the diffusion U-Net to provide additional 3D prior to the NeRF optimization process. We find that these correspondences are strongly consistent with human perception and by adopting it in our loss design we are able to produce NeRF models with geometries that are more coherent with common sense e.g. more smoothed object surface yielding higher 3D fidelity. We demonstrate the efficacy of our approach through various comparative qualitative results and a solid user study.
-
Knowledge of lane topology is a core problem in autonomous driving. Aerial imagery can provide high resolution quickly updatable lane source data but detecting lanes from such data has so far been an expensive manual process or where automated solutions exist undrivable and requiring of downstream processing. We propose a method for large-scale lane topology extraction from aerial imagery while ensuring that the resulting lanes are realistic and drivable by introducing a novel Bezier Graph shared parameterisation of Bezier curves. We develop a transformer-based model to predict these Bezier Graphs from input aerial images demonstrating competitive results on the UrbanLaneGraph dataset. We demonstrate that our method generates realistic lane graphs which require both minimal input and minimal downstream processing. We make our code publicly available at https://github.com/driskai/BGFormer
-
We present SplattingAvatar a hybrid 3D representation of photorealistic human avatars with Gaussian Splatting embedded on a triangle mesh which renders over 300 FPS on a modern GPU and 30 FPS on a mobile device. We disentangle the motion and appearance of a virtual human with explicit mesh geometry and implicit appearance modeling with Gaussian Splatting. The Gaussians are defined by barycentric coordinates and displacement on a triangle mesh as Phong surfaces. We extend lifted optimization to simultaneously optimize the parameters of the Gaussians while walking on the triangle mesh. SplattingAvatar is a hybrid representation of virtual humans where the mesh represents low-frequency motion and surface deformation while the Gaussians take over the high-frequency geometry and detailed appearance. Unlike existing deformation methods that rely on an MLP-based linear blend skinning (LBS) field for motion we control the rotation and translation of the Gaussians directly by mesh which empowers its compatibility with various animation techniques e.g. skeletal animation blend shapes and mesh editing. Trainable from monocular videos for both full-body and head avatars SplattingAvatar shows state-of-the-art rendering quality across multiple datasets.
-
Reconstructing an avatar from a portrait image has many applications in multimedia but remains a challenging research problem. Extracting reflectance maps and geometry from one image is ill-posed: recovering geometry is a one-to-many mapping problem and reflectance and light are difficult to disentangle. Accurate geometry and reflectance can be captured under the controlled conditions of a light stage but it is costly to acquire large datasets in this fashion. Moreover training solely with this type of data leads to poor generalization with in-the-wild images. This motivates the introduction of MoSAR a method for 3D avatar generation from monocular images. We propose a semi-supervised training scheme that improves generalization by learning from both light stage and in-the-wild datasets. This is achieved using a novel differentiable shading formulation. We show that our approach effectively disentangles the intrinsic face parameters producing relightable avatars. As a result MoSAR estimates a richer set of skin reflectance maps and generates more realistic avatars than existing state-of-the-art methods. We also release a new dataset that provides intrinsic face attributes (diffuse specular ambient occlusion and translucency maps) for 10k subjects.
-
In the realm of geospatial analysis the diversity of remote sensors encompassing both optical and microwave technologies offers a wealth of distinct observational capabilities. Recognizing this we present msGFM a multisensor geospatial foundation model that effectively unifies data from four key sensor modalities. This integration spans an expansive dataset of two million multisensor images. msGFM is uniquely adept at handling both paired and unpaired sensor data. For data originating from identical geolocations our model employs an innovative cross-sensor pretraining approach in masked image modeling enabling the synthesis of joint representations from diverse sensors. msGFM incorporating four remote sensors upholds strong performance forming a comprehensive model adaptable to various sensor types. msGFM has demonstrated enhanced proficiency in a range of both single-sensor and multisensor downstream tasks. These include scene classification segmentation cloud removal and pan-sharpening. A key discovery of our research is that representations derived from natural images are not always compatible with the distinct characteristics of geospatial remote sensors underscoring the limitations of existing representations in this field. Our work can serve as a guide for developing multisensor geospatial pretraining models paving the way for more advanced geospatial capabilities. Code can be found at \url https://github.com/boranhan/Geospatial_Foundation_Models
-
We study visually grounded VideoQA in response to the emerging trends of utilizing pretraining techniques for video- language understanding. Specifically by forcing vision- language models (VLMs) to answer questions and simultane- ously provide visual evidence we seek to ascertain the extent to which the predictions of such techniques are genuinely anchored in relevant video content versus spurious corre- lations from language or irrelevant visual context. Towards this we construct NExT-GQA - an extension of NExT-QA with 10.5K temporal grounding (or location) labels tied to the original QA pairs. With NExT-GQA we scrutinize a series of state-of-the-art VLMs. Through post-hoc atten- tion analysis we find that these models are extremely weak in substantiating the answers despite their strong QA per- formance. This exposes the limitation of current VLMs in making reliable predictions. As a remedy we further explore and propose a grounded-QA method via Gaussian mask optimization and cross-modal learning. Experiments with different backbones demonstrate that this grounding mechanism improves both grounding and QA. With these efforts we aim to push towards trustworthy VLMs in VQA systems. Our dataset and code are available at https://github.com/doc-doc/NExT-GQA.
-
Detecting edges in images suffers from the problems of (P1) heavy imbalance between positive and negative classes as well as (P2) label uncertainty owing to disagreement between different annotators. Existing solutions address P1 using class-balanced cross-entropy loss and dice loss and P2 by only predicting edges agreed upon by most annotators. In this paper we propose RankED a unified ranking-based approach that addresses both the imbalance problem (P1) and the uncertainty problem (P2). RankED tackles these two problems with two components: One component which ranks positive pixels over negative pixels and the second which promotes high confidence edge pixels to have more label certainty. We show that RankED outperforms previous studies and sets a new state-of-the-art on NYUD-v2 BSDS500 and Multi-cue datasets. Code is available at https://ranked-cvpr24.github.io.
-
We present DiffHuman a probabilistic method for photorealistic 3D human reconstruction from a single RGB image. Despite the ill-posed nature of this problem most methods are deterministic and output a single solution often resulting in a lack of geometric detail and blurriness in unseen or uncertain regions. In contrast DiffHuman predicts a probability distribution over 3D reconstructions conditioned on an input 2D image which allows us to sample multiple detailed 3D avatars that are consistent with the image. DiffHuman is implemented as a conditional diffusion model that denoises pixel-aligned 2D observations of an underlying 3D shape representation. During inference we may sample 3D avatars by iteratively denoising 2D renders of the predicted 3D representation. Furthermore we introduce a generator neural network that approximates rendering with considerably reduced runtime (55x speed up) resulting in a novel dual-branch diffusion framework. Our experiments show that DiffHuman can produce diverse and detailed reconstructions for the parts of the person that are unseen or uncertain in the input image while remaining competitive with the state-of-the-art when reconstructing visible surfaces.
-
Owe to the powerful generative priors the pre-trained text-to-image (T2I) diffusion models have become increasingly popular in solving the real-world image super-resolution problem. However as a consequence of the heavy quality degradation of input low-resolution (LR) images the destruction of local structures can lead to ambiguous image semantics. As a result the content of reproduced high-resolution image may have semantic errors deteriorating the super-resolution performance. To address this issue we present a semantics-aware approach to better preserve the semantic fidelity of generative real-world image super-resolution. First we train a degradation-aware prompt extractor which can generate accurate soft and hard semantic prompts even under strong degradation. The hard semantic prompts refer to the image tags aiming to enhance the local perception ability of the T2I model while the soft semantic prompts compensate for the hard ones to provide additional representation information. These semantic prompts encourage the T2I model to generate detailed and semantically accurate results. Furthermore during the inference process we integrate the LR images into the initial sampling noise to mitigate the diffusion model's tendency to generate excessive random details. The experiments show that our method can reproduce more realistic image details and hold better the semantics. The source code of our method can be found at https://github.com/cswry/SeeSR
-
Revolutionizing the field of deep learning Transformer-based models have achieved remarkable performance in many tasks. Recent research has recognized these models are robust to shuffling but are limited to inter-token permutation in the forward propagation. In this work we propose our definition of permutation equivariance a broader concept covering both inter- and intra- token permutation in the forward and backward propagation of neural networks. We rigorously proved that such permutation equivariance property can be satisfied on most vanilla Transformer-based models with almost no adaptation. We examine the property over a range of state-of-the-art models including ViT Bert GPT and others with experimental validations. Further as a proof-of-concept we explore how real-world applications including privacy-enhancing split learning and model authorization could exploit the permutation equivariance property which implicates wider intriguing application scenarios. The code is available at https://github.com/Doby-Xu/ST
-
Establishing an automatic evaluation metric that closely aligns with human judgments is essential for effectively developing image captioning models. Recent data-driven metrics have demonstrated a stronger correlation with human judgments than classic metrics such as CIDEr; however they lack sufficient capabilities to handle hallucinations and generalize across diverse images and texts partially because they compute scalar similarities merely using embeddings learned from tasks unrelated to image captioning evaluation. In this study we propose Polos a supervised automatic evaluation metric for image captioning models. Polos computes scores from multimodal inputs using a parallel feature extraction mechanism that leverages embeddings trained through large-scale contrastive learning. To train Polos we introduce Multimodal Metric Learning from Human Feedback (M2LHF) a framework for developing metrics based on human feedback. We constructed the Polaris dataset which comprises 131K human judgments from 550 evaluators which is approximately ten times larger than standard datasets. Our approach achieved state-of-the-art performance on Composite Flickr8K-Expert Flickr8K-CF PASCAL-50S FOIL and the Polaris dataset thereby demonstrating its effectiveness and robustness.
-
We introduce the video detours problem for navigating instructional videos. Given a source video and a natural language query asking to alter the how-to video's current path of execution in a certain way the goal is to find a related "detour video" that satisfies the requested alteration. To address this challenge we propose VidDetours a novel video-language approach that learns to retrieve the targeted temporal segments from a large repository of how-to's using video-and-text conditioned queries. Furthermore we devise a language-based pipeline that exploits how-to video narration text to create weakly supervised training data. We demonstrate our idea applied to the domain of how-to cooking videos where a user can detour from their current recipe to find steps with alternate ingredients tools and techniques. Validating on a ground truth annotated dataset of 16K samples we show our model's significant improvements over best available methods for video retrieval and question answering with recall rates exceeding the state of the art by 35%.
-
Many surface reconstruction methods incorporate normal integration which is a process to obtain a depth map from surface gradients. In this process the input may represent a surface with discontinuities e.g. due to self-occlusion. To reconstruct an accurate depth map from the input normal map hidden surface gradients occurring from the jumps must be handled. To model these jumps correctly we design a novel discretization for the domain of normal integration. Our key idea is to introduce auxiliary edges which bridge between piecewise-smooth planes in the domain so that the magnitude of hidden jumps can be explicitly expressed on finite elements. Using the auxiliary edges we design a novel algorithm to optimize the discontinuity and the depth map from the input normal map. Our method optimizes discontinuities by using a combination of iterative re-weighted least squares and iterative filtering of the jump magnitudes on auxiliary edges to provide strong sparsity regularization. Compared to previous discontinuity-preserving normal integration methods which model the magnitude of jumps only implicitly our method reconstructs subtle discontinuities accurately thanks to our explicit representation allowing for strong sparsity regularization.
-
We present DrivingGaussian an efficient and effective framework for surrounding dynamic autonomous driving scenes. For complex scenes with moving objects we first sequentially and progressively model the static background of the entire scene with incremental static 3D Gaussians. We then leverage a composite dynamic Gaussian graph to handle multiple moving objects individually reconstructing each object and restoring their accurate positions and occlusion relationships within the scene. We further use a LiDAR prior for Gaussian Splatting to reconstruct scenes with greater details and maintain panoramic consistency. DrivingGaussian outperforms existing methods in dynamic driving scene reconstruction and enables photorealistic surround-view synthesis with high-fidelity and multi-camera consistency. Our project page is at: https://github.com/VDIGPKU/DrivingGaussian.
-
In this paper we propose a novel concept of path consistency to learn robust object matching without using manual object identity supervision. Our key idea is that to track a object through frames we can obtain multiple different association results from a model by varying the frames it can observe i.e. skipping frames in observation. As the differences in observations do not alter the identities of objects the obtained association results should be consistent. Based on this rationale we generate multiple observation paths each specifying a different set of frames to be skipped and formulate the Path Consistency Loss that enforces the association results are consistent across different observation paths. We use the proposed loss to train our object matching model with only self-supervision. By extensive experiments on three tracking datasets (MOT17 PersonPath22 KITTI) we demonstrate that our method outperforms existing unsupervised methods with consistent margins on various evaluation metrics and even achieves performance close to supervised methods.
-
Unsupervised learning of keypoints and landmarks has seen significant progress with the help of modern neural network architectures but performance is yet to match the supervised counterpart making their practicability questionable. We leverage the emergent knowledge within text-to-image diffusion models towards more robust unsupervised keypoints. Our core idea is to find text embeddings that would cause the generative model to consistently attend to compact regions in images (i.e. keypoints). To do so we simply optimize the text embedding such that the cross-attention maps within the denoising network are localized as Gaussians with small standard deviations. We validate our performance on multiple datasets: the CelebA CUB-200-2011 Tai-Chi-HD DeepFashion and Human3.6m datasets. We achieve significantly improved accuracy sometimes even outperforming supervised ones particularly for data that is non-aligned and less curated. Our code is publicly available at https://stablekeypoints.github.io/.
-
Single-photon Light Detection and Ranging (LiDAR) systems are often equipped with an array of detectors for improved spatial resolution and sensing speed. However given a fixed amount of flux produced by the laser transmitter across the scene the per-pixel Signal-to-Noise Ratio (SNR) will decrease when more pixels are packed in a unit space. This presents a fundamental trade-off between the spatial resolution of the sensor array and the SNR received at each pixel. Theoretical characterization of this fundamental limit is explored. By deriving the photon arrival statistics and introducing a series of new approximation techniques the Mean Squared Error (MSE) of the maximum-likelihood estimator of the time delay is derived. The theoretical predictions align well with simulations and real data.
-
Cross-domain few-shot learning (CDFSL) aims to acquire knowledge from limited training data in the target domain by leveraging prior knowledge transferred from source domains with abundant training samples. CDFSL faces challenges in transferring knowledge across dissimilar domains and fine-tuning models with limited training data. To address these challenges we initially extend the analysis of loss landscapes from the parameter space to the representation space which allows us to simultaneously interpret the transferring and fine-tuning difficulties of CDFSL models. We observe that sharp minima in the loss landscapes of the representation space result in representations that are hard to transfer and fine-tune. Moreover existing flatness-based methods have limited generalization ability due to their short-range flatness. To enhance the transferability and facilitate fine-tuning we introduce a simple yet effective approach to achieve long-range flattening of the minima in the loss landscape. This approach considers representations that are differently normalized as minima in the loss landscape and flattens the high-loss region in the middle by randomly sampling interpolated representations. We implement this method as a new normalization layer that replaces the original one in both CNNs and ViTs. This layer is simple and lightweight introducing only a minimal number of additional parameters. Experimental results on 8 datasets demonstrate that our approach outperforms state-of-the-art methods in terms of average accuracy. Moreover our method achieves performance improvements of up to 9% compared to the current best approaches on individual datasets. Our code will be released.
-
Improving the detection of distant 3d objects is an important yet challenging task. For camera-based 3D perception the annotation of 3d bounding relies heavily on LiDAR for accurate depth information. As such the distance of annotation is often limited due to the sparsity of LiDAR points on distant objects which hampers the capability of existing detectors for long-range scenarios. We address this challenge by considering only 2D box supervision for distant objects since they are easy to annotate. We propose LR3D a framework that learns to recover the missing depth of distant objects. LR3D adopts an implicit projection head to learn the generation of mapping between 2D boxes and depth using the 3D supervision on close objects. This mapping allows the depth estimation of distant objects conditioned on their 2D boxes making long-range 3D detection with 2D supervision feasible. Experiments show that without distant 3D annotations LR3D allows camera-based methods to detect distant objects (over 200m) with comparable accuracy to full 3D supervision. Our framework is general and could widely benefit 3D detection methods to a large extent.
-
This paper addresses the decomposition of holographic feature vectors in Hyperdimensional Computing (HDC) aka Vector Symbolic Architectures (VSA). HDC uses high-dimensional vectors with brain-like properties to represent symbolic information and leverages efficient operators to construct and manipulate complexly structured data in a cognitive fashion. Existing models face challenges in decomposing these structures a process crucial for understanding and interpreting a composite hypervector. We address this challenge by proposing the HDC Memorized-Factorization Problem that captures the common patterns of construction in HDC models. To solve this problem efficiently we introduce HDQMF a HyperDimensional Quantum Memorized-Factorization algorithm. HDQMF is unique in its approach utilizing quantum computing to offer efficient solutions. It modifies crucial steps in Grover's algorithm to achieve hypervector decomposition achieving quadratic speed-up.
-
Recovering degraded low-resolution text images is challenging especially for Chinese text images with complex strokes and severe degradation in real-world scenarios. Ensuring both text fidelity and style realness is crucial for high-quality text image super-resolution. Recently diffusion models have achieved great success in natural image synthesis and restoration due to their powerful data distribution modeling abilities and data generation capabilities. In this work we propose an Image Diffusion Model (IDM) to restore text images with realistic styles. For diffusion models they are not only suitable for modeling realistic image distribution but also appropriate for learning text distribution. Since text prior is important to guarantee the correctness of the restored text structure according to existing arts we also propose a Text Diffusion Model (TDM) for text recognition which can guide IDM to generate text images with correct structures. We further propose a Mixture of Multi-modality module (MoM) to make these two diffusion models cooperate with each other in all the diffusion steps. Extensive experiments on synthetic and real-world datasets demonstrate that our Diffusion-based Blind Text Image Super-Resolution (DiffTSR) can restore text images with more accurate text structures as well as more realistic appearances simultaneously. Code is available at https://github.com/YuzheZhang-1999/DiffTSR.
-
Continual learning empowers models to adapt autonomously to the ever-changing environment or data streams without forgetting old knowledge. Prompt-based approaches are built on frozen pre-trained models to learn the task-specific prompts and classifiers efficiently. Existing prompt based methods are inconsistent between training and testing limiting their effectiveness. Two types of inconsistency are revealed. Test predictions are made from all classifiers while training only focuses on the current task classifier without holistic alignment leading to Classifier inconsistency. Prompt inconsistency indicates that the prompt selected during testing may not correspond to the one associated with this task during training. In this paper we propose a novel prompt-based method Consistent Prompting (CPrompt) for more aligned training and testing. Specifically all existing classifiers are exposed to prompt training resulting in classifier consistency learning. In addition prompt consistency learning is proposed to enhance prediction robustness and boost prompt selection accuracy. Our Consistent Prompting surpasses its prompt-based counterparts and achieves state-of-the-art performance on multiple continual learning benchmarks. Detailed analysis shows that improvements come from more consistent training and testing.
-
In the context of autonomous driving the significance of effective feature learning is widely acknowledged. While conventional 3D self-supervised pre-training methods have shown widespread success most methods follow the ideas originally designed for 2D images. In this paper we present UniPAD a novel self-supervised learning paradigm applying 3D volumetric differentiable rendering. UniPAD implicitly encodes 3D space facilitating the reconstruction of continuous 3D shape structures and the intricate appearance characteristics of their 2D projections. The flexibility of our method enables seamless integration into both 2D and 3D frameworks enabling a more holistic comprehension of the scenes. We manifest the feasibility and effectiveness of UniPAD by conducting extensive experiments on various 3D perception tasks. Our method significantly improves lidar- camera- and lidar-camera-based baseline by 9.1 7.7 and 6.9 NDS respectively. Notably our pre-training pipeline achieves 73.2 NDS for 3D object detection and 79.4 mIoU for 3D semantic segmentation on the nuScenes validation set achieving state-of-the-art results in comparison with previous methods.
-
Generative Adversarial Networks (GANs) have been widely used to recover vivid textures in image super-resolution (SR) tasks. In particular one discriminator is utilized to enable the SR network to learn the distribution of real-world high-quality images in an adversarial training manner. However the distribution learning is overly coarse-grained which is susceptible to virtual textures and causes counter-intuitive generation results. To mitigate this we propose the simple and effective Semantic-aware Discriminator (denoted as SeD) which encourages the SR network to learn the fine-grained distributions by introducing the semantics of images as a condition. Concretely we aim to excavate the semantics of images from a well-trained semantic extractor. Under different semantics the discriminator is able to distinguish the real-fake images individually and adaptively which guides the SR network to learn the more fine-grained semantic-aware textures. To obtain accurate and abundant semantics we take full advantage of recently popular pretrained vision models (PVMs) with extensive datasets and then incorporate its semantic features into the discriminator through a well-designed spatial cross-attention module. In this way our proposed semantic-aware discriminator empowered the SR network to produce more photo-realistic and pleasing images. Extensive experiments on two typical tasks i.e. SR and Real SR have demonstrated the effectiveness of our proposed methods. The code will be available at https://github.com/lbc12345/SeD.
-
While vision-language models (VLMs) have achieved remarkable performance improvements recently there is growing evidence that these models also posses harmful biases with respect to social attributes such as gender and race. Prior studies have primarily focused on probing such bias attributes individually while ignoring biases associated with intersections between social attributes. This could be due to the difficulty of collecting an exhaustive set of image-text pairs for various combinations of social attributes. To address this challenge we employ text-to-image diffusion models to produce counterfactual examples for probing intersectional social biases at scale. Our approach utilizes Stable Diffusion with cross attention control to produce sets of counterfactual image-text pairs that are highly similar in their depiction of a subject (e.g. a given occupation) while differing only in their depiction of intersectional social attributes (e.g. race & gender). Through our over-generate-then-filter methodology we produce SocialCounterfactuals a high-quality dataset containing 171k image-text pairs for probing intersectional biases related to gender race and physical characteristics. We conduct extensive experiments to demonstrate the usefulness of our generated dataset for probing and mitigating intersectional social biases in state-of-the-art VLMs.
-
Efficiently representing and reconstructing the 3D geometry of biological trees remains a challenging problem in computer vision and graphics. We propose a novel approach for generating realistic tree models from single-view photographs. We cast the 3D information inference problem to a semantic voxel diffusion process which converts an input image of a tree to a novel Semantic Voxel Structure (SVS) in 3D space. The SVS encodes the geometric appearance and semantic structural information (e.g. classifying trunks branches and leaves) which retains the intricate internal tree features. Tailored to the SVS we present SVDTree a new hybrid tree modeling approach by combining structure-oriented branch reconstruction and self-organization-based foliage reconstruction. We validate SVDTree by using images from both synthetic and real trees. The comparison results show that our approach can better preserve tree details and achieve more realistic and accurate reconstruction results than previous methods.
-
As with many machine learning problems the progress of image generation methods hinges on good evaluation metrics. One of the most popular is the Frechet Inception Distance (FID). FID estimates the distance between a distribution of Inception-v3 features of real images and those of images generated by the algorithm. We highlight important drawbacks of FID: Inception's poor representation of the rich and varied content generated by modern text-to-image models incorrect normality assumptions and poor sample complexity. We call for a reevaluation of FID's use as the primary quality metric for generated images. We empirically demonstrate that FID contradicts human raters it does not reflect gradual improvement of iterative text-to-image models it does not capture distortion levels and that it produces inconsistent results when varying the sample size. We also propose an alternative new metric CMMD based on richer CLIP embeddings and the maximum mean discrepancy distance with the Gaussian RBF kernel. It is an unbiased estimator that does not make any assumptions on the probability distribution of the embeddings and is sample efficient. Through extensive experiments and analysis we demonstrate that FID-based evaluations of text-to-image models may be unreliable and that CMMD offers a more robust and reliable assessment of image quality.
-
The recent success in revealing scene details from sparse 3D point clouds obtained via structure-from-motion has raised significant privacy concerns in visual localization. One prominent approach for mitigating this issue is to lift 3D points to 3D lines thereby reducing the effectiveness of the scene inversion attacks but this comes at the cost of increased algorithmic complexity for camera localization due to weaker geometric constraints induced by line clouds. To overcome this limitation we propose a new lifting approach called "ray cloud" whereby each lifted 3D line intersects at one of two predefined locations depicting omnidirectional rays from two cameras. This yields two benefits i) camera localization can now be cast as relative pose estimation between the query image and the calibrated rig of two perspective cameras which can be efficiently solved using a variant of the 5-point algorithm and ii) the ray cloud introduces erroneous estimations for the density-based inversion attack degrading the quality of scene recovery. Moreover we explore possible modifications of the inversion attack to better recover scenes from the ray clouds and propose a ray sampling technique to reduce the effectiveness of the modified attack. Experimental results on two public datasets show real-time localization speed as well as enhanced privacy-preserving capability over the state-of-the-art without overly sacrificing the localization accuracy.
-
Joint camera pose and dense geometry estimation from a set of images or a monocular video remains a challenging problem due to its computational complexity and inherent visual ambiguities. Most dense incremental reconstruction systems operate directly on image pixels and solve for their 3D positions using multi-view geometry cues. Such pixel-level approaches suffer from ambiguities or violations of multi-view consistency (e.g. caused by textureless or specular surfaces). We address this issue with a new image representation which we call a SuperPrimitive. SuperPrimitives are obtained by splitting images into semantically correlated local regions and enhancing them with estimated surface normal directions both of which are predicted by state-of-the-art single image neural networks. This provides a local geometry estimate per SuperPrimitive while their relative positions are adjusted based on multi-view observations. We demonstrate the versatility of our new representation by addressing three 3D reconstruction tasks: depth completion few-view structure from motion and monocular dense visual odometry. Project page: https://makezur.github.io/SuperPrimitive/
-
While recent model-free Reinforcement Learning (RL) methods have demonstrated human-level effectiveness in gaming environments their success in everyday tasks like visual navigation has been limited particularly under significant appearance variations. This limitation arises from (i) poor sample efficiency and (ii) over-fitting to training scenarios. To address these challenges we present a world model that learns invariant features using (i) contrastive unsupervised learning and (ii) an intervention-invariant regularizer. Learning an explicit representation of the world dynamics i.e. a world model improves sample efficiency while contrastive learning implicitly enforces learning of invariant features which improves generalization. However the naive integration of contrastive loss to world models is not good enough as world-model-based RL methods independently optimize representation learning and agent policy. To overcome this issue we propose an intervention-invariant regularizer in the form of an auxiliary task such as depth prediction image denoising image segmentation etc. that explicitly enforces invariance to style interventions. Our method outperforms current state-of-the-art model-based and model-free RL methods and significantly improves on out-of-distribution point navigation tasks evaluated on the iGibson benchmark. With only visual observations we further demonstrate that our approach outperforms recent language-guided foundation models for point navigation which is essential for deployment on robots with limited computation capabilities. Finally we demonstrate that our proposed model excels at the sim-to-real transfer of its perception module on the Gibson benchmark.
-
The Diffusion model a prevalent framework for image generation encounters significant challenges in terms of broad applicability due to its extended inference times and substantial memory requirements. Efficient Post-training Quantization (PTQ) is pivotal for addressing these issues in traditional models. Different from traditional models diffusion models heavily depend on the time-step t to achieve satisfactory multi-round denoising. Usually t from the finite set \ 1 \ldots T\ is encoded to a temporal feature by a few modules totally irrespective of the sampling data. However existing PTQ methods do not optimize these modules separately. They adopt inappropriate reconstruction targets and complex calibration methods resulting in a severe disturbance of the temporal feature and denoising trajectory as well as a low compression efficiency. To solve these we propose a Temporal Feature Maintenance Quantization (TFMQ) framework building upon a Temporal Information Block which is just related to the time-step t and unrelated to the sampling data. Powered by the pioneering block design we devise temporal information aware reconstruction (TIAR) and finite set calibration (FSC) to align the full-precision temporal features in a limited time. Equipped with the framework we can maintain the most temporal information and ensure the end-to-end generation quality. Extensive experiments on various datasets and diffusion models prove our state-of-the-art results. Remarkably our quantization approach for the first time achieves model performance nearly on par with the full-precision model under 4-bit weight quantization. Additionally our method incurs almost no extra computational cost and accelerates quantization time by 2.0 xon LSUN-Bedrooms 256 x256 compared to previous works. Our code is publicly available at \href https://github.com/ModelTC/TFMQ-DM https://github.com/ModelTC/TFMQ-DM .
-
CNC manufacturing is a process that employs computer numerical control (CNC) machines to govern the movements of various industrial tools and machinery encompassing equipment ranging from grinders and lathes to mills and CNC routers. However the reliance on manual CNC programming has become a bottleneck and the requirement for expert knowledge can result in significant costs. Therefore we introduce a pioneering approach named CNC-Net representing the use of deep neural networks (DNNs) to simulate CNC machines and grasp intricate operations when supplied with raw materials. CNC-Net constitutes a self-supervised framework that exclusively takes an input 3D model and subsequently generates the essential operation parameters required by the CNC machine to construct the object. Our method has the potential to transformative automation in manufacturing by offering a cost-effective alternative to the high costs of manual CNC programming while maintaining exceptional precision in 3D object production. Our experiments underscore the effectiveness of our CNC-Net in constructing the desired 3D objects through the utilization of CNC operations. Notably it excels in preserving finer local details exhibiting a marked enhancement in precision compared to the state-of-the-art 3D CAD reconstruction approaches. The codes are available at https://github.com/myavartanoo/CNC-Net_PyTorch.
-
Autonomous robot systems have attracted increasing research attention in recent years where environment understanding is a crucial step for robot navigation human-robot interaction and decision. Real-world robot systems usually collect visual data from multiple sensors and are required to recognize numerous objects and their movements in complex human-crowded settings. Traditional benchmarks with their reliance on single sensors and limited object classes and scenarios fail to provide the comprehensive environmental understanding robots need for accurate navigation interaction and decision-making. As an extension of JRDB dataset we unveil JRDB-PanoTrack a novel open-world panoptic segmentation and tracking benchmark towards more comprehensive environmental perception. JRDB-PanoTrack includes (1) various data involving indoor and outdoor crowded scenes as well as comprehensive 2D and 3D synchronized data modalities; (2) high-quality 2D spatial panoptic segmentation and temporal tracking annotations with additional 3D label projections for further spatial understanding; (3) diverse object classes for closed- and open-world recognition benchmarks with OSPA-based metrics for evaluation. Extensive evaluation of leading methods shows significant challenges posed by our dataset.
-
Images produced by text-to-image diffusion models might not always faithfully represent the semantic intent of the provided text prompt where the model might overlook or entirely fail to produce certain objects. While recent studies propose various solutions they often require customly tailored functions for each of these problems leading to sub-optimal results especially for complex prompts. Our work introduces a novel perspective by tackling this challenge in a contrastive context. Our approach intuitively promotes the segregation of objects in attention maps while also maintaining that pairs of related attributes are kept close to each other. We conducted extensive experiments across a wide variety of scenarios each involving unique combinations of objects attributes and scenes. These experiments effectively showcase the versatility efficiency and flexibility of our method in working with both latent and pixel-based diffusion models including Stable Diffusion and Imagen. Moreover we publicly share our source code to facilitate further research.
-
Self-supervised pre-training has been proved to be effective in learning transferable representations that benefit various visual tasks. This paper asks this question: can self-supervised pre-training learn general facial representations for various facial analysis tasks? Recent efforts toward this goal are limited to treating each face image as a whole i.e. learning consistent facial representations at the image-level which overlooks the consistency of local facial representations (i.e. facial regions like eyes nose etc). In this work we make a first attempt to propose a novel self-supervised facial representation learning framework to learn consistent global and local facial representations Facial Region Awareness (FRA). Specifically we explicitly enforce the consistency of facial regions by matching the local facial representations across views which are extracted with learned heatmaps highlighting the facial regions. Inspired by the mask prediction in supervised semantic segmentation we obtain the heatmaps via cosine similarity between the per-pixel projection of feature maps and facial mask embeddings computed from learnable positional embeddings which leverage the attention mechanism to globally look up the facial image for facial regions. To learn such heatmaps we formulate the learning of facial mask embeddings as a deep clustering problem by assigning the pixel features from the feature maps to them. The transfer learning results on facial classification and regression tasks show that our FRA outperforms previous pre-trained models and more importantly using ResNet as the unified backbone for various tasks our FRA achieves comparable or even better performance compared with SOTA methods in facial analysis tasks.
-
In recent times the generation of 3D assets from text prompts has shown impressive results. Both 2D and 3D diffusion models can help generate decent 3D objects based on prompts. 3D diffusion models have good 3D consistency but their quality and generalization are limited as trainable 3D data is expensive and hard to obtain. 2D diffusion models enjoy strong abilities of generalization and fine generation but 3D consistency is hard to guarantee. This paper attempts to bridge the power from the two types of diffusion models via the recent explicit and efficient 3D Gaussian splatting representation. A fast 3D object generation framework named as GaussianDreamer is proposed where the 3D diffusion model provides priors for initialization and the 2D diffusion model enriches the geometry and appearance. Operations of noisy point growing and color perturbation are introduced to enhance the initialized Gaussians. Our GaussianDreamer can generate a high-quality 3D instance or 3D avatar within 15 minutes on one GPU much faster than previous methods while the generated instances can be directly rendered in real time. Demos and code are available at https://taoranyi.com/gaussiandreamer/.
-
Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models
Diffusion models represent a new paradigm in text-to-image generation. Beyond generating high-quality images from text prompts models such as Stable Diffusion have been successfully extended to the joint generation of semantic segmentation pseudo-masks. However current extensions primarily rely on extracting attentions linked to prompt words used for image synthesis. This approach limits the generation of segmentation masks derived from word tokens not contained in the text prompt. In this work we introduce Open-Vocabulary Attention Maps (OVAM)--a training-free method for text-to-image diffusion models that enables the generation of attention maps for any word. In addition we propose a lightweight optimization process based on OVAM for finding tokens that generate accurate attention maps for an object class with a single annotation. We evaluate these tokens within existing state-of-the-art Stable Diffusion extensions. The best-performing model improves its mIoU from 52.1 to 86.6 for the synthetic images' pseudo-masks demonstrating that our optimized tokens are an efficient way to improve the performance of existing methods without architectural changes or retraining.
-
Hallucination posed as a pervasive challenge of multi-modal large language models (MLLMs) has significantly impeded their real-world usage that demands precise judgment. Existing methods mitigate this issue with either training with specific designed data or inferencing with external knowledge from other sources incurring inevitable additional costs. In this paper we present OPERA a novel MLLM decoding method grounded in an Over-trust Penalty and a Retrospection-Allocation strategy serving as a nearly free lunch to alleviate the hallucination issue without additional data knowledge or training. Our approach begins with an interesting observation that most hallucinations are closely tied to the knowledge aggregation patterns manifested in the self-attention matrix i.e. MLLMs tend to generate new tokens by focusing on a few summary tokens but not all the previous tokens. Such partial over-trust inclination results in the neglecting of image tokens and describes the image content with hallucination. Based on the observation OPERA introduces a penalty term on the model logits during the beam-search decoding to mitigate the over-trust issue along with a rollback strategy that retrospects the presence of summary tokens in the previously generated tokens and re-allocate the token selection if necessary. With extensive experiments OPERA shows significant hallucination-mitigating performance on different MLLMs and metrics proving its effectiveness and generality. Our code is available at: https://github.com/shikiw/OPERA.
-
Vision-language navigation (VLN) requires an agent to navigate through an 3D environment based on visual observations and natural language instructions. It is clear that the pivotal factor for successful navigation lies in the comprehensive scene understanding. Previous VLN agents employ monocular frameworks to extract 2D features of perspective views directly. Though straightforward they struggle for capturing 3D geometry and semantics leading to a partial and incomplete environment representation. To achieve a comprehensive 3D representation with fine-grained details we introduce a Volumetric Environment Representation (VER) which voxelizes the physical world into structured 3D cells. For each cell VER aggregates multi-view 2D features into such a unified 3D space via 2D-3D sampling. Through coarse-to-fine feature extraction and multi-task learning for VER our agent predicts 3D occupancy 3D room layout and 3D bounding boxes jointly. Based on online collected VERs our agent performs volume state estimation and builds episodic memory for predicting the next step. Experimental results show our environment representations from multi-task learning lead to evident performance gains on VLN. Our model achieves state-of-the-art performance across VLN benchmarks (R2R REVERIE and R4R).
-
Utilizing pre-trained 2D large-scale generative models recent works are capable of generating high-quality novel views from a single in-the-wild image. However due to the lack of information from multiple views these works encounter difficulties in generating controllable novel views. In this paper we present DreamComposer a flexible and scalable framework that can enhance existing view-aware diffusion models by injecting multi-view conditions. Specifically DreamComposer first uses a view-aware 3D lifting module to obtain 3D representations of an object from multiple views. Then it renders the latent features of the target view from 3D representations with the multi-view feature fusion module. Finally the target view features extracted from multi-view inputs are injected into a pre-trained diffusion model. Experiments show that DreamComposer is compatible with state-of-the-art diffusion models for zero-shot novel view synthesis further enhancing them to generate high-fidelity novel view images with multi-view conditions ready for controllable 3D object reconstruction and various other applications.
-
Model calibration measuring the alignment between the prediction accuracy and model confidence is an important metric reflecting model trustworthiness. Existing dense binary classification methods without proper regularisation of model confidence are prone to being over-confident. To calibrate Deep Neural Networks (DNNs) we propose a Self-Calibrating Vicinal Risk Minimisation (SCVRM) that explores the vicinity space of labeled data where vicinal images that are farther away from labeled images adopt the groundtruth label with decreasing label confidence. We prove that in the logistic regression problem SCVRM can be seen as a Vicinal Risk Minimisation plus a regularisation term that penalises the over-confident predictions. In practical implementation SCVRM is approximated using Monte Carlo sampling that samples additional augmented training images and labels from the vicinal distributions. Experimental results demonstrate that SCVRM can significantly enhance model calibration for different dense classification tasks on both in-distribution and out-of-distribution data. Code is available at https://github.com/Carlisle-Liu/SCVRM.
-
We present a method for automatically modifying a NeRF representation based on a single observation of a non-rigid transformed version of the original scene. Our method defines the transformation as a 3D flowspecifically as a weighted linear blending of rigid transformations of 3D anchor points that are defined on the surface of the scene. In order to identify anchor points we introduce a novel correspondence algorithm that first matches RGB-based pairs then leverages multi-view information and 3D reprojection to robustly filter false positives in two steps. We also introduce a new dataset for exploring the problem of modifying a NeRF scene through a single observation. Our dataset contains 113 scenes leveraging 47 3D assets.We show that our proposed method outperforms NeRF editing methods as well as diffusion-based methods and we also explore different methods for filtering correspondences.
-
Human pose and shape (HPS) estimation with lensless imaging is not only beneficial to privacy protection but also can be used in covert surveillance scenarios due to the small size and simple structure of this device. However this task presents significant challenges due to the inherent ambiguity of the captured measurements and lacks effective methods for directly estimating human pose and shape from lensless data. In this paper we propose the first end-to-end framework to recover 3D human poses and shapes from lensless measurements to our knowledge. We specifically design a multi-scale lensless feature decoder to decode the lensless measurements through the optically encoded mask for efficient feature extraction. We also propose a double-head auxiliary supervision mechanism to improve the estimation accuracy of human limb ends. Besides we establish a lensless imaging system and verify the effectiveness of our method on various datasets acquired by our lensless imaging system. The code and dataset are available at https://cic.tju.edu.cn/faculty/likun/projects/LPSNet.
-
As a fundamental problem in multimodal learning multimodal fusion aims to compensate for the inherent limitations of a single modality. One challenge of multimodal fusion is that the unimodal data in their unique embedding space mostly contains potential noise which leads to corrupted cross-modal interactions. However in this paper we show that the potential noise in unimodal data could be well quantified and further employed to enhance more stable unimodal embeddings via contrastive learning. Specifically we propose a novel generic and robust multimodal fusion strategy termed Embracing Aleatoric Uncertainty (EAU) which is simple and can be applied to kinds of modalities. It consists of two key steps: (1) the Stable Unimodal Feature Augmentation (SUFA) that learns a stable unimodal representation by incorporating the aleatoric uncertainty into self-supervised contrastive learning. (2) Robust Multimodal Feature Integration (RMFI) leveraging an information-theoretic strategy to learn a robust compact joint representation. We evaluate our proposed EAU method on five multimodal datasets where the video RGB image text audio and depth image are involved. Extensive experiments demonstrate the EAU method is more noise-resistant than existing multimodal fusion strategies and establishes new state-of-the-art on several benchmarks.
-
This work delves into the task of pose-free novel view synthesis from stereo pairs a challenging and pioneering task in 3D vision. Our innovative framework unlike any before seamlessly integrates 2D correspondence matching camera pose estimation and NeRF rendering fostering a synergistic enhancement of these tasks. We achieve this through designing an architecture that utilizes a shared representation which serves as a foundation for enhanced 3D geometry understanding. Capitalizing on the inherent interplay between the tasks our unified framework is trained end-to-end with the proposed training strategy to improve overall model accuracy. Through extensive evaluations across diverse indoor and outdoor scenes from two real-world datasets we demonstrate that our approach achieves substantial improvement over previous methodologies especially in scenarios characterized by extreme viewpoint changes and the absence of accurate camera poses.
-
Reconstructing CAD construction sequences from raw 3D geometry serves as an interface between real-world objects and digital designs. In this paper we propose CAD-Diffuser a multimodal diffusion scheme aiming at integrating top-down design paradigm into generative reconstruction. In particular we unify CAD point clouds and CAD construction sequences at the token level guiding our proposed multimodal diffusion strategy to understand and link between the geometry and the design intent concentrated in construction sequences. Leveraging the strong decoding abilities of language models the forward process is modeled as a random walk between the original token and the [MASK] token while the reverse process naturally fits the masked token modeling scheme. A volume-based noise schedule is designed to encourage outline-first generation decomposing the top-down design methodology into a machine-understandable procedure. For tokenizing CAD data of multiple modalities we introduce a tokenizer with a self-supervised face segmentation task to compress local and global geometric information for CAD point clouds and the CAD construction sequence is transformed into a primitive token string. Experimental results show that our CAD-Diffuser can perceive geometric details and the results are more likely to be reused by human designers.
-
Existing Siamese or transformer trackers commonly pose visual object tracking as a one-shot detection problem i.e. locating the target object in a single forward evaluation scheme. Despite the demonstrated success these trackers may easily drift towards distractors with similar appearance due to the single forward evaluation scheme lacking self-correction. To address this issue we cast visual tracking as a point set based denoising diffusion process and propose a novel generative learning based tracker dubbed DiffusionTrack. Our DiffusionTrack possesses two appealing properties: 1) It follows a novel noise-to-target tracking paradigm that leverages multiple denoising diffusion steps to localize the target in a dynamic searching manner per frame. 2) It models the diffusion process using a point set representation which can better handle appearance variations for more precise localization. One side benefit is that DiffusionTrack greatly simplifies the post-processing e.g. removing window penalty scheme. Without bells and whistles our DiffusionTrack achieves leading performance over the state-of-the-art trackers and runs in real-time. The code is in https://github.com/VISION-SJTU/DiffusionTrack.
-
In human-centric content generation the pre-trained text-to-image models struggle to produce user-wanted portrait images which retain the identity of individuals while exhibiting diverse expressions. This paper introduces our efforts towards personalized face generation. To this end we propose a novel multi-modal face generation framework capable of simultaneous identity-expression control and more fine-grained expression synthesis. Our expression control is so sophisticated that it can be specialized by the fine-grained emotional vocabulary. We devise a novel diffusion model that can undertake the task of simultaneously face swapping and reenactment. Due to the entanglement of identity and expression separately and precisely controlling them within one framework is a nontrivial task thus has not been explored yet. To overcome this we propose several innovative designs in the conditional diffusion model including balancing identity and expression encoder improved midpoint sampling and explicitly background conditioning. Extensive experiments have demonstrated the controllability and scalability of the proposed framework in comparison with state-of-the-art text-to-image face swapping and face reenactment methods.
-
Modern video generation models like Sora have achieved remarkable success in producing high-quality videos. However a significant limitation is their inability to offer interactive control to users a feature that promises to open up unprecedented applications and creativity. In this work we introduce the first solution to equip diffusion-based video generation models with spatio-temporal control. We present Peekaboo a novel masked attention module which seamlessly integrates with current video generation models offering control without the need for additional training or inference overhead. To facilitate future research we also introduce a comprehensive benchmark for interactive video generation. This benchmark offers a standardized framework for the community to assess the efficacy of emerging interactive video generation models. Our extensive qualitative and quantitative assessments reveal that Peekaboo achieves up to a 3.8x improvement in mIoU over baseline models all while maintaining the same latency. Code and benchmark are available on the webpage.
-
Computer vision techniques play a central role in the perception stack of autonomous vehicles. Such methods are employed to perceive the vehicle surroundings given sensor data. 3D LiDAR sensors are commonly used to collect sparse 3D point clouds from the scene. However compared to human perception such systems struggle to deduce the unseen parts of the scene given those sparse point clouds. In this matter the scene completion task aims at predicting the gaps in the LiDAR measurements to achieve a more complete scene representation. Given the promising results of recent diffusion models as generative models for images we propose extending them to achieve scene completion from a single 3D LiDAR scan. Previous works used diffusion models over range images extracted from LiDAR data directly applying image-based diffusion methods. Distinctly we propose to directly operate on the points reformulating the noising and denoising diffusion process such that it can efficiently work at scene scale. Together with our approach we propose a regularization loss to stabilize the noise predicted during the denoising process. Our experimental evaluation shows that our method can complete the scene given a single LiDAR scan as input producing a scene with more details compared to state-of-the-art scene completion methods. We believe that our proposed diffusion process formulation can support further research in diffusion models applied to scene-scale point cloud data.
-
Source-free domain adaptation (SFDA) assumes that model adaptation only accesses the well-learned source model and unlabeled target instances for knowledge transfer. However cross-domain distribution shift easily triggers invalid discriminative semantics from source model on recognizing the target samples. Hence understanding the specific content of discriminative pattern and adjusting their representation in target domain become the important key to overcome SFDA. To achieve such a vision this paper proposes a novel explanation paradigm "Discriminative Pattern Calibration (DPC)" mechanism on solving SFDA issue. Concretely DPC first utilizes learning network to infer the discriminative regions on the target images and specifically emphasizes them in feature space to enhance their representation. Moreover DPC relies on the attention-reversed mixup mechanism to augment more samples and improve the robustness of the classifier. Considerable experimental results and studies suggest that the effectiveness of our DPC in enhancing the performance of existing SFDA baselines.
-
In this paper we propose Image Downscaling Assessment by Rate-Distortion (IDA-RD) a novel measure to quantitatively evaluate image downscaling algorithms. In contrast to image-based methods that measure the quality of downscaled images ours is process-based that draws ideas from rate-distortion theory to measure the distortion incurred during downscaling. Our main idea is that downscaling and super-resolution (SR) can be viewed as the encoding and decoding processes in the rate-distortion model respectively and that a downscaling algorithm that preserves more details in the resulting low-resolution (LR) images should lead to less distorted high-resolution (HR) images in SR. In other words the distortion should increase as the downscaling algorithm deteriorates. However it is non-trivial to measure this distortion as it requires the SR algorithm to be blind and stochastic. Our key insight is that such requirements can be met by recent SR algorithms based on deep generative models that can find all matching HR images for a given LR image on their learned image manifolds. Extensive experimental results show the effectiveness of our IDA-RD measure.
-
Backdoor attacks have been well-studied in visible light object detection (VLOD) in recent years. However VLOD can not effectively work in dark and temperature-sensitive scenarios. Instead thermal infrared object detection (TIOD) is the most accessible and practical in such environments. In this paper our team is the first to investigate the security vulnerabilities associated with TIOD in the context of backdoor attacks spanning both the digital and physical realms. We introduce two novel types of backdoor attacks on TIOD each offering unique capabilities: Object-affecting Attack and Range-affecting Attack. We conduct a comprehensive analysis of key factors influencing trigger design which include temperature size material and concealment. These factors especially temperature significantly impact the efficacy of backdoor attacks on TIOD. A thorough understanding of these factors will serve as a foundation for designing physical triggers and temperature controlling experiments. Our study includes extensive experiments conducted in both digital and physical environments. In the digital realm we evaluate our approach using benchmark datasets for TIOD achieving an Attack Success Rate (ASR) of up to 98.21%. In the physical realm we test our approach in two real-world settings: a traffic intersection and a parking lot using a thermal infrared camera. Here we attain an ASR of up to 98.38%.
-
Deep Neural Networks (DNNs) are powerful tools for various computer vision tasks yet they often struggle with reliable uncertainty quantification -a critical requirement for real-world applications. Bayesian Neural Networks (BNN) are equipped for uncertainty estimation but cannot scale to large DNNs where they are highly unstable to train. To address this challenge we introduce the Adaptable Bayesian Neural Network (ABNN) a simple and scalable strategy to seamlessly transform DNNs into BNNs in a post-hoc manner with minimal computational and training overheads. ABNN preserves the main predictive properties of DNNs while enhancing their uncertainty quantification abilities through simple BNN adaptation layers (attached to normalization layers) and a few fine-tuning steps on pre-trained models. We conduct extensive experiments across multiple datasets for image classification and semantic segmentation tasks and our results demonstrate that ABNN achieves state-of-the-art performance without the computational budget typically associated with ensemble methods.
-
Composed image retrieval (CIR) task takes a composed query of image and text aiming to search relative images for both conditions. Conventional CIR approaches need a training dataset composed of triplets of query image query text and target image which is very expensive to collect. Several recent works have worked on the zero-shot (ZS) CIR paradigm to tackle the issue without using pre-collected triplets. However the existing ZS-CIR methods show limited backbone scalability and generalizability due to the lack of diversity of the input texts during training. We propose a novel CIR framework only using language for its training. Our LinCIR (Language-only training for CIR) can be trained only with text datasets by a novel self-supervision named self-masking projection (SMP). We project the text latent embedding to the token embedding space and construct a new text by replacing the keyword tokens of the original text. Then we let the new and original texts have the same latent embedding vector. With this simple strategy LinCIR is surprisingly efficient and highly effective; LinCIR with CLIP ViT-G backbone is trained in 48 minutes and shows the best ZS-CIR performances on four different CIR benchmarks CIRCO GeneCIS FashionIQ and CIRR even outperforming supervised method on FashionIQ. Code is available at https://github.com/navervision/lincir
-
The existing facial datasets while having plentiful images at near frontal views lack images with extreme head poses leading to the downgraded performance of deep learning models when dealing with profile or pitched faces. This work aims to address this gap by introducing a novel dataset named Extreme Pose Face High-Quality Dataset (EFHQ) which includes a maximum of 450k high-quality images of faces at extreme poses. To produce such a massive dataset we utilize a novel and meticulous dataset processing pipeline to curate two publicly available datasets VFHQ and CelebV-HQ which contain many high-resolution face videos captured in various settings. Our dataset can complement existing datasets on various facial-related tasks such as facial synthesis with 2D/3D-aware GAN diffusion-based text-to-image face generation and face reenactment. Specifically training with EFHQ helps models generalize well across diverse poses significantly improving performance in scenarios involving extreme views confirmed by extensive experiments. Additionally we utilize EFHQ to define a challenging cross-view face verification benchmark in which the performance of SOTA face recognition models drops 5-37% compared to frontal-to-frontal scenarios aiming to stimulate studies on face recognition under severe pose conditions in the wild.
-
Point Cloud Registration is a critical and challenging task in computer vision. Recent advancements have predominantly embraced a coarse-to-fine matching mechanism with the key to matching the superpoints located in patches with inter-frame consistent structures. However previous methods still face challenges with ambiguous matching because the interference information aggregated from irrelevant regions may disturb the capture of inter-frame consistency relations leading to wrong matches. To address this issue we propose Dynamic Cues-Assisted Transformer (DCATr). Firstly the interference from irrelevant regions is greatly reduced by constraining attention to certain cues i.e. regions with highly correlated structures of potential corresponding superpoints. Secondly cues-assisted attention is designed to mine the inter-frame consistency relations while more attention is assigned to pairs with high consistent confidence in feature aggregation. Finally a dynamic updating fashion is proposed to facilitate mining richer consistency information further improving aggregated features' distinctiveness and relieving matching ambiguity. Extensive evaluations on indoor and outdoor standard benchmarks demonstrate that DCATr outperforms all state-of-the-art methods.
-
Diffusion MRI (dMRI) non-invasively maps brain white matter yet necessitates denoising due to low signal-to-noise ratios. Patch2Self (P2S) employing self-supervised techniques and regression on a Casorati matrix effectively denoises dMRI images and has become the new de-facto standard in this field. P2S however is resource intensive both in terms of running time and memory usage as it uses all voxels (n) from all-but-one held-in volumes (d-1) to learn a linear mapping Phi : \mathbb R ^ n x(d-1) \mapsto \mathbb R ^ n for denoising the held-out volume. The increasing size and dimensionality of higher resolution dMRI acquisitions can make P2S infeasible for large-scale analyses. This work exploits the redundancy imposed by P2S to alleviate its performance issues and inspect regions that influence the noise disproportionately. Specifically this study makes a three-fold contribution: (1) We present Patch2Self2 (P2S2) a method that uses matrix sketching to perform self-supervised denoising. By solving a sub-problem on a smaller sub-space so called coreset we show how P2S2 can yield a significant speedup in training time while using less memory. (2) We present a theoretical analysis of P2S2 focusing on determining the optimal sketch size through rank estimation a key step in achieving a balance between denoising accuracy and computational efficiency. (3) We show how the so-called statistical leverage scores can be used to interpret the denoising of dMRI data a process that was traditionally treated as a black-box. Experimental results on both simulated and real data affirm that P2S2 maintains denoising quality while significantly enhancing speed and memory efficiency achieved by training on a reduced data subset.
-
Current subject-driven image generation methods encounter significant challenges in person-centric image generation. The reason is that they learn the semantic scene and person generation by fine-tuning a common pre-trained diffusion which involves an irreconcilable training imbalance. Precisely to generate realistic persons they need to sufficiently tune the pre-trained model which inevitably causes the model to forget the rich semantic scene prior and makes scene generation over-fit to the training data. Moreover even with sufficient fine-tuning these methods can still not generate high-fidelity persons since joint learning of the scene and person generation also lead to quality compromise. In this paper we propose Face-diffuser an effective collaborative generation pipeline to eliminate the above training imbalance and quality compromise. Specifically we first develop two specialized pre-trained diffusion models i.e. Text-driven Diffusion Model (TDM) and Subject-augmented Diffusion Model (SDM) for scene and person generation respectively. The sampling process is divided into three sequential stages i.e. semantic scene construction subject-scene fusion and subject enhancement. The first and last stages are performed by TDM and SDM respectively. The subject-scene fusion stage that is the collaboration achieved through a novel and highly effective mechanism Saliency-adaptive Noise Fusion (SNF). Specifically it is based on our key observation that there exists a robust link between classifier-free guidance responses and the saliency of generated images. In each time step SNF leverages the unique strengths of each model and allows for the spatial blending of predicted noises from both models automatically in a saliency-aware manner all of which can be seamlessly integrated into the DDIM sampling process. Extensive experiments confirm the impressive effectiveness and robustness of the Face-diffuser in generating high-fidelity person images depicting multiple unseen persons with varying contexts. Code is available at https://github.com/CodeGoat24/Face-diffuser.
-
Recent advancements in large vision-language models enabled visual object detection in open-vocabulary scenarios where object classes are defined in free-text formats during inference. In this paper we aim to probe the state-of-the-art methods for open-vocabulary object detection to determine to what extent they understand fine-grained properties of objects and their parts. To this end we introduce an evaluation protocol based on dynamic vocabulary generation to test whether models detect discern and assign the correct fine-grained description to objects in the presence of hard-negative classes. We contribute with a benchmark suite of increasing difficulty and probing different properties like color pattern and material. We further enhance our investigation by evaluating several state-of-the-art open-vocabulary object detectors using the proposed protocol and find that most existing solutions which shine in standard open-vocabulary benchmarks struggle to accurately capture and distinguish finer object details. We conclude the paper by highlighting the limitations of current methodologies and exploring promising research directions to overcome the discovered drawbacks. Data and code are available at https://lorebianchi98.github.io/FG-OVD .
-
Weakly-supervised action segmentation is a task of learning to partition a long video into several action segments where training videos are only accompanied by transcripts (ordered list of actions). Most of existing methods need to infer pseudo segmentation for training by serial alignment between all frames and the transcript which is time-consuming and hard to be parallelized while training. In this work we aim to escape from this inefficient alignment with massive but redundant frames and instead to directly localize a few action transitions for pseudo segmentation generation where a transition refers to the change from an action segment to its next adjacent one in the transcript. As the true transitions are submerged in noisy boundaries due to intra-segment visual variation we propose a novel Action-Transition-Aware Boundary Alignment (ATBA) framework to efficiently and effectively filter out noisy boundaries and detect transitions. In addition to boost the semantic learning in the case that noise is inevitably present in the pseudo segmentation we also introduce video-level losses to utilize the trusted video-level supervision. Extensive experiments show the effectiveness of our approach on both performance and training speed.
-
The ability to learn from context with novel concepts and deliver appropriate responses are essential in human conversations. Despite current Multimodal Large Language Models (MLLMs) and Large Language Models (LLMs) being trained on mega-scale datasets recognizing unseen images or understanding novel concepts in a training-free manner remains a challenge. In-Context Learning (ICL) explores training-free few-shot learning where models are encouraged to "learn to learn" from limited tasks and generalize to unseen tasks. In this work we propose link-context learning (LCL) which emphasizes "reasoning from cause and effect" to augment the learning capabilities of MLLMs. LCL goes beyond traditional ICL by explicitly strengthening the causal relationship between the support set and the query set. By providing demonstrations with causal links LCL guides the model to discern not only the analogy but also the underlying causal associations between data points which empowers MLLMs to recognize unseen images and understand novel concepts more effectively. To facilitate the evaluation of this novel approach we introduce the ISEKAI dataset comprising exclusively of unseen generated image-label pairs designed for link-context learning. Extensive experiments show that our LCL-MLLM exhibits strong link-context learning capabilities to novel concepts over vanilla MLLMs.
-
Large language models have achieved great success in recent years so as their variants in vision. Existing vision-language models can describe images in natural languages answer visual-related questions or perform complex reasoning about the image. However it is yet unclear how localization tasks such as word grounding or referring localization can be performed using large language models. In this work we aim to develop a vision-language model that can take locations for example a set of points or boxes as either inputs or outputs. When taking locations as inputs the model performs location-conditioned captioning which generates captions for the indicated object or region. When generating locations as outputs our model regresses pixel coordinates for each output word generated by the language model and thus performs dense word grounding. Our model is pre-trained on the Localized Narrative dataset which contains pixel-word-aligned captioning from human attention. We show our model can be applied to various location-aware vision-language tasks including referring localization location-conditioned captioning and dense object captioning archiving state-of-the-art performance on RefCOCO and Visual Genome.
-
Personalized text-to-image generation models enable users to create images that depict their individual possessions in diverse scenes finding applications in various domains. To achieve the personalization capability existing methods rely on finetuning a text-to-image foundation model on a user's custom dataset which can be non-trivial for general users resource-intensive and time-consuming. Despite attempts to develope finetuning-free methods their generation quality is much lower compared to their finetuning counterparts. In this paper we propose Joint-Image Diffusion (\jedi) an effective technique for learning a finetuning-free personalization model. Our key idea is to learn the joint distribution of multiple related text-image pairs that share a common subject. To facilitate learning we propose a scalable synthetic dataset generation technique. Once trained our model enables fast and easy personalization at test time by simply using reference images as input during the sampling process. Our approach does not require any expensive optimization process or additional modules and can faithfully preserve the identity represented by any number of reference images. Experimental results show that our model achieves state-of-the-art generation quality both quantitatively and qualitatively significantly outperforming both the prior finetuning-based and finetuning-free personalization baselines.
-
This paper proposes ConsistDreamer - a novel framework that lifts 2D diffusion models with 3D awareness and 3D consistency thus enabling high-fidelity instruction-guided scene editing. To overcome the fundamental limitation of missing 3D consistency in 2D diffusion models our key insight is to introduce three synergetic strategies that augment the input of the 2D diffusion model to become 3D-aware and to explicitly enforce 3D consistency during the training process. Specifically we design surrounding views as context-rich input for the 2D diffusion model and generate 3D-consistent structured noise instead of image-independent noise. Moreover we introduce self-supervised consistency-enforcing training within the per-scene editing procedure. Extensive evaluation shows that our ConsistDreamer achieves state-of-the-art performance for instruction-guided scene editing across various scenes and editing instructions particularly in complicated large-scale indoor scenes from ScanNet++ with significantly improved sharpness and fine-grained textures. Notably ConsistDreamer stands as the first work capable of successfully editing complex (e.g. plaid/checkered) patterns.
-
Extracting keypoint locations from input hand frames known as 3D hand pose estimation is a critical task in various human-computer interaction applications. Essentially the 3D hand pose estimation can be regarded as a 3D point subset generative problem conditioned on input frames. Thanks to the recent significant progress on diffusion-based generative models hand pose estimation can also benefit from the diffusion model to estimate keypoint locations with high quality. However directly deploying the existing diffusion models to solve hand pose estimation is non-trivial since they cannot achieve the complex permutation mapping and precise localization. Based on this motivation this paper proposes HandDiff a diffusion-based hand pose estimation model that iteratively denoises accurate hand pose conditioned on hand-shaped image-point clouds. In order to recover keypoint permutation and accurate location we further introduce joint-wise condition and local detail condition. Experimental results demonstrate that the proposed HandDiff significantly outperforms the existing approaches on four challenging hand pose benchmark datasets. Codes and pre-trained models are publicly available at https://github.com/cwc1260/HandDiff.
-
Once only a few-shot annotated samples are available the performance of learning-based object detection would be heavily dropped. Many few-shot object detection (FSOD) methods have been proposed to tackle this issue by adopting image-level augmentations in linear manners. Nevertheless those handcrafted enhancements often suffer from limited diversity and lack of semantic awareness resulting in unsatisfactory performance. To this end we propose a Semantic-guided Non-linear Instance-level Data Augmentation method (SNIDA) for FSOD by decoupling the foreground and background to increase their diversities respectively. We design a semantic awareness enhancement strategy to separate objects from backgrounds. Concretely masks of instances are extracted by an unsupervised semantic segmentation module. Then the diversity of samples would be improved by fusing instances into different backgrounds. Considering the shortcomings of augmenting images in a limited transformation space of existing traditional data augmentation methods we introduce an object reconstruction enhancement module. The aim of this module is to generate sufficient diversity and non-linear training data at the instance level through a semantic-guided masked autoencoder. In this way the potential of data can be fully exploited in various object detection scenarios. Extensive experiments on PASCAL VOC and MS-COCO demonstrate that the proposed method outperforms baselines by a large margin and achieves new state-of-the-art results under different shot settings.
-
Recent advances in instruction tuning have led to the development of State-of-the-Art Large Multimodal Models (LMMs). Given the novelty of these models the impact of visual adversarial attacks on LMMs has not been thoroughly examined. We conduct a comprehensive study of the robustness of various LMMs against different adversarial attacks evaluated across tasks including image classification image captioning and Visual Question Answer (VQA). We find that in general LMMs are not robust to visual adversarial inputs. However our findings suggest that context provided to the model via prompts--such as questions in a QA pair--helps to mitigate the effects of visual adversarial inputs. Notably the LMMs evaluated demonstrated remarkable resilience to such attacks on the ScienceQA task with only an 8.10% drop in performance compared to their visual counterparts which dropped 99.73%. We also propose a new approach to real-world image classification which we term query decomposition. By incorporating existence queries into our input prompt we observe diminished attack effectiveness and improvements in image classification accuracy. This research highlights a previously under explored facet of LMM robustness and sets the stage for future work aimed at strengthening the resilience of multimodal systems in adversarial environments.
-
We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos. Whereas existing methods rely on curated data with known audio-visual correspondence our multimodal contrastive-consensus coding (MC3) embedding reinforces the associations between audio language and vision when all modality pairs agree while diminishing those associations when any one pair does not. We show our approach can successfully discover how the long tail of human actions sound from egocentric video outperforming an array of recent multimodal embedding techniques on two datasets (Ego4D and EPIC-Sounds) and multiple cross-modal tasks.
-
Semantic scene completion also known as semantic occupancy prediction can provide dense geometric and semantic information for autonomous vehicles which attracts the increasing attention of both academia and industry. Unfortunately existing methods usually formulate this task as a voxel-wise classification problem and treat each voxel equally in 3D space during training. As the hard voxels have not been paid enough attention the performance in some challenging regions is limited. The 3D dense space typically contains a large number of empty voxels which are easy to learn but require amounts of computation due to handling all the voxels uniformly for the existing models. Furthermore the voxels in the boundary region are more challenging to differentiate than those in the interior. In this paper we propose HASSC approach to train the semantic scene completion model with hardness-aware design. The global hardness from the network optimization process is defined for dynamical hard voxel selection. Then the local hardness with geometric anisotropy is adopted for voxel-wise refinement. Besides self-distillation strategy is introduced to make training process stable and consistent. Extensive experiments show that our HASSC scheme can effectively promote the accuracy of the baseline model without incurring the extra inference cost. Source code is available at: https://github.com/songw-zju/HASSC.
-
The lifting of a 3D structure and camera from 2D landmarks is at the cornerstone of the discipline of computer vision. Traditional methods have been confined to specific rigid objects such as those in Perspective-n-Point (PnP) problems but deep learning has expanded our capability to reconstruct a wide range of object classes (e.g. C3DPO [??] and PAUL [??]) with resilience to noise occlusions and perspective distortions. However all these techniques have been limited by the fundamental need to establish correspondences across the 3D training data significantly limiting their utility to applications where one has an abundance of "in-correspondence" 3D data. Our approach harnesses the inherent permutation equivariance of transformers to manage varying numbers of points per 3D data instance withstands occlusions and generalizes to unseen categories. We demonstrate state-of-the-art performance across 2D-3D lifting task benchmarks. Since our approach can be trained across such a broad class of structures we refer to it simply as a 3D Lifting Foundation Model (3D-LFM) -- the first of its kind.
-
Recent innovations on text-to-3D generation have featured Score Distillation Sampling (SDS) which enables the zero-shot learning of implicit 3D models (NeRF) by directly distilling prior knowledge from 2D diffusion models. However current SDS-based models still struggle with intricate text prompts and commonly result in distorted 3D models with unrealistic textures or cross-view inconsistency issues. In this work we introduce a novel Visual Prompt-guided text-to-3D diffusion model (VP3D) that explicitly unleashes the visual appearance knowledge in 2D visual prompt to boost text-to-3D generation. Instead of solely supervising SDS with text prompt VP3D first capitalizes on 2D diffusion model to generate a high-quality image from input text which subsequently acts as visual prompt to strengthen SDS optimization with explicit visual appearance. Meanwhile we couple the SDS optimization with additional differentiable reward function that encourages rendering images of 3D models to better visually align with 2D visual prompt and semantically match with text prompt. Through extensive experiments we show that the 2D Visual Prompt in our VP3D significantly eases the learning of visual appearance of 3D models and thus leads to higher visual fidelity with more detailed textures. It is also appealing in view that when replacing the self-generating visual prompt with a given reference image VP3D is able to trigger a new task of stylized text-to-3D generation. Our project page is available at https://vp3d-cvpr24.github.io.
-
Undoubtedly high-fidelity 3D hair is crucial for achieving realism artistic expression and immersion in computer graphics. While existing 3D hair modeling methods have achieved impressive performance the challenge of achieving high-quality hair reconstruction persists: they either require strict capture conditions making practical applications difficult or heavily rely on learned prior data obscuring fine-grained details in images. To address these challenges we propose MonoHair a generic framework to achieve high-fidelity hair reconstruction from a monocular video without specific requirements for environments. Our approach bifurcates the hair modeling process into two main stages: precise exterior reconstruction and interior structure inference. The exterior is meticulously crafted using our Patch-based Multi-View Optimization PMVO. This method strategically collects and integrates hair information from multiple views independent of prior data to produce a high-fidelity exterior 3D line map. This map not only captures intricate details but also facilitates the inference of the hair's inner structure. For the interior we employ a data-driven multi-view 3D hair reconstruction method. This method utilizes 2D structural renderings derived from the reconstructed exterior mirroring the synthetic 2D inputs used during training. This alignment effectively bridges the domain gap between our training data and real-world data thereby enhancing the accuracy and reliability of our interior structure inference. Lastly we generate a strand model and resolve the directional ambiguity by our hair growth algorithm. Our experiments demonstrate that our method exhibits robustness across diverse hairstyles and achieves state-of-the-art performance. For more results please refer to our project page https://keyuwu-cs.github.io/MonoHair/
-
The absence of real targets to guide the model training is one of the main problems with the makeup transfer task. Most existing methods tackle this problem by synthesizing pseudo ground truths (PGTs). However the generated PGTs are often sub-optimal and their imprecision will eventually lead to performance degradation. To alleviate this issue in this paper we propose a novel Content-Style Decoupled Makeup Transfer (CSD-MT) method which works in a purely unsupervised manner and thus eliminates the negative effects of generating PGTs. Specifically based on the frequency characteristics analysis we assume that the low-frequency (LF) component of a face image is more associated with its makeup style information while the high-frequency (HF) component is more related to its content details. This assumption allows CSD-MT to decouple the content and makeup style information in each face image through the frequency decomposition. After that CSD-MT realizes makeup transfer by maximizing the consistency of these two types of information between the transferred result and input images respectively. Two newly designed loss functions are also introduced to further improve the transfer performance. Extensive quantitative and qualitative analyses show the effectiveness of our CSD-MT method. Our code is available at https://github.com/Snowfallingplum/CSD-MT.
-
Large pre-trained Vision-Language Models (VLMs) like CLIP despite having remarkable generalization ability are highly vulnerable to adversarial examples. This work studies the adversarial robustness of VLMs from the novel perspective of the text prompt instead of the extensively studied model weights (frozen in this work). We first show that the effectiveness of both adversarial attack and defense are sensitive to the used text prompt. Inspired by this we propose a method to improve resilience to adversarial attacks by learning a robust text prompt for VLMs. The proposed method named Adversarial Prompt Tuning (APT) is effective while being both computationally and data efficient. Extensive experiments are conducted across 15 datasets and 4 data sparsity schemes (from 1-shot to full training data settings) to show APT's superiority over hand-engineered prompts and other state-of-the-art adaption methods. APT demonstrated excellent abilities in terms of the in-distribution performance and the generalization under input distribution shift and across datasets. Surprisingly by simply adding one learned word to the prompts APT can significantly boost the accuracy and robustness (epsilon=4/255) over the hand-engineered prompts by +13% and +8.5% on average respectively. The improvement further increases in our most effective setting to +26.4% for accuracy and +16.7% for robustness. Code is available at https://github.com/TreeLLi/APT.
-
Continual test-time domain adaptation (CTTA) aims to adapt the source pre-trained model to a continually changing target domain without additional data acquisition or labeling costs. This issue necessitates an initial performance enhancement within the present domain without labels while concurrently averting an excessive bias toward the current domain. Such bias exacerbates catastrophic forgetting and diminishes the generalization ability to future domains. To tackle the problem this paper designs a versatile framework to capture high-quality supervision signals from three aspects: 1) The adaptive thresholds are employed to determine the reliability of pseudo-labels; 2) The knowledge from the source pre-trained model is utilized to adjust the unreliable one and 3) By evaluating past supervision signals we calculate a diversity score to ensure subsequent generalization. In this way we form a complete supervisory signal generation framework which can capture the current domain discriminative and reserve generalization in future domains. Finally to avoid catastrophic forgetting we design a weighted soft parameter alignment method to explore the knowledge from the source model. Extensive experimental results demonstrate that our method performs well on several benchmark datasets.
-
Safety and robustness are crucial factors in developing trustworthy autonomous vehicles. One essential aspect of addressing these factors is to equip vehicles with the capability to predict future trajectories for all moving objects in the surroundings and quantify prediction uncertainties. In this paper we propose the Sequential Neural Variational Agent (SeNeVA) a generative model that describes the distribution of future trajectories for a single moving object. Our approach can distinguish Out-of-Distribution data while quantifying uncertainty and achieving competitive performance compared to state-of-the-art methods on the Argoverse 2 and INTERACTION datasets. Specifically a 0.446 meters minimum Final Displacement Error a 0.203 meters minimum Average Displacement Error and a 5.35% Miss Rate are achieved on the INTERACTION test set. Extensive qualitative and quantitative analysis is also provided to evaluate the proposed model. Our open-source code is available at https://github.com/PurdueDigitalTwin/seneva.
-
The advent of Vision Transformers (ViTs) marks a substantial paradigm shift in the realm of computer vision. ViTs capture the global information of images through self-attention modules which perform dot product computations among patchified image tokens. While self-attention modules empower ViTs to capture long-range dependencies the computational complexity grows quadratically with the number of tokens which is a major hindrance to the practical application of ViTs. Moreover the self-attention mechanism in deep ViTs is also susceptible to the attention saturation issue. Accordingly we argue against the necessity of computing the attention scores in every layer and we propose the Less-Attention Vision Transformer (LaViT) which computes only a few attention operations at each stage and calculates the subsequent feature alignments in other layers via attention transformations that leverage the previously calculated attention scores. This novel approach can mitigate two primary issues plaguing traditional self-attention modules: the heavy computational burden and attention saturation. Our proposed architecture offers superior efficiency and ease of implementation merely requiring matrix multiplications that are highly optimized in contemporary deep learning frameworks. Moreover our architecture demonstrates exceptional performance across various vision tasks including classification detection and segmentation.
-
Vision-Language Models (VLMs) are pretrained on large diverse and noisy web-crawled datasets. This underscores the critical need for dataset pruning as the quality of these datasets is strongly correlated with the performance of VLMs on downstream tasks. Using CLIPScore from a pretrained model to only train models using highly-aligned samples is one of the most successful methods for pruning. We argue that this approach suffers from multiple limitations including: false positives and negatives due to CLIP's pretraining on noisy labels. We propose a pruning signal Sieve that employs synthetic captions generated by image-captioning models pretrained on small diverse and well-aligned image-text pairs to evaluate the alignment of noisy image-text pairs. To bridge the gap between the limited diversity of generated captions and the high diversity of alternative text (alt-text) we estimate the semantic textual similarity in the embedding space of a language model pretrained on unlabeled text corpus. Using DataComp a multimodal dataset filtering benchmark when evaluating on 38 downstream tasks our pruning approach surpasses CLIPScore by 2.6% and 1.7% on medium and large scale respectively. In addition on retrieval tasks Sieve leads to a significant improvement of 2.7% and 4.5% on medium and large scale respectively.
-
In this paper we propose the first generalizable view synthesis approach that specifically targets multi-view stereo-camera images. Since recent stereo matching has demonstrated accurate geometry prediction we introduce stereo matching into novel-view synthesis for high-quality geometry reconstruction. To this end this paper proposes a novel framework dubbed StereoNeRF which integrates stereo matching into a NeRF-based generalizable view synthesis approach. StereoNeRF is equipped with three key components to effectively exploit stereo matching in novel-view synthesis: a stereo feature extractor a depth-guided plane-sweeping and a stereo depth loss. Moreover we propose the StereoNVS dataset the first multi-view dataset of stereo-camera images encompassing a wide variety of both real and synthetic scenes. Our experimental results demonstrate that StereoNeRF surpasses previous approaches in generalizable view synthesis.
-
We introduce DyNFL a novel neural field-based approach for high-fidelity re-simulation of LiDAR scans in dynamic driving scenes. DyNFL processes LiDAR measurements from dynamic environments accompanied by bounding boxes of moving objects to construct an editable neural field. This field comprising separately reconstructed static background and dynamic objects allows users to modify viewpoints adjust object positions and seamlessly add or remove objects in the re-simulated scene. A key innovation of our method is the neural field composition technique which effectively integrates reconstructed neural assets from various scenes through a ray drop test accounting for occlusions and transparent surfaces. Our evaluation with both synthetic and real-world environments demonstrates that DyNFL substantially improves dynamic scene LiDAR simulation offering a combination of physical fidelity and flexible editing capabilities. Project page: https://shengyuh.github.io/dynfl
-
Large multi-modal models (LMMs) hold the potential to usher in a new era of automated visual assistance for people who are blind or low vision (BLV). Yet these models have not been systematically evaluated on data captured by BLV users. We address this by empirically assessing CLIP a widely-used LMM likely to underpin many assistive technologies. Testing 25 CLIP variants in a zero-shot classification task we find that their accuracy is 15 percentage points lower on average for images captured by BLV users than web-crawled images. This disparity stems from CLIP's sensitivities to 1) image content (e.g. not recognizing disability objects as well as other objects); 2) image quality (e.g. not being robust to lighting variation); and 3) text content (e.g. not recognizing objects described by tactile adjectives as well as visual ones). We delve deeper with a textual analysis of three common pre-training datasets: LAION-400M LAION-2B and DataComp-1B showing that disability content is rarely mentioned. We then provide three examples that illustrate how the performance disparities extend to three downstream models underpinned by CLIP: OWL-ViT CLIPSeg and DALL-E2. We find that few-shot learning with as few as 5 images can mitigate CLIP's quality-of-service disparities for BLV users in some scenarios which we discuss alongside a set of other possible mitigations.
-
Test-time adaptation (TTA) has emerged as a viable solution to adapt pre-trained models to domain shifts using unlabeled test data. However TTA faces challenges of adaptation failures due to its reliance on blind adaptation to unknown test samples in dynamic scenarios. Traditional methods for out-of-distribution performance estimation are limited by unrealistic assumptions in the TTA context such as requiring labeled data or re-training models. To address this issue we propose AETTA a label-free accuracy estimation algorithm for TTA. We propose the prediction disagreement as the accuracy estimate calculated by comparing the target model prediction with dropout inferences. We then improve the prediction disagreement to extend the applicability of AETTA under adaptation failures. Our extensive evaluation with four baselines and six TTA methods demonstrates that AETTA shows an average of 19.8%p more accurate estimation compared with the baselines. We further demonstrate the effectiveness of accuracy estimation with a model recovery case study showcasing the practicality of our model recovery based on accuracy estimation. The source code is available at https://github.com/taeckyung/AETTA.
-
In this work we present Digital Life Project a framework utilizing language as the universal medium to build autonomous 3D characters who are capable of engaging in social interactions and expressing with articulated body motions thereby simulating life in a digital environment. Our framework comprises two primary components: 1) SocioMind: a meticulously crafted digital brain that models personalities with systematic few-shot exemplars incorporates a reflection process based on psychology principles and emulates autonomy by initiating dialogue topics; 2) MoMat-MoGen: a text-driven motion synthesis paradigm for controlling the character's digital body. It integrates motion matching a proven industry technique to ensure motion quality with cutting-edge advancements in motion generation for diversity. Extensive experiments demonstrate that each module achieves state-of-the-art performance in its respective domain. Collectively they enable virtual characters to initiate and sustain dialogues autonomously while evolving their socio-psychological states. Concurrently these characters can perform contextually relevant bodily movements. Additionally an extension of DLP enables a virtual character to recognize and appropriately respond to human players' actions.
-
3D Object Detectors (3D-OD) are crucial for understanding the environment in many robotic tasks especially autonomous driving. Including 3D information via Lidar sensors improves accuracy greatly. However such detectors perform poorly on domains they were not trained on i.e. different locations sensors weather etc. limiting their reliability in safety-critical applications. There exist methods to adapt 3D-ODs to these domains; however these methods treat 3D-ODs as a black box neglecting underlying architectural decisions and source-domain training strategies. Instead we dive deep into the details of 3D-ODs focusing our efforts on fundamental factors that influence robustness prior to domain adaptation. We systematically investigate four design choices (and the interplay between them) often overlooked in 3D-OD robustness and domain adaptation: architecture voxel encoding data augmentations and anchor strategies. We assess their impact on the robustness of nine state-of-the-art 3D-ODs across six benchmarks encompassing three types of domain gaps - sensor type weather and location. Our main findings are: (1) transformer backbones with local point features are more robust than 3D CNNs (2) test-time anchor size adjustment is crucial for adaptation across geographical locations significantly boosting scores without retraining (3) source-domain augmentations allow the model to generalize to low-resolution sensors and (4) surprisingly robustness to bad weather is improved when training directly on more clean weather data than on training with bad weather data. We outline our main conclusions and findings to provide practical guidance on developing more robust 3D-ODs.
-
Several unsupervised image segmentation approaches have been proposed which eliminate the need for dense manually-annotated segmentation masks; current models separately handle either semantic segmentation (e.g. STEGO) or class-agnostic instance segmentation (e.g. CutLER) but not both (i.e. panoptic segmentation). We propose an Unsupervised Universal Segmentation model (U2Seg) adept at performing various image segmentation tasks---instance semantic and panoptic---using a novel unified framework. U2Seg generates pseudo semantic labels for these segmentation tasks via leveraging self-supervised models followed by clustering; each cluster represents different semantic and/or instance membership of pixels. We then self-train the model on these pseudo semantic labels yielding substantial performance gains over specialized methods tailored to each task: a +2.6 APbox boost (vs. CutLER) in unsupervised instance segmentation on COCO and a +7.0 PixelAcc increase (vs. STEGO) in unsupervised semantic segmentation on COCOStuff. Moreover our method sets up a new baseline for unsupervised panoptic segmentation which has not been previously explored. U2Seg is also a strong pretrained model for few-shot segmentation surpassing CutLER by +5.0 APmask when trained on a low-data regime e.g. only 1% COCO labels. We hope our simple yet effective method can inspire more research on unsupervised universal image segmentation.
-
Few-shot segmentation remains challenging due to the limitations of its labeling information for unseen classes. Most previous approaches rely on extracting high-level feature maps from the frozen visual encoder to compute the pixel-wise similarity as a key prior guidance for the decoder. However such a prior representation suffers from coarse granularity and poor generalization to new classes since these high-level feature maps have obvious category bias. In this work we propose to replace the visual prior representation with the visual-text alignment capacity to capture more reliable guidance and enhance the model generalization. Specifically we design two kinds of training-free prior information generation strategy that attempts to utilize the semantic alignment capability of the Contrastive Language-Image Pre-training model (CLIP) to locate the target class. Besides to acquire more accurate prior guidance we build a high-order relationship of attention maps and utilize it to refine the initial prior information. Experiments on both the PASCAL-5i and COCO-20i datasets show that our method obtains a clearly substantial improvement and reaches the new state-of-the-art performance. The code is available on the project website.
-
There are five types of trajectory prediction tasks: deterministic stochastic domain adaptation momentary observation and few-shot. These associated tasks are defined by various factors such as the length of input paths data split and pre-processing methods. Interestingly even though they commonly take sequential coordinates of observations as input and infer future paths in the same coordinates as output designing specialized architectures for each task is still necessary. For the other task generality issues can lead to sub-optimal performances. In this paper we propose SingularTrajectory a diffusion-based universal trajectory prediction framework to reduce the performance gap across the five tasks. The core of SingularTrajectory is to unify a variety of human dynamics representations on the associated tasks. To do this we first build a Singular space to project all types of motion patterns from each task into one embedding space. We next propose an adaptive anchor working in the Singular space. Unlike traditional fixed anchor methods that sometimes yield unacceptable paths our adaptive anchor enables correct anchors which are put into a wrong location based on a traversability map. Finally we adopt a diffusion-based predictor to further enhance the prototype paths using a cascaded denoising process. Our unified framework ensures the generality across various benchmark settings such as input modality and trajectory lengths. Extensive experiments on five public benchmarks demonstrate that SingularTrajectory substantially outperforms existing models highlighting its effectiveness in estimating general dynamics of human movements. Code is publicly available at https://github.com/inhwanbae/SingularTrajectory.
-
In this paper we explore a novel challenging generation task i.e. Handwritten Mathematical Expression Generation (HMEG) from symbolic sequences. Since symbolic sequences are naturally graph-structured data we formulate HMEG as a graph-to-image (G2I) generation problem. Unlike the generation of natural images HMEG requires critic layout clarity for synthesizing correct and recognizable formulas but has no real masks available to supervise the learning process. To alleviate this challenge we propose a novel end-to-end G2I generation pipeline (i.e. graph - layout - mask - image) which requires no real masks or nondifferentiable alignment between layouts and masks. Technically to boost the capacity of predicting detailed relations among adjacent symbols we propose a Less-is-More (LiM) learning strategy. In addition we design a differentiable layout refinement module which maps bounding boxes to pixel-level soft masks so as to further alleviate ambiguous layout areas. Our whole model including layout prediction mask refinement and image generation can be jointly optimized in an end-to-end manner. Experimental results show that our model can generate high-quality HME images and outperforms previous generative methods. Besides a series of ablations study demonstrate effectiveness of the proposed techniques. Finally we validate that our generated images promisingly boosts the performance of HME recognition models through data augmentation. Our code and results are available at: https://github.com/AiArt-HDU/HMEG.
-
Efficient transfer learning (ETL) is receiving increasing attention to adapt large pre-trained language-vision models on downstream tasks with a few labeled samples. While significant progress has been made we reveal that state-of-the-art ETL approaches exhibit strong performance only in narrowly-defined experimental setups and with a careful adjustment of hyperparameters based on a large corpus of labeled samples. In particular we make two interesting and surprising empirical observations. First to outperform a simple Linear Probing baseline these methods require to optimize their hyper-parameters on each target task. And second they typically underperform --sometimes dramatically-- standard zero-shot predictions in the presence of distributional drifts. Motivated by the unrealistic assumptions made in the existing literature i.e. access to a large validation set and case-specific grid-search for optimal hyperparameters we propose a novel approach that meets the requirements of real-world scenarios. More concretely we introduce a CLass-Adaptive linear Probe (CLAP) objective whose balancing term is optimized via an adaptation of the general Augmented Lagrangian method tailored to this context. We comprehensively evaluate CLAP on a broad span of datasets and scenarios demonstrating that it consistently outperforms SoTA approaches while yet being a much more efficient alternative.
-
Traditional 3D content creation tools empower users to bring their imagination to life by giving them direct control over a scene's geometry appearance motion and camera path. Creating computer-generated videos however is a tedious manual process which can be automated by emerging text-to-video diffusion models. Despite great promise video diffusion models are difficult to control hindering users to apply their creativity rather than amplifying it. To address this challenge we present a novel approach that combines the controllability of dynamic 3D meshes with the expressivity and editability of emerging diffusion models. For this purpose our approach takes an animated low-fidelity rendered mesh as input and injects the ground truth correspondence information obtained from the dynamic mesh into various stages of a pre-trained text-to-image generation model to output high-quality and temporally consistent frames. We demonstrate our approach on various examples where motion can be obtained by animating rigged assets or changing the camera path.
-
The fidelity of relighting is bounded by both geometry and appearance representations. For geometry both mesh and volumetric approaches have difficulty modeling intricate structures like 3D hair geometry. For appearance existing relighting models are limited in fidelity and often too slow to render in real-time with high-resolution continuous environments. In this work we present Relightable Gaussian Codec Avatars a method to build high-fidelity relightable head avatars that can be animated to generate novel expressions. Our geometry model based on 3D Gaussians can capture 3D-consistent sub-millimeter details such as hair strands and pores on dynamic face sequences. To support diverse materials of human heads such as the eyes skin and hair in a unified manner we present a novel relightable appearance model based on learnable radiance transfer. Together with global illumination-aware spherical harmonics for the diffuse components we achieve real-time relighting with all-frequency reflections using spherical Gaussians. This appearance model can be efficiently relit under both point light and continuous illumination. We further improve the fidelity of eye reflections and enable explicit gaze control by introducing relightable explicit eye models. Our method outperforms existing approaches without compromising real-time performance. We also demonstrate real-time relighting of avatars on a tethered consumer VR headset showcasing the efficiency and fidelity of our avatars.
-
In this paper we explore the capability of an agent to construct a logical sequence of action steps thereby assembling a strategic procedural plan. This plan is crucial for navigating from an initial visual observation to a target visual outcome as depicted in real-life instructional videos. Existing works have attained partial success by extensively leveraging various sources of information available in the datasets such as heavy intermediate visual observations procedural names or natural language step-by-step instructions for features or supervision signals. However the task remains formidable due to the implicit causal constraints in the sequencing of steps and the variability inherent in multiple feasible plans. To tackle these intricacies that previous efforts have overlooked we propose to enhance the agent's capabilities by infusing it with procedural knowledge. This knowledge sourced from training procedure plans and structured as a directed weighted graph equips the agent to better navigate the complexities of step sequencing and its potential variations. We coin our approach KEPP a novel Knowledge-Enhanced Procedure Planning system which harnesses a probabilistic procedural knowledge graph extracted from training data effectively acting as a comprehensive textbook for the training domain. Experimental evaluations across three widely-used datasets under settings of varying complexity reveal that KEPP attains superior state-of-the-art results while requiring only minimal supervision. Code and trained model are available at https://github.com/Ravindu-Yasas-Nagasinghe/KEPP
-
It is challenging for Neural Radiance Fields (NeRFs) in the few-shot setting to reconstruct high-quality novel views and depth maps in 360^\circ outward-facing indoor scenes. The captured sparse views for these scenes usually contain large viewpoint variations. This greatly reduces the potential consistency between views leading NeRFs to degrade a lot in these scenarios. Existing methods usually leverage pretrained depth prediction models to improve NeRFs. However these methods cannot guarantee geometry consistency due to the inherent geometry ambiguity in the pretrained models thus limiting NeRFs' performance. In this work we present P\textsuperscript 2 NeRF to capture global and hierarchical geometry consistency priors from pretrained models thus facilitating few-shot NeRFs in 360^\circ outward-facing indoor scenes. On the one hand we propose a matching-based geometry warm-up strategy to provide global geometry consistency priors for NeRFs. This effectively avoids the overfitting of early training with sparse inputs. On the other hand we propose a group depth ranking loss and ray weight mask regularization based on the monocular depth estimation model. This provides hierarchical geometry consistency priors for NeRFs. As a result our approach can fully leverage the geometry consistency priors from pretrained models and help few-shot NeRFs achieve state-of-the-art performance on two challenging indoor datasets. Our code is released at https://github.com/XT5un/P2NeRF.
-
Knowledge distillation (KD) has been applied to various tasks successfully and mainstream methods typically boost the student model via spatial imitation losses. However the consecutive downsamplings induced in the spatial domain of teacher model is a type of corruption hindering the student from analyzing what specific information needs to be imitated which results in accuracy degradation. To better understand the underlying pattern of corrupted feature maps we shift our attention to the frequency domain. During frequency distillation we encounter a new challenge: the low-frequency bands convey general but minimal context while the high are more informative but also introduce noise. Not each pixel within the frequency bands contributes equally to the performance. To address the above problem: (1) We propose the Frequency Prompt plugged into the teacher model absorbing the semantic frequency context during finetuning. (2) During the distillation period a pixel-wise frequency mask is generated via Frequency Prompt to localize those pixel of interests (PoIs) in various frequency bands. Additionally we employ a position-aware relational frequency loss for dense prediction tasks delivering a high-order spatial enhancement to the student model. We dub our Frequency Knowledge Distillation method as FreeKD which determines the optimal localization and extent for the frequency distillation. Extensive experiments demonstrate that FreeKD not only outperforms spatial-based distillation methods consistently on dense prediction tasks (e.g. FreeKD brings 3.8 AP gains for RepPoints-R50 on COCO2017 and 4.55 mIoU gains for PSPNet-R18 on Cityscapes) but also conveys more robustness to the student. Notably we also validate the generalization of our approach on large-scale vision models (e.g. DINO and SAM).
-
We introduce PlausiVL a large video-language model for anticipating action sequences that are plausible in the real-world. While significant efforts have been made towards anticipating future actions prior approaches do not take into account the aspect of plausibility in an action sequence. To address this limitation we explore the generative capability of a large video-language model in our work and further develop the understanding of plausibility in an action sequence by introducing two objective functions a counterfactual-based plausible action sequence learning loss and a long-horizon action repetition loss. We utilize temporal logical constraints as well as verb-noun action pair logical constraints to create implausible/counterfactual action sequences and use them to train the model with plausible action sequence learning loss. This loss helps the model to differentiate between plausible and not plausible action sequences and also helps the model to learn implicit temporal cues crucial for the task of action anticipation. The long-horizon action repetition loss puts a higher penalty on the actions that are more prone to repetition over a longer temporal window. With this penalization the model is able to generate diverse plausible action sequences. We evaluate our approach on two large-scale datasets Ego4D and EPIC-Kitchens-100 and show improvements on the task of action anticipation.
-
In Visual Place Recognition (VPR) the pose of a query image is estimated by comparing the image to a map of reference images with known reference poses. As is typical for image retrieval problems a feature extractor maps the query and reference images to a feature space where a nearest neighbor search is then performed. However till recently little attention has been given to quantifying the confidence that a retrieved reference image is a correct match. Highly certain but incorrect retrieval can lead to catastrophic failure of VPR-based localization pipelines. This work compares for the first time the main approaches for estimating the image-matching uncertainty including the traditional retrieval-based uncertainty estimation more recent data-driven aleatoric uncertainty estimation and the compute-intensive geometric verification. We further formulate a simple baseline method "SUE" which unlike the other methods considers the freely-available poses of the reference images in the map. Our experiments reveal that a simple L2-distance between the query and reference descriptors is already a better estimate of image-matching uncertainty than current data-driven approaches. SUE outperforms the other efficient uncertainty estimation methods and its uncertainty estimates complement the computationally expensive geometric verification approach. Future works for uncertainty estimation in VPR should consider the baselines discussed in this work.
-
Referring Image Segmentation (RIS) is a challenging task that requires an algorithm to segment objects referred by free-form language expressions. Despite significant progress in recent years most state-of-the-art (SOTA) methods still suffer from considerable language-image modality gap at the pixel and word level. These methods generally 1) rely on sentence-level language features for language-image alignment and 2) lack explicit training supervision for fine-grained visual grounding. Consequently they exhibit weak object-level correspondence between visual and language features. Without well-grounded features prior methods struggle to understand complex expressions that require strong reasoning over relationships among multiple objects especially when dealing with rarely used or ambiguous clauses. To tackle this challenge we introduce a novel Mask Grounding auxiliary task that significantly improves visual grounding within language features by explicitly teaching the model to learn fine-grained correspondence between masked textual tokens and their matching visual objects. Mask Grounding can be directly used on prior RIS methods and consistently bring improvements. Furthermore to holistically address the modality gap we also design a cross-modal alignment loss and an accompanying alignment module. These additions work synergistically with Mask Grounding. With all these techniques our comprehensive approach culminates in MagNet (Mask-grounded Network) an architecture that significantly outperforms prior arts on three key benchmarks (RefCOCO RefCOCO+ and G-Ref) demonstrating our method's effectiveness in addressing current limitations of RIS algorithms. Our code and pre-trained weights will be released.
-
The pursuit of accurate 3D hand pose estimation stands as a keystone for understanding human activity in the realm of egocentric vision. The majority of existing estimation methods still rely on single-view images as input leading to potential limitations e.g. limited field-of-view and ambiguity in depth. To address these problems adding another camera to better capture the shape of hands is a practical direction. However existing multi-view hand pose estimation methods suffer from two main drawbacks: 1) Requiring multi-view annotations for training which are expensive. 2) During testing the model becomes inapplicable if camera parameters/layout are not the same as those used in training. In this paper we propose a novel Single-to-Dual-view adaptation (S2DHand) solution that adapts a pre-trained single-view estimator to dual views. Compared with existing multi-view training methods 1) our adaptation process is unsupervised eliminating the need for multi-view annotation. 2) Moreover our method can handle arbitrary dual-view pairs with unknown camera parameters making the model applicable to diverse camera settings. Specifically S2DHand is built on certain stereo constraints including pair-wise cross-view consensus and invariance of transformation between both views. These two stereo constraints are used in a complementary manner to generate pseudo-labels allowing reliable adaptation. Evaluation results reveal that S2DHand achieves significant improvements on arbitrary camera pairs under both in-dataset and cross-dataset settings and outperforms existing adaptation methods with leading performance. Project page: https://github.com/ut-vision/S2DHand.
-
We propose a computational imaging method for time-efficient light-field acquisition that combines a coded aperture with an event-based camera. Different from the conventional coded-aperture imaging method our method applies a sequence of coding patterns during a single exposure for an image frame. The parallax information which is related to the differences in coding patterns is recorded as events. The image frame and events all of which are measured in a single exposure are jointly used to computationally reconstruct a light field. We also designed an algorithm pipeline for our method that is end-to-end trainable on the basis of deep optics and compatible with real camera hardware. We experimentally showed that our method can achieve more accurate reconstruction than several other imaging methods with a single exposure. We also developed a hardware prototype with the potential to complete the measurement on the camera within 22 msec and demonstrated that light fields from real 3-D scenes can be obtained with convincing visual quality. Our software and supplementary video are available from our project website.
-
Event-based Vision Sensors (EVS) gain popularity in enhancing CMOS Image Sensor (CIS) video capture. Nonidealities of EVS such as pixel or readout latency can significantly influence the quality of the enhanced images and warrant dedicated consideration in the design of fusion algorithms. A novel approach for jointly computing deblurred rolling-shutter artifact corrected high-speed videos with frame rates up to 10000 FPS using inherently blurry rolling shutter CIS frames of 120 FPS to 150 FPS in conjunction with EVS data from a hybrid CIS-EVS sensor is presented. EVS pixel latency readout latency and the sensor's refractory period are explicitly incorporated into the measurement model. This inverse function problem is solved on a per-pixel manner using an optimization-based framework. The interpolated images are subsequently processed by a novel refinement network. The proposed method is evaluated using simulated and measured datasets under natural and controlled environments. Extensive experiments show reduced shadowing effect a 4 dB increment in PSNR and a 12% improvement in LPIPS score compared to state-of-the-art methods.
-
Weakly-supervised Video Anomaly Detection (wVAD) aims to detect frame-level anomalies using only video-level labels in training. Due to the limitation of coarse-grained labels Multi-Instance Learning (MIL) is prevailing in wVAD. However MIL suffers from insufficiency of binary supervision to model diverse abnormal patterns. Besides the coupling between abnormality and its context hinders the learning of clear abnormal event boundary. In this paper we propose prompt-enhanced MIL to detect various abnormal events while ensuring clear event boundaries. Concretely we design the abnormal-aware prompts by using abnormal class annotations together with learnable prompt which can incorporate semantic priors into video features dynamically. The detector can utilize the semantic-rich features to capture diverse abnormal patterns. In addition normal context prompt is introduced to amplify the distinction between abnormality and its context facilitating the generation of clear boundary. With the mutual enhancement of abnormal-aware and normal context prompt the model can construct discriminative representations to detect divergent anomalies without ambiguous event boundaries. Extensive experiments demonstrate our method achieves SOTA performance on three public benchmarks. The code is available at https://github.com/Junxi-Chen/PE-MIL.
-
Character Animation aims to generating character videos from still images through driving signals. Currently diffusion models have become the mainstream in visual generation research owing to their robust generative capabilities. However challenges persist in the realm of image-to-video especially in character animation where temporally maintaining consistency with detailed information from character remains a formidable problem. In this paper we leverage the power of diffusion models and propose a novel framework tailored for character animation. To preserve consistency of intricate appearance features from reference image we design ReferenceNet to merge detail features via spatial attention. To ensure controllability and continuity we introduce an efficient pose guider to direct character's movements and employ an effective temporal modeling approach to ensure smooth inter-frame transitions between video frames. By expanding the training data our approach can animate arbitrary characters yielding superior results in character animation compared to other image-to-video methods. Furthermore we evaluate our method on image animation benchmarks achieving state-of-the-art results.
-
Benefiting from large-scale pre-trained text-to-image (T2I) generative models impressive progress has been achieved in customized image generation which aims to generate user-specified concepts. Existing approaches have extensively focused on single-concept customization and still encounter challenges when it comes to complex scenarios that involve combining multiple concepts. These approaches often require retraining/fine-tuning using a few images leading to time-consuming training processes and impeding their swift implementation. Furthermore the reliance on multiple images to represent a singular concept increases the difficulty of customization. To this end we propose FreeCustom a novel tuning-free method to generate customized images of multi-concept composition based on reference concepts using only one image per concept as input. Specifically we introduce a new multi-reference self-attention (MRSA) mechanism and a weighted mask strategy that enables the generated image to access and focus more on the reference concepts. In addition MRSA leverages our key finding that input concepts are better preserved when providing images with context interactions. Experiments show that our method's produced images are consistent with the given concepts and better aligned with the input text. Our method outperforms or performs on par with other training-based methods in terms of multi-concept composition and single-concept customization but is simpler. Codes can be found \href https://github.com/aim-uofa/FreeCustom here .
-
Sequence-to-sequence vision-language models are showing promise but their applicability is limited by their inference latency due to their autoregressive way of generating predictions. We propose a parallel decoding sequence-to-sequence vision-language model trained with a Query-CTC loss that marginalizes over multiple inference paths in the decoder. This allows us to model the joint distribution of tokens rather than restricting to conditional distribution as in an autoregressive model. The resulting model NARVL achieves performance on-par with its state-of-the-art autoregressive counterpart but is faster at inference time reducing from the linear complexity associated with the sequential generation of tokens to a paradigm of constant time joint inference.
-
Recent advances in generative AI have significantly enhanced image and video editing particularly in the context of text prompt control. State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks. However the computational demands of diffusion-based methods are substantial often necessitating large-scale paired datasets for training and therefore challenging the deployment in real applications. To address these issues this paper breaks down the text-based video editing task into two stages. First we leverage an pre-trained text-to-image diffusion model to simultaneously edit few keyframes in an zero-shot way. Second we introduce an efficient model called MaskINT which is built on non-autoregressive masked generative transformers and specializes in frame interpolation between the edited keyframes using the structural guidance from intermediate frames. Experimental results suggest that our MaskINT achieves comparable performance with diffusion-based methodologies while significantly improve the inference time. This research offers a practical solution for text-based video editing and showcases the potential of non-autoregressive masked generative transformers in this domain.
-
Pre-trained Vision Language Models (VLMs) have demonstrated notable progress in various zero-shot tasks such as classification and retrieval. Despite their performance because improving performance on new tasks requires task-specific knowledge their adaptation is essential. While labels are needed for the adaptation acquiring them is typically expensive. To overcome this challenge active learning a method of achieving a high performance by obtaining labels for a small number of samples from experts has been studied. Active learning primarily focuses on selecting unlabeled samples for labeling and leveraging them to train models. In this study we pose the question "how can the pre-trained VLMs be adapted under the active learning framework?" In response to this inquiry we observe that (1) simply applying a conventional active learning framework to pre-trained VLMs even may degrade performance compared to random selection because of the class imbalance in labeling candidates and (2) the knowledge of VLMs can provide hints for achieving the balance before labeling. Based on these observations we devise a novel active learning framework for VLMs denoted as PCB. To assess the effectiveness of our approach we conduct experiments on seven different real-world datasets and the results demonstrate that PCB surpasses conventional active learning and random sampling methods.
-
Current metrics for text-to-image models typically rely on statistical metrics which inadequately represent the real preference of humans. Although recent work attempts to learn these preferences via human annotated images they reduce the rich tapestry of human preference to a single overall score. However the preference results vary when humans evaluate images with different aspects. Therefore to learn the multi-dimensional human preferences we propose the Multi-dimensional Preference Score (MPS) the first multi-dimensional preference scoring model for the evaluation of text-to-image models. The MPS introduces the preference condition module upon CLIP model to learn these diverse preferences. It is trained based on our Multi-dimensional Human Preference (MHP) Dataset which comprises 918315 human preference choices across four dimensions (i.e. aesthetics semantic alignment detail quality and overall assessment) on 607541 images. The images are generated by a wide range of latest text-to-image models. The MPS outperforms existing scoring methods across 3 datasets in 4 dimensions enabling it a promising metric for evaluating and improving text-to-image generation. The model and dataset will be made publicly available to facilitate future research.
-
Generating novel views of an object from a single image is a challenging task. It requires an understanding of the underlying 3D structure of the object from an image and rendering high-quality spatially consistent new views. While recent methods for view synthesis based on diffusion have shown great progress achieving consistency among various view estimates and at the same time abiding by the desired camera pose remains a critical problem yet to be solved. In this work we demonstrate a strikingly simple method where we utilize a pre-trained video diffusion model to solve this problem. Our key idea is that synthesizing a novel view could be reformulated as synthesizing a video of a camera going around the object of interest---a scanning video---which then allows us to leverage the powerful priors that a video diffusion model would have learned. Thus to perform novel-view synthesis we create a smooth camera trajectory to the target view that we wish to render and denoise using both a view-conditioned diffusion model and a video diffusion model. By doing so we obtain a highly consistent novel view synthesis outperforming the state of the art.
-
Accurately detecting active objects undergoing state changes is essential for comprehending human interactions and facilitating decision-making. The existing methods for active object detection (AOD) primarily rely on visual appearance of the objects within input such as changes in size shape and relationship with hands. However these visual changes can be subtle posing challenges particularly in scenarios with multiple distracting no-change instances of the same category. We observe that the state changes are often the result of an interaction being performed upon the object thus propose to use informed priors about object related plausible interactions (including semantics and visual appearance) to provide more reliable cues for AOD. Specifically we propose a knowledge aggregation procedure to integrate the aforementioned informed priors into oracle queries within the teacher decoder offering more object affordance commonsense to locate the active object. To streamline the inference process and reduce extra knowledge inputs we propose a knowledge distillation approach that encourages the student decoder to mimic the detection capabilities of the teacher decoder using the oracle query by replicating its predictions and attention. Our proposed framework achieves state-of-the-art performance on four datasets namely Ego4D Epic-Kitchens MECCANO and 100DOH which demonstrates the effectiveness of our approach in improving AOD. The code and models are available at https://github.com/idejie/KAD.git.
-
Deep neural networks (DNNs) struggle to learn in dynamic settings because they mainly rely on static datasets. Continual learning (CL) aims to overcome this limitation by enabling DNNs to incrementally accumulate knowledge. A widely adopted scenario in CL is class-incremental learning (CIL) where DNNs are required to sequentially learn more classes. Among the various strategies in CL replay methods which revisit previous classes stand out as the only effective ones in CIL. Other strategies such as architectural modifications to segregate information across weights and protect them from change are ineffective in CIL. This is because they need additional information during testing to select the correct network parts to use. In this paper we propose NICE Neurogenesis Inspired Contextual Encoding a replay-free architectural method inspired by adult neurogenesis in the hippocampus. NICE groups neurons in the DNN based on different maturation stages and infers which neurons to use during testing without any additional signal. Through extensive experiments across 6 datasets and 3 architectures we show that NICE performs on par with or often outperforms replay methods. We also make the case that neurons exhibit highly distinctive activation patterns for the classes in which they specialize enabling us to determine when they should be used. The code is available at https://github.com/BurakGurbuz97/NICE.
-
Generating human motions from textual descriptions has gained growing research interest due to its wide range of applications. However only a few works consider human-scene interactions together with text conditions which is crucial for visual and physical realism. This paper focuses on the task of generating human motions in 3D indoor scenes given text descriptions of the human-scene interactions. This task presents challenges due to the multimodality nature of text scene and motion as well as the need for spatial reasoning. To address these challenges we propose a new approach that decomposes the complex problem into two more manageable sub-problems: (1) language grounding of the target object and (2) object-centric motion generation. For language grounding of the target object we leverage the power of large language models. For motion generation we design an object-centric scene representation for the generative model to focus on the target object thereby reducing the scene complexity and facilitating the modeling of the relationship between human motions and the object. Experiments demonstrate the better motion quality of our approach compared to baselines and validate our design choices. Code will be available at https://zju3dv.github.io/text_scene_motion.
-
This paper addresses the critical challenges of sparsity and occlusion in LiDAR-based 3D object detection. Current methods often rely on supplementary modules or specific architectural designs potentially limiting their applicability to new and evolving architectures. To our knowledge we are the first to propose a versatile technique that seamlessly integrates into any existing framework for 3D Object Detection marking the first instance of Weak-to-Strong generalization in 3D computer vision. We introduce a novel framework X-Ray Distillation with Object-Complete Frames suitable for both supervised and semi-supervised settings that leverages the temporal aspect of point cloud sequences. This method extracts crucial information from both previous and subsequent LiDAR frames creating Object-Complete frames that represent objects from multiple viewpoints thus addressing occlusion and sparsity. Given the limitation of not being able to generate Object-Complete frames during online inference we utilize Knowledge Distillation within a Teacher-Student framework. This technique encourages the strong Student model to emulate the behavior of the weaker Teacher which processes simple and informative Object-Complete frames effectively offering a comprehensive view of objects as if seen through X-ray vision. Our proposed methods surpass state-of-the-art in semi-supervised learning by 1-1.5 mAP and enhance the performance of five established supervised models by 1-2 mAP on standard autonomous driving datasets even with default hyperparameters. Code for Object-Complete frames is available here: https://github.com/sakharok13/X-Ray-Teacher-Patching-Tools.
-
Audiovisual segmentation (AVS) is a challenging task that aims to segment visual objects in videos according to their associated acoustic cues. With multiple sound sources and background disturbances involved establishing robust correspondences between audio and visual contents poses unique challenges due to (1) complex entanglement across sound sources and (2) frequent changes in the occurrence of distinct sound events. Assuming sound events occur independently the multi-source semantic space can be represented as the Cartesian product of single-source sub-spaces. We are motivated to decompose the multi-source audio semantics into single-source semantics for more effective interactions with visual content. We propose a semantic decomposition method based on product quantization where the multi-source semantics can be decomposed and represented by several disentangled and noise-suppressed single-source semantics. Furthermore we introduce a global-to-local quantization mechanism which distills knowledge from stable global (clip-level) features into local (frame-level) ones to handle frequent changes in audio semantics. Extensive experiments demonstrate that our semantically decomposed audio representation significantly improves AVS performance eg +21.2% mIoU on the challenging AVS-Semantic benchmark with ResNet50 backbone.
-
Active recognition which allows intelligent agents to explore observations for better recognition performance serves as a prerequisite for various embodied AI tasks such as grasping navigation and room arrangements. Given the evolving environment and the multitude of object classes it is impractical to include all possible classes during the training stage. In this paper we aim at advancing active open-vocabulary recognition empowering embodied agents to actively perceive and classify arbitrary objects. However directly adopting recent open-vocabulary classification models like Contrastive Language Image Pretraining (CLIP) poses its unique challenges. Specifically we observe that CLIP's performance is heavily affected by the viewpoint and occlusions compromising its reliability in unconstrained embodied perception scenarios. Further the sequential nature of observations in agent-environment interactions necessitates an effective method for integrating features that maintains discriminative strength for open-vocabulary classification. To address these issues we introduce a novel agent for active open-vocabulary recognition. The proposed method leverages inter-frame and inter-concept similarities to navigate agent movements and to fuse features without relying on class-specific knowledge. Compared to baseline CLIP model with 29.6% accuracy on ShapeNet dataset the proposed agent could achieve 53.3% accuracy for open-vocabulary recognition without any fine-tuning to the equipped CLIP model. Additional experiments conducted with the Habitat simulator further affirm the efficacy of our method.
-
Deep neural networks have played a crucial part in many critical domains such as autonomous driving face recognition and medical diagnosis. However deep neural networks are facing security threats from backdoor attacks and can be manipulated into attacker-decided behaviors by the backdoor attacker. To defend the backdoor prior research has focused on using clean data to remove backdoor attacks before model deployment. In this paper we investigate the possibility of defending against backdoor attacks by utilizing test-time partially poisoned data to remove the backdoor from the model. To address the problem a two-stage method TTBD is proposed. In the first stage we propose a backdoor sample detection method DDP to identify poisoned samples from a batch of mixed partially poisoned samples. Once the poisoned samples are detected we employ Shapley estimation to calculate the contribution of each neuron's significance in the network locate the poisoned neurons and prune them to remove backdoor in the models. Our experiments demonstrate that TTBD removes the backdoor successfully with only a batch of partially poisoned data across different model architectures and datasets against different types of backdoor attacks.
-
Domain shift is a challenge for supervised human pose estimation where the source data and target data come from different distributions. This is why pose estimation methods generally perform worse on the test set than on the training set. Recently test-time adaptation has proven to be an effective way to deal with domain shift in human pose estimation. Although the performance on the target domain has been improved existing methods require a large number of weight updates for convergence which is time-consuming and brings catastrophic forgetting. To solve these issues we propose a meta-auxiliary learning method to achieve fast adaptation for domain shift during inference. Specifically we take human pose estimation as the supervised primary task and propose body-specific image inpainting as a self-supervised auxiliary task. First we jointly train the primary and auxiliary tasks to get a pre-trained model on the source domain. Then meta-training correlates the performance of the two tasks to learn a good weight initialization. Finally meta-testing adapts the meta-learned model to the target data through self-supervised learning. Benefiting from the meta-learning paradigm the proposed method enables fast adaptation to the target domain while preserving the source domain knowledge. The carefully designed auxiliary task better pays attention to human-related semantics in a single image. Extensive experiments demonstrate the effectiveness of our test-time fast adaptation.
-
In this paper we explore the problem of event-based meshflow estimation a novel task that involves predicting a spatially smooth sparse motion field from event cameras. To start we generate a large-scale High-Resolution Event Meshflow (HREM) dataset which showcases its superiority by encompassing the merits of high resolution at 1280x720 handling dynamic objects and complex motion patterns and offering both optical flow and meshflow labels. These aspects have not been fully explored in previous works. Besides we propose Efficient Event-based MeshFlow (EEMFlow) network a lightweight model featuring a specially crafted encoder-decoder architecture to facilitate swift and accurate meshflow estimation. Furthermore we upgrade EEMFlow network to support dense event optical flow in which a Confidence-induced Detail Completion (CDC) module is proposed to preserve sharp motion boundaries. We conduct comprehensive experiments to show the exceptional performance and runtime efficiency (39x faster) of our EEMFlow model compared to recent state-of-the-art flow methods. Our code is available at https://github.com/boomluo02/EEMFlow.
-
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
Solving complex visual tasks such as "Who invented the musical instrument on the right?" involves a composition of skills: understanding space recognizing instruments and also retrieving prior knowledge. Recent work shows promise by decomposing such tasks using a large language model (LLM) into an executable program that invokes specialized vision models. However generated programs are error-prone: they omit necessary steps include spurious ones and are unable to recover when the specialized models give incorrect outputs. Moreover they require loading multiple models incurring high latency and computation costs. We propose Visual Program Distillation (VPD) an instruction-tuning framework that produces a vision-language model (VLM) capable of solving complex visual tasks with a single forward pass. VPD distills the reasoning ability of LLMs by using them to sample multiple candidate programs which are then executed and verified to identify the correct one. It translates each correct program into a language description of the reasoning steps which are then distilled into a VLM. Extensive experiments show that VPD improves the VLM's ability to count understand spatial relations and reason compositionally. Our VPD-trained PaLI-X outperforms all prior VLMs achieving state-of-the-art performance across complex vision tasks including MMBench OK-VQA A-OKVQA TallyQA POPE and Hateful Memes. An evaluation with human annotators also confirms that VPD improves model response factuality and consistency. Finally experiments on content moderation demonstrate that VPD is also helpful for adaptation to real-world applications with limited data.
-
Semantic instance and panoptic segmentation of 3D point clouds have been addressed using task-specific models of distinct design. Thereby the similarity of all segmentation tasks and the implicit relationship between them have not been utilized effectively. This paper presents a unified simple and effective model addressing all these tasks jointly. The model named OneFormer3D performs instance and semantic segmentation consistently using a group of learnable kernels where each kernel is responsible for generating a mask for either an instance or a semantic category. These kernels are trained with a transformer-based decoder with unified instance and semantic queries passed as an input. Such a design enables training a model end-to-end in a single run so that it achieves top performance on all three segmentation tasks simultaneously. Specifically our OneFormer3D ranks 1st and sets a new state-of-the-art (+2.1 mAP50) in the ScanNet test leaderboard. We also demonstrate the state-of-the-art results in semantic instance and panoptic segmentation of ScanNet (+21 PQ) ScanNet200 (+3.8 mAP50) and S3DIS (+0.8 mIoU) datasets.
-
Understanding human social behaviour is crucial in computer vision and robotics. Micro-level observations like individual actions fall short necessitating a comprehensive approach that considers individual behaviour intra-group dynamics and social group levels for a thorough understanding. To address dataset limitations this paper introduces JRDB-Social an extension of JRDB. Designed to fill gaps in human understanding across diverse indoor and outdoor social contexts JRDB-Social provides annotations at three levels: individual attributes intra-group interactions and social group context. This dataset aims to enhance our grasp of human social dynamics for robotic applications. Utilizing the recent cutting-edge multi-modal large language models we evaluated our benchmark to explore their capacity to decipher social human behaviour.
-
Human comprehension of a video stream is naturally broad: in a few instants we are able to understand what is happening the relevance and relationship of objects and forecast what will follow in the near future everything all at once. We believe that - to effectively transfer such an holistic perception to intelligent machines - an important role is played by learning to correlate concepts and to abstract knowledge coming from different tasks to synergistically exploit them when learning novel skills. To accomplish this we look for a unified approach to video understanding which combines shared temporal modelling of human actions with minimal overhead to support multiple downstream tasks and enable cooperation when learning novel skills. We then propose EgoPack a solution that creates a collection of task perspectives that can be carried across downstream tasks and used as a potential source of additional insights as a backpack of skills that a robot can carry around and use when needed. We demonstrate the effectiveness and efficiency of our approach on four Ego4D benchmarks outperforming current state-of-the-art methods. Project webpage: https://sapeirone.github.io/EgoPack.
-
The rapid advancement of generative models facilitating the creation of hyper-realistic images from textual descriptions has concurrently escalated critical societal concerns such as misinformation. Although providing some mitigation traditional fingerprinting mechanisms fall short in attributing responsibility for the malicious use of synthetic images. This paper introduces a novel approach to model fingerprinting that assigns responsibility for the generated images thereby serving as a potential countermeasure to model misuse. Our method modifies generative models based on each user's unique digital fingerprint imprinting a unique identifier onto the resultant content that can be traced back to the user. This approach incorporating fine-tuning into Text-to-Image (T2I) tasks using the Stable Diffusion Model demonstrates near-perfect attribution accuracy with a minimal impact on output quality. Through extensive evaluation we show that our method outperforms baseline methods with an average improvement of 11% in handling image post-processes. Our method presents a promising and novel avenue for accountable model distribution and responsible use. Our code is available in https://github.com/kylemin/WOUAF.
-
In-context prompting in large language models (LLMs) has become a prevalent approach to improve zero-shot capabilities but this idea is less explored in the vision domain. Existing visual prompting methods focus on referring segmentation to segment the most relevant object falling short of addressing many generic vision tasks like open-set segmentation and detection. In this paper we introduce a universal visual in-context prompting framework for both tasks as shown in Fig.1. In particular we build on top of an encoder-decoder architecture and develop a versatile prompt encoder to support a variety of prompts like strokes boxes and points. We further enhance it to take an arbitrary number of reference image segments as the context. Our extensive explorations show that the proposed visual in-context prompting elicits extraordinary referring and generic segmentation capabilities to refer and detect yielding competitive performance to close-set in-domain datasets and showing promising results on many open-set segmentation datasets. By joint training on COCO and SA-1B DINOv achieves 57.7 PQ on COCO and 23.2 PQ on ADE20K. Code will be available at https://github.com/UX-Decoder/DINOv
-
We present HAAR a new strand-based generative model for 3D human hairstyles. Specifically based on textual inputs HAAR produces 3D hairstyles that could be used as production-level assets in modern computer graphics engines. Current AI-based generative models take advantage of powerful 2D priors to reconstruct 3D content in the form of point clouds meshes or volumetric functions. However by using the 2D priors they are intrinsically limited to only recovering the visual parts. Highly occluded hair structures can not be reconstructed with those methods and they only model the "outer shell" which is not ready to be used in physics-based rendering or simulation pipelines. In contrast we propose a first text-guided generative method that uses 3D hair strands as an underlying representation. Leveraging 2D visual question-answering (VQA) systems we automatically annotate synthetic hair models that are generated from a small set of artist-created hairstyles. This allows us to train a latent diffusion model that operates in a common hairstyle UV space. In qualitative and quantitative studies we demonstrate the capabilities of the proposed model and compare it to existing hairstyle generation approaches. For results please refer to our project page https://haar.is.tue.mpg.de/.
-
Despite recent advances in text-to-3D generative methods there is a notable absence of reliable evaluation metrics. Existing metrics usually focus on a single criterion each such as how well the asset aligned with the input text. These metrics lack the flexibility to generalize to different evaluation criteria and might not align well with human preferences. Conducting user preference studies is an alternative that offers both adaptability and human-aligned results. User studies however can be very expensive to scale. This paper presents an automatic versatile and human-aligned evaluation metric for text-to-3D generative models. To this end we first develop a prompt generator using GPT-4V to generate evaluating prompts which serve as input to compare text-to-3D models. We further design a method instructing GPT-4V to compare two 3D assets according to user-defined criteria. Finally we use these pairwise comparison results to assign these models Elo ratings. Experimental results suggest our metric strongly align with human preference across different evaluation criteria.
-
Neural 3D reconstruction from multi-view images has recently attracted increasing attention from the community. Existing methods normally learn a neural field for the whole scene while it is still under-explored how to reconstruct a target object indicated by users. Considering the Segment Anything Model (SAM) has shown effectiveness in segmenting any 2D images in this paper we propose NTO3D a novel high-quality Neural Target Object 3D (NTO3D) reconstruction method which leverages the benefits of both neural field and SAM. We first propose a novel strategy to lift the multi-view 2D segmentation masks of SAM into a unified 3D occupancy field. The 3D occupancy field is then projected into 2D space and generates the new prompts for SAM. This process is iterative until convergence to separate the target object from the scene. After this we then lift the 2D features of the SAM encoder into a 3D feature field in order to improve the reconstruction quality of the target object. NTO3D lifts the 2D masks and features of SAM into the 3D neural field for high-quality neural target object 3D reconstruction. We conduct detailed experiments on several benchmark datasets to demonstrate the advantages of our method. The code will be available at: https://github.com/ucwxb/NTO3D.
-
Human intelligence can retrieve any person according to both visual and language descriptions. However the current computer vision community studies specific person re-identification (ReID) tasks in different scenarios separately which limits the applications in the real world. This paper strives to resolve this problem by proposing a new instruct-ReID task that requires the model to retrieve images according to the given image or language instructions. Our instruct-ReID is a more general ReID setting where existing 6 ReID tasks can be viewed as special cases by designing different instructions. We propose a large-scale OmniReID benchmark and an adaptive triplet loss as a baseline method to facilitate research in this new setting. Experimental results show that the proposed multi-purpose ReID model trained on our OmniReID benchmark without finetuning can improve +0.5% +0.6% +7.7% mAP on Market1501 MSMT17 CUHK03 for traditional ReID +6.4% +7.1% +11.2% mAP on PRCC VC-Clothes LTCC for clothes-changing ReID +11.7% mAP on COCAS+ real2 for clothes template based clothes-changing ReID when using only RGB images +24.9% mAP on COCAS+ real2 for our newly defined language-instructed ReID +4.3% on LLCM for visible-infrared ReID +2.6% on CUHK-PEDES for text-to-image ReID. The datasets the model and code are available at https://github.com/hwz-zju/Instruct-ReID.
-
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in various multimodal tasks. However their potential in the medical domain remains largely unexplored. A significant challenge arises from the scarcity of diverse medical images spanning various modalities and anatomical regions which is essential in real-world medical applications. To solve this problem in this paper we introduce OmniMedVQA a novel comprehensive medical Visual Question Answering (VQA) benchmark. This benchmark is collected from 73 different medical datasets including 12 different modalities and covering more than 20 distinct anatomical regions. Importantly all images in this benchmark are sourced from authentic medical scenarios ensuring alignment with the requirements of the medical field and suitability for evaluating LVLMs. Through our extensive experiments we have found that existing LVLMs struggle to address these medical VQA problems effectively. Moreover what surprises us is that medical-specialized LVLMs even exhibit inferior performance to those general-domain models calling for a more versatile and robust LVLM in the biomedical field. The evaluation results not only reveal the current limitations of LVLM in understanding real medical images but also highlight our dataset's significance. Our code with dataset are available at https://github.com/OpenGVLab/Multi-Modality-Arena.
-
In-context learning provides a new perspective for multi-task modeling for vision and NLP. Under this setting the model can perceive tasks from prompts and accomplish them without any extra task-specific head predictions or model fine-tuning. However skeleton sequence modeling via in-context learning remains unexplored. Directly applying existing in-context models from other areas onto skeleton sequences fails due to the similarity between inter-frame and cross-task poses which makes it exceptionally hard to perceive the task correctly from a subtle context. To address this challenge we propose Skeleton-in-Context (SiC) an effective framework for in-context skeleton sequence modeling. Our SiC is able to handle multiple skeleton-based tasks simultaneously after a single training process and accomplish each task from context according to the given prompt. It can further generalize to new unseen tasks according to customized prompts. To facilitate context perception we additionally propose a task-unified prompt which adaptively learns tasks of different natures such as partial joint-level generation sequence-level prediction or 2D-to-3D motion prediction. We conduct extensive experiments to evaluate the effectiveness of our SiC on multiple tasks including motion prediction pose estimation joint completion and future pose estimation. We also evaluate its generalization capability on unseen tasks such as motion-in-between. These experiments show that our model achieves state-of-the-art multi-task performance and even outperforms single-task methods on certain tasks.
-
High-resolution image generation with Generative Artificial Intelligence (GenAI) has immense potential but due to the enormous capital investment required for training it is increasingly centralised to a few large corporations and hidden behind paywalls. This paper aims to democratise high-resolution GenAI by advancing the frontier of high-resolution generation while remaining accessible to a broad audience. We demonstrate that existing Latent Diffusion Models (LDMs) possess untapped potential for higher-resolution image generation. Our novel DemoFusion framework seamlessly extends open-source GenAI models employing Progressive Upscaling Skip Residual and Dilated Sampling mechanisms to achieve higher-resolution image generation. The progressive nature of DemoFusion requires more passes but the intermediate results can serve as "previews" facilitating rapid prompt iteration.
-
In this paper we address the challenging problem of visual SLAM with neural scene representations. Recently neural scene representations have shown promise for SLAM to produce dense 3D scene reconstruction with high quality. However existing methods require scene-specific optimization leading to time-consuming mapping processes for each individual scene. To overcome this limitation we propose IBD-SLAM an Image-Based Depth fusion framework for generalizable SLAM. In particular we adopt a Neural Radiance Field (NeRF) for scene representation. Inspired by multi-view image-based rendering instead of learning a fixed-grid scene representation we propose to learn an image-based depth fusion model that fuses depth maps of multiple reference views into a xyz-map representation. Once trained this model can be applied to new uncalibrated monocular RGBD videos of unseen scenes without the need for retraining and reconstructs full 3D scenes efficiently with a light-weight pose optimization procedure. We thoroughly evaluate IBD-SLAM on public visual SLAM benchmarks outperforming the previous state-of-the-art while being 10x faster in the mapping stage. Project page: https://visual-ai.github.io/ibd-slam.
-
This paper proposes Comprehensive Pathology Language Image Pre-training (CPLIP) a new unsupervised technique designed to enhance the alignment of images and text in histopathology for tasks such as classification and segmentation. This methodology enriches vision language models by leveraging extensive data without needing ground truth annotations. CPLIP involves constructing a pathology-specific dictionary generating textual descriptions for images using language models and retrieving relevant images for each text snippet via a pre-trained model. The model is then fine-tuned using a many-to-many contrastive learning method to align complex interrelated concepts across both modalities. Evaluated across multiple histopathology tasks CPLIP shows notable improvements in zero-shot learning scenarios outperforming existing methods in both interpretability and robustness and setting a higher benchmark for the application of vision-language models in the field. To encourage further research and replication the code for CPLIP is available on GitHubat https://cplip.github.io/
-
We present a method to generate full-body selfies from photographs originally taken at arms length. Because self-captured photos are typically taken close up they have limited field of view and exaggerated perspective that distorts facial shapes. We instead seek to generate the photo some one else would take of you from a few feet away. Our approach takes as input four selfies of your face and body a background image and generates a full-body selfie in a desired target pose. We introduce a novel diffusion-based approach to combine all of this information into high-quality well-composed photos of you with the desired pose and background.
-
3D Visual Grounding (3DVG) aims at localizing 3D object based on textual descriptions. Conventional supervised methods for 3DVG often necessitate extensive annotations and a predefined vocabulary which can be restrictive. To address this issue we propose a novel visual programming approach for zero-shot open-vocabulary 3DVG leveraging the capabilities of large language models (LLMs). Our approach begins with a unique dialog-based method engaging with LLMs to establish a foundational understanding of zero-shot 3DVG. Building on this we design a visual program that consists of three types of modules i.e. view-independent view-dependent and functional modules. Furthermore we develop an innovative language-object correlation module to extend the scope of existing 3D object detectors into open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot approach can outperform some supervised baselines marking a significant stride towards effective 3DVG. Code is available at https://curryyuan.github.io/ZSVG3D.
-
In this paper we tackle the problem of learning Structure-from-Motion (SfM) through the use of graph attention networks. SfM is a classic computer vision problem that is solved though iterative minimization of reprojection errors referred to as Bundle Adjustment (BA) starting from a good initialization. In order to obtain a good enough initialization to BA conventional methods rely on a sequence of sub-problems (such as pairwise pose estimation pose averaging or triangulation) which provide an initial solution that can then be refined using BA. In this work we replace these sub-problems by learning a model that takes as input the 2D keypoints detected across multiple views and outputs the corresponding camera poses and 3D keypoint coordinates. Our model takes advantage of graph neural networks to learn SfM-specific primitives and we show that it can be used for fast inference of the reconstruction for new and unseen sequences. The experimental results show that the proposed model outperforms competing learning-based methods and challenges COLMAP while having lower runtime. Our code is available at: https://github.com/lucasbrynte/gasfm/.
-
Shape and geometric patterns are essential in defining stylistic identity. However current 3D style transfer methods predominantly focus on transferring colors and textures often overlooking geometric aspects. In this paper we introduce Geometry Transfer a novel method that leverages geometric deformation for 3D style transfer. This technique employs depth maps to extract a style guide subsequently applied to stylize the geometry of radiance fields. Moreover we propose new techniques that utilize geometric cues from the 3D scene thereby enhancing aesthetic expressiveness and more accurately reflecting intended styles. Our extensive experiments show that Geometry Transfer enables a broader and more expressive range of stylizations thereby significantly expanding the scope of 3D style transfer.
-
We present the first approach to render highly realistic free-viewpoint videos of a human actor in general apparel from sparse multi-view recording to display in real-time at an unprecedented 4K resolution. At inference our method only requires four camera views of the moving actor and the respective 3D skeletal pose. It handles actors in wide clothing and reproduces even fine-scale dynamic detail e.g. clothing wrinkles face expressions and hand gestures. At training time our learning-based approach expects dense multi-view video and a rigged static surface scan of the actor. Our method comprises three main stages. Stage 1 is a skeleton-driven neural approach for high-quality capture of the detailed dynamic mesh geometry. Stage 2 is a novel solution to create a view-dependent texture using four test-time camera views as input. Finally stage 3 comprises a new image-based refinement network rendering the final 4K image given the output from the previous stages. Our approach establishes a new benchmark for real-time rendering resolution and quality using sparse input camera views unlocking possibilities for immersive telepresence.
-
We introduce SEAS using ShapE-Aligned Supervision to enhance appearance-based person re-identification. When recognizing an individual's identity existing methods primarily rely on appearance which can be influenced by the background environment due to a lack of body shape awareness. Although some methods attempt to incorporate other modalities such as gait or body shape they encode the additional modality separately resulting in extra computational costs and lacking an inherent connection with appearance. In this paper we explore the use of implicit 3-D body shape representations as pixel-level guidance to augment the extraction of identity features with body shape knowledge in addition to appearance. Using body shape as supervision rather than as input provides shape-aware enhancements without any increase in computational cost and delivers coherent integration with pixel-wise appearance features. Moreover for video-based person re-identification we align pixel-level features across frames with shape awareness to ensure temporal consistency. Our results demonstrate that incorporating body shape as pixel-level supervision reduces rank-1 errors by 1.4% for frame-based and by 2.5% for video-based re-identification tasks respectively and can also be generalized to other existing appearance-based person re-identification methods.
-
Distillation strategies are currently the primary approaches for mitigating forgetting in class incremental learning (CIL). Existing methods generally inherit previous knowledge from a single teacher. However teachers with different mechanisms are talented at different tasks and inheriting diverse knowledge from them can enhance compatibility with new knowledge. In this paper we propose the MTD method to find multiple diverse teachers for CIL. Specifically we adopt weight permutation feature perturbation and diversity regularization techniques to ensure diverse mechanisms in teachers. To reduce time and memory consumption each teacher is represented as a small branch in the model. We adapt existing CIL distillation strategies with MTD and extensive experiments on CIFAR-100 ImageNet-100 and ImageNet-1000 show significant performance improvement. Our code is available at https://github.com/HaitaoWen/CLearning.
-
Although deep learning based object detection is of great significance for various applications it faces challenges when deployed on edge devices due to the computation and energy limitations. Post-training quantization (PTQ) can improve inference efficiency through integer computing. However they suffer from severe performance degradation when performing full quantization due to overlooking the unique characteristics of regression tasks in object detection. In this paper we are the first to explore regression-friendly quantization and conduct full quantization on various detectors. We reveal the intrinsic reason behind the difficulty of quantizing regressors with empirical and theoretical justifications and introduce a novel Regression-specialized Post-Training Quantization (Reg-PTQ) scheme. It includes Filtered Global Loss Integration Calibration to combine the global loss with a two-step filtering mechanism mitigating the adverse impact of false positive bounding boxes and Learnable Logarithmic-Affine Quantizer tailored for the non-uniform distributed parameters in regression structures. Extensive experiments on prevalent detectors showcase the effectiveness of the well-designed Reg-PTQ. Notably our Reg-PTQ achieves 7.6 times and 5.4 times reduction in computation and storage consumption under INT4 with little performance degradation which indicates the immense potential of fully quantized detectors in real-world object detection applications.
-
Recently pre-trained vision-language models (e.g. CLIP) have shown great potential in few-shot learning and attracted a lot of research interest. Although efforts have been made to improve few-shot ability of CLIP key factors on the effectiveness of existing methods have not been well studied limiting further exploration of CLIP's potential in few-shot learning. In this paper we first introduce a unified formulation to analyze CLIP-based few-shot learning methods from a perspective of logit bias which encourages us to learn an effective logit bias for further improving performance of CLIP-based few-shot learning methods. To this end we disassemble three key components involved in computation of logit bias (i.e. logit features logit predictor and logit fusion) and empirically analyze the effect on performance of few-shot classification. Based on analysis of key components this paper proposes a novel AMU-Tuning method to learn effective logit bias for CLIP-based few-shot classification. Specifically our AMU-Tuning predicts logit bias by exploiting the appropriate Auxiliary features which are fed into an efficient feature-initialized linear classifier with Multi-branch training. Finally an Uncertainty-based fusion is developed to incorporate logit bias into CLIP for few-shot classification. The experiments are conducted on several widely used benchmarks and the results show AMU-Tuning clearly outperforms its counterparts while achieving state-of-the-art performance of CLIP-based few-shot learning without bells and whistles.
-
The recently increased role of mobile photography has raised the standards of on-device photo processing tremendously. Despite the latest advancements in camera hardware the mobile camera sensor area cannot be increased significantly due to physical constraints leading to a pixel size of 0.6--2.0 \mum which results in strong image noise even in moderate lighting conditions. In the era of deep learning one can train a CNN model to perform robust image denoising. However there is still a lack of a substantially diverse dataset for this task. To address this problem we introduce a novel Mobile Image Denoising Dataset (MIDD) comprising over 400000 noisy / noise-free image pairs captured under various conditions by 20 different mobile camera sensors. Additionally we propose a new DPreview test set consisting of data from 294 different cameras for precise model evaluation. Furthermore we present the efficient baseline model SplitterNet for the considered mobile image denoising task that achieves high numerical and visual results while being able to process 8MP photos directly on smartphone GPUs in under one second. Thereby outperforming models with similar runtimes. This model is also compatible with recent mobile NPUs demonstrating an even higher speed when deployed on them. The conducted experiments demonstrate high robustness of the proposed solution when applied to images from previously unseen sensors showing its high generalizability. The datasets code and models can be found on the official project website.
-
In the field of computer vision Vision Transformers (ViTs) have emerged as a prominent deep learning architecture. Despite being inspired by Convolutional Neural Networks (CNNs) ViTs are susceptible to small spatial shifts in the input data - they lack shift-equivariance. To address this shortcoming we introduce novel data-adaptive designs for each of the ViT modules that break shift-equivariance such as tokenization self-attention patch merging and positional encoding. With our proposed modules we achieve perfect circular shift-equivariance across four prominent ViT architectures: Swin SwinV2 CvT and MViTv2. Additionally we leverage our design to further enhance consistency under standard shifts. We evaluate our adaptive ViT models on image classification and semantic segmentation tasks. Our models achieve competitive performance across three diverse datasets showcasing perfect (100%) circular shift consistency while improving standard shift consistency.
-
Spike cameras leveraging spike-based integration sampling and high temporal resolution offer distinct advantages over standard cameras. However existing approaches reliant on spike cameras often assume optimal illumination a condition frequently unmet in real-world scenarios. To address this we introduce SpikeNeRF the first work that derives a NeRF-based volumetric scene representation from spike camera data. Our approach leverages NeRF's multi-view consistency to establish robust self-supervision effectively eliminating erroneous measurements and uncovering coherent structures within exceedingly noisy input amidst diverse real-world illumination scenarios. The framework comprises two core elements: a spike generation model incorporating an integrate-and-fire neuron layer and parameters accounting for non-idealities such as threshold variation and a spike rendering loss capable of generalizing across varying illumination conditions. We describe how to effectively optimize neural radiance fields to render photorealistic novel views from the novel continuous spike stream demonstrating advantages over other vision sensors in certain scenes. Empirical evaluations conducted on both real and novel realistically simulated sequences affirm the efficacy of our methodology. The dataset and source code are released at https://github.com/BIT-Vision/SpikeNeRF.
-
We present Egocentric Action Scene Graphs (EASGs) a new representation for long-form understanding of egocentric videos. EASGs extend standard manually-annotated representations of egocentric videos such as verb-noun action labels by providing a temporally evolving graph-based description of the actions performed by the camera wearer including interacted objects their relationships and how actions unfold in time. Through a novel annotation procedure we extend the Ego4D dataset adding manually labeled Egocentric Action Scene Graphs which offer a rich set of annotations for long-from egocentric video understanding. We hence define the EASG generation task and provide a baseline approach establishing preliminary benchmarks. Experiments on two downstream tasks action anticipation and activity summarization highlight the effectiveness of EASGs for long-form egocentric video understanding. We will release the dataset and code to replicate experiments and annotations.
-
Existing research based on deep learning has extensively explored the problem of daytime image dehazing. However few studies have considered the characteristics of nighttime hazy scenes. There are two distinctions between nighttime and daytime haze. First there may be multiple active colored light sources with lower illumination intensity in nighttime scenes which may cause haze glow and noise with localized coupled and frequency inconsistent characteristics. Second due to the domain discrepancy between simulated and real-world data unrealistic brightness may occur when applying a dehazing model trained on simulated data to real-world data. To address the above two issues we propose a semi-supervised model for real-world nighttime dehazing. First the spatial attention and frequency spectrum filtering are implemented as a spatial-frequency domain information interaction module to handle the first issue. Second a pseudo-label-based retraining strategy and a local window-based brightness loss for semi-supervised training process is designed to suppress haze and glow while achieving realistic brightness. Experiments on public benchmarks validate the effectiveness of the proposed method and its superiority over state-of-the-art methods. The source code and Supplementary Materials are placed in the https://github.com/Xiaofeng-life/SFSNiD.
-
Data-Free Knowledge Distillation (DFKD) is a promising task to train high-performance small models to enhance actual deployment without relying on the original training data. Existing methods commonly avoid relying on private data by utilizing synthetic or sampled data. However a long-overlooked issue is that the severe distribution shifts between their substitution and original data which manifests as huge differences in the quality of images and class proportions. The harmful shifts are essentially the confounder that significantly causes performance bottlenecks. To tackle the issue this paper proposes a novel perspective with causal inference to disentangle the student models from the impact of such shifts. By designing a customized causal graph we first reveal the causalities among the variables in the DFKD task. Subsequently we propose a Knowledge Distillation Causal Intervention (KDCI) framework based on the backdoor adjustment to de-confound the confounder. KDCI can be flexibly combined with most existing state-of-the-art baselines. Experiments in combination with six representative DFKD methods demonstrate the effectiveness of our KDCI which can obviously help existing methods under almost all settings e.g. improving the baseline by up to 15.54% accuracy on the CIFAR-100 dataset.
-
In this paper we propose a novel concept factorization method that seeks factor matrices using a cross-order positive semi-definite neighbor graph which provides comprehensive and complementary neighbor information of the data. The factor matrices are learned with bipartite graph partitioning which exploits explicit cluster structure of the data and is more geared towards clustering application. We develop an effective and efficient optimization algorithm for our method and provide elegant theoretical results about the convergence. Extensive experimental results confirm the effectiveness of the proposed method.
-
Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding
Video Paragraph Grounding (VPG) is an emerging task in video-language understanding which aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video. However existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire. In this work we introduce and explore Weakly-Supervised Video Paragraph Grounding (WSVPG) to eliminate the need of temporal annotations. Different from previous weakly-supervised grounding frameworks based on multiple instance learning or reconstruction learning for two-stage candidate ranking we propose a novel siamese learning framework that jointly learns the cross-modal feature alignment and temporal coordinate regression without timestamp labels to achieve concise one-stage localization for WSVPG. Specifically we devise a Siamese Grounding TRansformer (SiamGTR) consisting of two weight-sharing branches for learning complementary supervision. An Augmentation Branch is utilized for directly regressing the temporal boundaries of a complete paragraph within a pseudo video and an Inference Branch is designed to capture the order-guided feature correspondence for localizing multiple sentences in a normal video. We demonstrate by extensive experiments that our paradigm has superior practicability and flexibility to achieve efficient weakly-supervised or semi-supervised learning outperforming state-of-the-art methods trained with the same or stronger supervision.
-
Deep Neural Networks (DNNs) are known to be susceptible to adversarial attacks. Previous researches mainly focus on improving adversarial robustness in the fully supervised setting leaving the challenging domain of zero-shot adversarial robustness an open question. In this work we investigate this domain by leveraging the recent advances in large vision-language models such as CLIP to introduce zero-shot adversarial robustness to DNNs. We propose LAAT a Language-driven Anchor-based Adversarial Training strategy. LAAT utilizes the features of a text encoder for each category as fixed anchors (normalized feature embeddings) for each category which are then employed for adversarial training. By leveraging the semantic consistency of the text encoders LAAT aims to enhance the adversarial robustness of the image model on novel categories. However naively using text encoders leads to poor results. Through analysis we identified the issue to be the high cosine similarity between text encoders. We then design an expansion algorithm and an alignment cross-entropy loss to alleviate the problem. Our experimental results demonstrated that LAAT significantly improves zero-shot adversarial robustness over state-of-the-art methods. LAAT has the potential to enhance adversarial robustness by large-scale multimodal models especially when labeled data is unavailable during training.
-
Diffusion model-based image restoration (IR) aims to use diffusion models to recover high-quality (HQ) images from degraded images achieving promising performance. Due to the inherent property of diffusion models most existing methods need long serial sampling chains to restore HQ images step-by-step resulting in expensive sampling time and high computation costs. Moreover such long sampling chains hinder understanding the relationship between inputs and restoration results since it is hard to compute the gradients in the whole chains. In this work we aim to rethink the diffusion model-based IR models through a different perspective i.e. a deep equilibrium (DEQ) fixed point system called DeqIR. Specifically we derive an analytical solution by modeling the entire sampling chain in these IR models as a joint multivariate fixed point system. Based on the analytical solution we can conduct parallel sampling and restore HQ images without training. Furthermore we compute fast gradients via DEQ inversion and found that initialization optimization can boost image quality and control the generation direction. Extensive experiments on benchmarks demonstrate the effectiveness of our method on typical IR tasks and real-world settings.
-
Object detection with event cameras benefits from the sensor's low latency and high dynamic range. However it is costly to fully label event streams for supervised training due to their high temporal resolution. To reduce this cost we present LEOD the first method for label-efficient event-based detection. Our approach unifies weakly- and semi-supervised object detection with a self-training mechanism. We first utilize a detector pre-trained on limited labels to produce pseudo ground truth on unlabeled events. Then the detector is re-trained with both real and generated labels. Leveraging the temporal consistency of events we run bi-directional inference and apply tracking-based post-processing to enhance the quality of pseudo labels. To stabilize training against label noise we further design a soft anchor assignment strategy. We introduce new experimental protocols to evaluate the task of label-efficient event-based detection on Gen1 and 1Mpx datasets. LEOD consistently outperforms supervised baselines across various labeling ratios. For example on Gen1 it improves mAP by 8.6% and 7.8% for RVT-S trained with 1% and 2% labels. On 1Mpx RVT-S with 10% labels even surpasses its fully-supervised counterpart using 100% labels. LEOD maintains its effectiveness even when all labeled data are available reaching new state-of-the-art results. Finally we show that our method readily scales to improve larger detectors as well. Code is released at https://github.com/Wuziyi616/LEOD.
-
Representation learning of pathology whole-slide images (WSIs) has been has primarily relied on weak supervision with Multiple Instance Learning (MIL). However the slide representations resulting from this approach are highly tailored to specific clinical tasks which limits their expressivity and generalization particularly in scenarios with limited data. Instead we hypothesize that morphological redundancy in tissue can be leveraged to build a task-agnostic slide representation in an unsupervised fashion. To this end we introduce PANTHER a prototype-based approach rooted in the Gaussian mixture model that summarizes the set of WSI patches into a much smaller set of morphological prototypes. Specifically each patch is assumed to have been generated from a mixture distribution where each mixture component represents a morphological exemplar. Utilizing the estimated mixture parameters we then construct a compact slide representation that can be readily used for a wide range of downstream tasks. By performing an extensive evaluation of PANTHER on subtyping and survival tasks using 13 datasets we show that 1) PANTHER outperforms or is on par with supervised MIL baselines and 2) the analysis of morphological prototypes brings new qualitative and quantitative insights into model interpretability. The code is available at https://github.com/mahmoodlab/Panther.
-
Polarization is a fundamental property of light that encodes abundant information regarding surface shape material illumination and viewing geometry. The computer vision community has witnessed a blossom of polarization-based vision applications such as reflection removal shape-from-polarization (SfP) transparent object segmentation and color constancy partially due to the emergence of single-chip mono/color polarization sensors that make polarization data acquisition easier than ever. However is polarization-based vision vulnerable to adversarial attacks? If so is that possible to realize these adversarial attacks in the physical world without being perceived by human eyes? In this paper we warn the community of the vulnerability of polarization-based vision which can be more serious than RGB-based vision. By adapting a commercial LCD projector we achieve locally controllable polarizing projection which is successfully utilized to fool state-of-the-art polarization-based vision algorithms for glass segmentation and SfP. Compared with existing physical attacks on RGB-based vision which always suffer from the trade-off between attack efficacy and eye conceivability the adversarial attackers based on polarizing projection are contact-free and visually imperceptible since naked human eyes can rarely perceive the difference of viciously manipulated polarizing light and ordinary illumination. This poses unprecedented risks on polarization-based vision for which due attentions should be paid and counter measures be considered.
-
Recent approaches to point tracking are able to recover the trajectory of any scene point through a large portion of a video despite the presence of occlusions. They are however too slow in practice to track every point observed in a single frame in a reasonable amount of time. This paper introduces DOT a novel simple and efficient method for solving this problem. It first extracts a small set of tracks from key regions at motion boundaries using an off-the-shelf point tracking algorithm. Given source and target frames DOT then computes rough initial estimates of a dense flow field and visibility mask through nearest-neighbor interpolation before refining them using a learnable optical flow estimator that explicitly handles occlusions and can be trained on synthetic data with ground-truth correspondences. We show that DOT is significantly more accurate than current optical flow techniques outperforms sophisticated "universal" trackers like OmniMotion and is on par with or better than the best point tracking algorithms like CoTracker while being at least two orders of magnitude faster. Quantitative and qualitative experiments with synthetic and real videos validate the promise of the proposed approach. Code data and videos showcasing the capabilities of our approach are available in the project webpage: https://16lemoing.github.io/dot .
-
Split Learning (SL) is a distributed learning framework renowned for its privacy-preserving features and minimal computational requirements. Previous research consistently highlights the potential privacy breaches in SL systems by server adversaries reconstructing training data. However these studies often rely on strong assumptions or compromise system utility to enhance attack performance. This paper introduces a new semi-honest Data Reconstruction Attack on SL named Feature-Oriented Reconstruction Attack (FORA). In contrast to prior works FORA relies on limited prior knowledge specifically that the server utilizes auxiliary samples from the public without knowing any client's private information. This allows FORA to conduct the attack stealthily and achieve robust performance. The key vulnerability exploited by FORA is the revelation of the model representation preference in the smashed data output by victim client. FORA constructs a substitute client through feature-level transfer learning aiming to closely mimic the victim client's representation preference. Leveraging this substitute client the server trains the attack model to effectively reconstruct private data. Extensive experiments showcase FORA's superior performance compared to state-of-the-art methods. Furthermore the paper systematically evaluates the proposed method's applicability across diverse settings and advanced defense strategies.
-
With the rapid development of face recognition (FR) systems the privacy of face images on social media is facing severe challenges due to the abuse of unauthorized FR systems. Some studies utilize adversarial attack techniques to defend against malicious FR systems by generating adversarial examples. However the generated adversarial examples i.e. the protected face images tend to suffer from subpar visual quality and low transferability. In this paper we propose a novel face protection approach dubbed DiffAM which leverages the powerful generative ability of diffusion models to generate high-quality protected face images with adversarial makeup transferred from reference images. To be specific we first introduce a makeup removal module to generate non-makeup images utilizing a fine-tuned diffusion model with guidance of textual prompts in CLIP space. As the inverse process of makeup transfer makeup removal can make it easier to establish the deterministic relationship between makeup domain and non-makeup domain regardless of elaborate text prompts. Then with this relationship a CLIP-based makeup loss along with an ensemble attack strategy is introduced to jointly guide the direction of adversarial makeup domain achieving the generation of protected face images with natural-looking makeup and high black-box transferability. Extensive experiments demonstrate that DiffAM achieves higher visual quality and attack success rates with a gain of 12.98% under black-box setting compared with the state of the arts. The code will be available at https://github.com/HansSunY/DiffAM.
-
Recently there has been a lot of progress in reducing the computation of deep models at inference time. These methods can reduce both the computational needs and power usage of deep models. Some of these approaches adaptively scale the compute based on the input instance. We show that such models can be vulnerable to a universal adversarial patch attack where the attacker optimizes for a patch that when pasted on any image can increase the compute and power consumption of the model. We run experiments with three different efficient vision transformer methods showing that in some cases the attacker can increase the computation to the maximum possible level by simply pasting a patch that occupies only 8% of the image area. We also show that a standard adversarial training defense method can reduce some of the attack's success. We believe adaptive efficient methods will be necessary for the future to lower the power usage of expensive deep models so we hope our paper encourages the community to study the robustness of these methods and develop better defense methods for the proposed attack. Code is available at: https://github.com/UCDvision/SlowFormer.
-
LiDAR Upsampling is a challenging task for the perception systems of robots and autonomous vehicles due to the sparse and irregular structure of large-scale scene contexts. Recent works propose to solve this problem by converting LiDAR data from 3D Euclidean space into an image super-resolution problem in 2D image space. Although their methods can generate high-resolution range images with fine-grained details the resulting 3D point clouds often blur out details and predict invalid points. In this paper we propose TULIP a new method to reconstruct high-resolution LiDAR point clouds from low-resolution LiDAR input. We also follow a range image-based approach but specifically modify the patch and window geometries of a Swin-Transformer-based network to better fit the characteristics of range images. We conducted several experiments on three public real-world and simulated datasets. TULIP outperforms state-of-the-art methods in all relevant metrics and generates robust and more realistic point clouds than prior works.
-
Inspired by the success of Large Language Models in dealing with new tasks via In-Context Learning (ICL) in NLP researchers have also developed Large Vision-Language Models (LVLMs) with ICL capabilities. However when implementing ICL using these LVLMs researchers usually resort to the simplest way like random sampling to configure the in-context sequence thus leading to sub-optimal results. To enhance the ICL performance in this study we use Visual Question Answering (VQA) as case study to explore diverse in-context configurations to find the powerful ones. Additionally through observing the changes of the LVLM outputs by altering the in-context sequence we gain insights into the inner properties of LVLMs improving our understanding of them. Specifically to explore in-context configurations we design diverse retrieval methods and employ different strategies to manipulate the retrieved demonstrations. Through exhaustive experiments on three VQA datasets: VQAv2 VizWiz and OK-VQA we uncover three important inner properties of the applied LVLM and demonstrate which strategies can consistently improve the ICL VQA performance. Our code is provided in: https: //github.com/GaryJiajia/OFv2_ICL_VQA.
-
Efficient generation of 3D digital humans is important in several industries including virtual reality social media and cinematic production. 3D generative adversarial networks (GANs) have demonstrated state-of-the-art (SOTA) quality and diversity for generated assets. Current 3D GAN architectures however typically rely on volume representations which are slow to render thereby hampering the GAN training and requiring multi-view-inconsistent 2D upsamplers. Here we introduce Gaussian Shell Maps (GSMs) as a framework that connects SOTA generator network architectures with emerging 3D Gaussian rendering primitives using an articulable multi shell--based scaffold. In this setting a CNN generates a 3D texture stack with features that are mapped to the shells. The latter represent inflated and deflated versions of a template surface of a digital human in a canonical body pose. Instead of rasterizing the shells directly we sample 3D Gaussians on the shells whose attributes are encoded in the texture features. These Gaussians are efficiently and differentiably rendered. The ability to articulate the shells is important during GAN training and at inference time to deform a body into arbitrary user-defined poses. Our efficient rendering scheme bypasses the need for view-inconsistent upsamplers and achieves high-quality multi-view consistent renderings at a native resolution of 512 x512 pixels. We demonstrate that GSMs successfully generate 3D humans when trained on single-view datasets including SHHQ and DeepFashion.
-
The task of No-Reference Image Quality Assessment (NR-IQA) is to estimate the quality score of an input image without additional information. NR-IQA models play a crucial role in the media industry aiding in performance evaluation and optimization guidance. However these models are found to be vulnerable to adversarial attacks which introduce imperceptible perturbations to input images resulting in significant changes in predicted scores. In this paper we propose a defense method to mitigate the variability in predicted scores caused by small perturbations thus enhancing the adversarial robustness of NR-IQA models. To be specific we present theoretical evidence showing that the extent of score changes is related to the l_1 norm of the gradient of the predicted score with respect to the input image when adversarial perturbations are l_inf-bounded. Building on this theoretical foundation we propose a norm regularization training strategy aimed at reducing the l_1 norm of the gradient thereby boosting the adversarial robustness of NR-IQA models. Experiments conducted on four NR-IQA baseline models demonstrate the effectiveness of our strategy in reducing score changes in the presence of adversarial attacks. To the best of our knowledge this work marks the first attempt to defend against adversarial attacks on NR-IQA models. Our study offers valuable insights into the adversarial robustness of NR-IQA models and provides a foundation for future research in this area.
-
Humans commonly work with multiple objects in daily life and can intuitively transfer manipulation skills to novel objects by understanding object functional regularities. However existing technical approaches for analyzing and synthesizing hand-object manipulation are mostly limited to handling a single hand and object due to the lack of data support. To address this we construct TACO an extensive bimanual hand-object-interaction dataset spanning a large variety of tool-action-object compositions for daily human activities. TACO contains 2.5K motion sequences paired with third-person and egocentric views precise hand-object 3D meshes and action labels. To rapidly expand the data scale we present a fully automatic data acquisition pipeline combining multi-view sensing with an optical motion capture system. With the vast research fields provided by TACO we benchmark three generalizable hand-object-interaction tasks: compositional action recognition generalizable hand-object motion forecasting and cooperative grasp synthesis. Extensive experiments reveal new insights challenges and opportunities for advancing the studies of generalizable hand-object motion analysis and synthesis. Our data and code are available at https://taco2024.github.io.
-
While existing motion style transfer methods are effective between two motions with identical content their performance significantly diminishes when transferring style between motions with different contents. This challenge lies in the lack of clear separation between content and style of a motion. To tackle this challenge we propose a novel motion style transformer that effectively disentangles style from content and generates a plausible motion with transferred style from a source motion. Our distinctive approach to achieving the goal of disentanglement is twofold: (1) a new architecture for motion style transformer with 'part-attentive style modulator across body parts' and 'Siamese encoders that encode style and content features separately'; (2) style disentanglement loss. Our method outperforms existing methods and demonstrates exceptionally high quality particularly in motion pairs with different contents without the need for heuristic post-processing. Codes are available at https://github.com/Boeun-Kim/MoST.
-
The quality of the prompts provided to text-to-image diffusion models determines how faithful the generated content is to the user's intent often requiring `prompt engineering'. To harness visual concepts from target images without prompt engineering current approaches largely rely on embedding inversion by optimizing and then mapping them to pseudo-tokens. However working with such high-dimensional vector representations is challenging because they lack semantics and interpretability and only allow simple vector operations when using them. Instead this work focuses on inverting the diffusion model to obtain interpretable language prompts directly. The challenge of doing this lies in the fact that the resulting optimization problem is fundamentally discrete and the space of prompts is exponentially large; this makes using standard optimization techniques such as stochastic gradient descent difficult. To this end we utilize a delayed projection scheme to optimize for prompts representative of the vocabulary space in the model. Further we leverage the findings that different timesteps of the diffusion process cater to different levels of detail in an image. The later noisy timesteps of the forward diffusion process correspond to the semantic information and therefore prompt inversion in this range provides tokens representative of the image semantics. We show that our approach can identify semantically interpretable and meaningful prompts for a target image which can be used to synthesize diverse images with similar content. We further illustrate the application of the optimized prompts in evolutionary image generation and concept removal.
-
In the realm of AI data serves as a pivotal resource. Real-world hyperspectral images (HSIs) bearing wide spectral characteristics are particularly valuable. However the acquisition of HSIs is always costly and time-intensive resulting in a severe data-thirsty issue in HSI research and applications. Current solutions have not been able to generate a sufficient volume of diverse and reliable synthetic HSIs. To this end our study formulates a novel generalized paradigm for HSI synthesis i.e. unmixing before fusion that initiates with unmixing across multi-source data and follows by fusion-based synthesis. By integrating unmixing this work maps unpaired HSI and RGB data to a low-dimensional abundance space greatly alleviating the difficulty of generating high-dimensional samples. Moreover incorporating abundances inferred from unpaired RGB images into generative models allows for cost-effective supplementation of various realistic spatial distributions in abundance synthesis. Our proposed paradigm can be instrumental with a series of deep generative models filling a significant gap in the field and enabling the generation of vast high-quality HSI samples for large-scale downstream tasks. Extension experiments on downstream tasks demonstrate the effectiveness of synthesized HSIs. The code is available at: HSI-Synthesis.github.io.
-
Neural implicit fields have been a de facto standard in novel view synthesis. Recently there exist some methods exploring fusing multiple modalities within a single field aiming to share implicit features from different modalities to enhance reconstruction performance. However these modalities often exhibit misaligned behaviors: optimizing for one modality such as LiDAR can adversely affect another like camera performance and vice versa. In this work we conduct comprehensive analyses on the multimodal implicit field of LiDAR-camera joint synthesis revealing the underlying issue lies in the misalignment of different sensors. Furthermore we introduce AlignMiF a geometrically aligned multimodal implicit field with two proposed modules: Geometry-Aware Alignment (GAA) and Shared Geometry Initialization (SGI). These modules effectively align the coarse geometry across different modalities significantly enhancing the fusion process between LiDAR and camera data. Through extensive experiments across various datasets and scenes we demonstrate the effectiveness of our approach in facilitating better interaction between LiDAR and camera modalities within a unified neural field. Specifically our proposed AlignMiF achieves remarkable improvement over recent implicit fusion methods (+2.01 and +3.11 image PSNR on the KITTI-360 and Waymo datasets) and consistently surpasses single modality performance (13.8% and 14.2% reduction in LiDAR Chamfer Distance on the respective datasets).
-
Large generative diffusion models have revolutionized text-to-image generation and offer immense potential for conditional generation tasks such as image enhancement restoration editing and compositing. However their widespread adoption is hindered by the high computational cost which limits their real-time application. To address this challenge we introduce a novel method dubbed CoDi that adapts a pre-trained latent diffusion model to accept additional image conditioning inputs while significantly reducing the sampling steps required to achieve high-quality results. Our method can leverage architectures such as ControlNet to incorporate conditioning inputs without compromising the model's prior knowledge gained during large scale pre-training. Additionally a conditional consistency loss enforces consistent predictions across diffusion steps effectively compelling the model to generate high-quality images with conditions in a few steps. Our conditional-task learning and distillation approach outperforms previous distillation methods achieving a new state-of-the-art in producing high-quality images with very few steps (e.g. 1-4) across multiple tasks including super-resolution text-guided image editing and depth-to-image generation.
-
Learning representations to capture the very fundamental understanding of the world is a key challenge in machine learning. The hierarchical structure of explanatory factors hidden in data is such a general representation and could be potentially achieved with a hierarchical VAE. However training a hierarchical VAE always suffers from the "posterior collapse" where the data information is hard to propagate to the higher-level latent variables hence resulting in a bad hierarchical representation. To address this issue we first analyze the shortcomings of existing methods for mitigating the "posterior collapse" from an information theory perspective then highlight the necessity of regularization for explicitly propagating data information to higher-level latent variables while maintaining the dependency between different levels. This naturally leads to formulating the inference of the hierarchical latent representation as a sequential decision process which could benefit from applying reinforcement learning (RL). Aligning RL's objective with the regularization we first introduce a "skip-generative path" to acquire a reward for evaluating the information content of an inferred latent representation and then the developed Q-value function based on it could have a consistent optimization direction of the regularization. Finally policy gradient one of the typical RL methods is employed to train a hierarchical VAE without introducing a gradient estimator. Experimental results firmly support our analysis and demonstrate that our proposed method effectively mitigates the "posterior collapse" issue learns an informative hierarchy acquires explainable latent representations and significantly outperforms other hierarchical VAE-based methods in downstream tasks.
-
Event-based semantic segmentation has gained popularity due to its capability to deal with scenarios under high-speed motion and extreme lighting conditions which cannot be addressed by conventional RGB cameras. Since it is hard to annotate event data previous approaches rely on event-to-image reconstruction to obtain pseudo labels for training. However this will inevitably introduce noise and learning from noisy pseudo labels especially when generated from a single source may reinforce the errors. This drawback is also called confirmation bias in pseudo-labeling. In this paper we propose a novel hybrid pseudo-labeling framework for unsupervised event-based semantic segmentation HPL-ESS to alleviate the influence of noisy pseudo labels. In particular we first employ a plain unsupervised domain adaptation framework as our baseline which can generate a set of pseudo labels through self-training. Then we incorporate offline event-to-image reconstruction into the framework and obtain another set of pseudo labels by predicting segmentation maps on the reconstructed images. A noisy label learning strategy is designed to mix the two sets of pseudo labels and enhance the quality. Moreover we propose a soft prototypical alignment module to further improve the consistency of target domain features. Extensive experiments show that our proposed method outperforms existing state-of-the-art methods by a large margin on the DSEC-Semantic dataset (+5.88% accuracy +10.32% mIoU) which even surpasses several supervised methods.
-
We introduce X-Adapter a universal upgrader to enable the pretrained plug-and-play modules (e.g. ControlNet LoRA) to work directly with the upgraded text-to-image diffusion model (e.g. SDXL) without further retraining. We achieve this goal by training an additional network to control the frozen upgraded model with the new text-image data pairs. In detail X-Adapter keeps a frozen copy of the old model to preserve the connectors of different plugins. Additionally X-Adapter adds trainable mapping layers that bridge the decoders from models of different versions for feature remapping. The remapped features will be used as guidance for the upgraded model. To enhance the guidance ability of X-Adapter we employ a -text training strategy for the upgraded model. After training we also introduce a two-stage denoising strategy to align the initial latents of X-Adapter and the upgraded model. Thanks to our strategies X-Adapter demonstrates universal compatibility with various plugins and also enables plugins of different versions to work together thereby expanding the functionalities of diffusion community. To verify the effectiveness of the proposed method we conduct extensive experiments and the results show that X-Adapter may facilitate wider application in the upgraded foundational diffusion model. Project page at: https://showlab.github.io/X-Adapter.
-
The robustness of convolutional neural networks (CNNs) is vital to modern AI-driven systems. It can be quantified by formal verification by providing a certified lower bound within which any perturbation does not alter the original input's classification result. It is challenging due to nonlinear components such as MaxPool. At present many verification methods are sound but risk losing some precision to enhance efficiency and scalability and thus a certified lower bound is a crucial criterion for evaluating the performance of verification tools. In this paper we present MaxLin a robustness verifier for MaxPool-based CNNs with tight Linear approximation. By tightening the linear approximation of the MaxPool function we can certify larger certified lower bounds of CNNs. We evaluate MaxLin with open-sourced benchmarks including LeNet and networks trained on the MNIST CIFAR-10 and Tiny ImageNet datasets. The results show that MaxLin outperforms state-of-the-art tools with up to 110.60% improvement regarding the certified lower bound and 5.13 X speedup for the same neural networks. Our code is available at https://github.com/xiaoyuanpigo/maxlin.
-
The recent progress in Large Language Models (LLM) has spurred various advancements in image-language conversation agents while how to build a proficient video-based dialogue system is still under exploration. Considering the extensive scale of LLM and visual backbone minimal GPU memory is left for facilitating effective temporal modeling which is crucial for comprehending and providing feedback on videos. To this end we propose Branching Temporal Adapter (BT-Adapter) a novel method for extending image-language pretrained models into the video domain. Specifically BT-Adapter serves as a plug-and-use temporal modeling branch alongside the pretrained visual encoder which is tuned while keeping the backbone frozen. Just pretrained once BT-Adapter can be seamlessly integrated into all image conversation models using this version of CLIP enabling video conversations without the need for video instructions. Besides we develop a unique asymmetric token masking strategy inside the branch with tailor-made training tasks for BT-Adapter facilitating faster convergence and better results. Thanks to BT-Adapter we are able to empower existing multimodal dialogue models with strong video understanding capabilities without incurring excessive GPU costs. Without bells and whistles BT-Adapter achieves (1) state-of-the-art zero-shot results on various video tasks using thousands of fewer GPU hours. (2) better performance than current video chatbots without any video instruction tuning. (3) state-of-the-art results of video chatting using video instruction tuning outperforming previous SOTAs by a large margin. The code has been available at https://github.com/farewellthree/BT-Adapter.
-
CAD programs are a popular way to compactly encode shapes as a sequence of operations that are easy to parametrically modify. However without sufficient semantic comments and structure such programs can be challenging to understand let alone modify. We introduce the problem of semantic commenting CAD programs wherein the goal is to segment the input program into code blocks corresponding to semantically meaningful shape parts and assign a semantic label to each block. We solve the problem by combining program parsing with visual-semantic analysis afforded by recent advances in foundational language and vision models. Specifically by executing the input programs we create shapes which we use to generate conditional photorealistic images to make use of semantic annotators for such images. We then distill the information across the images and link back to the original programs to semantically comment on them. Additionally we collected and annotated a benchmark dataset CADTalk consisting of 5288 machine-made programs and 45 human-made programs with ground truth semantic comments. We extensively evaluated our approach compared it to a GPT-based baseline and an open-set shape segmentation baseline and reported an 83.24% accuracy on the new CADTalk dataset. Code and data: https://enigma-li.github.io/CADTalk/.
-
Collecting well-matched multimedia datasets is crucial for training cross-modal retrieval models. However in real-world scenarios massive multimodal data are harvested from the Internet which inevitably contains Partially Mismatched Pairs (PMPs). Undoubtedly such semantical irrelevant data will remarkably harm the cross-modal retrieval performance. Previous efforts tend to mitigate this problem by estimating a soft correspondence to down-weight the contribution of PMPs. In this paper we aim to address this challenge from a new perspective: the potential semantic similarity among unpaired samples makes it possible to excavate useful knowledge from mismatched pairs. To achieve this we propose L2RM a general framework based on Optimal Transport (OT) that learns to rematch mismatched pairs. In detail L2RM aims to generate refined alignments by seeking a minimal-cost transport plan across different modalities. To formalize the rematching idea in OT first we propose a self-supervised cost function that automatically learns from explicit similarity-cost mapping relation. Second we present to model a partial OT problem while restricting the transport among false positives to further boost refined alignments. Extensive experiments on three benchmarks demonstrate our L2RM significantly improves the robustness against PMPs for existing models. The code is available at https://github.com/hhc1997/L2RM.
-
Robotics agents often struggle to understand and follow the multi-modal prompts in complex manipulation scenes which are challenging to be sufficiently and accurately described by text alone. Moreover for long-horizon manipulation tasks the deviation from general instruction tends to accumulate if lack of intermediate guidance from high-level subgoals. For this we consider can we generate subgoal images before act to enhance the instruction following in long-horizon manipulation with multi-modal prompts? Inspired by the great success of diffusion model in image generation tasks we propose a novel hierarchical framework named as CoTDiffusion that incorporates diffusion model as a high-level planner to convert the general and multi-modal prompts into coherent visual subgoal plans which further guide the low-level policy model before action execution. We design a semantic alignment module that can anchor the progress of generated keyframes along a coherent generation chain unlocking the chain-of-thought reasoning ability of diffusion model. Additionally we propose bi-directional generation and frame concat mechanism to further enhance the fidelity of generated subgoal images and the accuracy of instruction following. The experiments cover various robotics manipulation scenarios including visual reasoning visual rearrange and visual constraints. CoTDiffusion achieves outstanding performance gain compared to the baselines without explicit subgoal generation which proves that a subgoal image is worth a thousand words of instruction.
-
Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding. Scale is a primary factor influencing the performance of these foundation models. However these large foundation models often result in high computational cost. This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks. Specifically taking inspiration from knowledge distillation in model compression we propose a new asymmetric masked distillation (AMD) framework for pre-training relatively small models with autoencoding. The core of AMD is to devise an asymmetric masking strategy where the teacher model is enabled to see more context information with a lower masking ratio while the student model is still equipped with a high masking ratio. We design customized multi-layer feature alignment between the teacher encoder and student encoder to regularize the pre-training of student MAE. To demonstrate the effectiveness and versatility of AMD we apply it to both ImageMAE and VideoMAE for pre-training relatively small ViT models. AMD achieved 84.6% classification accuracy on IN1K using the ViT-B model. And AMD achieves 73.3% classification accuracy using the ViT-B model on the Something-in-Something V2 dataset a 3.7% improvement over the original ViT-B model from VideoMAE. We also transfer AMD pre-trained models to downstream tasks and obtain consistent performance improvement over the original masked autoencoding. The code and models are available at https://github.com/MCG-NJU/AMD.
-
Despite recent advances in inversion-based editing text-guided image manipulation remains challenging for diffusion models. The primary bottlenecks include 1) the time-consuming nature of the inversion process; 2) the struggle to balance consistency with accuracy; 3) the lack of compatibility with efficient consistency sampling methods used in consistency models. To address the above issues we start by asking ourselves if the inversion process can be eliminated for editing. We show that when the initial sample is known a special variance schedule reduces the denoising step to the same form as the multi-step consistency sampling. We name this Denoising Diffusion Consistent Model (DDCM) and note that it implies a virtual inversion strategy without explicit inversion in sampling. We further unify the attention control mechanisms in a tuning-free framework for text-guided editing. Combining them we present inversion-free editing (InfEdit) which allows for consistent and faithful editing for both rigid and non-rigid semantic changes catering to intricate modifications without compromising on the image's integrity and explicit inversion. Through extensive experiments InfEdit shows strong performance in various editing tasks and also maintains a seamless workflow (less than 3 seconds on one single A40) demonstrating the potential for real-time applications.
-
Understanding human motion from video is essential for a range of applications including pose estimation mesh recovery and action recognition. While state-of-the-art methods predominantly rely on transformer-based architectures these approaches have limitations in practical scenarios. Transformers are slower when sequentially predicting on a continuous stream of frames in real-time and do not generalize to new frame rates. In light of these constraints we propose a novel attention-free spatiotemporal model for human motion understanding building upon recent advancements in state space models. Our model not only matches the performance of transformer-based models in various motion understanding tasks but also brings added benefits like adaptability to different video frame rates and enhanced training speed when working with longer sequence of keypoints. Moreover the proposed model supports both offline and real-time applications. For real-time sequential prediction our model is both memory efficient and several times faster than transformer-based approaches while maintaining their high accuracy.
-
It is a long-lasting goal to design an embodied system that can solve long-horizon open-world tasks in human-like ways. However existing approaches usually struggle with compound difficulties caused by the logic-aware decomposition and context-aware execution of these tasks. To this end we introduce MP5 an open-ended multimodal embodied system built upon the challenging Minecraft simulator which can decompose feasible sub-objectives design sophisticated situation-aware plans and perform embodied action control with frequent communication with a goal-conditioned active perception scheme. Specifically MP5 is developed on top of recent advances in Multimodal Large Language Models (MLLMs) and the system is modulated into functional modules that can be scheduled and collaborated to ultimately solve pre-defined context- and process-dependent tasks. Extensive experiments prove that MP5 can achieve a 22% success rate on difficult process-dependent tasks and a 91% success rate on tasks that heavily depend on the context. Moreover MP5 exhibits a remarkable ability to address many open-ended tasks that are entirely novel.
-
Video anomaly understanding (VAU) aims to automatically comprehend unusual occurrences in videos thereby enabling various applications such as traffic surveillance and industrial manufacturing. While existing VAU benchmarks primarily concentrate on anomaly detection and localization our focus is on more practicality prompting us to raise the following crucial questions: "what anomaly occurred?" "why did it happen?" and "how severe is this abnormal event?". In pursuit of these answers we present a comprehensive benchmark for Causation Understanding of Video Anomaly (CUVA). Specifically each instance of the proposed benchmark involves three sets of human annotations to indicate the "what" "why" and "how" of an anomaly including 1) anomaly type start and end times and event descriptions 2) natural language explanations for the cause of an anomaly and 3) free text reflecting the effect of the abnormality. In addition we also introduce MMEval a novel evaluation metric designed to better align with human preferences for CUVA facilitating the measurement of existing LLMs in comprehending the underlying cause and corresponding effect of video anomalies. Finally we propose a novel prompt-based method that can serve as a baseline approach for the challenging CUVA. We conduct extensive experiments to show the superiority of our evaluation metric and the prompt-based approach.
-
3D visual grounding involves matching natural language descriptions with their corresponding objects in 3D spaces. Existing methods often face challenges with accuracy in object recognition and struggle in interpreting complex linguistic queries particularly with descriptions that involve multiple anchors or are view-dependent. In response we present the MiKASA (Multi-Key-Anchor Scene-Aware) Transformer. Our novel end-to-end trained model integrates a self-attention-based scene-aware object encoder and an original multi-key-anchor technique enhancing object recognition accuracy and the understanding of spatial relationships. Furthermore MiKASA improves the explainability of decision-making facilitating error diagnosis. Our model achieves the highest overall accuracy in the Referit3D challenge for both the Sr3D and Nr3D datasets particularly excelling by a large margin in categories that require viewpoint-dependent descriptions.
-
The long-tailed distribution problem in medical image analysis reflects a high prevalence of common conditions and a low prevalence of rare ones which poses a significant challenge in developing a unified model capable of identifying rare or novel tumor categories not encountered during training. In this paper we propose a new Zero-shot Pan-Tumor segmentation framework (ZePT) based on query-disentangling and self-prompting to segment unseen tumor categories beyond the training set. ZePT disentangles the object queries into two subsets and trains them in two stages. Initially it learns a set of fundamental queries for organ segmentation through an object-aware feature grouping strategy which gathers organ-level visual features. Subsequently it refines the other set of advanced queries that focus on the auto-generated visual prompts for unseen tumor segmentation. Moreover we introduce query-knowledge alignment at the feature level to enhance each query's discriminative representation and generalizability. Extensive experiments on various tumor segmentation tasks demonstrate the performance superiority of ZePT which surpasses the previous counterparts and evidences the promising ability for zero-shot tumor segmentation in real-world settings.
-
Video moment retrieval and highlight detection are two highly valuable tasks in video understanding but until recently they have been jointly studied. Although existing studies have made impressive advancement recently they predominantly follow the data-driven bottom-up paradigm. Such paradigm overlooks task-specific and inter-task effects resulting in poor model performance. In this paper we propose a novel task-driven top-down framework TaskWeave for joint moment retrieval and highlight detection. The framework introduces a task-decoupled unit to capture task-specific and common representations. To investigate the interplay between the two tasks we propose an inter-task feedback mechanism which transforms the results of one task as guiding masks to assist the other task. Different from existing methods we present a task-dependent joint loss function to optimize the model. Comprehensive experiments and in-depth ablation studies on QVHighlights TVSum and Charades-STA datasets corroborate the effectiveness and flexibility of the proposed framework. Codes are available at https://github.com/EdenGabriel/TaskWeave.
-
Contrastive pre-training of image-text foundation models such as CLIP demonstrated excellent zero-shot performance and improved robustness on a wide range of downstream tasks. However these models utilize large transformer-based encoders with significant memory and latency overhead which pose challenges for deployment on mobile devices. In this work we introduce MobileCLIP - a new family of efficient image-text models optimized for runtime performance along with a novel and efficient training approach namely multi-modal reinforced training. The proposed training approach leverages knowledge transfer from an image captioning model and an ensemble of strong CLIP encoders to improve the accuracy of efficient models. Our approach avoids train-time compute overhead by storing the additional knowledge in a reinforced dataset. MobileCLIP sets a new state-of-the-art latency-accuracy tradeoff for zero-shot classification and retrieval tasks on several datasets. Our MobileCLIP-S2 variant is 2.3x faster while more accurate compared to previous best CLIP model based on ViT-B/16. We further demonstrate the effectiveness of our multi-modal reinforced training by training a CLIP model based on ViT-B/16 image backbone and achieving +2.9% average performance improvement on 38 evaluation benchmarks compared to the previous best. Moreover we show that the proposed approach achieves 10x-1000x improved learning efficiency when compared with non- reinforced CLIP training. Code and models are available at https://github.com/apple/ml-mobileclip
-
Point-based interactive editing serves as an essential tool to complement the controllability of existing generative models. A concurrent work DragDiffusion updates the diffusion latent map in response to user inputs causing global latent map alterations. This results in imprecise preservation of the original content and unsuccessful editing due to gradient vanishing. In contrast we present DragNoise offering robust and accelerated editing without retracing the latent map. The core rationale of DragNoise lies in utilizing the predicted noise output of each U-Net as a semantic editor. This approach is grounded in two critical observations: firstly the bottleneck features of U-Net inherently possess semantically rich features ideal for interactive editing; secondly high-level semantics established early in the denoising process show minimal variation in subsequent stages. Leveraging these insights DragNoise edits diffusion semantics in a single denoising step and efficiently propagates these changes ensuring stability and efficiency in diffusion editing. Comparative experiments reveal that DragNoise achieves superior control and semantic retention reducing the optimization time by over 50% compared to DragDiffusion. Our codes are available at https://github.com/haofengl/DragNoise.
-
Pseudo-label-based semi-supervised learning (SSL) algorithms trained on a class-imbalanced set face two cascading challenges: 1) Classifiers tend to be biased towards majority classes and 2) Biased pseudo-labels are used for training. It is difficult to appropriately re-balance the classifiers in SSL because the class distribution of an unlabeled set is often unknown and could be mismatched with that of a labeled set. We propose a novel class-imbalanced SSL algorithm called class-distribution-mismatch-aware debiasing (CDMAD). For each iteration of training CDMAD first assesses the classifier's biased degree towards each class by calculating the logits on an image without any patterns (e.g. solid color image) which can be considered irrelevant to the training set. CDMAD then refines biased pseudo-labels of the base SSL algorithm by ensuring the classifier's neutrality. CDMAD uses these refined pseudo-labels during the training of the base SSL algorithm to improve the quality of the representations. In the test phase CDMAD similarly refines biased class predictions on test samples. CDMAD can be seen as an extension of post-hoc logit adjustment to address a challenge of incorporating the unknown class distribution of the unlabeled set for re-balancing the biased classifier under class distribution mismatch. CDMAD ensures Fisher consistency for the balanced error. Extensive experiments verify the effectiveness of CDMAD.
-
Despite being (pre)trained on a massive amount of data state-of-the-art video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions. Our work addresses this by identifying a broad spectrum of contrast misalignments such as replacing entities actions and flipping event order which alignment models should be robust against. To this end we introduce the VideoCon a video-language alignment dataset constructed by a large language model that generates plausible contrast video captions and explanations for differences between original and contrast video captions. Then a generative video-language model is finetuned with VideoCon to assess video-language entailment and generate explanations. Our VideoCon-based alignment model significantly outperforms current models. It exhibits a 12-point increase in AUC for the video-language alignment task on human-generated contrast captions. Finally our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks such as text-to-video retrieval (SSv2-Temporal) and video question answering (ATP-Hard). Moreover our model shows superior performance on novel videos and human-crafted captions and explanations.
-
Scaled relative pose estimation i.e. estimating relative rotation and scaled relative translation between two images has always been a major challenge in global Structure-from-Motion (SfM). This difficulty arises because the two-view relative translation computed by traditional geometric vision methods e.g. the five-point algorithm is scaleless. Many researchers have proposed diverse translation averaging methods to solve this problem. Instead of solving the problem in the motion averaging phase we focus on estimating scaled relative pose with the help of panoramic cameras and deep neural networks. In this paper a novel network namely PanoPose is proposed to estimate the relative motion in a fully self-supervised manner and a global SfM pipeline is built for panorama images. The proposed PanoPose comprises a depth-net and a pose-net with self-supervision achieved by reconstructing the reference image from its neighboring images based on the estimated depth and relative pose. To maintain precise pose estimation under large viewing angle differences we randomly rotate the panoramic images and pre-train the pose-net with images before and after the rotation. To enhance scale accuracy a fusion block is introduced to incorporate depth information into pose estimation. Extensive experiments on panoramic SfM datasets demonstrate the effectiveness of PanoPose compared with state-of-the-arts.
-
Sketch semantic segmentation is a well-explored and pivotal problem in computer vision involving the assignment of predefined part labels to individual strokes. This paper presents ContextSeg - a simple yet highly effective approach to tackling this problem with two stages. In the first stage to better encode the shape and positional information of strokes we propose to predict an extra dense distance field in an autoencoder network to reinforce structural information learning. In the second stage we treat an entire stroke as a single entity and label a group of strokes within the same semantic part using an autoregressive Transformer with the default attention mechanism. By group-based labeling our method can fully leverage the context information when making decisions for the remaining groups of strokes. Our method achieves the best segmentation accuracy compared with state-of-the-art approaches on two representative datasets and has been extensively evaluated demonstrating its superior performance. Additionally we offer insights into solving part imbalance in training data and the preliminary experiment on cross-category training which can inspire future research in this field.
-
How do two sets of images differ? Discerning set-level differences is crucial for understanding model behaviors and analyzing datasets yet manually sifting through thousands of images is impractical. To aid in this discovery process we explore the task of automatically describing the differences between two sets of images which we term Set Difference Captioning. This task takes in image sets \mathcal D _A and \mathcal D _B and outputs a description that is more often true on \mathcal D _A than \mathcal D _B. We outline a two-stage approach that first proposes candidate difference descriptions from image sets and then re-ranks the candidates by checking how well they can differentiate the two sets. We introduce VisDiff which first captions the images and prompts a language model to propose candidate descriptions then re-ranks these descriptions using CLIP. To evaluate VisDiff we collect VisDiffBench a dataset with 187 paired image sets with ground truth difference descriptions. We apply VisDiff to various domains such as comparing datasets (e.g. ImageNet vs. ImageNetV2) comparing classification models (e.g. zero-shot CLIP vs. supervised ResNet) characterizing differences between generative models (e.g. StableDiffusionV1 and V2) and discovering what makes images memorable. Using VisDiff we are able to find interesting and previously unknown differences in datasets and models demonstrating its utility in revealing nuanced insights.
-
Addressing biases in computer vision models is crucial for real-world AI deployments. However mitigating visual biases is challenging due to their unexplainable nature often identified indirectly through visualization or sample statistics which necessitates additional human supervision for interpretation. To tackle this issue we propose the Bias-to-Text (B2T) framework which interprets visual biases as keywords. Specifically we extract common keywords from the captions of mispredicted images to identify potential biases in the model. We then validate these keywords by measuring their similarity to the mispredicted images using a vision-language scoring model. The keyword explanation form of visual bias offers several advantages such as a clear group naming for bias discovery and a natural extension for debiasing using these group names. Our experiments demonstrate that B2T can identify known biases such as gender bias in CelebA background bias in Waterbirds and distribution shifts in ImageNet-R/C. Additionally B2T uncovers novel biases in larger datasets such as Dollar Street and ImageNet. For example we discovered a contextual bias between \keyword bee and \keyword flower in ImageNet. We also highlight various applications of B2T keywords including debiased training CLIP prompting and model comparison.
-
Context-aware emotion recognition (CAER) has recently boosted the practical applications of affective computing techniques in unconstrained environments. Mainstream CAER methods invariably extract ensemble representations from diverse contexts and subject-centred characteristics to perceive the target person's emotional state. Despite advancements the biggest challenge remains due to context bias interference. The harmful bias forces the models to rely on spurious correlations between background contexts and emotion labels in likelihood estimation causing severe performance bottlenecks and confounding valuable context priors. In this paper we propose a counterfactual emotion inference (CLEF) framework to address the above issue. Specifically we first formulate a generalized causal graph to decouple the causal relationships among the variables in CAER. Following the causal graph CLEF introduces a non-invasive context branch to capture the adverse direct effect caused by the context bias. During the inference we eliminate the direct context effect from the total causal effect by comparing factual and counterfactual outcomes resulting in bias mitigation and robust prediction. As a model-agnostic framework CLEF can be readily integrated into existing methods bringing consistent performance gains.
-
We introduce a lightweight and accurate localization method that only utilizes the geometry of 2D-3D lines. Given a pre-captured 3D map our approach localizes a panorama image taking advantage of the holistic 360 degree view. The system mitigates potential privacy breaches or domain discrepancies by avoiding trained or hand-crafted visual descriptors. However as lines alone can be ambiguous we express distinctive yet compact spatial contexts from relationships between lines namely the dominant directions of parallel lines and the intersection between non-parallel lines. The resulting representations are efficient in processing time and memory compared to conventional visual descriptor-based methods. Given the groups of dominant line directions and their intersections we accelerate the search process to test thousands of pose candidates in less than a millisecond without sacrificing accuracy. We empirically show that the proposed 2D-3D matching can localize panoramas for challenging scenes with similar structures dramatic domain shifts or illumination changes. Our fully geometric approach does not involve extensive parameter tuning or neural network training making it a practical algorithm that can be readily deployed in the real world. Project page including the code is available through this link: https://82magnolia.github.io/fgpl/.
-
Deep Neural Networks (DNNs) are widely used for visual classification tasks but their complex computation process and black-box nature hinder decision transparency and interpretability. Class activation maps (CAMs) and recent variants provide ways to visually explain the DNN decision-making process by displaying 'attention' heatmaps of the DNNs. Nevertheless the CAM explanation only offers relative attention information that is on an attention heatmap we can interpret which image region is more or less important than the others. However these regions cannot be meaningfully compared across classes and the contribution of each region to the model's class prediction is not revealed. To address these challenges that ultimately lead to better DNN Interpretation in this paper we propose CAPE a novel reformulation of CAM that provides a unified and probabilistically meaningful assessment of the contributions of image regions. We quantitatively and qualitatively compare CAPE with state-of-the-art CAM methods on CUB and ImageNet benchmark datasets to demonstrate enhanced interpretability. We also test on a cytology imaging dataset depicting a challenging Chronic Myelomonocytic Leukemia (CMML) diagnosis problem. Code is available at:https://github.com/AIML-MED/CAPE.
-
Neural Rendering representations have significantly contributed to the field of 3D computer vision. Given their potential considerable efforts have been invested to improve their performance. Nonetheless the essential question of selecting training views is yet to be thoroughly investigated. This key aspect plays a vital role in achieving high-quality results and aligns with the well-known tenet of deep learning: "garbage in garbage out". In this paper we first illustrate the importance of view selection by demonstrating how a simple rotation of the test views within the most pervasive NeRF dataset can lead to consequential shifts in the performance rankings of state-of-the-art techniques. To address this challenge we introduce a unified framework for view selection methods and devise a thorough benchmark to assess its impact. Significant improvements can be achieved without leveraging error or uncertainty estimation but focusing on uniform view coverage of the reconstructed object resulting in a training-free approach. Using this technique we show that high-quality renderings can be achieved faster by using fewer views. We conduct extensive experiments on both synthetic datasets and realistic data to demonstrate the effectiveness of our proposed method compared with random conventional error-based and uncertainty-guided view selection.
-
Despite extensive research on training generative adversarial networks (GANs) with limited training data learning to generate images from long-tailed training distributions remains fairly unexplored. In the presence of imbalanced multi-class training data GANs tend to favor classes with more samples leading to the generation of low quality and less diverse samples in tail classes. In this study we aim to improve the training of class-conditional GANs with long-tailed data. We propose a straightforward yet effective method for knowledge sharing allowing tail classes to borrow from the rich information from classes with more abundant training data. More concretely we propose modifications to existing class-conditional GAN architectures to ensure that the lower-resolution layers of the generator are trained entirely unconditionally while reserving class-conditional generation for the higher-resolution layers. Experiments on several long-tail benchmarks and GAN architectures demonstrate a significant improvement over existing methods in both the diversity and fidelity of the generated images. The code is available at https://github.com/khorrams/utlo.
-
Current diffusion-based video editing primarily focuses on structure-preserved editing by utilizing various dense correspondences to ensure temporal consistency and motion alignment. However these approaches are often ineffective when the target edit involves a shape change. To embark on video editing with shape change we explore customized video subject swapping in this work where we aim to replace the main subject in a source video with a target subject having a distinct identity and potentially different shape. In contrast to previous methods that rely on dense correspondences we introduce the VideoSwap framework that exploits semantic point correspondences inspired by our observation that only a small number of semantic points are necessary to align the subject's motion trajectory and modify its shape. We also introduce various user-point interactions (e.g. removing points and dragging points) to address various semantic point correspondence. Extensive experiments demonstrate state-of-the-art video subject swapping results across a variety of real-world videos.
-
There has been a growing interest in the task of generating sound for silent videos primarily because of its practicality in streamlining video post-production. However existing methods for video-sound generation attempt to directly create sound from visual representations which can be challenging due to the difficulty of aligning visual representations with audio representations. In this paper we present SonicVisionLM a novel framework aimed at generating a wide range of sound effects by leveraging vision-language models(VLMs). Instead of generating audio directly from video we use the capabilities of powerful VLMs. When provided with a silent video our approach first identifies events within the video using a VLM to suggest possible sounds that match the video content. This shift in approach transforms the challenging task of aligning image and audio into more well-studied sub-problems of aligning image-to-text and text-to-audio through the popular diffusion models. To improve the quality of audio recommendations with LLMs we have collected an extensive dataset that maps text descriptions to specific sound effects and developed a time-controlled audio adapter. Our approach surpasses current state-of-the-art methods for converting video to audio enhancing synchronization with the visuals and improving alignment between audio and video components. Project page: https://yusiissy.github.io/SonicVisionLM.github.io/
-
A unified and versatile LiDAR segmentation model with strong robustness and generalizability is desirable for safe autonomous driving perception. This work presents M3Net a one-of-a-kind framework for fulfilling multi-task multi-dataset multi-modality LiDAR segmentation in a universal manner using just a single set of parameters. To better exploit data volume and diversity we first combine large-scale driving datasets acquired by different types of sensors from diverse scenes and then conduct alignments in three spaces namely data feature and label spaces during the training. As a result M3Net is capable of taming heterogeneous data for training state-of-the-art LiDAR segmentation models. Extensive experiments on twelve LiDAR segmentation datasets verify our effectiveness. Notably using a shared set of parameters M3Net achieves 75.1% 83.1% and 72.4% mIoU scores respectively on the official benchmarks of SemanticKITTI nuScenes and Waymo Open.
-
Superpixels play a crucial role in image processing by partitioning an image into clusters of pixels with similar visual attributes. This facilitates subsequent image processing tasks offering computational advantages over the manipulation of individual pixels. While numerous oversegmentation techniques have emerged in recent years many rely on predefined initialization and termination criteria. In this paper a novel top-down superpixel segmentation algorithm called Hierarchical Histogram Threshold Segmentation (HHTS) is introduced. It eliminates the need for initialization and implements auto-termination outperforming state-of-the-art methods w.r.t boundary recall. This is achieved by iteratively partitioning individual pixel segments into foreground and background and applying intensity thresholding across multiple color channels. The underlying iterative process constructs a superpixel hierarchy that adapts to local detail distributions until color information exhaustion. Experimental results demonstrate the superiority of the proposed approach in terms of boundary adherence while maintaining competitive runtime performance on the BSDS500 and NYUV2 datasets. Furthermore an application of HHTS in refining machine learning-based semantic segmentation masks produced by the Segment Anything Foundation Model (SAM) is presented.
-
Recent Vision Transformer Compression (VTC) works mainly follow a two-stage scheme where the importance score of each model unit is first evaluated or preset in each submodule followed by the sparsity score evaluation according to the target sparsity constraint. Such a separate evaluation process induces the gap between importance and sparsity score distributions thus causing high search costs for VTC. In this work for the first time we investigate how to integrate the evaluations of importance and sparsity scores into a single stage searching the optimal subnets in an efficient manner. Specifically we present OFB a cost-efficient approach that simultaneously evaluates both importance and sparsity scores termed Once for Both (OFB) for VTC. First a bi-mask scheme is developed by entangling the importance score and the differentiable sparsity score to jointly determine the pruning potential (prunability) of each unit. Such a bi-mask search strategy is further used together with a proposed adaptive one-hot loss to realize the progressive-and-efficient search for the most important subnet. Finally Progressive Masked Image Modeling (PMIM) is proposed to regularize the feature space to be more representative during the search process which may be degraded by the dimension reduction. Extensive experiments demonstrate that OFB can achieve superior compression performance over state-of-the-art searching-based and pruning-based methods under various Vision Transformer architectures meanwhile promoting search efficiency significantly e.g. costing one GPU search day for the compression of DeiT-S on ImageNet-1K.
-
We present As-Plausible-as-Possible (APAP) mesh deformation technique that leverages 2D diffusion priors to preserve the plausibility of a mesh under user-controlled deformation. Our framework uses per-face Jacobians to represent mesh deformations where mesh vertex coordinates are computed via a differentiable Poisson Solve. The deformed mesh is rendered and the resulting 2D image is used in the Score Distillation Sampling (SDS) process which enables extracting meaningful plausibility priors from a pretrained 2D diffusion model. To better preserve the identity of the edited mesh we fine-tune our 2D diffusion model with LoRA. Gradients extracted by SDS and a user-prescribed handle displacement are then backpropagated to the per-face Jacobians and we use iterative gradient descent to compute the final deformation that balances between the user edit and the output plausibility. We evaluate our method with 2D and 3D meshes and demonstrate qualitative and quantitative improvements when using plausibility priors over geometry-preservation or distortion-minimization priors used by previous techniques. Our project page is at: https://as-plausible-aspossible.github.io/
-
We propose Multiscale Correlation searching homography estimation Network namely MCNet an iterative deep homography estimation architecture. Different from previous approaches that achieve iterative refinement by correlation searching within a single scale MCNet combines the multiscale strategy with correlation searching incurring nearly ignored computational overhead. Moreover MCNet adopts a Fine-Grained Optimization loss function named FGO loss to further boost the network training at the convergent stage which can improve the estimation accuracy without additional computational overhead. According to our experiments using the above two simple strategies can produce significant homography estimation accuracy with considerable efficiency. We show that MCNet achieves state-of-the-art performance on a variety of datasets including common scene MSCOCO cross-modal scene GoogleEarth and GoogleMap and dynamic scene SPID. Compared to the previous SOTA method 2-scale RHWF our MCNet reduces inference time FLOPs parameter cost and memory cost by 78.9% 73.5% 34.1% and 33.2% respectively while achieving 20.5% (MSCOCO) 43.4% (GoogleEarth) and 41.1% (GoogleMap) mean average corner error (MACE) reduction. Source code is available at https://github.com/zjuzhk/MCNet.
-
Panoptic segmentation combining semantic and instance segmentation stands as a cutting-edge computer vision task. Despite recent progress with deep learning models the dynamic nature of real-world applications necessitates continual learning where models adapt to new classes (plasticity) over time without forgetting old ones (catastrophic forgetting). Current continual segmentation methods often rely on distillation strategies like knowledge distillation and pseudo-labeling which are effective but result in increased training complexity and computational overhead. In this paper we introduce a novel and efficient method for continual panoptic segmentation based on Visual Prompt Tuning dubbed ECLIPSE. Our approach involves freezing the base model parameters and fine-tuning only a small set of prompt embeddings addressing both catastrophic forgetting and plasticity and significantly reducing the trainable parameters. To mitigate inherent challenges such as error propagation and semantic drift in continual segmentation we propose logit manipulation to effectively leverage common knowledge across the classes. Experiments on ADE20K continual panoptic segmentation benchmark demonstrate the superiority of ECLIPSE notably its robustness against catastrophic forgetting and its reasonable plasticity achieving a new state-of-the-art. The code is available at https://github.com/clovaai/ECLIPSE.
-
Continual learning can empower vision-language models to continuously acquire new knowledge without the need for access to the entire historical dataset. However mitigating the performance degradation in large-scale models is non-trivial due to (i) parameter shifts throughout lifelong learning and (ii) significant computational burdens associated with full-model tuning. In this work we present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-language models. Our approach involves the dynamic expansion of a pre-trained CLIP model through the integration of Mixture-of-Experts (MoE) adapters in response to new tasks. To preserve the zero-shot recognition capability of vision-language models we further introduce a Distribution Discriminative Auto-Selector (DDAS) that automatically routes in-distribution and out-of-distribution inputs to the MoE Adapter and the original CLIP respectively. Through extensive experiments across various settings our proposed method consistently outperforms previous state-of-the-art approaches while concurrently reducing parameter training burdens by 60%.
-
Human matting is a foundation task in image and video processing where human foreground pixels are extracted from the input. Prior works either improve the accuracy by additional guidance or improve the temporal consistency of a single instance across frames. We propose a new framework MaGGIe Masked Guided Gradual Human Instance Matting which predicts alpha mattes progressively for each human instances while maintaining the computational cost precision and consistency. Our method leverages modern architectures including transformer attention and sparse convolution to output all instance mattes simultaneously without exploding memory and latency. Although keeping constant inference costs in the multiple-instance scenario our framework achieves robust and versatile performance on our proposed synthesized benchmarks. With the higher quality image and video matting benchmarks the novel multi-instance synthesis approach from publicly available sources is introduced to increase the generalization of models in real-world scenarios. Our code and datasets are available at https://maggie-matt.github.io
-
Optical flow estimation a process of predicting pixel-wise displacement between consecutive frames has commonly been approached as a regression task in the age of deep learning. Despite notable advancements this de facto paradigm unfortunately falls short in generalization performance when trained on synthetic or constrained data. Pioneering a paradigm shift we reformulate optical flow estimation as a conditional flow generation challenge unveiling FlowDiffuser --- a new family of optical flow models that could have stronger learning and generalization capabilities. FlowDiffuser estimates optical flow through a `noise-to-flow' strategy progressively eliminating noise from randomly generated flows conditioned on the provided pairs. To optimize accuracy and efficiency our FlowDiffuser incorporates a novel Conditional Recurrent Denoising Decoder (Conditional-RDD) streamlining the flow estimation process. It incorporates a unique Hidden State Denoising (HSD) paradigm effectively leveraging the information from previous time steps. Moreover FlowDiffuser can be easily integrated into existing flow networks leading to significant improvements in performance metrics compared to conventional implementations. Experiments on challenging benchmarks including Sintel and KITTI demonstrate the effectiveness of our FlowDiffuser with superior performance to existing state-of-the-art models. Code is available at https://github.com/LA30/FlowDiffuser.
-
Implicit neural representation (INR) in combination with geometric rendering has recently been employed in real-time dense RGB-D SLAM. Despite active research endeavors being made there lacks a unified protocol for fair evaluation impeding the evolution of this area. In this work we establish to our knowledge the first open-source benchmark framework to evaluate the performance of a wide spectrum of commonly used INRs and rendering functions for mapping and localization. The goal of our benchmark is to 1) gain an intuition of how different INRs and rendering functions impact mapping and localization and 2) establish a unified evaluation protocol w.r.t. the design choices that may impact the mapping and localization. With the framework we conduct a large suite of experiments offering various insights in choosing the INRs and geometric rendering functions: for example the dense feature grid outperforms other INRs (e.g. tri-plane and hash grid) even when geometric and color features are jointly encoded for memory efficiency. To extend the findings into the practical scenario a hybrid encoding strategy is proposed to bring the best of the accuracy and completion from the grid-based and decomposition-based INRs. We further propose explicit hybrid encoding for high-fidelity dense grid mapping to comply with the RGB-D SLAM system that puts the premise on robustness and computation efficiency.
-
We introduce Free3D a simple accurate method for monocular open-set novel view synthesis (NVS). Similar to Zero-1-to-3 we start from a pre-trained 2D image generator for generalization and fine-tune it for NVS. Compared to other works that took a similar approach we obtain significant improvements without resorting to an explicit 3D representation which is slow and memory-consuming and without training an additional network for 3D reconstruction. Our key contribution is to improve the way the target camera pose is encoded in the network which we do by introducing a new ray conditioning normalization (RCN) layer. The latter injects pose information in the underlying 2D image generator by telling each pixel its viewing direction. We further improve multi-view consistency by using light-weight multi-view attention layers and by sharing generation noise between the different views. We train Free3D on the Objaverse dataset and demonstrate excellent generalization to new categories in new datasets including OmniObject3D and GSO. The project page is available at https://chuanxiaz.com/free3d/.
-
SVG (Scalable Vector Graphics) is a widely used graphics format that possesses excellent scalability and editability. Image vectorization that aims to convert raster images to SVGs is an important yet challenging problem in computer vision and graphics. Existing image vectorization methods either suffer from low reconstruction accuracy for complex images or require long computation time. To address this issue we propose SuperSVG a superpixel-based vectorization model that achieves fast and high-precision image vectorization. Specifically we decompose the input image into superpixels to help the model focus on areas with similar colors and textures. Then we propose a two-stage self-training framework where a coarse-stage model is employed to reconstruct the main structure and a refinement-stage model is used for enriching the details. Moreover we propose a novel dynamic path warping loss to help the refinement-stage model to inherit knowledge from the coarse-stage model. Extensive qualitative and quantitative experiments demonstrate the superior performance of our method in terms of reconstruction accuracy and inference time compared to state-of-the-art approaches. The code is available in https://github.com/sjtuplayer/SuperSVG.
-
This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV) framework where the input and output of the system are multimodal (i.e. audio and visual speech). With the proposed AV2AV two key advantages can be brought: 1) We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages. In contrast to Speech-to-Speech Translation (A2A) which solely translates between audio modalities the proposed AV2AV directly translates between audio-visual speech. This capability enhances the dialogue experience by presenting synchronized lip movements along with the translated speech. 2) We can improve the robustness of the spoken language translation system. By employing the complementary information of audio-visual speech the system can effectively translate spoken language even in the presence of acoustic noise showcasing robust performance. To mitigate the problem of the absence of a parallel AV2AV translation dataset we propose to train our spoken language translation system with the audio-only dataset of A2A. This is done by learning unified audio-visual speech representations through self-supervised learning in advance to train the translation system. Moreover we propose an AV-Renderer that can generate raw audio and video in parallel. It is designed with zero-shot speaker modeling thus the speaker in source audio-visual speech can be maintained at the target translated audio-visual speech. The effectiveness of AV2AV is evaluated with extensive experiments in a many-to-many language translation setting. Demo page is available on choijeongsoo.github.io/av2av.
-
Semi-supervised semantic segmentation allows model to mine effective supervision from unlabeled data to complement label-guided training. Recent research has primarily focused on consistency regularization techniques exploring perturbation-invariant training at both the image and feature levels. In this work we proposed a novel feature-level consistency learning framework named Density-Descending Feature Perturbation (DDFP). Inspired by the low-density separation assumption in semi-supervised learning our key insight is that feature density can shed a light on the most promising direction for the segmentation classifier to explore which is the regions with lower density. We propose to shift features with confident predictions towards lower-density regions by perturbation injection. The perturbed features are then supervised by the predictions on the original features thereby compelling the classifier to explore less dense regions to effectively regularize the decision boundary. Central to our method is the estimation of feature density. To this end we introduce a lightweight density estimator based on normalizing flow allowing for efficient capture of the feature density distribution in an online manner. By extracting gradients from the density estimator we can determine the direction towards less dense regions for each feature. The proposed DDFP outperforms other designs on feature-level perturbations and shows state of the art performances on both Pascal VOC and Cityscapes dataset under various partition protocols. The project is available at https://github.com/Gavinwxy/DDFP.
-
Current methods for 2D and 3D object understanding struggle with severe occlusions in busy urban environments partly due to the lack of large-scale labeled ground-truth annotations for learning occlusion. In this work we introduce a novel framework for automatically generating a large realistic dataset of dynamic objects under occlusions using freely available time-lapse imagery. By leveraging off-the-shelf 2D (bounding box segmentation keypoint) and 3D (pose shape) predictions as pseudo-groundtruth unoccluded 3D objects are identified automatically and composited into the background in a clip-art style ensuring realistic appearances and physically accurate occlusion configurations. The resulting clip-art image with pseudo-groundtruth enables efficient training of object reconstruction methods that are robust to occlusions. Our method demonstrates significant improvements in both 2D and 3D reconstruction particularly in scenarios with heavily occluded objects like vehicles and people in urban scenes.
-
Real-time multi-person pose estimation presents significant challenges in balancing speed and precision. While two-stage top-down methods slow down as the number of people in the image increases existing one-stage methods often fail to simultaneously deliver high accuracy and real-time performance. This paper introduces RTMO a one-stage pose estimation framework that seamlessly integrates coordinate classification by representing keypoints using dual 1-D heatmaps within the YOLO architecture achieving accuracy comparable to top-down methods while maintaining high speed. We propose a dynamic coordinate classifier and a tailored loss function for heatmap learning specifically designed to address the incompatibilities between coordinate classification and dense prediction models. RTMO outperforms state-of-the-art one-stage pose estimators achieving 1.1% higher AP on COCO while operating about 9 times faster with the same backbone. Our largest model RTMO-l attains 74.8% AP on COCO val2017 and 141 FPS on a single V100 GPU demonstrating its efficiency and accuracy. The code and models are available at https://github.com/open-mmlab/mmpose/tree/main/projects/rtmo.
-
We address the problem of generalized category discovery (GCD) that aims to partition a partially labeled collection of images; only a small part of the collection is labeled and the total number of target classes is unknown. To address this generalized image clustering problem we revisit the mean-shift algorithm i.e. a classic powerful technique for mode seeking and incorporate it into a contrastive learning framework. The proposed method dubbed Contrastive Mean-Shift (CMS) learning trains an embedding network to produce representations with better clustering properties by an iterative process of mean shift and contrastive update. Experiments demonstrate that our method both in settings with and without the total number of clusters being known achieves state-of-the-art performance on six public GCD benchmarks without bells and whistles.
-
We introduce a new task -- language-driven video inpainting which uses natural language instructions to guide the inpainting process. This approach overcomes the limitations of traditional video inpainting methods that depend on manually labeled binary masks a process often tedious and labor-intensive. We present the Remove Objects from Videos by Instructions (ROVI) dataset containing 5650 videos and 9091 inpainting results to support training and evaluation for this task. We also propose a novel diffusion-based language-driven video inpainting framework the first end-to-end baseline for this task integrating Multimodal Large Language Models to understand and execute complex language-based inpainting requests effectively. Our comprehensive results showcase the dataset's versatility and the model's effectiveness in various language-instructed inpainting scenarios. We have made datasets code and models publicly available at https://github.com/jianzongwu/Language-Driven-Video-Inpainting.
-
Although diffusion models are rising as a powerful solution for blind face restoration they are criticized for two problems: 1) slow training and inference speed and 2) failure in preserving identity and recovering fine-grained facial details. In this work we propose WaveFace to solve the problems in the frequency domain where low- and high-frequency components decomposed by wavelet transformation are considered individually to maximize authenticity as well as efficiency. The diffusion model is applied to recover the low-frequency component only which presents general information of the original image but 1/16 in size. To preserve the original identity the generation is conditioned on the low-frequency component of low-quality images at each denoising step. Meanwhile high-frequency components at multiple decomposition levels are handled by a unified network which recovers complex facial details in a single step. Evaluations on four benchmark datasets show that: 1) WaveFace outperforms state-of-the-art methods in authenticity especially in terms of identity preservation and 2) authentic images are restored with the efficiency 10x faster than existing diffusion model-based BFR methods.
-
Contrastive Language-Image Pre-training (CLIP) has become a promising language-supervised visual pre-training framework. This paper aims to distill small CLIP models supervised by a large teacher CLIP model. We propose several distillation strategies including relation feature gradient and contrastive paradigms to examine the effectiveness of CLIP-Knowledge Distillation (KD). We show that a simple feature mimicry with Mean Squared Error loss works surprisingly well. Moreover interactive contrastive learning across teacher and student encoders is also effective in performance improvement. We explain that the success of CLIP-KD can be attributed to maximizing the feature similarity between teacher and student. The unified method is applied to distill several student models trained on CC3M+12M. CLIP-KD improves student CLIP models consistently over zero-shot ImageNet classification and cross-modal retrieval benchmarks. When using ViT-L/14 pretrained on Laion-400M as the teacher CLIP-KD achieves 57.5% and 55.4% zero-shot top-1 ImageNet accuracy over ViT-B/16 and ResNet-50 surpassing the original CLIP without KD by 20.5% and 20.1% margins respectively. Our code is released on https://github.com/winycg/CLIP-KD.
-
Recent advances in 3D avatar generation have gained significant attention. These breakthroughs aim to produce more realistic animatable avatars narrowing the gap between virtual and real-world experiences. Most of existing works employ Score Distillation Sampling (SDS) loss combined with a differentiable renderer and text condition to guide a diffusion model in generating 3D avatars. However SDS often generates over-smoothed results with few facial details thereby lacking the diversity compared with ancestral sampling. On the other hand other works generate 3D avatar from a single image where the challenges of unwanted lighting effects perspective views and inferior image quality make them difficult to reliably reconstruct the 3D face meshes with the aligned complete textures. In this paper we propose a novel 3D avatar generation approach termed UltrAvatar with enhanced fidelity of geometry and superior quality of physically based rendering (PBR) textures without unwanted lighting. To this end the proposed approach presents a diffuse color extraction model and an authenticity guided texture diffusion model. The former removes the unwanted lighting effects to reveal true diffuse colors so that the generated avatars can be rendered under various lighting conditions. The latter follows two gradient-based guidances for generating PBR textures to render diverse face-identity features and details better aligning with 3D mesh geometry. We demonstrate the effectiveness and robustness of the proposed method outperforming the state-of-the-art methods by a large margin in the experiments.
-
Visual object tracking aims to localize the target object of each frame based on its initial appearance in the first frame. Depending on the input modility tracking tasks can be divided into RGB tracking and RGB+X (e.g. RGB+N and RGB+D) tracking. Despite the different input modalities the core aspect of tracking is the temporal matching. Based on this common ground we present a general framework to unify various tracking tasks termed as OneTracker. OneTracker first performs a large-scale pre-training on a RGB tracker called Foundation Tracker. This pretraining phase equips the Foundation Tracker with a stable ability to estimate the location of the target object. Then we regard other modality information as prompt and build Prompt Tracker upon Foundation Tracker. Through freezing the Foundation Tracker and only adjusting some additional trainable parameters Prompt Tracker inhibits the strong localization ability from Foundation Tracker and achieves parameter-efficient finetuning on downstream RGB+X tracking tasks. To evaluate the effectiveness of our general framework OneTracker which is consisted of Foundation Tracker and Prompt Tracker we conduct extensive experiments on 6 popular tracking tasks across 11 benchmarks and our OneTracker outperforms other models and achieves state-of-the-art performance.
-
Recent trends in Large Vision Language Models (LVLMs) research have been increasingly focusing on advancing beyond general image understanding towards more nuanced object-level referential comprehension. In this paper we present and delve into the self-consistency capability of LVLMs a crucial aspect that reflects the models' ability to both generate informative captions for specific objects and subsequently utilize these captions to accurately re-identify the objects in a closed-loop process. This capability significantly mirrors the precision and reliability of fine-grained visual-language understanding. Our findings reveal that the self-consistency level of existing LVLMs falls short of expectations posing limitations on their practical applicability and potential. To address this gap we introduce a novel fine-tuning paradigm named Self-Consistency Tuning (SC-Tune). It features the synergistic learning of a cyclic describer-locator system. This paradigm is not only data-efficient but also exhibits generalizability across multiple LVLMs. Through extensive experiments we demonstrate that SC-Tune significantly elevates performance across a spectrum of object-level vision-language benchmarks and maintains competitive or improved performance on image-level vision-language benchmarks. Both our model and code will be publicly available at https://github.com/ivattyue/SC-Tune.
-
The encoder-decoder network (ED-Net) is a commonly employed choice for existing depth completion methods but its working mechanism is ambiguous. In this paper we visualize the internal feature maps to analyze how the network densifies the input sparse depth. We find that the encoder feature of ED-Net focus on the areas with input depth points around. To obtain a dense feature and thus estimate complete depth the decoder feature tends to complement and enhance the encoder feature by skip-connection to make the fused encoder-decoder feature dense resulting in the decoder feature also exhibits sparse. However ED-Net obtains the sparse decoder feature from the dense fused feature at the previous stage where the "dense to sparse" process destroys the completeness of features and loses information. To address this issue we present a depth feature upsampling network (DFU) that explicitly utilizes these dense features to guide the upsampling of a low-resolution (LR) depth feature to a high-resolution (HR) one. The completeness of features is maintained throughout the upsampling process thus avoiding information loss. Furthermore we propose a confidence-aware guidance module (CGM) which is confidence-aware and performs guidance with adaptive receptive fields (GARF) to fully exploit the potential of these dense features as guidance. Experimental results show that our DFU a plug-and-play module can significantly improve the performance of existing ED-Net based methods with limited computational overheads and new SOTA results are achieved. Besides the generalization capability on sparser depth is also enhanced. Project page: https://npucvr.github.io/DFU.
-
We present NeRSP a Neural 3D reconstruction technique for Reflective surfaces with Sparse Polarized images. Reflective surface reconstruction is extremely challenging as specular reflections are view-dependent and thus violate the multiview consistency for multiview stereo. On the other hand sparse image inputs as a practical capture setting commonly cause incomplete or distorted results due to the lack of correspondence matching. This paper jointly handles the challenges from sparse inputs and reflective surfaces by leveraging polarized images. We derive photometric and geometric cues from the polarimetric image formation model and multiview azimuth consistency which jointly optimize the surface geometry modeled via implicit neural representation. Based on the experiments on our synthetic and real datasets we achieve the state-of-the-art surface reconstruction results with only 6 views as input.
-
Embodied agents operating in complex and uncertain environments face considerable challenges. While some advanced agents handle complex manipulation tasks with proficiency their success often hinges on extensive training data to develop their capabilities. In contrast humans typically rely on recalling past experiences and analogous situations to solve new problems. Aiming to emulate this human approach in robotics we introduce the Retrieval-Augmented Embodied Agent (RAEA). This innovative system equips robots with a form of shared memory significantly enhancing their performance. Our approach integrates a policy retriever allowing robots to access relevant strategies from an external policy memory bank based on multi-modal inputs. Additionally a policy generator is employed to assimilate these strategies into the learning process enabling robots to formulate effective responses to tasks. Extensive testing of RAEA in both simulated and real-world scenarios demonstrates its superior performance over traditional methods representing a major leap forward in robotic technology.
-
LiDAR-based 3D object detection plays an essential role in autonomous driving. Existing high-performing 3D object detectors usually build dense feature maps in the backbone network and prediction head. However the computational costs introduced by the dense feature maps grow quadratically as the perception range increases making these models hard to scale up to long-range detection. Some recent works have attempted to construct fully sparse detectors to solve this issue; nevertheless the resulting models either rely on a complex multi-stage pipeline or exhibit inferior performance. In this work we propose a fully sparse adaptive feature diffusion network (SAFDNet) for LiDAR-based 3D object detection. In SAFDNet an adaptive feature diffusion strategy is designed to address the center feature missing problem. We conducted extensive experiments on Waymo Open nuScenes and Argoverse2 datasets. SAFDNet performed slightly better than the previous SOTA on the first two datasets but much better on the last dataset which features long-range detection verifying the efficacy of SAFDNet in scenarios where long-range detection is required. Notably on Argoverse2 SAFDNet surpassed the previous best hybrid detector HEDNet by 2.6% mAP while being 2.1x faster and yielded 2.1% mAP gains over the previous best sparse detector FSDv2 while being 1.3x faster. The code will be available at https://github.com/zhanggang001/HEDNet.
-
We present EgoTAP a heatmap-to-3D pose lifting method for highly accurate stereo egocentric 3D pose estimation. Severe self-occlusion and out-of-view limbs in egocentric camera views make accurate pose estimation a challenging problem. To address the challenge prior methods employ joint heatmaps-probabilistic 2D representations of the body pose but heatmap-to-3D pose conversion still remains an inaccurate process. We propose a novel heatmap-to-3D lifting method composed of the Grid ViT Encoder and the Propagation Network. The Grid ViT Encoder summarizes joint heatmaps into effective feature embedding using self-attention. Then the Propagation Network estimates the 3D pose by utilizing skeletal information to better estimate the position of obscure joints. Our method significantly outperforms the previous state-of-the-art qualitatively and quantitatively demonstrated by a 23.9% reduction of error in an MPJPE metric. Our source code is available on GitHub.
-
Our paper aims to generate diverse and realistic animal motion sequences from textual descriptions without a large-scale animal text-motion dataset. While the task of text-driven human motion synthesis is already extensively studied and benchmarked it remains challenging to transfer this success to other skeleton structures with limited data. In this work we design a model architecture that imitates Generative Pretraining Transformer (GPT) utilizing prior knowledge learned from human data to the animal domain. We jointly train motion autoencoders for both animal and human motions and at the same time optimize through the similarity scores among human motion encoding animal motion encoding and text CLIP embedding. Presenting the first solution to this problem we are able to generate animal motions with high diversity and fidelity quantitatively and qualitatively outperforming the results of training human motion generation baselines on animal data. Additionally we introduce AnimalML3D the first text-animal motion dataset with 1240 animation sequences spanning 36 different animal identities. We hope this dataset would mediate the data scarcity problem in text-driven animal motion generation providing a new playground for the research community.
-
We propose SNI-SLAM a semantic SLAM system utilizing neural implicit representation that simultaneously performs accurate semantic mapping high-quality surface reconstruction and robust camera tracking. In this system we introduce hierarchical semantic representation to allow multi-level semantic comprehension for top-down structured semantic mapping of the scene. In addition to fully utilize the correlation between multiple attributes of the environment we integrate appearance geometry and semantic features through cross-attention for feature collaboration. This strategy enables a more multifaceted understanding of the environment thereby allowing SNI-SLAM to remain robust even when single attribute is defective. Then we design an internal fusion-based decoder to obtain semantic RGB Truncated Signed Distance Field (TSDF) values from multi-level features for accurate decoding. Furthermore we propose a feature loss to update the scene representation at the feature level. Compared with low-level losses such as RGB loss and depth loss our feature loss is capable of guiding the network optimization on a higher-level. Our SNI-SLAM method demonstrates superior performance over all recent NeRF-based SLAM methods in terms of mapping and tracking accuracy on Replica and ScanNet datasets while also showing excellent capabilities in accurate semantic segmentation and real-time semantic mapping. Codes will be available at https://github.com/IRMVLab/SNI-SLAM.
-
Text-to-image diffusion models produce high quality images but do not offer control over individual instances in the image. We introduce InstanceDiffusion that adds precise instance-level control to text-to-image diffusion models. InstanceDiffusion supports free-form language conditions per instance and allows flexible ways to specify instance locations such as simple single points scribbles bounding boxes or intricate instance segmentation masks and combinations thereof. We propose three major changes to text-to-image models that enable precise instance-level control. Our UniFusion block enables instance-level conditions for text-to-image models the ScaleU block improves image fidelity and our Multi-instance Sampler improves generations for multiple instances. InstanceDiffusion significantly surpasses specialized state-of-the-art models for each location condition. Notably on the COCO dataset we outperform previous state-of-the-art by 20.4% AP50box for box inputs and 25.4% IoU for mask inputs.
-
Most models of visual attention aim at predicting either top-down or bottom-up control as studied using different visual search and free-viewing tasks. In this paper we propose the Human Attention Transformer (HAT) a single model that predicts both forms of attention control. HAT uses a novel transformer-based architecture and a simplified foveated retina that collectively create a spatio-temporal awareness akin to the dynamic visual working memory of humans. HAT not only establishes a new state-of-the-art in predicting the scanpath of fixations made during target-present and target-absent visual search and "taskless" free viewing but also makes human gaze behavior interpretable. Unlike previous methods that rely on a coarse grid of fixation cells and experience information loss due to fixation discretization HAT features a sequential dense prediction architecture and outputs a dense heatmap for each fixation thus avoiding discretizing fixations. HAT sets a new standard in computational attention which emphasizes effectiveness generality and interpretability. HAT's demonstrated scope and applicability will likely inspire the development of new attention models that can better predict human behavior in various attention-demanding scenarios. Code is available at https://github.com/cvlab-stonybrook/HAT.
-
Current sparsely-supervised object detection methods largely depend on high threshold settings to derive high-quality pseudo labels from detector predictions. However hard instances within point clouds frequently display incomplete structures causing decreased confidence scores in their assigned pseudo-labels. Previous methods inevitably result in inadequate positive supervision for these instances. To address this problem we propose a novel Hard INsTance Enhanced Detector HINTED for sparsely-supervised 3D object detection. Firstly we design a self-boosting teacher SBT model to generate more potential pseudo-labels enhancing the effectiveness of information transfer. Then we introduce a mixed-density student MDS model to concentrate on hard instances during the training phase thereby improving detection accuracy. Our extensive experiments on the KITTI dataset validate our method's superior performance. Compared with leading sparsely-supervised methods HINTED significantly improves the detection performance on hard instances notably outperforming fully-supervised methods in detecting challenging categories like cyclists. HINTED also significantly outperforms the state-of-the-art semi-supervised method on challenging categories. The code is available at https://github.com/xmuqimingxia/HINTED.
-
Gradient-based saliency maps have been widely used to explain the decisions of deep neural network classifiers. However standard gradient-based interpretation maps including the simple gradient and integrated gradient algorithms often lack desired structures such as sparsity and connectedness in their application to real-world computer vision models. A common approach to induce sparsity-based structures into gradient-based saliency maps is to modify the simple gradient scheme using sparsification or norm-based regularization. However one drawback with such post-processing approaches is the potentially significant loss in fidelity to the original simple gradient map. In this work we propose to apply adversarial training as an in-processing scheme to train neural networks with structured simple gradient maps. We demonstrate an existing duality between the regularized norms of the adversarial perturbations and gradient-based maps whereby we design adversarial training schemes promoting sparsity and group-sparsity properties in simple gradient maps. We present comprehensive numerical results to show the influence of our proposed norm-based adversarial training methods on the standard gradient-based maps of standard neural network architectures on benchmark image datasets.
-
An effective pre-training framework with universal 3D representations is extremely desired in perceiving large-scale dynamic scenes. However establishing such an ideal framework that is both task-generic and label-efficient poses a challenge in unifying the representation of the same primitive across diverse scenes. The current contrastive 3D pre-training methods typically follow a frame-level consistency which focuses on the 2D-3D relationships in each detached image. Such inconsiderate consistency greatly hampers the promising path of reaching an universal pre-training framework: (1) The cross-scene semantic self-conflict \textit i.e. the intense collision between primitive segments of the same semantics from different scenes; (2) Lacking a globally unified bond that pushes the cross-scene semantic consistency into 3D representation learning. To address above challenges we propose a CSC framework that puts a scene-level semantic consistency in the heart bridging the connection of the similar semantic segments across various scenes. To achieve this goal we combine the coherent semantic cues provided by the vision foundation model and the knowledge-rich cross-scene prototypes derived from the complementary multi-modality information. These allow us to train a universal 3D pre-training model that facilitates various downstream tasks with less fine-tuning efforts. Empirically we achieve consistent improvements over SOTA pre-training approaches in semantic segmentation (+1.4% mIoU) object detection (+1.0% mAP) and panoptic segmentation (+3.0% PQ) using their task-specific 3D network on nuScenes. Code is released at \href https://github.com/chenhaomingbob/CSC https://github.com/chenhaomingbob/CSC hoping to inspire future research.
-
Implicit neural representations for video (NeRV) have recently become a novel way for high-quality video representation. However existing works employ a single network to represent the entire video which implicitly confuse static and dynamic information. This leads to an inability to effectively compress the redundant static information and lack the explicitly modeling of global temporal-coherent dynamic details. To solve above problems we propose DS-NeRV which decomposes videos into sparse learnable static codes and dynamic codes without the need for explicit optical flow or residual supervision. By setting different sampling rates for two codes and applying weighted sum and interpolation sampling methods DS-NeRV efficiently utilizes redundant static information while maintaining high-frequency details. Additionally we design a cross-channel attention-based (CCA) fusion module to efficiently fuse these two codes for frame decoding. Our approach achieves a high quality reconstruction of 31.2 PSNR with only 0.35M parameters thanks to separate static and dynamic codes representation and outperforms existing NeRV methods in many downstream tasks. Our project website is at https://haoyan14.github.io/DS-NeRV.
-
3D facial editing a longstanding task in computer vision with broad applications is expected to fast and intuitively manipulate any face from arbitrary viewpoints following the user's will. Existing works have limitations in terms of intuitiveness generalization and efficiency. To overcome these challenges we propose FaceEdit3D which allows users to directly manipulate 3D points to edit a 3D face achieving natural and rapid face editing. After one or several points are manipulated by users we propose the tri-plane warping to directly manipulate the view-independent 3D representation. To address the problem of distortion caused by tri-plane warping we train a warp-aware encoder to project the warped face onto a standardized latent space. In this space we further propose directional latent editing to mitigate the identity bias caused by the encoder and realize the disentangled editing of various attributes. Extensive experiments show that our method achieves superior results with rich facial details and nice identity preservation. Our approach also supports general applications like multi-attribute continuous editing and cat/car editing. The project website is https://cyh-sj.github.io/FaceEdit3D/.
-
This paper introduces 3DFIRES a novel system for scene-level 3D reconstruction from posed images. Designed to work with as few as one view 3DFIRES reconstructs the complete geometry of unseen scenes including hidden surfaces. With multiple view inputs our method produces full reconstruction within all camera frustums. A key feature of our approach is the fusion of multi-view information at the feature level enabling the production of coherent and comprehensive 3D reconstruction. We train our system on non-watertight scans from large-scale real scene dataset. We show it matches the efficacy of single-view reconstruction methods with only one input and surpasses existing techniques in both quantitative and qualitative measures for sparse-view 3D reconstruction.
-
Open-vocabulary semantic segmentation presents the challenge of labeling each pixel within an image based on a wide range of text descriptions. In this work we introduce a novel cost-based approach to adapt vision-language foundation models notably CLIP for the intricate task of semantic segmentation. Through aggregating the cosine similarity score i.e. the cost volume between image and text embeddings our method potently adapts CLIP for segmenting seen and unseen classes by fine-tuning its encoders addressing the challenges faced by existing methods in handling unseen classes. Building upon this we explore methods to effectively aggregate the cost volume considering its multi-modal nature of being established between image and text embeddings. Furthermore we examine various methods for efficiently fine-tuning CLIP.
-
Recently diffusion-based methods like InstructPix2Pix (IP2P) have achieved effective instruction-based image editing requiring only natural language instructions from the user. However these methods often inadvertently alter unintended areas and struggle with multi-instruction editing resulting in compromised outcomes. To address these issues we introduce the Focus on Your Instruction (FoI) a method designed to ensure precise and harmonious editing across multiple instructions without extra training or test-time optimization. In the FoI we primarily emphasize two aspects: (1) precisely extracting regions of interest for each instruction and (2) guiding the denoising process to concentrate within these regions of interest. For the first objective we identify the implicit grounding capability of IP2P from the cross-attention between instruction and image then develop an effective mask extraction method.
-
Multimodal Visual Object Tracking (VOT) has recently gained significant attention due to its robustness. Early research focused on fully fine-tuning RGB-based trackers which was inefficient and lacked generalized representation due to the scarcity of multimodal data. Therefore recent studies have utilized prompt tuning to transfer pre-trained RGB-based trackers to multimodal data. However the modality gap limits pre-trained knowledge recall and the dominance of the RGB modality persists preventing the full utilization of information from other modalities. To address these issues we propose a novel symmetric multimodal tracking framework called SDSTrack. We introduce lightweight adaptation for efficient fine-tuning which directly transfers the feature extraction ability from RGB to other domains with a small number of trainable parameters and integrates multimodal features in a balanced symmetric manner. Furthermore we design a complementary masked patch distillation strategy to enhance the robustness of trackers in complex environments such as extreme weather poor imaging and sensor failure. Extensive experiments demonstrate that SDSTrack outperforms state-of-the-art methods in various multimodal tracking scenarios including RGB+Depth RGB+Thermal and RGB+Event tracking and exhibits impressive results in extreme conditions. Our source code is available at : https://github.com/hoqolo/SDSTrack.
-
Recent advancements in post-hoc and inherently interpretable methods have markedly enhanced the explanations of black box classifier models. These methods operate either through post-analysis or by integrating concept learning during model training. Although being effective in bridging the semantic gap between a model's latent space and human interpretation these explanation methods only partially reveal the model's decision-making process. The outcome is typically limited to high-level semantics derived from the last feature map. We argue that the explanations lacking insights into the decision processes at low and mid-level features are neither fully faithful nor useful. Addressing this gap we introduce the Multi-Level Concept Prototypes Classifier (MCPNet) an inherently interpretable model. MCPNet autonomously learns meaningful concept prototypes across multiple feature map levels using Centered Kernel Alignment (CKA) loss and an energy-based weighted PCA mechanism and it does so without reliance on predefined concept labels. Further we propose a novel classifier paradigm that learns and aligns multi-level concept prototype distributions for classification purposes via Class-aware Concept Distribution (CCD) loss. Our experiments reveal that our proposed MCPNet while being adaptable to various model architectures offers comprehensive multi-level explanations while maintaining classification accuracy. Additionally its concept distribution-based classification approach shows improved generalization capabilities in few-shot classification scenarios.
-
In recent years there has been enormous interest in vision-language models trained using self-supervised objectives. However the use of large-scale datasets scraped from the web for training also makes these models vulnerable to potential security threats such as backdooring and poisoning attacks. In this paper we propose a method for mitigating such attacks on contrastively trained vision-language models. Our approach leverages external knowledge extracted from a language model to prevent models from learning correlations between image regions which lack strong alignment with external knowledge. We do this by imposing constraints to enforce that attention paid by the model to visual regions is proportional to the alignment of those regions with external knowledge. We conduct extensive experiments using a variety of recent backdooring and poisoning attacks on multiple datasets and architectures. Our results clearly demonstrate that our proposed approach is highly effective at defending against such attacks across multiple settings while maintaining model utility and without requiring any changes at inference time.
-
Large Language Models(LLMs) have shown remarkable emergent abilities in unifying almost all (if not every) NLP tasks. In the human motion-related realm however researchers still develop siloed models for each task. Inspired by InstuctGPT[??] and the generalist concept behind Gato [??] we introduce AvatarGPT an All-in-One framework for motion understanding planning generations as well as other tasks such as motion in-between synthesis. AvatarGPT treats each task as one type of instruction fine-tuned on the shared LLM. All the tasks are seamlessly interconnected with language as the universal interface constituting a closed-loop within the framework. To achieve this human motion sequences are first encoded as discrete tokens which serve as the extended vocabulary of LLM. Then an unsupervised pipeline to generate natural language descriptions of human action sequences from in-the-wild videos is developed. Finally all tasks are jointly trained. Extensive experiments show that AvatarGPT achieves SOTA on low-level tasks and promising results on high-level tasks demonstrating the effectiveness of our proposed All-in-One framework. Moreover for the first time AvatarGPT enables a principled approach by iterative traversal of the tasks within the closed-loop for unlimited long-motion synthesis.
-
Recently the proliferation of highly realistic synthetic images facilitated through a variety of GANs and Diffusions has significantly heightened the susceptibility to misuse. While the primary focus of deepfake detection has traditionally centered on the design of detection algorithms an investigative inquiry into the generator architectures has remained conspicuously absent in recent years. This paper contributes to this lacuna by rethinking the architectures of CNN-based generator thereby establishing a generalized representation of synthetic artifacts. Our findings illuminate that the up-sampling operator can beyond frequency-based artifacts produce generalized forgery artifacts. In particular the local interdependence among image pixels caused by upsampling operators is significantly demonstrated in synthetic images generated by GAN or diffusion. Building upon this observation we introduce the concept of Neighboring Pixel Relationships(NPR) as a means to capture and characterize the generalized structural artifacts stemming from up-sampling operations. A comprehensive analysis is conducted on an open-world dataset comprising samples generated by 28 distinct generative models. This analysis culminates in the establishment of a novel state-of-the-art performance showcasing a remarkable 12.8% improvement over existing methods. The code is available at https://github.com/chuangchuangtan/NPR-DeepfakeDetection.
-
Co-speech gestures if presented in the lively form of videos can achieve superior visual effects in human-machine interaction. While previous works mostly generate structural human skeletons resulting in the omission of appearance information we focus on the direct generation of audio-driven co-speech gesture videos in this work. There are two main challenges: 1) A suitable motion feature is needed to describe complex human movements with crucial appearance information. 2) Gestures and speech exhibit inherent dependencies and should be temporally aligned even of arbitrary length. To solve these problems we present a novel motion-decoupled framework to generate co-speech gesture videos. Specifically we first introduce a well-designed nonlinear TPS transformation to obtain latent motion features preserving essential appearance information. Then a transformer-based diffusion model is proposed to learn the temporal correlation between gestures and speech and performs generation in the latent motion space followed by an optimal motion selection module to produce long-term coherent and consistent gesture videos. For better visual perception we further design a refinement network focusing on missing details of certain areas. Extensive experimental results show that our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations. Our code demos and more resources are available at https://github.com/thuhcsi/S2G-MDDiffusion.
-
Existing Blind image Super-Resolution (BSR) methods focus on estimating either kernel or degradation information but have long overlooked the essential content details. In this paper we propose a novel BSR approach Content-aware Degradation-driven Transformer (CDFormer) to capture both degradation and content representations. However low-resolution images cannot provide enough content details and thus we introduce a diffusion-based module CDFormer_ diff to first learn Content Degradation Prior (CDP) in both low- and high-resolution images and then approximate the real distribution given only low-resolution information. Moreover we apply an adaptive SR network CDFormer_ SR that effectively utilizes CDP to refine features. Compared to previous diffusion-based SR methods we treat the diffusion model as an estimator that can overcome the limitations of expensive sampling time and excessive diversity. Experiments show that CDFormer can outperform existing methods establishing a new state-of-the-art performance on various benchmarks under blind settings. Codes and models will be available at https://github.com/I2-Multimedia-Lab/CDFormer.
-
Generating a 3D human model from a single reference image is challenging because it requires inferring textures and geometries in invisible views while maintaining consistency with the reference image. Previous methods utilizing 3D generative models are limited by the availability of 3D training data. Optimization-based methods that lift text-to-image diffusion models to 3D generation often fail to preserve the texture details of the reference image resulting in inconsistent appearances in different views. In this paper we propose HumanRef a 3D human generation framework from a single-view input. To ensure the generated 3D model is photorealistic and consistent with the input image HumanRef introduces a novel method called reference-guided score distillation sampling (Ref-SDS) which effectively incorporates image guidance into the generation process. Furthermore we introduce region-aware attention to Ref-SDS ensuring accurate correspondence between different body regions. Experimental results demonstrate that HumanRef outperforms state-of-the-art methods in generating 3D clothed humans with fine geometry photorealistic textures and view-consistent appearances. Code and model are available at https://eckertzhang.github.io/HumanRef.github.io/.
-
Large multimodal models (LMMs) have evolved from large language models (LLMs) to integrate multiple input modalities such as visual inputs. This integration augments the capacity of LLMs for tasks requiring visual comprehension and reasoning. However the extent and limitations of their enhanced abilities are not fully understood especially when it comes to real-world tasks. To address this gap we introduce GlitchBench a novel benchmark derived from video game quality assurance tasks to test and evaluate the reasoning capabilities of LMMs. Our benchmark is curated from a variety of unusual and glitched scenarios from video games and aims to challenge both the visual and linguistic reasoning powers of LMMs in detecting and interpreting out-of-the-ordinary events. We evaluate multiple state-of-the-art LMMs and we show that GlitchBench presents a new challenge for these models. Code and data are available at: https://glitchbench.github.io/
-
The goal of interactive image segmentation is to delineate specific regions within an image via visual or language prompts. Low-latency and high-quality interactive segmentation with diverse prompts remain challenging for existing specialist and generalist models. Specialist models with their limited prompts and task-specific designs experience high latency because the image must be recomputed every time the prompt is updated due to the joint encoding of image and visual prompts. Generalist models exemplified by the Segment Anything Model (SAM) have recently excelled in prompt diversity and efficiency lifting image segmentation to the foundation model era. However for high-quality segmentations SAM still lags behind state-of-the-art specialist models despite SAM being trained with x100 more segmentation masks. In this work we delve deep into the architectural differences between the two types of models. We observe that dense representation and fusion of visual prompts are the key design choices contributing to the high segmentation quality of specialist models. In light of this we reintroduce this dense design into the generalist models to facilitate the development of generalist models with high segmentation quality. To densely represent diverse visual prompts we propose to use a dense map to capture five types: clicks boxes polygons scribbles and masks. Thus we propose SegNext a next-generation interactive segmentation approach offering low latency high quality and diverse prompt support. Our method outperforms current state-of-the-art methods on HQSeg-44K and DAVIS quantitatively and qualitatively.
-
This work presents Adaptive Local-then-Global Merging (ALGM) a token reduction method for semantic segmentation networks that use plain Vision Transformers. ALGM merges tokens in two stages: (1) In the first network layer it merges similar tokens within a small local window and (2) halfway through the network it merges similar tokens across the entire image. This is motivated by an analysis in which we found that in those situations tokens with a high cosine similarity can likely be merged without a drop in segmentation quality. With extensive experiments across multiple datasets and network configurations we show that ALGM not only significantly improves the throughput by up to 100% but can also enhance the mean IoU by up to +1.1 thereby achieving a better trade-off between segmentation quality and efficiency than existing methods. Moreover our approach is adaptive during inference meaning that the same model can be used for optimal efficiency or accuracy depending on the application. Code is available at https://tue-mps.github.io/ALGM.
-
We propose a novel concept of dual and integrated latent topologies (DITTO in short) for implicit 3D reconstruction from noisy and sparse point clouds. Most existing methods predominantly focus on single latent type such as point or grid latents. In contrast the proposed DITTO leverages both point and grid latents (i.e. dual latent) to enhance their strengths the stability of grid latents and the detail-rich capability of point latents. Concretely DITTO consists of dual latent encoder and integrated implicit decoder. In the dual latent encoder a dual latent layer which is the key module block composing the encoder refines both latents in parallel maintaining their distinct shapes and enabling recursive interaction. Notably a newly proposed dynamic sparse point transformer within the dual latent layer effectively refines point latents. Then the integrated implicit decoder systematically combines these refined latents achieving high-fidelity 3D reconstruction and surpassing previous state-of-the-art methods on object- and scene-level datasets especially in thin and detailed structures.
-
In the realm of video object tracking auxiliary modalities such as depth thermal or event data have emerged as valuable assets to complement the RGB trackers. In practice most existing RGB trackers learn a single set of parameters to use them across datasets and applications. However a similar single-model unification for multi-modality tracking presents several challenges. These challenges stem from the inherent heterogeneity of inputs -- each with modality-specific representations the scarcity of multi-modal datasets and the absence of all the modalities at all times. In this work we introduce Un-Track a Unified Tracker of a single set of parameters for any modality. To handle any modality our method learns their common latent space through low-rank factorization and reconstruction techniques. More importantly we use only the RGB-X pairs to learn the common latent space. This unique shared representation seamlessly binds all modalities together enabling effective unification and accommodating any missing modality all within a single transformer-based architecture. Our Un-Track achieves +8.1 absolute F-score gain on the DepthTrack dataset by introducing only +2.14 (over 21.50) GFLOPs with +6.6M (over 93M) parameters through a simple yet efficient prompting strategy. Extensive comparisons on five benchmark datasets with different modalities show that Un-Track surpasses both SOTA unified trackers and modality-specific counterparts validating our effectiveness and practicality. The source code is publicly available at https://github.com/Zongwei97/UnTrack.
-
In the domain of video tracking existing methods often grapple with a trade-off between spatial density and temporal range. Current approaches in dense optical flow estimators excel in providing spatially dense tracking but are limited to short temporal spans. Conversely recent advancements in long-range trackers offer extended temporal coverage but at the cost of spatial sparsity. This paper introduces FlowTrack a novel framework designed to bridge this gap. FlowTrack combines the strengths of both paradigms by 1) chaining confident flow predictions to maximize efficiency and 2) automatically switching to an error compensation module in instances of flow prediction inaccuracies. This dual strategy not only offers efficient dense tracking over extended temporal spans but also ensures robustness against error accumulations and occlusions common pitfalls of naive flow chaining. Furthermore we demonstrate that chained flow itself can serve as an effective guide for an error compensation module even for occluded points. Our framework achieves state-of-the-art accuracy for long-range tracking on the DAVIS dataset and renders 50% speed-up when performing dense tracking.
-
The creation of personalized anatomical digital twins is important in the fields of medicine computer graphics sports science and biomechanics. To observe a subject's anatomy expensive medical devices (MRI or CT) are required and the creation of the digital model is often time-consuming and involves manual effort. Instead we leverage the fact that the shape of the body surface is correlated with the internal anatomy; e.g. from surface observations alone one can predict body composition and skeletal structure. In this work we go further and learn to infer the 3D location of three important anatomic tissues: subcutaneous adipose tissue (fat) lean tissue (muscles and organs) and long bones. To learn to infer these tissues we tackle several key challenges. We first create a dataset of human tissues by segmenting full-body MRI scans and registering the SMPL body mesh to the body surface. With this dataset we train HIT (Human Implicit Tissues) an implicit function that given a point inside a body predicts its tissue class. HIT leverages the SMPL body model shape and pose parameters to canonicalize the medical data. Unlike SMPL which is trained from upright 3D scans MRI scans are acquired with subjects lying on a table resulting in significant soft-tissue deformation. Consequently HIT uses a learned volumetric deformation field that undoes these deformations. Since HIT is parameterized by SMPL we can repose bodies or change the shape of subjects and the internal structures deform appropriately. We perform extensive experiments to validate HIT's ability to predict a plausible internal structure for novel subjects. The dataset and HIT model are available at https://hit.is.tue.mpg.de to foster future research in this direction.
-
Choreographers determine what the dances look like while cameramen determine the final presentation of dances. Recently various methods and datasets have showcased the feasibility of dance synthesis. However camera movement synthesis with music and dance remains an unsolved challenging problem due to the scarcity of paired data. Thus we present DCM a new multi-modal 3D dataset which for the first time combines camera movement with dance motion and music audio. This dataset encompasses 108 dance sequences (3.2 hours) of paired dance-camera-music data from the anime community covering 4 music genres. With this dataset we uncover that dance camera movement is multifaceted and human-centric and possesses multiple influencing factors making dance camera synthesis a more challenging task compared to camera or dance synthesis alone. To overcome these difficulties we propose DanceCamera3D a transformer-based diffusion model that incorporates a novel body attention loss and a condition separation strategy. For evaluation we devise new metrics measuring camera movement quality diversity and dancer fidelity. Utilizing these metrics we conduct extensive experiments on our DCM dataset providing both quantitative and qualitative evidence showcasing the effectiveness of our DanceCamera3D model. Code and video demos are available at https://github.com/ Carmenw1203/DanceCamera3D-Official.
-
Vision language models (VLM) have demonstrated remarkable performance across various downstream tasks. However understanding fine-grained visual-linguistic concepts such as attributes and inter-object relationships remains a significant challenge. While several benchmarks aim to evaluate VLMs in finer granularity their primary focus remains on the linguistic aspect neglecting the visual dimension. Here we highlight the importance of evaluating VLMs from both a textual and visual perspective. We introduce a progressive pipeline to synthesize images that vary in a specific attribute while ensuring consistency in all other aspects. Utilizing this data engine we carefully design a benchmark SPEC to diagnose the comprehension of object size position existence and count. Subsequently we conduct a thorough evaluation of four leading VLMs on SPEC. Surprisingly their performance is close to random guess revealing significant limitations. With this in mind we propose a simple yet effective approach to optimize VLMs in fine-grained understanding achieving significant improvements on SPEC without compromising the zero-shot performance. Results on two additional fine-grained benchmarks also show consistent improvements further validating the transferability of our approach. Code and data are available at https://github.com/wjpoom/SPEC.
-
3D synthetic-to-real unsupervised domain adaptive segmentation is crucial to annotating new domains. Self-training is a competitive approach for this task but its performance is limited by different sensor sampling patterns (i.e. variations in point density) and incomplete training strategies. In this work we propose a density-guided translator (DGT) which translates point density between domains and integrates it into a two-stage self-training pipeline named DGT-ST. First in contrast to existing works that simultaneously conduct data generation and feature/output alignment within unstable adversarial training we employ the non-learnable DGT to bridge the domain gap at the input level. Second to provide a well-initialized model for self-training we propose a category-level adversarial network in stage one that utilizes the prototype to prevent negative transfer. Finally by leveraging the designs above a domain-mixed self-training method with source-aware consistency loss is proposed in stage two to narrow the domain gap further. Experiments on two synthetic-to-real segmentation tasks (SynLiDAR ? semanticKITTI and SynLiDAR ? semanticPOSS) demonstrate that DGT-ST outperforms state-of-the-art methods achieving 9.4% and 4.3% mIoU improvements respectively. Code is available at https://github.com/yuan-zm/DGT-ST.
-
Recently there has been a surge in face personalization techniques benefiting from the advanced capabilities of pretrained text-to-image diffusion models. Among these a notable method is Textual Inversion which generates personalized images by inverting given images into textual embeddings. However methods based on Textual Inversion still struggle with balancing the trade-off between reconstruction quality and editability. In this study we examine this issue through the lens of initialization. Upon closely examining traditional initialization methods we identified a significant disparity between the initial and learned embeddings in terms of both scale and orientation. The scale of the learned embedding can be up to 100 times greater than that of the initial embedding. Such a significant change in the embedding could increase the risk of overfitting thereby compromising the editability. Driven by this observation we introduce a novel initialization method termed Cross Initialization that significantly narrows the gap between the initial and learned embeddings. This method not only improves both reconstruction and editability but also reduces the optimization steps from 5000 to 320. Furthermore we apply a regularization term to keep the learned embedding close to the initial embedding. We show that when combined with Cross Initialization this regularization term can effectively improve editability. We provide comprehensive empirical evidence to demonstrate the superior performance of our method compared to the baseline methods. Notably in our experiments Cross Initialization is the only method that successfully edits an individual's facial expression. Additionally a fast version of our method allows for capturing an input image in roughly 26 seconds while surpassing the baseline methods in terms of both reconstruction and editability. Code is available at https://github.com/lyuPang/CrossInitialization.
-
Text-to-image diffusion models have recently received increasing interest for their astonishing ability to produce high-fidelity images from solely text inputs. Subsequent research efforts aim to exploit and apply their capabilities to real image editing. However existing image-to-image methods are often inefficient imprecise and of limited versatility. They either require time-consuming fine-tuning deviate unnecessarily strongly from the input image and/or lack support for multiple simultaneous edits. To address these issues we introduce LEDITS++ an efficient yet versatile and precise textual image manipulation technique. LEDITS++'s novel inversion approach requires no tuning nor optimization and produces high-fidelity results with a few diffusion steps. Second our methodology supports multiple simultaneous edits and is architecture-agnostic. Third we use a novel implicit masking technique that limits changes to relevant image regions. We propose the novel TEdBench++ benchmark as part of our exhaustive evaluation. Our results demonstrate the capabilities of LEDITS++ and its improvements over previous methods.
-
We present VIDIM a generative model for video interpolation which creates short videos given a start and end frame. In order to achieve high fidelity and generate motions unseen in the input data VIDIM uses cascaded diffusion models to first generate the target video at low resolution and then generate the high-resolution video conditioned on the low-resolution generated video. We compare VIDIM to previous state-of-the-art methods on video interpolation and demonstrate how such works fail in most settings where the underlying motion is complex nonlinear or ambiguous while VIDIM can easily handle such cases. We additionally demonstrate how classifier-free guidance on the start and end frame and conditioning the superresolution model on the original high-resolution frames without additional parameters unlocks high-fidelity results. VIDIM is fast to sample from as it jointly denoises all the frames to be generated requires less than a billion parameters per diffusion model to produce compelling results and still enjoys scalability and improved quality at larger parameter counts. Please see our project page at vidiminterpolation.github.io.
-
We introduce WildlifeMapper (WM) a flexible model designed to detect locate and identify multiple species in aerial imagery. It addresses the limitations of traditional labor-intensive wildlife population assessments that are central to advancing environmental conservation efforts worldwide. While a number of methods exist to automate this process they are often limited in their ability to generalize to different species or landscapes due to the dominance of homogeneous backgrounds and/or poorly captured local image structures. WM introduces two novel modules that help to capture the local structure and context of objects of interest to accurately localize and identify them achieving a state-of-the-art (SOTA) detection rate of 0.56 mAP. Further we introduce a large aerial imagery dataset with more than 11k Images and 28k annotations verified by trained experts. WM also achieves SOTA performance on 3 other publicly available aerial survey datasets collected across 4 different countries improving mAP by 42%. Source code and trained models are available at Github
-
Learning Adaptive Spatial Coherent Correlations for Speech-Preserving Facial Expression Manipulation
Speech-preserving facial expression manipulation (SPFEM) aims to modify facial emotions while meticulously maintaining the mouth animation associated with spoken content. Current works depend on inaccessible paired training samples for the person where two aligned frames exhibit the same speech content yet differ in emotional expression limiting the SPFEM applications in real-world scenarios. In this work we discover that speakers who convey the same content with different emotions exhibit highly correlated local facial animations providing valuable supervision for SPFEM. To capitalize on this insight we propose a novel adaptive spatial coherent correlation learning (ASCCL) algorithm which models the aforementioned correlation as an explicit metric and integrates the metric to supervise manipulating facial expression and meanwhile better preserving the facial animation of spoken contents. To this end it first learns a spatial coherent correlation metric ensuring the visual disparities of adjacent local regions of the image belonging to one emotion are similar to those of the corresponding counterpart of the image belonging to another emotion. Recognizing that visual disparities are not uniform across all regions we have also crafted a disparity-aware adaptive strategy that prioritizes regions that present greater challenges. During SPFEM model training we construct the adaptive spatial coherent correlation metric between corresponding local regions of the input and output images as addition loss to supervise the generation process. We conduct extensive experiments on variant datasets and the results demonstrate the effectiveness of the proposed ASCCL algorithm. Code is publicly available at https://github.com/jianmanlincjx/ASCCL
-
Visual prompting of large vision language models such as CLIP exhibits intriguing zero-shot capabilities. A manually drawn red circle commonly used for highlighting can guide CLIP's attention to the surrounding region to identify specific objects within an image. Without precise object proposals however it is insufficient for localization. Our novel simple yet effective approach i.e. Differentiable Visual Prompting enables CLIP to zero-shot localize: given an image and a text prompt describing an object we first pick a rendered ellipse from uniformly distributed anchor ellipses on the image grid via visual prompting then use three loss functions to tune the ellipse coefficients to encapsulate the target region gradually. This yields promising experimental results for referring expression comprehension without precisely specified object proposals. In addition we systematically present the limitations of visual prompting inherent in CLIP and discuss potential solutions.
-
Each photo in an image burst can be considered a sample of a complex 3D scene: the product of parallax diffuse and specular materials scene motion and illuminant variation. While decomposing all of these effects from a stack of misaligned images is a highly ill-conditioned task the conventional align-and-merge burst pipeline takes the other extreme: blending them into a single image. In this work we propose a versatile intermediate representation: a two-layer alpha-composited image plus flow model constructed with neural spline fields -- networks trained to map input coordinates to spline control points. Our method is able to during test-time optimization jointly fuse a burst image capture into one high-resolution reconstruction and decompose it into transmission and obstruction layers. Then by discarding the obstruction layer we can perform a range of tasks including seeing through occlusions reflection suppression and shadow removal. Tested on complex in-the-wild captures we find that with no post-processing steps or learned priors our generalizable model is able to outperform existing dedicated single-image and multi-view obstruction removal approaches.
-
The estimation of 3D human motion from video has progressed rapidly but current methods still have several key limitations. First most methods estimate the human in camera coordinates. Second prior work on estimating humans in global coordinates often assumes a flat ground plane and produces foot sliding. Third the most accurate methods rely on computationally expensive optimization pipelines limiting their use to offline applications. Finally existing video-based methods are surprisingly less accurate than single-frame methods. We address these limitations with WHAM (World-grounded Humans with Accurate Motion) which accurately and efficiently reconstructs 3D human motion in a global coordinate system from video. WHAM learns to lift 2D keypoint sequences to 3D using motion capture data and fuses this with video features integrating motion context and visual information. WHAM exploits camera angular velocity estimated from a SLAM method together with human motion to estimate the body's global trajectory. We combine this with a contact-aware trajectory refinement method that lets WHAM capture human motion in diverse conditions such as climbing stairs. WHAM outperforms all existing 3D human motion recovery methods across multiple in-the-wild benchmarks. Code is available for research purposes at http://wham.is.tue.mpg.de/.
-
Recently the emergence of naturalistic adversarial patch (NAP) which possesses a deceptive appearance and various representations underscores the necessity of developing robust detection strategies. However existing approaches fail to differentiate the deep-seated natures in adversarial patches i.e. aggressiveness and naturalness leading to unsatisfactory precision and generalization against NAPs. To tackle this issue we propose NAPGuard to provide strong detection capability against NAPs via the elaborated critical feature modulation framework. For improving precision we propose the aggressive feature aligned learning to enhance the model's capability in capturing accurate aggressive patterns. Considering the challenge of inaccurate model learning caused by deceptive appearance we align the aggressive features by the proposed pattern alignment loss during training. Since the model could learn more accurate aggressive patterns it is able to detect deceptive patches more precisely. To enhance generalization we design the natural feature suppressed inference to universally mitigate the disturbance from different NAPs. Since various representations arise in diverse disturbing forms to hinder generalization we suppress the natural features in a unified approach via the feature shield module. Therefore the models could recognize NAPs within less disturbance and activate the generalized detection ability. Extensive experiments show that our method surpasses state-of-the-art methods by large margins in detecting NAPs (improve 60.24% AP@0.5 on average).
-
Existing diffusion models for pose-guided human video generation mostly suffer from temporal inconsistency in the generated appearance and poses due to the inherent randomization nature of the generation process. In this paper we propose a novel framework DiffPerformer to synthesize high-fidelity and temporally consistent human video. Without complex architecture modification or costly training DiffPerformer finetunes a pretrained diffusion model on a single video of the target character and introduces an implicit video representation as a proxy to learn temporally consistent guidance for the diffusion model. The guidance is encoded into VAE latent space and an iterative optimization loop is constructed between the implicit video representation and the diffusion model allowing to harness the smooth property of the implicit video representation and the generative capabilities of the diffusion model in a mutually beneficial way. Moreover we propose 3D-aware human flow as a temporal constraint during the optimization to explicitly model the correspondence between driving poses and human appearance. This alleviates the misalignment between guided poses and target performer and therefore maintains the appearance coherence under various motions. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods.
-
This paper introduces Unified Language-driven Zero-shot Domain Adaptation (ULDA) a novel task setting that enables a single model to adapt to diverse target domains without explicit domain-ID knowledge. We identify the constraints in the existing language-driven zero-shot domain adaptation task particularly the requirement for domain IDs and domain-specific models which may restrict flexibility and scalability. To overcome these issues we propose a new framework for ULDA consisting of Hierarchical Context Alignment (HCA) Domain Consistent Representation Learning (DCRL) and Text-Driven Rectifier (TDR). These components work synergistically to align simulated features with target text across multiple visual levels retain semantic correlations between different regional representations and rectify biases between simulated and real target visual features respectively. Our extensive empirical evaluations demonstrate that this framework achieves competitive performance in both settings surpassing even the model that requires domain-ID showcasing its superiority and generalization ability. The proposed method is not only effective but also maintains practicality and efficiency as it does not introduce additional computational costs during inference. The code is available on the project website.
-
Shape assembly composes complex shapes geometries by arranging simple part geometries and has wide applications in autonomous robotic assembly and CAD modeling. Existing works focus on geometry reasoning and neglect the actual physical assembly process of matching and fitting joints which are the contact surfaces connecting different parts. In this paper we consider contacting joints for the task of multi-part assembly. A successful joint-optimized assembly needs to satisfy the bilateral objectives of shape structure and joint alignment. We propose a hierarchical graph learning approach composed of two levels of graph representation learning. The part graph takes part geometries as input to build the desired shape structure. The joint-level graph uses part joints information and focuses on matching and aligning joints. The two kinds of information are combined to achieve the bilateral objectives. Extensive experiments demonstrate that our method outperforms previous methods achieving better shape structure and higher joint alignment accuracy.
-
Multi-modality image fusion is a technique that combines information from different sensors or modalities enabling the fused image to retain complementary features from each modality such as functional highlights and texture details. However effective training of such fusion models is challenging due to the scarcity of ground truth fusion data. To tackle this issue we propose the Equivariant Multi-Modality imAge fusion (EMMA) paradigm for end-to-end self-supervised learning. Our approach is rooted in the prior knowledge that natural imaging responses are equivariant to certain transformations. Consequently we introduce a novel training paradigm that encompasses a fusion module a pseudo-sensing module and an equivariant fusion module. These components enable the net training to follow the principles of the natural sensing-imaging process while satisfying the equivariant imaging prior. Extensive experiments confirm that EMMA yields high-quality fusion results for infrared-visible and medical images concurrently facilitating downstream multi-modal segmentation and detection tasks. The code is available at https://github.com/Zhaozixiang1228/MMIF-EMMA.
-
We present NeLF-Pro a novel representation to model and reconstruct light fields in diverse natural scenes that vary in extent and spatial granularity. In contrast to previous fast reconstruction methods that represent the 3D scene globally we model the light field of a scene as a set of local light field feature probes parameterized with position and multi-channel 2D feature maps. Our central idea is to bake the scene's light field into spatially varying learnable representations and to query point features by weighted blending of probes close to the camera - allowing for mipmap representation and rendering. We introduce a novel vector-matrix-matrix (VMM) factorization technique that effectively represents the light field feature probes as products of core factors (i.e. VM) shared among local feature probes and a basis factor (i.e. M) - efficiently encoding internal relationships and patterns within the scene.Experimentally we demonstrate that NeLF-Pro significantly boosts the performance of feature grid-based representations and achieves fast reconstruction with better rendering quality while maintaining compact modeling. Project page: sinoyou.github.io/nelf-pro
-
We introduce One-shot Open Affordance Learning (OOAL) where a model is trained with just one example per base object category but is expected to identify novel objects and affordances. While vision-language models excel at recognizing novel objects and scenes they often struggle to understand finer levels of granularity such as affordances. To handle this issue we conduct a comprehensive analysis of existing foundation models to explore their inherent understanding of affordances and assess the potential for data-limited affordance learning. We then propose a vision-language framework with simple and effective designs that boost the alignment between visual features and affordance text embeddings. Experiments on two affordance segmentation benchmarks show that the proposed method outperforms state-of-the-art models with less than 1% of the full training data and exhibits reasonable generalization capability on unseen objects and affordances. Project page: https://reagan1311.github.io/ooal.
-
We present a method for large-mask pluralistic image inpainting based on the generative framework of discrete latent codes. Our method learns latent priors discretized as tokens by only performing computations at the visible locations of the image. This is realized by a restrictive partial encoder that predicts the token label for each visible block a bidirectional transformer that infers the missing labels by only looking at these tokens and a dedicated synthesis network that couples the tokens with the partial image priors to generate coherent and pluralistic complete image even under extreme mask settings. Experiments on public benchmarks validate our design choices as the proposed method outperforms strong baselines in both visual quality and diversity metrics.
-
We present a novel semantic segmentation approach for incremental nuclei segmentation from histopathological images which is a very challenging task as we have to incrementally optimize existing models to make them perform well in both old and new classes without using training samples of old classes. Yet it is an indispensable component of computer-aided diagnosis systems. The proposed approach has two key techniques. First we propose a new future-class awareness mechanism by separating some potential regions for future classes from background based on their similarities to both old and new classes in the representation space. With this mechanism we can not only reserve more parameter space for future updates but also enhance the representation capability of learned features. We further propose an innovative compatibility-inspired distillation scheme to make our model take full advantage of the knowledge learned by the old model. We conducted extensive experiments on two famous histopathological datasets and the results demonstrate the proposed approach achieves much better performance than state-of-the-art approaches. The code is available at https://github.com/why19991/InSeg.
-
Large-scale Text-to-Image (T2I) diffusion models have revolutionized image generation over the last few years. Although owning diverse and high-quality generation capabilities translating these abilities to fine-grained image editing remains challenging. In this paper we propose DiffEditor to rectify two weaknesses in existing diffusion-based image editing: (1) in complex scenarios editing results often lack editing accuracy and exhibit unexpected artifacts; (2) lack of flexibility to harmonize editing operations e.g. imagine new content. In our solution we introduce image prompts in fine-grained image editing cooperating with the text prompt to better describe the editing content. To increase the flexibility while maintaining content consistency we locally combine stochastic differential equation (SDE) into the ordinary differential equation (ODE) sampling. In addition we incorporate regional score-based gradient guidance and a time travel strategy into the diffusion sampling further improving the editing quality. Extensive experiments demonstrate that our method can efficiently achieve state-of-the-art performance on various fine-grained image editing tasks including editing within a single image (e.g. object moving resizing and content dragging) and across images (e.g. appearance replacing and object pasting). Our source code is released at https://github.com/MC-E/DragonDiffusion.
-
Solving image and video jigsaw puzzles poses the challenging task of rearranging image fragments or video frames from unordered sequences to restore meaningful images and video sequences. Existing approaches often hinge on discriminative models tasked with predicting either the absolute positions of puzzle elements or the permutation actions applied to the original data. Unfortunately these methods face limitations in effectively solving puzzles with a large number of elements. In this paper we propose JPDVT an innovative approach that harnesses diffusion transformers to address this challenge. Specifically we generate positional information for image patches or video frames conditioned on their underlying visual content. This information is then employed to accurately assemble the puzzle pieces in their correct positions even in scenarios involving missing pieces. Our method achieves state-of-the-art performance on several datasets.
-
Diffusion models have emerged as the de facto paradigm for video generation. However their reliance on web-scale data of varied quality often yields results that are visually unappealing and misaligned with the textual prompts. To tackle this problem we propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning. InstructVideo has two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by generating through the full DDIM sampling chain we recast reward fine-tuning as editing. By leveraging the diffusion process to corrupt a sampled video InstructVideo requires only partial inference of the DDIM sampling chain reducing fine-tuning cost while improving fine-tuning efficiency. 2) To mitigate the absence of a dedicated video reward model for human preferences we repurpose established image reward models e.g. HPSv2. To this end we propose Segmental Video Reward a mechanism to provide reward signals based on segmental sparse sampling and Temporally Attenuated Reward a method that mitigates temporal modeling degradation during fine-tuning. Extensive experiments both qualitative and quantitative validate the practicality and efficacy of using image reward models in InstructVideo significantly enhancing the visual quality of generated videos without compromising generalization capabilities. Code and models can be accessed through our project page https://instructvideo.github.io/.
-
Model stealing (MS) involves querying and observing the output of a machine learning model to steal its capabilities. The quality of queried data is crucial yet obtaining a large amount of real data for MS is often challenging. Recent works have reduced reliance on real data by using generative models. However when high-dimensional query data is required these methods are impractical due to the high costs of querying and the risk of model collapse. In this work we propose using sample gradients (SG) to enhance the utility of each real sample as SG provides crucial guidance on the decision boundaries of the victim model. However utilizing SG in the model stealing scenario faces two challenges: 1. Pixel-level gradient estimation requires extensive query volume and is susceptible to defenses. 2. The estimation of sample gradients has a significant variance. This paper proposes Superpixel Sample Gradient stealing (SPSG) for model stealing under the constraint of limited real samples. With the basic idea of imitating the victim model's low-variance patch-level gradients instead of pixel-level gradients SPSG achieves efficient sample gradient estimation through two steps. First we perform patch-wise perturbations on query images to estimate the average gradient in different regions of the image. Then we filter the gradients through a threshold strategy to reduce variance. Exhaustive experiments demonstrate that with the same number of real samples SPSG achieves accuracy agreements and adversarial success rate significantly surpassing the current state-of-the-art MS methods. Codes are available at https://github.com/zyl123456aB/SPSG_attack.
-
Deep unfolding networks (DUN) have emerged as a popular iterative framework for accelerated magnetic resonance imaging (MRI) reconstruction. However conventional DUN aims to reconstruct all the missing information within the entire space in each iteration. Thus it could be challenging when dealing with highly ill-posed degradation often resulting in subpar reconstruction. In this work we propose a Progressive Divide-And-Conquer (PDAC) strategy aiming to break down the subsampling process in the actual severe degradation and thus perform reconstruction sequentially. Starting from decomposing the original maximum-a-posteriori problem of accelerated MRI we present a rigorous derivation of the proposed PDAC framework which could be further unfolded into an end-to-end trainable network. Each PDAC iteration specifically targets a distinct segment of moderate degradation based on the decomposition. Furthermore as part of the PDAC iteration such decomposition is adaptively learned as an auxiliary task through a degradation predictor which provides an estimation of the decomposed sampling mask. Following this prediction the sampling mask is further integrated via a severity conditioning module to ensure awareness of the degradation severity at each stage. Extensive experiments demonstrate that our proposed method achieves superior performance on the publicly available fastMRI and Stanford2D FSE datasets in both multi-coil and single-coil settings.
-
In Multiple Object Tracking objects often exhibit non-linear motion of acceleration and deceleration with irregular direction changes. Tacking-by-detection (TBD) trackers with Kalman Filter motion prediction work well in pedestrian-dominant scenarios but fall short in complex situations when multiple objects perform non-linear and diverse motion simultaneously. To tackle the complex non-linear motion we propose a real-time diffusion-based MOT approach named DiffMOT. Specifically for the motion predictor component we propose a novel Decoupled Diffusion-based Motion Predictor (D^2MP). It models the entire distribution of various motion presented by the data as a whole. It also predicts an individual object's motion conditioning on an individual's historical motion information. Furthermore it optimizes the diffusion process with much fewer sampling steps. As a MOT tracker the DiffMOT is real-time at 22.7FPS and also outperforms the state-of-the-art on DanceTrack and SportsMOT datasets with 62.3% and 76.2% in HOTA metrics respectively. To the best of our knowledge DiffMOT is the first to introduce a diffusion probabilistic model into the MOT to tackle non-linear motion prediction.
-
State-of-the-art video-text retrieval (VTR) methods typically involve fully fine-tuning a pre-trained model (e.g. CLIP) on specific datasets. However this can result in significant storage costs in practical applications as a separate model per task must be stored. To address this issue we present our pioneering work that enables parameter-efficient VTR using a pre-trained model with only a small number of tunable parameters during training. Towards this goal we propose a new method dubbed Multimodal Video Adapter (MV-Adapter) for efficiently transferring the knowledge in the pre-trained CLIP from image-text to video-text. Specifically MV-Adapter utilizes bottleneck structures in both video and text branches along with two novel components. The first is a Temporal Adaptation Module that is incorporated in the video branch to introduce global and local temporal contexts. We also train weights calibrations to adjust to dynamic variations across frames. The second is Cross Modality Tying that generates weights for video/text branches through sharing cross modality factors for better aligning between modalities. Thanks to above innovations MV-Adapter can achieve comparable or better performance than standard fine-tuning with negligible parameters overhead. Notably MV-Adapter consistently outperforms various competing methods in V2T/T2V tasks with large margins on five widely used VTR benchmarks (MSR-VTT MSVD LSMDC DiDemo and ActivityNet). Codes will be released.
-
Multi-view representation learning aims to derive robust representations that are both view-consistent and view-specific from diverse data sources. This paper presents an in-depth analysis of existing approaches in this domain highlighting a commonly overlooked aspect: the redundancy between view-consistent and view-specific representations. To this end we propose an innovative framework for multi-view representation learning which incorporates a technique we term 'distilled disentangling'. Our method introduces the concept of masked cross-view prediction enabling the extraction of compact high-quality view-consistent representations from various sources without incurring extra computational overhead. Additionally we develop a distilled disentangling module that efficiently filters out consistency-related information from multi-view representations resulting in purer view-specific representations. This approach significantly reduces redundancy between view-consistent and view-specific representations enhancing the overall efficiency of the learning process. Our empirical evaluations reveal that higher mask ratios substantially improve the quality of view-consistent representations. Moreover we find that reducing the dimensionality of view-consistent representations relative to that of view-specific representations further refines the quality of the combined representations.
-
Video transformers have become the de facto standard for human action recognition yet their exclusive reliance on the RGB modality still limits their adoption in certain domains. One such domain is Activities of Daily Living (ADL) where RGB alone is not sufficient to distinguish between visually similar actions or actions observed from multiple viewpoints. To facilitate the adoption of video transformers for ADL we hypothesize that the augmentation of RGB with human pose information known for its sensitivity to fine-grained motion and multiple viewpoints is essential. Consequently we introduce the first Pose Induced Video Transformer: PI-ViT (or π-ViT) a novel approach that augments the RGB representations learned by video transformers with 2D and 3D pose information. The key elements of π-ViT are two plug-in modules 2D Skeleton Induction Module and 3D Skeleton Induction Module that are responsible for inducing 2D and 3D pose information into the RGB representations. These modules operate by performing pose-aware auxiliary tasks a design choice that allows π-ViT to discard the modules during inference. Notably π-ViT achieves the state-of-the-art performance on three prominent ADL datasets encompassing both real-world and large-scale RGB-D datasets without requiring poses or additional computational overhead at inference.
-
ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI) with giga-pixel size and hierarchical image context in digital pathology. However these methods heavily depend on a substantial number of bag-level labels and solely learn from the original slides which are easily affected by variations in data distribution. Recently vision language model (VLM)-based methods introduced the language prior by pre-training on large-scale pathological image-text pairs. However the previous text prompt lacks the consideration of pathological prior knowledge therefore does not substantially boost the model's performance. Moreover the collection of such pairs and the pre-training process are very time-consuming and source-intensive. To solve the above problems we propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification. Specifically we propose a dual-scale visual descriptive text prompt based on the frozen large language model (LLM) to boost the performance of VLM effectively. To transfer the VLM to process WSI efficiently for the image branch we propose a prototype-guided patch decoder to aggregate the patch features progressively by grouping similar patches into the same prototype; for the text branch we introduce a context-guided text decoder to enhance the text features by incorporating the multi-granular image contexts. Extensive studies on three multi-cancer and multi-center subtyping datasets demonstrate the superiority of ViLa-MIL.
-
Open-world Semi-Supervised Learning aims to classify unlabeled samples utilizing information from labeled data while unlabeled samples are not only from the labeled known categories but also from novel categories previously unseen. Despite the promise current approaches solely rely on hazardous similarity-based clustering algorithms and give unlabeled samples free rein to spontaneously group into distinct novel class clusters. Nevertheless due to the absence of novel class supervision these methods typically suffer from the representation collapse dilemma---features of different novel categories can get closely intertwined and indistinguishable even collapsing into the same cluster and leading to degraded performance. To alleviate this we propose a novel framework TRAILER which targets to attain an optimal feature arrangement revealed by the recently uncovered neural collapse phenomenon. To fulfill this we adopt targeted prototypes that are pre-assigned uniformly with maximum separation and then progressively align the representations to them. To further tackle the potential downsides of such stringent alignment we encapsulate a sample-target allocation mechanism with coarse-to-fine refinery that is able to infer label assignments with high quality. Extensive experiments demonstrate that TRAILER outperforms current state-of-the-art methods on generic and fine-grained benchmarks. The code is available at https://github.com/Justherozen/TRAILER.
-
We revisit certain problems of pose estimation based on 3D--2D correspondences between features which may be points or lines. Specifically we address the two previously-studied minimal problems of estimating camera extrinsics from p \in \ 1 2 \ point--point correspondences and l=3-p line--line correspondences. To the best of our knowledge all of the previously-known practical solutions to these problems required computing the roots of degree \ge 4 (univariate) polynomials when p=2 or degree \ge 8 polynomials when p=1. We describe and implement two elementary solutions which reduce the degrees of the needed polynomials from 4 to 2 and from 8 to 4 respectively. We show experimentally that the resulting solvers are numerically stable and fast: when compared to the previous state-of-the art we may obtain nearly an order of magnitude speedup. The code is available at https://github.com/petrhruby97/efficient_absolute
-
Automatic text-to-3D generation that combines Score Distillation Sampling (SDS) with the optimization of volume rendering has achieved remarkable progress in synthesizing realistic 3D objects. Yet most existing text-to-3D methods by SDS and volume rendering suffer from inaccurate geometry e.g. the Janus issue since it is hard to explicitly integrate 3D priors into implicit 3D representations. Besides it is usually time-consuming for them to generate elaborate 3D models with rich colors. In response this paper proposes GSGEN a novel method that adopts Gaussian Splatting a recent state-of-the-art representation to text-to-3D generation. GSGEN aims at generating high-quality 3D objects and addressing existing shortcomings by exploiting the explicit nature of Gaussian Splatting that enables the incorporation of 3D prior. Specifically our method adopts a progressive optimization strategy which includes a geometry optimization stage and an appearance refinement stage. In geometry optimization a coarse representation is established under 3D point cloud diffusion prior along with the ordinary 2D SDS optimization ensuring a sensible and 3D-consistent rough shape. Subsequently the obtained Gaussians undergo an iterative appearance refinement to enrich texture details. In this stage we increase the number of Gaussians by compactness-based densification to enhance continuity and improve fidelity. With these designs our approach can generate 3D assets with delicate details and accurate geometry. Extensive evaluations demonstrate the effectiveness of our method especially for capturing high-frequency components.
-
Large multimodal models demonstrate remarkable generalist ability to perform diverse multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute fundamentally to this success but suffer from excessive noise. Recent studies use alternative captions synthesized by captioning models and have achieved notable benchmark performance. However our experiments reveal significant Scalability Deficiency and World Knowledge Loss issues in models trained with synthetic captions which have been largely obscured by their initial benchmark success. Upon closer examination we identify the root cause as the overly-simplified language structure and lack of knowledge details in existing synthetic captions. To provide higher-quality and more scalable multimodal pretraining data we propose CapsFusion an advanced framework that leverages large language models to consolidate and refine information from both web-based image-text pairs and synthetic captions. Extensive experiments show that CapsFusion captions exhibit remarkable all-round superiority over existing captions in terms of model performance (e.g. 18.8 and 18.3 improvements in CIDEr score on COCO and NoCaps) sample efficiency (requiring 11-16 times less computation than baselines) world knowledge depth and scalability. These effectiveness efficiency and scalability advantages position CapsFusion as a promising candidate for future scaling of LMM training.
-
Frechet Video Distance (FVD) a prominent metric for evaluating video generation models is known to conflict with human perception occasionally. In this paper we aim to explore the extent of FVD's bias toward frame quality over temporal realism and identify its sources. We first quantify the FVD's sensitivity to the temporal axis by decoupling the frame and motion quality and find that the FVD only increases slightly with larger temporal corruption. We then analyze the generated videos and show that via careful sampling from a large set of generated videos that do not contain motions one can drastically decrease FVD without improving the temporal quality. Both studies suggest FVD's basis towards the quality of individual frames. We show that FVD with features extracted from the recent large-scale self-supervised video models is less biased toward image quality. Finally we revisit a few real-world examples to validate our hypothesis.
-
The recent advance of deep learning technology brings the possibility of assisting the pathologist to predict the patients' survival from whole-slide pathological images (WSIs). However most of the prevalent methods only worked on the sampled patches in specifically or randomly selected tumor areas of WSIs which has very limited capability to capture the complex interactions between tumor and its surrounding micro-environment components. As a matter of fact tumor is supported and nurtured in the heterogeneous tumor micro-environment(TME) and the detailed analysis of TME and their correlation with tumors are important to in-depth analyze the mechanism of cancer development. In this paper we considered the spatial interactions among tumor and its two major TME components (i.e. lymphocytes and stromal fibrosis) and presented a Tumor Micro-environment Interactions Guided Graph Learning (TMEGL) algorithm for the prognosis prediction of human cancers. Specifically we firstly selected different types of patches as nodes to build graph for each WSI. Then a novel TME neighborhood organization guided graph embedding algorithm was proposed to learn node representations that can preserve their topological structure information. Finally a Gated Graph Attention Network is applied to capture the survival-associated intersections among tumor and different TME components for clinical outcome prediction. We tested TMEGL on three cancer cohorts derived from The Cancer Genome Atlas (TCGA) and the experimental results indicated that TMEGL not only outperforms the existing WSI-based survival analysis models but also has good explainable ability for survival prediction.
-
Multi-Object Tracking (MOT) encompasses various tracking scenarios each characterized by unique traits. Effective trackers should demonstrate a high degree of generalizability across diverse scenarios. However existing trackers struggle to accommodate all aspects or necessitate hypothesis and experimentation to customize the association information (motion and/or appearance) for a given scenario leading to narrowly tailored solutions with limited generalizability. In this paper we investigate the factors that influence trackers' generalization to different scenarios and concretize them into a set of tracking scenario attributes to guide the design of more generalizable trackers. Furthermore we propose a "point-wise to instance-wise relation" framework for MOT i.e. GeneralTrack which can generalize across diverse scenarios while eliminating the need to balance motion and appearance. Thanks to its superior generalizability our proposed GeneralTrack achieves state-of-the-art performance on multiple benchmarks and demonstrates the potential for domain generalization.
-
Generating dances that are both lifelike and well-aligned with music continues to be a challenging task in the cross-modal domain. This paper introduces PopDanceSet the first dataset tailored to the preferences of young audiences enabling the generation of aesthetically oriented dances. And it surpasses the AIST++ dataset in music genre diversity and the intricacy and depth of dance movements. Moreover the proposed POPDG model within the iDDPM framework enhances dance diversity and through the Space Augmentation Algorithm strengthens spatial physical connections between human body joints ensuring that increased diversity does not compromise generation quality. A streamlined Alignment Module is also designed to improve the temporal alignment between dance and music. Extensive experiments show that POPDG achieves SOTA results on two datasets. Furthermore the paper also expands on current evaluation metrics. The dataset and code are available at https://github.com/Luke-Luo1/POPDG.
-
Diffusion models have shown an impressive ability to model complex data distributions with several key advantages over GANs such as stable training better coverage of the training distribution's modes and the ability to solve inverse problems without extra training. However most diffusion models learn the distribution of fixed-resolution images. We propose to learn the distribution of continuous images by training diffusion models on image neural fields which can be rendered at any resolution and show its advantages over fixed-resolution models. To achieve this a key challenge is to obtain a latent space that represents photorealistic image neural fields. We propose a simple and effective method inspired by several recent techniques but with key changes to make the image neural fields photorealistic. Our method can be used to convert existing latent diffusion autoencoders into image neural field autoencoders. We show that image neural field diffusion models can be trained using mixed-resolution image datasets outperform fixed-resolution diffusion models followed by super-resolution models and can solve inverse problems with conditions applied at different scales efficiently.
-
Despite advancements in text-to-image generation (T2I) prior methods often face text-image misalignment problems such as relation confusion in generated images. Existing solutions involve cross-attention manipulation for better compositional understanding or integrating large language models for improved layout planning. However the inherent alignment capabilities of T2I models are still inadequate. By reviewing the link between generative and discriminative modeling we posit that T2I models' discriminative abilities may reflect their text-image alignment proficiency during generation. In this light we advocate bolstering the discriminative abilities of T2I models to achieve more precise text-to-image alignment for generation. We present a discriminative adapter built on T2I models to probe their discriminative abilities on two representative tasks and leverage discriminative fine-tuning to improve their text-image alignment. As a bonus of the discriminative adapter a self-correction mechanism can leverage discriminative gradients to better align generated images to text prompts during inference. Comprehensive evaluations across three benchmark datasets including both in-distribution and out-of-distribution scenarios demonstrate our method's superior generation performance. Meanwhile it achieves state-of-the-art discriminative performance on the two discriminative tasks compared to other generative models. The code is available at https://dpt-t2i.github.io/.
-
We introduce multi-slice reasoning a new notion for single-view 3D reconstruction which challenges the current and prevailing belief that multi-view synthesis is the most natural conduit between single-view and 3D. Our key observation is that object slicing is a more direct and hence more advantageous means to reveal occluded structures than altering camera views. Specifically slicing can peel through any occluder without obstruction and in the limit (i.e. with infinitely many slices) it is guaranteed to unveil all hidden object parts. We realize our idea by developing Slice3D a novel method for single-view 3D reconstruction which first predicts multi-slice images from a single RGB input image and then integrates the slices into a 3D model using a coordinate-based transformer network to product a signed distance function. The slice images can be regressed or generated both through a U-Net based network. For the former we inject a learnable slice indicator code to designate each decoded image into a spatial slice location while the slice generator is a denoising diffusion model operating on the entirety of slice images stacked on the input channels. We conduct extensive evaluation against state-of-the-art alternatives to demonstrate superiority of our method especially in recovering complex and severely occluded shape structures amid ambiguities. All Slice3D results were produced by networks trained on a single Nvidia A40 GPU with an inference time of less than 20 seconds.
-
A diffusion model which is formulated to produce an image using thousands of denoising steps usually suffers from a slow inference speed. Existing acceleration algorithms simplify the sampling by skipping most steps yet exhibit considerable performance degradation. By viewing the generation of diffusion models as a discretized integral process we argue that the quality drop is partly caused by applying an inaccurate integral direction to a timestep interval. To rectify this issue we propose a timestep tuner that helps find a more accurate integral direction for a particular interval at the minimum cost. Specifically at each denoising step we replace the original parameterization by conditioning the network on a new timestep enforcing the sampling distribution towards the real one. Extensive experiments show that our plug-in design can be trained efficiently and boost the inference performance of various state-of-the-art acceleration methods especially when there are few denoising steps. For example when using 10 denoising steps on LSUN Bedroom dataset we improve the FID of DDIM from 9.65 to 6.07 simply by adopting our method for a more appropriate set of timesteps. Code is available at \href https://github.com/THU-LYJ-Lab/time-tuner https://github.com/THU-LYJ-Lab/time-tuner .
-
Generalizable face anti-spoofing (FAS) approaches have drawn growing attention due to their robustness for diverse presentation attacks in unseen scenarios. Most previous methods always utilize domain generalization (DG) frameworks via directly aligning diverse source samples into a common feature space. However these methods neglect the hierarchical relations in FAS samples which may hinder the generalization ability by direct alignment. To address these issues we propose a novel Hierarchical Prototype-guided Distribution Refinement (HPDR) framework to learn embedding in hyperbolic space which facilitates the hierarchical relation construction. We also collaborate with prototype learning for hierarchical distribution refinement in hyperbolic space. In detail we propose the Hierarchical Prototype Learning to simultaneously guide domain alignment and improve the discriminative ability via constraining the multi-level relations between prototypes and instances in hyperbolic space. Moreover we design a Prototype-oriented Classifier which further considers relations between the sample and prototypes to improve the robustness of the final decision. Extensive experiments and visualizations demonstrate the effectiveness of our method against previous competitors.
-
Deep learning-based image registration (DLIR) methods have achieved remarkable success in deformable image registration. We observe that iterative inference can exploit the well-trained registration network to the fullest extent. In this work we propose a novel Iterative Inference Residual Pyramid Network (IIRP-Net) to enhance registration performance without any additional training costs. In IIRP-Net we construct a streamlined pyramid registration network consisting of a feature extractor and residual flow estimators (RP-Net) to achieve generalized capabilities in feature extraction and registration. Then in the inference phase IIRP-Net employs an iterative inference strategy to enhance RP-Net by iteratively reutilizing residual flow estimators from coarse to fine. The number of iterations is adaptively determined by the proposed IterStop mechanism. We conduct extensive experiments on the FLARE and Mindboggle datasets and the results verify the effectiveness of the proposed method outperforming state-of-the-art deformable image registration methods. Our code is available at https://github.com/Torbjorn1997/IIRP-Net.
-
Large-scale high-resolution (HR) land-cover mapping is a vital task to survey the Earth's surface and resolve many challenges facing humanity. However it is still a non-trivial task hindered by complex ground details various landforms and the scarcity of accurate training labels over a wide-span geographic area. In this paper we propose an efficient weakly supervised framework (Paraformer) to guide large-scale HR land-cover mapping with easy-access historical land-cover data of low resolution (LR). Specifically existing land-cover mapping approaches reveal the dominance of CNNs in preserving local ground details but still suffer from insufficient global modeling in various landforms. Therefore we design a parallel CNN-Transformer feature extractor in Paraformer consisting of a downsampling-free CNN branch and a Transformer branch to jointly capture local and global contextual information. Besides facing the spatial mismatch of training data a pseudo-label-assisted training (PLAT) module is adopted to reasonably refine LR labels for weakly supervised semantic segmentation of HR images. Experiments on two large-scale datasets demonstrate the superiority of Paraformer over other state-of-the-art methods for automatically updating HR land-cover maps from LR historical labels.
-
Video Frame Interpolation (VFI) which aims at generating high-frame-rate videos from low-frame-rate inputs is a highly challenging task. The emergence of bio-inspired sensors known as event cameras which boast microsecond-level temporal resolution has ushered in a transformative era for VFI. Nonetheless the application of event-based VFI techniques in domains with distinct environments from the training data can be problematic. This is mainly because event camera data distribution can undergo substantial variations based on camera settings and scene conditions presenting challenges for effective adaptation. In this paper we propose a test-time adaptation method for event-based VFI to address the gap between the source and target domains. Our approach enables sequential learning in an online manner on the target domain which only provides low-frame-rate videos. We present an approach that leverages confident pixels as pseudo ground-truths enabling stable and accurate online learning from low-frame-rate videos. Furthermore to prevent overfitting during the continuous online process where the same scene is encountered repeatedly we propose a method of blending historical samples with current scenes. Extensive experiments validate the effectiveness of our method both in cross-domain and continuous domain shifting setups. The code is available at https://github.com/Chohoonhee/TTA-EVF.
-
Open-vocabulary semantic segmentation (OVS) aims to segment images of arbitrary categories specified by class labels or captions. However most previous best-performing methods whether pixel grouping methods or region recognition methods suffer from false matches between image features and category labels. We attribute this to the natural gap between the textual features and visual features. In this work we rethink how to mitigate false matches from the perspective of image-to-image matching and propose a novel relation-aware intra-modal matching (RIM) framework for OVS based on visual foundation models. RIM achieves robust region classification by firstly constructing diverse image-modal reference features and then matching them with region features based on relation-aware ranking distribution. The proposed RIM enjoys several merits. First the intra-modal reference features are better aligned circumventing potential ambiguities that may arise in cross-modal matching. Second the ranking-based matching process harnesses the structure information implicit in the inter-class relationships making it more robust than comparing individually. Extensive experiments on three benchmarks demonstrate that RIM outperforms previous state-of-the-art methods by large margins obtaining a lead of more than 10% in mIoU on PASCAL VOC benchmark
-
Gait recognition stands as one of the most pivotal remote identification technologies and progressively expands across research and industry communities. However existing gait recognition methods heavily rely on task-specific upstream driven by supervised learning to provide explicit gait representations like silhouette sequences which inevitably introduce expensive annotation costs and potential error accumulation. Escaping from this trend this work explores effective gait representations based on the all-purpose knowledge produced by task-agnostic Large Vision Models (LVMs) and proposes a simple yet efficient gait framework termed BigGait. Specifically the Gait Representation Extractor (GRE) within BigGait draws upon design principles from established gait representations effectively transforming all-purpose knowledge into implicit gait representations without requiring third-party supervision signals. Experiments on CCPG CAISA-B* and SUSTech1K indicate that BigGait significantly outperforms the previous methods in both within-domain and cross-domain tasks in most cases and provides a more practical paradigm for learning the next-generation gait representation. Finally we delve into prospective challenges and promising directions in LVMs-based gait recognition aiming to inspire future work in this emerging topic. The source code is available at https://github.com/ShiqiYu/OpenGait.
-
Recently the rise of query-based Transformer decoders is reshaping camera-based 3D object detection. These query-based decoders are surpassing the traditional dense BEV (Bird's Eye View)-based methods. However we argue that dense BEV frameworks remain important due to their outstanding abilities in depth estimation and object localization depicting 3D scenes accurately and comprehensively. This paper aims to address the drawbacks of the existing dense BEV-based 3D object detectors by introducing our proposed enhanced components including a CRF-modulated depth estimation module enforcing object-level consistencies a long-term temporal aggregation module with extended receptive fields and a two-stage object decoder combining perspective techniques with CRF-modulated depth embedding. These enhancements lead to a "modernized" dense BEV framework dubbed BEVNeXt. On the nuScenes benchmark BEVNeXt outperforms both BEV-based and query-based frameworks under various settings achieving a state-of-the-art result of 64.2 NDS on the nuScenes test set.
-
Misinformation is a prevalent societal issue due to its potential high risks. Out-Of-Context (OOC) misinformation where authentic images are repurposed with false text is one of the easiest and most effective ways to mislead audiences. Current methods focus on assessing image-text consistency but lack convincing explanations for their judgments which are essential for debunking misinformation. While Multimodal Large Language Models (MLLMs) have rich knowledge and innate capability for visual reasoning and explanation generation they still lack sophistication in understanding and discovering the subtle cross-modal differences. In this paper we introduce Sniffer a novel multimodal large language model specifically engineered for OOC misinformation detection and explanation. Sniffer employs two-stage instruction tuning on InstructBLIP. The first stage refines the model's concept alignment of generic objects with news-domain entities and the second stage leverages OOC-specific instruction data generated by language-only GPT-4 to fine-tune the model's discriminatory powers. Enhanced by external tools and retrieval Sniffer not only detects inconsistencies between text and image but also utilizes external knowledge for contextual verification. Our experiments show that Sniffer surpasses the original MLLM by over 40% and outperforms state-of-the-art methods in detection accuracy. Sniffer also provides accurate and persuasive explanations as validated by quantitative and human evaluations.
-
Learning from seen attribute-object pairs to generalize to unseen compositions has been studied extensively in Compositional Zero-Shot Learning (CZSL). However CZSL setup is still limited to seen attributes and objects and cannot generalize to unseen concepts and their compositions. To overcome this limitation we propose a new task Open Vocabulary-Compositional Zero-shot Learning (OV-CZSL) where unseen attributes objects and unseen compositions are evaluated. To show that OV-CZSL is a challenging yet solvable problem we propose three new benchmarks based on existing datasets MIT-States C-GQA and VAW-CZSL along with new baselines and evaluation setup. We use language embeddings and external vocabulary with our novel neighborhood expansion loss to allow any method to learn semantic correlations between seen and unseen primitives.
-
Semantic scene completion (SSC) aims to predict complete 3D voxel occupancy and semantics from a single-view RGB-D image and recent SSC methods commonly adopt multi-modal inputs. However our investigation reveals two limitations: ineffective feature learning from single modalities and overfitting to limited datasets. To address these issues this paper proposes a novel SSC framework - Adversarial Modality Modulation Network (AMMNet) - with a fresh perspective of optimizing gradient updates. The proposed AMMNet introduces two core modules: a cross-modal modulation enabling the interdependence of gradient flows between modalities and a customized adversarial training scheme leveraging dynamic gradient competition. Specifically the cross-modal modulation adaptively re-calibrates the features to better excite representation potentials from each single modality. The adversarial training employs a minimax game of evolving gradients with customized guidance to strengthen the generator's perception of visual fidelity from both geometric completeness and semantic correctness. Extensive experimental results demonstrate that AMMNet outperforms state-of-the-art SSC methods by a large margin providing a promising direction for improving the effectiveness and generalization of SSC methods.
-
We address the challenging task of identifying segmenting and tracking hand-held objects which is crucial for applications such as human action segmentation and performance evaluation. This task is particularly challenging due to heavy occlusion rapid motion and the transitory nature of objects being hand-held where an object may be held released and subsequently picked up again. To tackle these challenges we have developed a novel transformer-based architecture called HOIST-Former. HOIST-Former is adept at spatially and temporally segmenting hands and objects by iteratively pooling features from each other ensuring that the processes of identification segmentation and tracking of hand-held objects depend on the hands' positions and their contextual appearance. We further refine HOIST-Former with a contact loss that focuses on areas where hands are in contact with objects. Moreover we also contribute an in-the-wild video dataset called HOIST which comprises 4125 videos complete with bounding boxes segmentation masks and tracking IDs for hand-held objects. Through experiments on the HOIST dataset and two additional public datasets we demonstrate the efficacy of HOIST-Former in segmenting and tracking hand-held objects.
-
Despite great improvements in semantic segmentation challenges persist because of the lack of local/global contexts and the relationship between them. In this paper we propose Contextrast a contrastive learning-based semantic segmentation method that allows to capture local/global contexts and comprehend their relationships. Our proposed method comprises two parts: a) contextual contrastive learning (CCL) and b) boundary-aware negative (BANE) sampling. Contextual contrastive learning obtains local/global context from multi-scale feature aggregation and inter/intra-relationship of features for better discrimination capabilities. Meanwhile BANE sampling selects embedding features along the boundaries of incorrectly predicted regions to employ them as harder negative samples on our contrastive learning resolving segmentation issues along the boundary region by exploiting fine-grained details. We demonstrate that our Contextrast substantially enhances the performance of semantic segmentation networks outperforming state-of-the-art contrastive learning approaches on diverse public datasets e.g. Cityscapes CamVid PASCAL-C COCO-Stuff and ADE20K without an increase in computational cost during inference.
-
Monocular 3D detection is a challenging task due to the lack of accurate 3D information. Existing approaches typically rely on geometry constraints and dense depth estimates to facilitate the learning but often fail to fully exploit the benefits of three-dimensional feature extraction in frustum and 3D space. In this paper we propose OccupancyM3D a method of learning occupancy for monocular 3D detection. It directly learns occupancy in frustum and 3D space leading to more discriminative and informative 3D features and representations. Specifically by using synchronized raw sparse LiDAR point clouds we define the space status and generate voxel-based occupancy labels. We formulate occupancy prediction as a simple classification problem and design associated occupancy losses. Resulting occupancy estimates are employed to enhance original frustum/3D features. As a result experiments on KITTI and Waymo open datasets demonstrate that the proposed method achieves a new state of the art and surpasses other methods by a significant margin.
-
This paper introduces a novel approach for high-quality deepfake detection called Localized Artifact Attention Network (LAA-Net). Existing methods for high-quality deepfake detection are mainly based on a supervised binary classifier coupled with an implicit attention mechanism. As a result they do not generalize well to unseen manipulations. To handle this issue two main contributions are made. First an explicit attention mechanism within a multi-task learning framework is proposed. By combining heatmap-based and self-consistency attention strategies LAA-Net is forced to focus on a few small artifact-prone vulnerable regions. Second an Enhanced Feature Pyramid Network (E-FPN) is proposed as a simple and effective mechanism for spreading discriminative low-level features into the final feature output with the advantage of limiting redundancy. Experiments performed on several benchmarks show the superiority of our approach in terms of Area Under the Curve (AUC) and Average Precision (AP). The code is available at https://github.com/10Ring/LAA-Net.
-
Universal Domain Adaptation (UniDA) targets knowledge transfer in the presence of both covariate and label shifts. Recently Source-free Universal Domain Adaptation (SF-UniDA) has emerged to achieve UniDA without access to source data which tends to be more practical due to data protection policies. The main challenge lies in determining whether covariate-shifted samples belong to target-private unknown categories. Existing methods tackle this either through hand-crafted thresholding or by developing time-consuming iterative clustering strategies. In this paper we propose a new idea of LEArning Decomposition (LEAD) which decouples features into source-known and -unknown components to identify target-private data. Technically LEAD initially leverages the orthogonal decomposition analysis for feature decomposition. Then LEAD builds instance-level decision boundaries to adaptively identify target-private data. Extensive experiments across various UniDA scenarios have demonstrated the effectiveness and superiority of LEAD. Notably in the OPDA scenario on VisDA dataset LEAD outperforms GLC by 3.5% overall H-score and reduces 75% time to derive pseudo-labeling decision boundaries. Besides LEAD is also appealing in that it is complementary to most existing methods. The code is available at https://github. com/ispc-lab/LEAD
-
Facial action unit (AU) intensity plays a pivotal role in quantifying fine-grained expression behaviors which is an effective condition for facial expression manipulation. However publicly available datasets containing intensity annotations for multiple AUs remain severely limited often featuring a restricted number of subjects. This limitation places challenges to the AU intensity manipulation in images due to disentanglement issues leading researchers to resort to other large datasets with pretrained AU intensity estimators for pseudo labels. In addressing this constraint and fully leveraging manual annotations of AU intensities for precise manipulation we introduce AUEditNet. Our proposed model achieves impressive intensity manipulation across 12 AUs trained effectively with only 18 subjects. Utilizing a dual-branch architecture our approach achieves comprehensive disentanglement of facial attributes and identity without necessitating additional loss functions or implementing with large batch sizes. This approach offers a potential solution to achieve desired facial attribute editing despite the dataset's limited subject count. Our experiments demonstrate AUEditNet's superior accuracy in editing AU intensities affirming its capability in disentangling facial attributes and identity within a limited subject pool. AUEditNet allows conditioning by either intensity values or target images eliminating the need for constructing AU combinations for specific facial expression synthesis. Moreover AU intensity estimation as a downstream task validates the consistency between real and edited images confirming the effectiveness of our proposed AU intensity manipulation method.
-
Accurately predicting the 3D human posture and the pressure exerted on the body for people resting in bed visualized as a body mesh (3D pose & shape) with a 3D pressure map holds significant promise for healthcare applications particularly in the prevention of pressure ulcers. Current methods focus on singular facets of the problem---predicting only 2D/3D poses generating 2D pressure images predicting pressure only for certain body regions instead of the full body or forming indirect approximations to the 3D pressure map. In contrast we introduce BodyMAP which jointly predicts the human body mesh and 3D applied pressure map across the entire human body. Our network leverages multiple visual modalities incorporating both a depth image of a person in bed and its corresponding 2D pressure image acquired from a pressure-sensing mattress. The 3D pressure map is represented as a pressure value at each mesh vertex and thus allows for precise localization of high-pressure regions on the body. Additionally we present BodyMAP-WS a new formulation of pressure prediction in which we implicitly learn pressure in 3D by aligning sensed 2D pressure images with a differentiable 2D projection of the predicted 3D pressure maps. In evaluations with real-world human data our method outperforms the current state-of-the-art technique by 25% on both body mesh and 3D applied pressure map prediction tasks for people in bed.
-
Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However existing works rely heavily on modality-specific encoders which usually differ in architecture and are limited to common modalities. In this paper we present OneLLM an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail we first train an image projection module to connect a vision encoder with LLM. Then we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions we also curated a comprehensive multimodal instruction dataset including 2M items from image audio video point cloud depth/normal map IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks encompassing tasks such as multimodal captioning question answering and reasoning where it delivers excellent performance. Code data model and online demo are available at https://github.com/csuhan/OneLLM
-
Adversarial patch attacks present a significant threat to real-world object detectors due to their practical feasibility. Existing defense methods which rely on attack data or prior knowledge struggle to effectively address a wide range of adversarial patches. In this paper we show two inherent characteristics of adversarial patches semantic independence and spatial heterogeneity independent of their appearance shape size quantity and location. Semantic independence indicates that adversarial patches operate autonomously within their semantic context while spatial heterogeneity manifests as distinct image quality of the patch area that differs from original clean image due to the independent generation process. Based on these observations we propose PAD a novel adversarial patch localization and removal method that does not require prior knowledge or additional training. PAD offers patch-agnostic defense against various adversarial patches compatible with any pre-trained object detectors. Our comprehensive digital and physical experiments involving diverse patch types such as localized noise printable and naturalistic patches exhibit notable improvements over state-of-the-art works. Our code is available at https://github.com/Lihua-Jing/PAD.
-
Text-to-image generation has achieved astonishing results yet precise spatial controllability and prompt fidelity remain highly challenging. This limitation is typically addressed through cumbersome prompt engineering scene layout conditioning or image editing techniques which often require hand drawn masks. Nonetheless pre-existing works struggle to take advantage of the natural instance-level compositionality of scenes due to the typically flat nature of rasterized RGB output images. Towards addressing this challenge we introduce MuLAn: a novel dataset comprising over 44K MUlti-Layer ANnotations of RGB images as multi-layer instance-wise RGBA decompositions and over 100K instance images. To build MuLAn we developed a training free pipeline which decomposes a monocular RGB image into a stack of RGBA layers comprising of background and isolated instances. We achieve this through the use of pretrained general-purpose models and by developing three modules: image decomposition for instance discovery and extraction instance completion to reconstruct occluded areas and image re-assembly. We use our pipeline to create MuLAn-COCO and MuLAn-LAION datasets which contain a variety of image decompositions in terms of style composition and complexity. With MuLAn we provide the first photorealistic resource providing instance decomposition and occlusion information for high quality images opening up new avenues for text-to-image generative AI research. With this we aim to encourage the development of novel generation and editing technology in particular layer-wise solutions. MuLAn data resources are available at https://MuLAn-dataset.github.io/
-
This paper addresses complex challenges in histopathological image analysis through three key contributions. Firstly it introduces a fast patch selection method FPS for whole-slide image (WSI) analysis significantly reducing computational cost while maintaining accuracy. Secondly it presents PathDino a lightweight histopathology feature extractor with a minimal configuration of five Transformer blocks and only ? 9 million parameters markedly fewer than alternatives. Thirdly it introduces a rotation-agnostic representation learning paradigm using self-supervised learning effectively mitigating overfitting. We also show that our compact model outperforms existing state-of-the-art histopathology-specific vision transformers on 12 diverse datasets including both internal datasets spanning four sites (breast liver skin and colorectal) and seven public datasets (PANDA CAMELYON16 BRACS DigestPath Kather PanNuke and WSSS4LUAD). Notably even with a training dataset of ? 6 million histopathology patches from The Cancer Genome Atlas (TCGA) our approach demonstrates an average 8.5% improvement in patch-level majority vote performance. These contributions provide a robust framework for enhancing image analysis in digital pathology rigorously validated through extensive evaluation.
-
Single-source domain generalization (SDG) for object detection is a challenging yet essential task as the distribution bias of the unseen domain degrades the algorithm performance significantly. However existing methods attempt to extract domain-invariant features neglecting that the biased data leads the network to learn biased features that are non-causal and poorly generalizable. To this end we propose an Unbiased Faster R-CNN (UFR) for generalizable feature learning. Specifically we formulate SDG in object detection from a causal perspective and construct a Structural Causal Model (SCM) to analyze the data bias and feature bias in the task which are caused by scene confounders and object attribute confounders. Based on the SCM we design a Global-Local Transformation module for data augmentation which effectively simulates domain diversity and mitigates the data bias. Additionally we introduce a Causal Attention Learning module that incorporates a designed attention invariance loss to learn image-level features that are robust to scene confounders. Moreover we develop a Causal Prototype Learning module with an explicit instance constraint and an implicit prototype constraint which further alleviates the negative impact of object attribute confounders. Experimental results on five scenes demonstrate the prominent generalization ability of our method with an improvement of 3.9% mAP on the Night-Clear scene.
-
Spike camera is a neuromorphic vision sensor that can capture highly dynamic scenes by generating a continuous stream of binary spikes to represent the arrival of photons at very high temporal resolution. Equipped with Bayer color filter array (CFA) color spike camera (CSC) has been invented to capture color information. Although spike camera has already demonstrated great potential for high-speed imaging its spatial resolution is limited compared with conventional digital cameras. This paper proposes a Color Spike Camera Super-Resolution (CSCSR) network to super-resolve higher-resolution color images from spike camera streams with Bayer CFA. To be specific we first propose a representation for Bayer-pattern spike streams exploring local temporal information with global perception to represent the binary data. Then we exploit the CFA layout and sub-pixel level motion to collect temporal pixels for the spatial super-resolution of each color channel. In particular a residual-based module for feature refinement is developed to reduce the impact of motion estimation errors. Considering color correlation we jointly utilize the multi-stage temporal-pixel features of color channels to reconstruct the high-resolution color image. Experimental results demonstrate that the proposed scheme can reconstruct satisfactory color images with both high temporal and spatial resolution from low-resolution Bayer-pattern spike streams. The source codes are available at https://github.com/csycdong/CSCSR.
-
This paper views the DETR's non-duplicate detection ability as a competition result among object queries. Around each object there are usually multiple queries within which only a single one can win the chance to become the final detection. Such a competition is hard: while some competing queries initially have very close prediction scores their leading query has to dramatically enlarge its score superiority after several decoder layers. To help the leading query stands out this paper proposes EASE-DETR which eases the competition by introducing bias that favours the leading one. EASE-DETR is very simple: in every intermediate decoder layer we identify the "leading / trailing" relationship between any two queries and encode this binary relationship into the following decoder layer to amplify the superiority of the leading one. More concretely the leading query is to be protected from mutual query suppression in the self-attention layer and encouraged to absorb more object features in the cross-attention layer therefore accelerating to win. Experimental results show that EASE-DETR brings consistent and remarkable improvement to various DETRs.
-
In the field of deep point cloud understanding KPConv is a unique architecture that uses kernel points to locate convolutional weights in space instead of relying on Multi-Layer Perceptron (MLP) encodings. While it initially achieved success it has since been surpassed by recent MLP networks that employ updated designs and training strategies. Building upon the kernel point principle we present two novel designs: KPConvD (depthwise KPConv) a lighter design that enables the use of deeper architectures and KPConvX an innovative design that scales the depthwise convolutional weights of KPConvD with kernel attention values. Using KPConvX with a modern architecture and training strategy we are able to outperform current state-of-the-art approaches on the ScanObjectNN Scannetv2 and S3DIS datasets. We validate our design choices through ablation studies and release our code and models.
-
This work aims to improve the efficiency of text-to-image diffusion models. While diffusion models use computationally expensive UNet-based denoising operations in every generation step we identify that not all operations are equally relevant for the final output quality. In particular we observe that UNet layers operating on high-res feature maps are relatively sensitive to small perturbations. In contrast low-res feature maps influence the semantic layout of the final image and can often be perturbed with no noticeable change in the output. Based on this observation we propose Clockwork Diffusion a method that periodically reuses computation from preceding denoising steps to approximate low-res feature maps at one or more subsequent steps. For multiple base- lines and for both text-to-image generation and image editing we demonstrate that Clockwork leads to comparable or improved perceptual scores with drastically reduced computational complexity. As an example for Stable Diffusion v1.5 with 8 DPM++ steps we save 32% of FLOPs with negligible FID and CLIP change. We re- lease code at https://github.com/Qualcomm-AI-research/clockwork-diffusion
-
Channel pruning approaches for convolutional neural networks (ConvNets) deactivate the channels statically or dynamically and require special implementation. In addition channel squeezing in representative ConvNets is carried out via 1 x 1 convolutions which dominates a large portion of computations and network parameters. Given these challenges we propose an effective multi-purpose module for dynamic channel sampling namely Pick-or-Mix (PiX) which does not require special implementation. PiX divides a set of channels into subsets and then picks from them where the picking decision is dynamically made per each pixel based on the input activations. We plug PiX into prominent ConvNet architectures and verify its multi-purpose utilities. After replacing 1 x 1 channel squeezing layers in ResNet with PiX the network becomes 25% faster without losing accuracy. We show that PiX allows ConvNets to learn better data representation than widely adopted approaches to enhance networks' representation power (e.g. SE CBAM AFF SKNet and DWP). We also show that PiX achieves state-of-the-art performance on network downscaling and dynamic channel pruning applications.
-
Diffusion-based models have gained significant popularity for text-to-image generation due to their exceptional image-generation capabilities. A risk with these models is the potential generation of inappropriate content such as biased or harmful images. However the underlying reasons for generating such undesired content from the perspective of the diffusion model's internal representation remain unclear. Previous work interprets vectors in an interpretable latent space of diffusion models as semantic concepts. However existing approaches cannot discover directions for arbitrary concepts such as those related to inappropriate concepts. In this work we propose a novel self-supervised approach to find interpretable latent directions for a given concept. With the discovered vectors we further propose a simple approach to mitigate inappropriate generation. Extensive experiments have been conducted to verify the effectiveness of our mitigation approach namely for fair generation safe generation and responsible text-enhancing generation. Project page: https://interpretdiffusion.github.io.
-
Reconstructing 3D clothed human involves creating a detailed geometry of individuals in clothing with applications ranging from virtual try-on movies to games. To enable practical and widespread applications recent advances propose to generate a clothed human from an RGB image. However they struggle to reconstruct detailed and robust avatars simultaneously. We empirically find that the high-frequency (HF) and low-frequency (LF) information from a parametric model has the potential to enhance geometry details and improve robustness to noise respectively. Based on this we propose HiLo namely clothed human reconstruction with high- and low-frequency information which contains two components. 1) To recover detailed geometry using HF information we propose a progressive HF Signed Distance Function to enhance the detailed 3D geometry of a clothed human. We analyze that our progressive learning manner alleviates large gradients that hinder model convergence. 2) To achieve robust reconstruction against inaccurate estimation of the parametric model by using LF information we propose a spatial interaction implicit function. This function effectively exploits the complementary spatial information from a low-resolution voxel grid of the parametric model. Experimental results demonstrate that HiLo outperforms the state-of-the-art methods by 10.43% and 9.54% in terms of Chamfer distance on the Thuman2.0 and CAPE datasets respectively. Additionally HiLo demonstrates robustness to noise from the parametric model challenging poses and various clothing styles.
-
Customizing robotic behaviors to be aligned with diverse human preferences is an underexplored challenge in the field of embodied AI. In this paper we present Promptable Behaviors a novel framework that facilitates efficient personalization of robotic agents to diverse human preferences in complex environments. We use multi-objective reinforcement learning to train a single policy adaptable to a broad spectrum of preferences. We introduce three distinct methods to infer human preferences by leveraging different types of interactions: (1) human demonstrations (2) preference feedback on trajectory comparisons and (3) language instructions. We evaluate the proposed method in personalized object-goal navigation and flee navigation tasks in ProcTHOR and RoboTHOR demonstrating the ability to prompt agent behaviors to satisfy human preferences in various scenarios.
-
Learning compatible representations enables the interchangeable use of semantic features as models are updated over time. This is particularly relevant in search and retrieval systems where it is crucial to avoid reprocessing of the gallery images with the updated model. While recent research has shown promising empirical evidence there is still a lack of comprehensive theoretical understanding about learning compatible representations. In this paper we demonstrate that the stationary representations learned by the d-Simplex fixed classifier optimally approximate compatibility representation according to the two inequality constraints of its formal definition. This not only establishes a solid foundation for future works in this line of research but also presents implications that can be exploited in practical learning scenarios. An exemplary application is the now-standard practice of downloading and fine-tuning new pre-trained models. Specifically we show the strengths and critical issues of stationary representations in the case in which a model undergoing sequential fine-tuning is asynchronously replaced by downloading a better-performing model pre-trained elsewhere. Such a representation enables seamless delivery of retrieval service (i.e. no reprocessing of gallery images) and offers improved performance without operational disruptions during model replacement. Code available at: https://github.com/miccunifi/iamcl2r.
-
The problem of calibrating deep neural networks (DNNs) for multi-label learning is considered. It is well-known that DNNs trained by cross-entropy for single-label or one-hot classification are poorly calibrated. Many calibration techniques have been proposed to address the problem. However little attention has been paid to the calibration of multi-label DNNs. In this literature the focus has been on improving labeling accuracy in the face of severe dataset unbalance. This is addressed by the introduction of asymmetric losses which have became very popular. However these losses do not induce well calibrated classifiers. In this work we first provide a theoretical explanation for this poor calibration performance by showing that these loses losses lack the strictly proper property a necessary condition for accurate probability estimation. To overcome this problem we propose a new Strictly Proper Asymmetric (SPA) loss. This is complemented by a Label Pair Regularizer (LPR) that increases the number of calibration constraints introduced per training example. The effectiveness of both contributions is validated by extensive experiments on various multi-label datasets. The resulting training method is shown to significantly decrease the calibration error while maintaining state-of-the-art accuracy.
-
We propose SceneTex a novel method for effectively generating high-quality and style-consistent textures for indoor scenes using depth-to-image diffusion priors. Unlike previous methods that either iteratively warp 2D views onto a mesh surface or distillate diffusion latent features without accurate geometric and style cues SceneTex formulates the texture synthesis task as an optimization problem in the RGB space where style and geometry consistency are properly reflected. At its core SceneTex proposes a multiresolution texture field to implicitly encode the mesh appearance. We optimize the target texture via a score-distillation-based objective function in respective RGB renderings. To further secure the style consistency across views we introduce a cross-attention decoder to predict the RGB values by cross-attending to the pre-sampled reference locations in each instance. SceneTex enables various and accurate texture synthesis for 3D-FRONT scenes demonstrating significant improvements in visual quality and prompt fidelity over the prior texture generation methods.
-
Among the numerous efforts towards digitally recovering the physical world Neural Radiance Fields (NeRFs) have proved effective in most cases. However underwater scene introduces unique challenges due to the absorbing water medium the local change in lighting and the dynamic contents in the scene. We aim at developing a neural underwater scene representation for these challenges modeling the complex process of attenuation unstable in-scattering and moving objects during light transport. The proposed method can reconstruct the scenes from both established datasets and in-the-wild videos with outstanding fidelity.
-
We address the problem of online action segmentation for egocentric procedural task videos. While previous studies have mostly focused on offline action segmentation where entire videos are available for both training and inference the transition to online action segmentation is crucial for practical applications like AR/VR task assistants. Notably applying an offline-trained model directly to online inference results in a significant performance drop due to the inconsistency between training and inference. We propose an online action segmentation framework by first modifying existing architectures to make them causal. Second we develop a novel action progress prediction module to dynamically estimate the progress of ongoing actions and using them to refine the predictions of causal action segmentation. Third we propose to learn task graphs from training videos and leverage them to obtain smooth and procedure-consistent segmentations. With the combination of progress and task graph with casual action segmentation our framework effectively addresses prediction uncertainty and oversegmentation in online action segmentation and achieves significant improvement on three egocentric datasets.
-
Cooperative perception offers several benefits for enhancing the capabilities of autonomous vehicles and improving road safety. Using roadside sensors in addition to onboard sensors increases reliability and extends the sensor range. External sensors offer higher situational awareness for automated vehicles and prevent occlusions. We propose CoopDet3D a cooperative multi-modal fusion model and TUMTraf-V2X a perception dataset for the cooperative 3D object detection and tracking task. Our dataset contains 2000 labeled point clouds and 5000 labeled images from five roadside and four onboard sensors. It includes 30k 3D boxes with track IDs and precise GPS and IMU data. We labeled nine categories and covered occlusion scenarios with challenging driving maneuvers like traffic violations near-miss events overtaking and U-turns. Through multiple experiments we show that our CoopDet3D camera-LiDAR fusion model achieves an increase of +14.36 3D mAP compared to a vehicle camera-LiDAR fusion model. Finally we make our dataset model labeling tool and devkit publicly available on our website: https://tum-traffic-dataset.github.io/tumtraf-v2x.
-
This paper addresses the challenge of object-centric layout generation under spatial constraints seen in multiple domains including floorplan design process. The design process typically involves specifying a set of spatial constraints that include object attributes like size and inter-object relations such as relative positioning. Existing works which typically represent objects as single nodes lack the granularity to accurately model complex interactions between objects. For instance often only certain parts of an object like a room's right wall interact with adjacent objects. To address this gap we introduce a factor graph based approach with four latent variable nodes for each room and a factor node for each constraint. The factor nodes represent dependencies among the variables to which they are connected effectively capturing constraints that are potentially of a higher order. We then develop message-passing on the bipartite graph forming a factor graph neural network that is trained to produce a floorplan that aligns with the desired requirements. Our approach is simple and generates layouts faithful to the user requirements demonstrated by a large improvement in IOU scores over existing methods. Additionally our approach being inferential and accurate is well-suited to the practical human-in-the-loop design process where specifications evolve iteratively offering a practical and powerful tool for AI-guided design.
-
Local Interpretable Model-agnostic Explanations (LIME) - a widely used post-ad-hoc model agnostic explainable AI (XAI) technique. It works by training a simple transparent (surrogate) model using random samples drawn around the neighborhood of the instance (image) to be explained (IE). Explanations are then extracted for a black-box model and a given IE using the surrogate model. However the explanations of LIME suffer from inconsistency across different runs for the same model and the same IE. We identify two main types of inconsistencies: variance in the sign and importance ranks of the segments (superpixels). These factors hinder LIME from obtaining consistent explanations. We analyze these inconsistencies and propose a new method Stabilized LIME for Consistent Explanations (SLICE). The proposed method handles the stabilization problem in two aspects: using a novel feature selection technique to eliminate spurious superpixels and an adaptive perturbation technique to generate perturbed images in the neighborhood of IE. Our results demonstrate that the explanations from SLICE exhibit significantly better consistency and fidelity than LIME (and its variant BayLime).
-
Open-set supervised anomaly detection (OSAD) - a recently emerging anomaly detection area - aims at utilizing a few samples of anomaly classes seen during training to detect unseen anomalies (i.e. samples from open-set anomaly classes) while effectively identifying the seen anomalies. Benefiting from the prior knowledge illustrated by the seen anomalies current OSAD methods can often largely reduce false positive errors. However these methods are trained in a closed-set setting and treat the anomaly examples as from a homogeneous distribution rendering them less effective in generalizing to unseen anomalies that can be drawn from any distribution. This paper proposes to learn heterogeneous anomaly distributions using the limited anomaly examples to address this issue. To this end we introduce a novel approach namely Anomaly Heterogeneity Learning (AHL) that simulates a diverse set of heterogeneous anomaly distributions and then utilizes them to learn a unified heterogeneous abnormality model in surrogate open-set environments. Further AHL is a generic framework that existing OSAD models can plug and play for enhancing their abnormality modeling. Extensive experiments on nine real-world anomaly detection datasets show that AHL can 1) substantially enhance different state-of-the-art OSAD models in detecting seen and unseen anomalies and 2) effectively generalize to unseen anomalies in new domains. Code is available at https://github.com/mala-lab/AHL.
-
Compressive spectral image reconstruction is a critical method for acquiring images with high spatial and spectral resolution. Current advanced methods which involve designing deeper networks or adding more self-attention modules are limited by the scope of attention modules and the irrelevance of attentions across different dimensions. This leads to difficulties in capturing non-local mutation features in the spatial-spectral domain and results in a significant parameter increase but only limited performance improvement. To address these issues we propose SPECAT a SPatial-spEctral Cumulative-Attention Transformer designed for high-resolution hyperspectral image reconstruction. SPECAT utilizes Cumulative-Attention Blocks (CABs) within an efficient hierarchical framework to extract features from non-local spatial-spectral details. Furthermore it employs a projection-object Dual-domain Loss Function (DLF) to integrate the optical path constraint a physical aspect often overlooked in current methodologies. Ultimately SPECAT not only significantly enhances the reconstruction quality of spectral details but also breaks through the bottleneck of mutual restriction between the cost and accuracy in existing algorithms. Our experimental results demonstrate the superiority of SPECAT achieving 40.3 dB in hyperspectral reconstruction benchmarks outperforming the state-of-the-art (SOTA) algorithms by 1.2 dB while using only 5% of the network parameters and 10% of the computational cost. The code is available at https://github.com/THU-luvision/SPECAT.
-
White balance (WB) algorithms in many commercial cameras assume single and uniform illumination leading to undesirable results when multiple lighting sources with different chromaticities exist in the scene. Prior research on multi-illuminant WB typically predicts illumination at the pixel level without fully grasping the scene's actual lighting conditions including the number and color of light sources. This often results in unnatural outcomes lacking in overall consistency. To handle this problem we present a deep white balancing model that leverages the slot attention where each slot is in charge of representing individual illuminants. This design enables the model to generate chromaticities and weight maps for individual illuminants which are then fused to compose the final illumination map. Furthermore we propose the centroid-matching loss which regulates the activation of each slot based on the color range thereby enhancing the model to separate illumination more effectively. Our method achieves the state-of-the-art performance on both single- and multi-illuminant WB benchmarks and also offers additional information such as the number of illuminants in the scene and their chromaticity. This capability allows for illumination editing an application not feasible with prior methods.
-
The paradigm of pre-training and fine-tuning has laid the foundation for deploying deep learning models. However most fine-tuning methods are designed to meet a specific resource budget. Recently considering diverse deployment scenarios with various resource budgets SN-Net is introduced to quickly obtain numerous new networks (stitches) from the pre-trained models (anchors) in a model family via model stitching. Although promising SN-Net confronts new challenges when adapting it to new target domains including huge memory and storage requirements and a long and sub-optimal multistage adaptation process. In this work we present a novel framework Efficient Stitchable Task Adaptation (ESTA) to efficiently produce a palette of fine-tuned models that adhere to diverse resource constraints. Specifically we first tailor parameter-efficient fine-tuning to share low-rank updates among the stitches while maintaining independent bias terms. In this way we largely reduce fine-tuning memory burdens and mitigate the interference among stitches that arises in task adaptation. Furthermore we streamline a simple yet effective one-stage deployment pipeline which estimates the important stitches to deploy with training-time gradient statistics. By assigning higher sampling probabilities to important stitches we also get a boosted Pareto frontier. Extensive experiments on 25 downstream visual recognition tasks demonstrate that our ESTA is capable of generating stitches with smooth accuracy-efficiency trade-offs and surpasses the direct SN-Net adaptation by remarkable margins with significantly lower training time and fewer trainable parameters. Furthermore we demonstrate the flexibility and scalability of our ESTA framework by stitching LLMs from LLaMA family obtaining chatbot stitches of assorted sizes.
-
Traditional referring expression comprehension (REC) aims to locate the target referent in an image guided by a text query. Several previous methods have studied on the Counterfactual problem in REC (C-REC) where the objects for a given query cannot be found in the image. However these methods focus on the overall image-text or specific attribute mismatch only. In this paper we address the C-REC problem from a deep perspective of fine-grained attributes. To this aim we first propose a fine-grained counterfactual sample generation method to construct C-REC datasets. Specifically we leverage pre-trained language model such as BERT to modify the attribute words in the queries obtaining the corresponding counterfactual samples. Furthermore we propose a C-REC framework. We first adopt three encoders to extract image text and attribute features. Then our dual-branch attentive fusion module fuses these cross-modal features with two branches by an attention mechanism. At last two prediction heads generate a bounding box and a counterfactual label respectively. In addition we incorporate contrastive learning with the generated counterfactual samples as negatives to enhance the counterfactual perception. Extensive experiments show that our framework achieves promising performance on both public REC datasets RefCOCO/+/g and our constructed C-REC datasets C-RefCOCO/+/g. The code and data are available at https://github.com/Glacier0012/CREC.
-
Recent advancements in dynamic neural radiance field methods have yielded remarkable outcomes. However these approaches rely on the assumption of sharp input images. When faced with motion blur existing dynamic NeRF methods often struggle to generate high-quality novel views. In this paper we propose DyBluRF a dynamic radiance field approach that synthesizes sharp novel views from a monocular video affected by motion blur. To account for motion blur in input images we simultaneously capture the camera trajectory and object Discrete Cosine Transform (DCT) trajectories within the scene. Additionally we employ a global cross-time rendering approach to ensure consistent temporal coherence across the entire scene. We curate a dataset comprising diverse dynamic scenes that are specifically tailored for our task. Experimental results on our dataset demonstrate that our method outperforms existing approaches in generating sharp novel views from motion-blurred inputs while maintaining spatial-temporal consistency of the scene.
-
Recently high-fidelity scene reconstruction with an optimized 3D Gaussian splat representation has been introduced for novel view synthesis from sparse image sets. Making such representations suitable for applications like network streaming and rendering on low-power devices requires significantly reduced memory consumption as well as improved rendering efficiency. We propose a compressed 3D Gaussian splat representation that utilizes sensitivity-aware vector clustering with quantization-aware training to compress directional colors and Gaussian parameters. The learned codebooks have low bitrates and achieve a compression rate of up to 31 on real-world scenes with only minimal degradation of visual quality. We demonstrate that the compressed splat representation can be efficiently rendered with hardware rasterization on lightweight GPUs at up to 4 higher framerates than reported via an optimized GPU compute pipeline. Extensive experiments across multiple datasets demonstrate the robustness and rendering speed of the proposed approach.
-
We present DenseAV a novel dual encoder grounding architecture that learns high-resolution semantically meaningful and audio-visual aligned features solely through watching videos. We show that DenseAV can discover the "meaning" of words and the "location" of sounds without explicit localization supervision. Furthermore it automatically discovers and distinguishes between these two types of associations without supervision. We show that DenseAV's localization abilities arise from a new multi-head feature aggregation operator that directly compares dense image and audio representations for contrastive learning. In contrast many other systems that learn "global" audio and video representations cannot localize words and sound. Finally we contribute two new datasets to improve the evaluation of AV representations through speech and sound prompted semantic segmentation. On these and other datasets we show DenseAV dramatically outperforms the prior art on speech and sound prompted semantic segmentation. DenseAV outperforms the current state-of-the-art ImageBind on cross-modal retrieval using fewer than half of the parameters. Project Page: https://aka.ms/denseav
-
We approach the challenge of addressing semi-supervised domain generalization (SSDG). Specifically our aim is to obtain a model that learns domain-generalizable features by leveraging a limited subset of labelled data alongside a substantially larger pool of unlabeled data. Existing domain generalization (DG) methods which are unable to exploit unlabeled data perform poorly compared to semi-supervised learning (SSL) methods under SSDG setting. Nevertheless SSL methods have considerable room for performance improvement when compared to fully-supervised DG training. To tackle this underexplored yet highly practical problem of SSDG we make the following core contributions. First we propose a feature-based conformity technique that matches the posterior distributions from the feature space with the pseudo-label from the model's output space. Second we develop a semantics alignment loss to learn semantically-compatible representations by regularizing the semantic structure in the feature space. Our method is plug-and-play and can be readily integrated with different SSL-based SSDG baselines without introducing any additional parameters. Extensive experimental results across five challenging DG benchmarks with four strong SSL baselines suggest that our method provides consistent and notable gains in two different SSDG settings.
-
With the success of large language models (LLMs) integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However existing LLM-based large multimodal models (e.g. Video-LLaMA VideoChat) can only take in a limited number of frames for short video understanding. In this study we mainly focus on designing an efficient and effective model for long-term video understanding. Instead of trying to process more frames simultaneously like most existing work we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. We conduct extensive experiments on various video understanding tasks such as long-video understanding video question answering and video captioning and our model can achieve state-of-the-art performances across multiple datasets.
-
Interactive motion synthesis is essential in creating immersive experiences in entertainment applications such as video games and virtual reality. However generating animations that are both high-quality and contextually responsive remains a challenge. Traditional techniques in the game industry can produce high-fidelity animations but suffer from high computational costs and poor scalability. Trained neural network models alleviate the memory and speed issues yet fall short on generating diverse motions. Diffusion models offer diverse motion synthesis with low memory usage but require expensive reverse diffusion processes. This paper introduces the Accelerated Auto-regressive Motion Diffusion Model (AAMDM) a novel motion synthesis framework designed to achieve quality diversity and efficiency all together. AAMDM integrates Denoising Diffusion GANs as a fast Generation Module and an Auto-regressive Diffusion Model as a Polishing Module. Furthermore AAMDM operates in a lower-dimensional embedded space rather than the full-dimensional pose space which reduces the training complexity as well as further improves the performance. We show that AAMDM outperforms existing methods in motion quality diversity and runtime efficiency through comprehensive quantitative analyses and visual comparisons. We also demonstrate the effectiveness of each algorithmic component through ablation studies.
-
Deep Text-to-Image Synthesis (TIS) models such as Stable Diffusion have recently gained significant popularity for creative text-to-image generation. However for domain-specific scenarios tuning-free Text-guided Image Editing (TIE) is of greater importance for application developers. This approach modifies objects or object properties in images by manipulating feature components in attention layers during the generation process. Nevertheless little is known about the semantic meanings that these attention layers have learned and which parts of the attention maps contribute to the success of image editing. In this paper we conduct an in-depth probing analysis and demonstrate that cross-attention maps in Stable Diffusion often contain object attribution information which can result in editing failures. In contrast self-attention maps play a crucial role in preserving the geometric and shape details of the source image during the transformation to the target image. Our analysis offers valuable insights into understanding cross and self-attention mechanisms in diffusion models. Furthermore based on our findings we propose a simplified yet more stable and efficient tuning-free procedure that modifies only the self-attention maps of specified attention layers during the denoising process. Experimental results show that our simplified method consistently surpasses the performance of popular approaches on multiple datasets.
-
Large pretrained models are increasingly crucial in modern computer vision tasks. These models are typically used in downstream tasks by end-to-end finetuning which is highly memory-intensive for tasks with high-resolution data e.g. video understanding small object detection and point cloud analysis. In this paper we propose Dynamic Reversible Dual-Residual Networks or Dr2Net a novel family of network architectures that acts as a surrogate network to finetune a pretrained model with substantially reduced memory consumption. Dr2Net contains two types of residual connections one maintaining the residual structure in the pretrained models and the other making the network reversible. Due to its reversibility intermediate activations which can be reconstructed from output are cleared from memory during training. We use two coefficients on either type of residual connections respectively and introduce a dynamic training strategy that seamlessly transitions the pretrained model to a reversible network with much higher numerical precision. We evaluate Dr2Net on various pretrained models and various tasks and show that it can reach comparable performance to conventional finetuning but with significantly less memory usage.
-
The primary focus of Neural Representation for Videos (NeRV) is to effectively model its spatiotemporal consistency. However current NeRV systems often face a significant issue of spatial inconsistency leading to decreased perceptual quality. To address this issue we introduce the Pyramidal Neural Representation for Videos (PNeRV) which is built on a multi-scale information connection and comprises a lightweight rescaling operator Kronecker Fully-connected layer (KFc) and a Benign Selective Memory (BSM) mechanism. The KFc inspired by the tensor decomposition of the vanilla Fully-connected layer facilitates low-cost rescaling and global correlation modeling. BSM merges high-level features with granular ones adaptively. Furthermore we provide an analysis based on the Universal Approximation Theory of the NeRV system and validate the effectiveness of the proposed PNeRV. We conducted comprehensive experiments to demonstrate that PNeRV surpasses the performance of contemporary NeRV models achieving the best results in video regression on UVG and DAVIS under various metrics (PSNR SSIM LPIPS and FVD). Compared to vanilla NeRV PNeRV achieves a +4.49 dB gain in PSNR and a 231% increase in FVD on UVG along with a +3.28 dB PSNR and 634% FVD increase on DAVIS.
-
Long-tail recognition is challenging because it requires the model to learn good representations from tail categories and address imbalances across all categories. In this paper we propose a novel generative and fine-tuning framework LTGC to handle long-tail recognition via leveraging generated content. Firstly inspired by the rich implicit knowledge in large-scale models (e.g. large language models LLMs) LTGC leverages the power of these models to parse and reason over the original tail data to produce diverse tail-class content. We then propose several novel designs for LTGC to ensure the quality of the generated data and to efficiently fine-tune the model using both the generated and original data. The visualization demonstrates the effectiveness of the generation module in LTGC which produces accurate and diverse tail data. Additionally the experimental results demonstrate that our LTGC outperforms existing state-of-the-art methods on popular long-tailed benchmarks.
-
Instance segmentation is data-hungry and as model capacity increases data scale becomes crucial for improving the accuracy. Most instance segmentation datasets today require costly manual annotation limiting their data scale. Models trained on such data are prone to overfitting on the training set especially for those rare categories. While recent works have delved into exploiting generative models to create synthetic datasets for data augmentation these approaches do not efficiently harness the full potential of generative models. To address these issues we introduce a more efficient strategy to construct generative datasets for data augmentation termed DiverGen. Firstly we provide an explanation of the role of generative data from the perspective of distribution discrepancy. We investigate the impact of different data on the distribution learned by the model. We argue that generative data can expand the data distribution that the model can learn thus mitigating overfitting. Additionally we find that the diversity of generative data is crucial for improving model performance and enhance it through various strategies including category diversity prompt diversity and generative model diversity. With these strategies we can scale the data to millions while maintaining the trend of model performance improvement. On the LVIS dataset DiverGen significantly outperforms the strong model X-Paste achieving +1.1 box AP and +1.1 mask AP across all categories and +1.9 box AP and +2.5 mask AP for rare categories. Our codes are available at https://github.com/aim-uofa/DiverGen.
-
This study focuses on a novel task in text-to-image (T2I) generation namely action customization. The objective of this task is to learn the co-existing action from limited data and generalize it to unseen humans or even animals. Experimental results show that existing subject-driven customization methods fail to learn the representative characteristics of actions and struggle in decoupling actions from context features including appearance. To overcome the preference for low-level features and the entanglement of high-level features we propose an inversion-based method Action-Disentangled Identifier (ADI) to learn action-specific identifiers from the exemplar images. ADI first expands the semantic conditioning space by introducing layer-wise identifier tokens thereby increasing the representational richness while distributing the inversion across different features. Then to block the inversion of action-agnostic features ADI extracts the gradient invariance from the constructed sample triples and masks the updates of irrelevant channels. To comprehensively evaluate the task we present an ActionBench that includes a variety of actions each accompanied by meticulously selected samples. Both quantitative and qualitative results show that our ADI outperforms existing baselines in action-customized T2I generation. Our project page is at https://adi-t2i.github.io/ADI.
-
We propose a framework for automatic colorization that allows for iterative editing and modifications. The core of our framework lies in an imagination module: by understanding the content within a grayscale image we utilize a pre-trained image generation model to generate multiple images that contain the same content. These images serve as references for coloring mimicking the process of human experts. As the synthesized images can be imperfect or different from the original grayscale image we propose a Reference Refinement Module to select the optimal reference composition. Unlike most previous end-to-end automatic colorization algorithms our framework allows for iterative and localized modifications of the colorization results because we explicitly model the coloring samples. Extensive experiments demonstrate the superiority of our framework over existing automatic colorization algorithms in editability and flexibility. Project page: https://xy-cong.github.io/imagine-colorization/.
-
This paper is not motivated to seek innovation within the attention mechanism. Instead it focuses on overcoming the existing trade-offs between accuracy and efficiency within the context of point cloud processing leveraging the power of scale. Drawing inspiration from recent advances in 3D large-scale representation learning we recognize that model performance is more influenced by scale than by intricate design. Therefore we present Point Transformer V3 (PTv3) which prioritizes simplicity and efficiency over the accuracy of certain mechanisms that are minor to the overall performance after scaling such as replacing the precise neighbor search by KNN with an efficient serialized neighbor mapping of point clouds organized with specific patterns. This principle enables significant scaling expanding the receptive field from 16 to 1024 points while remaining efficient (a 3x increase in processing speed and a 10x improvement in memory efficiency compared with its predecessor PTv2). PTv3 attains state-of-the-art results on over 20 downstream tasks that span both indoor and outdoor scenarios. Further enhanced with multi-dataset joint training PTv3 pushes these results to a higher level.
-
Precipitation nowcasting is an important spatio-temporal prediction task to predict the radar echoes sequences based on current observations which can serve both meteorological science and smart city applications. Due to the chaotic evolution nature of the precipitation systems it is a very challenging problem. Previous studies address the problem either from the perspectives of deterministic modeling or probabilistic modeling. However their predictions suffer from the blurry high-value echoes fading away and position inaccurate issues. The root reason of these issues is that the chaotic evolutionary precipitation systems are not appropriately modeled. Inspired by the nature of the systems we propose to decompose and model them from the perspective of global deterministic motion and local stochastic variations with residual mechanism. A unified and flexible framework that can equip any type of spatio-temporal models is proposed based on residual diffusion which effectively tackles the shortcomings of previous methods. Extensive experimental results on four publicly available radar datasets demonstrate the effectiveness and superiority of the proposed framework compared to state-of-the-art techniques. Our code is publicly available at https://github.com/DeminYu98/DiffCast.
-
We present Ego-Exo4D a diverse large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g. sports music dance bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts yielding long-form captures from 1 to 42 minutes each and 1286 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio eye gaze 3D point clouds camera poses IMU and multiple paired language descriptions---including a novel "expert commentary" done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity we also present a suite of benchmark tasks and their annotations including fine-grained activity understanding proficiency estimation cross-view translation and 3D hand/body pose. All resources are open sourced to fuel new research in the community.
-
Pre-training a model and then fine-tuning it on downstream tasks has demonstrated significant success in the 2D image and NLP domains. However due to the unordered and non-uniform density characteristics of point clouds it is non-trivial to explore the prior knowledge of point clouds and pre-train a point cloud backbone. In this paper we propose a novel pre-training method called Point cloud Diffusion pre-training PointDif. We consider the point cloud pre-training task as a conditional point-to-point generation problem and introduce a conditional point generator. This generator aggregates the features extracted by the backbone and employs them as the condition to guide the point-to-point recovery from the noisy point cloud thereby assisting the backbone in capturing both local and global geometric priors as well as the global point density distribution of the object. We also present a recurrent uniform sampling optimization strategy which enables the model to uniformly recover from various noise levels and learn from balanced supervision. Our PointDif achieves substantial improvement across various real-world datasets for diverse downstream tasks such as classification segmentation and detection. Specifically PointDif attains 70.0% mIoU on S3DIS Area 5 for the segmentation task and achieves an average improvement of 2.4% on ScanObjectNN for the classification task compared to TAP. Furthermore our pre-training framework can be flexibly applied to diverse point cloud backbones and bring considerable gains. Code is available at https://github.com/zhengxiaozx/PointDif
-
In Visual Question Answering (VQA) recognizing and localizing entities pose significant challenges. Pretrained vision-and-language models have addressed this problem by providing a text description as the answer. However in visual scenes with multiple entities textual descriptions struggle to distinguish the entities from the same category effectively. Consequently the VQA dataset is limited by the limitations of text description and cannot adequately cover scenarios involving multiple entities. To address this challenge we introduce a Mask for Align (Mask4Align) method which can determine the entity's position in the given image that best matches the user-input question. This method incorporates colored masks into the image en
Introduction
Conference CVPR2024 accepted paper complete List. Top ranking conferences for AI and Robotics communities. Total Accepted Paper Count 2715
