Research Projects
PREDITOR
We explore text guided image editing with a Hybrid Diffusion Model (HDM) architecture similar to DALLE-2. Our architecture consists of a diffusion prior model that generates CLIP image embedding conditioned on a text prompt and a custom Latent Diffusion Model trained to generate images conditioned on CLIP image embedding. We discover that the diffusion prior model can be used to perform text guided conceptual edits on the CLIP image embedding space without any finetuning or optimization. We combine this with structure preserving edits on the image decoder using existing approaches such as reverse DDIM to perform text guided image editing. Our approach, PRedItOR does not require additional inputs, fine-tuning, optimization or objectives and shows on par or better results than baselines qualitatively and quantitatively.
Enhancing Controllability of Diffusion Models
Inspired by techniques based on the latent space of GAN models for image manipulation, we propose to train a diffusion model conditioned on two latent codes, a spatial content mask and a flattened style embedding. We rely on the inductive bias of the progressive denoising process of diffusion models to encode pose/layout information in the spatial structure mask and semantic/style information in the style code. We extend the sampling technique from composable diffusion models to allow for some dependence between conditional inputs. This improves the quality of the generations significantly while also providing control over the amount of guidance from each latent code separately as well as from their joint distribution.
AESOP
We introduce AESOP: a new dataset that captures the creative process associated with visual storytelling. Visual panels are composed of clip-art objects with specific attributes
enabling a broad range of creative expression. Using AESOP, we propose foundational storytelling tasks that are generative variants
of story cloze tests, to better measure the creative and causal reasoning ability required for visual storytelling. We further develop a
generalized story completion framework that models stories as the co-evolution of visual and textual concepts. We benchmark the proposed
approach with human baselines and evaluate using comprehensive qualitative and quantitative metrics.
Visualize Your Story
UNDER REVIEW: Story illustration is the task of illustrating a natural language story with a coherent sequence of images.
We propose a more generalized task: Many-to-Many Story Illustration, i.e. automatic visualization of a textual story by a coherent sequence of
images of any length. We introduce a novel many-to-many dataset created by aligning natural language descriptions with corresponding
coherent sequence of images sampled from video clips. An end-to-end encoder-decoder neural architecture is proposed that sequentially
retrieves a coherent sequence of images given an input story. User studies show the applicability of the proposed task and dataset
and reveal that the illustrations generated by the proposed model are comparable to the ground truth.
GitEvolve
GitEvolve is a multi-task sequential deep network for simulation of future github events given past events for a particular repository. Each event is
characterized by a 3-tuple including type of the event, user cluster id and the time stamp of the event. The three tasks are
trained simultaneously. Social structure of Github is further modelled by automatically learning graph based representation for each
repository. The effectiveness of the proposed technique is evaluated using an array of metrics.
Show Me a Story
Story Illustration is the problem of retrieving/generating a sequence of images,
given a natural language story as input. We propose a hierarchical GRU network that learns
a representation for the input story and use it to retrieve an ordered set of images from a dataset.
In its core, the model is designed to explicitly model coherence between sentences in a story
optimized over sequential order embedding based loss function. The performance is qualitatively and quantitatively evaluated.
Anti-Forensic Enhancement
Digital images can be convincingly edited using image editing tools. In order to identify such image pro-cessing operations,
various forensic techniques have been proposed. In response, anti-forensic operationsdesigned as counter-measures have been
devised. We propose an anti-forensic technique tocounter spatial domain forensic detectors and demonstrate its accuracy on
popular image manipulation operations such as median filtering and contrast enhancement. Through a series of experiments,
we prove that the proposed algorithm canseverely degrade the performance of median filtering and contrast enhancement detectors.
The proposedalgorithm also outperforms popular anti-forensic algorithms.
Image Filtering Detection
Smart image editing and processing techniques make it easier to manipulate an image convincingly and also hide any artifacts of
tampering using operations like filtering, compression and/or format conversion to suppress forgery artifacts. We propose an
algorithm to detect if a given image has undergone filtering based enhancement irrespective of the format of image or the type
of filter applied using spatial domain quantization noise.