https://arxiv.org/pdf/1612.07899v1.pdf DARN: a Deep Adversial Residual Network for Intrinsic Image Decomposition

We present a new deep supervised learning method for intrinsic decomposition of a single image into its albedo and shading components. Our contributions are based on a new fully convolutional neural network that estimates absolute albedo and shading jointly. As opposed to classical intrinsic image decomposition work, it is fully data-driven, hence does not require any physical priors like shading smoothness or albedo sparsity, nor does it rely on geometric information such as depth. Compared to recent deep learning techniques, we simplify the architecture, making it easier to build and train. It relies on a single end-to-end deep sequence of residual blocks and a perceptually-motivated metric formed by two discriminator networks. We train and demonstrate our architecture on the publicly available MPI Sintel dataset and its intrinsic image decomposition augmentation. We additionally discuss and augment the set of quantitative metrics so as to account for the more challenging recovery of non scale-invariant quantities. Results show that our work outperforms the state of the art algorithms both on the qualitative and quantitative aspect, while training convergence time is reduced.

https://arxiv.org/pdf/1606.06724v2.pdf Tagger: Deep Unsupervised Perceptual Grouping

We present a framework for efficient perceptual inference that explicitly reasons about the segmentation of its inputs and features. Rather than being trained for any specific segmentation, our framework learns the grouping process in an unsupervised manner or alongside any supervised task. We enable a neural network to group the representations of different objects in an iterative manner through a differentiable mechanism. We achieve very fast convergence by allowing the system to amortize the joint iterative inference of the groupings and their representations. In contrast to many other recently proposed methods for addressing multi-object scenes, our system does not assume the inputs to be images and can therefore directly handle other modalities. We evaluate our method on multi-digit classification of very cluttered images that require texture segmentation. Remarkably our method achieves improved classification performance over convolutional networks despite being fully connected, by making use of the grouping mechanism. Furthermore, we observe that our system greatly improves upon the semi-supervised result of a baseline Ladder network on our dataset. These results are evidence that grouping is a powerful tool that can help to improve sample efficiency.

https://arxiv.org/abs/1612.08510 Learning Non-Lambertian Object Intrinsics across ShapeNet Categories

We consider the non-Lambertian object intrinsic problem of recovering diffuse albedo, shading, and specular highlights from a single image of an object.

Our analysis shows that feature learning at the encoder stage is more crucial for developing a universal representation across categories.

https://arxiv.org/pdf/1603.05631v2.pdf Generative Image Modeling using Style and Structure Adversarial Networks

Current generative frameworks use end-to-end learning and generate images by sampling from uniform noise distribution. However, these approaches ignore the most basic principle of image formation: images are product of: (a) Structure: the underlying 3D model; (b) Style: the texture mapped onto structure. In this paper, we factorize the image generation process and propose Style and Structure Generative Adversarial Network (S2 -GAN). Our S2 -GAN has two components: the StructureGAN generates a surface normal map; the Style-GAN takes the surface normal map as input and generates the 2D image. Apart from a real vs. generated loss function, we use an additional loss with computed surface normals from generated images. The two GANs are first trained independently, and then merged together via joint learning. We show our S 2 -GAN model is interpretable, generates more realistic images and can be used to learn unsupervised RGBD representations.

http://cvgl.stanford.edu/projects/ucn/ Universal Correspondence Network

We have proposed a novel deep metric learning approach to visual correspondence estimation, that is shown to be advantageous over approaches that optimize a surrogate patch similarity objective.

https://imatge-upc.github.io/saliency-salgan-2017/ SalGAN: Visual Saliency Prediction with Generative Adversarial Networks

https://arxiv.org/abs/1701.02965 Revisiting Deep Image Smoothing and Intrinsic Image Decomposition

We propose an image smoothing approximation and intrinsic image decomposition method based on a modified convolutional neural network architecture applied directly to the original color image. Our network has a very large receptive field equipped with at least 20 convolutional layers and 8 residual units. When training such a deep model however, it is quite difficult to generate edge-preserving images without undesirable color differences. To overcome this obstacle, we apply both image gradient supervision and a channel-wise rescaling layer that computes a minimum mean-squared error color correction. Additionally, to enhance piece-wise constant effects for image smoothing, we append a domain transform filter with a predicted refined edge map. The resulting deep model, which can be trained end-to-end, directly learns edge-preserving smooth images and intrinsic decompositions without any special design or input scaling/size requirements. Moreover, our method shows much better numerical and visual results on both tasks and runs in comparable test time to existing deep methods.

https://arxiv.org/pdf/1609.09444v2.pdf Contextual RNN-GANs for Abstract Reasoning Diagram Generation

Understanding, predicting, and generating object motions and transformations is a core problem in artificial intelligence. Modeling sequences of evolving images may provide better representations and models of motion and may ultimately be used for forecasting, simulation, or video generation. Diagrammatic Abstract Reasoning is an avenue in which diagrams evolve in complex patterns and one needs to infer the underlying pattern sequence and generate the next image in the sequence. For this, we develop a novel Contextual Generative Adversarial Network based on Recurrent Neural Networks (Context-RNN-GANs), where both the generator and the discriminator modules are based on contextual history (modeled as RNNs) and the adversarial discriminator guides the generator to produce realistic images for the particular time step in the image sequence.

https://arxiv.org/abs/1702.05421 The Effect of Color Space Selection on Detectability and Discriminability of Colored Objects

Overall, the best results were achieved in both simulated and real images using color spaces C1C2C3, UVW and XYZ. In addition, using a simulated environment, we show a practical application of color space selection in the context of top-down control in active visual search.

https://arxiv.org/pdf/1702.07971v1.pdf Seeing What Is Not There: Learning Context to Determine Where Objects Are Missing

https://arxiv.org/abs/1703.01560v1 LR-GAN: Layered Recursive Generative Adversarial Networks for Image Generation

We present LR-GAN: an adversarial image generation model which takes scene structure and context into account. Unlike previous generative adversarial networks (GANs), the proposed GAN learns to generate image background and foregrounds separately and recursively, and stitch the foregrounds on the background in a contextually relevant manner to produce a complete natural image. For each foreground, the model learns to generate its appearance, shape and pose. The whole model is unsupervised, and is trained in an end-to-end manner with gradient descent methods.

https://arxiv.org/abs/1611.01843 Learning to Perform Physics Experiments via Deep Reinforcement Learning

We found that state of art deep reinforcement learning methods can learn to perform the experiments necessary to discover such hidden properties. By systematically manipulating the problem difficulty and the cost incurred by the agent for performing experiments, we found that agents learn different strategies that balance the cost of gathering information against the cost of making mistakes in different situations.

http://alexgkendall.com/computer_vision/have_we_forgotten_about_geometry_in_computer_vision/

https://arxiv.org/abs/1706.08616v1 Do Deep Neural Networks Suffer from Crowding?

Crowding is a visual effect suffered by humans, in which an object that can be recognized in isolation can no longer be recognized when other objects, called flankers, are placed close to it. In this work, we study the effect of crowding in artificial Deep Neural Networks for object recognition. We analyze both standard deep convolutional neural networks (DCNNs) as well as a new version of DCNNs which is 1) multi-scale and 2) with size of the convolution filters change depending on the eccentricity wrt to the center of fixation. Such networks, that we call eccentricity-dependent, are a computational model of the feedforward path of the primate visual cortex. Our results reveal that the eccentricity-dependent model, trained on target objects in isolation, can recognize such targets in the presence of flankers, if the targets are near the center of the image, whereas DCNNs cannot.

https://arxiv.org/pdf/1706.04313.pdf Teaching Compositionality to CNNs∗

Convolutional neural networks (CNNs) have shown great success in computer vision, approaching human-level performance when trained for specific tasks via applicationspecific loss functions. In this paper, we propose a method for augmenting and training CNNs so that their learned features are compositional. It encourages networks to form representations that disentangle objects from their surroundings and from each other, thereby promoting better generalization. Our method is agnostic to the specific details of the underlying CNN to which it is applied and can in principle be used with any CNN. As we show in our experiments, the learned representations lead to feature activations that are more localized and improve performance over non-compositional baselines in object recognition tasks.

When there is only one object in the input image, teaching compositionality takes the form of ensuring that the activations within the region of that object remain invariant regardless of what background the object appears on. With multiple objects, we also explicitly ensure that the activations of each object remain the same as if that object were shown in isolation.

https://arxiv.org/pdf/1709.08872v1.pdf Learning to Label Affordances from Simulated and Real Data

We define a novel cost function, which is able to handle (potentially multiple) affordances of objects and their parts in a pixel-wise manner even in the case of incomplete data. We perform qualitative as well as quantitative evaluations with simulated and real data assessing 15 different affordances. In general, we find that affordances, which are wellenough represented in the training data, are correctly recognized with a substantial fraction of correctly assigned pixels.

obstruct vertical surface that prevents locomotion. e.g. wall break detachable objects that can easily be damaged or destroyed e.g. vase sit surface a human can sit on while having the feet on the ground e.g. seat cushion grasp detachable objects that can be encompassed with one hand or only few fingers and be moved with one arm.e.g. vase) pinch-pull surfaces that can be pulled through a pinch movement (all directions). e.g. knob hook-pull surfaces that can be pulled by hooking up fingers (all directions). e.g. handle tip-push surfaces that trigger some action when being pushed. e.g. button-panel warmth surfaces that emit warmth. e.g. fireplace illumination surfaces that emit visible light.e.g. bulb observe surfaces that present information or art, i.e. that can be read or watched. e.g. display support stable surfaces that provide support for standing (for the agent) except ground. e.g. wall place-on raised surfaces where objects can be placed on (this excludes the ground). e.g. tabletop dry surfaces capable of soaking up water. e.g. towel roll surfaces that can be used with wheels. e.g. road walk surfaces a human can walk on. e.g. grass

https://sermanet.github.io/imitate/ Time-Contrastive Networks: Self-Supervised Learning from Video

https://arxiv.org/abs/1711.10402v1 An Adversarial Neuro-Tensorial Approach For Learning Disentangled Representations

In this paper, we propose the first unsupervised deep learning method for disentangling multiple latent factors of variation in face images captured in-the-wild. To this end, we propose a deep latent variable model, where the multiplicative interactions of multiple latent factors of variation are explicitly modelled by means of multilinear (tensor) structure. We demonstrate that the proposed approach indeed learns disentangled representations of facial expressions and pose, which can be used in various applications, including face editing, as well as 3D face reconstruction and classification of facial expression, identity and pose.

https://arxiv.org/abs/1711.09020 StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation

We empirically demonstrate the effectiveness of our approach on a facial attribute transfer and a facial expression synthesis tasks.

https://arxiv.org/abs/1611.02401DIVIDE AND CONQUER NETWORKS

https://github.com/alexnowakvila/DiCoNet