This is an old revision of the document! Toward a Formal Model of Cognitive Synergy

Cognitive synergy has been posited as a key feature of real-world general intelligence, and has been used explicitly in the design of the OpenCog cognitive architecture. Here category theory and related concepts are used to give a formalization of the cognitive synergy concept.

A neural network is trained to generate shape descriptors that lie close to a vector representation of the shape class, given a vector space of words. This method is easily extendable to range scans, hand-drawn sketches and images. This makes cross-modal retrieval possible, without a need to design different methods depending on the query type. We show that sketch-based shape retrieval using semantic-based descriptors outperforms the state-of-the-art by large margins, and mesh-based retrieval generates results of higher relevance to the query, than current deep shape descriptors. It Takes Two to Tango: Towards Theory of AI’s Mind

In this work, we argue that for human-AI teams to be effective, humans must also develop a theory of AI’s mind – get to know its strengths, weaknesses, beliefs, and quirks. We instantiate these ideas within the domain of Visual Question Answering (VQA). We find that using just a few examples (50), lay people can be trained to better predict responses and oncoming failures of a complex VQA model. Surprisingly, we find that having access to the model’s internal states – its confidence in its top-k predictions, explicit or implicit attention maps which highlight regions in the image (and words in the question) the model is looking at (and listening to) while answering a question about an image – do not help people better predict its behavior. SHAPEWORLD: A new test methodology for multimodal language understanding

We introduce a novel framework for evaluating multimodal deep learning models with respect to their language understanding and generalization abilities. In this approach, artificial data is automatically generated according to the experimenter’s specifications. The content of the data, both during training and evaluation, can be controlled in detail, which enables tasks to be created that require true generalization abilities, in particular the combination of previously introduced concepts in novel ways. We demonstrate the potential of our methodology by evaluating various visual question answering models on four different tasks, and show how our framework gives us detailed insights into their capabilities and limitations. By opensourcing our framework, we hope to stimulate progress in the field of multimodal language understanding. Identifying First-person Camera Wearers in Third-person Videos

In this paper, we propose a new semi-Siamese Convolutional Neural Network architecture to address this novel challenge. We formulate the problem as learning a joint embedding space for first- and third-person videos that considers both spatial- and motion-domain cues. A new triplet loss function is designed to minimize the distance between correct first- and third-person matches while maximizing the distance between incorrect ones. This end-to-end approach performs significantly better than several baselines, in part by learning the first- and third-person features optimized for matching jointly with the distance measure itself. Visual Attribute Transfer through Deep Image Analogy Evaluating vector-space models of analogy

We evaluate the parallelogram model of analogy as applied to modern word embeddings, providing a detailed analysis of the extent to which this approach captures human relational similarity judgments in a large benchmark dataset. We find that that some semantic relationships are better captured than others. We then provide evidence for deeper limitations of the parallelogram model based on the intrinsic geometric constraints of vector spaces, paralleling classic results for first-order similarity. Words, Concepts, and the Geometry of Analogy Deep manifold-to-manifold transforming network for action recognition Generative Models of Visually Grounded Imagination Beyond Deep Residual Learning for Image Restoration: Persistent Homology-Guided Manifold Simplification

If an image contains many patterns and structures, the performance of these CNNs is still inferior. To address this issue, here we propose a novel feature space deep residual learning algorithm that outperforms the existing residual learning. The main idea is originated from the observation that the performance of a learning algorithm can be improved if the input and/or label manifolds can be made topologically simpler by an analytic mapping to a feature space. One Model To Learn Them All

To allow training on input data of widely different sizes and dimensions, such as images, sound waves and text, we need sub-networks to convert inputs into a joint representation space. We call these sub-networks modality nets as they are specific to each modality (images, speech, text) and define transformations between these external domains and a unified representation. Deep Learning: Generalization Requires Deep Compositional Feature Space Design

Generalization error defines the discriminability and the representation power of a deep model. In this work, we claim that feature space design using deep compositional function plays a significant role in generalization along with explicit and implicit regularizations. Our claims are being established with several image classification experiments. We show that the information loss due to convolution and max pooling can be marginalized with the compositional design, improving generalization performance. Also, we will show that learning rate decay acts as an implicit regularizer in deep model training. Conceptual Spaces for Cognitive Architectures: A Lingua Franca for Different Levels of Representation

In particular, we claim that Conceptual Spaces offer a lingua franca that allows to unify and generalize many aspects of the symbolic, sub-symbolic and diagrammatic approaches (by overcoming some of their typical problems) and to integrate them on a common ground. In doing so we extend and detail some of the arguments explored by [Ardenfors (1997)] for defending the need of a conceptual, intermediate, representation level between the symbolic and the sub-symbolic one. SCAN: Learning Abstract Hierarchical Compositional Visual Concepts

This paper describes SCAN (Symbol-Concept Association Network), a new framework for learning such concepts in the visual domain. We first use the previously published beta-VAE (Higgins et al., 2017a) architecture to learn a disentangled representation of the latent structure of the visual world, before training SCAN to extract abstract concepts grounded in such disentangled visual primitives through fast symbol association. Our approach requires very few pairings between symbols and images and makes no assumptions about the choice of symbol representations. Once trained, SCAN is capable of multimodal bi-directional inference, generating a diverse set of image samples from symbolic descriptions and vice versa. It also allows for traversal and manipulation of the implicit hierarchy of compositional visual concepts through symbolic instructions and learnt logical recombination operations. Such manipulations enable SCAN to invent and learn novel visual concepts through recombination of the few learnt concepts. Sluice networks: Learning what to share between loosely related tasks

To overcome this, we introduce Sluice Networks, a general framework for multi-task learning where trainable parameters control the amount of sharing – including which parts of the models to share. Our framework goes beyond and generalizes over previous proposals in enabling hard or soft sharing of all combinations of subspaces, layers, and skip connections. We perform experiments on three task pairs from natural language processing, and across seven different domains, using data from OntoNotes 5.0, and achieve up to 15% average error reductions over common approaches to multi-task learning. We analyze when the architecture is particularly helpful, as well as its ability to fit noise. We show that a) label entropy is predictive of gains in sluice networks, confirming findings for hard parameter sharing, and b) while sluice networks easily fit noise, they are robust across domains in practice. Joint Multimodal Learning with Deep Generative Models Solving Verbal Comprehension Questions in IQ Test by Knowledge-Powered Word Embedding A Spacetime Approach to Generalized Cognitive Reasoning in Multi-scale Learning Tropical hyperelliptic curves in the plane Learning Robust Visual-Semantic Embeddings Representation Learning by Learning to Count The Size of a Hyperball in a Conceptual Space

The cognitive framework of conceptual spaces [3] provides geometric means for representing knowledge. A conceptual space is a highdimensional space whose dimensions are partitioned into so-called domains. Within each domain, the Euclidean metric is used to compute distances. Distances in the overall space are computed by applying the Manhattan metric to the intra-domain distances. Instances are represented as points in this space and concepts are represented by regions. A Further Analysis of The Role of Heterogeneity in Coevolutionary Spatial Games

Surprisingly, results show that the heterogeneity of link weights (states) on their own does not always promote cooperation; rather cooperation is actually favoured by the increase in the number of overlapping states and not by the heterogeneity itself. One Model To Learn Them All

Our model architecture incorporates building blocks from multiple domains. It contains convolutional layers, an attention mechanism, and sparsely-gated layers. Each of these computational blocks is crucial for a subset of the tasks we train on. Interestingly, even if a block is not crucial for a task, we observe that adding it never hurts performance and in most cases improves it on all tasks. We also show that tasks with less data benefit largely from joint training with other tasks, while performance on large tasks degrades only slightly if at all. Event Representations with Tensor-based Compositions An Adversarial Neuro-Tensorial Approach For Learning Disentangled Representations

In this paper, we propose the first unsupervised deep learning method for disentangling multiple latent factors of variation in face images captured in-the-wild. To this end, we propose a deep latent variable model, where the multiplicative interactions of multiple latent factors of variation are explicitly modelled by means of multilinear (tensor) structure. We demonstrate that the proposed approach indeed learns disentangled representations of facial expressions and pose, which can be used in various applications, including face editing, as well as 3D face reconstruction and classification of facial expression, identity and pose.

We demonstrate the power of our methodology in expression and pose transfer, as well as discovering powerful features for pose and expression classification. Navigability of complex networks