https://docs.google.com/a/codeaudit.com/document/d/16bMjzUHs2LD5S9NzGYxKAx0ccFofGSgpO4LG1GtwLg4/edit?usp=sharing

On Theory

Background

Richard Feynman, the 1965 Nobel Prize winner in Physics, gave us an insightful lecture concerning computer heuristics. In his own unique way, he gives a lecture about how computers perform their tasks in a way that is accessible to most anyone. In this lecture, he is asked the questions as whether a computer will eventually think like a human or alternatively, will a computer ever achieve intelligence.

<insert feynman lecture here>

His thoughts on this matter drives the motivations for this book. If a computer can think like a human, then how is it able to do so? What theoretical tools, that we currently have at our disposal, can be used to better understand this emerging phenomena. DL systems are of course are nowhere near having the capability of intelligence, however DL systems in recent years have been able to perform tasks that Feynman had believed to be of the purview of humans. This is an continent-shifting development and as we shall see in this book, the “algorithms” of a DL system is remarkable different from conventional 'file handling systems'.

A major crutch that we carry along in neural network research is an artifact of the original 1957 perceptron proposal. This antiquated and cartoonish depiction of a biological neuron bakes into our research too many questionable assumptions. By raising the level of abstraction into abstract mathematics such as category theory do we we are get a better perspective as to the breadth of available alternative implementations. Ideally we want to construct our reasoning from first principles rather than from analogy. This is the kind of thinking that Feynman pursued.

Category Theory

In this book, we will employ the use of Category Theory to provide a more formal discussion of the concepts. We aren't going to go to deep in the Mathematics, rather we employ the math so as to avoid ambiguity and provide better intuition of the concepts. Category Theory has been used to represent a wide variety of domains. There are Category Theory treatments of Set theory, topology, physics, logic and programming. One of the nice advantages of using Category Theory is that the concepts can be represented in a visual manner. This further aids in understanding.

Category Theory is abstract mathematics that is able to describe many kinds of mathematical systems like linear algebra, vectors, tensors, hilbert spaces, group theory, topology, logic and computation. Its strength is that it makes explicit the notion of a morphism. A morphism is a general concept that maps from one category to another category. So in the domain of computation, it is able to represent data types and the functions that operate on that data type. It can represent general computation in the form of lambda calculus and the notion of currying is naturally captured in Category Theory.

So when we apply Category Theory to Deep Learning we have to categories. We have the category of the Model, where we have morphism that maps one Model to another Model. We also have the category of the Representation, wherein a Model serves as a morphism that maps a Representation into another Representation. The Representation category is equivalent to the vector spaces category. A learning machine is able to iterate through training Representations to evolve a Model. When the Model evolution hits an orbit, such that its morphisms of Models form a limit cycle, then we can say that the machine has converged with its training data. There is also an additional morphism that takes instances in the Representation category and maps them into the Model category. We aren't certain if this morphism is a category in itself called Learning.

There are two mind blowing ideas that do arise when we formalize DL in this way. The first one is whether a Representation is also a Model. Another one is whether the morphisms in a Model are also Models (i.e. closed). So if we think of learning systems in this way, learning would be able to bootstrap itself into more complex forms of learning. This is what we describe in the “Meta-Learning” pattern.

The power we gain from taking a category theory approach is that we can explicitly express the assumption we make in the mathematics that drive the learning dynamics. DL currently requires differentiable layers and this is to support the learning process known as stochastic gradient descent (SGD). Can more relaxed alternatives like genetic algorithms be employed? If so, what would be the most general properties of this learning process?

Dynamics

Neural Networks are dynamical systems and as such we will leverage mathematics that were developed over the centuries to describe behavior. The math really just involves calculus, linear algebra and probability.

However unlike dynamical systems which are governed by equations that are meant to describe reality, DL systems are dynamical system that learn how to describe reality. These are black box equation deriving machinery, which coincidentally are themselves dynamic systems that are governed by mathematical equations. Which of course brings up a question at an even higher meta-level. Why can't DL algorithm learn the dynamics of another DL algorithm? We shall see that this is an interesting possibility, but that's a topic for another book.

DL systems evolve from one Model state to another Model state. A Model state in a DL system consists of all the layers, where each layer consists of the weights and biases of the neurons contained in them. You may think of these weights and biases as being the parameters of the Model. To permit a continuous transition from one Model state to another, we require that the layers be differentiable. The differentiability of the layers is the one common trait of all DL systems.

DL systems use backpropagation to evolve Model states. This evolution is what is referred to training or learning. The equations of evolution are analogous to the equations of a dynamic system that maps one state in a multi-dimensional space into the another state in the same multi-dimensional space. We can therefore gain inspiration about the dynamics of learning using the similar concepts found in the dynamics of physical systems. The dynamics of learning evolves from a non-equilibrium state into a state in equilibrium. That is, we are trying to discover the Model state such that the “energy” of the state is minimized. That energy corresponds to the minimization of relative entropy. The relative entropy is the relative entropy between the observations and that of the predictions derived from the model.

We therefore need a way to calculate that entropy. Unfortunately, we have only partial knowledge of reality. That is, we have only partial observations. We only have the luxury of observations from the training set. We therefore need some kind of bias in our model evolution so as to account for the unknown (i.e. the observations that we never see). This bias we treat as an additional potential (i.e. energy) that keeps the learning process from become overly confident about its model. It is also important to note that the construction of our Model are also biases. The former can be considered as a “Regularization in Training” and the later “a Regularization in Construction”.

Just like dynamical systems, that are constrained in the way each particle may be inter-connected to other particles, we can do something similar. We can constrain the layers of the Model such that they form a hierarchy. These cross interactions can be expressed in construction by virtual of the Model and Composite patterns used or they can be expressed in training as a consequence of the Regularization expressions used.

In dynamical systems, there is at the heart of the equations of motion, the idea of least action. This is the idea that an particle will take the shortest path between two points. The equations of motion in physics from classical to quantum physics can all be derived from this idea. The equations of motions can be represented in physics by the Hamilton-Jacobi equations. The equivalent discrete form of the Hamilton-Jacobi equations is known as the Bellman equation. In Physics, the “fitness” function is that of minimizing energy cost. However, in ML practitioners have generalized this further to develop all kinds of reward functions that are suitable for the domain. As we shall see, all learning follows the analogous equations as the equations of motion. The distinction is in the reward function, where equations of motion are governed by least action, machine learning can be driven by a variety of reward functions. The most common reward function use just happens to be that leaning towards minimum entropy or alternatively maximum likelihood.

Information Theory

The field of Statistics can be split up into two fields, classical statistics and all its methods that were invented prior to the existence of computers and there's statistical learning which covers methods that were discovered when statistician discovered that computers. In the field of classical statistics, there are two well known philosophies that are frequently at odds. These are the Frequentists and the Bayesians and arguments between the two can get pretty messy. There however a lesser known philosophy that was pioneered by Ronald A. Fisher. Fisher's philosophy revolved around how well did a model fit the observed data. So he built systems that pursued the idea of Maximum Likelihood Estimation. Statistical learning falls within this philosophy.

Many Machine Learning literature have a bias toward Bayesian terminology and methodology. That is you hear terms like priors and posteriors. Which is Bayesian philosophy that asserts that given enough priors, you can approximate the posterior. It is a leap of faith, but underneath the assertion are many assumptions. There's an assumption of independent distributions and an assumption of Gaussian smoothness. Many theories of classical statistics are grounded on mathematics of convenience. The domain of classical statistics revolves around plenty of samples (preferably infinite), low dimensions and models with a few number of parameters. DL in contrast covers domains with plenty of samples, high dimensions and models with millions of parameters. The objectives of a classical statistician is towards building simple models, which differs from an ML practitioner's goal of building a machine predictor with model complexity not being a priority. There are plenty of literature that cover the over reach of classical statistics and the divide in mind set of the statistical community. You can read about that elsewhere.

So rather than being bogged down in the mine field of statistical methods, we take a less hazardous approach and leverage only information theoretic principles. Information Theory revolves around the concept of entropy, which just happens to be a similarity measure between two distributions. This is a simple enough idea that does not require that we need to conjure up philosophies dating back from an area where computers did not exist.

Computational Information Geometry

As we center our entire framework around the simple notion of machines that find the closest fit between two distributions, which is an entropic principle, we discover a rich mathematical approach. This approach is called Information Geometry and derives its foundations from the disciplines of Probability, Information Theory and Differential Geometry. Information Geometry also shares the tensor mathematics found in Einstein’s Theory of Relativity. In Einstein’s theory, space is non-euclidean, i.e. space is curved in the presence of gravity. As a result, light bends when it passes near a body with mass. This phenomenon is observable in light passing near our sun. Information geometry also employs non-euclidean spaces to analyze model space.

The importance of information geometry in the understanding of deep learning is that it provides a general framework that unifies various model estimation approaches. Deep Learning requires a rich understanding of the concept of similarity. It is critically important that we understand the different ways that we can measure similarity. From this vantage point, we then have at our disposal a much richer toolbox for constructing better classifiers. Another benefit is that because information geometry is a unifying framework, we are able to combine more classical statistical mixtures to build accurate generative models. This understanding helps us move a step closer towards the holy grail of deep learning, that is one-shot or even zero-shot learning.

Probabilistic Soft Logic

Computers are very good with working with exact logic. Neural networks however are unable to perform many conventional computational tasks. I hope that it does not come as a surprise that performing basic arithmetic is a difficult problem for neural networks. Good 'Ol Fashioned AI (GOFAI) otherwise known as 'symbolicism' conjectures that it would be possible to create a universal symbolic machine that is intelligent. Unfortunately, GOFAI systems have tended to be very brittle, complex and difficult to scale. In contrast, DL systems have been shown to perform very well in tasks that humans are naturally perform (i.e. image and speech recognition). The holy grail however is to be able to fuse connectionist neural networks with the symbolic GOFAI approach.

A more common approach of fusing logic into DL systems is via the approach of probabilistic graph models (PGM). PGMs are fundamentally based on Bayesian statistics. A good coverage of this topic and its relationship with DL can be found in the Deep Learning book. A common approach in many papers is to employ Bayesian statistics to reason about the characteristics of a DL system. A statistical approach is to treat the representations of a DL system as being probabilistic distributions and employing the tools of the trade to further reason about the behavior of the system. There however remains a lot more research as to how best to integrated PGMs into DL and this is a topic of active research.

Probabilistic soft logic (PSL) is a middle ground solution to fusing connectionism and symbolicism. PSL defines logical concepts in terms of continuous functions rather than conventional more rigid binary logic. We shall see in this book how many conventional computational tasks may be framed in terms of a PSL framework.

Game Theory Mathematics provides us with abstractions that aid us in our understanding of complex systems. However, every form of abstraction has its limitations in that there are certain details that are glossed over. We can sketch out some intuition with geometry, dynamics and logic as to how these kind of systems will tend to behave. What we begin to gleam from this is that these systems consist of classifiers built from other classifiers. They are a self similar system that should be treated as a collective of many interacting machines. Furthermore, these machines are designed to make predictions out of the future. These predictions need to performed using incomplete data. Therefore we need a mathematical framework that studies the behavior of many interacting parties that are have different sets of information.

The classical view of machine learning is that the problem can be cast as an optimization problem where all that is needed are algorithms that are able to search for an optimal solution. However, with machine learning we want to build machines that don’t overfit the data but rather is able to perform well on data that it has yet to see. We want these machines to make predictions about the unknown. This requirement, which is called generalization, is very different from the classical optimization problem. It is very different from the classical dynamics problem where all information is expected to be available. That is why, a lot of the engineering in deep learning requires additional constraints on the optimization problem. These are called ‘priors’ in some texts, these are called regularizations in an optimization problem.

Where do these regularizations come from and how can we select a good regularization? How do we handle impartial information? This is where a game theoretic viewpoint becomes important. Generalization is sometimes referred to as ‘structural risk minimization’. In other words, we build mechanisms to handle generalization using strategies similar to how parties mitigate risk. So we have actually returned full circle. Game theory is described as “the study of mathematical models of conflict and cooperation between intelligent rational decision-makers.” In our quest of understanding learning machines we end up with mathematics that were meant for the study of the interactions of intelligent beings.

Ilities

In software architecture, there is a notion of “ilities” that are non-functional qualities that are important in evaluating our solutions. Lacking in DL literature is an understanding of the quality and effectiveness of a network. The usual analysis is based on the generalization capability of a network as quantified by its performance in classifying a validation set. There however are other network properties such as trainability and expressivity that are important albeit entangled with the concept of generalization. We would like to see a framework where one understands how to compose various building blocks driven by an understanding as to how each block contributes to trainability, expressivity or generalization. The field is clearly very young in that we have few tools to evaluate the effectiveness of our solutions. Additional ilities may include interpretability, latency, adversarial stability and transferability.

Terminology

Deep Learning (DL) is a sub-specialty of Machine Learning. We will use the DL mostly in this book and however many of the ideas found here are also applicable to Machine Learning in general.

We will use the name Learning Machine, or Machine in its short form, rather than the words network or system to refer to the entire neural network. We will also use the term Deep Neural Network (DNN) to refer to an Artificial Neural Network. The collection of neurons in the machine and their connectivity and layers, we will refer to as the Model. This is consistent with the terminology used by many DL frameworks. We use the name Representation to refer to the data that is processed and flows through each layer in the Model. We shall use the name layer to refer to the common collection of neurons that maps an input Representation into an output Representation. In DL frameworks, the name tensor or vector is used to refer to this object.

Categories as well as the names of patterns will be capitalized in this text to reduce ambiguity of its use in a sentence. However references to other technical names such as category theory or information theory, we keep uncapitalized as in conventional technical text guidelines which recommends avoiding over capitalization.

This chapter will go into category theory, dynamical systems, information theory and probabilistic soft logic so as to build up a vocabulary for describing the patterns in this book. The patterns in this book for an even richer language that will enable the practitioner to reason about the constructions of DL systems.

TensorFlow

TensorFlow is specifically a C++ library that was created by Google to support the creation of Deep Learning solutions. It borrows most of its ideas from the Python based Theano library developed by the University of Montreal. Specifically it supports the definition and execution of a computational graph and it supports symbolic differentiation for calculating gradients required in the backpropagation learning.

Google adds to Theano's ideas by including capabilities that support distributed computation. This includes the ability to perform computation on partial subgraphs and the presence of coordination mechanisms such as queues and synchronization. So one could arguably make the simplification that TensorFlow is a distributed implementation of Theano.

TensorFlow is not the first Distributed Deep Learning framework. There are plenty of other alternatives such as H2O, DL4J, Spark MLLib, CNTK, etc. In fact, Google had previously a distributed framework called DistBelief that they claim has been replaced by TensorFlow. The claimed benefit of TensorFlow is that it reduced the complexity of deploying Deep Learning solutions as compared to their previous system. This is the core idea of TensorFlow, it reduces the complexity of formulating new DL models in a distributed context.

This book on occasion will provide concrete examples in TensorFlow so as to give the reader a closer connection to implementation.

References

http://colah.github.io/posts/2015-09-NN-Types-FP/

For a complete treatise (with type theory) of automatic differentiation in a functional setting (for neural nets and other things), have a look at Jeff Siskind and Barak Pearlmutter's work (here: http://www.bcl.hamilton.ie/~barak/publications.html )

Barak A. Pearlmutter and Jeffrey Mark Siskind. Reverse-Mode AD in a functional framework: Lambda the ultimate backpropagator. TOPLAS 30(2):1-36, Mar 2008, doi:10.1145/1330017.1330018.

Alexey Radul, Barak A. Pearlmutter, and Jeffrey Mark Siskind, AD in Fortran, Part 2: Implementation via Prepreprocessor, Advances in Automatic Differentiation, Lecture Notes in Computational Science and Engineering, Springer, 2012. Also arXiv:1203.1450.

Barak A. Pearlmutter and Jeffrey Mark Siskind. Using programming language theory to make AD sound and efficient. In Proceedings of the 5th International Conference on Automatic Differentiation, pages 79-90, Bonn, Germany, Aug 2008, doi:10.1007/978-3-540-68942-3_8. Springer-Verlag.

A Rosetta Stone for Connectionism http://www.santafe.edu/media/workingpapers/90-004.pdf

http://arxiv.org/pdf/1407.7417v1.pdf ‘ALMOST SURE’ CHAOTIC PROPERTIES OF MACHINE LEARNING METHODS We showcase some properties of this iteration, and establish in general that the iteration is ‘almost surely’ of chaotic nature. This theory explains the observation in the counter intuitive properties of deep learning methods.

http://www.kyb.mpg.de/fileadmin/user_upload/files/publications/pdfs/pdf2819.pdf Introduction to Statistical Learning Theory The goal of statistical learning theory is to study, in a statistical framework, the properties of learning algorithms. In particular, most results take the form of so-called error bounds. This tutorial introduces the techniques that are used to obtain such results.

https://www.andrew.cmu.edu/user/awodey/students/jackson.pdf Sheaf Theoretic Approach to Measure Theory

http://www.kdnuggets.com/2016/04/deep-learning-vs-svm-random-forest.html

https://en.wikipedia.org/wiki/Hamilton%E2%80%93Jacobi%E2%80%93Bellman_equation

http://econometricsense.blogspot.com/2011/01/classical-statistics-vs-machine.html

http://artent.net/category/category-theory/ Entropy preserving: One thing we noticed was that 1-1 functions are the only functions that conserve the entropy of categorical variables. For example, if X∈{1,2,3,4,5,6} is a die roll, then Y=f(X) has the same entropy as X only if f is 1-1.

http://www.argmin.net/2016/05/31/mechanics-of-lagrangians

http://www.argmin.net/2016/05/18/mates-of-costate/

A few years ago, Steve Wright introduced me to an older method from optimal control, called the method of adjoints, which is equivalent to backpropagation.

https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions

http://spectrum.ieee.org/robotics/artificial-intelligence/machinelearning-maestro-michael-jordan-on-the-delusions-of-big-data-and-other-huge-engineering-efforts

http://arxiv.org/pdf/1603.09260v2.pdf Degrees of Freedom in Deep Neural Networks

We show that the degrees of freedom in these models are related to the expected optimism, which is the expected difference between test error and training error. The degrees of freedom in deep networks is dramatically less than the number of parameters. In some real datasets, the number of parameters is several orders of magnitude larger than the degrees of freedom. Further, we observe that for fixed number of parameters, deeper networks have less degrees of freedom exhibiting a regularization-by-depth.

http://arxiv.org/abs/1301.6201

Causal Theories: A Categorical Perspective on Bayesian Networks

http://arxiv.org/abs/1508.06448

A Compositional Framework for Markov Processes

http://arxiv.org/abs/1402.3067

A Bayesian Characterization of Relative Entropy

http://www.cogsys.org/pdf/paper-1-2.pdf The Cognitive Systems Paradigm

In this essay, I review the motivations behind the cognitive systems movement and attempt to characterize the paradigm. I propose six features that distinguish research in this framework from other approaches to artificial intelligence, after which I present some positive and negative examples in an effort to clarify the field’s boundaries. In closing, I also discuss some avenues for encouraging research in this critical area.

https://bartoszmilewski.com/2014/10/28/category-theory-for-programmers-the-preface/

http://arxiv.org/abs/1604.03099v1 Symbolic Knowledge Extraction using Łukasiewicz Logics

http://norvig.com/chomsky.html On Chomsky and the Two Cultures of Statistical Learning

https://github.com/mtomassoli/papers/blob/master/inftheory.pdf

https://aeon.co/essays/the-evidence-is-in-there-is-no-language-instinct

https://cbmm.mit.edu/sites/default/files/publications/machines_that_think.pdf

http://www.cell.com/trends/cognitive-sciences/fulltext/S1364-6613(16)30043-2

http://biorxiv.org/content/biorxiv/early/2016/06/13/058545.full.pdf Towards an integration of deep learning and neuroscience

We hypothesize that (1) the brain optimizes cost functions, (2) these cost functions are diverse and differ across brain locations and over development, and (3) optimization operates within a pre-structured architecture matched to the computational problems posed by behavior. Such a heterogeneously optimized system, enabled by a series of interacting cost functions, serves to make learning data-efficient and precisely targeted to the needs of the organism.

https://arxiv.org/pdf/quant-ph/0101012v4.pdf Quantum Theory From Five Reasonable Axioms

https://www.linkedin.com/pulse/data-science-replacing-statistics-sam-savage

http://www.analyticbridge.com/profiles/blogs/the-8-worst-predictive-modeling-techniques

http://alex.smola.org/drafts/thebook.pdf

https://static.aminer.org/pdf/PDF/000/392/201/category_theory_applied_to_neural_modeling_and_graphical_representations.pdf Category Theory Applied to Neural Modeling and Graphical Representations

http://www.scientificamerican.com/article/evidence-rebuts-chomsky-s-theory-of-language-learning/ Evidence Rebuts Chomsky's Theory of Language Learning

In linguistics and allied fields, many researchers are be­­coming ever more dissatisfied with a totally formal language approach such as universal grammar—not to mention the empirical inadequacies of the theory. Moreover, many modern re­­searchers are also unhappy with armchair theoretical analyses, when there are large corpora of linguistic data—many now available online—that can be analyzed to test a theory.

https://arxiv.org/pdf/1610.01549.pdf A Novel Representation of Neural Networks

https://arxiv.org/pdf/1610.02751v1.pdf A New Theoretical and Technological System of Imprecise-Information Processing

https://drive.google.com/a/codeaudit.com/file/d/0B_wzP_JlVFcKeVRfVVNKX0NMelE/view

https://www.linkedin.com/pulse/computer-vision-research-my-deep-depression-nikos-paragios

https://arxiv.org/pdf/1612.04799v1.pdf Deep Function Machines: Generalized Neural Networks for Topological Layer Expression

http://videolectures.net/course_information_theory_pattern_recognition Video lectures of David MacKay on Information Theory

http://www.complex-systems.com/pdf/01-5-6.pdf What are the mathematical principles underlying the hierarchical

self-organization of internal representations in the network?

What are the relative roles of:

nonlinear input-output response

learning rule

input statistics (second order? higher order?)

Why are some properties learned more quickly than others?

What is a mathematical definition of category coherence, and

How does it relate the speed of category learning?

How can we explain changing patterns of inductive

generalization over developmental time scales?

https://arxiv.org/pdf/1703.02156v1.pdf On the Limits of Learning Representations with Label-Based Supervision

Will the representations learned from these generative methods ever rival the quality of those from their supervised competitors? In this work, we argue in the affirmative, that from an information theoretic perspective, generative models have greater potential for representation learning. Based on several experimentally validated assumptions, we show that supervised learning is upper bounded in its capacity for representation learning in ways that certain generative models, such as Generative Adversarial Networks (GANs) are not.

https://arxiv.org/pdf/1703.04361.pdf Toward a Formal Model of Cognitive Synergy

Cognitive synergy has been posited as a key feature of real-world general intelligence, and has been used explicitly in the design of the OpenCog cognitive architecture. Here category theory and related concepts are used to give a formalization of the cognitive synergy concept.

https://arxiv.org/abs/1703.07950 Failures of Deep Learning

In recent years, Deep Learning has become the go-to solution for a broad range of applications, often outperforming state-of-the-art. However, it is important, for both theoreticians and practitioners, to gain a deeper understanding of the difficulties and limitations associated with common approaches and algorithms. We describe four families of problems for which some of the commonly used existing algorithms fail or suffer significant difficulty. We illustrate the failures through practical experiments, and provide theoretical insights explaining their source, and how they might be remedied.

Failure due to Non-Informative Gradients - we discuss a family of problems for which with high probability, at any fixed point, the gradient, ∇Fh(w), will be essentially the same regardless of the underlying target function h.

Decomposition vs. End-to-end -Helping the SGD process by decomposing the problem leads to much faster training. We present two experiments, motivated by questions each and every practitioner must answer when facing a new, non trivial problem; what exactly is the needed data, what architecture is planned to be used, and what is the right distribution of development efforts, are all correlated questions with no clear answer. Our experiments and analysis show that taking the wrong choice may be expensive.

Architecture and Conditioning - We show how different architectures, all of them of sufficient expressive power for solving the problem, have orders-ofmagnitude difference in their condition numbers. In particular, this becomes apparent when considering convolutional vs. fully connected layers. This sheds a new light over the success of convolutional neural networks, which is generally attributed to their sample complexity benefits. Moreover, we show how conditioning, applied in conjunction with a better architecture choice, can further decrease the condition number by orders of magnitude. The direct effect on the convergence rate is analyzed, and is aligned with the significant performance gaps observed empirically. We also demonstrate how performance may not significantly improve by employing deeper and more powerful architectures, as well as the price that comes with choosing a sub-optimal architecture.

Flat Activations - We consider different ways to implement, approximate or learn such activations, such that the error will effectively propagate through them. Using a different variant of a local search-based update, based on [17, 16] , we arrive at an efficient solution. Efficient learning of generalized linear and single index models with isotonic regression.

https://arxiv.org/ftp/arxiv/papers/1705/1705.03921.pdf Why & When Deep Learning Works: Looking Inside Deep Learning

https://arxiv.org/abs/1610.07690 Operational calculus on programming spaces

https://arxiv.org/abs/1707.01594 Theories for influencer identification in complex networks

https://arxiv.org/ftp/arxiv/papers/1708/1708.01636.pdf Game theory models for communication between agents: a review

https://arxiv.org/abs/1708.05070 Data-driven Advice for Applying Machine Learning to Bioinformatics Problems

https://www.edge.org/responses/what-scientific-term-or%C2%A0concept-ought-to-be-more-widely-known

Representational_Issues_in_the_Debate_on.pdf Representational Issues in the Debate on the Standard Model of the Mind

https://arxiv.org/abs/1711.10455v1 Backprop as Functor: A compositional perspective on supervised learning

To summarise, in this paper we have developed an algebraic framework to describe composition of supervised learning algorithms. In order to do this, we have identified the notion of a request function as the key distinguishing feature of compositional learning. This request function allows us to construct training data for all sub-parts of a composite learning algorithm from training data for just the input and output of the composite algorithm.

https://arxiv.org/pdf/1511.00422.pdf ABELIAN LOGIC GATES

https://arxiv.org/pdf/1711.10455v2.pdf Backprop as Functor: A compositional perspective on supervised learning

https://arxiv.org/pdf/1801.10437v1.pdf Deep Learning Works in Practice. But Does it Work in Theory?

https://arxiv.org/pdf/1803.06376.pdf A Generalised Method for Empirical Game Theoretic Analysis

We provide insights in the empirical meta game showing that a Nash equilibrium of the meta-game is an approximate Nash equilibrium of the true underlying game. We investigate and show how many data samples are required to obtain a close enough approximation of the underlying game. Additionally, we extend the meta-game analysis methodology to asymmetric games. The state-of-the-art has only considered empirical games in which agents have access to the same strategy sets and the payoff structure is symmetric, implying that agents are interchangeable. Finally, we carry out an empirical illustration of the generalised method in several domains, illustrating the theory and evolutionary dynamics of several versions of the AlphaGo algorithm (symmetric), the dynamics of the Colonel Blotto game played by human players on Facebook (symmetric), and an example of a meta-game in Leduc Poker (asymmetric), generated by the PSRO multi-agent learning algorithm.

http://www.weizmann.ac.il/mcb/UriAlon/sites/mcb.UriAlon/files/network_motifs_nature_genetics_review.pdf Network motifs: theory and experimental approaches

https://arxiv.org/abs/1803.05316 Seven Sketches in Compositionality: An Invitation to Applied Category Theory

https://arxiv.org/abs/1603.04641v3 Compositional game theory

https://arxiv.org/abs/1808.05385 On the Decision Boundary of Deep Neural Networks

the last weight layer of a neural network converges to a linear SVM trained on the output of the last hidden layer, for both the binary case and the multi-class case with the commonly used cross-entropy loss. Furthermore, we show empirically that training a neural network as a whole, instead of only fine-tuning the last weight layer, may result in better bias constant for the last weight layer, which is important for generalization. In addition to facilitating the understanding of deep learning, our result can be helpful for solving a broad range of practical problems of deep learning, such as catastrophic forgetting and adversarial attacking.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.161.3556&rep=rep1&type=pdf A Beginner's Guide to the Mathematics of Neural Networks

https://arxiv.org/abs/1311.1539v1 Category-Theoretic Quantitative Compositional Distributional Models of Natural Language Semantics

https://graphicallinearalgebra.net/ Graphical Linear Algebra

https://arxiv.org/pdf/1810.10531.pdf A mathematical theory of semantic development in deep neural networks

The synaptic weights of the neural network extract from the statistical structure of the environment a set of paired object analyzers and feature synthesizers associated with every categorical distinction. The bootstrapped, simultaneous learning of each pair solves the apparent Gordian knot of knowing both which items belong to a category, and which features are important for that category: the object analyzers determine category membership, while the feature synthesizers determine feature importance, and the set of extracted categories are uniquely determined by the statistics of the environment.

https://arxiv.org/abs/1810.09274v1 From Hard to Soft: Understanding Deep Network Nonlinearities via Vector Quantization and Statistical Inference

DN layers constructed from these operations can be interpreted as {\em max-affine spline operators} (MASOs) that have an elegant link to vector quantization (VQ) and K-means.

https://arxiv.org/abs/1810.02054 Gradient Descent Provably Optimizes Over-parameterized Neural Networks

over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows us to exploit a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum.

https://www.semanticscholar.org/paper/A-Tale-of-Three-Probabilistic-Families-%3A-%2C-and-Yingnian-Gao/fb25d311c1237bb542175480aacd67b77272edd7 . A Tale of Three Probabilistic Families : Discriminative , Descriptive and Generative Models

https://arxiv.org/pdf/1803.06824.pdf Indeterminism in Physics, Classical Chaos and Bohmian Mechanics. Are Real Numbers Really Real?

© 2016-2017 Copyright - Carlos E. Perez