This identifies the pattern and should be representative of the concept that it describes. The name should be a noun that should be easily usable within a sentence. We would like the pattern to be easily referenceable in conversation between practitioners.


Describes in a single concise sentence the meaning of the pattern.


This section describes the reason why this pattern is needed in practice. Other pattern languages indicate this as the Problem. In our pattern language, we express this in a question or several questions and then we provide further explanation behind the question.


This section provides alternative descriptions of the pattern in the form of an illustration or alternative formal expression. By looking at the sketch a reader may quickly understand the essence of the pattern.


By now, you may have been asking yourself, “why has deep learning become more successful than any other machine learning algorithm?” The answer can be distilled down to one word: “modularity”. What is modularity?

In computer science, we build up complex systems from modules. One module built from more simple modules. It is what enables us to build our digital world based on just NAND or NOR gates. Universal Boolean operator are a necessity, but they are not sufficient enough to build complex system. Complex computing systems require modularity so that we have a tractable way of managing complexity.

Many different machine learning algorithms do support modularity, however deep learning is the only known one that is able to scale. We can survey the list of canonical patterns and see how each one enables modularity.

Modularity enables the patterns in this book to be mixed and matched to build more complex solutions.

Here are six core design operators essential for modular systems:

  • Splitting – Modules can be made independent.
  • Substituting – Modules can be substituted and interchanged.
  • Excluding – Existing Modules can be removed to build a usable solution.
  • Augmenting – New Modules can be added to create new solutions.
  • Inverting – The hierarchical dependencies between Modules can be rearranged.
  • Porting – Modules can be applied to different contexts.

These operators are of a general nature and inherent in any modular design. Operators are actions that change existing structure into new structures in well-defined ways. In the context applied to software this can mean refactoring operators at the source code level, language constructs at specification time or can mean component models at configuration time. These operators are complete in that they are capable of generating any structure in computer design. Furthermore, it is parsimonious, it is smallest set of rules possible. Finally, unlike other definitions of properties of Modularity, it does not mention qualitative measures.

The six operator definition focuses on functional invariance in the presence of design transformations.

In the context of deep learning, the modularity operators are enabled as follows:

  • Splitting – Pre-trained autoencoders can be split and reused as layers in another network.
  • Substituting – Through transfer learning, student networks can serve as substitutes of teacher networks.
  • Augmenting – New networks can be added later to improve accuracy. You can joint train networks to improve generalization. Furthermore, outputs of a neural network can be used as neural embedding that can be used as representations for other neural networks.
  • Inverting – Generative networks can be created that reverse the information flow. Bi-directional networks and ladder networks also have this property.
  • Porting – A neural network can be ported to a different context by replacing the top layers. Alternatively, through transfer learning a bigger network can be ported to a smaller network.
  • Excluding – Current deep learning practice does not have a method for excluding portions of a trained network. This is equivalent to forgetting, which at this time is not something a network can do with any precision. Furthermore, the parameter entanglement and the distributed representation nature will work against this capability.

The modularity of deep learning network and its corresponding scalability, give is an unmatched advantage over competing machine learning techniques.

Where are the module boundaries?

Known Uses

Here we review several projects or papers that have used this pattern.

Related Patterns In this section we describe in a diagram how this pattern is conceptually related to other patterns. The relationships may be as precise or may be fuzzy, so we provide further explanation into the nature of the relationship. We also describe other patterns may not be conceptually related but work well in combination with this pattern.

Relationship to Canonical Patterns

Relationship to other Patterns

Further Reading

We provide here some additional external material that will help in exploring this pattern in more detail.


Neural Module Networks

In this paper, we have introduced neural module networks, which provide a general-purpose framework for learning collections of neural modules which can be dynamically assembled into arbitrary deep networks.

Machine learning offers a fantastically powerful toolkit for building complex systems quickly. This paper argues that it is dangerous to think of these quick wins as coming for free. Using the framework of technical debt, we note that it is remarkably easy to incur massive ongoing maintenance costs at the system level when applying machine learning. The goal of this paper is highlight several machine learning specific risk factors and design patterns to be avoided or refactored where possible. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, changes in the external world, and a variety of system-level anti-patterns. Learning and Transfer of Modulated Locomotor Controllers

A high-frequency, low-level “spinal” network with access to proprioceptive sensors learns sensorimotor primitives by training on simple tasks. This pre-trained module is fixed and connected to a low-frequency, high-level “cortical” network, with access to all sensors, which drives behavior by modulating the inputs to the spinal network.

Our design encourages the low-level controller to focus on the specifics of reactive motor control, while a high-level controller directs behavior towards the task goal by communicating a modulatory signal.

We believe that the general idea of reusing learned behavioral primitives is important, and the design principles we have followed represent possible steps towards this goal. Our hierarchical design with information hiding has enabled the construction of low-level motor behaviors that are sheltered from task-specific information, enabling their reuse. Neurogenesis Deep Learning

Extending deep networks to accommodate new classes A Base Camp for Scaling AI

Modern statistical machine learning (SML) methods share a major limitation with the early approaches to AI: there is no scalable way to adapt them to new domains. Human learning solves this in part by leveraging a rich, shared, updateable world model. Such scalability requires modularity: updating part of the world model should not impact unrelated parts. We have argued that such modularity will require both “correctability” (so that errors can be corrected without introducing new errors) and “interpretability” (so that we can understand what components need correcting). To achieve this, one could attempt to adapt state of the art SML systems to be interpretable and correctable; or one could see how far the simplest possible interpretable, correctable learning methods can take us, and try to control the limitations of SML methods by applying them only where needed. Here we focus on the latter approach and we investigate two main ideas: “Teacher Assisted Learning”, which leverages crowd sourcing to learn language; and “Factored Dialog Learning”, which factors the process of application development into roles where the language competencies needed are isolated, enabling non-experts to quickly create new applications. Understanding Synthetic Gradients and Decoupled Neural Interfaces

When training neural networks, the use of Synthetic Gradients (SG) allows layers or modules to be trained without update locking - without waiting for a true error gradient to be backpropagated - resulting in Decoupled Neural Interfaces (DNIs). This unlocked ability of being able to update parts of a neural network asynchronously and with only local information was demonstrated to work empirically in Jaderberg et al (2016). However, there has been very little demonstration of what changes DNIs and SGs impose from a functional, representational, and learning dynamics point of view. In this paper, we study DNIs through the use of synthetic gradients on feed-forward networks to better understand their behaviour and elucidate their effect on optimisation. We show that the incorporation of SGs does not affect the representational strength of the learning system for a neural network, and prove the convergence of the learning system for linear and deep linear models. On practical problems we investigate the mechanism by which synthetic gradient estimators approximate the true loss, and, surprisingly, how that leads to drastically different layer-wise representations. Finally, we also expose the relationship of using synthetic gradients to other error approximation techniques and find a unifying language for discussion and comparison. Limits of End-to-End Learning Divide and Conquer Networks

Our model can be trained in weakly supervised environments, namely by just observing input-output pairs, and in even weaker environments, using a non-differentiable reward signal. Moreover, thanks to the dynamic aspect of our architecture, we can incorporate the computational complexity as a regularization term that can be optimized by backpropagation. We demonstrate the flexibility and efficiency of the Divide-and-Conquer Network on three combinatorial and geometric tasks: sorting, clustering and convex hulls. Forward Thinking: Building and Training Neural Networks One Layer at a Time

We present a general framework for training deep neural networks without backpropagation. This substantially decreases training time and also allows for construction of deep networks with many sorts of learners, including networks whose layers are defined by functions that are not easily differentiated, like decision trees. The main idea is that layers can be trained one at a time, and once they are trained, the input data are mapped forward through the layer to create a new learning problem. The process is repeated, transforming the data through multiple layers, one at a time, rendering a new data set, which is expected to be better behaved, and on which a final output layer can achieve good performance. We call this forward thinking and demonstrate a proof of concept by achieving state-of-the-art accuracy on the MNIST dataset for convolutional neural networks. We also provide a general mathematical formulation of forward thinking that allows for other types of deep learning problems to be considered. Learning to Reason: End-to-End Module Networks for Visual Question Answering

The recently proposed Neural Module Network (NMN) architecture implements this approach to question answering by parsing questions into linguistic substructures and assembling question-specific deep networks from smaller modules that each solve one subtask. However, existing NMN implementations rely on brittle off-the-shelf parsers, and are restricted to the module configurations proposed by these parsers rather than learning them from data. In this paper, we propose End-to-End Module Networks (N2NMNs), which learn to reason by directly predicting instance-specific network layouts without the aid of a parser. Our model learns to generate network structures (by imitating expert demonstrations) while simultaneously learning network parameters (using the downstream task loss). Modular Multitask Reinforcement Learning with Policy Sketches

Sketches annotate tasks with sequences of named subtasks, providing information about high-level structural relationships among tasks but not how to implement them—specifically not providing the detailed guidance used by much previous work on learning policy abstractions for RL (e.g. intermediate rewards, subtask completion signals, or intrinsic motivations). To learn from sketches, we present a model that associates every subtask with a modular subpolicy, and jointly maximizes reward over full task-specific policies by tying parameters across shared subpolicies. Optimization is accomplished via a decoupled actor–critic training objective that facilitates learning common behaviors from multiple dissimilar reward functions. We evaluate the effectiveness of our approach in three environments featuring both discrete and continuous control, and with sparse rewards that can be obtained only after completing a number of high-level subgoals. Experiments show that using our approach to learn policies guided by sketches gives better performance than existing techniques for learning task-specific or shared policies, while naturally inducing a library of interpretable primitive behaviors that can be recombined to rapidly adapt to new tasks. Boosted Backpropagation Learning for Training Deep Modular Networks Above and Beyond the Landauer Bound: Thermodynamics of Modularity Learning Hierarchical Information Flow with Recurrent Neural Modules

We propose a deep learning model inspired by neocortical communication via the thalamus. Our model consists of recurrent neural modules that send features via a routing center, endowing the modules with the flexibility to share features over multiple time steps. We show that our model learns to route information hierarchically, processing input data by a chain of modules. We observe common architectures, such as feed forward neural networks and skip connections, emerging as special cases of our architecture, while novel connectivity patterns are learned for the text8 compression task. We demonstrate that our model outperforms standard recurrent neural networks on three sequential benchmarks. Better together? Statistical learning in models made of modules Quality and Diversity Optimization: A Unifying Modular Framework

The optimization of functions to find the best solution according to one or several objectives has a central role in many engineering and research fields. Recently, a new family of optimization algorithms, named Quality-Diversity optimization, has been introduced, and contrasts with classic algorithms. Instead of searching for a single solution, Quality-Diversity algorithms are searching for a large collection of both diverse and high-performing solutions. The role of this collection is to cover the range of possible solution types as much as possible, and to contain the best solution for each type. The contribution of this paper is threefold. Firstly, we present a unifying framework of Quality-Diversity optimization algorithms that covers the two main algorithms of this family (Multi-dimensional Archive of Phenotypic Elites and the Novelty Search with Local Competition), and that highlights the large variety of variants that can be investigated within this family. Secondly, we propose algorithms with a new selection mechanism for Quality-Diversity algorithms that outperforms all the algorithms tested in this paper. Lastly, we present a new collection management that overcomes the erosion issues observed when using unstructured collections. These three contributions are supported by extensive experimental comparisons of Quality-Diversity algorithms on three different experimental scenarios. Learning Hierarchical Information Flow with Recurrent Neural Modules Modeling Relationships in Referential Expressions with Compositional Modular Networks

In this paper we instead present a modular deep architecture capable of analyzing referential expressions into their component parts, identifying entities and relationships mentioned in the input expression and grounding them all in the scene. We call this approach Compositional Modular Networks (CMNs): a novel architecture that learns linguistic analysis and visual inference end-to-end. Our approach is built around two types of neural modules that inspect local regions and pairwise interactions between regions. The evolutionary origins of modularity

Although most hypotheses assume indirect selection for evolvability, here we demonstrate that the ubiquitous, direct selection pressure to reduce the cost of connections between network nodes causes the emergence of modular networks. dentifying core functional networks and functional modules within artificial neural networks via subsets regression Dynamic Compositional Neural Networks over Tree Structure

most existing models suffer from the underfitting problem: they recursively use the same shared compositional function throughout the whole compositional process and lack expressive power due to inability to capture the richness of compositionality. In this paper, we address this issue by introducing the dynamic compositional neural networks over tree structure (DC-TreeNN), in which the compositional function is dynamically generated by a meta network. The role of metanetwork is to capture the metaknowledge across the different compositional rules and formulate them. Experimental results on two typical tasks show the effectiveness of the proposed models. Neural Task Programming: Learning to Generalize Across Hierarchical Tasks

In this work, we propose a novel robot learning framework called Neural Task Programming (NTP), which bridges the idea of few-shot learning from demonstration and neural program induction. NTP takes as input a task speci- fication (e.g., video demonstration of a task) and recursively decomposes it into finer sub-task specifications. These specifications are fed to a hierarchical neural program, where bottomlevel programs are callable subroutines that interact with the environment. We validate our method in three robot manipulation tasks. Modular Representation of Layered Neural Networks

In this paper, we propose a new method for extracting a global and simplified structure from a layered neural network. Based on network analysis, the proposed method detects communities or clusters of units with similar connection patterns. We show its effectiveness by applying it to three use cases. (1) Network decomposition: it can decompose a trained neural network into multiple small independent networks thus dividing the problem and reducing the computation time. (2) Training assessment: the appropriateness of a trained result with a given hyperparameter or randomly chosen initial parameters can be evaluated by using a modularity index. And (3) data analysis: in practical data it reveals the community structure in the input, hidden, and output layers, which serves as a clue for discovering knowledge from a trained neural network. Behavioral Communities and the Atomic Structure of Networks

We develop a theory of `behavioral communities' and the `atomic structure' of networks. We define atoms to be groups of agents whose behaviors always match each other in a set of coordination games played on the network. This provides a microfoundation for a method of detecting communities in social and economic networks. We provide theoretical results characterizing such behavior-based communities and atomic structures and discussing their properties in large random networks. We also provide an algorithm for identifying behavioral communities. We discuss applications including: a method of estimating underlying preferences by observing behavioral conventions in data, and optimally seeding diffusion processes when there are peer interactions and homophily. We illustrate the techniques with applications to high school friendship networks and rural village networks. INCREMENTAL LEARNING THROUGH DEEP ADAPTATION

We propose a method called Deep Adaptation Networks (DAN) that constrains newly learned filters to be linear combinations of existing ones. DANs preserve performance on the original task, require a fraction (typically 13%) of the number of parameters compared to standard fine-tuning procedures and converge in less cycles of training to a comparable or better level of performance. When coupled with standard network quantization techniques, we further reduce the parameter cost to around 3% of the original with negligible or no loss in accuracy. The learned architecture can be controlled to switch between various learned representations, enabling a single network to solve a task from multiple different domains. Diffusion-based neuromodulation can eliminate catastrophic forgetting in simple neural networks Intelligence is associated with the modular structure of intrinsic brain networks

Modelling subject-specific brain network graphs from functional MRI resting-state data (N = 309), we found that intelligence was not associated with global modularity features (e.g., number or size of modules) or the whole-brain proportions of different node types (e.g., connector hubs or provincial hubs). In contrast, we observed characteristic associations between intelligence and node-specific measures of within- and between-module connectivity, particularly in frontal and parietal brain regions that have previously been linked to intelligence. We propose that the connectivity profile of these regions may shape intelligence-relevant aspects of information processing. Learning to Compose Skills Ecological and evolutionary dynamics of interconnectedness and modularity

This work contains two major theoretical contributions: Firstly, we define a general set of measures, referred to as interconnectedness, which generalizes and combines classical notions of diversity and modularity. Secondly, we analyze the temporal evolution of interconnectedness based on a microscale model of ecoevolutionary dynamics. Backprop as Functor: A compositional perspective on supervised learning

To summarise, in this paper we have developed an algebraic framework to describe composition of supervised learning algorithms. In order to do this, we have identified the notion of a request function as the key distinguishing feature of compositional learning. This request function allows us to construct training data for all sub-parts of a composite learning algorithm from training data for just the input and output of the composite algorithm. Autostacker: A Compositional Evolutionary Learning System

We introduce an automatic machine learning (AutoML) modeling architecture called Autostacker, which combines an innovative hierarchical stacking architecture and an Evolutionary Algorithm (EA) to perform efficient parameter search. Neither prior domain knowledge about the data nor feature preprocessing is needed. Using EA, Autostacker quickly evolves candidate pipelines with high predictive accuracy. These pipelines can be used as is or as a starting point for human experts to build on. Autostacker finds innovative combinations and structures of machine learning models, rather than selecting a single model and optimizing its hyperparameters. Compared with other AutoML systems on fifteen datasets, Autostacker achieves state-of-art or competitive performance both in terms of test accuracy and time cost. Meta Multi-Task Learning for Sequence Modeling

Specifically, we use a shared meta-network to capture the meta-knowledge of semantic composition and generate the parameters of the task-specific semantic composition models. Compositional Attention Networks for Machine Reasoning

The model approaches problems by decomposing them into a series of attention-based reasoning steps, each performed by a novel recurrent Memory, Attention, and Composition (MAC) cell that maintains a separation between control and memory. A mechanistic model of connector hubs, modularity, and cognition

Critically, we find evidence consistent with a mechanistic model in which connector hubs tune the connectivity of their neighbors to be more modular while allowing for task appropriate information integration across communities, which increases global modularity and cognitive performance. Seven Sketches in Compositionality: An Invitation to Applied Category Theory Are network motifs the spandrels of cellular complexity?

Networks pervade biology at multiple scales [17]. Although the observation of common subgraphs within networks is interesting, it should be viewed in the appropriate context: selection might have taken advantage of the network structure a posteriori. As Gould said: ‘although spandrels must originate as necessary side-consequences of an architectural decision, and not as forms explicitly chosen to serve a purpose, they still exist in undeniable abundance, and can therefore be secondarily used in important and interesting ways’ [12]. The reliable nature of cellular circuits might well be an example of how some abundant motifs can be used to process information under the presence of noise [18]. In this context, new motifs emerging as a consequence of random gene duplications (such as those highlighted in Figure 1) are known to reduce efficiently internal fluctuations and delays in cellular responses. Evolvability of feed-forward loop architecture biases its abundance in transcription networks

Furthermore, we examined the functional robustness of the motifs facing mutational pressure in silico and observed that the abundance pattern is biased by the degree of their evolvability. Conclusions: The natural abundance pattern of the feed-forward loop can be reconstructed concerning its intrinsic plasticity. Intrinsic plasticity is associated to each motif in terms of its capacity of implementing a repertoire of possible functions and it is directly linked to the motif’s evolvability. Since evolvability is defined as the potential phenotypic variation of the motif upon mutation, the link plausibly explains the abundance pattern.

We claim, however, general applicability. FFL abundances are correlated with their plasticity and evolvability. Evolvability has been defined as a compromise between robustness against single mutations and the capability to modify the functional response upon increasing mutational pressure. The results indicate that a proper portion of intrinsic functional plasticity, which can be understood as a strategic trade-off between specialization and flexibility, is necessary to be abundant. Because only then one is suited to be readily evolvable in changing environments. Topology of biological networks and reliability of information processing

The likelihood of reliable dynamical attractors strongly depends on the underlying topology of a network. Comparing with the observed architectures of gene regulation networks, we find that those 3-node subgraphs that allow for reliable dynamics are also those that are more abundant in nature, suggesting that specific topologies of regulatory networks may provide a selective advantage in evolution through their resistance against noise.

Biological information processing systems as circuits of intrinsically noisy elements are constrained by the need for reproducible output. Effects of fluctuations on the system level can be suppressed through a suitable circuit design. Modular meta-learning

We train different modular structures on a set of related tasks and generalize to new tasks by composing the learned modules in new ways. Automatically Composing Representation Transformations as a Means for Generalization

) Our experimental findings in Sec. 4 highlight the limitations of monolithic static network architectures in generalizing to structurally similar but more complex problems than they have been trained on, providing evidence in favor of learners that dynamically re-program themselves, from a repertoire of learned reusable computational units, to the current problem they face. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science

aking inspiration from the network properties of biological neural networks (e.g. sparsity, scale-freeness), we argue that (contrary to general practice) artificial neural networks, too, should not have fully-connected layers. Here we propose sparse evolutionary training of artificial neural networks, an algorithm which evolves an initial sparse topology (Erdős–Rényi random graph) of two consecutive layers of neurons into a scale-free topology, during learning. Our method replaces artificial neural networks fully-connected layers with sparse ones before training, reducing quadratically the number of parameters, with no decrease in accuracy. Synthesis of Differentiable Functional Programs for Lifelong Learning

Our learning algorithm consists of: (1) a program synthesizer that performs a type-directed search over programs in this language, and decides on the library functions that should be reused and the architectures that should be used to combine them; and (2) a neural module that trains synthesized programs using stochastic gradient descent. Combined Reinforcement Learning via Abstract Representations

In this paper we propose a new way of explicitly bridging both approaches via a shared low-dimensional learned encoding of the environment, meant to capture summarizing abstractions. We show that the modularity brought by this approach leads to good generalization while being computationally efficient, with planning happening in a smaller latent state space. In addition, this approach recovers a sufficient low-dimensional representation of the environment, which opens up new strategies for interpretable AI, exploration and transfer learning. COMPOSITION AND DECOMPOSITION OF GANS Systematic Generalization: What Is Required and Can It Be Learned?

Our findings show that the generalization of modular models is much more systematic and that it is highly sensitive to the module layout, i.e. to how exactly the modules are connected. We furthermore investigate if modular models that generalize well could be made more end-to-end by learning their layout and parametrization. We show how end-to-end methods from prior work often learn a wrong layout and a spurious parametrization that do not facilitate systematic generalization. Neural Network Encapsulation A General Method for Amortizing Variational Filtering