**This is an old revision of the document!**

# Data Synthesis

http://www-etud.iro.umontreal.ca/~wardefar/publications/adv_perturb_chapter.pdf Perturbation, Optimization and Statistics

Simultaneously, the adversarial perspective can be fruitfully leveraged for tasks other than simple supervised learning. While the focus of generative modeling in the past has often been on models that directly optimize likelihood, many application domains express a need for realistic synthesis, including the generation of speech waveforms, image and video inpainting and super-resolution, the procedural generation of video game assets, and forward prediction in model-based reinforcement learning. Recent work (Theis et al., 2015) suggests that these goals may be at odds with this likelihoodcentric paradigm. Generative adversarial networks and their extensions provide one avenue attack on these difficult synthesis problems with an intuitively appealing approach: to learn to generate convincingly, aim to fool a motivated adversary. An important avenue for future research concerns the quantitative evaluation of generative models intended for synthesis; particular desiderata include generic, widely applicable evaluation procedures which nonetheless can be made to respect domain-specific notions of similarity and verisimilitude.

https://arxiv.org/abs/1612.07828 Learning from Simulated and Unsupervised Images through Adversarial Training

We propose Simulated+Unsupervised (S+U) learning, where the task is to learn a model to improve the realism of a simulator's output using unlabeled real data, while preserving the annotation information from the simulator. We develop a method for S+U learning that uses an adversarial network similar to Generative Adversarial Networks (GANs), but with synthetic images as inputs instead of random vectors.

https://arxiv.org/pdf/1612.09322v2.pdf Deep Learning Logo Detection with Data Expansion by Synthesising Context

https://arxiv.org/abs/1607.00070v1 A Sequence-to-Sequence Model for User Simulation in Spoken Dialogue Systems

User simulation is essential for generating enough data to train a statistical spoken dialogue system. Previous models for user simulation suffer from several drawbacks, such as the inability to take dialogue history into account, the need of rigid structure to ensure coherent user behaviour, heavy dependence on a specific domain, the inability to output several user intentions during one dialogue turn, or the requirement of a summarized action space for tractability. This paper introduces a data-driven user simulator based on an encoder-decoder recurrent neural network. The model takes as input a sequence of dialogue contexts and outputs a sequence of dialogue acts corresponding to user intentions. The dialogue contexts include information about the machine acts and the status of the user goal. We show on the Dialogue State Tracking Challenge 2 (DSTC2) dataset that the sequence-to-sequence model outperforms an agenda-based simulator and an n-gram simulator, according to F-score. Furthermore, we show how this model can be used on the original action space and thereby models user behaviour with finer granularity.

https://arxiv.org/abs/1606.03632v2 Natural Language Generation in Dialogue using Lexicalized and Delexicalized Data

https://arxiv.org/abs/1611.05416 Composing Music with Grammar Argumented Neural Networks and Note-Level Encoding

To transcend this inadequacy, we put forward a novel method for music composition that combines the LSTM with Grammars motivated by music theory. In this work, we hence improvise an LSTM with an original method known as the Grammar Argumented (GA) method, such that our model combines a neural network with grammars. We begin by training a LSTM neural network with a dataset from music composed by actual human musicians. In the training process, the machine learns the relationships within the sequential information as much as possible. Next we feed a short phrase of music to trigger the first phase of generation. Instead of adding the first phase of generated notes directly to the output, we evaluate these notes according to common music composition rules. Notes that go against music theory rules will be abandoned, and replaced by repredicted new notes that eventually conform to the rules. All amended results and their corresponding inputs will be then be added to training set. We then retrain our model with the updated training set and use the original generating method to do the second phase of (actual) generation.

https://arxiv.org/abs/1612.09322v2 Deep Learning Logo Detection with Data Expansion by Synthesising Context

Specifically, we design a novel algorithm for generating Synthetic Context Logo (SCL) training images to increase model robustness against unknown background clutters, resulting in superior logo detection performance.

https://arxiv.org/pdf/1702.08484v1.pdf Boosted Generative Models

We propose a new approach for using unsupervised boosting to create an ensemble of generative models, where models are trained in sequence to correct earlier mistakes. Our meta-algorithmic framework can leverage any existing base learner that permits likelihood evaluation, including recent latent variable models. Further, our approach allows the ensemble to include discriminative models trained to distinguish real data from modelgenerated data. We show theoretical conditions under which incorporating a new model in the ensemble will improve the fit and empirically demonstrate the effectiveness of boosting on density estimation and sample generation on synthetic and benchmark real datasets.

http://dustintran.com/blog/deep-and-hierarchical-implicit-models Deep and Hierarchical Implicit Models

Implicit probabilistic models are a very flexible class for modeling data. They define a process to simulate observations, and unlike traditional models, they do not require a tractable likelihood function. In this paper, we develop two families of models: hierarchical implicit models and deep implicit models. They combine the idea of implicit densities with hierarchical Bayesian modeling and deep neural networks. The use of implicit models with Bayesian analysis has in general been limited by our ability to perform accurate and scalable inference. We develop a variational inference algorithm for implicit models. Key to our method is specifying a variational family that is also implicit. This matches the model's flexibility and allows for accurate approximation of the posterior. Our method scales up implicit models to sizes previously not possible and opens the door to new modeling designs. We demonstrate diverse applications: a large-scale physical simulator for predator-prey populations in ecology; a Bayesian generative adversarial network for discrete data; and a deep implicit model for text generation.

https://arxiv.org/abs/1703.00868v1 Using Synthetic Data to Train Neural Networks is Model-Based Reasoning

We draw a formal connection between using synthetic training data to optimize neural network parameters and approximate, Bayesian, model-based reasoning. In particular, training a neural network using synthetic data can be viewed as learning a proposal distribution generator for approximate inference in the synthetic-data generative model.

https://arxiv.org/pdf/1703.01925v1.pdf Grammar Variational Autoencoder

However, generative modeling of discrete data such as arithmetic expressions and molecular structures still poses significant challenges. Crucially, state-of-the-art methods often produce outputs that are not valid. We make the key observation that frequently, discrete data can be represented as a parse tree from a context-free grammar. We propose a variational autoencoder which encodes and decodes directly to and from these parse trees, ensuring the generated outputs are always valid. Surprisingly, we show that not only does our model more often generate valid outputs, it also learns a more coherent latent space in which nearby points decode to similar discrete outputs. We demonstrate the effectiveness of our learned models by showing their improved performance in Bayesian optimization for symbolic regression and molecular synthesis.

https://arxiv.org/pdf/1703.00955v1.pdf Controllable Text Generation

We propose a new neural generative model which combines variational auto-encoders and holistic attribute discriminators for effective imposition of semantic structures. With differentiable approximation to discrete text samples, explicit constraints on independent attribute controls, and efficient collaborative learning of generator and discriminators, our model learns highly interpretable representations from even only word annotations, and produces realistic sentences with desired attributes.

https://arxiv.org/pdf/1704.01696v1.pdf A Syntactic Neural Model for General-Purpose Code Generation

https://arxiv.org/abs/1704.06851v1 Affect-LM: A Neural Language Model for Customizable Affective Text Generation

https://arxiv.org/pdf/1705.10843.pdf Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models

In unsupervised data generation tasks, besides the generation of a sample based on previous observations, one would often like to give hints to the model in order to bias the generation towards desirable metrics. We propose a method that combines Generative Adversarial Networks (GANs) [8] and reinforcement learning (RL) in order to accomplish exactly that. While RL biases the data generation process towards arbitrary metrics, the GAN component of the reward function ensures that the model still remembers information learned from data. We build upon previous results that incorporated GANs and RL in order to generate sequence data [20] and test this model in several settings for the generation of molecules encoded as text sequences (SMILES [12]) and in the context of music generation, showing for each case that we can effectively bias the generation process towards desired metrics.

https://arxiv.org/abs/1705.10929 Adversarial Generation of Natural Language

We present quantitative results on generating sentences from context-free and probabilistic context-free grammars, and qualitative language modeling results. A conditional version is also described that can generate sequences conditioned on sentence characteristics.

https://arxiv.org/abs/1705.07541v1 Learning from Complementary Labels

Collecting labeled data is costly and thus is a critical bottleneck in real-world classification tasks. To mitigate the problem, we consider a complementary label, which specifies a class that a pattern does not belong to. Collecting complementary labels would be less laborious than ordinary labels since users do not have to carefully choose the correct class from many candidate classes. However, complementary labels are less informative than ordinary labels and thus a suitable approach is needed to better learn from complementary labels. In this paper, we show that an unbiased estimator of the classification risk can be obtained only from complementary labels, if a loss function satisfies a particular symmetric condition. We theoretically prove the estimation error bounds for the proposed method, and experimentally demonstrate the usefulness of the proposed algorithms.

https://arxiv.org/abs/1707.05236v1 Artificial Error Generation with Machine Translation and Syntactic Patterns

This paper investigates two alternative methods for artificially generating writing errors, in order to create additional resources. We propose treating error generation as a machine translation task, where grammatically correct text is translated to contain errors. In addition, we explore a system for extracting textual patterns from an annotated corpus, which can then be used to insert errors into grammatically correct sentences. Our experiments show that the inclusion of artificially generated errors significantly improves error detection accuracy on both FCE and CoNLL 2014 datasets.

https://arxiv.org/abs/1709.01643v1 Learning to Compose Domain-Specific Transformations for Data Augmentation

https://arxiv.org/abs/1709.00938 ARIGAN: Synthetic Arabidopsis Plants using Generative Adversarial Network

We propose the Arabidopsis Rosette Image Generator (through) Adversarial Network: a deep convolutional network that is able to generate synthetic rosette-shaped plants, inspired by DCGAN.

https://arxiv.org/abs/1711.09534 Neural Text Generation: A Practical Guide

https://mlatgt.blog/2018/02/08/syntax-directed-variational-autoencoder-for-structured-data/ SYNTAX-DIRECTED VARIATIONAL AUTOENCODER FOR STRUCTURED DATA

https://github.com/PonyGE/PonyGE2 PonyGE2: grammatical evolution and variants in Python

https://arxiv.org/abs/1804.06516v2 Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization

To handle the variability in real-world data, the system relies upon the technique of domain randomization, in which the parameters of the simulator−such as lighting, pose, object textures, etc.−are randomized in non-realistic ways to force the neural network to learn the essential features of the object of interest.

We have demonstrated that domain randomization (DR) is an effective technique to bridge the reality gap. Using synthetic DR data alone, we have trained a neural network to accomplish complex tasks like object detection with performance comparable to more labor-intensive (and therefore more expensive) datasets. By randomly perturbing the synthetic images during training, DR intentionally abandons photorealism to force the network to learn to focus on the relevant features. With fine-tuning on real images, we have shown that DR both outperforms more photorealistic datasets and improves upon results obtained using real data alone.

https://arxiv.org/pdf/1805.10561v1.pdf Adversarial Constraint Learning for Structured Prediction

Learning requires a blackbox simulator of structured outputs, which generates valid labels, but need not model their corresponding inputs or the input-label relationship. At training time, we constrain the model to produce outputs that cannot be distinguished from simulated labels by adversarial training. Providing our framework with a small number of labeled inputs gives rise to a new semi-supervised structured prediction model; we evaluate this model on multiple tasks — tracking, pose estimation and time series prediction — and find that it achieves high accuracy with only a small number of labeled inputs. In some cases, no labels are required at all.

https://arxiv.org/abs/1809.01219v1 Graph-based Deep-Tree Recursive Neural Network (DTRNN) for Text Classification

The DTG method can generate a richer and more accurate representation for nodes (or vertices) in graphs. It adds flexibility in exploring the vertex neighborhood information to better reflect the second order proximity and homophily equivalence in a graph.

https://arxiv.org/abs/1811.11264v1 Synthesizing Tabular Data using Generative Adversarial Networks

Generative adversarial networks (GANs) implicitly learn the probability distribution of a dataset and can draw samples from the distribution. This paper presents, Tabular GAN (TGAN), a generative adversarial network which can generate tabular data like medical or educational records. Using the power of deep neural networks, TGAN generates high-quality and fully synthetic tables while simultaneously generating discrete and continuous variables. When we evaluate our model on three datasets, we find that TGAN outperforms conventional statistical generative models in both capturing the correlation between columns and scaling up for large datasets.