Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Industry and Case Study Paper CIKM’18, October 22-26, 2018, Torino, Italy

“Deep” Learning for Missing Value Imputation


in Tables with Non-Numerical Data
Felix Biessmann∗ , David Salinas∗ , Sebastian Schelter, Philipp Schmidt, Dustin Lange
Amazon Research
{biessman,dsalina,sseb,phschmid,langed}@amazon.de

ABSTRACT central data quality problem is missing data. For instance in retail
The success of applications that process data critically depends scenarios with a large product catalog, a product with an empty
on the quality of the ingested data. Completeness of a data source value for a product attribute is difficult to search for and is less
is essential in many cases. Yet, most missing value imputation ap- likely to be included in product recommendations.
proaches suffer from severe limitations. They are almost exclusively Many methods for missing data imputation were proposed in
restricted to numerical data, and they either offer only simple impu- various application contexts: simple approaches such as mean or
tation methods or are difficult to scale and maintain in production. mode imputation as implemented in most APIs for data wrangling
Here we present a robust and scalable approach to imputation that and Machine Learning (ML) pipelines (see footnote 1 for details with
extends to tables with non-numerical values, including unstruc- respect to the pandas and Spark libraries), matrix completion for
tured text data in diverse languages. Experiments on public data recommendation systems [16] or supervised learning approaches
sets as well as data sets sampled from a large product catalog in for social science applications [28, 30]. Most of these imputation
different languages (English and Japanese) demonstrate that the approaches, however, are either limited to small data sets or focus
proposed approach is both scalable and yields more accurate im- on imputation of numerical data from other numerical data. But in
putations than previous approaches. Training on data sets with many real-world scenarios the data types are mixed and contain
several million rows is a matter of minutes on a single machine. text. This kind of data is not easily amenable to imputation with
With a median imputation F1 score of 0.93 across a broad selection existing methods or software packages, as discussed in more detail
of data sets our approach achieves on average a 23-fold improve- in Section 2. In these cases the gap between a data source, con-
ment compared to mode imputation. While our system allows users taining unstructured text data or categorical data, and a data sink,
to apply state-of-the-art deep learning models if needed, we find often requiring complete data, needs to be bridged by custom code
that often simple linear n-gram models perform on par with deep to extract numerical features, feed the numerical values into an
learning methods at a much lower operational cost. The proposed imputation method and transform imputed numerical values back
method learns all parameters of the entire imputation pipeline au- into their non-numerical representation. Such custom code can be
tomatically in an end-to-end fashion, rendering it attractive as a difficult to maintain and imposes technical debt on the engineering
generic plugin both for engineers in charge of data pipelines where team in charge of a data pipeline [31].
data completeness is relevant, as well as for practitioners without Here we propose an imputation approach for tables with at-
expertise in machine learning who need to impute missing values tributes containing non-numerical data, including unstructured
in tables with non-numerical data. text and categorical data. To reduce the amount of custom feature
extraction glue code for making non-numerical data amenable to
ACM Reference Format: standard imputation methods, we designed a system that allows
Felix Biessmann, David Salinas, Sebastian Schelter, Philipp Schmidt, Dustin its users to combine and automatically select feature extractors for
Lange. 2018. ”Deep” Learning for Missing Value Imputation in Tables with categorical and sequential non-numerical data, leveraging state of
Non-Numerical Data. In The 27th ACM Int’l Conf. on Information and the art deep learning methods and efficient optimization tools. Our
Knowledge Management (CIKM’18), Oct. 22–26, 2018, Torino, Italy. ACM, work extends existing imputation methods with respect to three
NY, NY, USA, 9 pages. https://doi.org/10.1145/3269206.3272005 aspects. First in contrast to existing simple and scalable imputa-
tion approaches such as mode imputation, the system achieves on
1 INTRODUCTION average a 23-fold increase in imputation quality as measured by
The success of many applications that ingest data critically depends the F1-score. Second in contrast to more sophisticated approaches,
on the quality of the data processed by those applications [26]. A such as k-nearest-neighbor based methods and other established
approaches [10, 30], the proposed approach scales to large data sets
*these authors contributed equally. as demonstrated in experiments on tables with millions of rows,
Permission to make digital or hard copies of all or part of this work for personal or
which is up to four orders of magnitude more data than consid-
classroom use is granted without fee provided that copies are not made or distributed ered in aforementioned imputation studies. And third in contrast to
for profit or commercial advantage and that copies bear this notice and the full citation other scalable data cleaning approaches, such as HoloClean [27] or
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or NADEEF [9], the proposed approach can be easily automated and
republish, to post on servers or to redistribute to lists, requires prior specific permission does not require human input. The motivation for using machine
and/or a fee. Request permissions from [email protected]. learning and a scalable, automatable implementation is to enable
CIKM ’18, October 22–26, 2018, Torino, Italy
© 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM.
data engineers in charge of data pipelines to ensure completeness
ACM ISBN 978-1-4503-6014-2/18/10. . . $15.00 as well as correctness of a data source. Beyond this application
https://doi.org/10.1145/3269206.3272005

2017
distribution
specific
specificfeature of values
extraction
feature
123
125 125 in amapping
extraction columnmapping
122 can
(e.g.c (x be) c2
color,
k used
(xR
D[? ]. size,
brand,
k c) 2 is R computed,
D
is computed,
...). where
If the dataD where denotes
k types Dare not theknown
k denotes dimensionality
upfront,
the dimensi heuristi
distribution ofc.values ca column
inconsidered canthreebe useddifferent
[? ].
for a
forlatent
a variable
latent variable
126 associated 123
associated with column
with column We
Once the the non-numerical data is encoded into their respective numerical representation, a colum
124 126 c. We considered three types
different of featurizers
types of featu
specific For
(.). (.). categorical
For
feature categorical
127
extraction
125 127
c data we
c data use
124
mapping we aOnceone-hot
use the
(xa theencoded
one-hot
) 2
non-numerical
R Dencoded
isembedding dataembedding
computed, (as
is encoded
whereknown
into(as
D
theirfrom
known
denotes word
respective from
theembeddings).
numerical
word
dimensionalit
representa
embed
For For specificc featurek extraction mapping c is computed, where
c (xk ) 2 Rpossibilities
D k k denotes
): an the
for acolumns c with sequential data we
c.consider two different for cD(x
c
columns 128
c with sequential string
125
data we consider two different possibilities cfor
Industry and Case Study Paper latent variable
126 128 associated 126 withfor stringcolumn
a latent variable
CIKM’18, We considered
associated
October with22-26,columnthree
2018, c. Wedifferent
considered
Torino, types
Italy threeof featurizer
different c (x
types
n-gramn-gram representation129
representation or a character-based embedding using a Long-Short-Term-Memory (LSTM)
(.). For categorical127 129 c data or we127ause
character-based
ca(.).
one-hot
For categorical encoded embedding we use ausing
dataembedding one-hot (asaencoded
Long-Short-Term-Memory
knownembedding from word (as knownembeddingsfrom wor (L
recurrent neural network
130 [? ] TODO: For CITATION.
columns c withFor theFor
sequential character
string n-gram
data representation,
we consider tworepresentation, (x ) is
c
different cpossibilities
Forrecurrent
columns neural c with network
128 130 sequential [?
128
] TODO:
string dataCITATION.we consider the
two character
different n-gram
possibilities for (x ): a
a hashing function that maps each n-gram, where nor2a {1, . . . , 5}, inembedding x to c
. . . ,the 5},character
131 129 n-gram representation character-based using asequence
Long-Short-Term-Me c c c
a hashing
n-gram function or
representation
129 131 that a maps each n-gram,
character-based
recurrent neural embedding
network where [? ] nusing
TODO:2 {1, a inthe
Long-Short-Term-Memory
CITATION. For thecharacter
character n-gram sequence
(LSTM
representa
a D dimensional vector; here Dc denotes here the number of hash buckets. Note that the hashingc
132
130
a D dimensional
recurrent
featurizer neural network
130 132
is a stateless
c
vector;
[? ] 131
component
here
TODO: a hashing
Dcdoes denotes
CITATION.function here that For the
maps theeachnumber
character
n-gram, ofwhereas
hashnbuckets.
n-gram
where 2representation,
{1, . . . , 5},Note in thethat the
character
(x
of c)th
h
athat not require any training, the other two typesc Note
133
c
featurizer
afeature
hashing is
function a stateless
131 133 that maps component
each
132 c dimensional
Dn-gram, that does
where
Imputation vector;
not n of here
require
2 D
{1,
attribute c .denotes
any
. . ,
color 5}, here
training,
in the
the number
whereas
character of hash
the buckets.
other
sequence two x tyt
scenario, our system’s simple API (presented in Section 4) ismapsalso contain parameters
134 133 that are learned
featurizer is a stateless using backpropagation.
component that does not require any training, whereas the oth
beneficial for other use cases: users without any machine a Dfeature
dimensional
learning maps contain
132 134vector;parameters
c here 134 Dfeature that
c denotes maps
Product
are here learned
contain the
parametersusingthat
number backpropagation.
ofarehash learned buckets. Note that the hashin
using backpropagation.
135
133
experience or coding skills who want to run simple For the case
featurizer
classification is aofstateless
categorical componentembeddings, that does we not
Type use Description
a standard
require any linear Size embedding
training, whereas the
Color fed otherinto one twofully types o
135 For maps
connected thelayer.
case of categorical
The hyperparameter
135 embeddings,
For the case of categorical weusing use awas standard
embeddings, we linear embedding
use a standard fed
linear embedding into on
fed
136
134 feature contain parameters connected layer. The hyperparameter for this featurizer was a single both
that for
are this
Shoelearned featurizer
Ideal for running a single
backpropagation.
… 12UK oneBlack and used to set one and the used
experiments on non-numerical data can leverage the136 system connected
embedding
as long layer. TheTraining
dimensionality hyperparameter
136
as 137
well
Rows as the number
embedding for this
dimensionality featurizer
of hidden as well unitswas a single
ofnumber
the output one layer. andInofused the to setlayer
LSTM bo
… as8GB the of hidden units the output
137
as they can export their data from an Excel sheet135 and
138 137
For the
we case
embedding
specify
case, the ofdimensionality
featurize categorical
xc by iterating embeddings,
138 as an well
case, weas
LSTM
SDCards
we
the use
number
Best SDCard
through
featurize xc by athe standard
of
ever
hidden
sequence
iterating an LSTM linear
ofunits embedding
of the
Blue
characters
through of xfed
output
the sequence that
c of into
areone
layer.
characters In the
each full
of xL
i
target column. 136
139 138 connected
case, welayer.
represented The To-Be-Imputed
asfeaturize
continuous hyperparameter
by iterating
xc vector 139 Row represented
via anfor
a character
DressLSTMthis
as featurizer
continuous
This through
embedding.
yellow vector
dress …was
thevia
The acharacter
aM single ?embedding.
sequence
sequence one and
ofofcharacters
charactersused to
The sequencexcset
of xisci both
that
of
thenchara th
ar
In order to demonstrate the scalability and performance
137 139 embedding
represented
of our dimensionality
as continuous as140
well mapped
vector as the
via to a
a sequence
number
character ofof states
hidden
embedding.
h (
c, units
1), . . . , of
The
h the
(c,S )
and
sequence
c
we
output take the
layer.
of last state
In
characters theh LSTM
(c,S )
c
x c, m
140 mapped to a sequence 1. of statesStringh (
c, 1),
a fully . . . , h (c,S
connected layer
c )
and
Feature we
as the
Columns take the last state
featurization of x . The hyperparameters
Label h
c Column
(c,S c )
, mapped through
of each LSTM
acase, we featurize xc by iterating
141
approach, we present experiments on samples 138 of
141 a140
large mapped
fully connected
product to a sequence
layer
Representation
as the of featurization
states
142 thenan LSTM
h(the
c, 1), of. x
number . through
.c .,ofh
The (c,Scthe
layers,
)
and sequence
hyperparameters
the numberwe take of
of hidden ofcharacters
the last
unitsstate
each LSTM
of theh of(c,S
LSTM that
xcic )cell
, mapped
featurizer andare theeac
are dith
139
catalog and on public datasets extracted from Wikipedia.
142 141
represented
thena thefully
Innumber
the as continuous
connected of layers,layerthe vector
as143the
number via
charactersa character
featurization embedding
of hidden embedding.
ofunits c
. of
xci and The the number
LSTM The
thehyperparameters sequence
of hidden
cell and unitstheof characters
ofof each
the final
dimension LSTMfully of x c
is the
featuri
connected
the
144 (the LSTM featurizer.
experiments on the product catalog, we impute140 143 mapped
characters
missing
142 then
productto
the a number
sequenceof
embedding coflayers,
states
and the
Numerical hthe c, number
number 1), One. of
. .Hot hof(c,S
,hidden c)
hidden and units
units wethe
of take of
final the
the last
LSTM
Hot fully
state
Hot cell
Oneconnected
(c,Sc )
hPandas,and , the
outputmapped layerthroug
dimension of
2. i Representation

CPU
Character One
141 a fully
the LSTM connected
characters
144 attributes.
attribute values for a variety of product types and 143 Thefeaturizer. layer as the
embedding ci andfeaturization
145 Finally
the number all
Encoder of of
feature x .hidden
c
The chyperparameters
vectors
Sequence (x c
are
unitsEncoder
) concatenated
of the final Encoderoffully
into each
one Dask
Spark
LSTM
feature
or
connected vector featurizer
2 RD wh
x̃output ar
la
then theLSTMnumber of layers, the is the sumofover all latent dimensions c . We will referandto the numerical representatio
c number hidden units of theDLSTM cell the dimension
142 144
sizes of the data sets sampled from the product catalog arethe
between featurizer.
146
P of th
145 Finally
characters all feature
embedding vectors ci and c (x the) are
147 in concatenated
the
number
to-be-imputed intounits
column one feature
as y 2 {1, vector
2, . . . , D x̃y 2 R
}, as inD
where D
standard supervised
= layer learnin
Dc
143
146 is is
symbols ofof hidden
thewilltarget columns of the
use thesame finalencoding
fully connected
as the aforementioned output categorio
thethe sumone over all latent dimensions
several 1,000 rows and several million rows, which between 148 c Dc . We refer toorinto
the numerical representation of
D the values
144 145 Finally
LSTM all feature
featurizer. 3.
vectors c (x ) are concatenated LSTM one feature vector x̃ 2
MxNet R where D=
to four orders of magnitude larger than data sets147in 146in the
previous to-be-imputed
is the worksum over all column
latentasdimensions
Featurizers
149y 2After {1,the 2, .featurization
Embedding. .c,.DWe
D },will
y n-gramasx̃hashing
inofrefer
standard
input Embedding
toorthe supervised
feature columns
numerical learning
and the settings.
representation encoding yThe
ofofP the
the
150c column we can cast the imputation problem as a supervised D problem by learning toD p

CPU/GPU
on missing value imputation on numerical data145 [10].
148
147
Finally
symbols
We all
evaluateof feature
the targetvectors
in the to-be-imputed column columns c (x use
151
) are
the concatenated
same
as y 2 {1,of2,
distribution
encoding
y from
intoas one
the feature
aforementioned vector
. . . ,x̃.Dy }, as in standard supervised learning setting x̃ R
categorical
2 where variables.
D =
is the sum over
symbols of theall latent
target dimensions useD c . We willencoding
refer to the numerical …representation of the varia value
x̃ columns the same and as the aforementioned categorical
146
the imputations on product data with very different languages …
149 148 After the featurization 4. Latent ofrepresentation
input or feature columns the encoding p(y|x̃,y✓),of thethe to-be-imputed
yImputation
{1,of2,y.isas . then performed
}, as inby modeling probability over all obse
Feature Concatenation
147able
(English and Japanese), and find that our system is
150
intothe
column dealto-be-imputed
with
we can cast the column
imputation as
152
153
2
problem
classes
.a, D
given supervised
any input or feature
standard
problem vector
supervised
by x̃learning
with some
learning
to predict
learned
settings.
the
parameters label Th
✓. T
148 149 After of
symbols thethefeaturization
target columns x̃154of
use input
the same or
is feature
encoding
modeled, as columnsas the and the
aforementioned encoding
Imputation y of thevariables.
categorical to-be-im
different languages without any language-specific 151 distribution of y from x̃.
preprocessing, p(y|x̃, ✓)
150 column we can cast the imputation problem as a supervised problem
= softmax by[W learning to predict th
such as tokenization. In the Wikipedia experiments,
149
152 151
After the featurization
we impute
distribution
Imputation is then fromx̃ x̃.
of yperformed of input
by modeling or feature p(y|x̃, columns
✓), the and probabilitythe✓)encoding
p(y|x̃,
over allyDobserved of
x̃ +the b] to-be-impute
values or
missing infobox attributes from the article abstracts
150
153
column
for a number
classes ofwey can given castan the
Figureinputimputation
or155
1: where
feature
Imputation problem
✓ = (W,
vector as a are
x̃z, with
example b) supervised
parameters
some problem
to learn withby
learned
on non-numerical parameters Wlearning
2 R with
data
⇥D
✓. The
y
to
,b2 predict
RD y andthe
probability z is lab
a ve
of infobox properties. 151 152 Imputation
distribution
p(y|x̃, ✓) is modeled, of yis then
fromdeep x̃.performed
as learning;
156 all by
symbolsmodeling
parameters of all
explained
learned
p(y|x̃, column
✓), the
in Section probability
featurizers c . over
Finally, all
softmax(q) observed denotes val th
154 softmax function P exp q
where qj is the j 3. element of a vector q.
A sketch of the proposed approach is shown in153Figureclasses
1. We of y given an input 157
or feature vectorexpx̃q with some j learned parameters ✓. The prob
Imputation is then performed by modeling p(y|x̃, ✓),[Wthe +probability over all observed values o
j

✓) = softmax the cross-entropy loss between the(1)


152
will use the running example of imputing missing 154
color p(y|x̃, ✓) is modeled, as 158 p(y|x̃,
attributes The parameters ✓ are learned byx̃minimizing b] predic
153 classes of y given an input or 159
feature vector labels
and the observed x̃ with some
y, e.g. learnedDparameters ✓. The probabilit
by taking
for products in a retail catalog. Our system operates
155
154 whereon ✓
a =
table (W,
p(y|x̃, ✓) is modeled, as z, b) • are parameters
End-to-end to learn
p(y|x̃,
optimization with ✓) W = of2 R
softmax
Dy ⇥D
imputation ,[Wb 2 x̃ R +
model. b] and Ourz is a vector
system containing
N X Dy
all parameters
is the of all learned column featurizers X y
with non-numerical data, where the column to 156 be imputed learns numerical feature c . Finally,
representations softmax(q)automatically denotes and the
>iselementwise

color attribute and the columns considered as 157 input155 where


softmax
data ✓ = (W,
forfunction
the Pz, b) arewhere
exp q parameters p(y|x̃,
qj is the to jlearn
✓) =
element softmax ✓ = arg
withofWa vector [W min
2✓R q. x̃ D + y ⇥D
b] , b log(p(y|x̃,
2 RD y and
✓))
zonehot(y)
is a vector cont (1
readily
exp q applicable as a plugin in data pipelines 1 1that require com-
all parameters of all learned column featurizers . Finally, denotes the eleme
j
imputation are other product attribute columns155suchwhere as product
✓ = (W, z, b) are
j
parameters to learn with softmax(q)
3). loss, between Rytheand is aNvector containin
156 ⇥D
pleteness 160for wheredata p(y|x̃,
sources ✓)) RWD 2denotes
2(Section R D
c y the output
y b 2 of D
model z and is the number of r
description. The proposed approach then trains 158 157 Thesoftmax
parameters
learn-function✓ are learnedP exp qby 161minimizing
where
a value qwasj is the the cross-entropy
observed j element
in. the target ofcolumn
asoftmax(q)
vector q.thedenotes
corresponding predicted
to y. We distribution
use one-hot(y)
all
156a machine
159
parameters
and the observed labels
of all learned
j exp qcolumn
y, e.g. by taking
j featurizers c Finally, the elementwis
ing model for each to be imputed column that learns softmax
to predictfunction
the P exp q where qj WORK is the j element of a vector q.
The parameters 2✓j are RELATED
157
158 exp qlearned
j by minimizing
N X Dy the cross-entropy loss between 4 the predicted distri
observed values of the to be imputed column from 159 the remaining X
158 The and the
parameters observed✓ are labels
Missing
learned
✓ = y,
data
by
arg e.g. by
is a common
minimizing
min takingthe problem
cross-entropy
log(p(y|x̃, in✓)) statistics
>
loss
onehot(y) and hasthe
between becomepredicted distributio(2)
columns (or a subset thereof). Each input column of the system is
159 and the
fed into a featurizer component that processes sequential data (such observed labels
morey,important e.g. by
✓ taking with
1 the
1XX N increasing
Dy availability of data and the
popularity✓ of = data science. Methods for dealing
log(p(y|x̃, with
> missing data
casewhere
as unstructured text) or categorical data. In the 160 color ✓)) 2 RDy denotes
of the p(y|x̃, argoutput
the min
X N X Dofy the model and N ✓)) is theonehot(y)
number of rows for which
can be divided into ✓ the following categories [21]:
attribute, the to be imputed column is modeled161as aa value was observed in the
categorical ✓ =target arg min column corresponding 1 1log(p(y|x̃, to ✓))y. Weonehot(y)
> use one-hot(y) 2 {0, 1}Dy to (2
where p(y|x̃, ✓)) 2 R y denotes the
variable that is predicted from the concatenation of160all featurizer (1) Remove
D ✓ cases with
1 output 1 incomplete data
of the model and N is the number of rows for
a value was (2) Add dedicated missingcorresponding value symbol
outputs. 160 161 where p(y|x̃, ✓))observed
2 RDy denotes in the target
the output column of4the model and to N y. is Wethe number use one-hot(y) of rows for 2 {0, whic 1
The reason we propose this imputation model a value
161 is that in an was ex-observed (3) Impute missing values
in the target column corresponding to y. We use one-hot(y) 2 {0, 1}Dy t
tensive literature review we found that the topic of imputation for Approach (1) is also known as complete-case analysis and is the
non-numerical data beyond rule-based systems was not covered simplest approach to implement – yet 4 it has the decisive disadvan-
very well. There exists a lot of work on imputation [30], and on tage of excluding a large part of the data. Rows of a table are often 4
modeling non-numerical data, but to the best of our knowledge not complete, especially when dealing with heterogeneous data
there is no work on end-to-end systems that learn how to extract sources. Discarding an entire row of a table if just one column has
features and impute missing values on non-numerical data at scale. a missing value would often discard a substantial part of the data.
This paper aims at filling this gap by providing the following con- Approach (2) is also simple to implement as it essentially only
tributions: introduces a placeholder symbol for missing data. The resulting data
is consumed by downstream ML models as if there were no missing
• Scalable deep learning for imputation. We present an impu- values. This approach can be considered the de-facto standard in
tation approach that is based on state of the art deep learning many machine learning pipelines and often achieves competitive
models (Section 3). results, as the missingness of data can also convey information. If
• High precision imputations. In extensive experiments on pub- the application is really only focused on the final output of a model
lic and private real-world datasets, we compare our imputation and the goal is to make a pipeline survive missing data cases, then
approach against standard imputation baselines and observe up this approach is sensible.
to 100-fold improvements of imputation quality (Section 6). Finally, approach (3) replaces missing values with substitute
• Language-agnostic text feature extraction. Our approach op- values (also known as imputation). In many application scenarios
erates on the character level and can impute with high precision including the one considered in this work, users are interested in
and recall independent of the language present in a data source imputing these missing values. For instance, when browsing for a
(Section 5, Section 6). product, a customer might refine a search by specific queries (say

2018
Industry and Case Study Paper CIKM’18, October 22-26, 2018, Torino, Italy

for size or color filters). Such functionalities require the missing impute missing values already when new products enter the catalog.
values of the respective attributes to be imputed. The simplest Ideally, we want to leverage information about products that are
imputation methods fill in the same value for all missing data cases already in the catalog. While this use case – imputation for new
for a given column of a table, similar to approach (2) outlined above. data – can be tackled with matrix factorization type methods and
Examples are mean imputation (for continuous numerical data) there are a number of solutions in the cold start recommender sys-
or mode imputation (for non-numerical data), where the mean or tems literature, it is much simpler to implement with the approach
mode of the respective attribute distribution is imputed. These presented here: as we do not learn a latent representation for each
imputation methods are implemented in most libraries and APIs row index of a table and use only the content of observed values
that offer imputation methods1 . To the best of our knowledge, few in the input columns for the imputations, our approach naturally
software packages for data wrangling/pipelining go beyond this lends itself to ingesting new data from rows that were not in the
limited functionality and integrate with those well established tools. set of training data.
While these approaches suffice for replacing missing data in order Another line of missing value imputation research in the data-
to not make a data pipeline fail, they are not useful for actual base community is using rule based systems, such as NADEEF [9],
imputation of missing data since their precision and recall levels which efficiently applies user-specified rules for the detection and
are rather low as demonstrated by our experiments in Section 6. repairing of quality constraint violations. Such rule based systems
For more sophisticated approaches to imputation, there is a sub- can achieve high precision for imputation, but this often requires
stantial body of literature using both supervised as well as un- a domain expert in the loop to generate and maintain the set of
supervised methods [30]. One example from supervised learning rules to apply. In our approach, we leverage machine learning to
is Hot Deck Imputation, which refers to the idea of using similar allow for automatic high precision imputations. Yet another line of
entries of a table as source for finding substitute values [2]. Replace- research is the direction of active learning. Cleaning real world data
ment values can be found using k-nearest neighbors [3]. Another sets requires ground truth data, which is most easily obtained from
approach leveraging supervised learning techniques is Multivari- human annotators. If there is a human in the loop, active learn-
ate imputation by chained equations (MICE) [28]. The idea is to ing methods allow us to select and prioritize human effort [34].
learn a supervised model on all but one column of a table and use ActiveClean uses active learning for prioritization and learns and
that model to impute the missing values. This approach has been updates a convex loss model at the same time [17]. HoloClean gen-
adopted in many other studies, see [10]. erates a probabilistic model over a dataset that combines integrity
Another line of research focuses on unsupervised methods for constraints and external data sources to generate data repair sug-
imputation, such as matrix factorization [16, 22, 33]. Building on gestions [27]. While such research is important if humans are in
research for recommender systems, matrix factorization techniques the loop, we focus on approaches that require as little human inter-
were improved with respect to stability, speed and accuracy of im- vention as possible. The solution presented in this work is however
putations. However, not all use cases naturally lend themselves to easily extendable to batch based active learning scenarios, which
a matrix completion model. For instance, when dealing with tables we consider a future research direction. Finally, a recent line of
containing multiple columns with free text data, it is not obvious work similar to our approach is [11], where the authors follow the
how to apply these methods. Text data needs to be vectorized before idea of MICE [28], but propose to leverage deep learning to impute
it can be cast into a matrix or tensor completion problem, which numerical values jointly (several columns at a time). Similar to the
often discards valuable information, such as the order of tokens previously mentioned approaches, this work only considers numer-
in a text. Another drawback of these methods is that they solve a ical values on small data sets. In addition, the evaluation metric
more difficult problem than the one we are actually interested in: in in this study is very different from ours, as the authors evaluate
many cases, we are merely interested in the imputation of a single the error on a downstream classification or regression task, which
cell, not the entire row of a table; learning a model that only tries to renders comparisons with our results difficult.
impute one column can be much faster and cheaper than learning a To summarize, to the best of our knowledge, there are few ma-
model for the entire table. The most important drawback of matrix chine learning based solutions focusing on scalable missing value
factorization methods for the application scenario we consider is imputation in tables with non-numerical data. A lot of the research
however that it is not straightforward to obtain imputations for in the field of imputation originates from the social sciences [28]
new rows that were not present in the training data table. This is or the life sciences [30], targeting tables with only dozens or hun-
because matrix factorization methods approximate each matrix en- dreds of rows. Next to the scalability issues, most of the existing
try as an inner product of some latent vectors. These latent vectors approaches assume that the data comes in matrix form and each col-
are learned during training on the available training data. Hence, umn is numeric. In contrast to these approaches, our work focuses
for rows that were not amongst the training data, there is no latent on imputation for large data with non-numerical data types, with
representation in the matrix factorization model. Computing such an emphasis on extensibility to more heterogeneous data types.
a latent vector for new rows of a table can be too costly at predic-
tion time. One use case we are investigating in this work is that of
3 IMPUTATION MODEL
product attribute imputation in a large product catalog. In such a
scenario, an important requirement is to be able to ingest new data. In this section, we describe our proposed model for imputation. The
New data should be as complete as possible, so we would want to overall goal of the model is to obtain a probability estimate of the
likelihood of all potential values of an attribute or column, given an
1 DataFrame.fillna in pandas/Python and ml.feature.Imputer in Spark/Scala imputation model and information extracted from other columns.

2019
Industry and Case Study Paper CIKM’18, October 22-26, 2018, Torino, Italy

For illustration, we work with the example use case presented in missing symbol, hence there are Mc + 1 values that x c can take. For
Figure 1: given a product catalog where some product attributes, notational simplicity, these scalar variables will be denoted as vector
say color of a product, are missing for some products, we want to xc in the following. We chose to base the indexing on histograms in
model the likelihood of all colors that could possibly be imputed. So order to retain the information on the symbol frequency and in or-
in the example in Figure 1, we would like to estimate the likelihood der to be able to discard too infrequent symbols more easily. For se-
for a product to have the color yellow given the information of all quential data, the numerical representation xc ∈ {0, 1, 2, . . . , Ac }Sc
other columns for this row/product as well as the trained imputation is a vector of length Sc , where Sc denotes the length of the sequence
model: or string in column c and Ac denotes the size of the set of all char-
acters observed in column c. Also here we include an additional
p(color=yellow | other columns, imputation model) (1) missing symbol that increases the number of possible symbols to
As the product description for this particular product contained the Ac + 1. The data types are determined using heuristics. In the data
word ’yellow’, we would expect the likelihood for this color value sets used in the experiments, the data types of the columns are easy
to be high. Generally, we would always predict the likelihood for all to separate into free text fields (product description, bullet
of the possible values that an attribute can take and then take the points, item name) and categorical variables (e.g. color, brand,
value that has the highest likelihood as the imputation for this value. size, . . . ). If the data types are not known upfront, heuristics
In practice it can be useful to tune the model for each potential based on the distribution of values in a column can be used for type
value to only make a prediction if the model achieves a certain detection.
precision or recall required by an application. All parameters and
their optimization are explained in the following sections, but the Feature extraction. Machine learning models often produce inac-
high level overview over the approach can be subdivided into four curate predictions if the feature representation is not optimized –
separate stages also indicated in Figure 1: and on the other hand, very simple machine learning models can
perform surprisingly well when the appropriate features are ex-
(1) String representation: In this stage, we separate the columns
tracted from the data prior to model training and prediction. Here
into input/feature and to-be-imputed/target columns. Data is
we employ state-of-the-art methods from deep learning as well
still in their textual representation. All rows that have an ob-
as simple, but highly performant established feature extractors to
served value are considered for training data (and validation or
derive useful features from the numerical representation of the
testing). Rows with missing values are considered for imputa-
data. Once the non-numerical data is encoded into their respec-
tion.
tive numerical representation, a column-specific feature extraction
(2) Numerical representation: In order to train machine learning
mapping ϕc (xc ) ∈ RD c is computed, where Dc denotes the dimen-
methods for imputation, we need to first create a numerical
sionality for a latent variable associated with column c. We consider
representation of input and target columns. Depending on the
three different types of featurizers ϕc (·):
type of data, we either model columns as categorical variables
or sequential variables (such as free text fields). • Categorical variables:
(3) Feature representation: The quality of predictions of ma- – One-hot encoded embeddings
chine learning models depends critically on the feature repre- • Sequential variables:
sentation used. We build on a large body of work on embeddings – Hashed character n-grams
for categorical and sequential data and use learnable feature – Long short-term memory neural networks
representations. For one-hot encoded categorical data we define a featurizer as an
(4) Imputation (Equation 3): We finally compute the likelihood embedding layer (as in word embeddings [24] or matrix factor-
of all potential values from the concatenation of all extracted ization [16]) that is fed into a single fully connected layer. The
features (Equation 3). hyperparameter for this featurizer is used to set both the embed-
In the following, we explain all of these stages in detail. Through- ding dimensionality as well as the number of hidden units of the
out the section, we use the index c ∈ {0, 1, 2, . . . , C} to refer to input output layer. For columns c with sequential string data, we con-
or feature columns/attributes, either as superscript for vectors (in- sider two different possibilities for ϕc (xc ): an n-gram represen-
dicated by boldface font) or subscript for functions. We omit row tation or a character-based embedding using a long short-term
indices to keep notation simple. When we mention input data/fea- memory (LSTM) recurrent neural network [13]. For the character
tures or target variables, we always refer to a single row without a n-gram representation, ϕc (xc ) is a hashing function that maps each
row index. n-gram, with n ∈ {1, . . . , 5}, in the character sequence xc to a Dc di-
mensional vector; here Dc denotes here the number of hash buckets.
Numerical encoding. In order to make the data amenable to ma- In the LSTM case, we featurize xc by iterating an LSTM through the
chine learning models, the first step in the model is to transform sequence of characters of xc that are each represented as a contin-
the string data of each column c for each row into a numerical uous vector via a character embedding. The sequence of characters
representation xc . We use different encoders for different non- xc is then mapped to a sequence of states h (c,1) , . . . , h (c,Sc ) ; we
numerical data types and distinguish between categorical and se- take the last state h (c,Sc ) , mapped through a fully connected layer
quential data. For categorical data, the numerical representation as the featurization of xc . The hyperparameters of each LSTM fea-
x c ∈ {1, 2, . . . , Mc } is the index of the value in the histogram of turizer include the number of layers, the number of hidden units
size Mc computed on column c; note that we include an additional of the LSTM cell, the dimension of the character embedding c, and

2020
Industry and Case Study Paper CIKM’18, October 22-26, 2018, Torino, Italy

the number of hidden units of the final fully connected output An important advantage of a system that automatically tunes
layer of the LSTM featurizer. Note that the hashing featurizer is a its parameters is that we can keep its interface simple and enable
stateless component that does not require any training, whereas practitioners without an ML background to use it. The API we
the other two types of feature maps contain parameters that are designed allows to impute missing values by just passing a table as
learned using backpropagation in an end-to-end fashion along with a pandas DataFrame to the imputation model and specifying the to
all other model parameters. Finally, all feature vectors ϕc (xc ) are be imputed column and input columns, as shown in the Python code
concatenated into one feature vector in Listing 1. All (hyper-)parameters are derived from the data and
learned automatically. For data type detection, we use heuristics;
x̃ = [ϕ 1 (x1 ), ϕ 2 (x2 ), . . . , ϕC (xC )] ∈ RD (2) for the differentiable loss functions of the entire imputation model,
where D = Dc is the sum over all latent dimensions Dc . As is
P we use backpropagation and stochastic gradient descent; and for
common in the machine learning literature, we refer to the numeri- hyperparameter optimization on non-differentiable loss functions
cal representation of the values in the to-be-imputed target column (as for instance the model architecture parameters such as number
as y ∈ {1, 2, . . . , Dy }. of hidden units of an LSTM), we apply grid search (alternatives are
random search or bayesian global optimization techniques).
Imputation model. After extracting the features x̃ of input columns # load training and test tables
table = pandas . read_csv ( ' products . csv ')
and the observed values y of the to be imputed column we cast the missing = table [ table [ ' color ' ]. isnull ()]
imputation problem as a supervised learning problem by learning to
predict the label distribution of y from x̃. Our imputation approach # instantiate and train imputer
model = Imputer (
models p(y| x̃,θθ ), the Dy -dimensional probability vector over all data_columns =[ ' description ', ' product_type ' , ' brand '] ,
possible values in the to be imputed column conditioned on some label_columns =[ ' color '])
. fit ( table )
learned model parameters θ and an input vector x̃ (containing in-
formation from other columns) with a standard logistic regression # impute missing values
type output layer imputed = model . transform ( missing )

Listing 1: Example of Python imputation API.


p(y| x̃,θθ ) = softmax [Wx̃ + b] (3)
We perform all encoding steps using custom Python code and stan-
where the learned parameters θ = (W, z, b) include the learned pa- dard libraries; for representing the table data we apply pandas,
rameters of the output layer (W, b) and z, comprising all parameters for the hashing vectorizer on character n-grams, we leverage the
of the learned column featurizers ϕc . Finally, softmax(q) denotes HashingVectorizer of scikit-learn [25]. We implement the
exp q
the element-wise softmax function P exp qj where qj is the j-th modeling steps performed after the numerical encoding of the
j
element of a vector q. The parameters θ are learned by minimizing data in Apache MXNet[7]. The featurization (except for the char-
the cross-entropy loss between the predicted distribution and the acter n-gram representation, which is passed to the network as a
observed labels y by computing sparse vector), is set up using the Symbolic Python API. We employ
the standard GPU-optimized version of MXNet for the LSTM-based
N
featurizations.
X
θ = arg min −log(p(y| x̃,θθ )) ⊤ onehot(y) (4)
θ 1
5 EXPERIMENTS
where log denotes element-wise logarithm and the sum runs over
We ran experiments on a large sample of a product catalog and
N rows for which a value was observed in the target column cor-
on public Wikipedia datasets. For both datasets, we impute a vari-
responding to y. We use onehot(y) ∈ {0, 1} Dy to denote a one-hot ety of different attributes. As the number of valid attribute values
encoding of the label y, which is a vector of zeros and a single present in a given to be imputed column of the training data has a
one in the entry k corresponding to the class index encoded by strong impact on the task difficulty, we applied filters to make sure
y. We apply standard backpropagation and stochastic gradient de- that the results are comparable across product types and attributes.
scent (SGD) [6] in order to optimize all parameters, including those We included only attribute values that were observed at least 100
of the featurization, in an end-to-end fashion. Training the model times (for a product type) or at least once (in the smaller Wikipedia
with SGD is very memory efficient, as it requires us to only store data set) and considered only the 100 most frequent attribute values.
one mini-batch of data at a time in memory, which typically con-
sists of a few hundred rows of a table. The approach thus easily Product attributes. We used samples of a large product catalog
scales to tables with millions of rows. in different languages to demonstrate the ability of our system to
reliably impute data in non-numerical tables independent of the lan-
4 IMPLEMENTATION AND API guage of the text in a table. In our experiments, we trained models
Building a real world machine learning application such as an end- for imputing a set of attributes for product types listed in Table 2
to-end imputation system poses not only algorithmic challenges, for English and Japanese product data. Note that these two lan-
but also requires careful thinking about the system design. The guages have very different alphabets and usually require language-
goal of our work is to free users of our system from the need of specific preprocessing. An example of such a language-specific
feature engineering. We use machine learning not only to learn the step would be tokenization, which can be difficult for some lan-
imputation model, but also to learn an optimal feature extraction. guages, including Japanese. We did not apply any language-specific

2021
Industry and Case Study Paper CIKM’18, October 22-26, 2018, Torino, Italy

preprocessing in our experiments and used the same imputation shell running Scala/Spark on a single host (36 vCPU, 60 GB RAM).
models and parameter sets for both languages. For each product The reason the experiments were performed on different hardware
type, we extracted all products that matched the language and the is that for the Spark experiments we did not leverage GPUs.
product type. As input columns, we used both columns containing
unstructured text data (title, product description, bullet
points) as well as columns containing categorical variables (e.g., Hyperparameter Range Best value(s)
brand, manufacturer, size, display technology). The cardinal- LSTM layers [2,4] 2
ity of the character set for sequential data was set to 100 (for English) LSTM hidden units [10, 150] {100, 120}
Dimensionality of LSTM output [10, 150] 100
and to 1000 (for Japanese); 1000 characters covered most Japanese Dimensionality of LSTM character embedding [10, 100] 50
letters and some Chinese symbols. The number of rows in the tables Dimensionality of hashing vectorizer output [210, 220 ] {210, 215, 218 }
Dimensionality of embeddings for categorical variables [10, 50] 10
sampled for these experiments was between 10,000 and 5,000,000. SGD learning rate [10−5, 10−1 ] {0.001, 0.008}
Weight Decay/L 2 Regularization [0, 10−2 ] {0.0001, 0}
Wikipedia. In addition to the product catalog data sample, we
extracted attributes found in the infoboxes of Wikipedia articles. Table 1: Ranges and optimal (for a given model/data set) hy-
The data is publicly available as part of the DBpedia project. For perparameters for model selection.
our experiments, we have used the 2016-10 version2 . DBpedia pro-
vides the extracted Wikipedia graph in the turtle format, where
each row consists of triplets to describe subject, predicate,
object. We have mapped the textual (long abstracts) descriptions Baseline methods. For comparison, we added two baseline meth-
of subjects to their corresponding infobox objects for birth_place, ods. The first baseline is a simple mode imputation, which always
genre and location. These predicates are most commonly found predicts the most frequent value of a column. The second baseline
in the infoboxes of Wikipedia articles. In many cases, each subject is a rule-based string matching approach that predicts the label that
may be related to several genre objects, e.g., by relating a band had most string matches in the input columns, similar to rule-based
to multiple genres. In order to transform the DBpedia data into a imputation engines, such as the approach presented in [9].
multi-class dataset, where each training instance has exactly one
label associated to it, we have excluded all of the Wikipedia articles 6 RESULTS
with multiple objects per subject. The number of rows in the result- We performed extensive evaluations on Wikipedia data sets and
ing tables were 129,729 for location, 333,106 for birthplace and on product attribute data for several product types and attributes
170,500 for genre. sampled from a large product catalog. Methods are compared with
respect to imputation quality, as measured by F1 scores weighted
Experimental settings. After extracting the string data tables, we by class frequency, as well as with respect to their operational cost.
selected featurizers for each column depending on the data type,
as described in Section 3. In our experiments, we used one LSTM Product attribute results. Results for the imputation tasks for a
per free text column in the case of LSTM featurizers; in the case of number of product types and various product attributes are listed
the n-gram featurizer we concatenated and hashed the texts of all in Table 2. Our proposed approach reaches a median F1 score of
columns into one feature vector using the same hashing function. 92.8% when using LSTM-based featurizers and a median F1 score
The LSTM hyperparameters were kept the same for the featurizers of 93% for a linear model with an n-gram featurizer. Both clearly
of all columns. For all sequential features, we applied a sequence outperform the baselines mode imputation (median F1 4.1%) and
length of 300 based on a heuristic using the length histograms of string matching of the label to the free form text columns (median
representative data. Both types of sequential featurizers were com- F1 30.1%). We argue that in the case we are considering, mode im-
bined with the categorical embedding featurizers for all categorical putation can be considered the de-facto standard for imputation.
columns in the data set, excluding the to be imputed column. For one it is implemented in popular libraries for data pipelines
We ran grid search for hyperparameter optimization. Next to the (footnote 1), hence it is the most accessible option for data engineers
model hyperparameters described in Section 3 we also optimized an working in these frameworks. Second, while there are a number
L 2 norm regularizer using weight decay. An overview of the hyper- of open source packages for imputation in Python (e.g. MIDAS,
parameters optimized can be found in Table 1. For all experiments, fancyimpute) and R (MICE), none of those packages address the
we split the available data into a 80%, 10%, 10% split for training, use case we are considering: all of those existing packages work
validation and test data, respectively. All metrics reported are com- on matrices containing only numeric data. In contrast we are con-
puted on test data which was not used for training or validation. All sidering unstructured data like text as additional input. This use
experiments were run on a single GPU instance (1 GPU with 12GB case is not accounted for in existing packages, to the best of our
VRAM, 4 vCPUs with 60GB RAM)3 . Training was performed with a knowledge. Compared to mode imputation, we see an up to 100-fold
batch size of 128 and Adam SGD [15] for a maximum of 50 epochs improvement in imputation F1 score (median 23-fold improvement)
and early stopping if the loss does not improve for 3 consecutive with our proposed approach. The string matching method gives
epochs. The two baseline approaches were performed in a Spark an F1 score close to the best performing models only in rare cases
2 http://wiki.dbpedia.org/downloads-2016-10
where the attribute value is usually included in the article name,
3A single virtual CPU or vCPU on the AWS EC2 cloud service is a single hyperthread such as for the brand of shoes.
and approximately equivalent to half a physical CPU.

2022
Industry and Case Study Paper CIKM’18, October 22-26, 2018, Torino, Italy

Dataset Attribute Mode String matching LSTM N-gram Dataset Attribute Mode String matching LSTM N-gram
brand 0.4% 80.2% 99.9% 99.8% birth place 0.3% 16.3% 54.1% 60.2%
dress, EN manufacturer 1.0% 22.6% 99.0% 99.6% Wikipedia, English genre 1.5% 6.4% 43.2% 72.4%
size 3.7% 0.1% 77.4% 74.4% location 0.7% 7.5% 41.8% 60.0%
brand 12.4% 41.6% 93.5% 88.4% Median 0.7% 7.5% 43.2% 60.2%
monitor, EN display 27.6% 12.2% 90.0% 90.2%
manufacturer 13.8% 30.6% 91.2% 86.9%
Table 3: F1 scores on held-out data for imputation task Wi-
brand 3.8% 47.9% 98.7% 97.8%
notebook, EN cpu 80.0% 85.1% 95.6% 96.7% kipedia. See Table 2 for a description of columns.
manufacturer 4.0% 33.4% 92.8% 93.0%
brand 0.5% 91.4% 99.8% 99.9%
manufacturer 0.5% 77.9% 97.1% 98.3%
shoes, EN
size 1.2% 0.0% 54.8% 45.3%
toe style 12.1% 21.7% 89.1% 92.3% compare favourably with more sophisticated neural network ar-
brand 2.6% 19.2% 98.4% 99.6%
chitectures [12, 14]. The considerably faster training for the linear
color 16.8% 48.1% 78.0% 82.5% model is an important factor for production settings. We therefore
shoes, JP
size 51.1% 1.7% 66.6% 66.1% compare the operational cost and training speed in the following
style 57.6% 12.6% 87.0% 94.0%
section.
Median 4.1% 30.1% 92.8% 93.0%
Operational cost comparison. One application scenario of the
Table 2: F1 scores on held-out data for imputation task proposed method is automatic imputations in data pipelines to en-
product attributes for mode imputation, string matching, sure completeness of data sources. In this setting, operational cost
LSTM and character n-gram featurizers. For each attribute, imposed by the memory footprint of a model and the training time
between 10,000 and 5,000,000 products were sampled. In- can be an important factor. We compare the models used in our
dependent of the featurizers used, LSTMs or n-grams, our experiments with respect to these factors. The size and training
imputation approach outperforms both baselines in terms speed of the models depends on the model selection process; we
of F1 score, achieving on average a 23-fold increase (com- measured model size in MB and sample throughput in samples per
pared to mode imputation) and a 3-fold increase (compared seconds during training for the models with the highest validation
to string matching). score. The model size in MB for n-gram models is 0.4/13.1/104.9
MB (5th/50th/95th percentile) and the model size for LSTM based
imputation models is 18.6/37.9/45.7 MB. Depending on the best
performing hyperparameter setting for a given data set, there are
Wikipedia results. Also when imputing infobox properties from some n-gram models that are much smaller or much larger than
Wikipedia abstracts, we see that both machine learning based impu- the average LSTM models, but the median model size of n-gram
tation methods, LSTMs and character n-gram models, significantly models is about three times smaller than that of LSTM models.
outperform the baseline approaches Table 3. On average we observe Sample throughput during training was one to two orders of mag-
an almost 100-fold improvement in F1 score when comparing a nitude larger for imputation models using only character n-gram
simple n-gram model to mode imputation. It is worth noting that features (1,079/11,648/77,488 samples per second) compared to deep
despite the strong improvements with the proposed machine learn- learning based LSTM featurizers (107/290/994 samples per second).
ing imputation approach, the F1 scores of the imputations are only Assuming a sample throughput of 11,000 samples per second, one
reaching up to 72% for the Wikipedia data. After inspecting the data training pass through a table with 1,000,000 rows takes less than
we attribute this to label noise. For instance the confusion matrix of 90 seconds. For a data set of this size, typically less than 10 passes
true and imputed values shows that in the imputation task for birth through the data are needed for training to converge.
place, when the true value is ’us’, the most frequent imputed values
are ’us’, ’california’ and ’florida’ or when the true value is ’wales’, 7 LESSONS LEARNED
the most frequent imputed values are ’wales’, ’uk’ and ’england’. When we set out to create a system for imputation on non-numerical
So there are many ’misclassifications’ that are due to ambiguity data, we faced the question of choosing appropriate algorithms and
related to the political taxonomy of locations in the training data, execution platforms. Many of our initial decisions turned out to
which we cannot expect the imputation model to correct for. be suboptimal in one way or another. In this section, we describe
some of the learnings we made along the way.
N-Gram models vs. LSTM. In many experiments, we achieve high
scores with both the deep learning LSTM model and n-gram meth- Choice of imputation method and feature extractors. The
ods. The linear n-gram model often achieves competitive results: first challenge was to decide which imputation method to use and
only in six out of 20 cases, the LSTM clearly performed better than which featurization method should precede the imputation method.
the linear model. One reason for this could be that most of the tasks As highlighted in Section 2, there seems to be a gap between the
are too simple for an LSTM to achieve much higher performance. methods used in practice by data engineers and machine learn-
We assume that the advantage of the LSTM will become clearer ing practitioners (simple approaches such as mode imputation)
with more difficult imputation problems. However our results are and the mathematically more sophisticated matrix factorization
in line with some recent work that finds simple n-gram models to approaches, which are less often encountered in practice. There are

2023
Industry and Case Study Paper CIKM’18, October 22-26, 2018, Torino, Italy

several reasons why we decided not to follow the research on matrix and therefore naturally result in huge overheads when run on
factorization for imputation and opted for the approach presented smaller data. Additionally, difficult choices about the materialization
in this work. For one, the current approach is much simpler to of intermediate results are entirely left to the user [29].
implement, to extend, and to adapt to new scenarios and data types. Next, we leveraged a recently developed deep learning system [8]
For example, for image data, we can add an off-the-shelf pretrained for the imputation problem, which allow us to quickly design mod-
neural network [18] and fine tune it along with all other parameters els and optimize them efficiently (even for non-neural network
of the imputation model. Secondly, the approach presented here models). This is due to dedicated mathematical operators, support
is much cheaper to train. Matrix factorization models obtain an for automatic differentiation, and out-of-the-box efficient model
advantage over other methods by modeling latent relationships training with standard optimization algorithms on a variety of
between all observed aspects of the data. In our approach, however, hardware platforms. A major obstacle in leveraging deep learn-
we only learn to impute one column at a time, which can be more ing frameworks is the integration with Spark-based preprocessing
efficient than modeling the entire table. pipelines. Deep learning toolboxes are typically used through their
Another challenge we faced was the question of how to model Python bindings, and while Python offers a great ecosystem for
non-numerical data. We ran extensive studies on different types data analysis, we mostly aim to run preprocessing and feature ex-
of feature extractors, including linguistic knowledge, word embed- traction workloads on the type safe and stable Spark/JVM platform.
dings, and also other types of sequential feature extractors, such In order to keep the best of both worlds, we used a hybrid system
as convolutional neural networks [20]. The main conclusion from that extracted features in Spark and communicated with MXNet via
those experiments are the same as we draw from the experiments a custom disk-based serialization format. In practice, this system
presented in this work: in practice many popular deep learning turned out to be difficult to use, as debugging required us often to
approaches did not outperform rather simple feature extractors. dig through several layers of stack traces [31]. We had to set up two
The systematic comparison in this study demonstrates that sparse runtimes and tune their configurations (which can be especially dif-
linear models with hashed n-gram character features achieve state ficult for Spark’s memory settings). Furthermore, experimentation
of the art results on some tasks when compared directly to deep was not simple, as the feature extraction step was rather involved
learning methods, similar to the findings in [14]. Such models are and the required materialization of preprocessed features made it
much faster during training and prediction, work well on CPUs, and tedious to quickly try out different features or data sets. Finally,
require less memory. Yet, we emphasize that this could be related it is challenging to efficiently schedule the resulting workloads of
to the data sets we tested on. We hypothesize that in more com- such hybrid systems, as the results from the Spark-based prepro-
plicated settings, LSTMs are more likely to produce better results cessing jobs, executed on clusters of commodity machines, need to
than linear n-gram models. be transferred to specialized GPU instances for the training of deep
learning models.
System-specific challenges. As of today, there are no off-the-
shelf solutions available for complex end-to-end machine learn-
8 CONCLUSION
ing deployments, and many data management related questions
from the ML space are only beginning to raise the attention of We have presented an approach to missing value imputation for ta-
the database community [19, 26, 31]. In practice, a wide variety of bles containing non-numerical attributes using deep learning. The
systems is applied for large-scale ML, with different advantages and goal was to bridge the gap between existing imputation methods
drawbacks. These systems range from general-purpose distributed that are primarily targeted at imputation of numerical data and
dataflow systems such as Apache Spark [35], which support com- application scenarios where data comes in non-numerical tables.
plex preprocessing operations, but are difficult to program for ML Our approach defines a simple imputation API over tables with non-
practitioners with a background in statistics or mathematics, to numerical attributes that expects only the names of to-be-imputed
specialized deep learning engines such as Apache MXNet [8] or columns and the names of the columns used for imputation. Auto-
Google’s Tensorflow [1], which provide mathematical operators matic hyperparameter optimization is used to determine the optimal
optimized for different hardware platforms but lack support for combination of featurizer modules. If the data and its schema are
relational operations. not known, heuristics can be used to use a custom architecture for
We started with an imputation approach built on a distributed featurizing non-numerical content. The presented system allows
dataflow system, in particular the SparkML [23] API. We designed researchers and data engineers to plug in an imputation component
an API on top of DataFrames, which allowed us to quickly build into a data pipeline to ensure completeness of a data source using
and try different featurizer and imputation model combinations. state-of-the-art machine learning models. In extensive experiments
Spark turned out to be very helpful for large-scale data wrangling on product data from a sample of a large product catalog as well
and preprocessing workloads consisting of relational operations as a number of data sets obtained from Wikipedia, we have shown
mixed with UDFs, but is in our experience very difficult to use for that the approach efficiently and reliably imputes missing product
complex machine learning tasks [5]. The complexity of the data attributes in tables with millions of rows. Model training time for
(matrices and vectors) and the operations to apply forced us to a table of about one million rows and up to ten input columns is
base our implementations on the low-level RDD-API, rather than the usually between a few minutes for a simple model configuration,
SparkSQL API, which would provide automatic query optimization. such as a sparse linear character n-gram model, and around an hour
Programs on the RDD level represent hardcoded physical execution for the most complex models, on a single GPU instance. Experi-
plans that are usually tailored to run robustly in production setups, ments on product data in English and Japanese demonstrate that

2024
Industry and Case Study Paper CIKM’18, October 22-26, 2018, Torino, Italy

our character based approach is language-agnostic and can be used heterogeneous distributed systems. Machine Learning Systems workshop at NIPS,
for imputation of tables that contain very different languages. 2015.
[9] M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, and
In our experiments, we found that while deep learning methods N. Tang. Nadeef: a commodity data cleaning system. In Proceedings of the 2013
perform very well, a simple character n-gram feature extraction ACM SIGMOD International Conference on Management of Data, pages 541–552.
ACM, 2013.
often achieves competitive results. This could be due to the fact [10] P. J. García-Laencina, J.-L. Sancho-Gómez, and A. R. Figueiras-Vidal. Pattern
that the task was too easy. On the compute instances on which we classification with missing data: a review. Neural Computing and Applications,
ran experiments, simple linear character n-gram models achieved 19(2):263–282, 2010.
[11] L. Gondara and K. Wang. Multiple imputation using deep denoising autoencoders.
throughputs of several ten thousand samples per second during CoRR, abs/1705.02737, 2017.
training, whereas models with the LSTM-based featurizers usu- [12] E. Grave, T. Mikolov, A. Joulin, and P. Bojanowski. Bag of tricks for efficient text
ally only could process several hundreds of rows of a table during classification. In Proceedings of the 15th Conference of the European Chapter of the
Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7,
learning. Depending on the task and the best hyperparameter con- 2017, Volume 2: Short Papers, pages 427–431, 2017.
figuration, LSTM based models can be smaller in size compared [13] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput.,
9(8):1735–1780, Nov. 1997.
to the n-gram models. On average however n-gram models are [14] A. Joulin, E. Grave, P. Bojanowski, M. Nickel, and T. Mikolov. Fast Linear Model
not only faster to train but also smaller in size. This finding con- for Knowledge Graph Embeddings. arXiv:1710.10881v1, 2017.
firms other work that highlights the potential of relatively simple [15] D. Kingma and J. Ba. Adam: A method for stochastic optimization. Technical
report, preprint arXiv:1412.6980, 2014.
n-gram models especially when compared to more expensive to [16] Y. Koren, R. M. Bell, and C. Volinsky. Matrix factorization techniques for recom-
train neural network architectures [12, 14] We note that while the mender systems. IEEE Computer, 42(8):30–37, 2009.
current setting was restricted to imputation of categorical values, it [17] S. Krishnan, M. J. Franklin, K. Goldberg, J. Wang, and E. Wu. Activeclean: An
interactive data cleaning framework for modern machine learning. In SIGMOD’16,
can be extended straightforwardly by adding a standard numerical pages 2117–2120, 2016.
regression loss function. Another extension is to consider imputing [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep
convolutional neural networks. In Advances in neural information processing
several columns at the same time, potentially of different types. systems, pages 1097–1105, 2012.
This can be easily adapted by summing the column specific losses. [19] A. Kumar, R. McCann, J. Naughton, and J. M. Patel. Model Selection Management
We did run experiments with such multi-task loss functions but Systems: The Next Frontier of Advanced Analytics. SIGMOD Record, 2015.
[20] Y. LeCun, Y. Bengio, et al. Convolutional networks for images, speech, and time
we found that single output models perform best when evaluated series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
on single columns only. Finally, we highlight that many existing [21] R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. John Wiley
imputation approaches are based on matrix factorization and learn & Sons, Inc., New York, NY, USA, 1986.
[22] R. Mazumder, T. Hastie, and R. Tibshirani. Spectral regularization algorithms
a latent representations associated with each row index of a table. for learning large incomplete matrices. Journal of Machine Learning Research,
This makes it more difficult to impute values for new rows of a table, 11:2287–2322, 2010.
[23] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman,
a use case that is relevant when new rows need to be appended to a D. Tsai, M. Amde, S. Owen, et al. Mllib: Machine learning in apache spark. JMLR,
table. Our approach was designed to allow for simple and efficient 17(1):1235–1241, 2016.
insertions of new rows while preserving the ability to compute a [24] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed repre-
sentations of words and phrases and their compositionality. In C. J. C. Burges,
latent numerical representation for each row which could be used L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances
for other purposes, such as information retrieval, recommenda- in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates,
tion systems, or nearest neighbor search in large tables containing Inc., 2013.
[25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-
non-numerical data [4]. del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Courna-
The code used in this study is available as open source package peau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning
in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
https://github.com/awslabs/datawig. [26] N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich. Data management challenges
in production machine learning. In SIGMOD’17, pages 1723–1726, 2017.
REFERENCES [27] T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairs with
probabilistic inference. PVLDB, 10(11):1190–1201, 2017.
[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat,
[28] D. Rubin. Multiple imputation for nonresponse in surveys. Bioinformatics,
G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning.
17(6):520–525, 1987.
In OSDI, volume 16, pages 265–283, 2016.
[29] S. Schelter, A. Palumbo, S. Quinn, S. Marthi, and A. Musselman. Samsara: Declar-
[2] R. R. Andridge and R. J. Little. A review of hot deck imputation for survey
ative Machine Learning on Distributed Dataflow Systems. In Machine Learning
non-response. International statistical review, 78(1):40–64, 2010.
Systems Workshop at NIPS’16.
[3] G. Batista and M. C. Monard. An analysis of four missing data treatment methods
[30] G. M. Schmitt P, Mandel J. A comparison of six methods for missing data
for supervised learning. Applied Artificial Intelligence, 17(5-6):519–533, 2003.
imputation. J Biom Biostat, 6(224), 2015.
[4] R. Bordawekar and O. Shmueli. Using word embedding to enable semantic
[31] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary,
queries in relational databases. In Workshop on Data Management for End-to-End
M. Young, J. Crespo, and D. Dennison. Hidden technical debt in machine learn-
Machine Learning at Sigmod, page 5, 2017.
ing systems. In Advances in Neural Information Processing Systems 28: Annual
[5] J.-H. Böse, V. Flunkert, J. Gasthaus, T. Januschowski, D. Lange, D. Salinas, S. Schel-
Conference on Neural Information Processing Systems 2015, December 7-12, 2015,
ter, M. Seeger, and Y. Wang. Probabilistic demand forecasting at scale. PVLDB,
Montreal, Quebec, Canada, 2015.
10(12):1694–1705, 2017.
[32] A. P. Singh and G. J. Gordon. A unified view of matrix factorization models. In
[6] L. Bottou. On-line learning in neural networks. chapter On-line Learning and
ECML/PKDD, pages 358–373, 2008.
Stochastic Approximations, pages 9–42. Cambridge University Press, New York,
[33] O. G. Troyanskaya, M. N. Cantor, G. Sherlock, P. O. Brown, T. Hastie, R. Tibshirani,
NY, USA, 1998.
D. Botstein, and R. B. Altman. Missing value estimation methods for DNA
[7] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang,
microarrays. Bioinformatics, 17(6):520–525, 2001.
and Z. Zhang. Mxnet: A flexible and efficient machine learning library for
[34] M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data
heterogeneous distributed systems. CoRR, abs/1512.01274, 2015.
repair. PVLDB, 4(5):279–289, Feb. 2011.
[8] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang,
[35] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster
and Z. Zhang. MXNet: A flexible and efficient machine learning library for
computing with working sets. HotCloud, 10(10-10):95, 2010.

2025

You might also like