A Survey On Data Collection For Machine Learning
A Survey On Data Collection For Machine Learning
A Survey On Data Collection For Machine Learning
Fig. 1: A high level research landscape of data collection for machine learning. The
topics that are at least partially contributed by the data management community are
highlighted using blue italic text. Hence, to fully understand the research landscape, one
needs to look at the literature from the viewpoints of both the machine learning and data
management communities.
While there are many surveys on data collection that are either limited to
one discipline or a class of techniques, to our knowledge, this survey is the first
to bridge the machine learning (including natural language processing and
computer vision) and data management disciplines. We contend that a machine
learning user needs to know the techniques on all sides to make informed
decisions on which techniques to use when. This convergence is part of a larger
integration of the areas of Big data and Artificial Intelligence (AI) where data
management plays a role in almost all aspects of machine learning [4, 5].
We note that many sub-topics including semi-supervised learning, active
learning, and transfer learning are large enough to have their own surveys. The
goal of this survey is not to go into all the depths of these sub-topics, but to
focus on breadth and identify what data collection techniques are relevant for
machine learning purposes and what research challenges exist. Hence, we will
only cover the most representative work of the sub-topics, which are either the
best-performing or most recent ones. In addition, sometimes the boundary
between operations (e.g., data acquisition and data labeling) is not clear cut. In
those cases, we will clarify that the techniques are relevant in both operations.
Fig. 2: A decision flow chart for data collection. Starting from the top left, Sally can
start by asking whether the existing data or model should be improved. Each following
question leads to a specific technique that can be used for acquiring and labeling data or
improving existing data or models. This flow chart does not cover all the details in this
survey. For example, data labeling techniques like self learning and crowdsourcing can
be performed together as described in Section 3.2.1. Also, some questions (e.g.,
“Enough labels for self learning?”) are not easy to answer and may require an in-depth
understanding of the application and given data.
Fig. 3: A running example for data collection. A smart factory may produce various
images of product components, which are classified as normal or defective by a
convolutional neural network model. Unfortunately, with an application this specific, it
is often difficult to find enough training data for training the model.
We review the data acquisition literature, which can be categorized into data
discovery, data augmentation, and data generation. Many of the techniques
require scalable solutions and have thus been studied by the data management
community (Section 2).
We review the data labeling literature and group the techniques into three
approaches: utilizing existing labels, using crowdsourcing techniques, and using
weak supervision. While data labeling is traditionally a machine learning topic,
it is also studied in the data management community as scalability becomes an
issue (Section 3).
We put all the techniques together and provide guidelines on how to decide
which data collection techniques to use when (Section 5).
Collaborative Analysis [6, 7, 8]
Sharing Web [9, 10, 11, 12, 13, 14]
Data Lake [16, 17, 18]
Searching
Web [19, 20, 21, 22, 23, 24, 25, 26]
Embeddings [27, 28]
Joins [29, 30]
Processing [39, 40, 36]
Text [49, 50, 51]
TABLE I: A classification of data acquisition techniques. Some of the techniques can be
used together. For example, data can be generated while augmenting existing data.
The goal of data acquisition is to find datasets that can be used to train machine
learning models. There are largely three approaches in the literature: data
discovery, data augmentation, and data generation. Data discovery is necessary
when one wants to share or search for new datasets and has become important
as more datasets are available on the Web and corporate data lakes [52, 16].
Data augmentation complements data discovery where existing datasets are
enhanced by adding more external data. Data generation can be used when
there is no available external dataset, but it is possible to generate crowdsourced
or synthetic datasets instead. The following sections will cover the three
operations in more detail. The individual techniques are classified in Table I.
2.1DATA DISCOVERY
Data discovery can be viewed as two steps. First, the generated data must be
indexed and published for sharing. Many collaborative systems are designed to
make this process easy. However, other systems are built without the intention
of sharing datasets. For these systems, a post-hoc approach must be used where
metadata is generated after the datasets are created, without the help of the
dataset owners. Next, someone else can search the datasets for their machine
learning tasks. Here the key challenges include how to scale the searching and
how to tell whether a dataset is suitable for a given machine learning task. We
discuss these two issues in the following sections.
Data Sharing
We study data systems that are designed with dataset sharing in mind. These
systems may focus on collaborative analysis, publishing on the Web, or both.
Collaborative Analysis In an environment where data scientists are
collaboratively analyzing different versions of datasets, DataHub [6, 7, 8] can be
used to host, share, combine, and analyze them. There are two components: a
dataset version control system inspired by Git (a version control system for
code) and a hosted platform on top of it, which provides data search, data
cleaning, data integration, and data visualization.
Data Searching
While the previous data systems are platforms for sharing datasets, we now
explore systems that are mainly designed for searching datasets where the
number of datasets tends to be high. This setting is common within large
companies or on the Web.
Data Lake Data searching systems have become more popular with the advent
of data lakes [52, 16] in corporate environments where many datasets are
generated internally, but they are not easily discoverable by other teams or
individuals within the company. Providing a way to search datasets and analyze
them has significant business value because the teams or individuals do not
have to make redundant efforts to re-generate the datasets for their machine
learning tasks. Most of the recent data lake systems have come from the
industry. In many cases, it is not feasible for all the dataset owners to publish
datasets through one system. Instead, a post-hoc approach becomes necessary
where datasets are processed for searching after they are created, and no effort
is required on the dataset owner’s side.
More recently, scalability has become a pressing issue for handling data
lakes that consists of most datasets in a large company. Google Data Search
(GOODS) [17] is a system that catalogues the metadata of tens of billions of
datasets from various storage systems within Google. GOODS infers various
metadata including owner information and provenance information (by looking
up job logs), analyzes the contents of the datasets, and collects input from users.
At the core is a central catalog, which contains the metadata and is indexed
for data searching. Due to Google’s scale, there are many technical challenges
including scaling to the number of datasets, supporting a variety of data formats
where the costs for extracting metadata may differ, updating the catalog entries
due to the frequent churn of datasets, dealing with uncertainty in metadata
discovery, computing dataset importance for search ranking, and recovering
dataset semantics that are missing. To find datasets, users can use keywords
queries on the GOODS frontend and view profile pages of the datasets that
appear in the search results. In addition, users can track the provenance of a
dataset to see which datasets were used to create the given dataset and those
that rely on it.
Finally, expressive queries are also important for searching a data lake.
While GOODS scales, one downside is that it only supports simple keyword
queries. The DATA CIVILIZER system [18, 53] complements GOODS by focusing
more on the discovery aspect of datasets. Specifically, DATA CIVILIZER consists of
a module for building a linkage graph of data. Assuming that datasets have
schema, the nodes in the linkage graph are columns of tables while edges are
relationships like primary key-foreign key (PK-FK) relationships. A data
discovery module then supports a rich set of discovery queries on the linkage
graph, which can help users more easily discover the relevant datasets.
Web As the Web contains large numbers of structured datasets, there have
been significant efforts to automatically extract the useful ones. One of the most
successful systems is WebTables [20, 19], which automatically extracts
structured data that is published online in the form of HTML tables. For
example, WebTables extracts all Wikipedia infoboxes. Initially, about 14.1
billion HTML tables are collected from the Google search web crawl. Then a
classifier is applied to determine which tables can be viewed as relational
database tables. Each relational table consists of a schema that describes the
columns and a set of tuples. In comparison to the above data lake systems,
WebTables collects structured data from the Web.
As Web data tends to be much more diverse than say those in a corporate
environment, the table extraction techniques have been extended in multiple
ways as well. One direction is to extend table extraction beyond identifying
HTML tags by extracting relational data in the form of vertical tables and lists
and leveraging knowledge bases [22, 23]. Table searching also evolved where, in
addition to keyword searching, row-subset queries, entity-attribute queries, and
column search were introduced [24]. Finally, techniques for enhancing the
tables [25, 26] were proposed where entities or attribute values are added to
make the tables more complete.
2.2DATA AUGMENTATION
Embeddings
There are two possible models for training word vectors: Continuous Bag-of-
Words (CBOW) and Skip-gram. While CBOW predicts a word based on its
surrounding words, Skip-gram does the opposite and predicts the surrounding
words based on a given word. As a result, two words that occur in similar
contexts tend to have similar word vectors. A fascinating application of word
vectors is performing arithmetic operations on the word vectors. For example,
the result of subtracting the word vector of “king” by that of “queen” is similar to
the result of subtracting the word vector of “man” by that of “woman”. Since
word2vec was proposed, there have been many extensions including GloVe [28],
which improves word vectors by also taking into account global corpus
statistics, and Doc2Vec [56], which generates representations of documents.
Entity Augmentation
Joins
Relational databases are widely used in the industry, and there is a growing
interest to use tables as training data for machine learning. However, most
machine learning toolkits assume that a training dataset is a single file and
ignore the fact that there are typically multiple tables in a database due to
normalization. The key question is whether joining the tables and augmenting
the information is useful for model training. The Hamlet system [29] addresses
this problem by determining if key-foreign key (KFK) joins are necessary for
improving the model accuracy for linear classifiers and propose decision rules to
predict when it is safe to avoid joins and, as a result, significantly reduce the
total runtime.
2.3DATA GENERATION
If there are no existing datasets that can be used for training, then another
option is to generate the datasets either manually or automatically. For manual
construction, crowdsourcing is the standard method where human workers are
given tasks to gather the necessary bits of data that collectively become the
generated dataset. Alternatively, automatic techniques can be used to generate
synthetic datasets. Note that data generation can also be viewed as data
augmentation if there is existing data where some missing parts needs to be
filled in.
Crowdsourcing
Crowdsourcing is used to solve a wide range of problems, and there are many
surveys as well [57, 58, 59, 60]. One of the earliest and most popular platforms
is Amazon Mechanical Turk [61] where tasks (called HITs) are assigned to
human workers, and workers are compensated for finishing the tasks. Since
then, many other crowdsourcing platforms have been developed, and research
on crowdsourcing has flourished in the areas of data management, machine
learning, and human computer interaction. There is a wide range of
crowdsourcing tasks from simple ones like labeling images up to complex ones
like collaboratively writing that involve multiple steps.
The objective of the generative network is to increase the error rate of the
discriminative network. That is, the generative network attempts to fool the
discriminative network into thinking that its candidates are from the true
distribution. GANs have been used to generate synthetic images and videos that
look realistic in many applications.
Text data Generating synthetic text data has mostly been studied in the natural
language processing community. Paraphrasing [49] is a classical problem of
generating alternative expressions that have the same semantic meaning. For
example “What does X do for a living?” is a paraphrase of “What is X’s job?”.
We briefly cover two recent methods – one is syntax-based and the other
semantics-based – that uses paraphrasing to generate large amounts of
synthetic text data. Syntactically controlled paraphrase networks [50] (SCPNs)
can be trained to produce paraphrases of a sentence with different sentence
structures. Semantically equivalent adversarial rules for text [51] (SEARs) have
been proposed for perturbing input text while preserving its semantics. SEARs
can be used to debug a model by applying them on training data and seeing if
the re-trained model changes its predictions. In addition, there are many
paraphrasing techniques that are not covered in this survey.
3DATA LABELING
Machine
Data
Category Approach learning Techniques
types
task
classificatio
all [69, 70, 71, 72, 73]
n
Self-labeled
Use
Existing regression all [74, 75, 76]
Labels
classificatio
Label propagation graph [77]
n
imag
[83]
e
graph [84]
Machine
Data
Category Approach learning Techniques
types
task
Data classificatio
all [87, 88, 89]
programming n
Weak
supervisio
n
classificatio [90, 91, 92, 93, 94, 95, 96, 97
Fact extraction text
n ]
TABLE II: A classification of data labeling techniques. Some of the techniques can be
used for the same application. For example, for classification on graph data, both self-
labeled techniques and label propagation can be used.
Once enough data has been acquired, the next step is to label individual
examples. For instance, given an image dataset of industrial components in a
smart factory application, workers can start annotating if there are any defects
in the components. In many cases, data acquisition is done along with data
labeling. When extracting facts from the Web and constructing a knowledge
base, then each fact is assumed to be correct and thus implicitly labeled as true.
When discussing the data labeling literature, it is easier to separate it from data
acquisition as the techniques can be quite different.
Use existing labels: An early idea of data labeling is to exploit any existing
labels. There is an extensive literature on semi-supervised learning where the
idea is to learn from the labels to predict the rest of the labels.
Data type: Depending on the data type (e.g., text, images, and graphs) the data
labeling techniques differ significantly. For example, information extraction
from text is very different from object detection on images.
Classification
Single Classifier, Algorithm, and View The first category is where there is only
one classifier, learning algorithm, and view. Self-training [69] initially trains a
model on the labeled examples. The model is then applied to all the unlabeled
data where the examples are ranked by the confidences in their predictions. The
most confident predictions are then added into the labeled examples. This
process repeats until all the unlabeled examples are labeled. While there is no
assumption on the data, there is an implicit assumption that the predictions of
its own trained model tends to be correct, especially for the ones with high
confidence. This method is simple and effective, and has been successfully used
in real applications.
Regression
Relatively less research has been done for semi-supervised learning for
regression. Co-regularized least squares regression [74] is a least squares
regression algorithm based on the co-learning approach. Another co-regularized
framework [75] utilizes sufficient and redundant views similar to Co-training.
Co-training regressors [76] uses two k-nearest neighbor regressors with
different distance metrics. In each iteration, a regressor labels the unlabeled
data that can be labeled most confidently by the other regressor. After the
iterations, the final prediction of an example is made by averaging the
regression estimates by the two regressors. Co-training Regressors can be
extended by using any other base regressor.
3.2CROWD-BASED TECHNIQUES
The most accurate way to label examples is to do it manually. A well known use
case is the ImageNet image classification dataset [104] where tens of millions of
images were organized according to a semantic hierarchy by WordNet using
Amazon Mechanical Turk. However, ImageNet is an ambitious project that took
years to complete, which most machine learning users cannot afford for their
own applications. Traditionally, active learning has been a key technique in the
machine learning community for carefully choosing the right examples to label
and thus minimize cost. More recently, crowdsourcing techniques for labeling
have been proposed where there can be many workers who are not necessarily
experts in labeling. Hence, there is more emphasis on how to assign tasks to
workers, what interfaces to use, and how to ensure high quality labels. While
crowdsourcing data labeling is closely related to crowdsourcing data acquisition,
the individual techniques are different.
Active Learning
There are various ways semi-supervised learning can be used with active
learning. McCallum and Nigam [81] improves the Query-By-Committee (QBC)
technique and combines it with Expectation-Maximization (EM), which
effectively performs semi-supervised learning. Given a set of documents for
training data, active learning is done by selecting the documents that are closer
to others (and thus representative), but have committee disagreement, for
labeling. In addition, the EM algorithm is used to further infer the rest of the
labels.
The images that still have low confidence are selected to be labeled by
humans. Zhu et al. [84] combines semi-supervised and active learning under a
Gaussian random field model. The labeled and unlabeled examples are
represented as vertices in a graph where edges are weighted by similarities
between examples. This framework enables one to compute the next question
that minimizes the expected generalization error efficiently for active learning.
Once the new labels are added to the labeled data, semi-supervised learning is
performed using harmonic functions.
Crowdsourcing
3.3WEAK SUPERVISION
Data Programming
As data labeling at scale becomes more important especially for deep learning
applications, data programming [89] has been proposed as a solution for
generating large amounts of weak labels using multiple labeling functions
instead of individual labeling. Figure 5 illustrates how data programming can be
used for Sally’s smart factory application. A labeling function can be any
computer program that either generates a label for an example or refrains to do
so.
For example, a labeling function that checks if a tweet has a positive sentiment
may check if certain positive words appear in the text. Since a single labeling
function by itself may not be accurate enough or not be able to generate labels
for all examples, multiple labeling functions are implemented and combined
into a generative model, which is then used to generate large amounts of weak
labels with reasonable quality. Alternatively, voting methods like majority
voting can be used to combine the labeling functions. Finally, a noise-aware
discriminative model is trained on the weak labels. Data programming has been
implemented in the state-of-art Snorkel system [115], which is becoming
increasingly popular.
Fig. 5: A workflow of using data programming for a smart factory application. In this
scenario, Sally is using crowdsourcing to annotate defects on component images. Next,
the annotations can be automatically converted to labeling functions. Then the labeling
functions are combined either into a generative model or using majority voting. Finally,
the combined model generates weak labels that are used to train a discriminative model.
Instead, labeling functions that are more correlated with each other will
have less influence on the predicted label. In addition, labeling functions that
are outliers are also trained to have less influence in order to cancel out the
noise. Theoretical analysis [3] shows that, if the labeling functions are
reasonably accurate, then the predictions made by the generative model
becomes arbitrarily close to the true labels.
A generative model performs better than majority voting when there are
empirically about 10 to 100 labeling functions that are complementing each
other. If there are fewer than about 10 labeling functions, there is not enough
correlated labeling functions to train the generative model accurately. On the
other hand, if there are more than about 100 labeling functions, a theoretical
analysis [117] shows that taking a majority vote produces arbitrarily accurate
labels and is sufficient for our purposes.
Compared to DeepDive, it has a simpler Python syntax and does not require
complex setup involving databases. Snorkel [122, 115] is the most recent system
for data programming. Compared to DDLite, Snorkel is a full-fledged product
that is widely used in the industry. Snorkel enables users to use weak labels
from all available weak label sources, supports any type of classifier, and
provides rapid results in response to the user’s input. More recently, Snorkel has
been extended to solve massively multi-task learning [123].
Data Cleaning [126, 127, 128, 129]
Improve Data
Re-labeling [130]
Robust Against
Noise [131, 132, 133, 134]
Improve
Model
Transfer Learning [135, 136, 137]
TABLE III: A classification of techniques for improving existing data and models.
A major problem in machine learning is that the data can be noisy and the labels
incorrect. This problem occurs frequently in practice, so production machine
learning platforms like TensorFlow Extended (TFX) [138] have separate
components [139] to reduce data errors as much as possible though analysis and
validation. In case the labels are also noisy, re-labeling the examples becomes
necessary as well. We explore recent advances in data cleaning with a focus on
machine learning and then techniques for re-labeling.
Data Cleaning
It is common for the data itself to be noisy. For example, some values may be
out of range (e.g., a latitude value is beyond [-90, 90]) or use different units by
mistake (e.g., some intervals are in hours while other are in minutes). There is a
heavy literature on various integrity constraints (e.g., domain constraints,
referential integrity constraints, and functional dependencies) that can improve
data quality as well. HoloClean [128] is a state-of-art data cleaning system that
uses quality rules, value correlations, and reference data to build a probabilistic
model that captures how the data was generated. HoloClean then generates a
probabilistic program for repairing the data.
An interesting line of recent work is cleaning techniques that are tailored
for machine learning purposes. ActiveClean [126] is a model training framework
that iteratively suggests samples of data to clean based on how much the
cleaning improves the model accuracy and the likelihood that the data is dirty.
An analyst can then perform transformations and filtering to clean each sample.
ActiveClean treats the training and cleaning as a form of stochastic gradiant
descent and uses convex-loss models (SVMs, linear and logistic regression) to
guarantee global solutions for clean models. BoostClean [127] solves an
important class of inconsistencies where an attribute value is outside an allowed
domain.
BoostClean takes as input a dataset and a set of functions that can detect
these errors and repair functions that can fix them. Each pair of detection and
repair functions can produce a new model trained on the cleaned data.
BoostClean uses statistical boosting to find the best ensemble of pairs that
maximize the final model’s accuracy. Recently, TARS [129] was proposed to
solve the problem of cleaning crowdsourced labels using oracles. TARS provides
two pieces of advice. First, given test data with noisy labels, it uses an estimation
technique to predict how well the model may perform on the true labels. The
estimation is shown to be unbiased, and confidence intervals are computed to
bound the error. Second, given training data with noisy labels, TARS determines
which examples to send to an oracle in order to maximize the expected model
improvement of cleaning each noisy label.
Re-labeling
Trained models are only as good as their training data, and it is important to
obtain high quality data labels. Simply labeling more data may not improve the
model accuracy further. Indeed, Sheng et al. [130] shows that, if the labels are
noisy, then the model accuracy plateaus from some point and does not increase
further, no matter how many more labeling is done. The solution is to improve
the quality of existing labels. The authors show that repeated labeling using
workers of certain individual qualities can significantly improve model accuracy
where a straightforward round robin approach already give substantial
improvements, and being more selective in labeling gives even better results.
4.2IMPROVING MODELS
In addition to improving the data, there are also ways to improve the model
training itself. Making the model training more robust against noise or bias is an
active area of research. Another popular approach is to use transfer learning
where previously-trained models are used as a starting point to train the current
model.
Robust Against Noise and Bias
The idea is to model the relationships between images, class labels, and label
noises with a probabilistic graphical model and integrate it into the model
training. Label noise is categorized into two types: confusing noise, which is
caused by confusing content in the images, and pure random noise, which is
caused by technical bugs like mismatches between images and their
surrounding text.
The true labels and noise types are treated as latent variables, and an EM
algorithm is used for inference. Webly supervised learning [132] is a technique
for training a convolutional neural network on clean and noisy images on the
Web. First, the model is trained on top-ranked images from search engines,
which tend to be clean because they are highly-ranked, but also biased in the
sense that objects tends to be centered in the image with a clean background.
Then relationships are discovered among the clean images, which are then used
to adapt the model to more noisier images that are harder to classify. This
method suggests that it is worth training on easy and hard data separately.
Even if the labels themselves are clean, it may be the case that the labels are
imbalanced. SMOTE [134] performs over-sampling for minority classes that
need more examples. Simply replicating examples may lead to overfitting, so the
over-sampling must be done with care. The basic SMOTE algorithm finds for
each minority example the k minority class nearest neighbors and then
generates synthetic examples along the line segments joining the minority
example and its nearest neighbors. The number of samples to generate per
original minority sample can be adjusted based on neighboring examples.
SMOTE can be combined with under-sampling the majority class where
examples in that class are randomly removed. This combination of sampling
results in better model accuracy than when only the majority class is under-
sampled. He and Garcia [133] provide a comprehensive survey on learning from
imbalanced data.
Transfer Learning
Transfer learning is a popular approach for training models when there is not
enough training data or time to train from scratch. Starting from an existing
model that is well trained (also called a source task), one can incrementally
train a new model (a target task) that already performs well. For example, a
convolutional neural networks like AlexNet [142] and VGGNet [143] can be used
to train a model for a different, but related vision problem. Recently, Google
announced TensorFlow Hub [137], which enables users to easily re-use an
existing model to train an accurate model, even with a small dataset.
Inductive transfer learning is used when the source task and target task are
different while the two domains may or may not be the same. Here a task can be
categorizing a document while a domain could be a set of university webpages to
categorize.
Transductive transfer learning is used when the source and target tasks
are the same, but the domains are different.
is a common latent feature space that unifies the source and target
features. Transfer learning has been successfully used in many applications
including text sentiment analysis, image classification, human activity
classification, software defect detection, and multi-language text classification.
We now return to Sally’s scenario and provide an end-to-end guideline for data
collection (summarized as the workflow in Figure 2). If there is no or little data
to start with then Sally would need to acquire datasets. She can either search for
relevant datasets either on the Web or within the company data lake, or decide
to generate a dataset herself by installing camera equipment for taking photos of
the products within the factory. If the products also had some metadata, Sally
could also augment that data with external information about the product.
Once the data is available, then Sally can choose among the labeling
techniques using the categories discussed in Section 3. If there are enough
existing labels, then self labeling using semi-supervised learning is an attractive
option. There are many variants of self labeling depending on the assumptions
on the model training as we studied. If there are not enough labels, Sally can
decide to generate some using the crowd-based techniques using a budget.
If there are only a few experts available for labeling, active learning may be
the right choice, assuming that the important examples that influence the model
can be narrowed down. If there are many workers who do not necessarily have
expertise, general crowdsourcing methods can be used. If Sally does not have
enough budget for crowd-based methods or if it is simply not worth the cost,
and if the model training can tolerate weak labels, then weak supervision
techniques like data programming and label propagation can be used.
If Sally has existing labels, she may also want to make sure whether they
can be improved in quality. If the data is noisy or biased, then the various data
cleaning techniques can be used. If there are existing models for product quality
through tools like TensorFlow Hub, they can be used to further improve the
model using transfer learning.
Although data collection was traditionally a topic in the machine learning, as the
amount of training data is increasing, data management research is becoming
just as relevant, and we are observing a convergence of the two disciplines. As
such, there needs to be more awareness on how the research landscape will
evolve for both communities and more effort to better integrate the techniques.
Data Evaluation An open question is how to evaluate whether the right data
was collected with sufficient quantity.
First, it may not be clear if we have found the best datasets for a machine
learning task and whether the amount of data is enough to train a model with
sufficient accuracy. In some cases, there may be too many datasets, and simply
collecting and integrating all of them may have a negative affect on model
training. As a result, selecting the right datasets becomes an important problem.
Moreover, if the datasets are dynamic (e.g., they are streams of signals from
sensors) and change in quality, then the choice of datasets may have to change
dynamically as well. Second, many data discovery tools rely on dataset owners
to annotate their datasets for better discovery, but more automatic techniques
for understanding and extracting metadata from the data are needed.
While most of the data collection work assumes that the model training
comes after the data collection, another important avenue is to augment or
improve the data based on how the model performs. While there is a heavy
literature on model interpretation [120, 121], it is not clear how to address
feedback on the data level. In the model fairness literature [145], one approach
to reducing unfairness is to fix the data. In data cleaning, ActiveClean and
BoostClean are interesting approaches for fixing the data to improve model
accuracy. A key challenge is analyzing the model, which becomes harder for
complex models like deep neural networks.
7CONCLUSION
ACKNOWLEDGMENTS
[3] S. H. Bach, B. D. He, A. Ratner, and C. Ré, “Learning the structure of
generative models without labeled data,” in ICML, 2017, pp. 273–282.
[9] A. Y. Halevy, “Data publishing and sharing using fusion tables,” in CIDR,
2013.
[12] “Ckan,” http://ckan.org.
[13] “Quandl,” https://www.quandl.com.
[14] “Datamarket,” https://datamarket.com.
[15] “Kaggle,” https://www.kaggle.com/.
[20] M. J. Cafarella, A. Y. Halevy, H. Lee, J. Madhavan, C. Yu, D. Z. Wang, and
E. Wu, “Ten years of webtables,” PVLDB, vol. 11, no. 12, pp. 2140–2149, 2018.
[29] A. Kumar, J. Naughton, J. M. Patel, and X. Zhu, “To join or not to join?:
Thinking twice about joins before feature selection,” in SIGMOD, 2016, pp. 19–
34.
[30] V. Shah, A. Kumar, and X. Zhu, “Are key-foreign key joins safe to avoid
when learning high-capacity classifiers?” PVLDB, vol. 11, no. 3, pp. 366–379,
Nov. 2017.
[52] A. Y. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E.
Whang, “Managing google’s data lake: an overview of the goods
system,” IEEE Data Eng. Bull., vol. 39, no. 3, pp. 5–14, 2016.
[54] M. Stonebraker and I. F. Ilyas, “Data integration: The current status and
the way forward,” IEEE Data Eng. Bull., vol. 41, no. 2, pp. 3–9, 2018.
[73] A. Blum and T. Mitchell, “Combining labeled and unlabeled data with co-
training,” in COLT, 1998, pp. 92–100.
[78] D. D. Lewis and W. A. Gale, “A sequential algorithm for training text
classifiers,” in SIGIR, 1994, pp. 3–12.
[80] R. Burbidge, J. J. Rowland, and R. D. King, “Active learning for regression
based on query by committee,” in IDEAL, 2007, pp. 209–218.
[88] H. R. Ehrenberg, J. Shin, A. J. Ratner, J. A. Fries, and C. Ré, “Data
programming with ddlite: putting humans in a different part of the loop,”
in HILDA@SIGMOD, 2016, p. 13.
[89] A. J. Ratner, C. D. Sa, S. Wu, D. Selsam, and C. Ré, “Data programming:
Creating large training sets, quickly,” in NIPS, 2016, pp. 3567–3575.
[110] Y. Gu, Z. Jin, and S. C. Chiu, “Combining active learning and semi-
supervised learning using local and global consistency,” in Neural Information
Processing, C. K. Loo, K. S. Yap, K. W. Wong, A. Teoh, and K. Huang,
Eds. Cham: Springer International Publishing, 2014, pp. 215–222.
[116] H. R. Ehrenberg, J. Shin, A. J. Ratner, J. A. Fries, and C. Ré, “Data
programming with ddlite: Putting humans in a different part of the loop,”
in Proceedings of the Workshop on Human-In-the-Loop Data Analytics, ser.
HILDA ’16, 2016, pp. 13:1–13:6.
[122] A. J. Ratner, S. H. Bach, H. R. Ehrenberg, and C. Ré, “Snorkel: Fast
training set generation for information extraction,” in SIGMOD, 2017, pp.
1683–1686.
[134] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote:
Synthetic minority over-sampling technique,” J. Artif. Int. Res., vol. 16, no. 1,
pp. 321–357, Jun. 2002.
[137] “Tensorflow hub,” https://www.tensorflow.org/hub/.
[139] “Tensorflow data
validation,” https://www.tensorflow.org/tfx/data_validation/.