Active Learning for Machine Translation
Statistical machine translation methods rely on large quantities of training data to
learn various models. We are investigating applications of active learning to both
minimize the amount of training data required (for example when moving to new domains or in
languages where large amounts of text is not available) as well as integrating a human in the
training process to help answer difficult questions, such as translations of phrases, that will
help improve the overall translations.
Phrasal Improvement of Machine Translation
Given a machine translation function and bilingual data, we can generate an aligned Enlish/English
data set by translating the foreign sentences with the translation function. Differences between
these sentences point to possible mistakes in the translation function. We are investigating
various same language alignment methods on this data. Given these alignments, phrase correction
rules can be learned. Initial investigation shows major improvements in commercial translation
functions and some improvement on phrase based systems.
Unsupervised Pronoun Resolution
Research with Andy Kehler and Lara Taylor
Most data driven pronoun resolution techniques rely on labeled data for training. Recently,
there has been some work showing that unlabeled data can be used to train these pronoun/antecedent
models. Using a large corpus of text, unambiguous pronouns are identified and used to bootstrap
a model. Using this model, new pronouns are labeled and the process is continued until convergence.
The advantage of these unsupervised methods is that they have much larger data sets available than
for supervised. We are investigating an expanded set of features for pronoun resolution that
leverage this increased amount of data.
Binarizing Translation Rules
Research with Kevin Knight and Daniel Marcu at ISI
Recent translation models for statistical machine translation
involve syntactically motivated translation rules
(Galley et. al).
Empirically and qualitatively this type of translation approach
seems to be a good way to integrate syntax into the translation
process while still maintaining many of the advantages of current
state of the art phrase based systems. Decoding with these rules
is similar to parsing with additional complications. CKY style, bottom
up type parsing methods appear to be a good approach for this model.
One of the complications for many bottom up algorithms is that the grammar
must be binary (i.e. at most two elements on the right hand side). Binarization
of PCFG rules is fairly straightforward. Binarization for translation rules is an
interesting area of research since the rules are more complicated (i.e. left hand side
consists of a tree, not just a non-terminal) and because other models are used during decoding
such as the language models. The binarization process can be biased so that partial hypothesis
and future cost estimation are more accurate.
Hierarchical Clustering with Bregman Distances
Research with Sanjoy Dasgupta and Anjum Gupta
In 2003, we introduced an iterative hierarchical clustering method. The problem with previous
approaches is that they are greedy. Once a decision is made in constructing the hierarchy,
this decision can never be undone. The iterative approach we presented searches the space of
hierarchical clusterings and therefore does not have that problem.
The iterative approach attempts to minimize a criterion function based on the current
clustering. In our initial paper, we used a squared Euclidean cost function similar to that of
K-means. Euclidean distance is just one of a family of distance metrics called Bregman
distances. We are examining how these distances can be integrated with our iterative method.
The hope is that these new distance metrics will fit more naturally for a variety of different
data sets where Euclidean distance is not appropriate.
Previous (Unpublished) Projects
Searching Research Papers
Research with Vibhu Mittal at Google
Information retrieval of research papers provides a number of interesting problems that are
different than traditional information retrieval. First, extracting information from non-text
files is non-trivial and information such as authors, titles and references must be extracted
from research articles before further processing. This can be difficult due to the wide
variation of formatting and conventions. Once this information has been extracted, the
citation graph of research papers provides a very interesting search structure that is much
more reliable than traditional web link structure.
Reinforcement Learning for Traffic Navigation
Preliminary paper available
In this paper we examine reinforcement learning problems which consist of a set of homogeneous
entities. These problems tend to have extremely large state spaces making standard approaches
unattractive. We study lane change selection in a car traffic control problem as an example of
such a problem. We show how a single agent problem can be translated into an approximating
multi-agent problem. We provide learning results in a traffic simulator using this multi-agent
approximation with Q-learning and R-learning. Learning in the multi-agent problem proceeds
quickly and outperforms heuristic methods. Experimental results show that learned methods
perform better than heuristic methods as traffic densities increase towards rush hour
conditions. We summarize the translation method used from a single agent problem to a related
multi-agent problem for car traffic control and propose this as a starting place for related
Vision Techniques for Audio Processing
Preliminary paper available
In this paper we discuss the problem of audio retrieval. We examine artist
recognition/retrieval as a problem instead of the
traditional genre classification. This problem has the motivating benefit that there is a
known, uncontroversial ground truth. Second, and more importantly, we suggest borrowing
research from the image retrieval community. We provide results from one image retrieval
technique ported over to audio retrieval. This technique consists of taking the discrete
wavelet transform of the audio, histogramming the results and using statistical histogram
comparison metrics to compare similarity. The results are not outstanding, but we do show that
this sort of research can be done fairly easily and productively.
Dynamic Resource Allocation in Heterogeneous Computing Environments
The increasing power of modern personal computers is making clusters, which are built up from
desktop components increasingly popular. However, these clusters do have their drawbacks.
When clusters are updated, the whole cluster may not be upgraded all at once. This will lead
to an environment where there are some nodes that are faster than others. This irregular
environment can occur more simply in the case where jobs are running on some of the nodes.
These nodes will tend to run slower than the free nodes. There are a number of problems such
as RedBlack3D that have highly regular rectangular partitioning schemes. Unfortunately, if
these static partitions are used in the environment just described, idle waiting can occur when
faster nodes have that have finished their iteration early and are stuck waiting at a
synchronization point for slower processors to finish. We investigate a load balancer that can
adaptively tailor per-node workloads to node speed to help reduce this inefficiency.