Skip to content

Commit 6e57312

Browse files
committed
some more references and editing
1 parent 5e7d1b2 commit 6e57312

File tree

3 files changed

+128
-73
lines changed

3 files changed

+128
-73
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,10 @@ L2pickle/*.pickle
3939
## All the output we're going to generate
4040
L1output/*.best
4141
L1output/*.oof
42+
L1output/*.results
4243
L2output/*.best
4344
L2output/*.oof
45+
L2output/*.results
4446
MRFoutput/*.best
4547
MRFoutput/*.oof
48+
MRFoutput/*.results

paper/semeval2013.bib

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,3 +107,40 @@ @book{nltkbook
107107
publisher = {O'Reilly Media},
108108
year = 2009
109109
}
110+
111+
@InProceedings{denero-klein:2007:ACLMain,
112+
author = {DeNero, John and Klein, Dan},
113+
title = {Tailoring Word Alignments to Syntactic Machine Translation},
114+
booktitle = {Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics},
115+
month = {June},
116+
year = {2007},
117+
address = {Prague, Czech Republic},
118+
publisher = {Association for Computational Linguistics},
119+
pages = {17--24},
120+
url = {http://www.aclweb.org/anthology/P07-1003}
121+
}
122+
123+
@Unpublished{daume04cg-bfgs,
124+
author = {Hal {Daum\'e III}},
125+
title = {Notes on {CG} and {LM-BFGS} Optimization of Logistic Regression},
126+
month = {August},
127+
keywords = {ml},
128+
year = {2004}
129+
}
130+
131+
@INPROCEEDINGS{Toutanova03feature-richpart-of-speech,
132+
author = {Kristina Toutanova and Dan Klein and Christopher D. Manning and Yoram Singer},
133+
title = {Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network},
134+
booktitle = {IN PROCEEDINGS OF HLT-NAACL },
135+
year = {2003},
136+
pages = {252--259},
137+
publisher = {}
138+
}
139+
140+
@INPROCEEDINGS{Schmid95improvementsin,
141+
author = {Helmut Schmid},
142+
title = {Improvements In Part-of-Speech Tagging With an Application To German},
143+
booktitle = {In Proceedings of the ACL SIGDAT-Workshop},
144+
year = {1995},
145+
pages = {47--50}
146+
}

paper/semeval2013.tex

Lines changed: 88 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
\usepackage{times}
44
\usepackage{latexsym}
55
\setlength\titlebox{6.5cm} % Expanding the titlebox
6+
\usepackage{url}
67
\usepackage{float}
78
\floatstyle{boxed}
89
\restylefloat{figure}
@@ -21,59 +22,73 @@
2122

2223
%what resource did we use,
2324
\begin{abstract}
24-
25-
We present our approaches to CL-WSD(Cross-Lingual Word Sense Disambiguation) for the Semeval 2013 Task 10, which came in
26-
three varieties:
27-
"One layer" Classifiers, which are single maximum-entropy classifiers making use of monolingual context features. %local context features,
28-
"Two layer" Classifiers,which are based on layer-one classifiers and also use multilingual features that are translations for four other languages.
29-
%which are the same as the one-layer classifiers except that they use the translation of the word of interest into four other target languages as features,
30-
And lastly, the "MRF(Markov Random Field)" Classifiers, which also use multilingual features. Instead of translate each language separately, they build a network of five layer-one classifiers to allow them to find the translation for five languages jointly.%solve the classification task jointly.
31-
%=We will also discuss the results and findings.
32-
3325
We present our entries for the SemEval-2013 cross-language word-sense
34-
disambiguation task \cite{task10}. We submitted three systems based
35-
on classifiers trained on local context features, with some elaborations.
36-
Our three systems, in increasing order of complexity, were: maximum entropy
37-
classifiers trained to predict the desired target-language phrase using only monolingual features (we called this system ``L1"); similar classifiers, but with the desired target-language
38-
phrase for the other four languages as features (``L2"); and lastly, networks
39-
of five classifiers, over which we do loopy belief propagation in an attempt to
40-
solve the classification task jointly (``MRF").
26+
disambiguation task \cite{task10}. We submitted three systems based on
27+
classifiers trained on local context features, with some elaborations. Our
28+
three systems, in increasing order of complexity, were: maximum entropy
29+
classifiers trained to predict the desired target-language phrase using only
30+
monolingual features (we called this system \emph{L1}); similar classifiers,
31+
but with the desired target-language phrase for the other four languages as
32+
features (\emph{L2}); and lastly, networks of five classifiers, over which we
33+
do loopy belief propagation to solve the classification tasks jointly
34+
(\emph{MRF}).
4135
\end{abstract}
4236

4337
\section{Introduction}
4438
In the cross-language word-sense disambiguation (CL-WSD) task, given an
4539
instance of an ambiguous word used in a context, we want to predict the
4640
appropriate translation into some target language. This setting for WSD has an
4741
immediate application in machine translation, since many words have many
48-
possible translations.
49-
50-
Framing lexical ambiguities in this way, as an explicit classification task,
51-
has been shown to be improve machine translation even in the case of
52-
phrase-based SMT systems (cite Carpuat and Wu), which can mitigate the
53-
ambiguities through the use of a language model and phrase-tables with
54-
multi-word phrases.
55-
CL-WSD has been shown useful for statistical machine translation (cite Carpuat and Wu), although in future work we are particularly interested in applying it to rule-based systems. (XXX: is this relevant?)
56-
57-
In the Semeval-2013 CL-WSD task \cite{task10}, we are asked to build a system that can provide
58-
translations for twenty ambiguous English nouns in their contexts. The five target languages in the shared task are Spanish, Dutch, German, Italian and French. There were two settings for the evaluation, ``best" and ``oof". In either case, systems may present multiple possible answers for a given translation, although in the ``best" setting, the first answer is given more weight, which encourages only returning the one-best. In the ``oof" setting, systems are encouraged to return the top-five most likely translations. For a complete explanation of the settings, please see the shared task description \cite{task10}.
42+
possible translations. Framing the resolution of lexical ambiguities as an
43+
explicit classification task has been shown to be improve machine translation
44+
even in the case of phrase-based SMT systems \cite{carpuatpsd}, which can
45+
mitigate lexical ambiguities through the use of a language model and
46+
phrase-tables with multi-word phrases.
47+
48+
XXX: work in Brown 1991 reference too:
49+
\cite{Brown91word-sensedisambiguation}
50+
51+
In the Semeval-2013 CL-WSD task \cite{task10}, entrants are asked to build a
52+
system that can provide translations for twenty ambiguous English nouns, given
53+
appropriate contexts. The five target languages in the shared task are Spanish,
54+
Dutch, German, Italian and French. There were two settings for the evaluation,
55+
``best" and ``oof". In either case, systems may present multiple possible
56+
answers for a given translation, although in the ``best" setting, the first
57+
answer is given more weight in the evaluation, and this setting encourages only
58+
returning the top answer. In the ``oof" setting, systems are asked to
59+
return the top-five most likely translations. For a complete explanation of the
60+
task and its evaluation, please see the shared task description \cite{task10}.
5961

6062
%% consider: maybe move this to related work?
6163
Following the work of Lefever and Hoste
6264
\shortcite{lefever-hoste-decock:2011:ACL-HLT2011}, we wanted to develop systems
63-
that make use of multiple bitext corpora for the CL-WSD task.
64-
ParaSense, the system of Lefever and Hoste, takes into account evidence from all of the available parallel corpora. Let $S$ be the set of five target languages and $t$ be the particular target language of interest at the moment; ParaSense creates bag-of-words features from the translations of the target sentence into the languages $S - \lbrace{t \rbrace}$. Given corpora that are parallel over many languages, this is straightforward to do at training time, however at testing time it requires the use of a complete MT system into the four other languages, which is computationally prohibitive. Thus in our work, we have developed systems that make use of many parallel corpora but require neither a locally running MT system nor access to an online translation API.
65+
that make use of multiple bitext corpora for the CL-WSD task. ParaSense, the
66+
system of Lefever and Hoste, takes into account evidence from all of the
67+
available parallel corpora. Let $S$ be the set of five target languages and $t$
68+
be the particular target language of interest at the moment; ParaSense creates
69+
bag-of-words features from the translations of the target sentence into the
70+
languages $S - \lbrace{t \rbrace}$. Given corpora that are parallel over many
71+
languages, this is straightforward to do at training time, however at testing
72+
time it requires the use of a complete MT system into the four other languages,
73+
which is computationally prohibitive. Thus in our work, we have developed
74+
systems that make use of many parallel corpora but require neither a locally
75+
running MT system nor access to an online translation API.
6576

6677
We presented three systems in this competition, which were variations on the
6778
theme of a maximum entropy classifier for each ambiguous noun, trained on local
6879
context features similar to those used in previous work and familiar from the
6980
WSD literature.
7081

71-
Our systems had similar results, but at the time of the evaluation, our simplest system came in first place for the out-of-five evaluation for three languages (Spanish, German, and Italian).
72-
However, after the evaluation, we fixed a simple (slightly embarrassing) bug in our MRF code, which resulted in the MRF system posting even better results for the OOF evaluation.
73-
74-
on the \emph{oof} evaluation, we had the best results for Spanish, German, and Italian.
75-
All of our systems beat the ``most-frequent sense" baseline in every case.
82+
Our systems had similar results, but at the time of the evaluation, our
83+
simplest system came in first place for the out-of-five evaluation for three
84+
languages (Spanish, German, and Italian). However, after the evaluation
85+
deadline, we fixed a simple (slightly embarrassing) bug in our MRF code, which
86+
resulted in the MRF system producing even better results for the OOF
87+
evaluation.
7688

89+
... on the \emph{oof} evaluation, we had the best results for Spanish, German,
90+
and Italian. All of our systems beat the ``most-frequent sense" baseline in
91+
every case.
7792

7893
Our three systems made use of the same training data, which we extracted from
7994
the Europarl Intersection corpus, meaning that the English-language source
@@ -103,13 +118,15 @@ \section{L1}
103118
in question to the appropriate target-language lemma), we extract features from
104119
the English-language sentence.
105120

106-
Several steps of preprocessing were needed. We first POS tagged the sentences, since we are only interested in nouns.
107-
Then align the words in each sentence pair, and lemmatize the target sentence.
108-
After locating words of interest in the
109-
Europarl Intersection corpus, training instances were extracted, and a maxent
110-
classifier was trained over local context features similar to those used by Lefever and
111-
Hoste.
121+
%% rework a bit
122+
Several steps of preprocessing were needed. We first POS tagged the sentences,
123+
since we are only interested in nouns. Then align the words in each sentence
124+
pair, and lemmatize the target sentence. After locating words of interest in
125+
the Europarl Intersection corpus, training instances were extracted, and a
126+
maxent classifier was trained over local context features similar to those used
127+
by Lefever and Hoste.
112128

129+
%% howto do a nested list?
113130
\begin{figure}
114131
\begin{itemize}
115132
\item word form
@@ -121,7 +138,7 @@ \section{L1}
121138
\item bigrams and tagged bigrams (just in case)
122139
\end{itemize}
123140
\label{features}
124-
\caption{some features}
141+
\caption{Features used in our classifiers}
125142
\end{figure}
126143

127144
Note: word tag is different from word with tag (so as for bigram and bigram
@@ -166,6 +183,8 @@ \section{MRF}
166183
Spanish, Italian and French. Can this closeness be represented by the pairwise
167184
potentials?
168185

186+
%% TODO: build a diagram of the network. THE TRANSLATION PENTAGRAM.
187+
169188
There was some concern about pairwise potential in MRF, which is joint probability. Consider a word which occurs 500 times in the training data, it could co-occur with
170189
We had some concern about pairwise potential in MRF, which is joint
171190
probability. Consider a word which occurs 500 times in the training data, it
@@ -186,16 +205,19 @@ \section{Resources and tools}
186205

187206

188207
\section{preprocessing steps}
189-
tools: NLTK, Stanford Tagger, Berkeley Aligner, TreeTagger (for lemmatization),
190-
megam for learning.
191-
192-
We converted the English side of the text to ascii (XXX: why did we do that?
193-
does Stanford Tagger work better on ASCII? ...) We tokenized both the English
194-
and target language text with the default word tokenizer from NLTK. We aligned
195-
each of the English and target language pairs (en/de, en/es, etc) with the
196-
Berkeley Aligner, with very nearly the default settings, except that we ran 20
197-
iterations each of IBM Model 1 and 20 iterations of the HMM alignment
198-
algorithm. Is that the right number? Who can say, really?
208+
NLTK \cite{nltkbook}
209+
Stanford Tagger \cite{Toutanova03feature-richpart-of-speech}
210+
Berkeley Aligner \cite{denero-klein:2007:ACLMain}
211+
TreeTagger \cite{Schmid95improvementsin}
212+
megam for learning \footnote{\url{http://www.umiacs.umd.edu/~hal/megam/}}
213+
\cite{daume04cg-bfgs}
214+
215+
We tokenized both the English and target language text with the default word
216+
tokenizer from NLTK. We aligned each of the English and target language pairs
217+
(en/de, en/es, etc) with the Berkeley Aligner, with very nearly the default
218+
settings, except that we ran 20 iterations each of IBM Model 1 and 20
219+
iterations of the HMM alignment algorithm. Is that the right number? Who can
220+
say, really?
199221

200222
We also found another bug that made it seem like our alignments were awful.
201223
That was two more problems with TreeTagger -- it turned out that it was calling
@@ -214,26 +236,6 @@ \section{preprocessing steps}
214236
considering them likely alignment errors or other noise.
215237

216238
\section{Results}
217-
For the \emph{best} evaluation, the more sophisticated classifiers usually do
218-
better. though not always. It's not totally clear that they're better in
219-
general.
220-
221-
Also, in the out-of-five case, the L1 classifiers are usually better... in
222-
fact, they won the competition for three of the five languages.
223-
However ...
224-
225-
It seems like Els's features really are richer -- she gets translations for all
226-
the different words in the source language and uses the other-target-language
227-
bag-of-words as features. That's a lot of features. We're kind of forcing the
228-
information through a narrower pass -- we just get one decision.
229-
230-
However, in the one-best case, we get better results out of the L2 and MRF
231-
classifiers, so they do seem to help at least a bit. (though the results aren't
232-
all that much better...)
233-
234-
What went wrong with the MRF classifiers with the OOF evaluation??!
235-
TODO(alexr): now we know what went wrong. Write about it!
236-
237239
\begin{table*}[t!]
238240
\begin{center}
239241
\begin{tabular}{|r|r|r|r|r|r|}
@@ -242,6 +244,7 @@ \section{Results}
242244
\hline
243245
baseline & 23.23 & 20.66 & 17.43 & 20.21 & 25.74 \\
244246
best result & 32.16 & 23.61 & 20.82 & 25.66 & 30.11 \\
247+
\hline
245248
L1 & 29.01 & 21.53 & 19.5 & 24.52 & 27.01 \\
246249
L2 & 28.49 & \textbf{22.36} & \textbf{19.92} & 23.94 & \textbf{28.23} \\
247250
MRF & \textbf{29.36} & 21.61 & 19.76 & \textbf{24.62} & 27.46 \\
@@ -252,15 +255,15 @@ \section{Results}
252255
\end{center}
253256
\end{table*}
254257

255-
256258
\begin{table*}[t!]
257259
\begin{center}
258260
\begin{tabular}{|r|r|r|r|r|r|}
259261
\hline
260262
system & es & nl & de & it & fr \\
261263
\hline
262264
baseline & 53.07 & 43.59 & 38.86 & 42.63 & 51.36 \\
263-
best result & 61.69 & 47.83 (proycon c1l)& 44.02 & 53.98 & 59.8 (proycon c1lN) \\
265+
best result & 61.69 & 47.83 & 44.02 & 53.98 & 59.8 \\
266+
\hline
264267
L1 & 61.69 & 46.55 & 43.66 & 53.57 & 57.76 \\
265268
L2 & 59.51 & 46.36 & 42.32 & 53.05 & \textbf{58.2} \\
266269
MRF & \textbf{62.21} & \textbf{46.63} & \textbf{44.02} & \textbf{53.98} & 57.83 \\
@@ -271,6 +274,18 @@ \section{Results}
271274
\end{center}
272275
\end{table*}
273276

277+
For the \emph{best} evaluation, the more sophisticated classifiers usually do
278+
better. though not always. It's not totally clear that they're better in
279+
general.
280+
281+
It seems like Els's features really are richer -- she gets translations for all
282+
the different words in the source language and uses the other-target-language
283+
bag-of-words as features. That's a lot of features. We're kind of forcing the
284+
information through a narrower pass -- we just get one decision.
285+
286+
TODO(alexr): now we know what went wrong. Write about it!
287+
288+
274289

275290
\section{Further experiments}
276291
TODO(alexr): run the experiment where we do the L2 classifiers, but with the

0 commit comments

Comments
 (0)