Add BLEU score metric #627

sluks · 2019-10-24T16:32:41Z

A quick review of MT quality scores:

BLEU (Bilingual Evaluation Understudy), also called N-Gram Co-Occurence, was developed by IBM in 2001. It is a measure of similarity between a candidate translation and a set of reference translations, where each reference translation is equally valid. The essence of it is the following: we check how many n-grams the candidate and reference translation have in common (up to 4-grams). The idea is to focus on precision: how much of my candidate translation has correct words? However, if we only relied on precision, the machine could only output one word and precision would be super high but the translation (and recall) very bad. That’s why BLEU adds a “brevity penalty” to penalize for too short translations and improve recall.

Pros:
- easy and fast to compute
- reasonable estimate of Machine Translation (MT) quality, although limited
- way more scalable than having humans do it
Cons:
- doesn’t take meaning or grammar into account: if the references are not exhaustive (lack of synonyms) then the score is not accurate
- no focus on sentence structure or order
- score can be biased at sentence-level - BLEU is only meant to be used at corpus-level

NIST is an extension of BLEU, where we weigh the penalties of mis-matched n-grams: if we mis-match “rare” n-grams, the penalty will be higher than mis-matching common n-grams. The idea is to not give weight to usual, “stop word”-type of n-grams.

Pros:
- less weight put on unimportant words compared to BLEU, leading to potentially more accurate scores
Cons:
- still doesn’t take meaning into account, and this is potentially worse than BLEU: if the candidate translation has a valid synonym that’s not in the reference translations, then the penalty will be artificially higher

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) has the same idea as BLEU but focuses on recall instead of precision: we look at how many n-grams in the reference translations are in the candidate translation (instead of the reverse with BLEU). There are 5 different ROUGE variants: ROUGE-N, ROUGE-L, ROUGE-W, ROUGE-S, ROUGE-SU.

METEOR
Similar to BLEU, but considering synonyms and stemming. This is heavier to compute.

zhangguanheng66 · 2019-10-24T17:23:15Z

You can use flake8 to cover the lint error.

torchtext/data/functional.py

docs/source/data/metrics.rst

test/data/test_metrics.py

torchtext/data/__init__.py

cpuhrsch · 2019-10-28T19:55:24Z

torchtext/data/metrics.py

+        assert max_n > 0
+
+        ngrams = [tuple(x.split(' ')) for x in ngrams_iterator(tokens, max_n)]
+        ngrams_counter = collections.Counter(ngrams)


Counter accepts iterators, so you can use collections.Counter(tuple(x.split(' ')) for x in ngrams_iterator(tokens, max_n)). The advantage is that you don't need to materialize the entire string, so you'll save on memory roundtrips.

I think this is a good idea. In general, we only materialize the string if really necessary. Otherwise, we just send them to next generator pipeline.

torchtext/data/metrics.py

vincentqb · 2019-10-28T19:57:39Z

test/data/test_metrics.py

+        # Partial match
+        candidate = [['My', 'full', 'pytorch', 'test']]
+        refs = [[['My', 'full', 'pytorch', 'test', '!'], ['Different']]]
+        assert round(metrics.bleu_score(candidate, refs), 4) == 0.7788


What's the source for these scores? Or how were they computed?

I computed them "by hand" by applying the math mentioned in the paper. Should I make this more explicit?

That's fine with me. Do you have more than 4 digits in your calculations?

By the way, you could use "close to 0.7788" instead of round and equal.

or torch.testing.assert_allclose

torchtext/data/metrics.py

cpuhrsch · 2019-10-28T20:03:47Z

torchtext/data/metrics.py

+    if min(clipped_counts) == 0:
+        return 0.0
+    else:
+        pn = [clipped_counts[i] / total_counts[i] for i in range(max_n)]


A lot of the math here can be done using torch.Tensors - since this is part of the DAPI ecosystem it's ok to hard a hard dependency on that. Numpy on the other hand is something we can't have a hard dependency on, because pytorch/pytorch doesn't either.

good point, I'll look into it

I replaced the math part with tensor operations in 4fc19dc

Makes it more elegant without the list comprehensions!

torchtext/data/metrics.py

cpuhrsch · 2019-10-28T20:22:27Z

torchtext/data/metrics.py

+
+        # Get the length of the reference that's closest in length to the candidate
+        refs_len_list = [float(len(ref)) for ref in refs]
+        refs_len = min(refs_len_list, key=lambda x: abs(len(candidate) - x))


If I'm not missing something refs_len is being overwritten in each iteration. So in the end we only use the refs_len of the last (candidate, refs) pair?

That's a great catch !! It should be +=. My tests didn't catch it because I only tested for very small corpuses.
Related to this, I was wondering how I should test that this works for a large corpus. Should I check that I get the same result as the nltk implementation for a random corpus hardcoded in the test_metrics.py file?

Yes, using a well-established reference implementation isn't a bad idea. However, the values to compare against are better stored as a static asset, so that even though the reference implement might change, you still have the correct test data (we don't want a bug on their end influence us).

So, use ntlk to write out some test data, verify it's what you want and then use that as a reference within the test.

cpuhrsch · 2019-10-28T20:24:32Z

Great job! Thanks for writing this :) Most of my comments are around computational efficiency.

torchtext/data/metrics.py

zhangguanheng66 · 2019-10-28T22:03:28Z

torchtext/data/metrics.py

+from torchtext.data.utils import ngrams_iterator
+
+
+def _compute_ngram_counter(tokens, max_n):


It seems that this func has only one line. You may not need a func here.

The assumption I took to keep this as a function was:

Right now I use this function in these 2 places:

text/torchtext/data/metrics.py

Lines 73 to 75 in c1fc7b0

reference_counters = _compute_ngram_counter(refs[0], max_n)

for ref in refs[1:]:

reference_counters = reference_counters | _compute_ngram_counter(ref, max_n)

We might want to use it in other metrics in the future (it's a long shot as of now though)

cpuhrsch · 2019-10-29T14:47:50Z

If you want you could, for fun (for now), benchmark this algorithm on some fixed dataset and see how some of the changes impact your runtime. You can then also compare this to known implementations such as https://github.com/mjpost/sacreBLEU

Once something like this is merged we want to make sure that our implementations of these metrics is at least within 10x of other implementations. Then we might decide to bind against these or write C++ code etc.

cpuhrsch · 2019-10-30T21:19:02Z

Generally this looks great to me! I'd be happy to merge that in its current form. There's still plenty of room for performance optimizations, but the overall setup seems fine and we have tests :D - Great job!

vincentqb

Yup, I second @cpuhrsch. This looks good to me too!

sluks · 2019-10-30T21:32:37Z

@cpuhrsch @zhangguanheng66 In terms of github etiquette, should I rebase all these commits into 3 (one for metrics, one for the tests and one for docs) or into 1 only? For now I kept all the changes compared to the original PR in separate commits to keep track of each change

cpuhrsch · 2019-10-30T21:35:10Z

@sluks - when merging we're choosing to squash and merge this PR into one. So, no need to worry about that.

sluks · 2019-10-31T21:04:42Z

All right the PR passed all the tests and I added a few comments to clarify the test, I'm comfortable merging this from my side. I also rebased on top of master.

zhangguanheng66 reviewed Oct 24, 2019

View reviewed changes

torchtext/data/functional.py Outdated Show resolved Hide resolved

zhangguanheng66 reviewed Oct 24, 2019

View reviewed changes

torchtext/data/functional.py Outdated Show resolved Hide resolved

zhangguanheng66 reviewed Oct 24, 2019

View reviewed changes

torchtext/data/functional.py Outdated Show resolved Hide resolved

sluks force-pushed the master branch 5 times, most recently from a35e89f to a8ae4c8 Compare October 25, 2019 18:33

sluks changed the title ~~[WIP] Add BLEU score~~ [WIP] Add BLEU score metric Oct 25, 2019

sluks force-pushed the master branch from a8ae4c8 to 4d027cd Compare October 25, 2019 21:06

sluks marked this pull request as ready for review October 25, 2019 21:07

sluks changed the title ~~[WIP] Add BLEU score metric~~ Add BLEU score metric Oct 25, 2019

zhangguanheng66 reviewed Oct 28, 2019

View reviewed changes

docs/source/data/metrics.rst Show resolved Hide resolved

zhangguanheng66 reviewed Oct 28, 2019

View reviewed changes

test/data/test_metrics.py Outdated Show resolved Hide resolved

sluks commented Oct 28, 2019

View reviewed changes

torchtext/data/__init__.py Outdated Show resolved Hide resolved