Introduction
This page provides some performance (code speed) and precision (identification accuracy) benchmarks of the LanguageIdentifierPlugin. These benchmarks were produced by analyzing results from the previous version (nutch-0.7-dev
) and the patches NUTCH-60-050526.patch, NUTCH-60-050605.patch, NUTCH-60-050607.patch (see NewLanguageIdentifier for more details).
These data can be usefull if you want to contribute in increasing the LanguageIdentifierPlugin performance and/or precision, or if you want to tune precisely your Nutch configuration.
Performance
Data set
These performance benchmarks were produced by testing the LanguageIdentifierPlugin on a set of 492 french files representing a total size of 171,3 Mo. These files were extracted from the European Parliament Proceedings Parallel Corpus 1996-2003 Release v2.
Raw results
The following matrix shows the LanguageIdentifierPlugin processing time in ms for many versions. Each patched version is configured to be comparable with the nutch-0.7-dev version, ie by using both 1-grams, 2-grams, 3-grams and 4-grams for performing analysis. The Data Size row is the size of data in bytes used in each file to perform the identification. Other rows represent the following configurations:
Nutch-0.7
: The nutch-0.7-dev LanguageIdentifierPlugin version (without patch).NUTCH-60-050526
: The LanguageIdentifierPlugin code with NUTCH-60-050526.patch applied.NUTCH-60-050607
: The LanguageIdentifierPlugin code with NUTCH-60-050607.patch applied.
|
Nutch-0.7 |
NUTCH-60-050526 |
|
|
NUTCH-60-050607 |
Data Size |
time |
time |
% |
time |
% |
128 |
2410 |
1485 |
38.38 |
716 |
70.29 |
256 |
2842 |
1836 |
35.40 |
1048 |
63.12 |
512 |
3759 |
2305 |
38.68 |
1649 |
56.13 |
1024 |
5899 |
5130 |
13.04 |
2839 |
51.87 |
2048 |
8581 |
7462 |
13.04 |
4534 |
47.16 |
4096 |
12622 |
10513 |
16.71 |
8031 |
36.37 |
8192 |
21360 |
18289 |
14.38 |
13803 |
35.38 |
16384 |
32073 |
29488 |
8.06 |
23733 |
26.00 |
32768 |
58535 |
49417 |
15.58 |
41994 |
28.26 |
65536 |
99861 |
91285 |
8.59 |
81612 |
18.27 |
131072 |
184083 |
161258 |
12.40 |
140501 |
23.68 |
262144 |
309438 |
289395 |
6.48 |
244369 |
21.03 |
524288 |
504145 |
442028 |
12.32 |
377693 |
25.08 |
Total |
1245608 |
1109891 |
10.90 |
942522 |
24.33 |
Average |
95816 |
85376.23 |
10.90 |
72501.69 |
24.33 |
Graphical representation
http://frutch.free.fr/images/nutch/langid-benchs03.jpg
Graphical representation (log axis)
http://frutch.free.fr/images/nutch/langid-benchs04.jpg
Discussion
- The NUTCH-60-050607.patch increases performances from
18.27%
to70.29%
with an average of24.33%
. - The profiling of the code confirms what SamiSiren suggests in a previous message: "the most timeconsuming part of language identifier is splitting the text into ngrams and propably the biggest optimization could be done there". Profiling confirms this point and shows that the splitting of the text takes around
25%
of the whole process.
Precision
Data set
These precision benchmarks were produced by testing the LanguageIdentifierPlugin on the Data Size first bytes from a set of :
- 492 french files,
- 487 english files,
- 488 deutch files.
(These files were extracted from the European Parliament Proceedings Parallel Corpus 1996-2003 Release v2).
Raw results
|
|
Nutch-0.7 |
|
|
|
|
NUTCH-60-050605 |
|
|
NUTCH-60-050607 |
|
|
Data Size |
avg |
fr |
en |
de |
avg |
fr |
en |
de |
avg |
fr |
en |
de |
8 |
38.84 |
36.99 |
10.47 |
69.06 |
14.00 |
2.64 |
2.67 |
36.68 |
51.11 |
48.37 |
19.30 |
85.66 |
16 |
70.38 |
58.74 |
75.15 |
77.25 |
45.64 |
13.41 |
68.17 |
55.33 |
94.06 |
97.36 |
87.68 |
97.13 |
32 |
66.51 |
55.08 |
86.86 |
57.58 |
56.43 |
41.26 |
73.92 |
54.10 |
98.56 |
99.59 |
96.30 |
99.80 |
64 |
97.14 |
97.15 |
97.54 |
96.72 |
65.35 |
53.86 |
84.80 |
57.38 |
99.93 |
100 |
99.79 |
100 |
128 |
97.90 |
94.51 |
99.79 |
99.39 |
77.81 |
70.53 |
89.32 |
73.57 |
100 |
100 |
100 |
100 |
256 |
100 |
100 |
100 |
100 |
90.32 |
90.04 |
92.20 |
88.73 |
100 |
100 |
100 |
100 |
512 |
100 |
100 |
100 |
100 |
96.93 |
98.17 |
97.54 |
95.08 |
100 |
100 |
100 |
100 |
1024 |
100 |
100 |
100 |
100 |
99.59 |
99.80 |
99.79 |
99.18 |
100 |
100 |
100 |
100 |
2048 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
Graphical representation
http://frutch.free.fr/images/nutch/langid-benchs05.jpg
Discussion
TODO