Language filter ineffective

Discussion in English language.
Forumsregeln
You can start and continue with posts in english language in all other forums as well, but if you are looking for a forum to start a discussion in english, this is the right choice.

Language filter ineffective

Beitragvon davide » Mo Mai 18, 2015 10:06 pm

EDIT: JS fiddle: https://jsfiddle.net/DavideBaldini/uu7upmeu/9/

~~

As shown in the screenshot, the language filter is ineffective.
In the screenshot, I specify to only retrieve English documents, but German results remain abundant. Maybe this is intended?

In case it may be useful, I wrote an excellent language detector in javascript able to recognize languages even for very short texts with just 5÷10 words. It is currently capable to distinguish and tell English, Spanish, French ,Italian, German and Russian.

Available here, from within a firefox add-on package of mine: https://addons.mozilla.org/en-US/firefo ... xt-reader/
Just to show the quality of the alghorithm, a video example of the recognition is here: https://vimeo.com/113796496


english_search.gif
english_search.gif (104.38 KiB) 5701-mal betrachtet
Zuletzt geändert von davide am Do Okt 29, 2015 6:07 pm, insgesamt 5-mal geändert.
davide
 
Beiträge: 84
Registriert: Fr Feb 15, 2013 8:03 am

Re: Language filter ineffective

Beitragvon davide » Di Mai 19, 2015 2:59 am

Since my above JS package isn't so straightforward, I can extract the relevant parts out of it and share them somewhere, if desired.
The parts would be dictionary files along with some functions grouped into one single file. The whole excerpt would weight a few dozen KB.
davide
 
Beiträge: 84
Registriert: Fr Feb 15, 2013 8:03 am

Re: Language filter ineffective

Beitragvon Orbiter » Di Mai 19, 2015 1:37 pm

the language recognition is actually very fuzzy.
Orbiter
 
Beiträge: 5796
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: Language filter ineffective

Beitragvon davide » Di Mai 19, 2015 3:29 pm

Thanks Michael for answering.
As I see it, I think we can frankly agree that the current recognition method doesn't suit well the average user's needs. I understand that YaCy is predominantly a mono-language German project where many users and crawlers surf almost exclusively on German waters, and so it's difficult to realize when the language detector doesn't work. But here's my report: it doesn't work.
Flashback: I already rose to the attention the problem of a language gap in the YaCy community.

The reason I'm taking the time to post on this forum is not to criticize anyone. It's for YaCy to improve.
Practically speaking, I won't be able to move my YaCy installation from testing to production without being able to rely on the results language. That's it, to filter off German results from the list, which accompany every query I run.

Moreover, I offer my excellent open source algorithm for language detection. It works really well, for short and long texts, and is easily extensible to newer languages. it is used by 1200 persons right now.
Since it's written in JS, it should be easily importable to YaCy with the addition of only a few files.

If necessary, I can provide for quick instructions on how to use it. The license is GPLv2, as specified in the page linked above.
As a summary: it's computationally fast; correctly detects 100% of the documents longer than 10 words; it doesn't rely on external services; it's about 50 KB of JS.
davide
 
Beiträge: 84
Registriert: Fr Feb 15, 2013 8:03 am

Re: Language filter ineffective

Beitragvon Orbiter » Di Mai 19, 2015 4:00 pm

the language detection method has no relation to the language of the developers.
Language detection in javascript is unfortunately not applicable since any detection method would need to identify the language within the java code.
Anyway, where is it?
Orbiter
 
Beiträge: 5796
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: Language filter ineffective

Beitragvon davide » Mi Mai 20, 2015 11:39 pm

I have taken the time to extract the relevant data from the above-mentioned package and pack the files into a working example, here attached.
The example is self-explanatory.

Code: Alles auswählen
apt-get install nodejs
tar -axf LanguageDetect.tar.gz
cd ./scripts
nodejs languageDetectDemo.js


You'll notice the algorithm is very simple. The key is in the sorting order of the vocabularies and their size. Remember this when creating additional vocabularies:
  • word count per dictionary must be preserved at about 1000. Deviation from this will cause bias.
  • words must be sorted from the most frequently used in such language to the least used. The dictionary actually contains only the top thousand words by frequency.

if you get a syntax error, nodejs v0.12 is required.

I tried to provide a patch by myself looking at the current YaCy implementation of the language detector but I'm a total ignorant at java.
Dateianhänge
LanguageDetect.tar.gz
(37.92 KiB) 166-mal heruntergeladen
davide
 
Beiträge: 84
Registriert: Fr Feb 15, 2013 8:03 am

Re: Language filter ineffective

Beitragvon davide » Fr Mai 22, 2015 11:41 pm

I'm particularly sensible to this improvement and I'm on the pathway deciding whether to do a consistent hardware investment on YaCy.
I need results to be filtered by language not because of a mere "personal taste", but because a parser program will elaborate the results and it needs to map English keywords.

The algorithm is likely very simple to convert to java, and is also especially effective.
If anything is wrong, please let me know.
davide
 
Beiträge: 84
Registriert: Fr Feb 15, 2013 8:03 am

Re: Language filter ineffective

Beitragvon davide » Di Jun 16, 2015 12:19 pm

Any progress?


Language detection is a core feature for a search engine and in YacY it barely works. To my understanding, detection is currently based on date format recognition and <head> tag. We both already know this is fuzzy at best.

For as much as it's simple, the algorithm I implemented in JS for my TTS software works very well for all the six supported languages. I also provided you with a demonstrative package ready to download and run, so to rapidly taste its effectiveness with the bundled demo. I also know you have experience with JS so you can understand the code.

I submitted many patches to other FOSS projects in the past; its unfortunate Java is not in my cultural baggage yet. Can you at least tell me where in the priority list is a reimplementation of the language detector?
davide
 
Beiträge: 84
Registriert: Fr Feb 15, 2013 8:03 am

Re: Language filter ineffective

Beitragvon biolizard89 » Fr Jul 31, 2015 11:08 pm

davide hat geschrieben:Any progress?


Language detection is a core feature for a search engine and in YacY it barely works. To my understanding, detection is currently based on date format recognition and <head> tag. We both already know this is fuzzy at best.

For as much as it's simple, the algorithm I implemented in JS for my TTS software works very well for all the six supported languages. I also provided you with a demonstrative package ready to download and run, so to rapidly taste its effectiveness with the bundled demo. I also know you have experience with JS so you can understand the code.

I submitted many patches to other FOSS projects in the past; its unfortunate Java is not in my cultural baggage yet. Can you at least tell me where in the priority list is a reimplementation of the language detector?


I'd also like to see a response on this.
biolizard89
 
Beiträge: 61
Registriert: Do Jan 03, 2013 12:42 am

Re: Language filter ineffective

Beitragvon Orbiter » Fr Jul 31, 2015 11:39 pm

the language detection in YaCy was always fuzzy. Just recently I made experiments with language detection based on bayes filters in the loklak project. This works in many cases, but fails also quite often. This is just really a complex thing. I will try to add the loklak method to YaCy maybe, I already added the bayes classes but they will be used for something else first.
Orbiter
 
Beiträge: 5796
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: Language filter ineffective

Beitragvon biolizard89 » Sa Aug 01, 2015 10:28 am

Orbiter hat geschrieben:the language detection in YaCy was always fuzzy. Just recently I made experiments with language detection based on bayes filters in the loklak project. This works in many cases, but fails also quite often. This is just really a complex thing. I will try to add the loklak method to YaCy maybe, I already added the bayes classes but they will be used for something else first.


I think the concern here is that davide has offered to assist, and his offer has, as far as I can tell from this thread, been met with silence. @Orbiter, is YaCy willing to look at davide's code?
biolizard89
 
Beiträge: 61
Registriert: Do Jan 03, 2013 12:42 am

Re: Language filter ineffective

Beitragvon Orbiter » Sa Aug 01, 2015 9:06 pm

biolizard89 hat geschrieben:I think the concern here is that davide has offered to assist, and his offer has, as far as I can tell from this thread, been met with silence. @Orbiter, is YaCy willing to look at davide's code?

You recommended davids code: did YOU actually test it?
Orbiter
 
Beiträge: 5796
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: Language filter ineffective

Beitragvon Cajun » Mo Aug 03, 2015 7:49 pm

SOLR supports two implementations of language detection durig index time, controlled for by solrconfig.xml, see: https://cwiki.apache.org/confluence/display/solr/Detecting+Languages+During+Indexing. The better algo seems to achieve an accuracy of about 99.2% not for all, but for most languages.

(How) Could this approach be used as an alternative to the YaCy language filter?
Cajun
 
Beiträge: 10
Registriert: Di Nov 19, 2013 9:35 pm

Re: Language filter ineffective

Beitragvon Orbiter » Do Aug 06, 2015 9:52 pm

Thats a good hint, thank you! Looks like that is easy to do. Let me see, but give me time, there is the cccamp15 and I doubt that there is time before the camp starts.
Orbiter
 
Beiträge: 5796
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: Language filter ineffective

Beitragvon biolizard89 » So Aug 09, 2015 10:08 am

Orbiter hat geschrieben:
biolizard89 hat geschrieben:I think the concern here is that davide has offered to assist, and his offer has, as far as I can tell from this thread, been met with silence. @Orbiter, is YaCy willing to look at davide's code?

You recommended davids code: did YOU actually test it?


I am not clear on why you think I recommended davide's code. Look back through this thread; I never said that. I said something entirely different: davide expressed an interest in contributing, and his posts got near zero response. That discourages contribution, and is unfortunate.
biolizard89
 
Beiträge: 61
Registriert: Do Jan 03, 2013 12:42 am

Re: Language filter ineffective

Beitragvon davide » Do Okt 29, 2015 2:43 pm

I created a JS fIddle with my language detector.
Now you can test it directly on the browser:

https://jsfiddle.net/DavideBaldini/uu7upmeu/9/
davide
 
Beiträge: 84
Registriert: Fr Feb 15, 2013 8:03 am

Re: Language filter ineffective

Beitragvon biolizard89 » Mi Nov 25, 2015 12:30 am

davide hat geschrieben:I created a JS fIddle with my language detector.
Now you can test it directly on the browser:

https://jsfiddle.net/DavideBaldini/uu7upmeu/9/


Hi,

Any idea how well your code performs in terms of accuracy compared to the two methods supported by Solr?
biolizard89
 
Beiträge: 61
Registriert: Do Jan 03, 2013 12:42 am

Re: Language filter ineffective

Beitragvon davide » Mi Nov 25, 2015 1:03 am

biolizard89 hat geschrieben:Hi,

Any idea how well your code performs in terms of accuracy compared to the two methods supported by Solr?


I have no comparison figure; my implementation works quite well however: on medium to long phrases the accuracy is virtually 100%, while short phrases (3 words or less) have approx. a 50% accuracy. FWIW, I never had it miss on a phrase longer than 5 words.
davide
 
Beiträge: 84
Registriert: Fr Feb 15, 2013 8:03 am


Zurück zu English

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 3 Gäste

cron