Search result for short words

Hier finden YaCy User Hilfe wenn was nicht funktioniert oder anders funktioniert als man dachte. Bei offensichtlichen Fehlern diese bitte gleich in die Bugs (http://bugs.yacy.net) eintragen.
Forumsregeln
In diesem Forum geht es um Benutzungsprobleme und Anfragen für Hilfe. Wird dabei ein Bug identifiziert, wird der thread zur Bearbeitung in die Bug-Sektion verschoben. Wer hier also einen Thread eingestellt hat und ihn vermisst, wird ihn sicherlich in der Bug-Sektion wiederfinden.

Search result for short words

Beitragvon nstaudt » Di Sep 07, 2010 7:16 am

Hi, I've noticed that if I search for "C#", I get no results - even though I recently created an index for stackoverflow.com and I also know there is plenty of other .NET related content in YaCy.

Any thoughts / comments?

Thanks,
Nathan.
nstaudt
 
Beiträge: 73
Registriert: Fr Aug 13, 2010 10:54 am

Re: Search result for short words

Beitragvon Low012 » Di Sep 07, 2010 7:59 am

YaCy does not store words which are shorter than 2 characters. C# is a one character word for YaCy since it does not look at the # as a character, it just ignores it. I think it would be great to have some kind of way to differentiate between words with "weird" characters like C# and Strings which don't make any sense, but YaCy is not that smart so far.

Another search words which don't work a s expected is for example "c-base". YaCy stores words which are connected with a hyphen as two words and since the "c" in "c-base" is only one character, it will be omitted. In this case a solution could be that words which are connected with a hyphen are usually split except if the result is two words where one of them only has one character. But then "c-base" can only be found by searching for "c-base", not by searching for "base", except if "c-base" and "base" were stored in the datatbase. I don't know if this would blow up the database considerably or not. I guess that would have to be investigated. To do things like this I like to write small programms in Perl and abuse Wikipedia as a text corpus by downloading random pages. I guess what would have to be done would be to calculate the percentage of words with a hyphen and a single character in relation to all words. But this is getting off topic since it does not solve the C# problem.
Low012
 
Beiträge: 2214
Registriert: Mi Jun 27, 2007 12:11 pm

Re: Search result for short words

Beitragvon nstaudt » Mi Sep 08, 2010 11:31 am

ah, that's a shame. Would it be an idea to have "mappings" made by the search input interface for small words to larger words? (for example, a mappings file which maps C# => CSharp, "DOT NET")
nstaudt
 
Beiträge: 73
Registriert: Fr Aug 13, 2010 10:54 am

Re: Search result for short words

Beitragvon Orbiter » Mi Sep 08, 2010 3:50 pm

thats a phantastic idea! It would also subsume a concept of synonym mapping which would then also enable all other kinds of 'strange' mappings and abbreviations.
like:
c-base -> cbase
e-mail -> email
tel -> phone
tel. -> phone
telephone -> phone

The parser and the input field would need to use the same mapping, so it would be a 'invisible' effect for the user. The mapping must also be used by the snippet fetcher which makes things a little bit complicated.

We could have a default mapping dictionary which should be extendable by user dictionaries to enable a synonym matching, if wanted.
Orbiter
 
Beiträge: 5796
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: Search result for short words

Beitragvon nstaudt » Mi Sep 08, 2010 5:37 pm

you would have to be careful... I think the list would have to be moderated (and then probably signed) - otherwise an "angry" peer could set up mappings that make the results worse :shock:
nstaudt
 
Beiträge: 73
Registriert: Fr Aug 13, 2010 10:54 am


Zurück zu Fragen und Antworten

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 1 Gast

cron