Crowd sourcing better search results

Ereignisse, Vorschläge und Aktionen

Crowd sourcing better search results

Beitragvon sherlock » Sa Dez 10, 2011 6:05 pm

Von Google zu übersetzen, weil mein Deutsch ist sehr gebrochen! :)

---------

Ich frage mich, ob eine Art von Maschine Learning-Lösung zur Verbesserung Yacy Ranking-Algorithmus, zum Beispiel angewendet werden könnten:

  • Haben Sie eine Liste der Suchbegriffe, auf die wir wissen, gibt es eine kanonische URL, die eindeutig das optimale Ergebnis (z. B. Ubuntu.com für Ubuntu)
  • Haben Software durch alle Permutationen, wie Ranking-Kriterien können gegen verschiedene keyphrase / url Listenelemente eingestellt werden.
  • Notieren Sie sich die durchschnittliche Position der "optimalen url" in der Rangliste für jede Phrase / url Paar.
Die Kombination von Ranking-Kriterien, die konsequent erhält die höchste Wertung für die verschiedenen Phrase / url Paare sollte (theoretisch) etwa den besten Algorithmus werden. Als neue Ranking-Kriterien aufgenommen werden (wie meine soziale Metriken Idee) können wir Text die gegen den führenden Algorithmen ist. Es kann auch benutzt werden um festzustellen, ob ein neues Ranking-Kriterien tatsächlich fügt echten Mehrwert für die Ergebnisse oder nicht.

Wir Menschenmenge Quelle Indizierung. Es wäre großartig, crowdsource Suche nach den besten Ranking-Algorithmus, wie gut!
sherlock
 
Beiträge: 2
Registriert: Sa Dez 10, 2011 6:03 pm

Re: Crowd sourcing better search results

Beitragvon pokey909 » So Dez 11, 2011 12:52 pm

Please ask in english if you're german isn't good enough because the google translation is not understandable at all.
pokey909
 
Beiträge: 1
Registriert: So Dez 11, 2011 12:50 pm

Re: Crowd sourcing better search results

Beitragvon sherlock » Mo Dez 12, 2011 8:37 pm

Ok, once more (with some extra stuff added) in English:

The big problem I see new users complain about is not the size of the index, not the resource usage or speed, not the user interface... but the search results when they first try Yacy. At first I thought the problem was just that the index is too small. Then I noticed that most of the time, the website they were looking for was in the index... but being ranked very low. For example, wikipedia.org currently shows up 3 spots lower than http://ja.wikipedia.org/wiki/Linux in a search for "wikipedia". This is a ranking problem, not an index problem.

Ranking algorithms can be set in Yacy, but the defaults, aren't working. Can we use machine learning / crowdsourcing / something else to fix the default ranking settings? For example, what if:

  • We took a list of keyphrases with known "correct" urls (a "canonical URL"). For example phrase:"ubuntu" == url:"http://Ubuntu.com".
  • Have software search Yacy for each keyphrase, with all possible permutations of the Yacy's ranking criteria.
  • Record the average ranking of the canonical URL in Yacy (which ideally should show up at #1).
  • Look at what combination of ranking criteria consistently get phrases' canonical URL's the highest ranking.

Other than creating the list of phrase/url pairs, this is 100% automated.

The combination of ranking criteria, which consistently gets the url we want closest to #1, should (theoretically) be the best. Then we have a better default for search ranking criteria to give all Yacy users. We can also test new ideas for new ranking criteria (like my idea of ​​social metrics like Google / Facebook), to see if it actually adds real value to the results or not.

It seems silly to crowdsource crawling / indexing. We can use that same distributed database to help us get the best results for our searches! I'm sure there's even better ways of doing this than what I wrote above. I'm not a coder. My idea is just a proof of concept.
sherlock
 
Beiträge: 2
Registriert: Sa Dez 10, 2011 6:03 pm

Re: Crowd sourcing better search results

Beitragvon flami » Di Dez 13, 2011 11:43 pm

sherlock hat geschrieben:Ok, once more (with some extra stuff added) in English:
[*]Have software search Yacy for each keyphrase, with all possible permutations of the Yacy's ranking criteria.


If I counted right there are about 40 criteria, each can have 16 settings , thus you would have 16^40 = 2^160 possibilities. which would be about the size of trying to bruteforce 3DES , thus it would most likely take more energy then then sun has ever produced.

machine learning would for example try to find which criteria has the largest impact and go from there. but in the end if you only taught the machine that search query is in the hostname means good , then you would teach it to blindly rank pages that have the search word in their name. So e.g. someone could open up yacy1 yacy2..... .com ,populate them with spam , and all would get a great ranking even though they have nothing to do with the search for yacy.
I dont know much about search result ranking, but it is quite a large problem, and even though the learning part is 100% automated, that does not mean that generating the learning material is a trivial task. ( or that the learning part can be done in less then the universes remaining lifetime)
flami
 
Beiträge: 19
Registriert: Di Nov 29, 2011 9:57 am


Zurück zu Mitmachen

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 2 Gäste

cron