Renovating P2P ranking challenge - proposal

Forum for developers

Renovating P2P ranking challenge - proposal

Beitragvon reger » Mi Jan 04, 2017 4:07 am

IMHO we've a unsolved challenge with the result ranking and I think it's time (between now and v2.0) to tackle this issue.

What challenge do we have
  • internally we work with 2 indexes and 2 ranking models (Solr and RWI index/ranking)
  • ranking figures (numbers) of these 2 are not comparable – currently just adjusted to be in a similar number range
  • remote Solr results are ranked by provided fields and boosts but the same doc from different peers can have different ranking results returned
  • ranking from RWI changes over search time as factored in result-feature range changes (e.g. one ranking factor is (urllength-min.urllength)/(max.urllength-min.urllength). As results are collected the min/max will change possibly with every new result, already ranked items don't reflect that.
  • in P2P (remote) searches the quickest available result is shown on 1st page while better results maybe shown later (especially the 1st page is in regards to ranking the worst because of this concurrency topic)


Proposal for a solution
If we do a local search, in one index only we don't have any issue, because we just keep the order the search process supplies the results. Hm..... that's pretty simple, why not expanding on this procedure.
Means if we do a search in 2 indexes (A and B), why not keeping the order too and simply display A1, B1, A2, B2 …. and if remote peers come into play were we basically need to behaive a bit like a metasearch engine joining many already ordered search results into a final result list expanding it to [A..Z]1 [A..Z]2 …. etc.

That's probably the simplest metasearch merge strategy available, see a description e.g. here (with some tuning ideas)
http://www.technicaljournalsonline.com/ ... 12/231.pdf

The nice part is … it is relative simple and actually I don't have found any similar easy but good one without to do the whole ranking (the search process did already) again what is not really possible without having all ranking data (means loading the resource).

Here a nice understandable overview of other methods http://ijcsi.org/papers/IJCSI-9-4-3-239-251.pdf

What is now the proposal:
  • apply the position ranking (from above) to P2P result list (maybe later fine tuning to take original source ranking or peers dht position or available results etc. as quality factor into account)
  • give ranking credit to same result coming from diff. peers
  • still add a post-ranking
  • still display quickest result first but assure the 1st page shows not the worst from one resource if we expect more coming in (maybe except it's from local)

Comment:
I did some rudimentary verification testing of the 1st basic part (position ranking). The result was not THAT great (best result not the first …. but in the 1st 20) but also not worse as current 1st pages (tested without any fine tuning and post-ranking). What makes sense as today the quickest available solr & rwi results are shown in turns what is basically a position ranking from the first 2 available peer-results.

P.S. To not beak everything with unknown (not completely tested) outcome we could/should apply changes to a clone or branch and have it tested, adjusted and fine tuned by all interested.

Any better idea... let me know...
reger
 
Beiträge: 46
Registriert: Mi Jan 02, 2013 9:23 am

Re: Renovating P2P ranking challenge - proposal

Beitragvon luc » Do Jan 05, 2017 8:32 am

Hi reger, definitely I agree it is time to tackle this issue. Thank you for your detailed synthesis.

To my mind the ultimate heart of the problem is that the current YaCy Peer-to-Peer search mode combines two models that may be incompatible :
- federated/meta search on web portal nodes and OpenSearch systems : somehow similar this way to systems such as Searx
- globally distributed (aka DHT or disitributed RWI) search on DHT enabled nodes : somehow similar this way to Solr Cloud but without its annoying size fixed "shards" and master/slaves concepts

I am not sure to fully embrace the details of your proposal, but I hope it won't finally go too much to a federated/meta search model direction. Personally when I use YaCy I am interested by the globally distributed index part...

So I would add the following proposals that I believe to be compatible with the current YaCy behavior :
- add a more user-friendly way to enable/disable federated/meta search and/or DHT search : not only available trough debugging/analysis configuration settings, but rather as search options
- add the possibility to save and reuse federated search profiles : I mean allowing to perform federated/meta search among N explicitly chosen Web Portal peers and/or OpenSearch systems
- feed as a regular background task the globally distributed index with WebPortal nodes index data (maybe it is already done currently, I did not dig the code enough)
- make ranking customization easier :
- possibility to customize the ranking profile as search options
- possibility to easily save multiple ranking profiles
- possibility to share ranking profiles with peers as it is done with blacklists

I don't know if these ideas are better, but at least I hope it would make ranking more flexible and eventually more deterministic from a user point of view...
luc
 
Beiträge: 285
Registriert: Mi Aug 26, 2015 1:04 am

Re: Renovating P2P ranking challenge - proposal

Beitragvon reger » Fr Jan 06, 2017 5:17 am

Hi, and thanks for feedback.

luc hat geschrieben:- globally distributed (aka DHT or disitributed RWI) search on DHT enabled nodes


To clarify, I only talk about this and the difficulties with the internal DHT/RWI & the internal (embedded) Solr index and results coming from other peers in the YaCy network, but to treat the received results from these peers like a metasearch does with their retrieved results. The opensearch results are on top of this (but would fit into the same ranking scheme).
P.S. Part of the compatibility challenge is e.g. this current score calculation
Code: Alles auswählen
// so far Solr score is used (with abitrary factor to get value similar to rwi ranking values)
                        Float scorex = (Float) iEntry.getFieldValue("score");
                        if (scorex != null && scorex > 0)
                            score = (long) ((1000000.0f * scorex) - iEntry.urllength());


cu
reger
 
Beiträge: 46
Registriert: Mi Jan 02, 2013 9:23 am

Re: Renovating P2P ranking challenge - proposal

Beitragvon davide » Fr Jan 06, 2017 2:40 pm

Hi reger & luc. I'll drop here an idea for your consideration.

To take a first small step toward a functional ranking algorithm able to combine results from both Solr and RWI, I suggest to drop the arbitrary regularization factor of 1000000.0f in favor of a min-max normalization. With a min-max normalization, all Solr scores would be divided by their max and offset by their min, resulting in ranks between 0.0f and 1.0f. These ranks can then be easily integrated with RWI scores which undergo the same normalization. This would at least remove that arbitrary regularization constant of 1000000.0f which may create an unwanted score bias between results from the two different systems.

If instead you're taking in consideration to redesign the whole ranking algorithm, wouldn't it be possible to preemptively generate a cache of "fake" RWI data from the local Solr database, keeping the generated RWI local, and merge that with the real RWI data in order to apply one single ranking algorithm to both combined? I think combining Solr with RWI data in this way might be an escamotage to eliminate Solr's own ranking from the equation.
davide
 
Beiträge: 84
Registriert: Fr Feb 15, 2013 8:03 am

Re: Renovating P2P ranking challenge - proposal

Beitragvon reger » So Jan 08, 2017 4:50 am

davide hat geschrieben:To take a first small step

I agree, that could be a start. I tried it before my post and found it is not stable. All normalization fight the timing constraint of the distributed index results.
Min/Max or whatever are unfortunately never correct/complete when we need it for normalization. So one thought is to use statistic and guess according to count of samples what the min/max might be.... but to find out what the true effect is.... I've no good/enough learning sample collections.

Nevertheless, I think we should start somewhere - and as bears in all cases the risk to calculate with a too small or the wrong (while just first) samples - so I would combine the 2 ideas but keep it simple (for the start) and could think of to go with a linear correction factor, calculated out of the top best result of both stacks (solr/rwi) to normalize the other. factor = top-fulltext.solr-score / rwi-score followed by final-solr-score = stack-score * factor.

davide hat geschrieben:eliminate Solr's own ranking

Thoughts in this direction would IMHO end up in rerank one or the other with one ranking class - and I don't plan to implement the YaCy ranker as custom solr ranking procedure nor wise versa but would like to find some solution which can deal with results ranked differently on arrival.
reger
 
Beiträge: 46
Registriert: Mi Jan 02, 2013 9:23 am

Re: Renovating P2P ranking challenge - proposal

Beitragvon biolizard89 » Sa Jan 14, 2017 11:54 am

As I mentioned on GitHub, if we consider it acceptable to run Javascript in the browser, it's feasible to re-sort the results in JS after some of them have been sent to the browser. (Based on my initial testing of the branch I pushed, doing this results in much better ranking and IMO much better UX.) RWI post-ranking could presumably be recalculated as needed and any changed results pushed to the browser again; I'm not sure how badly that would affect CPU usage.
biolizard89
 
Beiträge: 61
Registriert: Do Jan 03, 2013 12:42 am

Re: Renovating P2P ranking challenge - proposal

Beitragvon reger » Di Jan 17, 2017 12:09 am

biolizard89 hat geschrieben:it's feasible to re-sort the results in JS after some of them have been sent to the browser

That is not the intention of this proposal to present best/good ranking only by after-treatment (e.g. via JS).
I strongly belief that this is one of the main and important task for the engine to spit things out in the desired order.

P.S. as mentioned elsewhere.... maybe have a look at the unmaintained JS search interface under htroot/yacy/ui/* if the aftertreatment path can be incorporated there to revitalize that route.

Maybe I'd read your github article again, because the algorithm (criteria) for re-sorting via JS to improve ranking order might be what've overlooked but is what I'm after.
reger
 
Beiträge: 46
Registriert: Mi Jan 02, 2013 9:23 am

Re: Renovating P2P ranking challenge - proposal

Beitragvon biolizard89 » So Jan 22, 2017 4:16 am

reger hat geschrieben:
biolizard89 hat geschrieben:it's feasible to re-sort the results in JS after some of them have been sent to the browser

That is not the intention of this proposal to present best/good ranking only by after-treatment (e.g. via JS).
I strongly belief that this is one of the main and important task for the engine to spit things out in the desired order.

P.S. as mentioned elsewhere.... maybe have a look at the unmaintained JS search interface under htroot/yacy/ui/* if the aftertreatment path can be incorporated there to revitalize that route.

Maybe I'd read your github article again, because the algorithm (criteria) for re-sorting via JS to improve ranking order might be what've overlooked but is what I'm after.


I was unaware of the existence of htroot/yacy/ui/* ; I'll look at it and see if it's a good fit for Javascript re-sorting.

The algorithm used for re-sorting in Javascript isn't particularly complex; all it does is look at the score returned for each result (using the existing "ranking" field in the YaCy JSON API) and inserts each result into the HTML DOM so that the results are in descending score.

I think it's probably impossible to have the sorting done solely server-side, unless we're willing to accept inaccurate sorting or significant delay in getting the results. This is because different peers return results with different latencies -- unless you can re-sort client-side, you have to live with a tradeoff on the server-side between low latency and accurate sorting.
biolizard89
 
Beiträge: 61
Registriert: Do Jan 03, 2013 12:42 am


Zurück zu YaCy Coding & Architecture

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 1 Gast