What challenge do we have
- internally we work with 2 indexes and 2 ranking models (Solr and RWI index/ranking)
- ranking figures (numbers) of these 2 are not comparable – currently just adjusted to be in a similar number range
- remote Solr results are ranked by provided fields and boosts but the same doc from different peers can have different ranking results returned
- ranking from RWI changes over search time as factored in result-feature range changes (e.g. one ranking factor is (urllength-min.urllength)/(max.urllength-min.urllength). As results are collected the min/max will change possibly with every new result, already ranked items don't reflect that.
- in P2P (remote) searches the quickest available result is shown on 1st page while better results maybe shown later (especially the 1st page is in regards to ranking the worst because of this concurrency topic)
Proposal for a solution
If we do a local search, in one index only we don't have any issue, because we just keep the order the search process supplies the results. Hm..... that's pretty simple, why not expanding on this procedure.
Means if we do a search in 2 indexes (A and B), why not keeping the order too and simply display A1, B1, A2, B2 …. and if remote peers come into play were we basically need to behaive a bit like a metasearch engine joining many already ordered search results into a final result list expanding it to [A..Z]1 [A..Z]2 …. etc.
That's probably the simplest metasearch merge strategy available, see a description e.g. here (with some tuning ideas)
http://www.technicaljournalsonline.com/ ... 12/231.pdf
The nice part is … it is relative simple and actually I don't have found any similar easy but good one without to do the whole ranking (the search process did already) again what is not really possible without having all ranking data (means loading the resource).
Here a nice understandable overview of other methods http://ijcsi.org/papers/IJCSI-9-4-3-239-251.pdf
What is now the proposal:
- apply the position ranking (from above) to P2P result list (maybe later fine tuning to take original source ranking or peers dht position or available results etc. as quality factor into account)
- give ranking credit to same result coming from diff. peers
- still add a post-ranking
- still display quickest result first but assure the 1st page shows not the worst from one resource if we expect more coming in (maybe except it's from local)
I did some rudimentary verification testing of the 1st basic part (position ranking). The result was not THAT great (best result not the first …. but in the 1st 20) but also not worse as current 1st pages (tested without any fine tuning and post-ranking). What makes sense as today the quickest available solr & rwi results are shown in turns what is basically a position ranking from the first 2 available peer-results.
P.S. To not beak everything with unknown (not completely tested) outcome we could/should apply changes to a clone or branch and have it tested, adjusted and fine tuned by all interested.
Any better idea... let me know...