Ranking Transparency Roadmap

Forum for developers

Ranking Transparency Roadmap

Beitragvon reger » Sa Okt 08, 2016 1:41 am

regarding the recent pull requests dealing with showing ranking details
Show ranking in HTML UI https://github.com/yacy/yacy_search_server/pull/78
Ranking Transparency https://github.com/yacy/yacy_search_server/pull/77

a point of view how this could become a useful feature, with the aim
  • to show/inform why the result is (the most) important
  • make optimization easier by understanding the effect
  • optionally learn from feedback how to rank better/get learning material of ranking fits/misfits

(a roadmap proposal)
  • show a ranking info symbol (or number or % etc.) with a link to a Ranking-Info-Servlet in result
  • the ranking info servlet to display UNDERSTANDABLE info about the ranking (may be graphical, grouped or clustered by factors, weights of influence, etc.)
  • servlet possibly with user feedback option (like existing +/- button) for kind of ranking optimization/learning

If above gets endorsed, the next vital and big question is:
are there already realistic ideas for the 2 fuzzy points ("display UNDERSTANDABLE info" and "kind of optimization/learning") ?

(if yes. let's hear, if no... let it all die in silence ;) )
Zuletzt geändert von reger am Sa Okt 08, 2016 4:38 pm, insgesamt 1-mal geändert.
reger
 
Beiträge: 46
Registriert: Mi Jan 02, 2013 9:23 am

Re: Ranking Transparency Roadmap

Beitragvon luc » Sa Okt 08, 2016 7:40 am

Hi reger, I think it's a good idea you created this roadmap.

I like the proposals, and to my mind it would really fits well with YaCy philosophy ("Web Search by the People, for the People").

However for now I have also no clear idea or inspiring examples of how shoud be the understandable info...
About the feedback option, maybe the + button could be used to register "votes" in a Solr field that could be used as any other field in ranking (note I do not know very well this function and I am not sure if there is not already such a field...).

Another idea I was wondering about was if bookmarked pages could not also be used in the ranking process (I am also not very sure of how it precisely works currently...).

By the way, it would be great if some people with no developer skills would also share their point of view.
luc
 
Beiträge: 283
Registriert: Mi Aug 26, 2015 1:04 am

Re: Ranking Transparency Roadmap

Beitragvon biolizard89 » Fr Okt 14, 2016 6:32 am

Hi reger and luc,

UX is not really my specialty, but my guess is that to really define "understandable" we would have to have some idea of what a user's goal is.

Some examples:

1. "I think YaCy is ranking a website too high or low by accident, I want to figure out how this can be fixed by changing the ranking rules."
2. "I think YaCy is ranking a website too high because that website is using abusive SEO / spam techniques, I want to figure out what the website is doing so we can make YaCy penalize such sites."
3. "I have a new idea for a YaCy ranking method and I want to figure out whether it would be beneficial, and which pages would be most affected."
4. "I have a website and I want to figure out how to change the site to make it rank more highly in YaCy using the default YaCy settings."

There may be some information that is beneficial for some of these use cases, but is superfluous for others of these use cases. It might be useful to consider these use cases independently for the purpose of figuring out what information should be highlighted and how it should be visualized. I think a good first step is to simply make the raw data available, since this allows people to experiment with layers on top of it, but I fully agree that making raw data available is not really sufficient by itself for most real-world use cases. (Although for my particular use cases, it's sufficient given that I'm willing to code some Python scripts to do my additional analysis.)

In terms of optimization/learning, a common technique in machine learning is backpropagation. Basically, this uses the partial derivative of an output variable with respect to some input variable, to determine how to change the input variable in order to optimize the output variable. I'm playing around with this technique in the context of YaCy ranking, but I don't have any results to share yet. The important takeaway here is that because backpropagation needs partial derivatives, it needs to know what calculations were used to get the final ranking score.

One potentially useful way to get data for deciding how to optimize ranking would be to use clickthrough data. There's not much of a privacy implication to collecting clickthrough data as long as it's not shared with peers, but my guess is that multiple nodes' clickthrough data would need to be combined in some way to get sufficiently noise-free data. There are some ways that this could be done; I'm investigating using a social graph method where users' own nodes do optimization using their own clickthrough data but share a weighted sum of their own optimizations and their friends' optimizations. This is reasonably private (users effectively act as a blind for the users in their social graph), and reasonably Sybil-resistant (social graphs have been reasonably well-studied for Sybil-resistance, including in the context of Freenet, for which the stakes are a lot higher). I don't have any practical results to share on that yet.

Cheers!
-Jeremy
biolizard89
 
Beiträge: 61
Registriert: Do Jan 03, 2013 12:42 am

Re: Ranking Transparency Roadmap

Beitragvon reger » Fr Okt 14, 2016 10:55 pm

biolizard89 hat geschrieben:One potentially useful way to get data for deciding how to optimize ranking would be to use clickthrough data.


One quick line about history to above suggestion, a "clickservlet" was previously proposed but the idea finally disposed
see comment https://github.com/yacy/yacy_search_server/commit/61ae9d2d1187459ceb695ebc465cd7bd12905f9d
So that I'd not look into this option (again).
reger
 
Beiträge: 46
Registriert: Mi Jan 02, 2013 9:23 am

Re: Ranking Transparency Roadmap

Beitragvon biolizard89 » Fr Okt 14, 2016 11:29 pm

reger hat geschrieben:
biolizard89 hat geschrieben:One potentially useful way to get data for deciding how to optimize ranking would be to use clickthrough data.


One quick line about history to above suggestion, a "clickservlet" was previously proposed but the idea finally disposed
see comment https://github.com/yacy/yacy_search_server/commit/61ae9d2d1187459ceb695ebc465cd7bd12905f9d
So that I'd not look into this option (again).


Is there more background on the discussion about the clickservlet? I'm curious what its intended use case was and why it was removed. The commit you linked only shows it as disabled by default.
biolizard89
 
Beiträge: 61
Registriert: Do Jan 03, 2013 12:42 am

Re: Ranking Transparency Roadmap

Beitragvon reger » Sa Okt 15, 2016 12:06 am

biolizard89 hat geschrieben:
reger hat geschrieben:
biolizard89 hat geschrieben: I'm curious what its intended use case was and why it was removed.


The primary intended use was to make sure a search result that the user found worthwhile to look at (click on) is used to improve the local index.
fyi: intro of the servlet https://github.com/yacy/yacy_search_server/commit/d44d8996d03ecec0e3c78fb54ab39ae22caef7c1
past TODO-List Actions e.g. (0- = not implemented yet)
- crawl/recrawl the url
- crawl all links on page (with depth) / site
0- increase/create rating
0- add to a collection
0- connect query and url
0- learn and classify content - promote rating
0- add to click statistic url/cnt (maybe to use for boost)

P.S. a veto by Orbiter is then good enough for a delete.
reger
 
Beiträge: 46
Registriert: Mi Jan 02, 2013 9:23 am

Re: Ranking Transparency Roadmap

Beitragvon biolizard89 » Di Okt 18, 2016 3:35 am

reger hat geschrieben:The primary intended use was to make sure a search result that the user found worthwhile to look at (click on) is used to improve the local index.
fyi: intro of the servlet https://github.com/yacy/yacy_search_server/commit/d44d8996d03ecec0e3c78fb54ab39ae22caef7c1
past TODO-List Actions e.g. (0- = not implemented yet)
- crawl/recrawl the url
- crawl all links on page (with depth) / site
0- increase/create rating
0- add to a collection
0- connect query and url
0- learn and classify content - promote rating
0- add to click statistic url/cnt (maybe to use for boost)

P.S. a veto by Orbiter is then good enough for a delete.


Hi reger,

Okay, so it looks to me like there are 2 different questions here, which I think are orthogonal.

1. How do we want to collect information that can be used to improve YaCy's results?
2. What do we want to do with that information once it's collected?

The various possible answers to (1) would seem to include the following:

* Include UI elements next to results (e.g. upvote/downvote buttons).
* A browser add-on that adds UI elements (e.g. upvote/downvote) while actually visiting the page.
* Allow users to opt into receiving notifications in the YaCy UI (with configurable frequency, e.g. once per week), of the form "Do you have a few seconds to help YaCy improve? If so, choose which of these 2 results for the search 'foo' is more relevant to you."
* Clickthrough data (not really any need for this to be part of YaCy if you guys don't wish it to be; it could easily be done as a browser add-on, e.g. a Greasemonkey script, for users who wish to opt in).

The possible answers to (2) would include the things you listed; the ones that occurred to me include:

* Recrawl URL or site (possibly with depth).
* Use as input to machine learning to improve ranking rules.

Both of these questions imply a 3rd question: what should the structure of the collected data be?

I would suggest that the collected data be a set of pairs (q, d), where q is a query, and d is a DAG (directed acyclic graph) of URL's. A DAG seems like a good fit, because it can be traversed to find pairs of URL's, where the first URL is more relevant than the second URL.

There are a number of UI's that could feed a DAG. For example, a simple UI could assign a URL to 3 categories: "exactly what I wanted", "relevant", "irrelevant". (Last I checked, this is what Google does to train their ranking.) Since these 3 categories are inherently ordered, the DAG would in effect have 3 layers, with each URL in the "exactly what I wanted" layer being more relevant than each URL in the "relevant" layer, etc.

Alternatively, a UI could offer a "rank this URL more highly" button next to a search result; this would create a link in the DAG that makes the result more relevant than the result that appeared directly above it. If that button is clicked, the UI could immediately swap those 2 results in the results page, and if the user clicks the button again, the action would be repeated with whatever result is above it this time.

In terms of how algorithms would use the DAG, a recrawl could be initiated whenever a URL is assigned to the "more relevant" side of a link between 2 URL's. For machine learning algorithms, backpropagation could be used to try to decrease the distance between any URL's whose ranking is in the inverse order that the DAG has. If a genetic algorithm is used, then the fitness score could be the fraction of URL pairs from the DAG which are in the correct order using the given ranking.

So yeah, lots of possibilities here, but basically all of the use cases I can think of can be met by a DAG-per-query structure, and the methods of feeding data into the DAG are orthogonal to the methods of using the DAG to improve results.

Cheers!
biolizard89
 
Beiträge: 61
Registriert: Do Jan 03, 2013 12:42 am

Re: Ranking Transparency Roadmap

Beitragvon reger » Sa Okt 22, 2016 12:39 am

biolizard89 hat geschrieben:1. How do we want to collect information that can be used to improve YaCy's results?
2. What do we want to do with that information once it's collected?

right, we could divide the topic into the 2 sections/question

biolizard89 hat geschrieben:Alternatively, a UI could offer a "rank this URL more highly" button next to a search result;

fyi: for (1)
That what cam to my mind too and I'm experimenting with it, with focus on the button and effect(but are far from happy with what I've tested so far (its 2 button up/down, a pie chart and 3 numbers) but stumpled over other things to look at in the rwi ranking area).
In regards to how to represent (internal structure), I started with the rwi (reverse word index) and deal here just with result pairs for ranking parameter (as that is what machine learning could optimize).
Have to read your nice reply likely a couple times more and probably have to get closer to the ... how to represent details (to fully understand your query, URL, DAG comment/idea but I think will get it ... as with my above sentence .... I'm in the context of a search which includes query & url).
But spitting out rows of numbers etc. without answer to your question (2) which includes .... "is handled within YaCy by...... or with Tool xyz" is not of benefit for me.
reger
 
Beiträge: 46
Registriert: Mi Jan 02, 2013 9:23 am


Zurück zu YaCy Coding & Architecture

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 1 Gast

cron