reger hat geschrieben:
The primary intended use was to make sure a search result that the user found worthwhile to look at (click on) is used to improve the local index.
fyi: intro of the servlet https://github.com/yacy/yacy_search_server/commit/d44d8996d03ecec0e3c78fb54ab39ae22caef7c1
past TODO-List Actions e.g. (0- = not implemented yet)
- crawl/recrawl the url
- crawl all links on page (with depth) / site
0- increase/create rating
0- add to a collection
0- connect query and url
0- learn and classify content - promote rating
0- add to click statistic url/cnt (maybe to use for boost)
P.S. a veto by Orbiter is then good enough for a delete.
Okay, so it looks to me like there are 2 different questions here, which I think are orthogonal.
1. How do we want to collect information that can be used to improve YaCy's results?
2. What do we want to do with that information once it's collected?
The various possible answers to (1) would seem to include the following:
* Include UI elements next to results (e.g. upvote/downvote buttons).
* A browser add-on that adds UI elements (e.g. upvote/downvote) while actually visiting the page.
* Allow users to opt into receiving notifications in the YaCy UI (with configurable frequency, e.g. once per week), of the form "Do you have a few seconds to help YaCy improve? If so, choose which of these 2 results for the search 'foo' is more relevant to you."
* Clickthrough data (not really any need for this to be part of YaCy if you guys don't wish it to be; it could easily be done as a browser add-on, e.g. a Greasemonkey script, for users who wish to opt in).
The possible answers to (2) would include the things you listed; the ones that occurred to me include:
* Recrawl URL or site (possibly with depth).
* Use as input to machine learning to improve ranking rules.
Both of these questions imply a 3rd question: what should the structure of the collected data be?
I would suggest that the collected data be a set of pairs (q, d), where q is a query, and d is a DAG (directed acyclic graph) of URL's. A DAG seems like a good fit, because it can be traversed to find pairs of URL's, where the first URL is more relevant than the second URL.
There are a number of UI's that could feed a DAG. For example, a simple UI could assign a URL to 3 categories: "exactly what I wanted", "relevant", "irrelevant". (Last I checked, this is what Google does to train their ranking.) Since these 3 categories are inherently ordered, the DAG would in effect have 3 layers, with each URL in the "exactly what I wanted" layer being more relevant than each URL in the "relevant" layer, etc.
Alternatively, a UI could offer a "rank this URL more highly" button next to a search result; this would create a link in the DAG that makes the result more relevant than the result that appeared directly above it. If that button is clicked, the UI could immediately swap those 2 results in the results page, and if the user clicks the button again, the action would be repeated with whatever result is above it this time.
In terms of how algorithms would use the DAG, a recrawl could be initiated whenever a URL is assigned to the "more relevant" side of a link between 2 URL's. For machine learning algorithms, backpropagation could be used to try to decrease the distance between any URL's whose ranking is in the inverse order that the DAG has. If a genetic algorithm is used, then the fitness score could be the fraction of URL pairs from the DAG which are in the correct order using the given ranking.
So yeah, lots of possibilities here, but basically all of the use cases I can think of can be met by a DAG-per-query structure, and the methods of feeding data into the DAG are orthogonal to the methods of using the DAG to improve results.