Improving ranking using neural networks and genetic algos

Discussion in English language.
Forumsregeln
You can start and continue with posts in english language in all other forums as well, but if you are looking for a forum to start a discussion in english, this is the right choice.

Improving ranking using neural networks and genetic algos

Beitragvon biolizard89 » Sa Feb 22, 2014 10:57 am

Hi, I have a proposal for improving YaCy search ranking. I'm taking a course on neural networks and genetic algorithms, and I'm planning to use YaCy ranking as my project for the course. Below is a slightly modified version of my proposal from the course; feedback would be greatly appreciated.

---

CS 5970 Artificial Neural Networks and Evolution – Instructor: Dr. Dean Hougen
Project Proposal by Jeremy Rand

I am interested in improving relevance of peer-to-peer web search engines using computational intelligence. Centralized search engines such as Google, Bing, and DuckDuckGo require the user to trust the search engine not to maliciously modify search results or log queries. As has been shown by enormous documentation, Google and Bing are definitely tampering with search results and logging queries, and DuckDuckGo has no way to prove that it is not doing so. An alternative search engine methodology has been proposed by the YaCy project, which uses a peer-to-peer distributed hash table to store a search index, with results being determined collectively by the network. YaCy has major civil liberties advantages, in that it is not possible to censor results, and logging queries is difficult. Unfortunately, YaCy's search ranking performs poorly compared to Google, Bing, and DuckDuckGo. I am interested in using computational intelligence to improve YaCy's ranking algorithms, hopefully making YaCy more competitive with Google, Bing, and DuckDuckGo. I am undecided whether to use neural networks, evolution, or both – this will be decided after additional material has been covered in class.

A decentralized collaborative search engine ranking system has a few potentially conflicting requirements:
1. Users should benefit from other users' experience.
2. Information about a user's search history should not leak to any other users.
3. Users should not be able to unethically influence the ranking to induce spam or censorship.

Requirements 1 and 2 can be fulfilled by simply having all users submit and retrieve data via Tor. However, this method is highly vulnerable to Sybil attacks. Introducing cryptographic proof of work would partially counter Sybil attacks while preserving anonymity, but would also significantly raise the cost of legitimate usage, and presumably a spammer has more resources than a legitimate user (spammers have botnet time at their disposal), so proof of work would probably still have a significant spam problem. Another rate-limiting method is based on IP addresses, but this eliminates anonymity, and botnets still have vastly more IP addresses available than legitimate users.

I think I have a reasonable compromise. Users are connected via a friend-to-friend network such as RetroShare. Users will be given the ability to upvote or downvote results for searches; this data will be saved locally but not shared (to protect privacy). Users also begin with a set of randomly generated search algorithms (in the form of neural network weights or evolutionary genotypes). The search algorithms will take as input some information about the search (e.g. the search tokens, although the exact set of input information is yet to be determined), and output a set of SOLR ranking parameters which can be fed to YaCy via its API. Periodically, each user sends some of its algorithms to all of its friends for evaluation. Those friends send the same algorithm unmodified to their friends for evaluation. This recursively continues until all connected users have received the algorithm; a user who receives the same algorithm a second time will drop the connection to avoid a loop. Each user who has received the algorithm then computes an evaluation based on its local search ranking upvote/downvote history, and returns a linear combination of its own evaluation and all of its friends' evaluations. The evaluation would consist of weight adjustments for a neural network, or a fitness value for an evolutionary genotype. Each user who received a copy of the search algorithm will save a copy for future reference, and the user who originated the algorithm will be able to apply the evaluation data to improve the algorithm. Over time, the algorithms will improve based on evaluation data.

The advantage of the linear combination system is that no user can reliably determine whether the evaluation received from a friend is primarily influenced by that friend, by that friend's friends, or by the friends of that friends' friends, etc. A fitness value and a weight adjustment, particularly in aggregate form, provide plenty of information to refine the search algorithms, but are unlikely to reveal any useful information about individual searches. The linear combination algorithm causes first-degree friends to have more influence on each other's ranking than second-degree friends, who have more influence than third-degree friends, etc. This makes spam and censorship difficult, because attackers will only have a large effect on their friends (and can be unfriended at any time). Sybil attacks are unfeasible on friend-to-friend networks, since all users know the identity of their friends.

Project Scope: This is a large and complex problem, and it is unlikely that a full treatment will be possible given the confines of this class. As necessary, components of the project may be cut or deferred so that something presentable is likely to exist at the end of the semester. ANNE-related coding and experimentation will have priority over the friend-to-friend infrastructure. The possibility exists of continuing the project after the semester is over, potentially for independent study credit (I would very much like to do so, assuming that the project is making progress).

References:
YaCy: http://yacy.net/en/index.html
RetroShare: http://retroshare.sourceforge.net/
biolizard89
 
Beiträge: 58
Registriert: Do Jan 03, 2013 12:42 am

Re: Improving ranking using neural networks and genetic algo

Beitragvon Orbiter » Sa Feb 22, 2014 11:27 am

Hi, thats a very interesting proposal!

Thats a very useful use of YaCy as a 'laboratory' for 'social search' & ranking research. Since that approach will come up with a set of ranking rules for YaCy, it will be very useful for us.

This reminds me that the documentation for the YaCy ranking mechanism is (still) very incomplete, so please use the opportunity to ask questions about it, I will do my best to write a documentation in the wiki to answer your questions.

I would like to add another reference for this idea: the french seeks project, now moved to https://github.com/beniz/seeks
Short about seeks: its a framework above of other search engines which applies 'social ranking rules' using the clicks on search results. They claim that they solved the problem to anonymously distribute the clicks on the results to main privacy of the users. I got this explained by the project maintainer, Emmanuel Benazera, in personal discussion, and it made pretty much sense. I hope there is some documentation visible about his approach, if you cannot find that then just ask him...

I would like to mention that we actually have an upvote/downvote mechanism at the search results which is not very much used; just a hint that there is already some framework that you could use. Please consider that you may get your hands on the code for that ;)

However, if you need something, just post here.
Orbiter
 
Beiträge: 5786
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: Improving ranking using neural networks and genetic algo

Beitragvon biolizard89 » Sa Mär 01, 2014 9:07 am

Orbiter hat geschrieben:Hi, thats a very interesting proposal!

Thats a very useful use of YaCy as a 'laboratory' for 'social search' & ranking research. Since that approach will come up with a set of ranking rules for YaCy, it will be very useful for us.

This reminds me that the documentation for the YaCy ranking mechanism is (still) very incomplete, so please use the opportunity to ask questions about it, I will do my best to write a documentation in the wiki to answer your questions.

I would like to add another reference for this idea: the french seeks project, now moved to https://github.com/beniz/seeks
Short about seeks: its a framework above of other search engines which applies 'social ranking rules' using the clicks on search results. They claim that they solved the problem to anonymously distribute the clicks on the results to main privacy of the users. I got this explained by the project maintainer, Emmanuel Benazera, in personal discussion, and it made pretty much sense. I hope there is some documentation visible about his approach, if you cannot find that then just ask him...

I would like to mention that we actually have an upvote/downvote mechanism at the search results which is not very much used; just a hint that there is already some framework that you could use. Please consider that you may get your hands on the code for that ;)

However, if you need something, just post here.


Hi Orbiter,

Thanks for the reply.

I have one question about the ranking. I see there are two ranking mechanisms, SOLR and RWI. The documentation states that some items in the index only have RWI data available, but it's not clear to me what circumstances that occurs under. Is that just because some network nodes are on old YaCy versions that don't use SOLR internally? Or is there some other circumstance under which a search result will only have RWI data?

Regarding Seeks, there are a few reasons why I chose not to include it in my proposal (I'm familiar with it). It doesn't seem to be actively maintained and installation on current OS versions is extremely difficult (or at least completely undocumented). To my knowledge Seeks doesn't run on Windows, which limits its audience considerably. My best understanding of Seeks's privacy features is that it only shares your data with people who make similar searches based on a locality-sensitive hash function; this makes it hard to evaluate a search algorithm on a wide variety of searches. (I could be wrong here, as the documentation is extremely weak on this.) And I'm unaware of any anti-Sybil algorithms used by Seeks, which would make it somewhat vulnerable to spamming. (Again, maybe I'm wrong, as the documentation is very weak.)

I noticed that there are "bookmark"/"recommend"/"delete" icons next to each search result; is this the upvote/downvote mechanism you're talking about or is there something else I've missed? I haven't found any documentation on what those three icons currently do, is there documentation available for them that I've missed?

Thanks!
biolizard89
 
Beiträge: 58
Registriert: Do Jan 03, 2013 12:42 am

Re: Improving ranking using neural networks and genetic algo

Beitragvon Orbiter » Sa Mär 01, 2014 1:09 pm

biolizard89 hat geschrieben:I have one question about the ranking. I see there are two ranking mechanisms, SOLR and RWI. The documentation states that some items in the index only have RWI data available, but it's not clear to me what circumstances that occurs under. Is that just because some network nodes are on old YaCy versions that don't use SOLR internally? Or is there some other circumstance under which a search result will only have RWI data?

The RWIs are the data structure which is used to distribute the index. Its not 'old' becuase it was there first before Solr came into the architecture, it's still 'the' solution to the problem of distributed search (using the 'partition by word' approach). RWIs work only together with a metadata storage and thats what was replaced by Solr; Solr acts now in two roles: the metadata store for the RWIs and the search index for the 'Appliance Mode'. If YaCy runs without P2P, then only Solr is filled and Solr is the only place where things are searched. Furthermore, Solr is also used as Index when doing P2P search; its results are mixed to the distributed search results.

I made a picture of the Architecture:
Ranking.png
Ranking.png (146.07 KiB) 6953-mal betrachtet

Maybe I should do an explanation video for that. Because it's even a bit more complex, the index verification process (realtime loading of remote search result documents) is not shown in there.

biolizard89 hat geschrieben:Regarding Seeks, there are a few reasons why I chose not to include it in my proposal (I'm familiar with it). It doesn't seem to be actively maintained and installation on current OS versions is extremely difficult (or at least completely undocumented).

I'm afraid that this is true. But nevertheless Seeks should be mentioned since they made an interesting approach to the problem of distributed voting of search results.

biolizard89 hat geschrieben:I noticed that there are "bookmark"/"recommend"/"delete" icons next to each search result; is this the upvote/downvote mechanism you're talking about or is there something else I've missed? I haven't found any documentation on what those three icons currently do, is there documentation available for them that I've missed?

Yes. A click on the bookmark icon makes (obviously) a bookmark. A click on 'recommend' creates an upvote message, a click on delete deletes the link and creates a downvote message. These messages are part of peer-ping payloads; it will be distributed for some time in your peer seed. You can see these messages in /News.html?page=1 and the evaluation of up/downvotes influence the content of the page /Surftips.html where not only votes but also other urls are shown (i.e. public crawl starts and home pages of the online peers as entered in /ConfigProfile_p.html

If you want to do some experiements then these functions may be a good playground and open to changes if you like.
Orbiter
 
Beiträge: 5786
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: Improving ranking using neural networks and genetic algo

Beitragvon biolizard89 » Di Sep 02, 2014 7:15 am

Hi,

Just wanted to mention that this is still being worked on. I've also recruited 2 other students to help me this semester. I'll attempt to keep everyone informed in this thread as significant progress happens.

Cheers.
biolizard89
 
Beiträge: 58
Registriert: Do Jan 03, 2013 12:42 am

Re: Improving ranking using neural networks and genetic algo

Beitragvon Low012 » Di Sep 02, 2014 8:49 am

Very cool! Thumbs up! 8-)
Low012
 
Beiträge: 2214
Registriert: Mi Jun 27, 2007 12:11 pm

Re: Improving ranking using neural networks and genetic algo

Beitragvon biolizard89 » Mo Apr 13, 2015 9:53 pm

Hello everyone,

I'm working on posting a writeup of last semester's effort; hopefully that will be up soon. Progress is continuing this semester.

Until then, a bit of interesting numbers.

As of the end of last semester, we were able to produce Solr ranking parameters which had a fitness score (measuring ability to learn ranking similarity to a given dataset) of 0.12 with a data sample of 300 search queries. We used Startpage results to train it (decentralized data gathering is still being worked on). For reference, Startpage would be 1.0 (since it matches itself perfectly), 0.0 would be random guessing, DuckDuckGo had a fitness of 0.47, and YaCy's default Solr ranking (without RWI) had a fitness of -0.006. So, basically, we've closed about a quarter of the gap between YaCy's default Solr settings and DuckDuckGo in terms of ability to mimic Startpage (this will probably extrapolate to non-Startpage training data, although we don't have evidence of this yet).

We also demonstrated that a decentralized fitness calculation of YaCy Solr parameters does converge to a similar fitness (0.12) as a centralized calculation (0.13), but takes approximately 4 times as many generations (in the case of our very simplified test social graph) to achieve that fitness. We don't think this increased generation count is likely to be a problem in practical situations, but more research is needed before we can be confident of this.

We are very curious how the results are affected by doing RWI as well as Solr. Unfortunately, since RWI in YaCy doesn't yet support debugQuery output, we don't have the ability to get those results.

Hope this quenches your thirst for information on our efforts here, at least for now. I will try to post a more thorough writeup when I have a chance.

Cheers!
biolizard89
 
Beiträge: 58
Registriert: Do Jan 03, 2013 12:42 am

Re: Improving ranking using neural networks and genetic algo

Beitragvon Cajun » Fr Jul 31, 2015 4:54 pm

Hi Biolizard89 and Orbiter,


Very interesting, indeed. The crucial point in this research, from my point of view is, to get the link between queries, and their exitpoint(s), that have been choosen from within the SERP. @biolizard89, I really hope, you've got the opportunities to go on, provide the yacy community with some more results, and accomplish your work!

@Orbiter - isn't there an axiomatic way to get fully anonymized listings of queries accomplished in the yacy-network, keeping the privacy of use, users and providers? And even more, to get them linked with the results choosen finally? What, if they would not leave the users terminal? What, if the users may deliberately decide to send such key-value pairs, anonymously and anonymized, to a central and open database server - open for researchers around the world?

I think, the idea behind is not only important for daily use; it touches some relevant 'philosophical' and innovative questions, too. I also assume, it migt be too challenging and a little bit too demanding, not to try to get it accomplished by some 'crowd-intelligence' approach ... what is the yacy-community thinking about the relevance and possible solutions for these issues?


Best

(P.S. in order to improve 'readability', did some small re-edits)
Zuletzt geändert von Cajun am Mo Aug 03, 2015 2:01 pm, insgesamt 4-mal geändert.
Cajun
 
Beiträge: 10
Registriert: Di Nov 19, 2013 9:35 pm

Re: Improving ranking using neural networks and genetic algo

Beitragvon biolizard89 » Fr Jul 31, 2015 11:36 pm

Hi Cajun,

Cajun hat geschrieben:Hi Biolizard89 and Orbiter,


Very, interesting, indeed. The very critical point in this research from my point of view is, to get the link betwen queries and their actual endpoints in the SERP - hope, you've got the opportunities to go on and accomplish your work!


I'm not certain what you mean by "endpoints" and "SERP" -- from context I infer that you're asking about detecting which result a user actually clicked on the results page? If so, I am doing some experiments on finding the best way to determine this. Ideally, it would be nice if YaCy included a callback mechanism for this, so that I could provide a REST interface that YaCy would call and provide me that info. I'm not sure if YaCy is interested in providing such a mechanism. If not, it's probably possible to do this with Javascript via a Greasemonkey script, although I haven't fully looked into this yet. You are correct that this information is very useful.

Cajun hat geschrieben:@Orbiter - is there a way to get fully anonymized listings of queries accomplished in the yacy-network? And, to get them linked them with results choosen finally? As long as they do not leave the users terminal? What, if the users may deliberately decide to send these key-value pairs, anonymously and anonymized, to a central and open database server - open for researchers around the world?


I think the problem with anonymously providing that data to a central server is that users could mass-submit spam. In theory something like Hashcash would be able to make spam less easy, but I don't think it's sufficient, and it also gives an advantage to attackers that have lots of computing resources (e.g. either the NSA or a botnet operator). I am investigating the possibility of using the user's social graph to share this information, which would make spam much more difficult (a spammer would only affect his immediate friends, who would probably unfriend him/her). However, privacy is very tricky in such a system, and I don't have any great solutions at this point. It would be possible to store such data locally, and simply submit ranking algorithms to your friends, and they just return the fitness value, which preserves privacy much more. I think this may be sufficient to get "good enough" data, though it's certainly not as effective as what Google is doing (they have an inherent advantage here, since they don't have privacy constraints).

You might also look at Blippex, a (now defunct) search engine which tried to crowdsource data from its users. They had some interesting ideas with privacy, but their design required a central server (which you didn't have to trust entirely for privacy purposes).

Cajun hat geschrieben:I think, the idea behind is not only important, but also critical, challenging and a little bit to demanding, not to try to get it accomplished by some 'crowd-intelliengce' approach ... don't you think so?


Having users cooperate definitely improves the efficiency of the system. I'd love to hear more ideas on this topic.
biolizard89
 
Beiträge: 58
Registriert: Do Jan 03, 2013 12:42 am

Re: Improving ranking using neural networks and genetic algo

Beitragvon Orbiter » Fr Jul 31, 2015 11:36 pm

there is a list of queries which had been made on the local YaCy in DATA/LOG - fully anonymized. It stores only the date and the request.
There is no such global list!
Orbiter
 
Beiträge: 5786
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: Improving ranking using neural networks and genetic algo

Beitragvon Cajun » Mo Aug 03, 2015 3:44 pm

Hi biolizard89


I think, your concept of distributing ranking-algorithms, rather than exchanging data itselves (with all their privacy- and spam- concerns), might open a new door to look for alternative types of solutions aiming for the improvement of ranking-algorithms ... :?


Trying to bring things, thoughts, knowledge and discussion together:

A fully centralized approach is the domain of commercial SE's. It doesn't care much of privacy, but it supports best BIG-DATA anylytics in order to develop 'most-optimal' rankings in terms of customer satisfaction and commercial interests.

A 'cooperative approach' relying on centralized data, resulted in a crucial loss of privacy - or, in an increased risk of getting spammed and tampered with disinforming data, respectively. Obviously, we don't see any great solutions coming for that.

A 'cooperative approach', based on social graphs and using a restricted data exchange, might bring most of needs for privacy and data-secuity into balance. However, it is yet to be operationalized fully.

A pure algorithmic approach - this idea came up for me, when thinking about your concept of exchanging algorhithms - could provide all intelligence locally. It would not be restricted to influence processes at query-time, but it also might influence an (local?) index, by trying to avoid fetching and indexing false-positive hits. Ranking algoritms might be choosen explicitely by the user, or elected (or even trained) by the local search- & find- history. In such a scenario, the development focus shifted towards the competition of exchangeable 'intelligent' algorithm plugins - avoiding to get stuck with ambiguities of clustering, disinformation, and privacy, which can be considered typical side-effects of centralized data-store analytics.

The approaches mentioned last, might make use of YaCy's query-log, and be realized on the browser's side by "Greasemonkey" (or sthg. similar, however, restricting it's use to firefox and chrome by now), and on some logic preserved by the server.


Did you miss some relevant views/issues/points within this try of a review?


Best Regards
Cajun
 
Beiträge: 10
Registriert: Di Nov 19, 2013 9:35 pm

Re: Improving ranking using neural networks and genetic algo

Beitragvon biolizard89 » So Aug 09, 2015 10:36 am

Cajun hat geschrieben:Hi biolizard89


I think, your concept of distributing ranking-algorithms, rather than exchanging data itselves (with all their privacy- and spam- concerns), might open a new door to look for alternative types of solutions aiming for the improvement of ranking-algorithms ... :?


Trying to bring things, thoughts, knowledge and discussion together:

A fully centralized approach is the domain of commercial SE's. It doesn't care much of privacy, but it supports best BIG-DATA anylytics in order to develop 'most-optimal' rankings in terms of customer satisfaction and commercial interests.

A 'cooperative approach' relying on centralized data, resulted in a crucial loss of privacy - or, in an increased risk of getting spammed and tampered with disinforming data, respectively. Obviously, we don't see any great solutions coming for that.

A 'cooperative approach', based on social graphs and using a restricted data exchange, might bring most of needs for privacy and data-secuity into balance. However, it is yet to be operationalized fully.

A pure algorithmic approach - this idea came up for me, when thinking about your concept of exchanging algorhithms - could provide all intelligence locally. It would not be restricted to influence processes at query-time, but it also might influence an (local?) index, by trying to avoid fetching and indexing false-positive hits. Ranking algoritms might be choosen explicitely by the user, or elected (or even trained) by the local search- & find- history. In such a scenario, the development focus shifted towards the competition of exchangeable 'intelligent' algorithm plugins - avoiding to get stuck with ambiguities of clustering, disinformation, and privacy, which can be considered typical side-effects of centralized data-store analytics.

The approaches mentioned last, might make use of YaCy's query-log, and be realized on the browser's side by "Greasemonkey" (or sthg. similar, however, restricting it's use to firefox and chrome by now), and on some logic preserved by the server.


Did you miss some relevant views/issues/points within this try of a review?


Best Regards


This is an accurate summary, I think. Two additional things that I think are noteworthy.

(1) A somewhat large data set is necessary to avoid fitting the ranking algorithm to noise rather than signal (called "overfitting" in machine learning). If you have a small number of searches to train against, or a large number that are all similar in some (potentially non-obvious) way, then the algorithm you end up with is unlikely to perform well against different searches. In my initial experiment, I trained against the top 100 results from 300 Google searches (selected at random from the AOL dataset), using a genetic algorithm. The genetic algorithm took about 6 generations before it started overfitting. Changing various parameters of the genetic algorithm might improve this, but I suspect a larger sample size of training data would also be helpful. (To be fair, before it started overfitting, it had closed about a quarter of the gap between YaCy and DuckDuckGo, so it was still quite good for a first attempt.)

(2) However, the large data set doesn't have to be all on one user's computer. Based on a (relatively contrived) simulation I did where I divided the dataset into 8 parts, put each one on a separate node, and had the nodes exchange a linear combination of fitness values according to a simulated social graph, the genetic algorithm reached nearly the same fitness as the single-user version. It took about 4 times as many generations to do so, although the number of runs was small enough and the simulated social graph contrived enough that it's not at all clear whether a 4-fold increase will be representative of other cases. Even so, 4-fold increase in number of generations isn't a bad tradeoff for the better sample size and better privacy you get by exchanging only fitness values over a social graph.

Another thing that could be done on the single-user side, maybe, is some kind of custom link graph weighting based on the user's browsing habits. It's not particularly hard for a Greasemonkey script to keep track of which web pages a user visits, and that data could be combined with YaCy's link graph to estimate the user's likely interest in results. For example, a search result which has a large number of degree 2 or 3 link graph paths from domains that you visit, may be inferred to be more relevant to you than one that only has a small number of degree 5 or 6 link graph paths from domains that you visit. Graph theory is not really my thing, so I'm not sure how well this would work, but I think it's a logical extension of how Google does some of its ranking. I might play around with it later to see how well it ends up working.
biolizard89
 
Beiträge: 58
Registriert: Do Jan 03, 2013 12:42 am


Zurück zu English

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 1 Gast