YaCy as ZeroNet search engine

Discussion in English language.
Forumsregeln
You can start and continue with posts in english language in all other forums as well, but if you are looking for a forum to start a discussion in english, this is the right choice.

YaCy as ZeroNet search engine

Beitragvon David » Sa Jan 23, 2016 2:32 pm

I have just started a reddit thread. In case someone is interested in ZeroNet:
https://www.reddit.com/r/zeronet/commen ... ch_engine/

ZeroNet project page: https://zeronet.io/
David
 
Beiträge: 170
Registriert: Di Mär 05, 2013 5:35 pm

Re: YaCy as ZeroNet search engine

Beitragvon data2016 » Mi Jan 27, 2016 9:57 pm

Interesting idea!

Is that possible with yacy?
I have zeronet running, and it runs under localhost:43110 with links to actual pages with content like this: http://127.0.0.1:43110/1EU1tbG9oC1A8jz2 ... 5asrNsE4Vr

So how can I
a) get yacy to index those sites
b) get my indexed sites shared with other people, so they can also query my results?

The idea is this:
I have zeronet and yacy locally running on the same computer and want to add my own site as zeronet page in the format above.
Then I want other people who also use yacy + zeronet to find my just indexed website if they search for a keyword that indicates to content on my 0-page.

Would I need to define something like freenet, like a new shared yacy network group, e.g. zerofreenet and share among those users who join this group?
How can I configure yacy to be able to test that?
data2016
 
Beiträge: 4
Registriert: Fr Jan 01, 2016 8:12 am

Re: YaCy as ZeroNet search engine

Beitragvon luc » Do Jan 28, 2016 8:38 am

Sounds also quite interesting to me.

Unfortunately, to my mind current YaCy is not ready to do what you describe. In Search Portal mode or PeerToPeer mode (see /ConfigBasic.html), localhost URL are not allowed to be crawled or shared, as they are not supposed to be reachable by external network. I tried crawling a zeronet site (ZeroTalk with images) but YaCy gives me this message : Crawling of "http://127.0.0.1:43110/1C2JhCunGLtvyX56nQ88tcb87WnXspjWN" failed. Reason: denied_(the host '127.0.0.1' is local, but local addresses are not accepted: 127.0.0.1)/
What you can do is at least to try crawling ZeroNet pages in YaCy Intranet mode. It worked for me, but for some reason looks like not very efficient, as links are not followed or indexed : at the end I got only one page indexed with no much meta data (see /ViewFile.html).
Try yourself, maybe you may have more success in this mode.

By the way, maybe it would not require so much work to get YaCy working with ZeroNet, but I believe some thinking and refactoring is needed.
luc
 
Beiträge: 294
Registriert: Mi Aug 26, 2015 1:04 am

Re: YaCy as ZeroNet search engine

Beitragvon Orbiter » Do Jan 28, 2016 12:27 pm

so many p2p hidden networks came up all the time the last years (I2P, IPFS, FreenetProject and ZeroNet). From what I know, most of these networks use a local proxy to connet to these networks and that means they appear to YaCy as localhost addess. To explain this no-localhost restriction again: this is there to protect your privacy. Without this restriction it could happen that information from your private intranet is shared to other peers.

This means, the ‚YaCy not ready‘ for ‚X' (X in hidden-web-networks) just refers to a simply ‚if‘ statement, not to the capabiliyt to crawl or index such networks. What we need here is a detailed profile of such networks so that we can define a network definition which opens the p2p restriction in YaCy in such a way that it detects that the ‚intranet‘ is a port to such networks with a defined proxy port and other netwok filters so that there is no danger that private data is shared by mistake.

However, you can instantly simulate such a YaCy network for ‚X' with the intranet network definition - this would give you a YaCy search engine for such a network but ‚just‘ without a sharing option. If this works we should then discover how we can create a pre-defined network definition for such networks which you could then select in /ConfigNetwork_p.html
Orbiter
 
Beiträge: 5793
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: YaCy as ZeroNet search engine

Beitragvon data2016 » Do Jan 28, 2016 3:32 pm

Sounds reasonable!
So how exactly can I start testing to crawl localhost:43110 "zeronet" sites if setting yacy to "intranet" mode?
data2016
 
Beiträge: 4
Registriert: Fr Jan 01, 2016 8:12 am

Re: YaCy as ZeroNet search engine

Beitragvon luc » Sa Jan 30, 2016 3:26 pm

First, intranet mode :
- go to Administration > "Use case & account" menu (/ConfigBasic.html) : then choose "Intranet Indexing", and click "Set Configuration" button.

Then you can try at least by three ways :
- Choose "URL Viewer" in "Search Interfaces" (/ViewFile.html) : you can paste here your ZeroNet site local url, and click "Show Metadata" button. You will be able to see how YaCy parses this page, switching between different view modes with "View as" combobox.
- go to Administration > "Load Web pages, Crawler" menu (/CrawlStartSite.html) : you can paste your ZeroNet url and "Start New Crawl". You will see "Crawled Pages" results at the bottom of Crawler_p.html screen.
- go to Administration > "Advanced Crawler" menu (/CrawlStartExpert.html) : you can also paste you url here and "Start New Crawl Job". You will also see "Crawled Pages results at the bottom.

I just tried all these but, beyond the problem of hidden network access through a proxy, the main issue is that all ZeroNet sites content is dynamically filled in an iframe with javaScript. So as far as I know YaCy does not include a JavaScript engine rendering when crawling, and all he parses and indexes is just the header of ZeroNet.
I guess this can be a more general issue as many websites provide dynamic content through Javascript. Is there some option or solution with existing YaCy crawler?

About access to hidden networks and share of indexed data, I suppose it would not be a problem when the network is not accessed with a local proxy, but only rely on an alternate name resolution system such as Namecoin. Does anyone already crawls .bit sites?
luc
 
Beiträge: 294
Registriert: Mi Aug 26, 2015 1:04 am

Re: YaCy as ZeroNet search engine

Beitragvon data2016 » So Jan 31, 2016 8:34 pm

Hi luc,

thanks for the explanation, I tried all 3 suggestions but to not much avail.

I was able though to index the data folder by pointing the crawler to the cache via file:///, which gave me pretty good results of searching for words or content i knew i came across while surfing zeronet, but then the generated links are not of so much use, as they point to the place in my cache file structure...

So is there any way or idea how to index dynamically loading content via javascript?

Or how to restore useful link structure out of the cache (which always is a full copy of a visited site as far as I understand zeronet)?

Greetings, Clemens.
data2016
 
Beiträge: 4
Registriert: Fr Jan 01, 2016 8:12 am

Re: YaCy as ZeroNet search engine

Beitragvon luc » Mo Feb 01, 2016 8:26 am

Good idea you had to use the cache!

For now I imagine 2 solutions for better indexing ZeroNet :
- Generic : optionally use a browser engine rendering instead of raw html when crawling with YaCy. This would benefit for any dynamically filled website. For example JavaFX has a WebKit based component to include rendered HTML supporting Javascript and CSS : http://docs.oracle.com/javafx/2/webview ... ebview.htm
- ZeroNet specific : customize YaCy to make use of their API to correctly parse links. I don't know if their API already includes what would be needed (http://zeronet.readthedocs.org/en/lates ... reference/)
luc
 
Beiträge: 294
Registriert: Mi Aug 26, 2015 1:04 am

Re: YaCy as ZeroNet search engine

Beitragvon biolizard89 » Do Feb 04, 2016 12:32 am

Orbiter hat geschrieben:so many p2p hidden networks came up all the time the last years (I2P, IPFS, FreenetProject and ZeroNet). From what I know, most of these networks use a local proxy to connet to these networks and that means they appear to YaCy as localhost addess. To explain this no-localhost restriction again: this is there to protect your privacy. Without this restriction it could happen that information from your private intranet is shared to other peers.

This means, the ‚YaCy not ready‘ for ‚X' (X in hidden-web-networks) just refers to a simply ‚if‘ statement, not to the capabiliyt to crawl or index such networks. What we need here is a detailed profile of such networks so that we can define a network definition which opens the p2p restriction in YaCy in such a way that it detects that the ‚intranet‘ is a port to such networks with a defined proxy port and other netwok filters so that there is no danger that private data is shared by mistake.

However, you can instantly simulate such a YaCy network for ‚X' with the intranet network definition - this would give you a YaCy search engine for such a network but ‚just‘ without a sharing option. If this works we should then discover how we can create a pre-defined network definition for such networks which you could then select in /ConfigNetwork_p.html


How exactly is YaCy currently detecting whether a URL is local? I gather it just does a DNS lookup of the domain in the URL? There are a lot of intricacies of getting this right, which are usually dependent on what non-IP network is used. In Tor's case, most hidden services are publicly accessible, but some require client authentication which is done by the Tor daemon. Indexing an authenticated hidden service would be very bad. I don't know if the Tor daemon's API gives an easy way to detect whether a hidden service used client authentication. In Namecoin's case, a domain can resolve to any IPv4/IPv6 address (which should be easy to check for locality), but can also resolve to a Tor or I2P hidden service, as well as Freenet, Zeronet, and CJDNS (although not all of these are widely supported by current software). Namecoin might also in the future support encrypted records. Unfortunately, I'm not even sure what the right policy is on indexing encrypted Namecoin sites, because some domain owners would only use encrypted records to make blockchain censorship more expensive, while others would be using it for privacy. This is definitely worth thinking about. I don't know enough about the systems other than Tor and Namecoin to know what their requirements are, but I strongly suspect that many of them will have their own unique issues to deal with. It is not as simple as whitelisting .onion.
biolizard89
 
Beiträge: 61
Registriert: Do Jan 03, 2013 12:42 am

Re: YaCy as ZeroNet search engine

Beitragvon Orbiter » Sa Feb 20, 2016 11:50 pm

ZeroNet stores visited web pages in a local data path. That path can easily indexed with YaCy. The indexed file path could then be translated into the ZeroNet URL schema using the site hash which is also in the data path. That requires some coding and extra logic in the crawler. So far this appears to be an option to access ZeroNet content. The next question to solve is: how should ZeroNet indexes be shared. It requires a public network which handles only localhost addresses. Thats strange, but obviously required.
Orbiter
 
Beiträge: 5793
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main


Zurück zu English

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 1 Gast

cron