Yacy automatic web crawling

Hier finden YaCy User Hilfe wenn was nicht funktioniert oder anders funktioniert als man dachte. Bei offensichtlichen Fehlern diese bitte gleich in die Bugs (http://bugs.yacy.net) eintragen.
Forumsregeln
In diesem Forum geht es um Benutzungsprobleme und Anfragen für Hilfe. Wird dabei ein Bug identifiziert, wird der thread zur Bearbeitung in die Bug-Sektion verschoben. Wer hier also einen Thread eingestellt hat und ihn vermisst, wird ihn sicherlich in der Bug-Sektion wiederfinden.

Yacy automatic web crawling

Beitragvon ainslied » Mo Mai 09, 2016 1:35 pm

Hi all,

I have learned German in the past, but I will continue in English sorry for that !
At first, let me say that I discovered Yacy a very short time ago, and I can say, it looks very powerfull :P . Furthermore, the windows installation is very easy and it is available in several languages which is quite impressive ! Some pages of the website are translated in French, some video tutorials exist in english, that is good too. Sadly I can't find any French Yacy community :|. I have found some old articles from 2011 but it seems that Yacy has been strongly improved since this time.

I have red a little the documentation and have watched the tutorial videos. But I have not undestood well the default behaviours of a fresh install (after the basic configuration done) :
:arrow: Does Yacy nodes crawl permanently and automatically all the world wide web, or should I manualy define the websites which should be crawled on my computer ?

In my understanding, by default, Yacy don't index anythink until you configure some websites or sources to be crawled.
If that is true, I think it could be very interesting to develop a feature which allows all nodes to crawl automaticaly all the web, following some basic rules on which pages should be crawled in priority (frequently updated pages, banned or priority topics definedb y default or by node owners, etc..) and introducing may be some coordination between nodes (don't crawl again a page which has just been crawled by an other node).

I understand that this feature could require some new developments :geek: , but I imagine the power of this kind of system: very quickly, much more pages would be indexed by Yacy and we could expect to do not use any proprietary search indexer anymore :P and invite our non geeg friends and family to use and install Yacy (themselve indexing the web without configuring anything)!

Thank you if you can answer me, and sorry if I have not well understood the functionality of Yacy and how it should be used :roll: .
Let discuss of this feature on this topic if you are interested :!:

Regards,
Ainslie
ainslied
 
Beiträge: 1
Registriert: Mo Mai 09, 2016 12:28 pm

Re: Yacy automatic web crawling

Beitragvon sixcooler » Mo Mai 09, 2016 9:50 pm

Hi Ainslie,

you're right, from a fresh install YaCy crawls nothing until you start a crawl by giving a startpoint.
But your gets Index-Data from other peers and this is already very helpful, resolving a search-request.

I don't understand what you mean by the basic rules to crawl per default.
If you don't limit the crawler to the start domain, and give a high crawling depth you will get more than we are able to handle very soon :-)

You can also help others that distribute their crawl-jobs, by enabling remote-crawling (/RemoteCrawl_p.html).
Once You got Index-Data, there are also old pages in the index, these may need to be recrawled (/IndexReIndexMonitor_p.html).

For me it is best that all the users start crawling pages of their interest, to get a well blended index.

Cu, sixcooler.
sixcooler
 
Beiträge: 487
Registriert: Do Aug 14, 2008 5:22 pm


Zurück zu Fragen und Antworten

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 2 Gäste

cron