PPM limitations in YaCy?

Discussion in English language.
Forumsregeln
You can start and continue with posts in english language in all other forums as well, but if you are looking for a forum to start a discussion in english, this is the right choice.

PPM limitations in YaCy?

Beitragvon sbolokanov » Mi Nov 12, 2014 12:29 pm

Hello, YaCy folks!

Recently I decided to finally give YaCy a go.
After a few weeks of usage and learning in the process, I got to the point where I need help from somebody that knows more about it.

I'm interested in how PPM limitations work in YaCy.
Since some sites have limitations for crawlers to prevent DoS and others don't, I figure that YaCy must have such limitations in itself to prevent this from happening.

My question is:
Are this limitations hardcoded into YaCy and/or are there settings that we can tweak to adjust this limitations to our own usage needs.
If they are hardcoded and there is no settings for this, can somebody point me to the right files that include this limitations?
sbolokanov
 
Beiträge: 4
Registriert: Mo Nov 10, 2014 6:24 pm

Re: PPM limitations in YaCy?

Beitragvon Orbiter » Mi Nov 12, 2014 3:01 pm

There is actually a hardcoded limitation to 2 documents per second for the same domain. This is done in connection with a proper identification of the crawler as 'yacybot'. The combination of the limitation and the identification of the crawler is promise to web hosters that YaCy is a good behaving robot and does not overload web services.

Furthermore we fully support the robots.txt standard which may demand an even slower crawling.

If you start several crawls for different domains, this factor adds up, i.e. if you start 5 crawls then YaCy loads 10 web pages per second, but the limitation for a single domain is always the same. For most cases this is sufficient because you can load 172800 documents for a single domain each day. Most domains do not have so much documents, so this should work for most of us.

If you want to index really large domains like the wikipedia you can import the wikipedia XML dump. This does up to 60000 ppm.

If you run YaCy in the intranet then the limitation is removed for intranet addresses. I tested this with more than 10000 ppm. Speed is not a problem for YaCy, but for most hosters.
Orbiter
 
Beiträge: 5792
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: PPM limitations in YaCy?

Beitragvon sbolokanov » Mi Nov 12, 2014 7:51 pm

Orbiter hat geschrieben:There is actually a hardcoded limitation to 2 documents per second for the same domain. This is done in connection with a proper identification of the crawler as 'yacybot'. The combination of the limitation and the identification of the crawler is promise to web hosters that YaCy is a good behaving robot and does not overload web services.


So I've thought.

Tried to crawl a local site that's quiet big with the current speed - around 120 PPM. After a few days of crawling my system crashed, which screwed the crawl. For some reason it was gone on next run of YaCy, so it never finished.
I want to increase the speed a little - say 4. The server certainly can handle more than 5-6 pages, so it will be no problem.

Also I wonder what will happen if there is a robots.txt and it allows more than the hardcoded value?
Will YaCy adjust to the limit that robots.txt is allowing to or will it stick to the hardcoded value?

Thanks for the quick response!
sbolokanov
 
Beiträge: 4
Registriert: Mo Nov 10, 2014 6:24 pm

Re: PPM limitations in YaCy?

Beitragvon Orbiter » Mi Nov 12, 2014 9:33 pm

the robots.txt "Crawl-delay" feature uses integer values which means seconds. I was not able to find a proper documentation for that which says that this value is actually interger-only. During all the years I have seen only integer values for Crawl-delay. That means a Crawl-delay would make crawling even slower.

If you would use non-integer values for Crawl-delay, then the current parser would not recognize that and there is also no below-hardcoded-adaption because there would be no need for that.
Orbiter
 
Beiträge: 5792
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: PPM limitations in YaCy?

Beitragvon sbolokanov » Mi Nov 12, 2014 11:19 pm

Didn't know that it have to be a integer. Thanks for clearing out this one for me.

The only thing left for me is to hunt down, where that constant (default maximum pages per domain in sec) is stored in the source code.

Once again, thanks for the quick responses.
Good night.

edit:
Aaaand done. LOL
I must say, I really like this project. I can see it taking off in the feature, especially for the advanced users.

Thanks for the help, Orbiter. I appreciate it.
sbolokanov
 
Beiträge: 4
Registriert: Mo Nov 10, 2014 6:24 pm


Zurück zu English

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 1 Gast