Load Balance of API Crawls Timer problem.

Discussion in English language.
Forumsregeln
You can start and continue with posts in english language in all other forums as well, but if you are looking for a forum to start a discussion in english, this is the right choice.

Load Balance of API Crawls Timer problem.

Beitragvon smokingwheels » Fr Okt 31, 2014 7:31 am

My idea is to download the API table, perform a load balance, Delete All and reenter the new info in a short space of time.

I have being trying to calculate how to shift the Date next EXEC time in the API calls so that I can offset and load balance my peer if I have a lot of eg RSS Feeds or Web crawls from the same site.

Just a rough example assuming I have the rig to handle it.
Crawl 1000 RSS Feeds/Hour once per hour 24 hours a day.
I ideally need to schedule each new crawl 3.6 seconds apart as not to overload any of the systems.

I have tried to calculate the API timers resolution/counts in order to recalculate Date next Exec but I get 2 different results.
Does anybody have a figure on the time base of what it should be?

Yacy Timer API.JPG
API Timer Calc
Yacy Timer API.JPG (134.74 KiB) 2856-mal betrachtet
smokingwheels
 
Beiträge: 126
Registriert: Sa Aug 31, 2013 7:16 am

Re: Load Balance of API Crawls Timer problem.

Beitragvon Orbiter » Fr Okt 31, 2014 10:36 am

This is in general a very good idea, but
smokingwheels hat geschrieben:I ideally need to schedule each new crawl 3.6 seconds apart as not to overload any of the systems.

(which is also not a bad idea) does not work. The reason is, that the scheduler for the API actions does not work this way. Here is how it works:
- there is a "cleanup"-process which runs every 10 minutes (you can change this in /PerformanceQueues_p.html, see "Delay between busy loops" column for the "Cleanup" row)
- as part of the cleanup-process (which does a lot, i.e. cleaning caches, running postprocessing etc.) the API table (see also: /Tables_p.html?table=api ) is checked for processes that are due which are then startet all at once

That means, even if you configure 3.6 seconds distance between such starts in the schedule, a set of then would be startet at once.
Outside of the automatism that you want to establish I suggest the following two options to solve the scheduling-problem that comes from the current architecture.

- either move the API-scheduler process out from the cleanup-process into it's own busy thread so you can change the frequency you want to
- or add a delay option in the scheduled process start so you can cause that the api calls are not made too fast after each other

While the first option is much more work I would suggest that this is the better option. Additionally, the second option could be established independently from the first one.
Orbiter
 
Beiträge: 5787
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: Load Balance of API Crawls Timer problem.

Beitragvon smokingwheels » Fr Okt 31, 2014 12:31 pm

Orbiter hat geschrieben:This is in general a very good idea, but

Ok Many thanks it gives me lots of options for home work for the next few months.
I really like the Yacy concept my tech English "If the signal is there why not tune in".

Update 1-11-14
The time in the API table is in GMT/UTC not your local time.
Code: Alles auswählen
20141031223424889
Y   M D H  M S mS
smokingwheels
 
Beiträge: 126
Registriert: Sa Aug 31, 2013 7:16 am

Re: Load Balance of API Crawls Timer problem.

Beitragvon Orbiter » Mi Nov 12, 2014 10:50 am

I will probably implement the 'high-precision timer' as described above but currently I am very busy with customer requests. Please hold on or remind me later..
Orbiter
 
Beiträge: 5787
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: Load Balance of API Crawls Timer problem.

Beitragvon Orbiter » Fr Nov 14, 2014 10:06 am

I found time to do this .. its now implemented. The check for processes due in the API action table runs now once every minute. However, this does not include a higher precision in process periods, which is currently still with a minimum of 10 minutes to prevent that this tool is misused for DoS purpose. As far as I see in your idea this is not important because you want to use the event trigger? If yes, then there is missing a feature to set exact minutes in the event trigger, the trigger currently can only execute at full hours.
Orbiter
 
Beiträge: 5787
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: Load Balance of API Crawls Timer problem.

Beitragvon smokingwheels » Fr Nov 14, 2014 12:33 pm

Ok cool the 1 minute is fine, I changed the clean up busy time to 30000 30000.
If I have too many scheduled crawls running at a particular time, I just edit the API table and offset the next start time and it works a treat.
Cheers
smokingwheels
 
Beiträge: 126
Registriert: Sa Aug 31, 2013 7:16 am

Re: Load Balance of API Crawls Timer problem.

Beitragvon smokingwheels » Fr Dez 09, 2016 10:58 pm

With the scheduler fixes recently you can add offsets by actually waiting pausing in real-time a pre-set time before adding another site to crawl.

Thankyou for you time.
smokingwheels
 
Beiträge: 126
Registriert: Sa Aug 31, 2013 7:16 am


Zurück zu English

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 1 Gast