Recrawling

Discussion in English language.
Forumsregeln
You can start and continue with posts in english language in all other forums as well, but if you are looking for a forum to start a discussion in english, this is the right choice.

Recrawling

Beitragvon irnerio » Di Apr 11, 2017 2:32 pm

I'm not sure if the recrawling is a working function in Yacy. For recrawling I mean the action to crawl again the websites of the index and so add new pages and/ or delete the old not working ones.

In http://websitename.com/IndexReIndexMonitor_p.html there is a "Re-Crawl Index Documents" function that I activated but:
1) the number of documents to process do not correspond to the real number of total indexed documents;
2) it has not indexed all new documents published in websites I'm monitoring

Kinf regards.

Mario
irnerio
 
Beiträge: 11
Registriert: Fr Mär 17, 2017 9:03 pm

Re: Recrawling

Beitragvon luc » Fr Apr 14, 2017 2:26 pm

Hi Mario,
the Re-Crawl feature re-crawls and updates already indexed documents, but the crawl depth is there set to zero : this means if an indexed page has new links in a new published version (not yet known by your YaCy peer), their content won't be crawled and added to your local index.
To my mind, if you want to be sure to keep a fresh index of a website you have better regularly run a full crawl (/CrawlStartExpert.html), eventually scheduled with the YaCy internal Process Scheduler (/Table_API_p.html) or with a cron task or any convenient external scheduler.

Best regards,
Luc
luc
 
Beiträge: 264
Registriert: Mi Aug 26, 2015 1:04 am

Re: Recrawling

Beitragvon irnerio » Fr Apr 14, 2017 9:10 pm

Hi Luc ! I'm trying now the full recrawl plan. For full recrawl I mean that I've inserted again all the websites of the index and recrawled it. I don't understand how to cron this. It's a shame that there isn't a way to keep the index automatically updated. I'm not a programmer and there is probably a big issue about this functionality.

Thank you again for your response. If somebody will donate me few bucks I will have the money to pay a programmer :-D

Mario
irnerio
 
Beiträge: 11
Registriert: Fr Mär 17, 2017 9:03 pm

Re: Recrawling

Beitragvon luc » Sa Apr 15, 2017 9:00 am

With the YaCy Process Scheduler (/Table_API_p.html) you just have to :
- find your last crawl start action(s) : in that page you can search for "crawl" or sort by "Recording Date" by clicking on that column for example
- in the "Scheduler" column, select the "activate scheduler" combo box option and then the appropriate periodicity.

With cron on a linux machine it is also quite simple :
- first try to start your crawl from command line : for example with the help of YaCy bin/apicall.sh script and using the parameters recorded in the /Table_API_p.html
- then schedule this command line as a cron job : see for example the related Debian documentation

But of course I agree it is not super user friendly : why not share here some ideas of which feature would make it easier? Some ideas :
- an option in the crawl start page to tell directly that you will want to run this task regularly?
- a new specific page with a list of websites/URLs whose index should be maintened up-to-date?
...
luc
 
Beiträge: 264
Registriert: Mi Aug 26, 2015 1:04 am

Re: Recrawling

Beitragvon irnerio » Sa Apr 15, 2017 2:47 pm

I've just scheduled a process to recrawl all index at 00:00 am every day. :shock:

But I've noticed that when you recrawl the index is canceled. Could be useful a way to add only new pages to database without deleting the old index.

I.e. 1) you recrawl; 2) the robot finds new pages; 3) it adds new pages to archive and deletes not working ones.

Yes, a new specific page where you may edit a list of webpages to mantain updated would be great.

My Yacy engine is: http://irnerio.sabatino.pro and is a search engine on the topic of Italian law.



luc hat geschrieben:With the YaCy Process Scheduler (/Table_API_p.html) you just have to :
- find your last crawl start action(s) : in that page you can search for "crawl" or sort by "Recording Date" by clicking on that column for example
- in the "Scheduler" column, select the "activate scheduler" combo box option and then the appropriate periodicity.

With cron on a linux machine it is also quite simple :
- first try to start your crawl from command line : for example with the help of YaCy bin/apicall.sh script and using the parameters recorded in the /Table_API_p.html
- then schedule this command line as a cron job : see for example the related Debian documentation

But of course I agree it is not super user friendly : why not share here some ideas of which feature would make it easier? Some ideas :
- an option in the crawl start page to tell directly that you will want to run this task regularly?
- a new specific page with a list of websites/URLs whose index should be maintened up-to-date?
...
irnerio
 
Beiträge: 11
Registriert: Fr Mär 17, 2017 9:03 pm

Re: Recrawling

Beitragvon smokingwheels » Fr Apr 21, 2017 6:30 pm

In /CrawlStartExpert.html
You may have to set the "Clean-Up before Crawl Start?" options before before you schedule gets set in place.

Also it maybe possible if using RSS Feeds to overload your system (coming from the point of a P4) if you have lots schedules too close together.
smokingwheels
 
Beiträge: 107
Registriert: Sa Aug 31, 2013 7:16 am

Re: Recrawling

Beitragvon irnerio » Mi Apr 26, 2017 8:59 pm

I've never tried the rss function. It seems interesting. Thx
irnerio
 
Beiträge: 11
Registriert: Fr Mär 17, 2017 9:03 pm


Zurück zu English

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 1 Gast