Seite 1 von 1

Thank you very much

BeitragVerfasst: So Sep 24, 2017 1:23 pm

Code: Alles auswählen
Postprocessing Progress 
busy:postprocessed 34300 from 106327778 collection documents; 1426 ppm; 74521 minutes remaining

I would thank all the devs involved on the refactoring of the postprocessing routines. The procedure runs fully satisfying now! The timeframe to complete decreased from over 700 years (before the refactoring of the routines) to 52 days.

Outstanding work! Thank you very much


Re: Thank you very much

BeitragVerfasst: Mo Sep 25, 2017 6:21 am
von luc
good to know this task is starting to become useful within the bounds of a human life ;)

Do you run YaCy with the very latest sources from GitHub? (I wonder to which extend the latest Solr upgrades also contributed to improve these post-processing performances...)

Re: Thank you very much

BeitragVerfasst: Mo Sep 25, 2017 11:25 am
Hi Luc,

exactly, I just pulled the newest commit with the command

git clone

then made a few hacks because of my giant index size of 200 million documents. But I didn't touch code related to the postprocessing procedures, because the lack of java skills. Then I just compiled the sources with the command

ant clean all

I additionally added the switches -XX:+UseParallelGC -XX:+UseNUMA to the startup script, In multiprocessor environments these switches increase the performance a bit.

Yes you're right, I think the integration of the latest solr version is jointly responsible for the performance gain, too.

Re: Thank you very much

BeitragVerfasst: Fr Okt 06, 2017 7:25 am
After a few days it decreased to 160 ppm and now it takes over 1 year again for the process to complete :-(

Question: When I'm crawling some sites on another peer and export this index via the XML export feature (Rich and full-text Solr data), has this postprocessing procedure already been run and does this data dump already contain the postprocessing data or does it need to be computed again?

Re: Thank you very much

BeitragVerfasst: Do Okt 12, 2017 8:47 am
von luc
Hi LA_FORGE, sorry for the delayed answer, but as far as I know :
- post-processing runs only once all crawls are terminated (see the conditional check)
- once post-processed and committed, related Solr fields are indeed exported with the XML export feature, so they do not need to be computed again.

A few complementary remarks on export/import however :
- the webgraph collection is not exported, so obviously you also loose any post-processing computation on webgraph collcetion fields when exporting
- some post-processed fields computation is related to the local peer data : for example references post-processing uses the citation index, and eventually the webgraph collection if enabled. So to my mind, to be truly accurate, theses values should be computed again when importing to another peer with a larger or a different index. But it wont' be done automatically after import, as the fields marking that post-processing is needed (process_sxt and harvestkey_s) are cleaned-up after a successful post-processing...

Re: Thank you very much

BeitragVerfasst: Fr Okt 13, 2017 9:33 am
Great! Thank you very much