Postprocess: nine days and counting

was weder zu YaCy noch zum Thema Suchmaschinen gehört

Postprocess: nine days and counting

Beitragvon davide » Fr Okt 16, 2015 12:39 pm

Nine days ago Yacy automatically paused a crawler for lack of disk space, and consequently started post-processing. Since then it has not yet finished, running 24/7.

It perpetually flips from a short phase of CPU burst to a much longer phase of hard disk burst: from all cores at 100% to a full queue of small hard disk read ops.
The index size is moderate with 24M records over 250GB. OS files cache is 12GB. JVM heap 4GB.
May I do something to speed it up?


Side consideration:
Standing solely from the observed behavior, it appears that the post-processing algorithms perform something like a one-to-every comparison between indexed records, in a way similar to:

Code: Alles auswählen
for record_a in index; do
    for record_b in index; do
        one-to-every comparison
    done
done


If this is actually the case, then the post-process reads over and over the whole index file; since only a fraction of the index fits in the OS cache, this causes a massive amount of small, random, inefficient read ops. Given this hypothesis true, would it be feasible to improve the algorithm by either:

  1. Keep it read the whole index file over and over, as it currently does, but performing sequential reads rather than random ones. Rationale: reading the whole index file sequentially from head to tail is faster than reading it randomly with "partial field reads";
  2. Perform the comparison in large batches rather than with individual records, so avoid the one-to-every check and instead perform a many-to-every check. Consideration: a large batch might consist of anything between 1 MB to 10 GB of indexed records, enough to entirely fit them within available RAM, so to perform multiple comparisons at a time against each record read from disk.
davide
 
Beiträge: 78
Registriert: Fr Feb 15, 2013 8:03 am

Re: Postprocess: nine days and counting

Beitragvon Orbiter » Fr Okt 16, 2015 1:34 pm

Postprocessing is deactivated by default at least since one year. This process may be interesting to generate SEO data but not for standard YaCy operations.
If your postprocessing is still active, deactivate it by going to /IndexSchema_p.html and then deactivate the field process_sxt
Orbiter
 
Beiträge: 5784
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: Postprocess: nine days and counting

Beitragvon davide » Fr Okt 16, 2015 3:41 pm

Very good then.
I disabled process_sxt from the schema and reindexed.
Without post-processing, will Yacy be able to filter duplicate results from search results?
davide
 
Beiträge: 78
Registriert: Fr Feb 15, 2013 8:03 am

Re: Postprocess: nine days and counting

Beitragvon davide » So Okt 25, 2015 6:11 pm

Without post-processing, will Yacy be able to filter duplicate results from search results?
davide
 
Beiträge: 78
Registriert: Fr Feb 15, 2013 8:03 am

Re: Postprocess: nine days and counting

Beitragvon Orbiter » Mi Feb 24, 2016 11:07 am

raw duplicates are identified by their url hash and that is default behaviour, no postprocessing is needed.
'special duplicates' like with/without http(s) and/or with/without leading '.www' are handled by the postprocessing in such a way that one of these variants is set to be visible. Without the postprocessing any of the variant which appears first is preferred and the 'similar' url gets a worse ranking.
Orbiter
 
Beiträge: 5784
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main


Zurück zu Off-Topic

Wer ist online?

Mitglieder in diesem Forum: Bing [Bot] und 1 Gast

cron