It perpetually flips from a short phase of CPU burst to a much longer phase of hard disk burst: from all cores at 100% to a full queue of small hard disk read ops.
The index size is moderate with 24M records over 250GB. OS files cache is 12GB. JVM heap 4GB.
May I do something to speed it up?
Standing solely from the observed behavior, it appears that the post-processing algorithms perform something like a one-to-every comparison between indexed records, in a way similar to:
- Code: Alles auswählen
for record_a in index; do
for record_b in index; do
If this is actually the case, then the post-process reads over and over the whole index file; since only a fraction of the index fits in the OS cache, this causes a massive amount of small, random, inefficient read ops. Given this hypothesis true, would it be feasible to improve the algorithm by either:
- Keep it read the whole index file over and over, as it currently does, but performing sequential reads rather than random ones. Rationale: reading the whole index file sequentially from head to tail is faster than reading it randomly with "partial field reads";
- Perform the comparison in large batches rather than with individual records, so avoid the one-to-every check and instead perform a many-to-every check. Consideration: a large batch might consist of anything between 1 MB to 10 GB of indexed records, enough to entirely fit them within available RAM, so to perform multiple comparisons at a time against each record read from disk.