Multi-threaded access to SOLR index and RAID1 load balancing

Forum for developers

Multi-threaded access to SOLR index and RAID1 load balancing

Beitragvon davide » Fr Sep 25, 2015 8:06 pm

Note: this is an intentional double post of viewtopic.php?f=23&t=5683 . That thread took an inappropriate shape and is best to close it.

We can expect that a consistent portion of the major high-end YaCy nodes out there with large indexes store their laboriously-crawled data into some sort of redundant RAID, to prevent data corruption worth months of crawling.

In my particular node, I have a medium-sized index with 21M records making up for 220GB of storage, mirrored on a two-disks software RAID1 driven by Linux md driver.
The md RAID1 driver is capable of splitting concurrent read requests across its component devices, thus increasing the read speed almost proportionally to the number of devices.
To take this advantage, however, md needs the requests to come from different threads. If this is the case, the amount of IOPS across the mirror can increase to appreciable values even for mechanical disks, maybe high enough for YaCy to provide responsive local results in a "realtime" delay.

However, running a YaCy search query on the local index does not appear to distribute the load across the RAID devices; one of the two disks receives 10 times more read requests than the other, as reported by `atop`. For this, it appears that YaCy (SOLR) performs most of the intensive index reads from a single thread, and doesn't take advantage of the full hardware potential, which could be multiple times higher on large RAID setups.

If this is correct, how could the issue be worked around?
davide
 
Beiträge: 84
Registriert: Fr Feb 15, 2013 8:03 am

Re: Multi-threaded access to SOLR index and RAID1 load balan

Beitragvon Orbiter » Sa Sep 26, 2015 11:25 pm

davide hat geschrieben:However, running a YaCy search query on the local index does not appear to distribute the load across the RAID devices; one of the two disks receives 10 times more read requests than the other, as reported by `atop`. For this, it appears that YaCy (SOLR) performs most of the intensive index reads from a single thread, and doesn't take advantage of the full hardware potential, which could be multiple times higher on large RAID setups.

I don't know if RAID1 does load balancing and wikipedia says: "Actual read throughput of most RAID 1 implementations is slower than the fastest drive."
Your question has also a second component "it appears that YaCy (SOLR) performs most of the intensive index reads from a single thread": this is a correct observation. The write operations in the yacy-integrated solr is single-thread only on purpose.
Reason: solr provides concurrent multi-instance queries with several threads using a solr shard option, called a "solr cloud". This requires that you set up several solr servers and configure them to opperate as solr cloud. Then you can assign this cloud to YaCy as external solr - the solr cloud appears to be a single instance for YaCy. As the operator of the solr cloud, you can place their database files on different discs. This sounds like a complex solution, but it is also the most appropriate one because it is a bad idea to do concurrent write/read operations on a single disc.
Orbiter
 
Beiträge: 5792
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main


Zurück zu YaCy Coding & Architecture

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 2 Gäste

cron