eros » Mo Jul 24, 2017 12:53 pm

Hi there,

I've been running Yacy for a couple of months now and the index got pretty big (about 20 million documents). I don't want to delete the documents, but the index size is getting out of hands in terms of disk space: the "SEGMENTS" directory takes up more than 400 GB.

In my early tests with yacy, I seem to recall that when I exported the index and then re-imported it in another yacy installation (on a different machine), the index size was considerably smaller in terms of disk space, is that possible?

I was toying with the idea of:

- exporting the index to an XML file
- emptying the "SEGMENTS" directory
- re-importing the index

So here's my questions:

a) do you think I'll get the result I expect (i.e. reduce the amount of disk space taken up by the index)
b) is there a better way of doing this?

sixcooler » Mo Jul 24, 2017 8:03 pm


following your idea, will result in lower disk-usage, but you will also loose a lot of additional Information.

YaCy uses multiple indexes - think of which is needed.
At Index Administration -> Index Source & Targets (/IndexFederated_p.html) you can switch them on and of.

The Solr Search Index ist the core and YaCy won't be useful without that - so this can't be switched off.

There are 2 kinds of Web Structure Index. This index is about the references between pages and sites. Without them YaCy will only loose some of its Index quality. I also don't use them in the Freeworld.
If you uncheck the citation Index, you can remove the SEGMENTS/default/citation.index* files once you shut down YaCy.
If you uncheck the webgraph index you can remove the SEGMENTS/solr_6_6/webgraph directory once you shut down YaCy.

The Reverse Word Index is used to distribute your index to other peers in the Network. That's why I not recommend to uncheck this - but if doing so you can remove SEGMENTS/default/text.index* files once you shut down YaCy.
But there is a way to reduce the data used by the Reverse Word Index.
You may want to limit the count of references per word(-hash) by setting System Administration -> Advanced Properties. Set the Key index.maxReferences to the value of 10000 for eg. This will remove the oldest References per word during merge of SEGMENTS/default/citation.index* files.
If you have a System that is able to handle huge files (64Bit JRE) you can change the key filesize.max.other / filesize.max.win (depending on your OS) to 21474836470 (20GB) for eg.
The bigger the Filesize, the lower space is wasted.
(These change will need a restart of YaCy)

The Solr Search Index can be optimized to larger files too - with the same result in wasting less space. At Index Administration -> Optimize Solr (/IndexControlURLs_p.html) you can merge the index to a few (larger) files, without loosing anything.

Please be careful while following these steps! (Backup, etc.)
And keep in mind that larger files truly do save space - but there needs to be room to write them, while the old ones are not deleted.
You need at least the volume for next largest file.

Cu, sixcooler.
smokingwheels » Do Aug 03, 2017 12:46 pm

Going on the figures you supplied your peer is using 20 GB per million docs.
When you crawl some sites there is a lots of noise that is picked up eg extra domains, I am unsure why that is.

My peer is below 4 GB per million docs I think this is due to the long black list I have its over 80 000 now however it slows the crawling speed down to about 1/5 th.
If anyone wants a copy I can put it on a cloud server to download.
