Hi there,

I've been running Yacy for a couple of months now and the index got pretty big (about 20 million documents). I don't want to delete the documents, but the index size is getting out of hands in terms of disk space: the "SEGMENTS" directory takes up more than 400 GB.

In my early tests with yacy, I seem to recall that when I exported the index and then re-imported it in another yacy installation (on a different machine), the index size was considerably smaller in terms of disk space, is that possible?

I was toying with the idea of:

- exporting the index to an XML file
- emptying the "SEGMENTS" directory
- re-importing the index

So here's my questions:

a) do you think I'll get the result I expect (i.e. reduce the amount of disk space taken up by the index)
b) is there a better way of doing this?

following your idea, will result in lower disk-usage, but you will also loose a lot of additional Information.

YaCy uses multiple indexes - think of which is needed.
At Index Administration -> Index Source & Targets (/IndexFederated_p.html) you can switch them on and of.

The Solr Search Index ist the core and YaCy won't be useful without that - so this can't be switched off.

There are 2 kinds of Web Structure Index. This index is about the references between pages and sites. Without them YaCy will only loose some of its Index quality. I also don't use them in the Freeworld.
If you uncheck the citation Index, you can remove the SEGMENTS/default/citation.index* files once you shut down YaCy.
If you uncheck the webgraph index you can remove the SEGMENTS/solr_6_6/webgraph directory once you shut down YaCy.

The Reverse Word Index is used to distribute your index to other peers in the Network. That's why I not recommend to uncheck this - but if doing so you can remove SEGMENTS/default/text.index* files once you shut down YaCy.
But there is a way to reduce the data used by the Reverse Word Index.
You may want to limit the count of references per word(-hash) by setting System Administration -> Advanced Properties. Set the Key index.maxReferences to the value of 10000 for eg. This will remove the oldest References per word during merge of SEGMENTS/default/citation.index* files.
If you have a System that is able to handle huge files (64Bit JRE) you can change the key filesize.max.other / filesize.max.win (depending on your OS) to 21474836470 (20GB) for eg.
The bigger the Filesize, the lower space is wasted.
(These change will need a restart of YaCy)

The Solr Search Index can be optimized to larger files too - with the same result in wasting less space. At Index Administration -> Optimize Solr (/IndexControlURLs_p.html) you can merge the index to a few (larger) files, without loosing anything.

Please be careful while following these steps! (Backup, etc.)
And keep in mind that larger files truly do save space - but there needs to be room to write them, while the old ones are not deleted.
You need at least the volume for next largest file.

Cu, sixcooler.
Going on the figures you supplied your peer is using 20 GB per million docs.
When you crawl some sites there is a lots of noise that is picked up eg extra domains, I am unsure why that is.

My peer is below 4 GB per million docs I think this is due to the long black list I have its over 80 000 now however it slows the crawling speed down to about 1/5 th.
If anyone wants a copy I can put it on a cloud server to download.
