We have 2 Robinson type cloud servers that must be replaced
Saving the harvested data from each crawler is CRITICAL
The data in them is different. The crawlers had different tasks.
- Functionally, they are not working properly and cannot be easily repaired. (Many attempted repairs, without success)
Additionally, it is not useful to copy the DATA folder sets as there are errors in them which continue when the DATA set is inserted in a fresh crawler installation.
Each of them is out of space for crawling. They cannot be expanded. Available cloud space is blocked in size.
They must be permanently shut down, very soon, due to ending the hosting agreement
We have rescued the data lists of what to crawl and frequencies of crawls from each (thanks!)
There is no space on either of them to make a Solr backups of the mountains (GBs) of harvested data
Due to many problems, it is not realistic to recrawl all the data. There is too much and much is time specific.
The installations are generic Ubuntu with a crawler installed in the VM.
The crawlers use both internal Solr cores (including for webgraph edges, and do not write to any external Solr DBs
We need to recover the folders with the harvested data. This is essential.
1. What folders do we need to "save" and
2. What do we move it to...'where,' please?
We can make new generic crawlers in another cloud space.
Our goal is to add these harvested data to a private [P2P/DHT environment]
However, all the new servers are Robinson servers, that read each other, at this time