transition Robinson to P2P & adding a new server safely

Discussion in English language.
Forumsregeln
You can start and continue with posts in english language in all other forums as well, but if you are looking for a forum to start a discussion in english, this is the right choice.

transition Robinson to P2P & adding a new server safely

Beitragvon xioc752 » Mo Okt 06, 2014 6:31 pm

HI, We currently have 2 servers - one in each of 2 ubuntu clouds.
They write only to themselves and they are separate.
However, they can read each other to give combined results in a single window, via one of the engines.

We want to safely migrate them to a shared storage environment, optimizing storage use across the 2 servers.
Network Configuration:
yacy.network.allip.unit

In /ConfigNetwork_p.html
They are both set as Robinson Mode
Public Peer

/IndexFederated_p.html

Lazy Value Initialization
Use deep-embedded local Solr
Use remote Solr server(s) (format
Solr Hosts
Solr Host Administration Interface
Index Size
http://xxx.xxx.x.xx:8090/solr/admin/ 194358
Solr URL(s)
NOT write-enabled (if unchecked, the remote server(s) will only be used as search peers)


It is potentially possible that there are some duplicate documents in both clouds.

We have over 44 million web edges now - and the servers are reporting together.

Question 1
How do we safely transition to Peer-to-Peer mode as a separate search cluster with connection only to our own servers?
Key is not to lose our large databases across 2 servers.
Question 2
Will this find and eliminate any duplicates?
Question 3
Will moving to this p2p mode result in smaller storage or larger storage needs?
Question 4
Will this load balance the storage of data across the two servers?
Question 5
We are running out of storage space.
When we will soon add another server (a 3rd) what is the safe procedure for this, please?
Is there any leveling effect or data shifting when the new server is added, or will the 3rd server remain largely empty int he beginning?
Our ultimate goal is to evenly distribute search results across all servers.
Many thanks.

Notes:
Remote Crawler Configuration /RemoteCrawl_p.html
"Your peer cannot accept remote crawls because you need senior or principal peer status for that!"
How do we fix that?
We want all our peers to be basically 'local' to each other'...no 'out of center' 'far away' circles with different DBs
Thank you
xioc752
 
Beiträge: 68
Registriert: Mo Jul 28, 2014 5:01 pm

Re: transition Robinson to P2P & adding a new server safely

Beitragvon Orbiter » Di Okt 07, 2014 8:36 am

If
xioc752 hat geschrieben:migrate them to a shared storage environment, optimizing storage use across the 2 servers.

is the only reason, then I would not recommend to connect the peers with a different network setup. Instead, just move the DATA folders both to the new shared storage space (i.e. rename to DATA1 and DATA2) and replace the DATA folder on your servers with a symbolic link to the shared storage location.
Orbiter
 
Beiträge: 5769
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: transition Robinson to P2P & adding a new server safely

Beitragvon xioc752 » Mi Okt 15, 2014 8:59 am

Duplicates appearing...
We have discovered that at least in one data "provider" category, we are encountering multiple / duplicate entries for the same individual source pages.
Remembering we have 2 servers in robinson configuration, cross-reading each other but not writing to each other....

We typically use specially constructed RSS feeds that crawl a target site based on an initial 'sample' list and harvest what is there.
What is important for us is to go back in history and collect as many 'archival' documents as possible in the target site address.
Typically we seek to collect Up To 1000 document entries at a time.

How do we ensure we can collect earlier documents beyond the initial 'Up to 1000'?

As we crawl daily and in some cases hourly, it appears that the Restriction Do Not Accept Duplicates of the same page is not being respected.
I do not know why this is so. Perhaps there is a conflicting configuration error? Admittedly, with 2 stand-alone servers, basic duplicates can be possible, however, in some cases there are more than 2.

Duplicates we find have different session numbers, but ultimately are identically the same target document.

How do we remove duplicates in an automated manner?
Now that we have 51 million web edges, it is no longer feasible to attempt to find and clean this by hand.
Ideally, we want the 1st copy of each collected document.
Thank you!

As a recurring question to address, will shifting from a collection of stand-alone Robinson servers to a closed environment P2P environment of only our own servers 'clean' this?
What are the downsides, please? Thank you.
xioc752
 
Beiträge: 68
Registriert: Mo Jul 28, 2014 5:01 pm


Zurück zu English

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 1 Gast