Configuring to only process single (or limited) domains

Hier finden YaCy User Hilfe wenn was nicht funktioniert oder anders funktioniert als man dachte. Bei offensichtlichen Fehlern diese bitte gleich in die Bugs (http://bugs.yacy.net) eintragen.
Forumsregeln
In diesem Forum geht es um Benutzungsprobleme und Anfragen für Hilfe. Wird dabei ein Bug identifiziert, wird der thread zur Bearbeitung in die Bug-Sektion verschoben. Wer hier also einen Thread eingestellt hat und ihn vermisst, wird ihn sicherlich in der Bug-Sektion wiederfinden.

Configuring to only process single (or limited) domains

Beitragvon cypherpunks » Di Sep 25, 2012 11:06 pm

Hi :)

We have an internal top-level domain with many thousands of servers.
Let's call it g3n. , that's g3n dot, the dot meaning the DNS root, to be clear
this is a top-level domain.

The servers might be:
a.g3n, b.g3n, c.g3n, w2x9y5z.g3n, etcetera.

Due to design and partitioning, the full list of servers is considered to be
both unknown and unlimited.

These servers may have links in their HTML, plain text documents, files
served via web/ftp, etc... to the usual public internet sites, google, yahoo, etc.

How do I configure Yacy to crawl and index ONLY the entire g3n domain?

How do I configure Yacy to crawl and index and share subsets of the g3n
domain amongst Yacy peers? ie:

yacy1 gets to do 0.g3n through g.g3n, ignoring h.g3n through zzzz.g3n as it encounters them.
yacy2 gets to do h.g3n through zzzz.g3n, ignoring 0.g3n through g.g3n as it encounters them.

The servers do follow a regular expression, let's say: '^[0-9a-z]{10}.tld$'

We also have another domain, call it k5a.

How do I configure a limited list of only those domains (g3n, k5a, etc) we're interested in?
And how about the work dividing subset configuration within that?

The second level domain (2ld.tld) is usually the server. But in some cases, the third
or further levels are the servers (server.3ld.2ld.tld). How does that affect things?

Thank you.

(I will try to use google translate on replies.)
cypherpunks
 
Beiträge: 2
Registriert: Di Sep 25, 2012 11:02 pm

Re: Configuring to only process single (or limited) domains

Beitragvon Lotus » Mi Sep 26, 2012 5:27 pm

cypherpunks hat geschrieben:How do I configure Yacy to crawl and index ONLY the entire g3n domain?

You can use the yacy.network.***.unit FIles. The description can be found in <YACY>/defaults/yacy.network.readme

How do I configure Yacy to crawl and index and share subsets of the g3n
domain amongst Yacy peers?.

This is only possible by crawling the specific subset on one server only and not using index distribution (DHT).

As far as I know a similar configuration is made on the Sciencenet (http://sciencenet.kit.edu/).

(I will try to use google translate on replies.)

Please do not. It's awful ;)
Lotus
 
Beiträge: 1699
Registriert: Mi Jun 27, 2007 3:33 pm
Wohnort: Hamburg

Re: Configuring to only process single (or limited) domains

Beitragvon cypherpunks » Do Okt 11, 2012 6:26 am

> yacy.network.*

I'm looking through these...

> crawling the specific subset on one server only

We don't always know the hostnames, so the best we could do is supply a regex
to divide the crawling. We feel the yacy server[s] can hold and index the crawl data
but want to split the crawling job across crawl servers for speed, etc. We would actually
want each user facing search server to carry the results from all the subset crawls,
not just the regex crawl done by that particular server. Not sure if your 'not using
index distribution DHT' would prohibit that? ie: There would need to be some sort of
index distribution into or awareness amongst the user facing search engines about
which engine to find the result if the front engine didn't have it.

Still getting used to yacy models so my terms are probably way off :(

> sciencenet

I looked around sciencenet and didn't see their config posted?
cypherpunks
 
Beiträge: 2
Registriert: Di Sep 25, 2012 11:02 pm


Zurück zu Fragen und Antworten

Wer ist online?

Mitglieder in diesem Forum: Exabot [Bot] und 2 Gäste

cron