Questions on schema field: crawldepth_i

Discussion in English language.
Forumsregeln
You can start and continue with posts in english language in all other forums as well, but if you are looking for a forum to start a discussion in english, this is the right choice.

Questions on schema field: crawldepth_i

Beitragvon davide » Di Mai 19, 2015 10:05 am

I see the solr schema has a field named crawldepth_i.

Two questions arise:

  1. How is the depth measured? Since the crawler may find the same document referenced from many different paths, the depth at which the document is located is relative. Is the lowest known depth assigned to crawldepth_i?
  2. Since I'm crawling a CMS where all the "significant" documents lie at the same depth, with the exception of a minority of "insignificant" node pages such as home pages or indexes, may I remove the crawldepth_i field from my index without compromising anything? Unchecking the box from /IndexSchema_p.html will take immediate effect?
davide
 
Beiträge: 78
Registriert: Fr Feb 15, 2013 8:03 am

Re: Questions on schema field: crawldepth_i

Beitragvon Orbiter » Di Mai 19, 2015 1:35 pm

to 1)
the crawldepth_i is the smallest possible number, the first depth where the document was recognized

to 2)
I believe this will cause an error, but I never tried to remove the field. Maybe it works.
Orbiter
 
Beiträge: 5769
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: Questions on schema field: crawldepth_i

Beitragvon davide » Di Mai 19, 2015 4:06 pm

Thanks for answering.
As I understand it, the first time the crawler encounters a document doesn't necessarily correspond to the lowest depth at which such document may be found.
For example, if the crawler is instructed to start its scan from multiple starting URLs, when it descends both those URLs it may find a same document from both "crawling paths", potentially at different depths. In other words, the crawler may encounter again an already-indexed document, but at a lower depth.

In this case, which depth is used, the one where the document was first encountered, or the lowest known?
davide
 
Beiträge: 78
Registriert: Fr Feb 15, 2013 8:03 am

Re: Questions on schema field: crawldepth_i

Beitragvon Orbiter » Di Mai 19, 2015 4:09 pm

the crawler follows first the lowest depth. One depth is completed until the next is started. Therefore the crawldepth_i is actually always the lowest depth possible.
If the crawler encounters the same url at deeper depth, the depth is higher - and not crawled (its double then)
Orbiter
 
Beiträge: 5769
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: Questions on schema field: crawldepth_i

Beitragvon davide » Di Mai 19, 2015 4:20 pm

Perfect, that's really clear!
davide
 
Beiträge: 78
Registriert: Fr Feb 15, 2013 8:03 am


Zurück zu English

Wer ist online?

Mitglieder in diesem Forum: YaCy [Bot] und 1 Gast

cron