Site count and indexed sites

Hier finden YaCy User Hilfe wenn was nicht funktioniert oder anders funktioniert als man dachte. Bei offensichtlichen Fehlern diese bitte gleich in die Bugs (http://bugs.yacy.net) eintragen.
Forumsregeln
In diesem Forum geht es um Benutzungsprobleme und Anfragen für Hilfe. Wird dabei ein Bug identifiziert, wird der thread zur Bearbeitung in die Bug-Sektion verschoben. Wer hier also einen Thread eingestellt hat und ihn vermisst, wird ihn sicherlich in der Bug-Sektion wiederfinden.

Site count and indexed sites

Beitragvon DNcrawler » Do Mär 02, 2017 1:56 am

Hi,

Two questions, one easy and one perhaps not so easy:

1. Is there a way to get a count of sites/domains in the index? I notice http://localhost:8090/api/status_p.html nor .xml lists such a stat.

2. I seem to max out at 2,530 sites in the index. When dumping out all sites from hostbrowser.html or .xml, I only ever see 2,530 sites/domains. I request this url, "http://localhost:8090/HostBrowser.xml?admin=true&hosts=" which should dump all sites in the index. Once I noticed this, I created 50 new vhost1.example.com, vhost2.example.com, etc all with a single html page serving up "this is vhost #". These sites are crawled, but not showing up in the hostbrowser.xml site dump. Am I doing something wrong or misunderstanding what should be in the index?

I've started looking at the solr engine directly to see if there is something there to get me a feel for what's in the index.

Thank you for any pointers.
DNcrawler
 
Beiträge: 18
Registriert: Mi Dez 21, 2016 1:48 am

Re: Site count and indexed sites

Beitragvon luc » Do Mär 02, 2017 2:57 pm

Hi DNcrawler,
indeed I am also not sure if there is some api returning directly the global domain names count.

About the HostBrowser page/api, there is a hardcoded maximum number of items : 2520 for authenticated users, 360 for unauthenticated.

But if you want to get the whole domain names list of your index without requesting Solr directly, the /IndexControlURLs_p.html page may help you : in the "Statistics about top-domains in URL Database" section, you can explicitly fill the maximum number of domains you want.

Have a nice day
luc
 
Beiträge: 283
Registriert: Mi Aug 26, 2015 1:04 am

Re: Site count and indexed sites

Beitragvon DNcrawler » Fr Mär 03, 2017 7:33 am

luc hat geschrieben:Hi DNcrawler,
About the HostBrowser page/api, there is a hardcoded maximum number of items : 2520 for authenticated users, 360 for unauthenticated.

But if you want to get the whole domain names list of your index without requesting Solr directly, the /IndexControlURLs_p.html page may help you : in the "Statistics about top-domains in URL Database" section, you can explicitly fill the maximum number of domains you want.

Have a nice day


Once again, thank you luc. I may figure out a patch to make the maxcount a config file variable of some sort. I really want the hostbrowser.xml output per domain for all domains in the index, which should be more than 2520 at this point. Thank you.
DNcrawler
 
Beiträge: 18
Registriert: Mi Dez 21, 2016 1:48 am

Re: Site count and indexed sites

Beitragvon DNcrawler » Mo Mär 06, 2017 8:05 am

Hi,

Thanks again. I did a quick patch to HostBrowser.java and now I'm seeing all the expected sites, beyond the 2520 limit in the code. I don't have the patch as a config file option yet, just a new hardcoded limit. It doesn't appear to have impacted performance in any way.
DNcrawler
 
Beiträge: 18
Registriert: Mi Dez 21, 2016 1:48 am

Re: Site count and indexed sites

Beitragvon luc » Di Mär 07, 2017 9:24 am

Great!
For the future, I believe an interesting option could be to add the possibility to paginate through the host browser items with some usual parameters such as offset and page size. Thus you could set offset to zero and a large/unlimited page size (when authenticated) to get all items if desired.
luc
 
Beiträge: 283
Registriert: Mi Aug 26, 2015 1:04 am


Zurück zu Fragen und Antworten

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 1 Gast