Exporting the list of sites and related urls

Hier finden YaCy User Hilfe wenn was nicht funktioniert oder anders funktioniert als man dachte. Bei offensichtlichen Fehlern diese bitte gleich in die Bugs (http://bugs.yacy.net) eintragen.
Forumsregeln
In diesem Forum geht es um Benutzungsprobleme und Anfragen für Hilfe. Wird dabei ein Bug identifiziert, wird der thread zur Bearbeitung in die Bug-Sektion verschoben. Wer hier also einen Thread eingestellt hat und ihn vermisst, wird ihn sicherlich in der Bug-Sektion wiederfinden.

Exporting the list of sites and related urls

Beitragvon DNcrawler » So Jan 08, 2017 8:19 pm

Hi,

I'm trying to debug why my two step process takes 2-3 days to complete. I'm not sure I'm approaching it correctly. Feedback appreciated.

A daily cronjob runs to export the list of sites in the index. I scripted this command "http://localhost:8090/HostBrowser.xml?admin=true&hosts=" which works great. In about 30 seconds, I receive a nice XML file listing every site and a count of links per site. I then feed each site into the following command to get the details of each site: "http://localhost:8090/api/webstructure.xml?about=(insert site here)". I receive an XML file which has the details of the site and which other sites link in to the site and out from the site. I use this XML data to make pretty pictures of the structure of the intranet I'm crawling.

The HostBrowser.xml output is about 3000 sites. Therefore, I have to call webstructure about 3000 times (once per site). As previously mentioned, HostBrowser.xml takes about 30 seconds to complete. Looping through about 3000 sites with webstructure takes 2-3 days.

Could I be doing something better to get the links in/out of each site? Could I query Solr directly for this data? If this is the best way, then for about 3000 sites and 20 million documents with 751 million edges, does running for 2-3 days sound about right?

I'm gathering system stats to see if the server is just disk bound, ram bound, or what is happening. I'm running an dual cpu quad-core freebsd system with 32GB of ram and 5TB of disk in a mirrored array. The OpenJDK 7 JVM has 16GB assigned to yacy.

Thank you for any feedback or pointers.
DNcrawler
 
Beiträge: 18
Registriert: Mi Dez 21, 2016 1:48 am

Re: Exporting the list of sites and related urls

Beitragvon luc » Mo Jan 09, 2017 7:18 pm

Hi DNCrawler,
a first hint that might help you : when you fill the "about" parameter, the webstructure.xml api does not rely on Solr but instead loads and parses the document at the specified URL. So 3000 sites is not that much, but if they are somehow slow to respond it might take some time...
Don't you have the information you need with HostBrowser.xml api and "path" parameter (which rely on Solr)?
luc
 
Beiträge: 284
Registriert: Mi Aug 26, 2015 1:04 am

Re: Exporting the list of sites and related urls

Beitragvon DNcrawler » Mo Jan 09, 2017 8:44 pm

Thank you luc. I did try to use hostbrowser.xml first, but I couldn't figure out how to get the actual links (beyond the host and domain name) in the xml, like webstructure.xml exports.

webstructure.xml will export example.com with inbound links from example.net and example.ru.

hostbrowser.xml will show no inbound links.

I also notice I can only find inbound links, not outbound links.

example.com contains href's to example.net and example.ru. I can verify by browsing the site or running wget -m --spider and seeing the list of links. Maybe this is a bug or I'm misunderstanding something in yacy.

I can find them because example.ru and example.net link back to example.com.
DNcrawler
 
Beiträge: 18
Registriert: Mi Dez 21, 2016 1:48 am

Re: Exporting the list of sites and related urls

Beitragvon luc » Di Jan 10, 2017 7:40 pm

Obviously these APIs lack documentation. I will try to clarify their usage and update the related Javadoc and wiki entries after checking everything works as expected.
luc
 
Beiträge: 284
Registriert: Mi Aug 26, 2015 1:04 am

Re: Exporting the list of sites and related urls

Beitragvon DNcrawler » Di Jan 10, 2017 8:28 pm

Thanks.

Originally, I read the docs and found
http://yacy.net/en/API.html
which suggests webstructure.xml under
Retrieval of the web page link structure
section. This wiki page seems fairly well documented as well,
http://www.yacy-websuche.de/wiki/index.php/Dev:API
DNcrawler
 
Beiträge: 18
Registriert: Mi Dez 21, 2016 1:48 am

Re: Exporting the list of sites and related urls

Beitragvon luc » Di Jan 10, 2017 9:41 pm

Yes indeed I also missed the comments in the webstructure.xml stream itself which are rather detailed. But I was thinking to the the wiki page http://www.yacy-websuche.de/wiki/index.php/Dev:APIwebstructure and Java class Javadoc which would deserve a little update.
luc
 
Beiträge: 284
Registriert: Mi Aug 26, 2015 1:04 am

Re: Exporting the list of sites and related urls

Beitragvon luc » Do Jan 12, 2017 6:17 pm

Hey DNcrawler, for now I updated the webstructure Javadoc to reflect as much as possible the current usage and implementation.

With my own tests it looks like it worked as expected, even if at first the "latest" parameter can be a little confusing.

Regarding your remark
I also notice I can only find inbound links, not outbound links.
could it be that you crawled example.net and example.ru but not example.com (because of crawl depth setting for example)?
In that case it is normal behavior that the api only report inbound links from example.net and example.ru to example.com because YaCy doesn't know links coming out from example.com until you crawl it...

Regarding performance, I have some ideas to explore, but it will take some time to measure and test.
luc
 
Beiträge: 284
Registriert: Mi Aug 26, 2015 1:04 am

Re: Exporting the list of sites and related urls

Beitragvon DNcrawler » Do Jan 12, 2017 9:59 pm

Wow thank you so much for the updates.

I'm spelunking the logs and solr to figure out what's crawled or not.

Thank you for the quick response, they are appreciated.
DNcrawler
 
Beiträge: 18
Registriert: Mi Dez 21, 2016 1:48 am

Re: Exporting the list of sites and related urls

Beitragvon luc » Mo Jan 30, 2017 9:20 am

Hello, in the end there were some points to fix on the the webstructure.xml API and the HostBrowser.html page.

Improvements and eventually fixes are still possible, but I already pushed some modifications on GitHub, notably related to https/http. I also added a supplementary optional parameter on the webstructure.xml api to control if you want or not to reload and parse the document at the 'about' url. This option may be interesting for your performance issue, as I did not found valuable optimizations on the core of the webstructure.xml algorithm without breaking its compressed memory data structure .

Best regards
luc
 
Beiträge: 284
Registriert: Mi Aug 26, 2015 1:04 am

Re: Exporting the list of sites and related urls

Beitragvon DNcrawler » Mo Jan 30, 2017 4:04 pm

Thank you luc.
DNcrawler
 
Beiträge: 18
Registriert: Mi Dez 21, 2016 1:48 am


Zurück zu Fragen und Antworten

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 3 Gäste

cron