About Yacy

Discussion in English language.
Forumsregeln
You can start and continue with posts in english language in all other forums as well, but if you are looking for a forum to start a discussion in english, this is the right choice.

About Yacy

Beitragvon bubul » Di Mai 02, 2017 6:16 pm

I've used yacy since a lot of months, tried a lot of things and here are what i think about it:

- Speed
Yacy is a good idea, but speed is a big problem, even with a lot of website to crawl i've often less than 200/ppm, sometimes, i've 2000 or more, i don't know why !
And when it's crawling, it never can be used for search, when i search from my local yacy when it's crawling, it never find anything from me, but i've enabled the option to send links to others and i've seen some of my website crawled in search result from other (like: http://search.yacy.net/).
It need an option to add the search module to another disk than the crawling.
The only way to run yacy is to launch it from a secondary disk , it use to much disk if it's on main drive

- RSS
Yacy can crawl rss is good, i often use it.

- Sitemap,
it seem it doesn't work very well, maybe a problem with big sitemap like and other sitemap website are not listed in my index:
https://www.pinterest.com/v2_sitemaps/w ... itemap.xml

And why no option like rss to crawl sitemap every x time ?

- crawling
when adding an new website to crawl, it take sometimes more than one hour for the advanced crawler load the website and show the crawler monitor page !

- Compression
i've compressed the windows folder of yacy and it show it use 35.3Go (compressed size) for 49.1Go of real data, it's a good idea i think it you have not a lot of diskspace.
bubul
 
Beiträge: 23
Registriert: Mo Okt 24, 2016 11:57 am

Re: About Yacy

Beitragvon luc » Mi Mai 03, 2017 7:21 am

Hi bubul,
when adding an new website to crawl, it take sometimes more than one hour for the advanced crawler load the website and show the crawler monitor page

Do you also have some examples to provide for this case : it could be valuable when digging for future performances improvements.

And why no option like rss to crawl sitemap every x time ?

Do you mean a Schedule option directly in the Advanced Crawler page (/CrawlStartExpert.html)? Because there is already the generic Process Scheduler page (/Table_API_p.html) to schedule crawls from website or Sitemaps. There is also the "Scheduler and Profile editor" (CrawlProfileEditor_p.html) page dedicated to crawls scheduling.

One last question : did you find a way to solve your search performance issue mentioned earlier?
luc
 
Beiträge: 265
Registriert: Mi Aug 26, 2015 1:04 am

Re: About Yacy

Beitragvon bubul » Do Mai 04, 2017 4:29 pm

luc hat geschrieben:Hi bubul,
when adding an new website to crawl, it take sometimes more than one hour for the advanced crawler load the website and show the crawler monitor page

Do you also have some examples to provide for this case : it could be valuable when digging for future performances improvements.

And why no option like rss to crawl sitemap every x time ?

Do you mean a Schedule option directly in the Advanced Crawler page (/CrawlStartExpert.html)? Because there is already the generic Process Scheduler page (/Table_API_p.html) to schedule crawls from website or Sitemaps. There is also the "Scheduler and Profile editor" (CrawlProfileEditor_p.html) page dedicated to crawls scheduling.

<span>One last question : did you find a way to solve your search performance issue mentioned [url=<a href="http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5928]earlier[/url]?" class="smarterwiki-linkify">http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5928]earlier[/url]?</a></span>


About crawling, it's all the websites in fact i add.

About sitemaps, the problem is they are not listed in the scheduled process when using the option sitemap when adding a new crawl , there's only "sitemap loader for: xxxx" but not listed in "recorded action"

And no, the only way i've found is to send often my urls to others peers so urls become available on network !

Yacy is a very good idea but it need a new programmation, maybe in c/c++/asm and with another database, maybe mysql or other, it need some testing.


And i've looked at the page loaded with crawler monitor, and often it crawl a lot of the same websites, no more of 3 or 4 websites different for a lot of time (i've many more website added to crawl with option to crawl linked websites too...), so i think there's a problem here about how are selected urls to crawl because with only a few websites it can have a lot of urls crawled at the same time. I've seen recently it was 2000 ppm and more and in fact i don't know why, it was the same website crawled each time and not a website i've added, but a website discovered when crawling the web ! There's a problem here with how url are loaded and selected !
bubul
 
Beiträge: 23
Registriert: Mo Okt 24, 2016 11:57 am

Re: About Yacy

Beitragvon luc » Sa Mai 06, 2017 1:12 pm

Yes YaCy need testing and feedback, and here you provide some which is a good thing.

Given the current size of the project, personally I prefer to spend the little time I have to improve the existing code base rather than restarting from scratch with a completely different technology stack...

By the way, if in the coming weeks I could find and publish some improvements to the points you mention you will find my report here...
luc
 
Beiträge: 265
Registriert: Mi Aug 26, 2015 1:04 am

Re: About Yacy

Beitragvon bubul » Mi Mai 17, 2017 12:03 pm

Maybe a good option can be to use Visual C#, i don't know enough programming to make a program like yacy but it's more simple than c++, i think java is not good for yacy because java is not made to do big application like yacy.
bubul
 
Beiträge: 23
Registriert: Mo Okt 24, 2016 11:57 am

Re: About Yacy

Beitragvon luc » Mi Mai 17, 2017 12:19 pm

Mmmh I understand anyone has its own programming language preferences, but I think you are misinformed about Java and big applications. Just one example : Apache Hadoop is written in Java and is used by many large companies for high performances operations. Just check their "Powered By" page : Amazon is the first entry of a rather long list...
luc
 
Beiträge: 265
Registriert: Mi Aug 26, 2015 1:04 am


Zurück zu English

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 1 Gast