yacy up to date for RSS feeds

Hier finden YaCy User Hilfe wenn was nicht funktioniert oder anders funktioniert als man dachte. Bei offensichtlichen Fehlern diese bitte gleich in die Bugs (http://bugs.yacy.net) eintragen.
Forumsregeln
In diesem Forum geht es um Benutzungsprobleme und Anfragen für Hilfe. Wird dabei ein Bug identifiziert, wird der thread zur Bearbeitung in die Bug-Sektion verschoben. Wer hier also einen Thread eingestellt hat und ihn vermisst, wird ihn sicherlich in der Bug-Sektion wiederfinden.

yacy up to date for RSS feeds

Beitragvon bebop » Mi Nov 11, 2009 10:27 pm

sorry for talking english, but the english forum is not very active in fact

i am using yacy since several months with quite a succes
some servers on the global network
some servers on dedicated search engine

i have a question concerning RSS feeds to stay yacy up to date.
on my own search engine, i do not want yacy to crawl yacy new sites, i prefer yacy to get my choose web sites through rss feeds

how does yacy react for RSS feed ?
is here a possibility to subscribe to feeds and to make yacy crawle them in no delay once every article is published ?
bebop
 
Beiträge: 20
Registriert: Mi Apr 15, 2009 6:02 am

Re: yacy up to date for RSS feeds

Beitragvon Orbiter » Do Nov 12, 2009 2:54 pm

you are right, YaCy should be able to read and parse rss feeds, so it would be possible to give rss feeds as crawl start points and get the links from the rss into the index. I will see what I can do.
Orbiter
 
Beiträge: 5793
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: yacy up to date for RSS feeds

Beitragvon bebop » Fr Nov 13, 2009 11:04 am

this would be a so cool feature i think to get yacy up-to-date for choosen websites :

when i mannually had a website into yacy (global or my own dedicated search engine)
i want to retrieve very preciselly information for this website

on global nbetwork : yacy work 80% for outside jobs that is not deirectly link with my needs (and this is cool for community)
on dedicated websearch : i do not particularly want for yacy to crawl every link going out for the first start point - (and yacy is mainly sleeping)
but it is hard thing to get information from my choose websites uptodate in boths cases

if not i fact : i still have to use my google ...... :(

thank you if someone of the community look a bit on automatic parse of rss feed (i have myself no programming skills)

the killing feature would be from an end-user point of view : to be able to add a rss feed as a crawl start point and bookmark
so that in the Rss type bookmarks all followen feed are presented

this will allow yacy to be very useful for end users even if not all the web is crawled : at least an rss reader with a very strong way to share data with other people.

because actually, i like much to see my yacy pears working for the community but it is not so useful for me
as i still have to use my google search and reader :( .....

andsearch and reader could be very linked and powerful applications inside a global yacy network :)
bebop
 
Beiträge: 20
Registriert: Mi Apr 15, 2009 6:02 am

Re: yacy up to date for RSS feeds

Beitragvon Orbiter » Fr Nov 13, 2009 12:16 pm

actually YaCy has already a built-in RSS reader which also works correclty if it is given an RSS input. Theoretically it should be possible to use a rss as crawl start point. I tried that and found out that this does not work for the following reasons:
1) many times robots.txt deny access to these pages
2) the server does not return file names for the rss page that ends with 'rss' or does not correctly deliver a porper mime-type, so the YaCy parser does not select the rss parser for such resources

furthermore, it would be necessary to exclude the RSS content itself from the index, and include only the pages that are linked from the rss.
This may trigger once again a discussion that we already had about crawl start points for problem 1): is it correct to ignore robots.txt for crawl start points?
The second point 2) would make it necessary that a new Crawl start property is needed, that forces the parser to select the rss parser for the start point.

@all: what do you think about case 1)?
Orbiter
 
Beiträge: 5793
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: yacy up to date for RSS feeds

Beitragvon bebop » Fr Nov 13, 2009 12:29 pm

that is cool is you think my needs interesting for the community :)

thx
bebop
 
Beiträge: 20
Registriert: Mi Apr 15, 2009 6:02 am

Re: yacy up to date for RSS feeds

Beitragvon Lotus » Fr Nov 13, 2009 5:18 pm

Orbiter hat geschrieben:1): is it correct to ignore robots.txt for crawl start points?
@all: what do you think about case 1)?

I think this is a matter of design. Personally I think it is ok to load such a page as a start point, but its contents should not be indexed except following the links (and check again for robots).
Lotus
 
Beiträge: 1699
Registriert: Mi Jun 27, 2007 3:33 pm
Wohnort: Hamburg

Re: yacy up to date for RSS feeds

Beitragvon bebop » Sa Nov 21, 2009 7:57 pm

is there anything i can do now to do it ?

or should i wait for another version of yacy ?
bebop
 
Beiträge: 20
Registriert: Mi Apr 15, 2009 6:02 am

Re: yacy up to date for RSS feeds

Beitragvon bebop » Fr Mär 05, 2010 10:12 am

is it possible to find a way to force yacy include the updates from Rss feeds ?
bebop
 
Beiträge: 20
Registriert: Mi Apr 15, 2009 6:02 am

Re: yacy up to date for RSS feeds

Beitragvon Orbiter » Fr Mär 05, 2010 12:18 pm

I see that this is important while the problems as described above still exist.
Can you give one example of an url for a rss that you would like to index?
Orbiter
 
Beiträge: 5793
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: yacy up to date for RSS feeds

Beitragvon dulcedo » Fr Mär 05, 2010 1:48 pm

Orbiter hat geschrieben:1) many times robots.txt deny access to these pages
...
This may trigger once again a discussion that we already had about crawl start points for problem 1): is it correct to ignore robots.txt for crawl start points?

No.
But from my point of view it is correct to analyse its contents for links to other domains and generate a new (own) document containing those links.
dulcedo
 
Beiträge: 1006
Registriert: Do Okt 16, 2008 6:36 pm
Wohnort: Bei Karlsruhe

Re: yacy up to date for RSS feeds

Beitragvon Quix0r » Mi Mär 10, 2010 10:07 pm

@bebop: Meanwhile you can use sitemaps as crawling start point, which gives YaCy also up-to-date URLs to fetch and parse.
Quix0r
 
Beiträge: 1345
Registriert: Di Jul 31, 2007 9:22 am
Wohnort: Krefeld

Re: yacy up to date for RSS feeds

Beitragvon bebop » So Mai 02, 2010 2:05 pm

i am back to the stuff in yacy

i will start working with sitemaps which seems to be a right way to start


but for easier handling and development of yacy community rss integration could be a great feature.
having an p2p rss reader that is also a p2p search engine would be great ;)


one example of rss feed to follow :
http://feeds.feedburner.com/failblog
bebop
 
Beiträge: 20
Registriert: Mi Apr 15, 2009 6:02 am

Re: yacy up to date for RSS feeds

Beitragvon bebop » So Mai 02, 2010 6:03 pm

in fact i do not manage neither to crawl from sitemap neither from rss feed

here are example adresses for that :

http://velorizontal.bbfr.net/sitemap.xml

http://velorizontal.bbfr.net/feed/?f=2
(i acepted for dynamic urls)


errors for feed : cannot load: load error - REJECTED WRONG MIME TYPE, mime = text/xml; charset=windows-1252: no parser found
errors for sitemap : no errors given, it simplies include the page in the index without being able to crawl the indicated pages
bebop
 
Beiträge: 20
Registriert: Mi Apr 15, 2009 6:02 am

Re: yacy up to date for RSS feeds

Beitragvon bebop » Fr Mai 07, 2010 12:43 pm

i suffer issues if i ussue other start crawl option than from url

feed
sitemap
file


i do not manage to make it work

i tried on al the peers i would like to use : 5 peers
and no crawling to start in all those case

maybe i do it bad, could you help me for the right procedure ?
bebop
 
Beiträge: 20
Registriert: Mi Apr 15, 2009 6:02 am

Re: yacy up to date for RSS feeds

Beitragvon Orbiter » Fr Aug 27, 2010 11:04 am

I'm happy to announce that YaCy now has a wonderful RSS reader and crawler:

- YaCy now recognizes all RSS feed links and stores them
- the stored feed links can be seen in /Load_RSS_p.html (click on Index Creation -> Indexing of RSS feeds)
- you can also manually insert a RSS feed address in /Load_RSS_p.html
- feeds can be placed into a loader and the loader can be used in the new Crawling scheduler
- scheduled rss loadings are also shown in /Load_RSS_p.html together with statistics about how many feed items had been loaded and stored

this is contained in SVN 7076 (and I am still working on more)
Orbiter
 
Beiträge: 5793
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: yacy up to date for RSS feeds

Beitragvon bebop » Do Dez 30, 2010 12:04 am

i came back to yacy thes last days


it seems that new version gives very pretty cool results for rss, sitemaps
also the domain list restriction as used for sciencenet is a pretty cool feature.

i think i have almost all what i wanted to start.

all my congratulations for the work on yacy !!
bebop
 
Beiträge: 20
Registriert: Mi Apr 15, 2009 6:02 am


Zurück zu Fragen und Antworten

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 1 Gast

cron