issue for indexing very active websites

Hier finden YaCy User Hilfe wenn was nicht funktioniert oder anders funktioniert als man dachte. Bei offensichtlichen Fehlern diese bitte gleich in die Bugs (http://bugs.yacy.net) eintragen.
Forumsregeln
In diesem Forum geht es um Benutzungsprobleme und Anfragen für Hilfe. Wird dabei ein Bug identifiziert, wird der thread zur Bearbeitung in die Bug-Sektion verschoben. Wer hier also einen Thread eingestellt hat und ihn vermisst, wird ihn sicherlich in der Bug-Sektion wiederfinden.

issue for indexing very active websites

Beitragvon bebop » Do Jul 15, 2010 10:48 am

i want to use yacy to follow some active to very-active websites

i mean websites that are update several times a day : between 5 to 50 times a day.

i am not focussed to on crawling the whole web, just to get updated search index from a group of active webpages

i do have some peears trying to do this job but it is quite unefficient as it is quite never updated.
(i suffer using standard rss, sitemap information)

and the link crawl from starting page is not so easy for often indexing
bebop
 
Beiträge: 20
Registriert: Mi Apr 15, 2009 6:02 am

Re: issue for indexing very active websites

Beitragvon joan » Fr Jul 16, 2010 7:37 pm

On a similar topic, I'm interested in knowing more about how content update works. Maybe that will help understand how to improve it for very active websites.

When a given page is crawled, it is parsed, split into words, and the words end up in the global index.
When the same page is crawled again later, maybe it has a completely different content, so to have a consistent index, there should be a mechanism to remove all the words entries that were indexed in the first place.

Is there such a mechanism and how does it work, (roughly) ?

Is there a parallel index for each crawled page, keeping track of the hashes of the words that were parsed the last time we crawled it ?
I can't think of another way to keep the list of words that will potentially need to be removed from the index, since we are not going to scan the whole index just to find the few words that have an obsolete reference to this specific page.
Each time we re-crawl a page, it would have to be sent out to the network so other peers that had also saved its words can remove the corresponding entries, and maybe add the new ones.

I have seen some reference to vertical partitionning of the DHT in the code, seems to be related to that, but I couldn't figure exactly what it was doing.

Maybe we can set up an experiment with a crawling job that would work only on one page (depth:0), and set the re-crawl to 1 hour…
joan
 
Beiträge: 1
Registriert: Mo Jul 12, 2010 11:32 am
Wohnort: Bordeaux - France

Re: issue for indexing very active websites

Beitragvon bebop » Fr Jul 16, 2010 11:57 pm

here are the explanations i received from quixor :

""
your robots.txt must provide a link to your sitemap.xml like this way:
CODE: SELECT ALL
User-agent: *
Sitemap: http://your-site-here/sitemap.xml

--- this is ok if i am the owner of the site only --

Then it should work. As I recall, sitemaps can't be "re-crawled", to have the same behavior look at "Steering" (left main menu) and lookup your entry with the crawl start. Just call that URL e.g. with lynx --dump http://127.0.0.1:8080/api/etc/etc/ and add that line to your crontab file:
CODE: SELECT ALL
0 */2 * * * lynx --dump http://127.0.0.1:8080/api/etc/etc/

-- i do understand the part steering is ? maybe it is a pb of language version ....


This will re-crawl the API every 2 hours at 00 minutes.
""
bebop
 
Beiträge: 20
Registriert: Mi Apr 15, 2009 6:02 am

Re: issue for indexing very active websites

Beitragvon bebop » Sa Jul 17, 2010 12:03 am

here are some pb i suffered while trying to crawl from rss feed page or sitemap :


errors for feed : cannot load: load error - REJECTED WRONG MIME TYPE, mime = text/xml; charset=windows-1252: no parser found


errors for sitemap : no errors given, it simplies include the page in the index without being able to crawl the indicated pages

viewtopic.php?f=5&t=2465
bebop
 
Beiträge: 20
Registriert: Mi Apr 15, 2009 6:02 am

Re: issue for indexing very active websites

Beitragvon Lotus » Di Jul 27, 2010 9:39 pm

joan hat geschrieben:When the same page is crawled again later, maybe it has a completely different content, so to have a consistent index, there should be a mechanism to remove all the words entries that were indexed in the first place.

Is there such a mechanism and how does it work, (roughly) ?

All words on the indexed page are stored as reverse word index with refenrence to the page.
On search, all words being searched for are verified to still be at the found page. If not, the reference verified as false will be deleted.
This is the way, YaCy handles it at the moment.
Lotus
 
Beiträge: 1699
Registriert: Mi Jun 27, 2007 3:33 pm
Wohnort: Hamburg


Zurück zu Fragen und Antworten

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 2 Gäste