Lazy question about FINAL_LOAD_CONTEXT and must-match filter

Hier finden YaCy User Hilfe wenn was nicht funktioniert oder anders funktioniert als man dachte. Bei offensichtlichen Fehlern diese bitte gleich in die Bugs (http://bugs.yacy.net) eintragen.
Forumsregeln
In diesem Forum geht es um Benutzungsprobleme und Anfragen für Hilfe. Wird dabei ein Bug identifiziert, wird der thread zur Bearbeitung in die Bug-Sektion verschoben. Wer hier also einen Thread eingestellt hat und ihn vermisst, wird ihn sicherlich in der Bug-Sektion wiederfinden.

Lazy question about FINAL_LOAD_CONTEXT and must-match filter

Beitragvon DNcrawler » Di Jan 10, 2017 10:46 pm

Hi, I've started reading through the source code to see where this message originates, but I haven't found it quite yet.

FINAL_LOAD_CONTEXT url does not match must-match filter (smb|ftp|https?)://(www.)?(\Qexample.com\E.*)


I think it's here,
https://github.com/yacy/yacy_search_server/blob/c1401d821e2141fd3d1e1a1ec03ec8b20f8fcd86/source/net/yacy/crawler/CrawlStacker.java#L522


I'm wondering where I can change the parameters to include http or really re-write the code to (smb|ftp|http|https) as it appears the regex
Code: Alles auswählen
(smb|ftp|https?)
is supposed to match http or https, but it doesn't seem to do so in practice.

Thank you.
DNcrawler
 
Beiträge: 19
Registriert: Mi Dez 21, 2016 1:48 am

Re: Lazy question about FINAL_LOAD_CONTEXT and must-match fi

Beitragvon luc » Do Jan 12, 2017 10:04 am

Hello,
the message can also originate a few lines upper when you use the "Load Filter on URLs" filter on the /CrawlStartExpert.html page.

Indeed I just checked and had no problem with your regular expression filter. I used
Code: Alles auswählen
(smb|ftp|https?)://(www.)?(\Qen.wikipedia.org\E.*)
as a "must-match" "Load Filter on URLs" and used
Code: Alles auswählen
"https://en.wikipedia.org/wiki/Main_Page
as start URL with a crawl depth set to 1:
- URLs such as https://meta.wikimedia.org/wiki/Main_Page are successfully rejected
- URLs such as https://en.wikipedia.org/wiki/January_12 are successfully crawled

Another example with http://yacy.net as start URL and
Code: Alles auswählen
(smb|ftp|https?)://(www.)?(\Qyacy.net\E.*)
as filter also worked as expected :
- URLs such as http://player.vimeo.com/video/102122237 are rejected
- URLs such as http://yacy.net/release_notes/ are successfully crawled

Did one of use missed something?
luc
 
Beiträge: 294
Registriert: Mi Aug 26, 2015 1:04 am

Re: Lazy question about FINAL_LOAD_CONTEXT and must-match fi

Beitragvon DNcrawler » Do Jan 12, 2017 2:45 pm

Once again luc, thank you for the response.

As far as I can tell, the expert crawls were started with:

Load Filter on URLs
must-match

Use filter .*
(must not be empty)
must-not-match


Document Filter
These are limitations on index feeder. The filters will be applied after a web page was loaded.

Filter on URLs

must-match .*
(must not be empty)
must-not-match

Filter on Content of Document
(all visible text, including camel-case-tokenized url and title)

must-match .*
(must not be empty)
must-not-match


I'll keep digging. As there is no easy way to use an API to re-create the expert crawls, I'll have to restart a few of them to see if it changes.
DNcrawler
 
Beiträge: 19
Registriert: Mi Dez 21, 2016 1:48 am

Re: Lazy question about FINAL_LOAD_CONTEXT and must-match fi

Beitragvon luc » Do Jan 12, 2017 5:30 pm

Indeed yes there is a way to re-create expert crawls : in the Process Scheduler page (/Table_API_p.html) in "Type" column, "crawler" entries have a "clone" link opening the /CrawlStartExpert.html page with the same previously used parameters.
luc
 
Beiträge: 294
Registriert: Mi Aug 26, 2015 1:04 am


Zurück zu Fragen und Antworten

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 3 Gäste

cron