How to exclude section of a domain?

Discussion in English language.
Forumsregeln
You can start and continue with posts in english language in all other forums as well, but if you are looking for a forum to start a discussion in english, this is the right choice.

How to exclude section of a domain?

Beitragvon Giorgos » So Dez 10, 2017 2:42 am

Hi! :D
How can I exclude from scanning, a section of the domain to scan?
Eg. I want the crawler to scan the site foo.com, which has a wiki at foo.com/wiki and a forum at foo.com/forum.
Since it will be an overkill to scan a big wiki and specially a forum, how can I scan the rest of the domain?

TIA! :D
G.
Giorgos
 
Beiträge: 13
Registriert: Sa Aug 24, 2013 4:28 pm

Re: How to exclude section of a domain?

Beitragvon luc » Di Dez 19, 2017 10:38 am

Hi Giorgos,
you can do this using the "Load Filter on URLs" field in the Advanced Crawler page (/CrawlStartExpert.html).
For example :
- tick the "Restrict to start domain(s)" radio button
- type a regular expression in the "must-not-match", something such as http://foo.com/((wiki)|(forum))/?.*

You can use the Regex Test page (/RegexTest.html) to adjust your filtering regular expression before launching the crawl.
Then when the crawl is launched, you can check that filtering is effectively performed in the "Rejected URLs" page (/IndexCreateParserErrors_p.html).

Happy crawling
luc
 
Beiträge: 314
Registriert: Mi Aug 26, 2015 1:04 am

Re: How to exclude section of a domain?

Beitragvon Giorgos » Di Dez 19, 2017 12:51 pm

THANKS luc!!! :D
I'll try it!
Giorgos
 
Beiträge: 13
Registriert: Sa Aug 24, 2013 4:28 pm


Zurück zu English

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 1 Gast