How to configure yacy to search only for PDF files

Discussion in English language.
You can start and continue with posts in english language in all other forums as well, but if you are looking for a forum to start a discussion in english, this is the right choice.

How to configure yacy to search only for PDF files

Beitragvon nergal » Mi Aug 16, 2017 1:18 am

Hello everyone I'm new to yacy and the project .
I already know about filetype and so on , but what I want yacy to do is crawl and search only for docs PDF docx and so on , I have a need for this as I spend most of my time searching for service manuals and machine documentation on every single sear h engine and p2p , torrent , edonkey servers !
I have no pre knowledge about search engines but I have a bit experience
So help me if you can guys and be easy on me thank you
Beiträge: 1
Registriert: Mi Aug 16, 2017 1:12 am

Re: How to configure yacy to search only for PDF files

Beitragvon luc » Fr Aug 18, 2017 9:24 am

Hi nergal,
as you said, when searching you can restrict results to a given file type using the "Filetype" facet or the "filetype:" operator.

When crawling, as far as I know there is no option to directly filter on a selection of file types or MIME types (yet it could be convenient), but using a regular expression filter on document URL will do the job. So for example for pdf files, I would suggest you to crawl the websites your interested in with the following options in the Advanced Crawler (CrawlStartExpert.html) :
- Document Filter > Filter on URLs > must-match : .*\.pdf
- Index Attributes > Add Crawl result to collection(s) : your_pdf_collection_name

Other options at default, or as you which. What is important here is to use the "Document Filter > Filter on URLs" option, and not the "Crawler Filter > Load Filter on URLs" one. Because that last one is too restrictive as it would prevent the loader to parse html files and follow their links, so the crawl task would rapidly terminate.

Eventually using a custom collection name can later help you restrict searches to your own pdf collection using the "Collection" facet or the "collection:" operator.

Have a nice day. Let us know if you somehow achieved what you want.
Beiträge: 314
Registriert: Mi Aug 26, 2015 1:04 am

Zurück zu English

Wer ist online?

Mitglieder in diesem Forum: Yahoo [Bot] und 1 Gast