How to enter an URL list to crawl with some parameters ?

Discussion in English language.
Forumsregeln
You can start and continue with posts in english language in all other forums as well, but if you are looking for a forum to start a discussion in english, this is the right choice.

How to enter an URL list to crawl with some parameters ?

Beitragvon fff » Mo Jan 01, 2018 7:54 pm

Hi to all

(I am new here - happy 2018 btw !)

I'd like to enter a list of urls to start to crawl
- not one by one
- with some parameter for each, different for each url (url of course, but also crawl depth, max pages count, collection name, restrict to sub-path or not ...)

I have an excel list of these, but of course, I can change that to anything required

Thanks in advance,
Fabrice
fff
 
Beiträge: 4
Registriert: Mo Jan 01, 2018 7:43 pm

Re: How to enter an URL list to crawl with some parameters ?

Beitragvon ircamb » Mo Jan 15, 2018 6:15 pm

I think you have to do them one by one if you want different parameters for each url.
ircamb
 
Beiträge: 20
Registriert: Mo Sep 11, 2017 1:00 am

Re: How to enter an URL list to crawl with some parameters ?

Beitragvon fff » Di Jan 16, 2018 9:10 am

thanks ircamb

.... not so good ...
Is everyone only entering his own site ? anyone else has a trick ? any developper view ? Luc ? Orbiter ?
fff
 
Beiträge: 4
Registriert: Mo Jan 01, 2018 7:43 pm

Re: How to enter an URL list to crawl with some parameters ?

Beitragvon luc » Mi Jan 17, 2018 8:59 am

Hi Fabrice, welcome and happy new year!

Sorry for the late answer, but I confirm that as far as I know, starting a crawl from a file with a list of entries is possible but currently limited :
- use the "From File (enter a path within your local file system)" field in the advanced crawler (/CrawlStartExpert.html page)
- file format : must be html or simple text (converting from a Excel .xlsx file to .csv will work fine)
- all entries listed in the file will share the same crawl profile/parameters

So if you have different crawl parameters for each of your crawl start point, I would suggest you to write an external script (.sh, .bat or whatever you prefer, depending on your OS) that will be responsible for calling the /Crawler_p.html API with the appropriate parameters for each of your entry.
To help you build the API URLs, you can go to the Process Scheduler page (/Table_API_p.html) and pick up and adjust the URL of one of your previously recorder crawl.

Best regards
luc
 
Beiträge: 314
Registriert: Mi Aug 26, 2015 1:04 am

Re: How to enter an URL list to crawl with some parameters ?

Beitragvon fff » Do Jan 18, 2018 12:43 pm

SUPER ! Thanks a lot Luc - hoping this will also be helpful for others
All is great; no regret to have taken a VPS especially to test Yacy & add my small 24/7 node

... and this API & it's doc is super clear. Will use it with c# (I am less familiar with batch scripts)

But if you guys could speak more english than german, it would help YaCy to increase even more ;)
(aber Deutsch is auch OK für mich, nür schwiriger)

Fabrice
PS: I have some other beginner questions, but will do separate threads
fff
 
Beiträge: 4
Registriert: Mo Jan 01, 2018 7:43 pm


Zurück zu English

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 2 Gäste