slow indexing

Discussion in English language.
Forumsregeln
You can start and continue with posts in english language in all other forums as well, but if you are looking for a forum to start a discussion in english, this is the right choice.

slow indexing

Beitragvon lucipher » Do Jul 20, 2017 7:44 am

I have 7 files with links added to queue
Each have about 10 000 links

But indexing of this huge list is terrible slow. Can I tune something to do it faster?
As you can see below load is small and there are free resources to do it faster



average ppm is 50 (30000 is set)
documents in index 8,836,848

System
YaCy version: 1.921/9000
Uptime: 0 days 01:47
Java version: 1.8.0_131
Processors: 8
Load: 0.2
Threads: 563/19, peak:572, total:1360
Memory Usage
RAM used: 4.15 GB
RAM max: 12.72 GB
DISK used: (approx.) 193.49 GB
DISK free: 1,527.71 GB
Traffic [Reset]
Proxy: 0 Bytes
Crawler: 138.2 MB
Incoming Connections
Active: 7 | Max: 200
Queues
Loader Queue:

2 | 500
Local Crawl 66,836 pause local crawl
Remote triggered Crawl 0 pause remote triggered crawl
Pre-Queueing 0



part of my log



Code: Alles auswählen
I 2017/07/20 02:33:01 org.apache.solr.update.processor.LogUpdateProcessorFactory [collection1] webapp=null path=/update params={}{add=[bwFGuQBW4zEB (1573422242273951744)]} 0 2

I 2017/07/20 02:33:01 REJECTED http://www.tanio-buduje.pl/ - cannot load: load error - Client can't execute: www.tanio-buduje.pl duration=0 for url http://www.tanio-buduje.pl/

I 2017/07/20 02:33:01 LOADER Forcing sleep of 95 ms for host www.tanio-buduje.pl

I 2017/07/20 02:33:01 REJECTED https://sklep.premium.pl/osklep.pl - Parser Failure for extension 'pl' or mimetype 'text/plain': file extension 'pl' is denied (1); url = https://sklep.premium.pl/osklep.pl; url = https://sklep.premium.pl/osklep.pl

W 2017/07/20 02:33:01 SWITCHBOARD Unable to parse the resource 'https://sklep.premium.pl/osklep.pl'. Parser Failure for extension 'pl' or mimetype 'text/plain': file extension 'pl' is denied (1); url = https://sklep.premium.pl/osklep.pl; url = https://sklep.premium.pl/osklep.pl

W 2017/07/20 02:33:01 PARSER Parser Failure for extension 'pl' or mimetype 'text/plain': file extension 'pl' is denied (1); url = https://sklep.premium.pl/osklep.pl

I 2017/07/20 02:33:01 CrawlQueues placed NOLOAD URL on indexing queue: https://sklep.premium.pl/osklep.pl

I 2017/07/20 02:33:00 REJECTED https://www.alx.pl/robots.txt - no response body (http return code = 404)

I 2017/07/20 02:33:00 org.apache.solr.update.processor.LogUpdateProcessorFactory [collection1] webapp=null path=/update params={}{add=[rmbM4ylpnyrA (1573422241271513088)]} 0 0

I 2017/07/20 02:33:00 REJECTED http://www.alx.pl/tech/php/ - cannot load: load error - CRAWLER Redirect of URL=http://www.alx.pl/tech/php/ to https://www.alx.pl/tech/php/ placed on crawler queue for double-check

I 2017/07/20 02:33:00 HostQueue opened HostQueue /home/zmudzmar/yacy/DATA/INDEX/webportal/QUEUES/CrawlerCoreStacks/www.alx.pl-#5ZIHKg.443 with 0 urls.

I 2017/07/20 02:33:00 LOADER CRAWLER ..Redirecting request to: https://www.alx.pl/tech/php/

I 2017/07/20 02:33:00 LOADER CRAWLER Redirection detected ('HTTP/1.1 301 Moved Permanently') for URL http://www.alx.pl/tech/php/

I 2017/07/20 02:33:00 org.apache.solr.update.DirectUpdateHandler2 end_commit_flush

I 2017/07/20 02:33:00 org.apache.solr.update.processor.LogUpdateProcessorFactory [collection1] webapp=null path=/update params={}{add=[MeIE_QHOEqrA (1573422241111080960)]} 0 11

I 2017/07/20 02:33:00 REJECTED http://www.osklep.pl/ - cannot load: load error - CRAWLER Redirect of URL=http://www.osklep.pl/ to https://sklep.premium.pl/osklep.pl placed on crawler queue for double-check

I 2017/07/20 02:33:00 HostQueue opened HostQueue /home/zmudzmar/yacy/DATA/INDEX/webportal/QUEUES/CrawlerNoLoadStacks/sklep.premium.pl-#KLyFDg.443 with 0 urls.

I 2017/07/20 02:33:00 LOADER CRAWLER ..Redirecting request to: https://sklep.premium.pl/osklep.pl

I 2017/07/20 02:33:00 LOADER CRAWLER Redirection detected ('HTTP/1.1 302 Found') for URL http://www.osklep.pl/

I 2017/07/20 02:32:59 SWITCHBOARD *Indexed 1034 words in URL http://www.zarzadzaniebhp.pl/ [96lRtQIGYARC] Description: Zarządzanie bezpieczeństwem i higieną pracy MimeType: text/html | Charset: UTF-8 | Size: 20153 bytes | LinkStorageTime: 50 ms | indexStorageTime: 0 ms

I 2017/07/20 02:32:59 org.apache.solr.update.processor.LogUpdateProcessorFactory [collection1] webapp=null path=/update params={}{add=[96lRtQIGYARC (1573422240713670656)]} 0 7

I 2017/07/20 02:32:59 Fulltext indexing: 96lRtQIGYARC http://www.zarzadzaniebhp.pl/

I 2017/07/20 02:32:59 SWITCHBOARD Excluded 2 words in URL http://www.zarzadzaniebhp.pl/

I 2017/07/20 02:32:59 HTCACHE storing content of url http://www.zarzadzaniebhp.pl/, 178912 bytes

I 2017/07/20 02:32:59 org.apache.solr.update.SolrIndexWriter Calling setCommitData with IW:org.apache.solr.update.SolrIndexWriter@7dce4433

I 2017/07/20 02:32:59 org.apache.solr.update.DirectUpdateHandler2 start commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}

I 2017/07/20 02:32:58 LOADER CRAWLER ..Redirecting request to: https://sklep.premium.pl/osklep.pl

I 2017/07/20 02:32:58 LOADER CRAWLER Redirection detected ('HTTP/1.1 302 Found') for URL http://www.osklep.pl/robots.txt

I 2017/07/20 02:32:58 org.apache.solr.update.processor.LogUpdateProcessorFactory [collection1] webapp=null path=/update params={}{add=[E2tflfPlbfJA (1573422239312773120)]} 0 0

I 2017/07/20 02:32:58 REJECTED http://www.xn--ciarwki-o0a35a77a.info.pl/ - cannot load: load error - Client can't execute: www.xn--ciarwki-o0a35a77a.info.pl duration=0 for url http://www.xn--ciarwki-o0a35a77a.info.pl/

I 2017/07/20 02:32:58 LOADER Forcing sleep of 33 ms for host www.xn--ciarwki-o0a35a77a.info.pl

I 2017/07/20 02:32:58 org.apache.solr.update.processor.LogUpdateProcessorFactory [collection1] webapp=null path=/update params={}{add=[E3rxestaJT4A (1573422239245664256)]} 0 0

I 2017/07/20 02:32:58 REJECTED http://www.produkcja.co.pl/ - cannot load: load error - CRAWLER Redirect of URL=http://www.produkcja.co.pl/ to http://www.parkingco.pl/ placed on crawler queue for double-check

I 2017/07/20 02:32:58 LOADER CRAWLER ..Redirecting request to: http://www.parkingco.pl/

I 2017/07/20 02:32:58 LOADER CRAWLER Redirection detected ('HTTP/1.1 302 Found') for URL http://www.produkcja.co.pl/

I 2017/07/20 02:32:58 LOADER CRAWLER ..Redirecting request to: http://www.parkingco.pl/

I 2017/07/20 02:32:58 LOADER CRAWLER Redirection detected ('HTTP/1.1 302 Found') for URL http://www.produkcja.co.pl/robots.txt

I 2017/07/20 02:32:57 SWITCHBOARD *Indexed 3 words in URL http://www.pietkiewicz.pl/ [M6Yo4Ql7s4aB] Description: MimeType: text/html | Charset: UTF-8 | Size: 14 bytes | LinkStorageTime: 1 ms | indexStorageTime: 0 ms

I 2017/07/20 02:32:57 org.apache.solr.update.processor.LogUpdateProcessorFactory [collection1] webapp=null path=/update params={}{add=[M6Yo4Ql7s4aB (1573422238574575616)]} 0 0

I 2017/07/20 02:32:57 Fulltext indexing: M6Yo4Ql7s4aB http://www.pietkiewicz.pl/

I 2017/07/20 02:32:57 SWITCHBOARD Excluded 0 words in URL http://www.pietkiewicz.pl/

I 2017/07/20 02:32:57 SWITCHBOARD *Indexed 114 words in URL http://www.zielonykacik.com.pl/ [t7TDoLSZP5IA] Description: Zielony Kącik | Dom Opieki Częstochowa, Śląsk MimeType: text/html | Charset: UTF-8 | Size: 992 bytes | LinkStorageTime: 578 ms | indexStorageTime: 0 ms

I 2017/07/20 02:32:57 org.apache.solr.update.processor.LogUpdateProcessorFactory [collection1] webapp=null path=/update params={}{add=[t7TDoLSZP5IA (1573422238572478464)]} 0 1

I 2017/07/20 02:32:57 Fulltext indexing: t7TDoLSZP5IA http://www.zielonykacik.com.pl/

I 2017/07/20 02:32:57 org.apache.solr.update.processor.LogUpdateProcessorFactory [collection1] webapp=null path=/update params={}{add=[jkMcAQiEIMQD (1573422238560944128)]} 0 0

I 2017/07/20 02:32:57 REJECTED http://www.windykacjawpraktyce.pl/ - cannot load: load error - CRAWLER Redirect of URL=http://www.windykacjawpraktyce.pl/ to http://statima.pl/ placed on crawler queue for double-check

I 2017/07/20 02:32:57 HostQueue opened HostQueue /home/zmudzmar/yacy/DATA/INDEX/webportal/QUEUES/CrawlerCoreStacks/statima.pl-#lf73UA.80 with 0 urls.

I 2017/07/20 02:32:57 org.apache.solr.update.processor.LogUpdateProcessorFactory [collection1] webapp=null path=/update params={}{add=[DPU9zJhZ1AUA (1573422238513758208)]} 0 0

I 2017/07/20 02:32:57 REJECTED http://www.sebimar.com.pl/ - cannot load: load error - Client can't execute: www.sebimar.com.pl duration=0 for url http://www.sebimar.com.pl/

I 2017/07/20 02:32:57 HTCACHE storing content of url http://www.pietkiewicz.pl/, 27 bytes

I 2017/07/20 02:32:57 LOADER Forcing sleep of 104 ms for host www.sebimar.com.pl

I 2017/07/20 02:32:57 REJECTED http://www.pietkiewicz.pl/robots.txt - no response body (http return code = 404)

I 2017/07/20 02:32:57 REJECTED http://la.org.pl/robots.txt - retry counter exceeded

I 2017/07/20 02:32:57 LOADER CRAWLER ..Redirecting request to: http://la.org.pl/robots.txt

I 2017/07/20 02:32:57 LOADER CRAWLER Redirection detected ('HTTP/1.1 301 Moved Permanently') for URL http://www.la.org.pl/robots.txt

I 2017/07/20 02:32:57 SWITCHBOARD Excluded 0 words in URL http://www.zielonykacik.com.pl/

I 2017/07/20 02:32:57 HTCACHE storing content of url http://www.zielonykacik.com.pl/, 17194 bytes

I 2017/07/20 02:32:57 LOADER CRAWLER ..Redirecting request to: http://statima.pl/

I 2017/07/20 02:32:57 LOADER CRAWLER Redirection detected ('HTTP/1.1 301 Moved Permanently') for URL http://www.windykacjawpraktyce.pl/

I 2017/07/20 02:32:57 LOADER CRAWLER ..Redirecting request to: http://www.la.org.pl/robots.txt

I 2017/07/20 02:32:57 LOADER CRAWLER Redirection detected ('HTTP/1.1 301 Moved Permanently') for URL http://la.org.pl/robots.txt

I 2017/07/20 02:32:56 LOADER CRAWLER ..Redirecting request to: http://statima.plrobots.txt/

I 2017/07/20 02:32:56 LOADER CRAWLER Redirection detected ('HTTP/1.1 301 Moved Permanently') for URL http://www.windykacjawpraktyce.pl/robots.txt

I 2017/07/20 02:32:56 LOADER CRAWLER ..Redirecting request to: http://la.org.pl/robots.txt

I 2017/07/20 02:32:56 LOADER CRAWLER Redirection detected ('HTTP/1.1 301 Moved Permanently') for URL http://www.la.org.pl/robots.txt

I 2017/07/20 02:32:56 org.apache.solr.update.processor.LogUpdateProcessorFactory [collection1] webapp=null path=/update params={}{add=[a0RVK75NilsD (1573422237373956096)]} 0 0

I 2017/07/20 02:32:56 REJECTED http://zagorski-softland.pl/ - cannot load: load error - CRAWLER Redirect of URL=http://zagorski-softland.pl/ to http://zagorski-softland.pl/index_.html placed on crawler queue for double-check

I 2017/07/20 02:32:56 HostQueue opened HostQueue /home/zmudzmar/yacy/DATA/INDEX/webportal/QUEUES/CrawlerCoreStacks/zagorski-softland.pl-#5NilsD.80 with 0 urls.

I 2017/07/20 02:32:56 LOADER CRAWLER ..Redirecting request to: http://zagorski-softland.pl/index_.html

I 2017/07/20 02:32:56 LOADER CRAWLER Redirection detected ('HTTP/1.1 302 Moved Temporarily') for URL http://zagorski-softland.pl/

I 2017/07/20 02:32:56 SWITCHBOARD *Indexed 430 words in URL http://securelist.pl/?VLredirect=1 [dkvH5756-IvB] Description: SecureList.pl MimeType: text/html | Charset: UTF-8 | Size: 4784 bytes | LinkStorageTime: 15 ms | indexStorageTime: 0 ms

I 2017/07/20 02:32:56 org.apache.solr.update.processor.LogUpdateProcessorFactory [collection1] webapp=null path=/update params={}{add=[dkvH5756-IvB (1573422237230301184)]} 0 4

I 2017/07/20 02:32:56 Fulltext indexing: dkvH5756-IvB http://securelist.pl/?VLredirect=1

I 2017/07/20 02:32:56 SWITCHBOARD Excluded 3 words in URL http://securelist.pl/?VLredirect=1

I 2017/07/20 02:32:56 HTCACHE storing content of url http://securelist.pl/?VLredirect=1, 53503 bytes

I 2017/07/20 02:32:56 LOADER CRAWLER ..Redirecting request to: http://www.la.org.pl/robots.txt

I 2017/07/20 02:32:56 LOADER CRAWLER Redirection detected ('HTTP/1.1 301 Moved Permanently') for URL http://la.org.pl/robots.txt

I 2017/07/20 02:32:56 org.apache.solr.update.processor.LogUpdateProcessorFactory [collection1] webapp=null path=/update params={}{add=[TEL793LDKHAA (1573422236852813824)]} 0 0

I 2017/07/20 02:32:56 REJECTED http://www.tpnet.ltd.pl/ - cannot load: load error - Client can't execute: www.tpnet.ltd.pl duration=0 for url http://www.tpnet.ltd.pl/

I 2017/07/20 02:32:55 LOADER CRAWLER ..Redirecting request to: http://la.org.pl/robots.txt

I 2017/07/20 02:32:55 LOADER CRAWLER Redirection detected ('HTTP/1.1 301 Moved Permanently') for URL http://www.la.org.pl/robots.txt

I 2017/07/20 02:32:55 org.apache.solr.update.processor.LogUpdateProcessorFactory [collection1] webapp=null path=/update params={}{add=[btTO5B3D7ATA (1573422236598009856)]} 0 0

I 2017/07/20 02:32:55 REJECTED http://www.jordek-spedycja.slask.pl/ - cannot load: load error - Client can't execute: www.jordek-spedycja.slask.pl duration=0 for url http://www.jordek-spedycja.slask.pl/

I 2017/07/20 02:32:55 LOADER Forcing sleep of 105 ms for host www.jordek-spedycja.slask.pl

I 2017/07/20 02:32:55 LOADER CRAWLER ..Redirecting request to: http://www.la.org.pl/robots.txt

I 2017/07/20 02:32:55 LOADER CRAWLER Redirection detected ('HTTP/1.1 301 Moved Permanently') for URL http://la.org.pl/robots.txt

I 2017/07/20 02:32:55 org.apache.solr.update.processor.LogUpdateProcessorFactory [collection1] webapp=null path=/update params={}{add=[W4ddnbyiaxGA (1573422235917484032)]} 0 0

I 2017/07/20 02:32:55 REJECTED http://www.la.org.pl/ - cannot load: load error - CRAWLER Redirect of URL=http://www.la.org.pl/ to http://la.org.pl/ placed on crawler queue for double-check

I 2017/07/20 02:32:55 HostQueue opened HostQueue /home/zmudzmar/yacy/DATA/INDEX/webportal/QUEUES/CrawlerCoreStacks/la.org.pl-#OSx9zA.80 with 0 urls.

I 2017/07/20 02:32:55 LOADER CRAWLER ..Redirecting request to: http://la.org.pl/

I 2017/07/20 02:32:55 LOADER CRAWLER Redirection detected ('HTTP/1.1 301 Moved Permanently') for URL http://www.la.org.pl/

I 2017/07/20 02:32:55 org.apache.solr.update.processor.LogUpdateProcessorFactory [collection1] webapp=null path=/update params={}{add=[9EGyN6NsJczA (1573422235845132288)]} 0 0

I 2017/07/20 02:32:55 REJECTED http://www.ssp10sto.waw.ids.pl/ - cannot load: load error - Client can't execute: www.ssp10sto.waw.ids.pl duration=0 for url http://www.ssp10sto.waw.ids.pl/

I 2017/07/20 02:32:55 LOADER Forcing sleep of 59 ms for host www.ssp10sto.waw.ids.pl

I 2017/07/20 02:32:54 REJECTED http://www.la.org.pl/robots.txt - retry counter exceeded

I 2017/07/20 02:32:54 LOADER CRAWLER ..Redirecting request to: http://www.la.org.pl/robots.txt

I 2017/07/20 02:32:54 LOADER CRAWLER Redirection detected ('HTTP/1.1 301 Moved Permanently') for URL http://la.org.pl/robots.txt

I 2017/07/20 02:32:54 LOADER CRAWLER ..Redirecting request to: http://la.org.pl/robots.txt

I 2017/07/20 02:32:54 LOADER CRAWLER Redirection detected ('HTTP/1.1 301 Moved Permanently') for URL http://www.la.org.pl/robots.txt

I 2017/07/20 02:32:54 LOADER CRAWLER ..Redirecting request to: http://www.la.org.pl/robots.txt

I 2017/07/20 02:32:54 LOADER CRAWLER Redirection detected ('HTTP/1.1 301 Moved Permanently') for URL http://la.org.pl/robots.txt

I 2017/07/20 02:32:53 LOADER CRAWLER ..Redirecting request to: http://la.org.pl/robots.txt

I 2017/07/20 02:32:53 LOADER CRAWLER Redirection detected ('HTTP/1.1 301 Moved Permanently') for URL http://www.la.org.pl/robots.txt

I 2017/07/20 02:32:53 LOADER CRAWLER ..Redirecting request to: http://www.la.org.pl/robots.txt

I 2017/07/20 02:32:53 LOADER CRAWLER Redirection detected ('HTTP/1.1 301 Moved Permanently') for URL http://la.org.pl/robots.txt
lucipher
 
Beiträge: 11
Registriert: Mi Jun 28, 2017 11:07 am

Zurück zu English

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 1 Gast

cron