Intranet:gecrawlte repositoryDokumente werden nicht gefunden

Keine Scheu, hier darf alles gefragt und diskutiert werden. Das ist das Forum für YaCy-Anfänger. Hier kann man 'wo muss man klicken' fragen und sich über Grundlagen zur Suchmaschinentechnik unterhalten.
Forumsregeln
Hier werden Fragen beantwortet und wir versuchen die Probleme von YaCy-Newbies zu klären. Bitte beantwortete Fragen im YaCy-Wiki http://wiki.yacy.de dokumentieren!

Intranet:gecrawlte repositoryDokumente werden nicht gefunden

Beitragvon flegno » Sa Sep 06, 2014 5:56 pm

Hallo,

hier ein Ausschnitt aus der Log-Datei, wo das Crawlen von Dokumenten aus dem Ordner DATA/HTDOCS/repository geloggt wird:
Code: Alles auswählen
I 2014/09/06 17:50:46 Crawl Start continue=localcrawler
D 2014/09/06 17:50:46 HostBalancer (re-)initialized the round-robin queue with one host
I 2014/09/06 17:51:21 HTCACHE storing content of url http://localhost:8099/repository/, 578 bytes
S 2014/09/06 17:51:21 AbstractBlockingThread thread 'java.lang.reflect.Method.parseDocument.0' deplo
yed, starting loop.
D 2014/09/06 17:51:21 SWITCHBOARD processResourceStack processCase=LOCAL_CRAWLING, depth=0, maxDepth
=99, must-match=https?+://(www.)?\Qlocalhost\E.*, must-not-match=, initiatorHash=2qTkDO4j_tSJ, url=h
ttp://localhost:8099/repository/
I 2014/09/06 17:51:21 SWITCHBOARD CRAWL: ADDED 3 LINKS FROM http://localhost:8099/repository/, STACK
ING TIME = 5, PARSING TIME = 9
S 2014/09/06 17:51:21 AbstractBlockingThread thread 'java.lang.reflect.Method.job.1' deployed, start
ing loop.
S 2014/09/06 17:51:21 AbstractBlockingThread thread 'java.lang.reflect.Method.condenseDocument.2' de
ployed, starting loop.
S 2014/09/06 17:51:21 AbstractBlockingThread thread 'java.lang.reflect.Method.webStructureAnalysis.3
' deployed, starting loop.
S 2014/09/06 17:51:22 AbstractBlockingThread thread 'java.lang.reflect.Method.storeDocumentIndex.4'
deployed, starting loop.
I 2014/09/06 17:51:22 SWITCHBOARD Excluded 0 words in URL http://localhost:8099/repository/
I 2014/09/06 17:51:22 Fulltext indexing: jnm6Aggzy7ic http://localhost:8099/repository/
I 2014/09/06 17:51:22 SWITCHBOARD *Indexed 22 words in URL http://localhost:8099/repository/ [jnm6Ag
gzy7ic]
        Description:  Directory: /repository/
        MimeType: text/html | Charset: UTF-8 | Size: 162 bytes |
        LinkStorageTime: 196 ms | indexStorageTime: 0 ms
I 2014/09/06 17:51:22 IODispatcher appended dump job for file citation.index.20140906155122668.blob
I 2014/09/06 17:51:22 ReferenceContainerCache creating rwi heap dump 'citation.index.201409061551226
68.blob', 1 rwi's
I 2014/09/06 17:51:23 HostQueue opened HostQueue T:\0_Tools\YaCy\yacy_en\DATA\INDEX\intranet\QUEUES\
CrawlerCoreStacks\localhost.8099 with 0 urls.
D 2014/09/06 17:51:30 HostBalancer (re-)initialized the round-robin queue with one host
D 2014/09/06 17:51:30 HostBalancer (re-)initialized the round-robin queue with one host
I 2014/09/06 17:51:30 HostQueue forcing crawl-delay of 4199 milliseconds for localhost: minimumDelta
= 10, host.average = 8451, robots.delay = 0, ((waitig = 4225) - (timeSinceLastAccess = 28)) = 4197
I 2014/09/06 17:51:30 HostQueue waiting for localhost: 4 seconds remaining...
I 2014/09/06 17:51:31 HostQueue waiting for localhost: 3 seconds remaining...
I 2014/09/06 17:51:31 ReferenceContainerCache finished rwi heap dump: 1 words, 0 word/URL relations
in 6900 milliseconds
I 2014/09/06 17:51:32 HTCACHE storing content of url http://localhost:8099/repository/freex42005pr.p
df, 81427 bytes
D 2014/09/06 17:51:32 SWITCHBOARD processResourceStack processCase=LOCAL_CRAWLING, depth=1, maxDepth
=99, must-match=https?+://(www.)?\Qlocalhost\E.*, must-not-match=, initiatorHash=2qTkDO4j_tSJ, url=h
ttp://localhost:8099/repository/freex42005pr.pdf
I 2014/09/06 17:51:32 HostQueue waiting for localhost: 2 seconds remaining...
I 2014/09/06 17:51:33 HostQueue waiting for localhost: 1 seconds remaining...
I 2014/09/06 17:51:34 HeapReader generating index for T:\0_Tools\YaCy\yacy_en\DATA\INDEX\intranet\SE
GMENTS\default\citation.index.20140906155122668.blob, 0 MB. Please wait.
I 2014/09/06 17:51:34 HeapReader finished index generation for T:\0_Tools\YaCy\yacy_en\DATA\INDEX\in
tranet\SEGMENTS\default\citation.index.20140906155122668.blob, 1 entries, 0 gaps.
D 2014/09/06 17:51:34 HostBalancer (re-)initialized the round-robin queue with one host
I 2014/09/06 17:51:34 HostQueue forcing crawl-delay of 3634 milliseconds for localhost: minimumDelta
= 10, host.average = 7326, robots.delay = 0, ((waitig = 3663) - (timeSinceLastAccess = 32)) = 3631
I 2014/09/06 17:51:35 HostQueue waiting for localhost: 3 seconds remaining...
I 2014/09/06 17:51:36 HTCACHE storing content of url http://localhost:8099/repository/YaCy-Flyer.pdf
, 84483 bytes
I 2014/09/06 17:51:36 HostQueue waiting for localhost: 2 seconds remaining...
I 2014/09/06 17:51:37 HostQueue waiting for localhost: 1 seconds remaining...
I 2014/09/06 17:51:42 Crawl Start continue=remotecrawler
I 2014/09/06 17:51:48 Crawl Start pause=remotecrawler
I 2014/09/06 17:51:49 HTCACHE storing content of url http://localhost:8099/, 10823 bytes
I 2014/09/06 17:51:54 SWITCHBOARD CRAWL: ADDED 0 LINKS FROM http://localhost:8099/repository/freex42
005pr.pdf, STACKING TIME = 0, PARSING TIME = 22445
D 2014/09/06 17:51:54 SWITCHBOARD processResourceStack processCase=LOCAL_CRAWLING, depth=1, maxDepth
=99, must-match=https?+://(www.)?\Qlocalhost\E.*, must-not-match=, initiatorHash=2qTkDO4j_tSJ, url=h
ttp://localhost:8099/repository/YaCy-Flyer.pdf
I 2014/09/06 17:51:55 SWITCHBOARD Excluded 0 words in URL http://localhost:8099/repository/freex4200
5pr.pdf
I 2014/09/06 17:51:55 Fulltext indexing: 59af1Mgzy7ic http://localhost:8099/repository/freex42005pr.
pdf
I 2014/09/06 17:51:55 SWITCHBOARD *Indexed 659 words in URL http://localhost:8099/repository/freex42
005pr.pdf [59af1Mgzy7ic]
        Description:  \\server\e\freex 2005-04\022 ne
        MimeType: application/pdf | Charset: UTF-8 | Size: 9592 bytes |
        LinkStorageTime: 10 ms | indexStorageTime: 0 ms
I 2014/09/06 17:52:03 SWITCHBOARD CRAWL: ADDED 0 LINKS FROM http://localhost:8099/repository/YaCy-Fl
yer.pdf, STACKING TIME = 0, PARSING TIME = 9179
D 2014/09/06 17:52:03 SWITCHBOARD processResourceStack processCase=LOCAL_CRAWLING, depth=1, maxDepth
=99, must-match=https?+://(www.)?\Qlocalhost\E.*, must-not-match=, initiatorHash=2qTkDO4j_tSJ, url=h
ttp://localhost:8099/
I 2014/09/06 17:52:04 SWITCHBOARD CRAWL: ADDED 25 LINKS FROM http://localhost:8099/, STACKING TIME =
8, PARSING TIME = 48
I 2014/09/06 17:52:04 SWITCHBOARD Excluded 0 words in URL http://localhost:8099/repository/YaCy-Flye
r.pdf
I 2014/09/06 17:52:04 Fulltext indexing: 5WtquMgzy7ic http://localhost:8099/repository/YaCy-Flyer.pd
f
I 2014/09/06 17:52:04 SWITCHBOARD *Indexed 357 words in URL http://localhost:8099/repository/YaCy-Fl
yer.pdf [5WtquMgzy7ic]
        Description:  YaCy-Flyer.pdf
        MimeType: application/pdf | Charset: UTF-8 | Size: 4892 bytes |
        LinkStorageTime: 45 ms | indexStorageTime: 0 ms
I 2014/09/06 17:52:04 SWITCHBOARD Excluded 0 words in URL http://localhost:8099/
I 2014/09/06 17:52:04 Fulltext indexing: zFOvZggzy7ic http://localhost:8099/
I 2014/09/06 17:52:04 SWITCHBOARD *Indexed 87 words in URL http://localhost:8099/ [zFOvZggzy7ic]
        Description:  YaCy '_anonw-50226812-78': Search Page
        MimeType: text/html | Charset: UTF-8 | Size: 611 bytes |
        LinkStorageTime: 361 ms | indexStorageTime: 0 ms
I 2014/09/06 17:52:04 REJECTED http://www.yacystats.de/peer/2qTkDO4j_tSJ - denied_(the host 'www.yac
ystats.de' is global, but global addresses are not accepted: 62.75.214.113)
W 2014/09/06 17:52:09 SWITCHBOARD Crawl job '62_remotetriggeredcrawl' is paused: user request in Cra
wler_p from localhost
I 2014/09/06 17:52:10 HostQueue opened HostQueue T:\0_Tools\YaCy\yacy_en\DATA\INDEX\intranet\QUEUES\
CrawlerCoreStacks\localhost.8099 with 0 urls..

Mein Problem ist, dass ich keine Wörter aus den gecrawlten Dokumenten finde. Auf dem Screenshot unten die Inhalte im Ordner DATA\INDEX\intranet\SEGMENTS\solr_47\collection1\data\index. Was kann ich noch machen, um die gecrawlte Dokumente durchsuchbar zu machen?
Dateianhänge
test_intranet.jpg
test_intranet.jpg (118.93 KiB) 1268-mal betrachtet
flegno
 
Beiträge: 232
Registriert: So Aug 17, 2014 4:23 pm

Re: Intranet:gecrawlte repositoryDokumente werden nicht gefu

Beitragvon flegno » Sa Sep 06, 2014 6:08 pm

Hier das Log einer Suche hach dem Wort 'global', das in einem .txt-Dokument im Ordner repository enthalten ist:

Code: Alles auswählen
I 2014/09/06 19:03:01 LOCAL_SEARCH ACCESS CONTROL: WHITELISTED CLIENT FROM 127.0.0.1 gets no search
restrictions
I 2014/09/06 19:03:01 LOCAL_SEARCH INIT WORD SEARCH: global:nHCTOv9rKm0I - 10 links to be computed,
10 lines to be displayed
I 2014/09/06 19:03:01 SearchEventCache getEvent: 1 in cache
I 2014/09/06 19:03:01 Protocol SOLR QUERY: defType=edismax&start=0&rows=10&facet=true&facet.mincount
=1&facet.limit=30&facet.sort=count&facet.method=fcs&facet.field=%7B%21ex%3Dcoordinate_p%7Dcoordinate
_p&facet.field=%7B%21ex%3Dhost_s%7Dhost_s&facet.field=%7B%21ex%3Durl_file_ext_s%7Durl_file_ext_s&fac
et.field=%7B%21ex%3Durl_protocol_s%7Durl_protocol_s&facet.field=%7B%21ex%3Dlanguage_s%7Dlanguage_s&f
l=*%2Cscore&q=httpstatus_i%3A200+AND+-url_file_ext_s%3A%28jpg+OR+png+OR+gif%29+AND+%28%28url_paths_s
xt%3A%22global%22%5E3.0%29+OR+%28synonyms_sxt%3A%22global%22%5E0.5%29+OR+%28title%3A%22global%22%5E5
.0%29+OR+%28text_t%3A%22global%22%5E1.0%29+OR+%28host_s%3A%22global%22%5E6.0%29+OR+%28h1_txt%3A%22gl
obal%22%5E5.0%29+OR+%28h2_txt%3A%22global%22%5E3.0%29%29&bq=clickdepth_i%3A0%5E0.8+clickdepth_i%3A1%
5E0.4
I 2014/09/06 19:03:01 YACY SEARCH (solr), returned 0 out of 0 documents from shard query = defType=e
dismax&start=0&rows=10&facet=true&facet.mincount=1&facet.limit=30&facet.sort=count&facet.method=fcs&
facet.field=%7B%21ex%3Dcoordinate_p%7Dcoordinate_p&facet.field=%7B%21ex%3Dhost_s%7Dhost_s&facet.fiel
d=%7B%21ex%3Durl_file_ext_s%7Durl_file_ext_s&facet.field=%7B%21ex%3Durl_protocol_s%7Durl_protocol_s&
facet.field=%7B%21ex%3Dlanguage_s%7Dlanguage_s&fl=*%2Cscore&q=httpstatus_i%3A200+AND+-url_file_ext_s
%3A%28jpg+OR+png+OR+gif%29+AND+%28%28url_paths_sxt%3A%22global%22%5E3.0%29+OR+%28synonyms_sxt%3A%22g
lobal%22%5E0.5%29+OR+%28title%3A%22global%22%5E5.0%29+OR+%28text_t%3A%22global%22%5E1.0%29+OR+%28hos
t_s%3A%22global%22%5E6.0%29+OR+%28h1_txt%3A%22global%22%5E5.0%29+OR+%28h2_txt%3A%22global%22%5E3.0%2
9%29&bq=clickdepth_i%3A0%5E0.8+clickdepth_i%3A1%5E0.4&hl=true&hl.fragsize=220&hl.simple.post=%3C%2Fb
%3E&hl.simple.pre=%3Cb%3E&hl.snippets=5&hl.fl=description_txt&hl.fl=h4_txt&hl.fl=h3_txt&hl.fl=h2_txt
&hl.fl=h1_txt&hl.fl=text_t
S 2014/09/06 19:03:01 BusyThread thread 'Balancer waiting for localhost: 1881 milliseconds' breaks f
or intermission: 2 seconds
I 2014/09/06 19:03:01 LOCAL_SEARCH EXIT WORD SEARCH: global - local_rwi_available(0), local_rwi_stor
ed(0), remote_rwi_available(0), remote_rwi_stored(0), remote_rwi_peerCount(0), local_solr_available(
0), local_solr_stored(0), remote_solr_available(0), remote_solr_stored(0), remote_solr_peerCount(0),
216 ms
I 2014/09/06 19:03:01 DidYouMean found 0 unsorted terms, returned 0 sorted suggestions; execution ti
me: 3ms
S 2014/09/06 19:03:01 BusyThread thread 'BusyThread net.yacy.crawler.data.CrawlQueues.remoteTriggere
dCrawlJob' breaks for intermission: 1 seconds
S 2014/09/06 19:03:02 BusyThread thread 'BusyThread net.yacy.contentcontrol.SMWListSyncThread.run' b
reaks for intermission: 1 seconds
S 2014/09/06 19:03:02 BusyThread thread 'BusyThread net.yacy.contentcontrol.ContentControlFilterUpda
teThread.run' breaks for intermission: 1 seconds
I 2014/09/06 19:03:03 DidYouMean found 0 unsorted terms, returned 0 sorted suggestions; execution ti
me: 1ms
D 2014/09/06 19:03:34 SWITCHBOARD Cleaning Incoming News, 0 entries on stack
I 2014/09/06 19:03:35 YACY rulebasedUpdateInfo: not an automatic update selected
I 2014/09/06 19:03:35 RESOURCE OBSERVER Volume T:\0_Tools\YaCy\yacy_en\DATA: free space (2926 MB) is
low, but nominal (< 4096 MB)
I 2014/09/06 19:03:35 NoticedURL CLEARING ALL STACKS
I 2014/09/06 19:03:35 SWITCHBOARD Solr auto-optimization: idleSearch=33640, idleAdmin=221, deltaOpti
mize=8258008, proccount=0
flegno
 
Beiträge: 232
Registriert: So Aug 17, 2014 4:23 pm


Zurück zu Hilfe für Einsteiger und Anwender

Wer ist online?

Mitglieder in diesem Forum: Google [Bot] und 1 Gast