new page =failed. Reason: exist-test failed: Error executing

Discussion in English language.
Forumsregeln
You can start and continue with posts in english language in all other forums as well, but if you are looking for a forum to start a discussion in english, this is the right choice.

new page =failed. Reason: exist-test failed: Error executing

Beitragvon xioc752 » Di Jul 12, 2016 12:32 pm

Hello,
Same problem before in Using 1.83 and now appearing again a Fresh Install of 1.90
DATA folder moved from previous install (essential to conserve)

Adding a never crawled URL to the Advanced Crawler for specific page
Crawling of "http://www.theguardian.com/politics/2016/jul/11/.... " failed. Reason: exist-test failed: Error executing query/

Running Windows 10, in MS Azure cloud Basic A1 (1 Core, 1.75 GB memory)
1,200 GB dedicated to YaCy (no competing programs running simultaneously in same WIN vm space)
127 GiB attaached disk, but still running on 'native' internal Solrs in YaCy
YaCy version: 1.90/9000
Uptime: 0 days 00:58
Java version: 1.8.0_91
Processors: 1
Memory Usage
RAM used: 389.35 MB
RAM max: 1.13 GB
DISK used: (approx.) 5.25 GB
DISK free: 107.44 GB
31747 documents
Robinson Mode
Documents
solr search api 31,747 2
Webgraph Edges
solr search api 1,962,370 1
Citations
(reverse link index) 492,217 1
RWIs
(P2P Chunks)
---
Crawler PPM 0
Postprocessing Progress
idle 00:00
pending: collection=29762 webgraph=1962370
Traffic (Crawler) 0.64 MB
Load -1

How to set this to start crawling and indexing again normally, please?
Thank you for your patient advice.
xioc752
 
Beiträge: 68
Registriert: Mo Jul 28, 2014 5:01 pm

Re: new page =failed. Reason: exist-test failed: Error execu

Beitragvon luc » Mi Jul 13, 2016 1:05 pm

Hi, I just successfully crawled "http://www.theguardian.com/politics/2016/jul/11/who-will-be-in-theresa-mays-cabinet-government" and other similar pages with a YaCy 1.91/9013 peer adn default advanced parameters.
There were not so much changes since 1.90/9000, so can you detail the other parameters you used in /CrawlStartExpert.html?
luc
 
Beiträge: 235
Registriert: Mi Aug 26, 2015 1:04 am

Re: new page =failed. Reason: exist-test failed: Error execu

Beitragvon xioc752 » Mi Jul 13, 2016 2:48 pm

HI + thanks for replying.
I, too, just crawled that page and got this:
Crawling of "http://www.theguardian.com/politics/2016/jul/11/who-will-be-in-theresa-mays-cabinet-government " failed. Reason: exist-test failed: Error executing query/

The server in a Win vm is running as stand alone Robinson mode - public peer
(Microsoft Windows 10 instance in Azure cloud computing environment)
The idea is to later convert several such Robinson servers to a dedicated private group with full DHT+P2P in a privte group of servers for searching a special topic
Network definition = defaults/yacy.network.allip.unit

re: used in /CrawlStartExpert.html
generic, excepting these:
Crawling Depth = 0
...Use Special User Agent and robot identification

Use Special User Agent and robot identification = Random browser or 'greedy' mode

As listed below...
Crawling Depth 1 + also all linked non-parsable documents [selected]
Unlimited crawl depth for URLs matching with [not selected]
Maximum Pages per Domain Use: [not selected] Page-Count: [not selected] 10000
misc. Constraints Accept URLs with query-part ('?'): [selected]
Obey html-robots-noindex: [selected]
Obey html-robots-nofollow: [not selected]
Load Filter on URLs must-match
Restrict to start domain(s) [selected]
Restrict to sub-path(s) [not selected]
Use filter .* [not selected] (must not be empty)
must-not-match
Load Filter on IPs .* [not selected] must-match (must not be empty)
must-not-match [not selected]
Must-Match List for Country Codes info no country code restriction
Use filter [not selected]
AD,AL,AT,BA,BE,BG,BY,CH,CY,CZ,DE,DK,EE,ES,FI,FO,FR,GG,GI,GR,HR,HU,IE,IM,IS,IT,JE,LI,LT,LU,LV,MC,MD,MK,MT,NL,NO,PL,PT,RO,RU,SE,SI,SJ,SK,SM,TR,UA,UK,VA,YU
Document Filter [not selected]
These are limitations on index feeder. The filters will be applied after a web page was loaded.

Filter on URLsinfo
must-match .* [not selected] (must not be empty)
must-not-match [not selected]
Filter on Content of Document [not selected]
(all visible text, including camel-case-tokenized url and title)
must-match .* [not selected](must not be empty)
must-not-match [not selected]
Clean-Up before Crawl Start No Deletion Do not delete any document before the crawl is started.
Delete sub-path [not selected] For each host in the start url list, delete all documents (in the given subpath) from that host.
Delete only old [not selected] Treat documents that are loaded ago as stale and delete them before the crawl is started.
Double-Check Rules
No Doubles [selected] Never load any page that is already known. Only the start-url may be loaded again.
Re-load [not selected] Treat documents that are loaded ago as stale and load them again. If they are younger, they are ignored.
Document Cache
Store to Web Cache [selected]
Policy for usage of Web Cache
no cache [not selected] if fresh [selected] if exist [not selected] cache only [not selected]
Robot Behaviour
Use Special User Agent and robot identification [Random Browser]
Snapshot Creation
Max Depth for Snapshots -1
Multiple Snapshot Versions [selected>] replace old snapshots with new one [not selected] add new versions for each crawl must-not-match filter for snapshot generation
Index Attributes
Indexing
index text: {selected] index media: [selected] Add Crawl result to collection(s) user
Time Zone Offset -120

---
How and where did you get YaCy 1.91/9013 please? I'd like to update everything to latest ver., please, ASAP. Thanks!
I cannot find it and the win version: 1.90/9000 update does not show it as available, and from here it's somehow not in Google search.

Many thanks for your patient help!
xioc752
 
Beiträge: 68
Registriert: Mo Jul 28, 2014 5:01 pm

Re: new page =failed. Reason: exist-test failed: Error execu

Beitragvon luc » Mi Jul 13, 2016 11:22 pm

I retried in Robinson mode with "allip" network config and with the same parameters as yours and still had no error. To be sure, tomorrow I will retry on windows with YaCy 1.90/9000...

Looking rapidly for the error "exist-test failed" in source code, it looks like YaCy has a problem accessing your Solr index when checking if the url has already been crawled. Did you tried after restarting your YaCy peer? And does basic crawl (/CrawlStartSite.html) works?

The peer I run is in a Docker container on a VPS, and has this version because is automatically built from latest sources on the main YaCy git repository. You can find it on Docker Hub.

If you really want the very latest changes running on Windows, I am afraid you will have to build it yourself from latest sources, or wait for a new official build made available on http://kaskelix.de/update/, by Orbiter I guess.
luc
 
Beiträge: 235
Registriert: Mi Aug 26, 2015 1:04 am

Re: new page =failed. Reason: exist-test failed: Error execu

Beitragvon xioc752 » Do Jul 14, 2016 7:54 am

Thank you for your patient and detailed reply and testing.

Your comment on potential problems reading the Solr files, flags a recurring problem / issue
I/we have seen this problem, before, in the 1.83/9857 +related generations.
This is seen, notably, when we need to move DATA sets to a fresh, certified, public build #.
This happens most frequently when the move becomes urgently needed.
This happens when /ConfigHeuristics_p.html is suddenly unavailable, and shows Solr error mssages, due to a breakdown in the surrounding machine.
Typically it is in the Heuristics page that we see this failure, and when it fails, it produces what seems to be a non-recoverable error.

Following instructions from Orbiter, some years ago, for this specific type of case, we have relied on moving DATA sets to fresh, healthy installs.
This has not always been successful, to say the least.
We have lost access to DATA sets.
Our DATA sets are frequently on the order of 20+ GB or more.

Despite carefully recovering and patiently moving them, we frequently we have huge trouble getting the moved DATA pack to be read, at all.
This is shown in the new 'fresh build' vm refusing to start.
Removing the DATA set and letting the new vm build a new empty DATA set, shows the vm is comparatively healthy.
Stopping it, properly, removing the 'fresh' empty DATA set and replacing it with our DATA set, and restarting - even in 'administrator' mode in the Win version, generates a 'no start' reaction.
I have struggled with this in both Ubuntu installs for 3+ years and more recently in Windows installs.

I am wondering about the UTF- 8 issue.
I may be totally wrong, of course.
Our focused crawls are done in up to 40 languages. Our DATA sets are, of necessity, heavily loaded with many languages and types of characters.
By looking at front end search result 'symptoms,' this may give us a clue to the / or an / underlying problem.
Users need to search from the front end in any of the source languages, of course.

I have noticed, also, that sometimes when a front end search is done specifically in Ukrainian or Russian (for example) the results are very limited, even though there has been a lot of suitable original data mined and cross indexed, previously.
Results in Ukrainian are smaller than Russian results and sometimes only a few pages of results display.
The front end search, using a built in generic display panel, tends to go back to Page 1 - even when a Page 3 or Page 4 is selected to be displayed next.
This happens also when we know that there is substantial 'extra' data results that can be displayed.
The numbers of potential results shown at the top of the results page also seems smaller than should be available.
Reverting to Page 1 happens in cases where we know there is more data and larger available results are shown to be available.

Can this create a weakness in the later readability of DATA sets that have been moved to a fresh vm?
If 'yes,' how can this be 'strengthened' to make the DATA sets more reliably readable, when moved?

I appreciate your thoughts on this, please.
As noted above, I may be totally off base in this examination, but some of the impacts could - perhaps - be as we are seeing them.

Thank you very kindly, once again, for your patient help!
Your seasoned expertise is most appreciated :)
xioc752
 
Beiträge: 68
Registriert: Mo Jul 28, 2014 5:01 pm

Re: new page =failed. Reason: exist-test failed: Error execu

Beitragvon luc » Do Jul 14, 2016 2:14 pm

Hi, thanks for your very detailed operations feedback. I think this is very valuable for further improvements.
I am not very experienced in running highly available YaCy peers. I have only been running for my personal use one or two remote peers always up for some months now, with index sizes of only a few Giga Bytes. I also experienced loosing some indexed DATA but mainly because of manipulation errors when starting playing with Docker.

Maybe backing up your index with the 'Index Export' feature (/IndexExport_p.html) and then importing would be a more reliable solution rather than directly reusing DATA folders. Did you experiment a little with this solution? (I don't know if it would perform in a reasonable amount of time for data sets as large as yours...)

Another possibility I am thinking (but I didn't tried yet) could be to use an external Solr Server rather than the default embedded one in YaCy. Embedded Solr works fine and make YaCy an autonomous application, but I am not sure it is a good option for large production data sets (see Solr wiki and documentation documentation about this).

By the way, these kind of issues are not the most obvious ones to investigate and solve. It would certainly help to perform some debugging when you encounter the issue... Maybe you have some people with development skills in your group?
luc
 
Beiträge: 235
Registriert: Mi Aug 26, 2015 1:04 am

Re: new page =failed. Reason: exist-test failed: Error execu

Beitragvon smokingwheels » Mi Jul 20, 2016 11:50 am

If you want to run the latest code from Git hub...
Muddle and adjust you way thru this Wiki https://github.com/loklak/loklak_server/wiki/Setting-up-Loklak-on-Windows-32-Bit
Take the time to Read sources carefully.
I did it for [url]Loklak.org[/url] and have run yacy for several weeks.
Then follow the Linux instructions for Yacy.
https://github.com/yacy/yacy_search_server
You must start Yacy with the BAT File being in windows.
If you want.
smokingwheels
 
Beiträge: 102
Registriert: Sa Aug 31, 2013 7:16 am

Re: new page =failed. Reason: exist-test failed: Error execu

Beitragvon xioc752 » Mi Jul 20, 2016 2:59 pm

Thank you, smokingwheels.
I will go and look. :) Thanks for the tips, too.

We are still trying to find a strong, reliable solution to
NEW PAGE =FAILED. REASON: EXIST-TEST FAILED: ERROR EXECUTING
1) > the issue of new pages reporting incorrectly at known 1st crawl, hence initial crawling being stopped [as described above] and

2) previously harvested, and what were functional DATA sets non being readable when moved to fresh a YaCy

I am wondering if it is needed to reinstall the same -by now- older version / generation of YaCy to achieve readability?
It was my understanding that this was not required.

We have many large, older DATA sets after several years of cloud based harvesting and now we to 'mount' the older data sets and use them.

Many thanks for your expertise!
xioc752
 
Beiträge: 68
Registriert: Mo Jul 28, 2014 5:01 pm


Zurück zu English

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 2 Gäste

cron