Import Mediawiki

Hier finden YaCy User Hilfe wenn was nicht funktioniert oder anders funktioniert als man dachte. Bei offensichtlichen Fehlern diese bitte gleich in die Bugs (http://bugs.yacy.net) eintragen.
Forumsregeln
In diesem Forum geht es um Benutzungsprobleme und Anfragen für Hilfe. Wird dabei ein Bug identifiziert, wird der thread zur Bearbeitung in die Bug-Sektion verschoben. Wer hier also einen Thread eingestellt hat und ihn vermisst, wird ihn sicherlich in der Bug-Sektion wiederfinden.

Import Mediawiki

Beitragvon promocore » Sa Apr 01, 2017 1:17 pm

Hallo,

ich möchte gerne das aktuelle Wikipedia importieren, aber über die Importfunktion funktioniert es nicht [ IndexImportMediawiki_p.html ]

Yacy läuft bei mir auf Debian ohne Desktop.
Muss die Importdatei auf Localhost liegen oder kann sie auch auf dem Client liegen, worüber ich das Webinterface aufrufe.
Kann ich über die Debian Konsole/Shell den Import auch anstoßen?
Hat jemand die Importfunktion schon erfolgreich getestet?



Gruß promocore
promocore
 
Beiträge: 71
Registriert: Mo Feb 08, 2016 8:50 pm

Re: Import Mediawiki

Beitragvon luc » Mo Apr 03, 2017 3:58 pm

Hello promocore, I hope an answer in English is better than nothing.
Indeed the MediaWiki Dump Import works but the user interface is a currently bit confusing/buggy. To make it work, you must either :
- put the dump file in your YaCy server install folder and then choose it with the browser upload field : only works if your YaCy server is running on the same computer as your browser
- OR call directly the url this way :
Code: Alles auswählen
http://peerhost:8090/IndexImportMediawiki_p.html?file=file:///absolute/server/path/to/yourdump.xml.bz2
(the importmediawiki.sh script runs this way)

In both cases, the dump file has to be on the same computer as the YaCy peer.

One last confusing thing : the browser then regularly refreshes the /IndexImportMediawiki_p.html page, showing the progress, but never clearly indicates when the task is terminated (at least on my last import test).

It also bothered me last time I used this feature. I will try to find some time to improve these points.

Have a nice day
luc
 
Beiträge: 294
Registriert: Mi Aug 26, 2015 1:04 am

Re: Import Mediawiki

Beitragvon promocore » Di Apr 04, 2017 8:36 am

hy luc, thank for your reply.
I undestand, i can only import wikidumps, if i have a browser on my Yacy server.
In my case, Yacy run on Linux and I have only a shell to controll Yacy on the local maschine.

Do you know a solution to import wikidumps without a browser?
promocore
 
Beiträge: 71
Registriert: Mo Feb 08, 2016 8:50 pm

Re: Import Mediawiki

Beitragvon luc » Di Apr 04, 2017 12:26 pm

From a linux shell you can proceed as follow :
- get the dump on your Linux machine, for example with curl, wget, or scp
- on a debian install, go to /usr/share/yacy/bin
- then run
Code: Alles auswählen
sh importmediawiki.sh /yourpath/yourdump.xml.bz2

- with your browser you can then visit the /IndexImportMediawiki_p.html page to get the import task progression

As I said , you could also trigger the import remotely from your browser (url /IndexImportMediawiki_p.html?file=file:///absolute/server/path/to/yourdump.xml.bz2) : what is important is just to first download the dump on your Linux machine and then feed the "file" parameter with the dump file path as it appears on the remote Linux machine.
luc
 
Beiträge: 294
Registriert: Mi Aug 26, 2015 1:04 am

Re: Import Mediawiki

Beitragvon luc » Fr Mai 05, 2017 8:57 am

For information, I pushed a few improvements related to Mediawiki dump import, hopefully making it a bit more reliable and easier to handle without command line operations (notably direct import from dump http URL and scheduling).

They are already available on latest YaCy sources on GitHub or as a developer system update ("dev 1.92/9199 (unsigned)" entry in /ConfigUpdate_p.html), for those who don't use a package manager.
luc
 
Beiträge: 294
Registriert: Mi Aug 26, 2015 1:04 am

Re: Import Mediawiki

Beitragvon promocore » Mo Mai 08, 2017 6:30 pm

hy luc, thanks for the Information.

I will try to import the wiki in the future again. With 1.92.000, the import starts on my Yacy, but stop after importing 10-15 documents.
promocore
 
Beiträge: 71
Registriert: Mo Feb 08, 2016 8:50 pm

Re: Import Mediawiki

Beitragvon luc » Do Mai 11, 2017 5:43 am

Ok, if you would like to tell here which exact dump file failed I can already check now if it works with the latest modifications (I tested mostly with french and english wiki dumps from dumps.wikimedia.org).
luc
 
Beiträge: 294
Registriert: Mi Aug 26, 2015 1:04 am

Re: Import Mediawiki

Beitragvon promocore » Fr Mai 19, 2017 11:13 pm

Hy luc,

I use a german Wikipedia dump for my import, but I delete the file from my drive in the past.

I have installed now my Yacy client from Source and the import from actual german Wikipedia dump works like a charm.
Thanks for your great improvement.

My Yacy have only faults, if I import and Indexing together ore if Yacy has huge crawlings.
If it happens, the GC Memory ist out of the range and Import und Indexing fail, but It might well be that i have no optimized RAM settings.
promocore
 
Beiträge: 71
Registriert: Mo Feb 08, 2016 8:50 pm

Re: Import Mediawiki

Beitragvon luc » Sa Mai 20, 2017 11:13 am

Thank you promocore for your feedback, and glad to know the import worked with the German Wikipedia dump.

What are your memory settings? If you wish to share your log traces when the import and indexing processes failed it can always be interesting for further improvements.

Have a nice day
luc
 
Beiträge: 294
Registriert: Mi Aug 26, 2015 1:04 am

Re: Import Mediawiki

Beitragvon promocore » Mi Mai 24, 2017 4:46 pm

After 3 Days Yacy up running, i got the Memory Error: Caused by: java.lang.OutOfMemoryError: Java heap space

I only indexing with 50PPM at 1 Domain.

My yacy PC has 15 GB Ram and Yacy can use up to 8GB.

If I do something simultaneously, like import and crawling ore only only heavy crawling, I got the Error within a few hours.
Have I set something grossly wrong?


RAM-Settings.JPG
RAM-Settings.JPG (127.02 KiB) 2564-mal betrachtet


yacylog.zip
(199.91 KiB) 49-mal heruntergeladen
promocore
 
Beiträge: 71
Registriert: Mo Feb 08, 2016 8:50 pm

Re: Import Mediawiki

Beitragvon luc » Sa Mai 27, 2017 9:30 am

Ok, as far as I know your settings look fine, and 8GB dedicated to YaCy should be enough. When having some time I will try to run a similar scenario as the one your describe (MediaWiki import while crawling) and check what happens.

For information, a tip that may be useful to help debugging that kind of failure case : run YaCy with the
Code: Alles auswählen
-XX:+HeapDumpOnOutOfMemoryError
advanced JVM option (present as a comment in YaCy sources build.xml file), will produce a heap memory dump once the OutOfMemoryError occurs. This dump can be then opened with a tool such as JVisualVM and it can really help finding what part of the code is using too much memory.
luc
 
Beiträge: 294
Registriert: Mi Aug 26, 2015 1:04 am

Re: Import Mediawiki

Beitragvon promocore » Sa Mai 27, 2017 12:14 pm

I find the file and the options.

Code: Alles auswählen
  <java classname="net.yacy.yacy" fork="yes">
      <classpath>
        <pathelement location="${build}"/>
        <pathelement location="${htroot}"/>
        <pathelement location="${lib}" />
        <fileset dir="${lib}" includes="**/*.jar" />
      </classpath>
      <arg line="-start"/>
      <jvmarg line="-Xdebug"/>
      <jvmarg line="-Xnoagent"/>
      <jvmarg line="-Djava.compiler=none"/>
      <jvmarg line="-Xrunjdwp:transport=dt_socket,address=8000,server=y,suspend=y"/>
      <!-- Dump memory heap when an OutOfMemoryError occurs -->
      <!-- <jvmarg line="-XX:+HeapDumpOnOutOfMemoryError"/> -->
      <!-- Dump path -->
      <!-- <jvmarg line="-XX:HeapDumpPath=/your_path/"/> -->
      <!-- Log JAXP XML parsers Debug information -->
      <!-- <jvmarg line="-Djaxp.debug=1"/> -->
    </java>


I comment out and edit path from these settings now.

<jvmarg line="-XX:+HeapDumpOnOutOfMemoryError"/>
<jvmarg line="-XX:HeapDumpPath=/your_path/"/>
promocore
 
Beiträge: 71
Registriert: Mo Feb 08, 2016 8:50 pm

Re: Import Mediawiki

Beitragvon promocore » So Mai 28, 2017 12:06 pm

I got the "java.lang.OutOfMemoryError: GC overhead limit exceeded" error again, but i don´t find a dump in the given folder from yacy.
Do i still have to do something?

yacy-log.7z
(241.97 KiB) 51-mal heruntergeladen
promocore
 
Beiträge: 71
Registriert: Mo Feb 08, 2016 8:50 pm

Re: Import Mediawiki

Beitragvon luc » So Mai 28, 2017 2:32 pm

It looks like I confused you a little with the example in the build.xml file : this one is only useful if you run YaCy in debug mode from compiled sources with Apache Ant. If you run YaCy using the startYACY.sh or startYACY.bat script, this is in that script that you have to add the JVM option "-XX:+HeapDumpOnOutOfMemoryErro" at the appropriate place to get the dump generated...
luc
 
Beiträge: 294
Registriert: Mi Aug 26, 2015 1:04 am

Re: Import Mediawiki

Beitragvon promocore » Mo Mai 29, 2017 12:07 pm

Thx luc, I got the dump.
Maybe you or somebody else can take a look of the dump.
It would be realy grate to find out, why I get the error.

Download Log
Download Dump
promocore
 
Beiträge: 71
Registriert: Mo Feb 08, 2016 8:50 pm

Re: Import Mediawiki

Beitragvon promocore » Sa Jun 03, 2017 4:35 pm

I know now the reason.
8GB Ram was not enough. In the last days, Yacy won´t start anymore. After expanding the Ram, all works fine.
promocore
 
Beiträge: 71
Registriert: Mo Feb 08, 2016 8:50 pm

Re: Import Mediawiki

Beitragvon luc » Mo Jun 05, 2017 1:10 am

Hi promocore, good to know everything now works fine for you!

And sorry, I missed your answer with the log and dump so I could not have a look at it. Now your links point to a HTTP 404 status, but if you would like to share it again, I am still interested to check what consumes so much memory...
luc
 
Beiträge: 294
Registriert: Mi Aug 26, 2015 1:04 am

Re: Import Mediawiki

Beitragvon promocore » Fr Jun 09, 2017 7:38 pm

hy luc,

I deleted the files from my webspace and I have only a backup from my logs.
Download Log

I found another dump from testing the dumpsetting in startyacy.sh
It should contain the same memory error.

Download Dump

yacy-eclips-memory.jpg
yacy-eclips-memory.jpg (99.32 KiB) 2232-mal betrachtet


Thanks for your help :)
promocore
 
Beiträge: 71
Registriert: Mo Feb 08, 2016 8:50 pm

Re: Import Mediawiki

Beitragvon luc » Sa Jun 10, 2017 9:34 pm

Ok I have been able to download your dump. Thank you for sharing such a big file!
Just need some time now to analyse this, and check to what extent the memory footprint could be improved ...
luc
 
Beiträge: 294
Registriert: Mi Aug 26, 2015 1:04 am

Re: Import Mediawiki

Beitragvon luc » Sa Jun 17, 2017 1:49 pm

As you noticed @promocore, in your dump about 90% of the RAM is used by RAMIndex instances, themselves used by the various YaCy internal data tables, storing any other persistent information than the Solr index.

Looking carefully at the existing code, it looks like it is the normal behavior : these internal tables are backed by files, but to improve performance they are also duplicated as memory structures loaded in RAM.

So whatever amount of RAM you dedicate to YaCy, if you run YaCy long enough to get some data from crawls, DHT or search queries, RAM will be filled with internal tables content, until it remains only 200Mbytes or 10% of the totally available memory (see Table constructor). When this threshold is reached, some tables loose their RAM copy, and accesses are made trough the backing file.
It looks like the condition is effectively respected in your case (about 90% RAM used by RAMIndex instances), but the 10% left are not enough to run some operations concurrently without OutOfMemory errors.

So what's next? Embracing and automatically fixing all possible out of memory scenarios is far from easy, especially with quite complex systems such as YaCy. But I am considering a pragmatic and simple to code solution : make user configurable the total amount of RAM occupied by YaCy internal tables. Default would remain the same (stop filling when free memory is below 200Mbytes or 10%), but it could be modified by user so internal tables only use for example at most 70% of the available RAM. Not perfect as it needs some trial and error, but at least once you would have the right setting you would not have OutOfMemory errors anymore.

So what do you think about it?
luc
 
Beiträge: 294
Registriert: Mi Aug 26, 2015 1:04 am

Re: Import Mediawiki

Beitragvon promocore » Sa Jun 17, 2017 3:34 pm

Hi luc,

thank you for debuging and your replay.

To set a new limit for caching internal tables would be a great solution, because Yacy clients are able to build larger libary, although clients are not able to expand the RAM.

With 80.000.000 dokuments I need 1GB RAM for the yacy processes. I think to set the limit to 70-80 % RAM for the internal table offers enough possibilities to scale up the Yacy documents.
promocore
 
Beiträge: 71
Registriert: Mo Feb 08, 2016 8:50 pm

Re: Import Mediawiki

Beitragvon luc » Di Jun 20, 2017 8:39 am

Arr, after deeper analysis and more debugging, it appears I was a bit confused by YaCy deep internals... The rules I mentioned about memory allocation for internal tables cache are right, but that is not the scenario you were falling in on the dump you provided.
In your case, 90% of memory is not dedicated to internal tables content cache, but only to these tables index keys. And that's not the same thing : while caching tables content/values is optional and depend on the available amount of memory (access to the values can be made directly through the backing files), with the current implementation table keys are required to be entirely in memory for proper working.
So with the number of documents on your peer... as far as I understand, there is no simple and easy solution to reduce memory footprint. :?

But I am still digging the subject. These core internals are not so easy to deal with and there are certainly optimizations paths I missed.
luc
 
Beiträge: 294
Registriert: Mi Aug 26, 2015 1:04 am


Zurück zu Fragen und Antworten

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 2 Gäste