Import Mediawiki

Hier finden YaCy User Hilfe wenn was nicht funktioniert oder anders funktioniert als man dachte. Bei offensichtlichen Fehlern diese bitte gleich in die Bugs (http://bugs.yacy.net) eintragen.
Forumsregeln
In diesem Forum geht es um Benutzungsprobleme und Anfragen für Hilfe. Wird dabei ein Bug identifiziert, wird der thread zur Bearbeitung in die Bug-Sektion verschoben. Wer hier also einen Thread eingestellt hat und ihn vermisst, wird ihn sicherlich in der Bug-Sektion wiederfinden.

Import Mediawiki

Beitragvon promocore » Sa Apr 01, 2017 1:17 pm

Hallo,

ich möchte gerne das aktuelle Wikipedia importieren, aber über die Importfunktion funktioniert es nicht [ IndexImportMediawiki_p.html ]

Yacy läuft bei mir auf Debian ohne Desktop.
Muss die Importdatei auf Localhost liegen oder kann sie auch auf dem Client liegen, worüber ich das Webinterface aufrufe.
Kann ich über die Debian Konsole/Shell den Import auch anstoßen?
Hat jemand die Importfunktion schon erfolgreich getestet?



Gruß promocore
promocore
 
Beiträge: 55
Registriert: Mo Feb 08, 2016 8:50 pm

Re: Import Mediawiki

Beitragvon luc » Mo Apr 03, 2017 3:58 pm

Hello promocore, I hope an answer in English is better than nothing.
Indeed the MediaWiki Dump Import works but the user interface is a currently bit confusing/buggy. To make it work, you must either :
- put the dump file in your YaCy server install folder and then choose it with the browser upload field : only works if your YaCy server is running on the same computer as your browser
- OR call directly the url this way :
Code: Alles auswählen
http://peerhost:8090/IndexImportMediawiki_p.html?file=file:///absolute/server/path/to/yourdump.xml.bz2
(the importmediawiki.sh script runs this way)

In both cases, the dump file has to be on the same computer as the YaCy peer.

One last confusing thing : the browser then regularly refreshes the /IndexImportMediawiki_p.html page, showing the progress, but never clearly indicates when the task is terminated (at least on my last import test).

It also bothered me last time I used this feature. I will try to find some time to improve these points.

Have a nice day
luc
 
Beiträge: 265
Registriert: Mi Aug 26, 2015 1:04 am

Re: Import Mediawiki

Beitragvon promocore » Di Apr 04, 2017 8:36 am

hy luc, thank for your reply.
I undestand, i can only import wikidumps, if i have a browser on my Yacy server.
In my case, Yacy run on Linux and I have only a shell to controll Yacy on the local maschine.

Do you know a solution to import wikidumps without a browser?
promocore
 
Beiträge: 55
Registriert: Mo Feb 08, 2016 8:50 pm

Re: Import Mediawiki

Beitragvon luc » Di Apr 04, 2017 12:26 pm

From a linux shell you can proceed as follow :
- get the dump on your Linux machine, for example with curl, wget, or scp
- on a debian install, go to /usr/share/yacy/bin
- then run
Code: Alles auswählen
sh importmediawiki.sh /yourpath/yourdump.xml.bz2

- with your browser you can then visit the /IndexImportMediawiki_p.html page to get the import task progression

As I said , you could also trigger the import remotely from your browser (url /IndexImportMediawiki_p.html?file=file:///absolute/server/path/to/yourdump.xml.bz2) : what is important is just to first download the dump on your Linux machine and then feed the "file" parameter with the dump file path as it appears on the remote Linux machine.
luc
 
Beiträge: 265
Registriert: Mi Aug 26, 2015 1:04 am

Re: Import Mediawiki

Beitragvon luc » Fr Mai 05, 2017 8:57 am

For information, I pushed a few improvements related to Mediawiki dump import, hopefully making it a bit more reliable and easier to handle without command line operations (notably direct import from dump http URL and scheduling).

They are already available on latest YaCy sources on GitHub or as a developer system update ("dev 1.92/9199 (unsigned)" entry in /ConfigUpdate_p.html), for those who don't use a package manager.
luc
 
Beiträge: 265
Registriert: Mi Aug 26, 2015 1:04 am

Re: Import Mediawiki

Beitragvon promocore » Mo Mai 08, 2017 6:30 pm

hy luc, thanks for the Information.

I will try to import the wiki in the future again. With 1.92.000, the import starts on my Yacy, but stop after importing 10-15 documents.
promocore
 
Beiträge: 55
Registriert: Mo Feb 08, 2016 8:50 pm

Re: Import Mediawiki

Beitragvon luc » Do Mai 11, 2017 5:43 am

Ok, if you would like to tell here which exact dump file failed I can already check now if it works with the latest modifications (I tested mostly with french and english wiki dumps from dumps.wikimedia.org).
luc
 
Beiträge: 265
Registriert: Mi Aug 26, 2015 1:04 am

Re: Import Mediawiki

Beitragvon promocore » Fr Mai 19, 2017 11:13 pm

Hy luc,

I use a german Wikipedia dump for my import, but I delete the file from my drive in the past.

I have installed now my Yacy client from Source and the import from actual german Wikipedia dump works like a charm.
Thanks for your great improvement.

My Yacy have only faults, if I import and Indexing together ore if Yacy has huge crawlings.
If it happens, the GC Memory ist out of the range and Import und Indexing fail, but It might well be that i have no optimized RAM settings.
promocore
 
Beiträge: 55
Registriert: Mo Feb 08, 2016 8:50 pm

Re: Import Mediawiki

Beitragvon luc » Sa Mai 20, 2017 11:13 am

Thank you promocore for your feedback, and glad to know the import worked with the German Wikipedia dump.

What are your memory settings? If you wish to share your log traces when the import and indexing processes failed it can always be interesting for further improvements.

Have a nice day
luc
 
Beiträge: 265
Registriert: Mi Aug 26, 2015 1:04 am

Re: Import Mediawiki

Beitragvon promocore » Mi Mai 24, 2017 4:46 pm

After 3 Days Yacy up running, i got the Memory Error: Caused by: java.lang.OutOfMemoryError: Java heap space

I only indexing with 50PPM at 1 Domain.

My yacy PC has 15 GB Ram and Yacy can use up to 8GB.

If I do something simultaneously, like import and crawling ore only only heavy crawling, I got the Error within a few hours.
Have I set something grossly wrong?


RAM-Settings.JPG
RAM-Settings.JPG (127.02 KiB) 14-mal betrachtet


yacylog.zip
(199.91 KiB) 1-mal heruntergeladen
promocore
 
Beiträge: 55
Registriert: Mo Feb 08, 2016 8:50 pm


Zurück zu Fragen und Antworten

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 2 Gäste