The YaCy Grid

Forum for developers

The YaCy Grid

Beitragvon Orbiter » Mi Mär 29, 2017 9:58 am

I'm actually working hard to make a YaCy/2, now called "YaCy Grid".
The main idea is currently, that this becomes a large-scale search appliance -- for the first step.
In a second step, we can do two things: replace the old code parts in "Legacy YaCy" with the grid elements and secondly, turn the YaCy Grid into a peer-to-peer architecture (again).
YaCy Grid is therefore a 'professional YaCy' with the vision that it stays a modern piece of software that may power the next-generation p2p search.

I posted a milestone plan and an architecture picture here:
https://twitter.com/yacy_search/status/ ... 1844357120
Bild
Orbiter
 
Beiträge: 5796
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: The YaCy Grid

Beitragvon Orbiter » Mi Mär 29, 2017 10:02 am

"Legacy YaCy" (YaCy/1) will benefit from the milestone 2: we will get a WARC parser which produces elasticsearch-like JSON index files and YaCy will get a surrogate parser to read those files.
Then it will be easy to use outside-of-YaCy crawlers like you have with wget:
Code: Alles auswählen
wget "http://yacy.net" --warc-file="yacy"

..will generate a WARC file which YaCy/1 then can index using the Grid Parser.
Orbiter
 
Beiträge: 5796
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: The YaCy Grid

Beitragvon reger » Fr Mär 31, 2017 12:50 am

Uups,

was parallel looking into a warc importer and read your post to last, see commit https://github.com/yacy/yacy_search_ser ... fd248d51f3

P.S. looked at your grid prototype, haven't grap'd all the communication details so far but was a little surprised by the prerequisite (rabbit & ftp) currently without a way around/out,
at least for the ftp I implemented for my first testing Apache embedded (https://mina.apache.org/ftpserver-proje ... erver.html). Maybe something to consider.
reger
 
Beiträge: 46
Registriert: Mi Jan 02, 2013 9:23 am

Re: The YaCy Grid

Beitragvon Orbiter » Sa Apr 01, 2017 12:37 am

great work with the WARC importer!
reger hat geschrieben:prerequisite (rabbit & ftp) currently without a way around/out,

Well actually if the MCP does not find a FTP service, it will host files itself. Same with the queue, if there is no rabbitMQ, it will handle queues with a poor-mans-queue implementation using an embedded MapDB

reger hat geschrieben:at least for the ftp I implemented for my first testing Apache embedded (https://mina.apache.org/ftpserver-proje ... erver.html). Maybe something to consider.

I considered that as well but we can that as add-on later. Same with SMB or other protocols, any file sharing should be usable. Idea is that everyone can choose their own place to share warc/index files.
Orbiter
 
Beiträge: 5796
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: The YaCy Grid

Beitragvon Huppi » Sa Apr 01, 2017 11:11 am

@Orbiter: Thanks for sharing your plan! Looks great!
Huppi
 
Beiträge: 898
Registriert: Fr Jun 29, 2007 9:49 am
Wohnort: Kürten

Re: The YaCy Grid

Beitragvon Orbiter » Mo Apr 24, 2017 3:47 pm

YaCy Grid: Parser Microservice

you can now send a WARC file to a yacy_grid_parser microservice
and get the parsed fulltext and links as JSON:

Code: Alles auswählen
wget https://www.ffii.org --warc-file=ffii.org
curl -X POST -F "sourcebytes=@ffii.org.warc.gz"  http://yacygrid.com:8500/yacy/grid/parser/parser.json


Here we stil use wget as loader. That component will be replaced soon with a headless browser which
generates WARC files.
Orbiter
 
Beiträge: 5796
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: The YaCy Grid

Beitragvon reger » Sa Mai 06, 2017 11:10 pm

Orbiter hat geschrieben:YaCy Grid: Parser Microservice

you can now send a WARC file to a yacy_grid_parser microservice
and get the parsed fulltext and links as JSON:



If you want to test the feature without wget/curl you could use the sourceurl parameter and a online stored warc

Code: Alles auswählen
http://yacygrid.com:8500/yacy/grid/parser/parser.json?sourceurl=https://archive.org/download/warc-www_c-l-o-u-d_us/www.c-l-o-u-d.us-2016-10-03-bd9783dc-00000.warc.gz
reger
 
Beiträge: 46
Registriert: Mi Jan 02, 2013 9:23 am

Re: The YaCy Grid

Beitragvon Orbiter » Di Mai 09, 2017 11:06 pm

good idea!
Orbiter
 
Beiträge: 5796
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: The YaCy Grid

Beitragvon Orbiter » So Mai 14, 2017 8:22 pm

The yacy_grid_loader is ready, running and able to act as listener on the mcp event queue!

The rabbitmq message server attached to the mcp is able to dispatch work tasks for the YaCy grid microservices. The yacy_grid_loader service is the first one which actually listens on such a queue and acts on messages.

The yacy_grid_loader is now running at yacygrid.com:8200. It has a servlet interface but it gets interesting if it is accessed using a message. To do that, store first a message object named 'job.json' with the following content:
Code: Alles auswählen
{
  "metadata": {
    "process": "yacy_grid_loader",
    "count": 1
  },
  "data": [{"collection": "test"}],
  "actions": [{
    "urls": ["http://yacy.net"],
    "collection": "test",
    "targetasset": "test3/yacy.net.warc.gz",
    "type": "loader",
    "queue": "webloader"
  }]
}


Then upload the message with
Code: Alles auswählen
curl -X POST -F "message=@job.json" -F "serviceName=loader" -F "queueName=webloader" http://yacygrid.com:8100/yacy/grid/mcp/messages/send.json


The result is an asset in test3/yacy.net.warc.gz containing the web page that was loaded with a headless browser, thus containing executed javascript code!

To check the content, you can parse the asset with
Code: Alles auswählen
curl http://yacygrid.com:8500/yacy/grid/parser/parser.json?sourceasset=test3/yacy.net.warc.gz
Orbiter
 
Beiträge: 5796
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main


Zurück zu YaCy Coding & Architecture

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 2 Gäste