Seite 1 von 1

The YaCy Grid

BeitragVerfasst: Mi Mär 29, 2017 9:58 am
von Orbiter
I'm actually working hard to make a YaCy/2, now called "YaCy Grid".
The main idea is currently, that this becomes a large-scale search appliance -- for the first step.
In a second step, we can do two things: replace the old code parts in "Legacy YaCy" with the grid elements and secondly, turn the YaCy Grid into a peer-to-peer architecture (again).
YaCy Grid is therefore a 'professional YaCy' with the vision that it stays a modern piece of software that may power the next-generation p2p search.

I posted a milestone plan and an architecture picture here:
https://twitter.com/yacy_search/status/ ... 1844357120
Bild

Re: The YaCy Grid

BeitragVerfasst: Mi Mär 29, 2017 10:02 am
von Orbiter
"Legacy YaCy" (YaCy/1) will benefit from the milestone 2: we will get a WARC parser which produces elasticsearch-like JSON index files and YaCy will get a surrogate parser to read those files.
Then it will be easy to use outside-of-YaCy crawlers like you have with wget:
Code: Alles auswählen
wget "http://yacy.net" --warc-file="yacy"

..will generate a WARC file which YaCy/1 then can index using the Grid Parser.

Re: The YaCy Grid

BeitragVerfasst: Fr Mär 31, 2017 12:50 am
von reger
Uups,

was parallel looking into a warc importer and read your post to last, see commit https://github.com/yacy/yacy_search_ser ... fd248d51f3

P.S. looked at your grid prototype, haven't grap'd all the communication details so far but was a little surprised by the prerequisite (rabbit & ftp) currently without a way around/out,
at least for the ftp I implemented for my first testing Apache embedded (https://mina.apache.org/ftpserver-proje ... erver.html). Maybe something to consider.

Re: The YaCy Grid

BeitragVerfasst: Sa Apr 01, 2017 12:37 am
von Orbiter
great work with the WARC importer!
reger hat geschrieben:prerequisite (rabbit & ftp) currently without a way around/out,

Well actually if the MCP does not find a FTP service, it will host files itself. Same with the queue, if there is no rabbitMQ, it will handle queues with a poor-mans-queue implementation using an embedded MapDB

reger hat geschrieben:at least for the ftp I implemented for my first testing Apache embedded (https://mina.apache.org/ftpserver-proje ... erver.html). Maybe something to consider.

I considered that as well but we can that as add-on later. Same with SMB or other protocols, any file sharing should be usable. Idea is that everyone can choose their own place to share warc/index files.

Re: The YaCy Grid

BeitragVerfasst: Sa Apr 01, 2017 11:11 am
von Huppi
@Orbiter: Thanks for sharing your plan! Looks great!

Re: The YaCy Grid

BeitragVerfasst: Mo Apr 24, 2017 3:47 pm
von Orbiter
YaCy Grid: Parser Microservice

you can now send a WARC file to a yacy_grid_parser microservice
and get the parsed fulltext and links as JSON:

Code: Alles auswählen
wget https://www.ffii.org --warc-file=ffii.org
curl -X POST -F "sourcebytes=@ffii.org.warc.gz"  http://yacygrid.com:8500/yacy/grid/parser/parser.json


Here we stil use wget as loader. That component will be replaced soon with a headless browser which
generates WARC files.

Re: The YaCy Grid

BeitragVerfasst: Sa Mai 06, 2017 11:10 pm
von reger
Orbiter hat geschrieben:YaCy Grid: Parser Microservice

you can now send a WARC file to a yacy_grid_parser microservice
and get the parsed fulltext and links as JSON:



If you want to test the feature without wget/curl you could use the sourceurl parameter and a online stored warc

Code: Alles auswählen
http://yacygrid.com:8500/yacy/grid/parser/parser.json?sourceurl=https://archive.org/download/warc-www_c-l-o-u-d_us/www.c-l-o-u-d.us-2016-10-03-bd9783dc-00000.warc.gz

Re: The YaCy Grid

BeitragVerfasst: Di Mai 09, 2017 11:06 pm
von Orbiter
good idea!

Re: The YaCy Grid

BeitragVerfasst: So Mai 14, 2017 8:22 pm
von Orbiter
The yacy_grid_loader is ready, running and able to act as listener on the mcp event queue!

The rabbitmq message server attached to the mcp is able to dispatch work tasks for the YaCy grid microservices. The yacy_grid_loader service is the first one which actually listens on such a queue and acts on messages.

The yacy_grid_loader is now running at yacygrid.com:8200. It has a servlet interface but it gets interesting if it is accessed using a message. To do that, store first a message object named 'job.json' with the following content:
Code: Alles auswählen
{
  "metadata": {
    "process": "yacy_grid_loader",
    "count": 1
  },
  "data": [{"collection": "test"}],
  "actions": [{
    "urls": ["http://yacy.net"],
    "collection": "test",
    "targetasset": "test3/yacy.net.warc.gz",
    "type": "loader",
    "queue": "webloader"
  }]
}


Then upload the message with
Code: Alles auswählen
curl -X POST -F "message=@job.json" -F "serviceName=loader" -F "queueName=webloader" http://yacygrid.com:8100/yacy/grid/mcp/messages/send.json


The result is an asset in test3/yacy.net.warc.gz containing the web page that was loaded with a headless browser, thus containing executed javascript code!

To check the content, you can parse the asset with
Code: Alles auswählen
curl http://yacygrid.com:8500/yacy/grid/parser/parser.json?sourceasset=test3/yacy.net.warc.gz