Using WARC as import option & YaCy2 architecture

Forum for developers

Using WARC as import option & YaCy2 architecture

Beitragvon Orbiter » Fr Sep 18, 2015 11:02 am

In the conext of a YaCy re-design towards a YaCy2 I plan to rip YaCy apart into stand-alone modules, make room for funded (and commercially usable) plug-in parts and then pack the resulting modules again together to different appliances. This could lead to a 'new' YaCy which is compatible to the old network but is composed by the new modules. A target is also to create professional appliance packages which can consist of parts which are not applicable for p2p search but necessary for customers.
Bild
One of the tasks to create that architecture is the identification of standards which the modules of YaCy2 should support. I identified that WARC is really amazing and important and would fit into the YaCys user demand to collect large amounts of web data. WARC is the file standard of the internet archive http://archive.org
There are a lot of interesting applications available to create and process WARC:

  • WARC can be created by wget and it's really simple:
    Code: Alles auswählen
    wget -r --warc-file=yacy.net.warc http://yacy.net
    .. creates a full archive of yacy.net
  • WARC can be created using http://webrecorder.io (this is similar to the YaCy proxy idea: record everything which you browse) .. this will also be open source soon!
  • There is an archive software for WARC which is 'like archive.org', so its a DIY archive.org software: https://github.com/ikreymer/pywb (check out the other tools of https://github.com/ikreymer .. its mostly about WARC)
  • there is a java library available to parse WARC files; that could be integrated in YaCy to feed it's parser: https://sbforge.org/display/JWAT/Usage
  • I tried and liked a lot the webarchiveplayer https://github.com/ikreymer/webarchiveplayer - this starts a web server and within your browser you get all the content from a specific WARC file served (unfortunately this starts on our default port 8090 so you might have to stop YaCy to try that :( .. or change the port)

I also like to idea that in a YaCy2 architecture we should be able to share on two levels: additionally to p2p index sharing we could do a WARC sharing as well. I consider to add a bittorrent tracker for that together with a WARC archive management to the list of modules which could be glued together to YaCy2.

What do you think? Please try the wget command above and maybe start to collect WARC archives which we can share to bootstrap a huge YaCy2 index when software modules are ready!
Orbiter
 
Beiträge: 5798
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: Using WARC as import option & YaCy2 architecture

Beitragvon David » So Sep 20, 2015 7:55 pm

Sounds good!

Since 2014, I'm automatically recording nearly all the pages I visit with the shelve firefox add-on , and I'm planning to convert the data to WARC files using Wget and then build my private waybackmachine with openwayback. Of course, I will use YaCy as search engine.
David
 
Beiträge: 170
Registriert: Di Mär 05, 2013 5:35 pm

Re: Using WARC as import option & YaCy2 architecture

Beitragvon luc » Mo Sep 21, 2015 8:18 am

At Common Crawl they also use WARC format to store very huge crawl archives on Amazon WS : http://commoncrawl.org/the-data/get-started/. Maybe their data could also be a source to feed some YaCy nodes?
luc
 
Beiträge: 305
Registriert: Mi Aug 26, 2015 1:04 am

Re: Using WARC as import option & YaCy2 architecture

Beitragvon luc » Do Sep 24, 2015 11:43 am

Shouldn't any public website been encouraged to provide its own WARC up-to-date archive of its contents, and even its own reverse-index : in fact to run its own YaCy node instance with a WARC and index related only to its contents?
Maybe a specific YaCy distribution would help that... Maybe I am not aware that websites already do this with current YaCy release?
luc
 
Beiträge: 305
Registriert: Mi Aug 26, 2015 1:04 am

Re: Using WARC as import option & YaCy2 architecture

Beitragvon Orbiter » Do Sep 24, 2015 12:44 pm

luc hat geschrieben:At Common Crawl they also use WARC format to store very huge crawl archives on Amazon WS : http://commoncrawl.org/the-data/get-started/. Maybe their data could also be a source to feed some YaCy nodes?

They have a strange way to present links to their warc files ("replace xxx with yyy" which is then a link to a gzipped file which contains again strings which are parts of links) but I finally managed to load one of these dumps. What I have seen then is a mixture of documents from different domains, originating from a 'wide crawl' over different domains in the same way as YaCy does. One month has about 50 Terabytes in those archives. It could be interesting to download that all and filter it according to some rules, but indexing could be a hard work requiring a lot of servers. It's possibe but not easy...

luc hat geschrieben:Shouldn't any public website been encouraged to provide its own WARC up-to-date archive of its contents, and even its own reverse-index : in fact to run its own YaCy node instance with a WARC and index related only to its contents?

Thats a good point since there is also a copyright-issue: while it is simple to create a WARC with wget, it could be a copyright issue to publish a full crawl of a single domain because it's like a re-distribution. I don't know if that is already a problem, but I am unsure. For a YaCy2 architecture I would suggest to implement a 'closed-group' torrent-based file sharing infrastructure for such files. And then your suggestion could lead to a marker in the web pages containing a copyright note. I believe such things exist, but no-one uses such markers. For domains containing free licenses we could create an open-group sharing instead of a closed-group sharing.
Orbiter
 
Beiträge: 5798
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: Using WARC as import option & YaCy2 architecture

Beitragvon luc » Do Sep 24, 2015 5:51 pm

To eventually build an index upon theses commoncrawl datasets I suppose it would require some Hadoop programming skills to run the jobs on Amazon EC2 or another cloud. It would certainly take some time but seems very interesting... despite the fact that these data are stored on a commercial and centralized cloud system.
luc
 
Beiträge: 305
Registriert: Mi Aug 26, 2015 1:04 am

Re: Using WARC as import option & YaCy2 architecture

Beitragvon reger » Mi Okt 21, 2015 2:06 am

Modular is fine, already a idea of a framework or handcrafted ... or do you mean realy stand-alone (communicating over a file system ;-( )?

Orbiter hat geschrieben: I identified that WARC is really amazing and important

I don't get the discussion about WARC, is it about the idea to distribute (sell ;-) ) a index w/o crawling. Does that work for us?
Or is it just .... basically to have a module to write the crawler cache in a different (reuseable) format ....
reger
 
Beiträge: 46
Registriert: Mi Jan 02, 2013 9:23 am

Re: Using WARC as import option & YaCy2 architecture

Beitragvon luc » Sa Dez 05, 2015 1:30 pm

To extend the idea of a modular architecture based on standards, dont't you think integrating Apache Nutch web crawler (https://nutch.apache.org/) Apache Tika parsers (https://tika.apache.org/) would be a good point? Eventual parts of Yacy crawling or parsing system not already in theses libraries could be contributed to these projects... It would allow even more code review and testing on such core components.
luc
 
Beiträge: 305
Registriert: Mi Aug 26, 2015 1:04 am

Re: Using WARC as import option & YaCy2 architecture

Beitragvon Orbiter » So Dez 27, 2015 8:01 pm

I worked with a commercial partner who selected YaCy over Nutch as crawler because they considered Nutch as old and badly maintained already some years ago. Since then I worked with these partners to enhance the YaCy crawler even further. Because of this experience, turning to Nutch would be a huge step back.

Apache Tika is a component for Solr which bundles a set of parsers and unifies their metadata structure into a common metadata structure. The same does YaCy and YaCy uses a superset of parsers which are in Tika. Furthermore, the metadata structure in YaCy is much much richer than that which is used in Tika. That means: Tika is great, but already subsumed with the functions in YaCy.

What is great about Nutch and Tika is the 'thinking in modules'. Thats exactly what the idea with the YaCy2 components is.

reger hat geschrieben:I don't get the discussion about WARC, is it about the idea to distribute (sell ;-) ) a index w/o crawling.

WARC is a great format and there are already a lot of tools for it, so it's just a good choice. This is not about 'selling' data. The word 'distribution' considers the usability for the (YaCy!) community.
reger hat geschrieben:Or is it just .... basically to have a module to write the crawler cache in a different (reuseable) format ....

That as well!
Orbiter
 
Beiträge: 5798
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main


Zurück zu YaCy Coding & Architecture

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 1 Gast

cron