One of the tasks to create that architecture is the identification of standards which the modules of YaCy2 should support. I identified that WARC is really amazing and important and would fit into the YaCys user demand to collect large amounts of web data. WARC is the file standard of the internet archive http://archive.org
There are a lot of interesting applications available to create and process WARC:
- WARC can be created by wget and it's really simple:
- Code: Alles auswählen
wget -r --warc-file=yacy.net.warc http://yacy.net
- WARC can be created using http://webrecorder.io (this is similar to the YaCy proxy idea: record everything which you browse) .. this will also be open source soon!
- There is an archive software for WARC which is 'like archive.org', so its a DIY archive.org software: https://github.com/ikreymer/pywb (check out the other tools of https://github.com/ikreymer .. its mostly about WARC)
- there is a java library available to parse WARC files; that could be integrated in YaCy to feed it's parser: https://sbforge.org/display/JWAT/Usage
- I tried and liked a lot the webarchiveplayer https://github.com/ikreymer/webarchiveplayer - this starts a web server and within your browser you get all the content from a specific WARC file served (unfortunately this starts on our default port 8090 so you might have to stop YaCy to try that .. or change the port)
I also like to idea that in a YaCy2 architecture we should be able to share on two levels: additionally to p2p index sharing we could do a WARC sharing as well. I consider to add a bittorrent tracker for that together with a WARC archive management to the list of modules which could be glued together to YaCy2.
What do you think? Please try the wget command above and maybe start to collect WARC archives which we can share to bootstrap a huge YaCy2 index when software modules are ready!