HandleSet & RowSet

Forum for developers

HandleSet & RowSet

Beitragvon gaston » Di Mai 15, 2012 11:42 am

Hi

könnte bitte einer HandleSet und RowSet serialisierbar machen (wenn es geht). Ich bekomme das nicht hin, nur ein hinzufügen von Serializable reicht leider nicht. Danke.
gaston
 
Beiträge: 143
Registriert: Fr Jan 06, 2012 2:22 pm

Re: HandleSet & RowSet

Beitragvon Orbiter » Di Mai 15, 2012 11:57 am

Hallo gaston,
weil sich das danach anhört als wenn du da mithacken willst habe ich das mal spasseshalber gemacht!
https://gitorious.org/yacy/rc1/commit/1 ... 7deffa0264

Was hast du denn vor?
Orbiter
 
Beiträge: 5792
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: HandleSet & RowSet

Beitragvon gaston » Di Mai 15, 2012 5:01 pm

wow, das ging ja schnell :)

Ich möchte den den DHT-Hash-Cache speichern und laden. Ich hatte das schon mal, nur wurde da noch "Set" verwendet.

protected HashMap<String, Set<String>> cachedUrlHashs;

speichern: ObjectSerializer.save(getCacheUrlHashsSet(Blacklist.BLACKLIST_DHT));
laden: (getCacheUrlHashsSet(Blacklist.BLACKLIST_DHT)).addAll((Set)ObjectSerializer.read());

aktuell hängt er jetzt beim laden
private final ConcurrentMap<String, HandleSet> cachedUrlHashs;
(getCacheUrlHashsSet(Blacklist.BLACKLIST_DHT)).putAll((HandleSet)ObjectSerializer.read());

Kann es sein das da nur ein "(HandleSet)" nicht mehr reicht?
Zuletzt geändert von gaston am Di Mai 15, 2012 5:07 pm, insgesamt 1-mal geändert.
gaston
 
Beiträge: 143
Registriert: Fr Jan 06, 2012 2:22 pm

Re: HandleSet & RowSet

Beitragvon Orbiter » Di Mai 15, 2012 5:05 pm

gaston hat geschrieben:Ich möchte den den DHT-Hash-Cache speichern und laden. Ich hatte das schon mal, nur wurde da noch HashMap verwendet.

gute Idee

gaston hat geschrieben:(getCacheUrlHashsSet(Blacklist.BLACKLIST_DHT)).putAll((HandleSet)ObjectSerializer.read());

Kann es sein das da nur ein "(HandleSet)" nicht mehr reicht?

weiss nicht, du musst im Detail gucken wo es hängt. Dazu einen Thread dump machen, mit kill -3 auf den java thread. Dump kommt dann im Terminal. Und dann gucken wo ein Deadlock o.ä. ist.
Orbiter
 
Beiträge: 5792
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: HandleSet & RowSet

Beitragvon gaston » Mi Mai 16, 2012 1:36 pm

Jetzt funktioniert es.

In Switchboard.java muss in close() diese Zeile hinzugefügt werden.
Code: Alles auswählen
urlBlacklist.saveDHTCache();

Blacklist.java
Code: Alles auswählen
// Blacklist.java
// (C) 2005 by Michael Peter Christen; mc@yacy.net, Frankfurt a. M., Germany
// first published 11.07.2005 on http://yacy.net
//
// This is a part of YaCy, a peer-to-peer based web search engine
//
// $LastChangedDate$
// $LastChangedRevision$
// $LastChangedBy$
//
// LICENSE
//
// This program is free software; you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation; either version 2 of the License, or
// (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program; if not, write to the Free Software
// Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
package net.yacy.repository;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;

import net.yacy.kelondro.data.meta.DigestURI;
import net.yacy.kelondro.data.meta.URIMetadataRow;
import net.yacy.kelondro.index.HandleSet;
import net.yacy.kelondro.index.RowSpaceExceededException;
import net.yacy.kelondro.logging.Log;
import net.yacy.kelondro.util.FileUtils;
import net.yacy.kelondro.util.SetTools;

public class Blacklist {
    private static final File BLACKLIST_CACHEFILE = new File("DATA/DHT_Blacklist_Cache.ser");

    public static final String BLACKLIST_DHT = "dht";
    public static final String BLACKLIST_CRAWLER = "crawler";
    public static final String BLACKLIST_PROXY = "proxy";
    public static final String BLACKLIST_SEARCH = "search";
    public static final String BLACKLIST_SURFTIPS = "surftips";
    public static final String BLACKLIST_NEWS = "news";
    public final static String BLACKLIST_FILENAME_FILTER = "^.*\\.black$";

    public static enum BlacklistError {

        NO_ERROR(0),
        TWO_WILDCARDS_IN_HOST(1),
        SUBDOMAIN_XOR_WILDCARD(2),
        PATH_REGEX(3),
        WILDCARD_BEGIN_OR_END(4),
        HOST_WRONG_CHARS(5),
        DOUBLE_OCCURANCE(6),
        HOST_REGEX(7);
        final int errorCode;

        BlacklistError(final int errorCode) {
            this.errorCode = errorCode;
        }

        public int getInt() {
            return this.errorCode;
        }

        public long getLong() {
            return this.errorCode;
        }
    }
    protected static final Set<String> BLACKLIST_TYPES = new HashSet<String>(Arrays.asList(new String[]{
                Blacklist.BLACKLIST_CRAWLER,
                Blacklist.BLACKLIST_PROXY,
                Blacklist.BLACKLIST_DHT,
                Blacklist.BLACKLIST_SEARCH,
                Blacklist.BLACKLIST_SURFTIPS,
                Blacklist.BLACKLIST_NEWS
            }));
    public static final String BLACKLIST_TYPES_STRING = "proxy,crawler,dht,search,surftips,news";
    private File blacklistRootPath = null;
    private final ConcurrentMap<String, HandleSet> cachedUrlHashs;
    private final ConcurrentMap<String, Map<String, List<String>>> hostpaths_matchable; // key=host, value=path; mapped url is http://host/path; path does not start with '/' here
    private final ConcurrentMap<String, Map<String, List<String>>> hostpaths_notmatchable; // key=host, value=path; mapped url is http://host/path; path does not start with '/' here

    public Blacklist(final File rootPath) {

        setRootPath(rootPath);

        // prepare the data structure
        this.hostpaths_matchable = new ConcurrentHashMap<String, Map<String, List<String>>>();
        this.hostpaths_notmatchable = new ConcurrentHashMap<String, Map<String, List<String>>>();
        this.cachedUrlHashs = new ConcurrentHashMap<String, HandleSet>();

        for (final String blacklistType : BLACKLIST_TYPES) {
            this.hostpaths_matchable.put(blacklistType, new ConcurrentHashMap<String, List<String>>());
            this.hostpaths_notmatchable.put(blacklistType, new ConcurrentHashMap<String, List<String>>());           
            if (blacklistType.equals(Blacklist.BLACKLIST_DHT)) {
                loadDHTCache();
            } else {
                this.cachedUrlHashs.put(blacklistType, new HandleSet(URIMetadataRow.rowdef.primaryKeyLength, URIMetadataRow.rowdef.objectOrder, 0));   
            }
        }
    }

    public final void setRootPath(final File rootPath) {
        if (rootPath == null) {
            throw new NullPointerException("The blacklist root path must not be null.");
        }
        if (!rootPath.isDirectory()) {
            throw new IllegalArgumentException("The blacklist root path is not a directory.");
        }
        if (!rootPath.canRead()) {
            throw new IllegalArgumentException("The blacklist root path is not readable.");
        }

        this.blacklistRootPath = rootPath;
    }

    protected Map<String, List<String>> getBlacklistMap(final String blacklistType, final boolean matchable) {
        if (blacklistType == null) {
            throw new IllegalArgumentException("Blacklist type not set.");
        }
        if (!BLACKLIST_TYPES.contains(blacklistType)) {
            throw new IllegalArgumentException("Unknown blacklist type: " + blacklistType + ".");
        }

        return (matchable) ? this.hostpaths_matchable.get(blacklistType) : this.hostpaths_notmatchable.get(blacklistType);
    }

    protected HandleSet getCacheUrlHashsSet(final String blacklistType) {
        if (blacklistType == null) {
            throw new IllegalArgumentException("Blacklist type not set.");
        }
        if (!BLACKLIST_TYPES.contains(blacklistType)) {
            throw new IllegalArgumentException("Unknown backlist type.");
        }

        return this.cachedUrlHashs.get(blacklistType);
    }

    public void clear() {
        for (final Map<String, List<String>> entry : this.hostpaths_matchable.values()) {
            entry.clear();
        }
        for (final Map<String, List<String>> entry : this.hostpaths_notmatchable.values()) {
            entry.clear();
        }
//        for (final HandleSet entry : this.cachedUrlHashs.values()) {
//            entry.clear();
//        }
    }

    public int size() {
        int size = 0;
        for (final String entry : this.hostpaths_matchable.keySet()) {
            for (final List<String> ientry : this.hostpaths_matchable.get(entry).values()) {
                size += ientry.size();
            }
        }
        for (final String entry : this.hostpaths_notmatchable.keySet()) {
            for (final List<String> ientry : this.hostpaths_notmatchable.get(entry).values()) {
                size += ientry.size();
            }
        }
        return size;
    }

    public void loadList(final BlacklistFile[] blFiles, final String sep) {
        for (final BlacklistFile blf : blFiles) {
            loadList(blf.getType(), blf.getFileName(), sep);
        }
    }

    /**
     * create a blacklist from file, entries separated by 'sep'
     * duplicit entries are removed
     * @param blFile
     * @param sep
     */
    private void loadList(final BlacklistFile blFile, final String sep) {
        final Map<String, List<String>> blacklistMapMatch = getBlacklistMap(blFile.getType(), true);
        final Map<String, List<String>> blacklistMapNotMatch = getBlacklistMap(blFile.getType(), false);
        Set<Map.Entry<String, List<String>>> loadedBlacklist;
        Map.Entry<String, List<String>> loadedEntry;
        List<String> paths;
        List<String> loadedPaths;

        final Set<String> fileNames = blFile.getFileNamesUnified();
        for (final String fileName : fileNames) {
            // make sure all requested blacklist files exist
            final File file = new File(this.blacklistRootPath, fileName);
            try {
                file.createNewFile();
            } catch (final IOException e) { /* */ }

            // join all blacklists from files into one internal blacklist map
            loadedBlacklist = SetTools.loadMapMultiValsPerKey(file.toString(), sep).entrySet();
            for (final Iterator<Map.Entry<String, List<String>>> mi = loadedBlacklist.iterator(); mi.hasNext();) {
                loadedEntry = mi.next();
                loadedPaths = loadedEntry.getValue();

                // create new entry if host mask unknown, otherwise merge
                // existing one with path patterns from blacklist file
                paths = (isMatchable(loadedEntry.getKey())) ? blacklistMapMatch.get(loadedEntry.getKey()) : blacklistMapNotMatch.get(loadedEntry.getKey());
                if (paths == null) {
                    if (isMatchable(loadedEntry.getKey())) {
                        blacklistMapMatch.put(loadedEntry.getKey(), loadedPaths);
                    } else {
                        blacklistMapNotMatch.put(loadedEntry.getKey(), loadedPaths);
                    }
                } else {
                    // check for duplicates? (refactor List -> Set)
                    paths.addAll(new HashSet<String>(loadedPaths));
                }
            }
        }
    }

    public void loadList(final String blacklistType, final String fileNames, final String sep) {
        // method for not breaking older plasmaURLPattern interface
        final BlacklistFile blFile = new BlacklistFile(fileNames, blacklistType);

        loadList(blFile, sep);
    }

    public void removeAll(final String blacklistType, final String host) {
        getBlacklistMap(blacklistType, true).remove(host);
        getBlacklistMap(blacklistType, false).remove(host);
    }

    public void remove(final String blacklistType, final String host, final String path) {

        final Map<String, List<String>> blacklistMap = getBlacklistMap(blacklistType, true);
        List<String> hostList = blacklistMap.get(host);
        if (hostList != null) {
            hostList.remove(path);
            if (hostList.isEmpty()) {
                blacklistMap.remove(host);
            }
        }

        final Map<String, List<String>> blacklistMapNotMatch = getBlacklistMap(blacklistType, false);
        hostList = blacklistMapNotMatch.get(host);
        if (hostList != null) {
            hostList.remove(path);
            if (hostList.isEmpty()) {
                blacklistMapNotMatch.remove(host);
            }
        }
    }

    public void add(final String blacklistType, final String host, final String path) {
        if (host == null) {
            throw new IllegalArgumentException("host may not be null");
        }
        if (path == null) {
            throw new IllegalArgumentException("path may not be null");
        }

        final String p = (path.length() > 0 && path.charAt(0) == '/') ? path.substring(1) : path;

        final Map<String, List<String>> blacklistMap = getBlacklistMap(blacklistType, isMatchable(host));

        // avoid PatternSyntaxException e
        final String h =
                ((!isMatchable(host) && host.length() > 0 && host.charAt(0) == '*') ? "." + host : host).toLowerCase();

        List<String> hostList;
        if (!(blacklistMap.containsKey(h) && ((hostList = blacklistMap.get(h)) != null))) {
            blacklistMap.put(h, (hostList = new ArrayList<String>()));
        }

        hostList.add(p);
    }

    public int blacklistCacheSize() {
        int size = 0;
        final Iterator<String> iter = this.cachedUrlHashs.keySet().iterator();
        while (iter.hasNext()) {
            size += this.cachedUrlHashs.get(iter.next()).size();
        }
        return size;
    }

    public boolean hashInBlacklistedCache(final String blacklistType, final byte[] urlHash) {
        return getCacheUrlHashsSet(blacklistType).has(urlHash);
    }

    public boolean contains(final String blacklistType, final String host, final String path) {
        boolean ret = false;

        if (blacklistType != null && host != null && path != null) {
            final Map<String, List<String>> blacklistMap =
                    getBlacklistMap(blacklistType, isMatchable(host));

            // avoid PatternSyntaxException e
            final String h =
                    ((!isMatchable(host) && host.length() > 0 && host.charAt(0) == '*') ? "." + host : host).toLowerCase();

            final List<String> hostList = blacklistMap.get(h);
            if (hostList != null) {
                ret = hostList.contains(path);
            }
        }
        return ret;
    }

    public boolean isListed(final String blacklistType, final DigestURI url) {
        if (url == null) {
            throw new IllegalArgumentException("url may not be null");
        }

        if (url.getHost() == null) {
            return false;
        }
        final HandleSet urlHashCache = getCacheUrlHashsSet(blacklistType);
        if (!urlHashCache.has(url.hash())) {
            final boolean temp = isListed(blacklistType, url.getHost().toLowerCase(), url.getFile());
            if (temp) {
                try {
                    urlHashCache.put(url.hash());
                } catch (final RowSpaceExceededException e) {
                    Log.logException(e);
                }
            }
            return temp;
        }
        return true;
    }

    public static boolean isMatchable(final String host) {

        return (
                (Pattern.matches("^[a-z0-9.-]*$", host))            // simple Domain (yacy.net or www.yacy.net)
                || (Pattern.matches("^\\*\\.[a-z0-9-.]*$", host))   // start with *. (not .* and * must follow a dot)
                || (Pattern.matches("^[a-z0-9-.]*\\.\\*$", host))   // ends with .* (not *. and before * must be a dot)
                );
    }

    public String getEngineInfo() {
        return "Default YaCy Blacklist Engine";
    }

    public boolean isListed(final String blacklistType, final String hostlow, final String path) {
        if (hostlow == null) {
            throw new IllegalArgumentException("hostlow may not be null");
        }
        if (path == null) {
            throw new IllegalArgumentException("path may not be null");
        }

        // getting the proper blacklist
        final Map<String, List<String>> blacklistMapMatched = getBlacklistMap(blacklistType, true);

        final String p = (path.length() > 0 && path.charAt(0) == '/') ? path.substring(1) : path;

        List<String> app;
        boolean matched = false;
        String pp = ""; // path-pattern

        // try to match complete domain
        if (!matched && (app = blacklistMapMatched.get(hostlow)) != null) {
            for (int i = app.size() - 1; !matched && i > -1; i--) {
                pp = app.get(i);
                if (pp.indexOf("?*",0) > 0) {
                    // prevent "Dangling meta character '*'" exception
                    Log.logWarning("Blacklist", "ignored blacklist path to prevent 'Dangling meta character' exception: " + pp);
                    continue;
                }
                matched |= (("*".equals(pp)) || (p.matches(pp)));
            }
        }
        // first try to match the domain with wildcard '*'
        // [TL] While "." are found within the string
        int index = 0;
        while (!matched && (index = hostlow.indexOf('.', index + 1)) != -1) {
            if ((app = blacklistMapMatched.get(hostlow.substring(0, index + 1) + "*")) != null) {
                for (int i = app.size() - 1; !matched && i > -1; i--) {
                    pp = app.get(i);
                    matched |= (("*".equals(pp)) || (p.matches(pp)));
                }
            }
            if ((app = blacklistMapMatched.get(hostlow.substring(0, index))) != null) {
                for (int i = app.size() - 1; !matched && i > -1; i--) {
                    pp = app.get(i);
                    matched |= (("*".equals(pp)) || (p.matches(pp)));
                }
            }
        }
        index = hostlow.length();
        while (!matched && (index = hostlow.lastIndexOf('.', index - 1)) != -1) {
            if ((app = blacklistMapMatched.get("*" + hostlow.substring(index, hostlow.length()))) != null) {
                for (int i = app.size() - 1; !matched && i > -1; i--) {
                    pp = app.get(i);
                    matched |= (("*".equals(pp)) || (p.matches(pp)));
                }
            }
            if ((app = blacklistMapMatched.get(hostlow.substring(index + 1, hostlow.length()))) != null) {
                for (int i = app.size() - 1; !matched && i > -1; i--) {
                    pp = app.get(i);
                    matched |= (("*".equals(pp)) || (p.matches(pp)));
                }
            }
        }


        // loop over all Regexentrys
        if (!matched) {
            final Map<String, List<String>> blacklistMapNotMatched = getBlacklistMap(blacklistType, false);
            String key;
            for (final Entry<String, List<String>> entry : blacklistMapNotMatched.entrySet()) {
                key = entry.getKey();
                try {
                    if (Pattern.matches(key, hostlow)) {
                        app = entry.getValue();
                        for (int i = 0; i < app.size(); i++) {
                            if (Pattern.matches(app.get(i), p)) {
                                return true;
                            }
                        }
                    }
                } catch (final PatternSyntaxException e) {
                    //System.out.println(e.toString());
                }
            }
        }
        return matched;
    }

    public BlacklistError checkError(final String element, final Map<String, String> properties) {

        final boolean allowRegex = (properties != null) && properties.get("allowRegex").equalsIgnoreCase("true");
        int slashPos;
        final String host, path;

        if ((slashPos = element.indexOf('/')) == -1) {
            host = element;
            path = ".*";
        } else {
            host = element.substring(0, slashPos);
            path = element.substring(slashPos + 1);
        }

        if (!allowRegex || !RegexHelper.isValidRegex(host)) {
            final int i = host.indexOf('*');

            // check whether host begins illegally
            if (!host.matches("([A-Za-z0-9_-]+|\\*)(\\.([A-Za-z0-9_-]+|\\*))*")) {
                if (i == 0 && host.length() > 1 && host.charAt(1) != '.') {
                    return BlacklistError.SUBDOMAIN_XOR_WILDCARD;
                }
                return BlacklistError.HOST_WRONG_CHARS;
            }

            // in host-part only full sub-domains may be wildcards
            if (host.length() > 0 && i > -1) {
                if (!(i == 0 || i == host.length() - 1)) {
                    return BlacklistError.WILDCARD_BEGIN_OR_END;
                }

                if (i == host.length() - 1 && host.length() > 1 && host.charAt(i - 1) != '.') {
                    return BlacklistError.SUBDOMAIN_XOR_WILDCARD;
                }
            }

            // check for double-occurences of "*" in host
            if (host.indexOf("*", i + 1) > -1) {
                return BlacklistError.TWO_WILDCARDS_IN_HOST;
            }
        } else if (allowRegex && !RegexHelper.isValidRegex(host)) {
            return BlacklistError.HOST_REGEX;
        }

        // check for errors on regex-compiling path
        if (!RegexHelper.isValidRegex(path) && !"*".equals(path)) {
            return BlacklistError.PATH_REGEX;
        }

        return BlacklistError.NO_ERROR;
    }

    public static String defaultBlacklist(final File listsPath) {
        final List<String> dirlist = FileUtils.getDirListing(listsPath, Blacklist.BLACKLIST_FILENAME_FILTER);
        if (dirlist.isEmpty()) {
            return null;
        }
        return dirlist.get(0);
    }

    /**
     * Checks if a blacklist file contains a certain entry.
     * @param blacklistToUse The blacklist.
     * @param newEntry The Entry.
     * @return True if file contains entry, else false.
     */
    public static boolean blacklistFileContains(final File listsPath, final String blacklistToUse, final String newEntry) {
        final Set<String> blacklist = new HashSet<String>(FileUtils.getListArray(new File(listsPath, blacklistToUse)));
        return blacklist != null && blacklist.contains(newEntry);
    }
   
    public final void saveDHTCache() {
        try {
            ObjectOutputStream out = new ObjectOutputStream(new FileOutputStream(BLACKLIST_CACHEFILE));
            out.writeObject(getCacheUrlHashsSet(Blacklist.BLACKLIST_DHT));
            out.close();
           
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public final void loadDHTCache() {
        try {
            if (BLACKLIST_CACHEFILE.exists()) {
                ObjectInputStream in;
                in = new ObjectInputStream(new FileInputStream(BLACKLIST_CACHEFILE));
                this.cachedUrlHashs.put(Blacklist.BLACKLIST_DHT, (HandleSet) in.readObject());
                in.close();
            } else {
                this.cachedUrlHashs.put(Blacklist.BLACKLIST_DHT, new HandleSet(URIMetadataRow.rowdef.primaryKeyLength, URIMetadataRow.rowdef.objectOrder, 0));               
            }
        } catch (ClassNotFoundException e) {
            e.printStackTrace();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Der DHT-Cache wird mit Blacklist.clear() nicht mehr gelöscht, weil der Cache sonst mit jedem Start wieder zurückgesetzt wird.

Mich stört das nicht. Wenn es sein muss lösche ich einfach die Datei "DATA/DHT_Blacklist_Cache.ser".
Zuletzt geändert von gaston am Fr Mai 25, 2012 8:02 pm, insgesamt 1-mal geändert.
gaston
 
Beiträge: 143
Registriert: Fr Jan 06, 2012 2:22 pm

Re: HandleSet & RowSet

Beitragvon Quix0r » Mi Mai 16, 2012 2:10 pm

Hallo gaston,

erstmal ein herzliches Willkommen in der Community. Ich moechte dir nicht gleich den Wind aus den Segeln nehmen, da ich es sehr gut fine, wenn andere mitmachen wollen, aber ich denke an ein paar Geflogenheiten musst du dich anpassen muessen (das ist ueberall so), also sei bitte nicht enttaeuscht. :mrgreen:

Wenn du nur ein paar Aenderungen einspielen willst, sei bitte so gut erstelle mit deiner IDE (Eclipse/Netbeans oder was verwendest du?) ein Diff-Patch. Das ist besser und hat mehr Aussicht auf Aufnahmeerfolg als eine ganze Datei hier zu posten. Lade dann den Patch hier im Forum hoch.

Wenn du aber ernsthafter mitmachen willst und mehr beisteuern willst, erstelle dir bitte bei gitorious.org einen Account, da YaCy's Code dort gehostet ist. Dann clone YaCy's rc1 repository (geht per Webseite) und fuege rc1 als "remote-tracking" repository hinzu, wenn du dann dir diese "holst" (git fetch rc1 z.B.) und dann mergst (git merge rc1/master z.B.) dann bekommst du auch jede Aenderung mit. Dann musst du deine Aenderungen erstmal comitten (git add some/foo/bar/some.java nicht vergessen! Also erstmal deine Aenderungen hinzufuegen zum Commit) und dan pushen (git push), damit sie auf den GIT-Server kopiert werden.

Danach kann Michael die Aenderungen bei sich dann auch eventuell mergen und mit aufnehmen. Eine bessere Anleitung gibt es dazu im Wiki.
Quix0r
 
Beiträge: 1345
Registriert: Di Jul 31, 2007 9:22 am
Wohnort: Krefeld

Re: HandleSet & RowSet

Beitragvon gaston » Mi Mai 16, 2012 3:01 pm

Hallo Quix0r,
ok, hier ist mal ein diff von Blacklist.java. Für die eine Zeile in Switchboard.java gibt es wegen anderer Änderungen kein diff.
Für mehr fehlt leider die Zeit.
Code: Alles auswählen
diff --git "a/C:\\Blacklist-HEAD-left.java" "b/C:\\Blacklist.java"
index 32b4f30..33ce140 100644
--- "a/C:\\Blacklist-HEAD-left.java"
+++ "b/C:\\Blacklist.java"
@@ -26,7 +26,12 @@
package net.yacy.repository;

import java.io.File;
+import java.io.FileInputStream;
+import java.io.FileNotFoundException;
+import java.io.FileOutputStream;
import java.io.IOException;
+import java.io.ObjectInputStream;
+import java.io.ObjectOutputStream;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
@@ -49,6 +54,7 @@ import net.yacy.kelondro.util.FileUtils;
import net.yacy.kelondro.util.SetTools;

public class Blacklist {
+    private static final File BLACKLIST_CACHEFILE = new File("DATA/DHT_Blacklist_Cache.ser");

     public static final String BLACKLIST_DHT = "dht";
     public static final String BLACKLIST_CRAWLER = "crawler";
@@ -107,8 +113,12 @@ public class Blacklist {

         for (final String blacklistType : BLACKLIST_TYPES) {
             this.hostpaths_matchable.put(blacklistType, new ConcurrentHashMap<String, List<String>>());
-            this.hostpaths_notmatchable.put(blacklistType, new ConcurrentHashMap<String, List<String>>());
-            this.cachedUrlHashs.put(blacklistType, new HandleSet(URIMetadataRow.rowdef.primaryKeyLength, URIMetadataRow.rowdef.objectOrder, 0));
+            this.hostpaths_notmatchable.put(blacklistType, new ConcurrentHashMap<String, List<String>>());           
+            if (blacklistType.equals(Blacklist.BLACKLIST_DHT)) {
+                loadDHTCache();
+            } else {
+                this.cachedUrlHashs.put(blacklistType, new HandleSet(URIMetadataRow.rowdef.primaryKeyLength, URIMetadataRow.rowdef.objectOrder, 0));   
+            }
         }
     }

@@ -155,9 +165,9 @@ public class Blacklist {
         for (final Map<String, List<String>> entry : this.hostpaths_notmatchable.values()) {
             entry.clear();
         }
-        for (final HandleSet entry : this.cachedUrlHashs.values()) {
-            entry.clear();
-        }
+//        for (final HandleSet entry : this.cachedUrlHashs.values()) {
+//            entry.clear();
+//        }
     }

     public int size() {
@@ -507,4 +517,34 @@ public class Blacklist {
         final Set<String> blacklist = new HashSet<String>(FileUtils.getListArray(new File(listsPath, blacklistToUse)));
         return blacklist != null && blacklist.contains(newEntry);
     }
+   
+    public final void saveDHTCache() {
+        try {
+            ObjectOutputStream out = new ObjectOutputStream(new FileOutputStream(BLACKLIST_CACHEFILE));
+            out.writeObject(getCacheUrlHashsSet(Blacklist.BLACKLIST_DHT));
+            out.close();
+           
+        } catch (IOException e) {
+            e.printStackTrace();
+        }
+    }
+
+    public final void loadDHTCache() {
+        try {
+            if (BLACKLIST_CACHEFILE.exists()) {
+                ObjectInputStream in;
+                in = new ObjectInputStream(new FileInputStream(BLACKLIST_CACHEFILE));
+                this.cachedUrlHashs.put(Blacklist.BLACKLIST_DHT, (HandleSet) in.readObject());
+                in.close();
+            } else {
+                this.cachedUrlHashs.put(Blacklist.BLACKLIST_DHT, new HandleSet(URIMetadataRow.rowdef.primaryKeyLength, URIMetadataRow.rowdef.objectOrder, 0));               
+            }
+        } catch (ClassNotFoundException e) {
+            e.printStackTrace();
+        } catch (FileNotFoundException e) {
+            e.printStackTrace();
+        } catch (IOException e) {
+            e.printStackTrace();
+        }
+    }
}
gaston
 
Beiträge: 143
Registriert: Fr Jan 06, 2012 2:22 pm

Re: HandleSet & RowSet

Beitragvon Quix0r » Mi Mai 16, 2012 3:17 pm

Ich habe mal deinen Patch etwas gepatcht, damit er an YaCy angepasst ist:
Code: Alles auswählen
diff --git "a/C:\\Blacklist-HEAD-left.java" "b/C:\\Blacklist.java"
index 32b4f30..33ce140 100644
--- "a/C:\\Blacklist-HEAD-left.java"
+++ "b/C:\\Blacklist.java"
@@ -26,7 +26,12 @@
package net.yacy.repository;

import java.io.File;
+import java.io.FileInputStream;
+import java.io.FileNotFoundException;
+import java.io.FileOutputStream;
import java.io.IOException;
+import java.io.ObjectInputStream;
+import java.io.ObjectOutputStream;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
@@ -49,6 +54,7 @@ import net.yacy.kelondro.util.FileUtils;
import net.yacy.kelondro.util.SetTools;

public class Blacklist {
+    private static final File BLACKLIST_CACHEFILE = new File("DATA/DHT_Blacklist_Cache.ser");

     public static final String BLACKLIST_DHT = "dht";
     public static final String BLACKLIST_CRAWLER = "crawler";
@@ -107,8 +113,12 @@ public class Blacklist {

         for (final String blacklistType : BLACKLIST_TYPES) {
             this.hostpaths_matchable.put(blacklistType, new ConcurrentHashMap<String, List<String>>());
-            this.hostpaths_notmatchable.put(blacklistType, new ConcurrentHashMap<String, List<String>>());
-            this.cachedUrlHashs.put(blacklistType, new HandleSet(URIMetadataRow.rowdef.primaryKeyLength, URIMetadataRow.rowdef.objectOrder, 0));
+            this.hostpaths_notmatchable.put(blacklistType, new ConcurrentHashMap<String, List<String>>());           
+            if (blacklistType.equals(Blacklist.BLACKLIST_DHT)) {
+                loadDHTCache();
+            } else {
+                this.cachedUrlHashs.put(blacklistType, new HandleSet(URIMetadataRow.rowdef.primaryKeyLength, URIMetadataRow.rowdef.objectOrder, 0));   
+            }
         }
     }

@@ -155,9 +165,9 @@ public class Blacklist {
         for (final Map<String, List<String>> entry : this.hostpaths_notmatchable.values()) {
             entry.clear();
         }
-        for (final HandleSet entry : this.cachedUrlHashs.values()) {
-            entry.clear();
-        }
+//        for (final HandleSet entry : this.cachedUrlHashs.values()) {
+//            entry.clear();
+//        }
     }

     public int size() {
@@ -507,4 +517,34 @@ public class Blacklist {
         final Set<String> blacklist = new HashSet<String>(FileUtils.getListArray(new File(listsPath, blacklistToUse)));
         return blacklist != null && blacklist.contains(newEntry);
     }
+   
+    public final void saveDHTCache() {
+        try {
+            final ObjectOutputStream out = new ObjectOutputStream(new FileOutputStream(BLACKLIST_CACHEFILE));
+            out.writeObject(getCacheUrlHashsSet(Blacklist.BLACKLIST_DHT));
+            out.close();
+           
+        } catch (final IOException e) {
+            Log.logException(e);
+        }
+    }
+
+    public final void loadDHTCache() {
+        try {
+            if (BLACKLIST_CACHEFILE.exists()) {
+                final ObjectInputStream in = new ObjectInputStream(new FileInputStream(BLACKLIST_CACHEFILE));
+                this.cachedUrlHashs.put(Blacklist.BLACKLIST_DHT, (HandleSet) in.readObject());
+                in.close();
+            } else {
+                this.cachedUrlHashs.put(Blacklist.BLACKLIST_DHT, new HandleSet(URIMetadataRow.rowdef.primaryKeyLength, URIMetadataRow.rowdef.objectOrder, 0));               
+            }
+        } catch (final ClassNotFoundException e) {
+            Log.logException(e);
+        } catch (final FileNotFoundException e) {
+            Log.logException(e);
+        } catch (final IOException e) {
+            Log.logException(e);
+        }
+    }
}

Das 'final' in den catch-Bloecken ist recht gut, da es dem Compiler mitteilt, dass die Variable "e" nicht geaendert werden soll, was hier auch keinen Sinn machen wuerde. Dann kann der Java-Compiler diese besser optimieren. Eine Ausgabe der Exception mit e.printStackTrace(); bedeutet, dass diese in die Console ausgegeben wird, was aber meistens dann nicht mehr lesbar (und somit nicht debugbar) ist. Hingegegn wird mit dem Aufruf von Log.logException(e); diese an den YaCy-eigenen Logger uebergeben, der diese dann in das Logbuch schreibt.

Der Rest - bis auf fehlendes 'final' fuer "in" und "out" - ist erstmal annehmbar. Allerdings koenntest du dir ueberlegen, ob loadDHTCache() und saveDHTCache() nicht auf 'private' gesetzt werden kann, falls du diese Methoden nur in Blacklist.java verwenden solltest. Das kann so manchen dummen Fehler verhindern.
Quix0r
 
Beiträge: 1345
Registriert: Di Jul 31, 2007 9:22 am
Wohnort: Krefeld

Re: HandleSet & RowSet

Beitragvon Orbiter » Do Mai 17, 2012 8:30 am

prima! macht bitte in einen git clone rein
Orbiter
 
Beiträge: 5792
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: HandleSet & RowSet

Beitragvon Quix0r » Fr Mai 25, 2012 3:20 pm

Nur wo soll saveDHTCache() aufgerufen werden? Ich sehe nur einen Aufruf von loadDHTCache().

Auch werden in der Methode clear() Zeilen auskommentiert, die cachedUrlHashs bereinigen sollten. Ist das so okay?
Quix0r
 
Beiträge: 1345
Registriert: Di Jul 31, 2007 9:22 am
Wohnort: Krefeld

Re: HandleSet & RowSet

Beitragvon gaston » Fr Mai 25, 2012 8:01 pm

gaston hat geschrieben:In Switchboard.java muss in close() diese Zeile hinzugefügt werden.
Code: Alles auswählen
urlBlacklist.saveDHTCache();

Blacklist.java
Code: Alles auswählen
// Blacklist.java
// (C) 2005 by Michael Peter Christen; mc@yacy.net, Frankfurt a. M., Germany
// first published 11.07.2005 on http://yacy.net
//
// This is a part of YaCy, a peer-to-peer based web search engine
//
// $LastChangedDate$
// $LastChangedRevision$
// $LastChangedBy$
//
// LICENSE
//
// This program is free software; you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation; either version 2 of the License, or
// (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program; if not, write to the Free Software
// Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
package net.yacy.repository;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;

import net.yacy.kelondro.data.meta.DigestURI;
import net.yacy.kelondro.data.meta.URIMetadataRow;
import net.yacy.kelondro.index.HandleSet;
import net.yacy.kelondro.index.RowSpaceExceededException;
import net.yacy.kelondro.logging.Log;
import net.yacy.kelondro.util.FileUtils;
import net.yacy.kelondro.util.SetTools;

public class Blacklist {
    private static final File BLACKLIST_CACHEFILE = new File("DATA/DHT_Blacklist_Cache.ser");

    public static final String BLACKLIST_DHT = "dht";
    public static final String BLACKLIST_CRAWLER = "crawler";
    public static final String BLACKLIST_PROXY = "proxy";
    public static final String BLACKLIST_SEARCH = "search";
    public static final String BLACKLIST_SURFTIPS = "surftips";
    public static final String BLACKLIST_NEWS = "news";
    public final static String BLACKLIST_FILENAME_FILTER = "^.*\\.black$";

    public static enum BlacklistError {

        NO_ERROR(0),
        TWO_WILDCARDS_IN_HOST(1),
        SUBDOMAIN_XOR_WILDCARD(2),
        PATH_REGEX(3),
        WILDCARD_BEGIN_OR_END(4),
        HOST_WRONG_CHARS(5),
        DOUBLE_OCCURANCE(6),
        HOST_REGEX(7);
        final int errorCode;

        BlacklistError(final int errorCode) {
            this.errorCode = errorCode;
        }

        public int getInt() {
            return this.errorCode;
        }

        public long getLong() {
            return this.errorCode;
        }
    }
    protected static final Set<String> BLACKLIST_TYPES = new HashSet<String>(Arrays.asList(new String[]{
                Blacklist.BLACKLIST_CRAWLER,
                Blacklist.BLACKLIST_PROXY,
                Blacklist.BLACKLIST_DHT,
                Blacklist.BLACKLIST_SEARCH,
                Blacklist.BLACKLIST_SURFTIPS,
                Blacklist.BLACKLIST_NEWS
            }));
    public static final String BLACKLIST_TYPES_STRING = "proxy,crawler,dht,search,surftips,news";
    private File blacklistRootPath = null;
    private final ConcurrentMap<String, HandleSet> cachedUrlHashs;
    private final ConcurrentMap<String, Map<String, List<String>>> hostpaths_matchable; // key=host, value=path; mapped url is http://host/path; path does not start with '/' here
    private final ConcurrentMap<String, Map<String, List<String>>> hostpaths_notmatchable; // key=host, value=path; mapped url is http://host/path; path does not start with '/' here

    public Blacklist(final File rootPath) {

        setRootPath(rootPath);

        // prepare the data structure
        this.hostpaths_matchable = new ConcurrentHashMap<String, Map<String, List<String>>>();
        this.hostpaths_notmatchable = new ConcurrentHashMap<String, Map<String, List<String>>>();
        this.cachedUrlHashs = new ConcurrentHashMap<String, HandleSet>();

        for (final String blacklistType : BLACKLIST_TYPES) {
            this.hostpaths_matchable.put(blacklistType, new ConcurrentHashMap<String, List<String>>());
            this.hostpaths_notmatchable.put(blacklistType, new ConcurrentHashMap<String, List<String>>());           
            if (blacklistType.equals(Blacklist.BLACKLIST_DHT)) {
                loadDHTCache();
            } else {
                this.cachedUrlHashs.put(blacklistType, new HandleSet(URIMetadataRow.rowdef.primaryKeyLength, URIMetadataRow.rowdef.objectOrder, 0));   
            }
        }
    }

    public final void setRootPath(final File rootPath) {
        if (rootPath == null) {
            throw new NullPointerException("The blacklist root path must not be null.");
        }
        if (!rootPath.isDirectory()) {
            throw new IllegalArgumentException("The blacklist root path is not a directory.");
        }
        if (!rootPath.canRead()) {
            throw new IllegalArgumentException("The blacklist root path is not readable.");
        }

        this.blacklistRootPath = rootPath;
    }

    protected Map<String, List<String>> getBlacklistMap(final String blacklistType, final boolean matchable) {
        if (blacklistType == null) {
            throw new IllegalArgumentException("Blacklist type not set.");
        }
        if (!BLACKLIST_TYPES.contains(blacklistType)) {
            throw new IllegalArgumentException("Unknown blacklist type: " + blacklistType + ".");
        }

        return (matchable) ? this.hostpaths_matchable.get(blacklistType) : this.hostpaths_notmatchable.get(blacklistType);
    }

    protected HandleSet getCacheUrlHashsSet(final String blacklistType) {
        if (blacklistType == null) {
            throw new IllegalArgumentException("Blacklist type not set.");
        }
        if (!BLACKLIST_TYPES.contains(blacklistType)) {
            throw new IllegalArgumentException("Unknown backlist type.");
        }

        return this.cachedUrlHashs.get(blacklistType);
    }

    public void clear() {
        for (final Map<String, List<String>> entry : this.hostpaths_matchable.values()) {
            entry.clear();
        }
        for (final Map<String, List<String>> entry : this.hostpaths_notmatchable.values()) {
            entry.clear();
        }
//        for (final HandleSet entry : this.cachedUrlHashs.values()) {
//            entry.clear();
//        }
    }

    public int size() {
        int size = 0;
        for (final String entry : this.hostpaths_matchable.keySet()) {
            for (final List<String> ientry : this.hostpaths_matchable.get(entry).values()) {
                size += ientry.size();
            }
        }
        for (final String entry : this.hostpaths_notmatchable.keySet()) {
            for (final List<String> ientry : this.hostpaths_notmatchable.get(entry).values()) {
                size += ientry.size();
            }
        }
        return size;
    }

    public void loadList(final BlacklistFile[] blFiles, final String sep) {
        for (final BlacklistFile blf : blFiles) {
            loadList(blf.getType(), blf.getFileName(), sep);
        }
    }

    /**
     * create a blacklist from file, entries separated by 'sep'
     * duplicit entries are removed
     * @param blFile
     * @param sep
     */
    private void loadList(final BlacklistFile blFile, final String sep) {
        final Map<String, List<String>> blacklistMapMatch = getBlacklistMap(blFile.getType(), true);
        final Map<String, List<String>> blacklistMapNotMatch = getBlacklistMap(blFile.getType(), false);
        Set<Map.Entry<String, List<String>>> loadedBlacklist;
        Map.Entry<String, List<String>> loadedEntry;
        List<String> paths;
        List<String> loadedPaths;

        final Set<String> fileNames = blFile.getFileNamesUnified();
        for (final String fileName : fileNames) {
            // make sure all requested blacklist files exist
            final File file = new File(this.blacklistRootPath, fileName);
            try {
                file.createNewFile();
            } catch (final IOException e) { /* */ }

            // join all blacklists from files into one internal blacklist map
            loadedBlacklist = SetTools.loadMapMultiValsPerKey(file.toString(), sep).entrySet();
            for (final Iterator<Map.Entry<String, List<String>>> mi = loadedBlacklist.iterator(); mi.hasNext();) {
                loadedEntry = mi.next();
                loadedPaths = loadedEntry.getValue();

                // create new entry if host mask unknown, otherwise merge
                // existing one with path patterns from blacklist file
                paths = (isMatchable(loadedEntry.getKey())) ? blacklistMapMatch.get(loadedEntry.getKey()) : blacklistMapNotMatch.get(loadedEntry.getKey());
                if (paths == null) {
                    if (isMatchable(loadedEntry.getKey())) {
                        blacklistMapMatch.put(loadedEntry.getKey(), loadedPaths);
                    } else {
                        blacklistMapNotMatch.put(loadedEntry.getKey(), loadedPaths);
                    }
                } else {
                    // check for duplicates? (refactor List -> Set)
                    paths.addAll(new HashSet<String>(loadedPaths));
                }
            }
        }
    }

    public void loadList(final String blacklistType, final String fileNames, final String sep) {
        // method for not breaking older plasmaURLPattern interface
        final BlacklistFile blFile = new BlacklistFile(fileNames, blacklistType);

        loadList(blFile, sep);
    }

    public void removeAll(final String blacklistType, final String host) {
        getBlacklistMap(blacklistType, true).remove(host);
        getBlacklistMap(blacklistType, false).remove(host);
    }

    public void remove(final String blacklistType, final String host, final String path) {

        final Map<String, List<String>> blacklistMap = getBlacklistMap(blacklistType, true);
        List<String> hostList = blacklistMap.get(host);
        if (hostList != null) {
            hostList.remove(path);
            if (hostList.isEmpty()) {
                blacklistMap.remove(host);
            }
        }

        final Map<String, List<String>> blacklistMapNotMatch = getBlacklistMap(blacklistType, false);
        hostList = blacklistMapNotMatch.get(host);
        if (hostList != null) {
            hostList.remove(path);
            if (hostList.isEmpty()) {
                blacklistMapNotMatch.remove(host);
            }
        }
    }

    public void add(final String blacklistType, final String host, final String path) {
        if (host == null) {
            throw new IllegalArgumentException("host may not be null");
        }
        if (path == null) {
            throw new IllegalArgumentException("path may not be null");
        }

        final String p = (path.length() > 0 && path.charAt(0) == '/') ? path.substring(1) : path;

        final Map<String, List<String>> blacklistMap = getBlacklistMap(blacklistType, isMatchable(host));

        // avoid PatternSyntaxException e
        final String h =
                ((!isMatchable(host) && host.length() > 0 && host.charAt(0) == '*') ? "." + host : host).toLowerCase();

        List<String> hostList;
        if (!(blacklistMap.containsKey(h) && ((hostList = blacklistMap.get(h)) != null))) {
            blacklistMap.put(h, (hostList = new ArrayList<String>()));
        }

        hostList.add(p);
    }

    public int blacklistCacheSize() {
        int size = 0;
        final Iterator<String> iter = this.cachedUrlHashs.keySet().iterator();
        while (iter.hasNext()) {
            size += this.cachedUrlHashs.get(iter.next()).size();
        }
        return size;
    }

    public boolean hashInBlacklistedCache(final String blacklistType, final byte[] urlHash) {
        return getCacheUrlHashsSet(blacklistType).has(urlHash);
    }

    public boolean contains(final String blacklistType, final String host, final String path) {
        boolean ret = false;

        if (blacklistType != null && host != null && path != null) {
            final Map<String, List<String>> blacklistMap =
                    getBlacklistMap(blacklistType, isMatchable(host));

            // avoid PatternSyntaxException e
            final String h =
                    ((!isMatchable(host) && host.length() > 0 && host.charAt(0) == '*') ? "." + host : host).toLowerCase();

            final List<String> hostList = blacklistMap.get(h);
            if (hostList != null) {
                ret = hostList.contains(path);
            }
        }
        return ret;
    }

    public boolean isListed(final String blacklistType, final DigestURI url) {
        if (url == null) {
            throw new IllegalArgumentException("url may not be null");
        }

        if (url.getHost() == null) {
            return false;
        }
        final HandleSet urlHashCache = getCacheUrlHashsSet(blacklistType);
        if (!urlHashCache.has(url.hash())) {
            final boolean temp = isListed(blacklistType, url.getHost().toLowerCase(), url.getFile());
            if (temp) {
                try {
                    urlHashCache.put(url.hash());
                } catch (final RowSpaceExceededException e) {
                    Log.logException(e);
                }
            }
            return temp;
        }
        return true;
    }

    public static boolean isMatchable(final String host) {

        return (
                (Pattern.matches("^[a-z0-9.-]*$", host))            // simple Domain (yacy.net or www.yacy.net)
                || (Pattern.matches("^\\*\\.[a-z0-9-.]*$", host))   // start with *. (not .* and * must follow a dot)
                || (Pattern.matches("^[a-z0-9-.]*\\.\\*$", host))   // ends with .* (not *. and before * must be a dot)
                );
    }

    public String getEngineInfo() {
        return "Default YaCy Blacklist Engine";
    }

    public boolean isListed(final String blacklistType, final String hostlow, final String path) {
        if (hostlow == null) {
            throw new IllegalArgumentException("hostlow may not be null");
        }
        if (path == null) {
            throw new IllegalArgumentException("path may not be null");
        }

        // getting the proper blacklist
        final Map<String, List<String>> blacklistMapMatched = getBlacklistMap(blacklistType, true);

        final String p = (path.length() > 0 && path.charAt(0) == '/') ? path.substring(1) : path;

        List<String> app;
        boolean matched = false;
        String pp = ""; // path-pattern

        // try to match complete domain
        if (!matched && (app = blacklistMapMatched.get(hostlow)) != null) {
            for (int i = app.size() - 1; !matched && i > -1; i--) {
                pp = app.get(i);
                if (pp.indexOf("?*",0) > 0) {
                    // prevent "Dangling meta character '*'" exception
                    Log.logWarning("Blacklist", "ignored blacklist path to prevent 'Dangling meta character' exception: " + pp);
                    continue;
                }
                matched |= (("*".equals(pp)) || (p.matches(pp)));
            }
        }
        // first try to match the domain with wildcard '*'
        // [TL] While "." are found within the string
        int index = 0;
        while (!matched && (index = hostlow.indexOf('.', index + 1)) != -1) {
            if ((app = blacklistMapMatched.get(hostlow.substring(0, index + 1) + "*")) != null) {
                for (int i = app.size() - 1; !matched && i > -1; i--) {
                    pp = app.get(i);
                    matched |= (("*".equals(pp)) || (p.matches(pp)));
                }
            }
            if ((app = blacklistMapMatched.get(hostlow.substring(0, index))) != null) {
                for (int i = app.size() - 1; !matched && i > -1; i--) {
                    pp = app.get(i);
                    matched |= (("*".equals(pp)) || (p.matches(pp)));
                }
            }
        }
        index = hostlow.length();
        while (!matched && (index = hostlow.lastIndexOf('.', index - 1)) != -1) {
            if ((app = blacklistMapMatched.get("*" + hostlow.substring(index, hostlow.length()))) != null) {
                for (int i = app.size() - 1; !matched && i > -1; i--) {
                    pp = app.get(i);
                    matched |= (("*".equals(pp)) || (p.matches(pp)));
                }
            }
            if ((app = blacklistMapMatched.get(hostlow.substring(index + 1, hostlow.length()))) != null) {
                for (int i = app.size() - 1; !matched && i > -1; i--) {
                    pp = app.get(i);
                    matched |= (("*".equals(pp)) || (p.matches(pp)));
                }
            }
        }


        // loop over all Regexentrys
        if (!matched) {
            final Map<String, List<String>> blacklistMapNotMatched = getBlacklistMap(blacklistType, false);
            String key;
            for (final Entry<String, List<String>> entry : blacklistMapNotMatched.entrySet()) {
                key = entry.getKey();
                try {
                    if (Pattern.matches(key, hostlow)) {
                        app = entry.getValue();
                        for (int i = 0; i < app.size(); i++) {
                            if (Pattern.matches(app.get(i), p)) {
                                return true;
                            }
                        }
                    }
                } catch (final PatternSyntaxException e) {
                    //System.out.println(e.toString());
                }
            }
        }
        return matched;
    }

    public BlacklistError checkError(final String element, final Map<String, String> properties) {

        final boolean allowRegex = (properties != null) && properties.get("allowRegex").equalsIgnoreCase("true");
        int slashPos;
        final String host, path;

        if ((slashPos = element.indexOf('/')) == -1) {
            host = element;
            path = ".*";
        } else {
            host = element.substring(0, slashPos);
            path = element.substring(slashPos + 1);
        }

        if (!allowRegex || !RegexHelper.isValidRegex(host)) {
            final int i = host.indexOf('*');

            // check whether host begins illegally
            if (!host.matches("([A-Za-z0-9_-]+|\\*)(\\.([A-Za-z0-9_-]+|\\*))*")) {
                if (i == 0 && host.length() > 1 && host.charAt(1) != '.') {
                    return BlacklistError.SUBDOMAIN_XOR_WILDCARD;
                }
                return BlacklistError.HOST_WRONG_CHARS;
            }

            // in host-part only full sub-domains may be wildcards
            if (host.length() > 0 && i > -1) {
                if (!(i == 0 || i == host.length() - 1)) {
                    return BlacklistError.WILDCARD_BEGIN_OR_END;
                }

                if (i == host.length() - 1 && host.length() > 1 && host.charAt(i - 1) != '.') {
                    return BlacklistError.SUBDOMAIN_XOR_WILDCARD;
                }
            }

            // check for double-occurences of "*" in host
            if (host.indexOf("*", i + 1) > -1) {
                return BlacklistError.TWO_WILDCARDS_IN_HOST;
            }
        } else if (allowRegex && !RegexHelper.isValidRegex(host)) {
            return BlacklistError.HOST_REGEX;
        }

        // check for errors on regex-compiling path
        if (!RegexHelper.isValidRegex(path) && !"*".equals(path)) {
            return BlacklistError.PATH_REGEX;
        }

        return BlacklistError.NO_ERROR;
    }

    public static String defaultBlacklist(final File listsPath) {
        final List<String> dirlist = FileUtils.getDirListing(listsPath, Blacklist.BLACKLIST_FILENAME_FILTER);
        if (dirlist.isEmpty()) {
            return null;
        }
        return dirlist.get(0);
    }

    /**
     * Checks if a blacklist file contains a certain entry.
     * @param blacklistToUse The blacklist.
     * @param newEntry The Entry.
     * @return True if file contains entry, else false.
     */
    public static boolean blacklistFileContains(final File listsPath, final String blacklistToUse, final String newEntry) {
        final Set<String> blacklist = new HashSet<String>(FileUtils.getListArray(new File(listsPath, blacklistToUse)));
        return blacklist != null && blacklist.contains(newEntry);
    }
   
    public final void saveDHTCache() {
        try {
            ObjectOutputStream out = new ObjectOutputStream(new FileOutputStream(BLACKLIST_CACHEFILE));
            out.writeObject(getCacheUrlHashsSet(Blacklist.BLACKLIST_DHT));
            out.close();
           
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public final void loadDHTCache() {
        try {
            if (BLACKLIST_CACHEFILE.exists()) {
                ObjectInputStream in;
                in = new ObjectInputStream(new FileInputStream(BLACKLIST_CACHEFILE));
                this.cachedUrlHashs.put(Blacklist.BLACKLIST_DHT, (HandleSet) in.readObject());
                in.close();
            } else {
                this.cachedUrlHashs.put(Blacklist.BLACKLIST_DHT, new HandleSet(URIMetadataRow.rowdef.primaryKeyLength, URIMetadataRow.rowdef.objectOrder, 0));               
            }
        } catch (ClassNotFoundException e) {
            e.printStackTrace();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Der DHT-Cache wird mit Blacklist.clear() nicht mehr gelöscht, weil der Cache sonst mit jedem Start wieder zurückgesetzt wird.

Mich stört das nicht. Wenn es sein muss lösche ich einfach die Datei "DATA/DHT_Blacklist_Cache.ser".
gaston
 
Beiträge: 143
Registriert: Fr Jan 06, 2012 2:22 pm

Re: HandleSet & RowSet

Beitragvon Quix0r » Fr Mai 25, 2012 10:40 pm

Upps, da habe ich wohl was ueberlesen. Sorry. :shock:

Hier ist der Commit:
https://gitorious.org/~quix0r/yacy/quix ... 0ef341a06c
Quix0r
 
Beiträge: 1345
Registriert: Di Jul 31, 2007 9:22 am
Wohnort: Krefeld

Re: HandleSet & RowSet

Beitragvon gaston » Sa Jun 09, 2012 1:41 pm

Schade das es noch nicht im aktuellen Code drin ist. Bei mir spart das eine Menge unnötigen Traffic.

Aktuell umfasst meine Liste 549,118 Einträge bei einer Dateigröße von gut 7MB.

Mich wundert die Menge aber schon, weil ich wie ich finde eigentlich nur unnütze Seiten sperre.

Meine Blacklist kann sich jeder bei mir herunterladen (Peer 28112011).


YaCy könnte für den Crawler eine Whitelist gebrauchen für Seiten wie z.B. dmoz.org. Zum Crawlen gut, aber für den YaCy Index nur Ballast wie ich finde.

gaston
gaston
 
Beiträge: 143
Registriert: Fr Jan 06, 2012 2:22 pm

Re: HandleSet & RowSet

Beitragvon Quix0r » Mo Jun 11, 2012 7:55 pm

Sollte mit den letzten Merges dabei sein (mit leichten Abaenderungen). Also back-to-topic ...
Quix0r
 
Beiträge: 1345
Registriert: Di Jul 31, 2007 9:22 am
Wohnort: Krefeld

Re: HandleSet & RowSet

Beitragvon gaston » Mi Jun 13, 2012 3:55 pm

Das löschen von this.cachedUrlHashs.values() in clear() muss verhindert werden, sonst wird die Datei
DATA/WORK/blacklistCache_DHT.ser mit jedem Start wieder zurückgesetzt. So macht das speichern kein Sinn ;)

Code: Alles auswählen
source/net/yacy/repository/Blacklist.java |    4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/source/net/yacy/repository/Blacklist.java b/source/net/yacy/repository/Blacklist.java
index 7675d7a..d69af6f 100644
--- a/source/net/yacy/repository/Blacklist.java
+++ b/source/net/yacy/repository/Blacklist.java
@@ -162,9 +162,9 @@ public class Blacklist {
         for (final Map<String, List<Pattern>> entry : this.hostpaths_notmatchable.values()) {
             entry.clear();
         }
-        for (final HandleSet entry : this.cachedUrlHashs.values()) {
+/*      for (final HandleSet entry : this.cachedUrlHashs.values()) {
             entry.clear();
-        }
+        }*/
     }

     public int size() {
gaston
 
Beiträge: 143
Registriert: Fr Jan 06, 2012 2:22 pm

Re: HandleSet & RowSet

Beitragvon gaston » Di Jul 10, 2012 8:35 am

push push push, würde das bitte einer ändern... das ist der einzige Grund warum ich YaCy immer selbst "bauen" muss :(
gaston
 
Beiträge: 143
Registriert: Fr Jan 06, 2012 2:22 pm

Re: HandleSet & RowSet

Beitragvon Orbiter » Di Jul 10, 2012 12:11 pm

jaaaa... letzte Woche war alles was zum Vortrag beim RMLL gehörte priorität.
Hier ist der Fix: http://gitorious.org/yacy/rc1/commit/ae ... 7c956fcba6
allerdings ein wenig ausgedehnter und auch anders. Die Semantic von clear() darf nicht verändert werden aber die Frage, wann un wo clear() aufgerufen wird muss dann beantwortet werden. Der Cache ist jetzt für alle Blacklists aktiv und unter DATA/LISTS zu finden.
Orbiter
 
Beiträge: 5792
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: HandleSet & RowSet

Beitragvon gaston » Di Jul 10, 2012 6:08 pm

Danke :)
gaston
 
Beiträge: 143
Registriert: Fr Jan 06, 2012 2:22 pm


Zurück zu YaCy Coding & Architecture

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 2 Gäste

cron