Aggregate / Clear duplicate results

Discussion in English language.
Forumsregeln
You can start and continue with posts in english language in all other forums as well, but if you are looking for a forum to start a discussion in english, this is the right choice.

Aggregate / Clear duplicate results

Beitragvon davide » Mi Mär 18, 2015 12:25 am

As you can see from the attached screenshot, lots of YaCy search results are duplicate or very similar and are positioned adjacently throughout the results list. So similar to be practically indistinguishable to the user and be of no utility to occupy the ranking list.

I think it would be best for YaCy to recognize these duplicates and tidy them up.

What do you think?
Dateianhänge
yacy_search.jpg
yacy_search.jpg (204.03 KiB) 1693-mal betrachtet
davide
 
Beiträge: 78
Registriert: Fr Feb 15, 2013 8:03 am

Re: Aggregate / Clear duplicate results

Beitragvon freak » Do Apr 02, 2015 4:39 pm

I think this isn't a yacy topic, because yacy has no control over page titles.

If you only check the page titles shown in result list, then you are right, if you talk about duplicates. But yacy (and any other search engine/crawler) is using the real url as the unique identifier for a search result.

As you can see in your result list the page title is the same, but url itself differs from result to result:

Title: Flash Memory - Buy Card ....
Url: http://www.misco.it/Cat/Fotografia-e-Vi ... orie-Flash?Viewtype=list


Title: Flash Memory - Buy Card ....
Url: http://www.misco.it/Cat/Fotografia-e-Vi ... orie-Flash?InStock=True

Title: Flash Memory - Buy Card ....
Url: http://www.misco.it/Cat/Fotografia-e-Vi ... orie-Flash?Viewtype=gallery


To clear the results in the yacy way, you can do one of the following:

Blacklist
Decide if you want results with the parameter Viewtype in your index. To block such urls you can create a black list entry like (not tested) .*Viewtype=.* or something like that.

Dynamic urls
decide if you want to have dynamic urls crawled or not.

Hope it helps to clarify the "duplicates" topic. :)
freak
 
Beiträge: 21
Registriert: Do Okt 10, 2013 10:59 pm

Re: Aggregate / Clear duplicate results

Beitragvon Winter_fox » Fr Apr 03, 2015 12:44 pm

I think google solves this buy not showing multiple pages from the same domain on the same page.
Winter_fox
 
Beiträge: 2
Registriert: Fr Apr 03, 2015 12:33 pm

Re: Aggregate / Clear duplicate results

Beitragvon smokingwheels » Mo Apr 06, 2015 10:11 am

Interesting. I have similar problem with Twitter showing in many languages and no way to limit it.
Maybe have a delimiter character in the url string when you do a crawl eg cut short the URL or process on the peer later on.

Why don't you log a report for a Wishlist http://mantis.tokeek.de/my_view_page.php
smokingwheels
 
Beiträge: 102
Registriert: Sa Aug 31, 2013 7:16 am

Re: Aggregate / Clear duplicate results

Beitragvon MikeS » Mo Apr 06, 2015 12:02 pm

I have similar problem with Twitter showing in many languages


Twitter has the annoying habit of sending a page in the language that is specified in the "Accept-Language" request-header. Which means that if you crawl the same Twitter-URL with different languages listed in the accept-language request-header, you will get different results.

This may be nice if a page is requested with an actual browser, but can be really confusing if you do this with a crawler.
MikeS
 
Beiträge: 88
Registriert: Mo Feb 25, 2008 6:30 pm

Re: Aggregate / Clear duplicate results

Beitragvon smokingwheels » Di Apr 07, 2015 1:18 pm

MikeS hat geschrieben:
I have similar problem with Twitter showing in many languages


Twitter has the annoying habit of sending a page in the language that is specified in the "Accept-Language" request-header. Which means that if you crawl the same Twitter-URL with different languages listed in the accept-language request-header, you will get different results.

This may be nice if a page is requested with an actual browser, but can be really confusing if you do this with a crawler.


There is a much better scraper for Twitter now but I not going to share my peer to the yacy network because my VM runs out of space every 2 days.
Its http://loklak.org using RSS feeds into yacy or a reader.
smokingwheels
 
Beiträge: 102
Registriert: Sa Aug 31, 2013 7:16 am

Re: Aggregate / Clear duplicate results

Beitragvon biolizard89 » Mo Apr 13, 2015 10:14 pm

Winter_fox hat geschrieben:I think google solves this buy not showing multiple pages from the same domain on the same page.


That sounds like an interesting approach. 2nd-level ICANN-approved domains are somewhat expensive, which acts as a rate limiter on spamming the same content across domains. 3rd-level domains on the same 2nd-level domain, however, are very cheap for the owner of that 2nd-level domain. Does Google require the 2nd-level domain to be unique?

I suppose another approach would be to use a similarity algorithm of the content in the Solr fields for the pages. For example, you could construct a float vector of words/phrases, and collapse groups that have a very high cosine similarity. This idea totally fails the KISS test compared to your approach, though.
biolizard89
 
Beiträge: 58
Registriert: Do Jan 03, 2013 12:42 am

Re: Aggregate / Clear duplicate results

Beitragvon davide » Do Mai 21, 2015 10:11 am

What about hashing the snippet result data, such as title, body, or url with a perceptual library like phash (which is not written in java) and check for "visually" duplicate entries among the results?

For example, once all the titles of the returned results are hashed, their fingerprint can be compared to obtain a linear indication of how much these titles differ between each other. If the difference is below a threshold, they are very similar and hence duplicate. The hashing is so fast it can be performed by the fly when results are returned.
davide
 
Beiträge: 78
Registriert: Fr Feb 15, 2013 8:03 am


Zurück zu English

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 1 Gast

cron