Aggregate / Clear duplicate results

Discussion in English language.
Forumsregeln
You can start and continue with posts in english language in all other forums as well, but if you are looking for a forum to start a discussion in english, this is the right choice.

Aggregate / Clear duplicate results

Beitragvon davide » Mi Mär 18, 2015 12:25 am

As you can see from the attached screenshot, lots of YaCy search results are duplicate or very similar and are positioned adjacently throughout the results list. So similar to be practically indistinguishable to the user and be of no utility to occupy the ranking list.

I think it would be best for YaCy to recognize these duplicates and tidy them up.

What do you think?
Dateianhänge
yacy_search.jpg
yacy_search.jpg (204.03 KiB) 3029-mal betrachtet
davide
 
Beiträge: 84
Registriert: Fr Feb 15, 2013 8:03 am

Re: Aggregate / Clear duplicate results

Beitragvon freak » Do Apr 02, 2015 4:39 pm

I think this isn't a yacy topic, because yacy has no control over page titles.

If you only check the page titles shown in result list, then you are right, if you talk about duplicates. But yacy (and any other search engine/crawler) is using the real url as the unique identifier for a search result.

As you can see in your result list the page title is the same, but url itself differs from result to result:

Title: Flash Memory - Buy Card ....
Url: http://www.misco.it/Cat/Fotografia-e-Vi ... orie-Flash?Viewtype=list


Title: Flash Memory - Buy Card ....
Url: http://www.misco.it/Cat/Fotografia-e-Vi ... orie-Flash?InStock=True

Title: Flash Memory - Buy Card ....
Url: http://www.misco.it/Cat/Fotografia-e-Vi ... orie-Flash?Viewtype=gallery


To clear the results in the yacy way, you can do one of the following:

Blacklist
Decide if you want results with the parameter Viewtype in your index. To block such urls you can create a black list entry like (not tested) .*Viewtype=.* or something like that.

Dynamic urls
decide if you want to have dynamic urls crawled or not.

Hope it helps to clarify the "duplicates" topic. :)
freak
 
Beiträge: 21
Registriert: Do Okt 10, 2013 10:59 pm

Re: Aggregate / Clear duplicate results

Beitragvon Winter_fox » Fr Apr 03, 2015 12:44 pm

I think google solves this buy not showing multiple pages from the same domain on the same page.
Winter_fox
 
Beiträge: 2
Registriert: Fr Apr 03, 2015 12:33 pm

Re: Aggregate / Clear duplicate results

Beitragvon smokingwheels » Mo Apr 06, 2015 10:11 am

Interesting. I have similar problem with Twitter showing in many languages and no way to limit it.
Maybe have a delimiter character in the url string when you do a crawl eg cut short the URL or process on the peer later on.

Why don't you log a report for a Wishlist http://mantis.tokeek.de/my_view_page.php
smokingwheels
 
Beiträge: 137
Registriert: Sa Aug 31, 2013 7:16 am

Re: Aggregate / Clear duplicate results

Beitragvon MikeS » Mo Apr 06, 2015 12:02 pm

I have similar problem with Twitter showing in many languages


Twitter has the annoying habit of sending a page in the language that is specified in the "Accept-Language" request-header. Which means that if you crawl the same Twitter-URL with different languages listed in the accept-language request-header, you will get different results.

This may be nice if a page is requested with an actual browser, but can be really confusing if you do this with a crawler.
MikeS
 
Beiträge: 88
Registriert: Mo Feb 25, 2008 6:30 pm

Re: Aggregate / Clear duplicate results

Beitragvon smokingwheels » Di Apr 07, 2015 1:18 pm

MikeS hat geschrieben:
I have similar problem with Twitter showing in many languages


Twitter has the annoying habit of sending a page in the language that is specified in the "Accept-Language" request-header. Which means that if you crawl the same Twitter-URL with different languages listed in the accept-language request-header, you will get different results.

This may be nice if a page is requested with an actual browser, but can be really confusing if you do this with a crawler.


There is a much better scraper for Twitter now but I not going to share my peer to the yacy network because my VM runs out of space every 2 days.
Its http://loklak.org using RSS feeds into yacy or a reader.
smokingwheels
 
Beiträge: 137
Registriert: Sa Aug 31, 2013 7:16 am

Re: Aggregate / Clear duplicate results

Beitragvon biolizard89 » Mo Apr 13, 2015 10:14 pm

Winter_fox hat geschrieben:I think google solves this buy not showing multiple pages from the same domain on the same page.


That sounds like an interesting approach. 2nd-level ICANN-approved domains are somewhat expensive, which acts as a rate limiter on spamming the same content across domains. 3rd-level domains on the same 2nd-level domain, however, are very cheap for the owner of that 2nd-level domain. Does Google require the 2nd-level domain to be unique?

I suppose another approach would be to use a similarity algorithm of the content in the Solr fields for the pages. For example, you could construct a float vector of words/phrases, and collapse groups that have a very high cosine similarity. This idea totally fails the KISS test compared to your approach, though.
biolizard89
 
Beiträge: 61
Registriert: Do Jan 03, 2013 12:42 am

Re: Aggregate / Clear duplicate results

Beitragvon davide » Do Mai 21, 2015 10:11 am

What about hashing the snippet result data, such as title, body, or url with a perceptual library like phash (which is not written in java) and check for "visually" duplicate entries among the results?

For example, once all the titles of the returned results are hashed, their fingerprint can be compared to obtain a linear indication of how much these titles differ between each other. If the difference is below a threshold, they are very similar and hence duplicate. The hashing is so fast it can be performed by the fly when results are returned.
davide
 
Beiträge: 84
Registriert: Fr Feb 15, 2013 8:03 am

Re: Aggregate / Clear duplicate results

Beitragvon davide » Fr Jun 30, 2017 8:29 pm

Renovating this proposal after two years.

The advantage of using a perceptual hash library to process an already ranked list of results before it is presented to the user is that such a hash can be indiscriminately extracted from the text snippets which accompany the results, as well as from thumbnail mages, and can be used to numerically determine the visual difference between results presented by Yacy.

For text results, this could be effective at detecting and deleting results which look very similar, and for image results it would detect identical images which differ only by resolution or canvas ratio.

To demonstrate how simple the principle is, check this program I wrote years ago using the phash library.
It takes as argument the filename of two images to compare, and replies via its exit status whether the images are almost identical but differ by resolution or cropping.

Code: Alles auswählen
/* IMAGE COMPARER PROGRAM
*
* Synopsis: ./program image0 image1
* Exit status: 0: the images likely (95%+) represent the same object;
*              1: no resolute answer;
*              2: error.
*
* Notes: the program is capable to recognize two images only if these
* differ marginally, whether for size, aspect ratio, cropping, contrast.
* Very similar images which represent marginally-different objects
* don't normally match.
*/

#include <iostream>
#include <pHash.h>

#define THRESHOLD 10

int main(int argc, char *argv[]) {
  const char *f0, *f1;
  ulong64 hash0, hash1;
  int distance;

  f0 = argv[1];
  f1 = argv[2];

  cout << "Image0: " << f0 << '\n'
       << "Image1: " << f1 << '\n';

  if (ph_dct_imagehash(f0, hash0) != 0) return 2;
  if (ph_dct_imagehash(f1, hash1) != 0) return 2;

  distance = ph_hamming_distance(hash0, hash1);
  cout << "Distance: " << distance << '\n';

  return distance > THRESHOLD;
}
davide
 
Beiträge: 84
Registriert: Fr Feb 15, 2013 8:03 am

Re: Aggregate / Clear duplicate results

Beitragvon smokingwheels » Sa Jul 01, 2017 4:05 am

Its good your brought this subject up like a revisit.
The country probably has a very small primary industry.
It looks like Google has rejected there site (not to sure yet). They sell just about anything related to do with computers.
There site is massive and it has improved over time and has valid email address's for contact. Unlike some of the sites.

If I want to buy things online now, I run a web portal on the site of interest with the category I need as a start point then they are easier to locate the bargains.
In (/IndexSchema_p.html) there is some settings that may help.
smokingwheels
 
Beiträge: 137
Registriert: Sa Aug 31, 2013 7:16 am


Zurück zu English

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 1 Gast

cron