Any one interested in tracking users on Social networks?

Discussion in English language.
Forumsregeln
You can start and continue with posts in english language in all other forums as well, but if you are looking for a forum to start a discussion in english, this is the right choice.

Any one interested in tracking users on Social networks?

Beitragvon smokingwheels » Mi Jul 02, 2014 2:15 pm

If you want to crawl a Twitter profile then here are some setting I have found work.
Use Https://mobile.twitter.com/username eg the results are better for tweets

Advanced Crawler
No Page Count eg unticked.
Restrict to sub-path(s) only.
Do not delete any document before the crawl is started.
No Doubles Never load any page that is already known. Only the start-url may be loaded again.

Start Crawl
Max PPM eg crawler speed about 45 a min (twitter robots.txt)
# Wait 1 second between successive requests. See ONBOARD-2698 for details.
Crawl-delay: 1

Wait some time then export URL's with titles then you can look at the data and extract stuff of interest. PM for details.

The search results for tweets are best if you search username action site:twitter.com
Zuletzt geändert von smokingwheels am Do Feb 05, 2015 1:23 am, insgesamt 1-mal geändert.
smokingwheels
 
Beiträge: 136
Registriert: Sa Aug 31, 2013 7:16 am

Re: Any one interested in tracking users on Social networks?

Beitragvon Orbiter » Mi Jul 23, 2014 8:43 pm

since twitter decided to switch off RSS feeds it is not easy any more to integrate tweets in YaCy search results. We would need a twitter scraper which may be possible to set specific crawl filter rules. Someone must invest some work to find out what to do exactly to crawl Twitter accounts in a nice way.
(to everyone): please invest some time to find a solution.
Orbiter
 
Beiträge: 5792
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main

Re: Any one interested in tracking users on Social networks?

Beitragvon banana » Sa Aug 02, 2014 8:13 pm

Orbiter hat geschrieben:since twitter decided to switch off RSS feeds it is not easy any more to integrate tweets in YaCy search results. We would need a twitter scraper which may be possible to set specific crawl filter rules. Someone must invest some work to find out what to do exactly to crawl Twitter accounts in a nice way.
(to everyone): please invest some time to find a solution.


Hi,
i looked upon this, just to figure how to scrape twitter. It is possible to scrape twitter via their api, but they have limitations regarding twitter api (terms of service + query restrictions).

There are few do-able solutions:
1)Scrape Twitter streaming JSON API and receive all tweets*, parse it with suitable already existing applications which can turn it to RSS, which yacy can read directly.
2)Same as above but use twitter json/api compatible java library and integrate it to yacy.*

*You receive only ~1% of the tweets per token with streaming api, also theres restrictions what you can do with the data.Its quite easy to do the choice one, but i don't know how it will look in search results.
I dont have enough knowledge in programming to do option 2 at this time but flip side, this would be excellent programming experience + learning opportunity.

Of course it would be nice in some cases but also extremely creepy to make social search into yacy which would build social profile for every people based on crawling, like showing profile pictures, all social media accounts and other information like friends etc.
banana
 
Beiträge: 2
Registriert: Mi Jul 30, 2014 8:46 pm

Re: Any one interested in tracking users on Social networks?

Beitragvon smokingwheels » Do Feb 05, 2015 12:22 pm

I read google is going to get a pipeline of tweets to there engine but how to harvest google?
I prefer 50 Twitter users per Yacy server and Quickbasic to scrape the info I need. PM if interested.
smokingwheels
 
Beiträge: 136
Registriert: Sa Aug 31, 2013 7:16 am

Re: Any one interested in tracking users on Social networks?

Beitragvon Orbiter » Fr Feb 06, 2015 8:24 am

I invested some work meanwhile in an algorithm to scrape twitter search results from their html search results. I tried first last year but stopped when I believed that this work is so absurd since the html is extremely bloated and so inefficient to use as search result compared to a true API which was removed. But in January this year I nearly completed a scraper which is at this time an external project but not a part of YaCy:

http://twitterscraper.host-browser.org

I want to publish the code in the near future as soon as everything is inside that I want to for the initial source release (mostly license make-up). The code contains also some 'nasty' I-am-a-browser-fake user-agent believe-me stuff which could be fighted back by twitter as soon as they read the source code. Without pretending that the scraper is a browser it does not work. Therefore this code can not be included into YaCy because I think that this could harm the "properly-behaved robot" status.
Orbiter
 
Beiträge: 5792
Registriert: Di Jun 26, 2007 10:58 pm
Wohnort: Frankfurt am Main


Zurück zu English

Wer ist online?

Mitglieder in diesem Forum: 0 Mitglieder und 2 Gäste