Best Way to automatically find links to your content? - linker

So, here is the task I've found myself thinking of. Pretend for a moment, that I have a large body of content. I want to see what websites are linking to my content. I know that I could look into TrackBack or PingBack but what about those that aren't using tools capable of dealing with that?
It would seem that some form of Web Crawler that looks for pages linking to the original document might be useful. My question to the greater community is what would be the best way to get started here? Do TrackBack and PingBack do more than I assume? Are there services or tools out there that already do what I'm thinking?

Google is your friend!
Use the link prefix:
link:whatsite.com
And yes, trackbacks do more.

If you have HTTP referers setup in your logs, you can mine them.
You can even discover pages taht does not know about.
Else, there is the paying Linkscape from Seomoz or the free majesticSEO (if you confirm ownership of the domain).
MajesticSEO has a bigger backlink index and an API (need to login!).

Related

How are stored handwritten notes from apps in databases?

At the beggining I'd like to say it's not an emergency :D
I was thinking about project ideas recently. Projects that I could try to create to learn something more, something new or just to leave my comfort zone. I've picked notes app project that support handwritten notes. And here's the first problem, my little knowledge can't come up with idea how to store these handwritten notes in database.
Database or other technologies haven't been picked yet so there is no "How to store it in MySQL?" and so on... just theoretically thinking how it could be done. I was looking in google and here on stackoverflow but didn't get nothing similar, just some questions how to verify or recognize handwritten notes.
Has anybody any idea or lead I could go by?
Here I am assuming your "handwritten notes" are images. A simple solution might be uploading your images somewhere (e.g. Amazon S3, but there are countless options out there). Then, in some database you might have a reference to the URL of the image. In your code you can then download the images using the URL and process them as you see fit.
Note: I am making many assumptions here but I hope this helps.

Is it possible to find the source of website data?

When a website is constantly updating it's information from a data source, is it possible to find out what or where that data source is?
An example of what I'm talking about is stock prices. I'm curious to learn for educational purposes.
It depends how the site has implemented the solution. If it's server-side, you probably won't be able to find out. If it's in javascript, you could find out by looking at the page sources, such as in Chrome -> Inspect. But you'll need to know your way around.

How to collect data from a website

Preface: I have a broad, college knowledge, of a handful of languages (C++, VB,C#,Java, many web languages), so go with which ever you like.
I want to make an android app that compares numbers, but in order to do that I need a database. I'm a one man team, and the numbers get updated biweekly so I want to grab those numbers off of a wiki that gets updated as well.
So my question is: how can I access information from a website using one of the languages above?
What I understand the problem to be: Some entity generates a data set (i.e. numbers) every other week and you have a need to download that data set for treatment (e.g. sorting).
Ideally, the web site maintaining the wiki would provide a Service, like a RESTful interface, to easily gather the data. If that were the case, I'd go with any language that provides easy manipulation of HTTP request & response, and makes your data manipulation easy. As a previous poster said, Java would work well.
If you are stuck with the wiki page, you have a couple of options. You can parse the HTML your browser receives (Perl comes to mind as a decent language for that). Or you can use tools built for that purpose such as the aforementioned Jsoup.
Your question also mentions some implementation details such as needing a database. Evidently, there isn't enough contextual information for me to know whether that's optimal, so I won't address this aspect of the problem.
http://jsoup.org/ is a great Java tool for accessing content on html pages
Consider https://scraperwiki.com/ - it's a site where users can contribute scrapers. It's free as long as you let your scraper be public. The results of your scraper are exposed as csv and JSON.
If you don't know what a "scraper" is, google "screen scraping" - it's a long and frustrating tradition for coders, who have dealt with the same problem you have since the beginning of networked computing.
You could check out :http://web-harvest.sourceforge.net/
For Python, BeautifulSoup is one of the most tolerant HTML parsers out there. The documentation also lists similar libraries in Ruby and Java, so you'll probably find something relevant there.

Whats a good way to protect a link database from automatic scrapers?

I have a large link database, that I would want to protect against others who would want to copy them. Is there anything I can do other than force people to enter a CAPTCHA before each link?
you can output the links using ROT13, and then use javascript to put them back to normal.
this way, the scrapers must support javascript in order to steal your links, which should cut down on the number of eligible scrapers
bonus points: replace ROT13 with something harder, and obfuscate your 'decode' javascript.
The javascript suggestion could work, but you would render your page inaccessible to those using assistive technologies like screen readers as well as anyone without javascript.
Another possible option would be to generate a cryptographic nonce. This technique is currently used to protect against CSRF attacks, but could also be used to ensure that the scraper would have to request a page from your site before accessing a link. This approach may not be appropriate if you support hotlinking, but if you just want to make sure that someone went to your site first, it could work.
Another somewhat ghetto option would be use referrers. These can be easily faked, but it might prevent some of the dumber scrapers. This also requires that you know where your users came from before they hit your site.
Can you let us know if you are hotlinking or if the user comes to your site before going to the protected link? We might be able to provide better advice that way.

User behavior inside my web page

Does anybody know if there is a way to understand what users are doing on my web page? I can see overall stats with Google Analytics but I can't really understand what users are looking at, if they are reading my content, etc.
Does anybody know if there is a free or paid service that I could use?
Dani
You could try using one of the analytics things that does heatmaps.
These give a visual representation of where people are clicking on the site, and I think some may let you track where the cursor is as well.
This is about as close as you can get to what you want - there's no way to track what someone's eyes are looking at,
Wikipedia claims that google analytics can do this. I've not seen it in analytics, but it may well be hiding there somewhere. I have used clicktale a bit before though:
http://www.clicktale.com/
Try Mouseflow. You can even record your visitors and watch the videos. Haven't tried it but looks promising.
The Site Overlay report in Google Analytics is probably what Wikipedia is referring to. However, it doesn't work most of the time (it's a "piece of crap" in Google's own words).
ClickTale is okay. I recommend CrazyEgg: http://crazyegg.com/

Resources