How are geolocation databases assembled? - database

I am not asking what geolocation service to use or how you use them.
I am asking, how the do these companies know so well where every IP address is? Is there some breach of privacy being violated?
I looked at the wikipedia page, and all they had to say was using the WHOIS service, which obviously doesn't work at all: my IP is owned by a company listed in another state.

It has a lot to do where the ISP's are logically located and that ARIN knows where networks are assigned.
They can also determine your location based on routers.
run this in a command/terminal window :
tracert google.com
I'm sure you can see some location based info in your tracert.

Many such databases appear to be extracting it from the 'whois' databases held by the Regional Internet Registries (RIPE, ARIN, etc).
These are not the same at the domain name 'whois' lookups, these relate specifically to IP addresses.
Such data extraction is an illegal breach of their database copyrights and strictly against their T&Cs.
See How does geographic lookup by IP work? for more details.

Let me answer with an analogy each car has a unique number that identifies it from it's manufacturer, the company has a list of all the cars that where send to each mayor distributor on each part of the world, each one of those distributors has several dealers to which they assign a set of cars to sell, and each one of those dealers sells the cars to end customers. So in theory if the manufacturer wants to know where is the world is a car he doesn't has to ask because he know in which country it landed.
Translating that to IPs every company that sells public IP address has a record of who owns it, and they are normally give them away in bulks of 1000s to ISP (phone numbers used to be like this). For example I can tell you if an IP is from my country just by looking at the first 2 groups. On the other hand hosting providers and data centers work the same way and they almost always know where is the machine physically, and last but not least doing a trace will jump hops to the closest (theoretically, as you can force the traces to be what you want) IP to the box which means you can guess the location if you have the one of the hop before it.

Those companies pay for the data.
There are many ways to get this data (not all illegal), one simple one is, for example, providing free services that encourage you to provide some information about your actual location like for example DslReports. Once they know one IP and the ISP is easy to correlate other IPs from the same area.
As you can see here one company recommends the other so you can see the connection.

I was wondering the same thing. Check out Ken Norton, Project Manager at Google, 's response on how Google acquires geolocation data: http://www.quora.com/How-does-Google-keep-its-geolocation-database-updated-with-new-MAC-addresses.

Related

Can schemas be used in a product where the sending address is configurable

Not a dev question but since this is where the docs point I'm asking it here, hopefully won't be closed. ;)
Given this line in https://developers.google.com/gmail/schemas/registering-with-google :
Emails must come from a static email address, eg foo#bar.com
I take it there's no way to make use of Schemas yet in a product where users configure their own mail server and from address?
If that is correct, are there any plans to allow schemas to work in this sort of environment in the future?
We'd love to bake it into our product's emails but seems like we can't with this constraint since each install sends emails from it's own configured 'from' address.
Only static email addresses are supported in this initial rollout, but we might reconsider this decision in the future. Please fill the form at https://developers.google.com/gmail/schemas/registering-with-google to describe your use case anyway and tell us why you have this requirement.

How can I get product information intoa database without having to populate it manually?

I am looking for a method of dynamically linking product information based on the name of the product.
For example: User types in "Playstation 3", the site would then go out and grab any information it can, such as picture, retail price, etc. Ideally, it would let you choose the correct item (returns both ps3 controller and ps3 console, user can choose which). It would then use this information in a product listing.
The easiest way I can think to implement this is to use the existing API of a major retailer such as Amazon. I have a couple completely different ideas for sites, one of which would involve selling from amazon (which I would assume they would be ok with) and another which would only be data mining the information. I am concerned they would not take it very kindly if I was just stealing their images and descriptions.
Is there another way, maybe less "sneaky" way to accomplish this that wouldn't be in legally frowned upon ?
Many web-commerce companies use a data stream known as an API - EBay, Etsy, and Amazon all have API feeds for their products. If you can convince the company to allow you access to their API (usually they will give you a key/password), then you can directly access their back-end database, typically at the read-only level. Depending on the company, you can just write them directly for access.
You are correct when you say that most companies wouldn't take kindly to someone web-scraping their product directory and re-using it. That is unethical, and could lead to big trouble with larger companies with a significant legal presence.
On the other hand, there is nothing to prevent you from cobbling together several API feeds into a Mash-Up - try Yahoo Pipes! to learn the basics of API/Mash-Up integration:
Yahoo Pipes:
http://pipes.yahoo.com/pipes/
Here is the link to Amazon's Product Advertising API program:
https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html
Good luck, and happy development!
Many online retailers provide a product feed - either well-publicized (William M-B has listed some examples), or sorta-kinda hidden, for the purposes of affiliate marketing. They usually have terms of use around those product feeds, describing in detail what you're allowed to do with them, and exactly how many of your limbs are at risk if you don't play by their rules.
However, the mechanism you're describing sounds remarkably similar to a search engine; there's a well-established precedent for search engines indexing sites, and using their content to reason about the underlying site. Get a lawyer to validate this, but there's a good chance that your intended purpose falls under "fair use".
I'm representative of http://aerse.com.
We are building service, that do the following:
search product by name. For example: galaxy s3, galaxy s 3 or galaxy sIII
return technical specifications (CPU, RAM etc) and product images (thumbnails and high-res images)
provide API http://aerse.com/p
deal with legal issues, provide licenses & etc.

Best way to build a DataMart from multiple external systems?

I'm in the planning stages of building a SQL Server DataMart for mail/email/SMS contact info and history. Each piece of data is located in a different external system. Because of this, email addresses do not have account numbers and SMS phone numbers do not have email addresses, etc. In other words, there isn't a shared primary key. Some data overlaps, but there isn't much I can do except keep the most complete version when duplicates arise.
Is there a best practice for building a DataMart with this data? Would it be an acceptable practice to create a key table with a column for each external key? Then, a unique primary ID can be assigned to tie this to other DataMart tables.
Looking for ideas/suggestions on approaches I may not have yet thought of.
Thanks.
The email address or phone number itself sounds like a suitable business key. Typically a "staging" database is used to load the data from multiple sources and then assign surrogate keys and do other transformations.
Are you familiar with data warehouse methods and design patterns? If you don't have previous knowledge or experience then consider hiring some help. BI / data warehouse projects have a very high failure rate and mistakes can be expensive.
Found more information here:
http://en.wikipedia.org/wiki/Extract,_transform,_load#Dealing_with_keys
Well, with no other information to tie the disparate pieces together, your datamart is going to be pretty rudimentary. You'll be able to get the types of data (sms, email, mail), metrics for each type over time ("this week/month/quarter/year we averaged 42.5 sms texts per day, and 8000 emails per month! w00t!"). With just phone numbers and email addresses, your "other datamarts" will likely have to be phone company names, or internet domains. I guess you could link from that into some sort of geographical information (internet provider locations?), or maybe financial information for the companies. Kind of a blur if you don't already know which direction you want to head.
To be honest, this sounds like someone high-up is having a knee-jerk reaction to the "datamart" buzzword coupled with hearing something about how important communication metrics are, so they sent orders on down the chain to "get us some datamarts to run stats on all our e-mails!"
You need to figure out what it is that you or your employer is expecting to get out of this project, and then figure out if the data you're currently collecting gives you a trail to follow to that information. Right now it sounds like you're doing it backwards ("I have this data, what's it good for?"). It's entirely possible that you don't currently have the data you need, which means you'll need to buy it (who knows if you could) or start collecting it, in which case you won't have nice looking graphs and trend-lines for upper-management to look at for some time... falling right in line with the warning dportas gave you in his second paragraph ;)

Booking logic and architecture, database sync: Hotels, tennis courts reservation system

Imagine that you want to design a tennis booking system.
You have 5 tennis clubs as partners with no online api allowing you to check on their side if a court is booked or not: You have to build this part as well.
Every time a booking is done on their side you want it to be known by our system. Probably using a POST request form tennis partner to our server.
Every time a booking is done on our website, we want to push the booking to their system. The difficulty is that their system need to be online and accessible from outside. Ip may change, we have to use a dns updater.
In case their system is not available we still accept the booking and fallback to an async email with 'i confirm booking/reject booking' link sent to the club.
I find the whole process quite complex and was wondering about the way online hotel booking system and hotel were working. Do they all have their data open and online ?
The good thing is that the data will grow large and fits nicely to some no SQL ;) like couch db
There are several questions here, let me try and address each one...
Since this appears to be an internet application with federated servers, using the implied HTTP Protocol makes a lot of sense. This could be done via Form POSTs, GET, or even REST-ful submission of some custom data structure. In the end, the exact approach to use will need to come down to the size and complexity of the information being communicated. Many architectures employ these approaches and often combine them with encrypted, signed, and/or encoded payloads for security. One short-fall to consider with these approaches is that they will require you to clearly communicate all request / response message formats, field ranges, and variations since these mechanisms are not really self-describing. On the other hand, these patterns use very common protocols, are easily understood, easy to implemented, and are typically lean on-the-wire.
In constrast, architectures with very complex structures often chose to use WSDL-based web services. Also driven by common standards, these tend to be self-describing, inherently versionable, although they can take more time and energy to implement. There are a lot of advantanges to web services which are driven by many WS-* standards which may be worth investigating further in your case.
As for the reservation process... many similar architectures will employ an orchestration model such as the following:
Find open booking spaces
Make a reservation for a booking space. This places an expiring lock on a space while the requestor fills in all required booking information. This mitigates against race conditions that could lead to multiple bookings for the same space
Once all required booking information is received and validated the booking is confirmed and permamently locked from use by other requestors
As for the SQL-style DB comment, I can't really say given the amount of information supplied. With that said, my instincts tell me a SQL-style DB is completely reasonable for this problem set. I have databases with many pedabytes and have very high SLA's. You implied a need for high availability and SQL-based databases have a few decades of proven support behind them in this area.
Hope this helps.
I think you will find most on-line hotel reservation systems aren't really on-line. My experience is that those companies (not the hotels themselves) offering on-line booking systems also insist that the hotel itself also books their rooms on-line using the same system.
Everything works fine as long as connectivity is not an issue - and in small motels scenario it normally will. Of course the bigger hotels use the same system the airlines do and they have dedicated communications links for the purpose. The reservations are of course maintained on one central computer with appropriate backup links etc etc etc.
It is very easy for individual tennis clubs to offer their own real-time online booking systems using their own database/website with programs like MyCourts offers however once you want to link more than one clubs facilities then you really don't have much option other than to have a centralized server that both the user and the club both have to use to reserve facilities.

Open Source Address Scrubber?

I have set of names and addresses that have been entered into and excel spreadsheet, but the problem is that the many people that entered the addresses entered them in many different non-standard formats. I want to scrub the addresses before transferring all of of them to my database. Looking around, all I really found in the way of address scrubbers(parsers or formatters) is the one that is put out by Semaphore. For my purposes, I don't really need all of that and I don't want to pay for the licensing fees for the software. Is there anything out there that is Free and/or Open Source that will do the scrubbing for me?
Since I work in the mailing business ...
A mailable address is not geo-coding. One allows the USPS to deliver mail to and the other tells you where on earth that point is. The USPS does not geo-code their mailable addresses. It's useful for marking areas/regions of people for targeting.
You're not buying a license to the software, you're buying the data. The post office has lots of rules especially if you're doing this commercially and trying to get a better rate than first class. See USPS Domestic Mail Manual for the complete list of rules. The USPS moves zips and households between zips all the time. The company (I work for) pays the USPS for its updated mailing list so we can keep our DBs updated. Weekly.
Back to your question. Do you want to change the data into a common format (street -> st) or are you looking for duplicates and want to only store real mailable addresses ?
for common format; you can break the address into pieces, clean up the white space and apply a dictionary of terms/translations. Then apply some sql to find the duplicates. Keep in mind households (1 main st) are different from persons (john doe, 1 main st).
for the mailable addresses, well some of you (the readers) won't like this answer, but you want information and that isn't free. Someone spends time or money to acquire and maintain these lists. So, find a business model to acquire funds for the list or go to someone who will do it for you. Data and mail management
Realistically, Semaphore is pretty cheap, just keep in mind that the address db will have to be updated quarterly and $19/quarter is pretty cheap.
Another Address Scrubbing product. SAP PostalSoft. I don't know what the data will cost though.
I actually work in the address verification industry... Jim's answer is a smart accept. Unfortunately for those of us with low budgets, official USPS data is pricey and the systems are complicated. (I know by experience, since the company I work for, SmartyStreets, provides address verification at lower rates than most.)
The best I can do here to help is recommend a low-cost/free alternative (depending on your volume) such as LiveAddress, where for a list of addresses there's no minimum purchase, and the API is super-cheap and super-easy, comparatively.
A .NET wrapper for the USPS APIs
http://www.codeproject.com/KB/cs/USPS_Web_Tools_Wrapper.aspx
Most of the software that I've worked with to do this is very expensive (or to put it another way, marketing departments are naive and have huge budgets).
This sort of work is a precursor to Geo-coding. This linked Wiki article includes a list of Geocoding software, some of which is free. If you're lucky, some of the free ones may include address standardizing routines.
If you find a good one, let me know.
We use Accuzip. It's a lot cheaper than most solutions (~$700/year) and comes with bi-monthly updates. It uses the USPS address standardization API, for which I've written a .NET wrapper. This allows me to run it in real-time (Accuzip, by default, comes only with a batch mode).

Resources