Open Source Address Scrubber?

Open Source Address Scrubber? - database

I have set of names and addresses that have been entered into and excel spreadsheet, but the problem is that the many people that entered the addresses entered them in many different non-standard formats. I want to scrub the addresses before transferring all of of them to my database. Looking around, all I really found in the way of address scrubbers(parsers or formatters) is the one that is put out by Semaphore. For my purposes, I don't really need all of that and I don't want to pay for the licensing fees for the software. Is there anything out there that is Free and/or Open Source that will do the scrubbing for me?

Since I work in the mailing business ...
A mailable address is not geo-coding. One allows the USPS to deliver mail to and the other tells you where on earth that point is. The USPS does not geo-code their mailable addresses. It's useful for marking areas/regions of people for targeting.
You're not buying a license to the software, you're buying the data. The post office has lots of rules especially if you're doing this commercially and trying to get a better rate than first class. See USPS Domestic Mail Manual for the complete list of rules. The USPS moves zips and households between zips all the time. The company (I work for) pays the USPS for its updated mailing list so we can keep our DBs updated. Weekly.
Back to your question. Do you want to change the data into a common format (street -> st) or are you looking for duplicates and want to only store real mailable addresses ?
for common format; you can break the address into pieces, clean up the white space and apply a dictionary of terms/translations. Then apply some sql to find the duplicates. Keep in mind households (1 main st) are different from persons (john doe, 1 main st).
for the mailable addresses, well some of you (the readers) won't like this answer, but you want information and that isn't free. Someone spends time or money to acquire and maintain these lists. So, find a business model to acquire funds for the list or go to someone who will do it for you. Data and mail management
Realistically, Semaphore is pretty cheap, just keep in mind that the address db will have to be updated quarterly and $19/quarter is pretty cheap.
Another Address Scrubbing product. SAP PostalSoft. I don't know what the data will cost though.

I actually work in the address verification industry... Jim's answer is a smart accept. Unfortunately for those of us with low budgets, official USPS data is pricey and the systems are complicated. (I know by experience, since the company I work for, SmartyStreets, provides address verification at lower rates than most.)
The best I can do here to help is recommend a low-cost/free alternative (depending on your volume) such as LiveAddress, where for a list of addresses there's no minimum purchase, and the API is super-cheap and super-easy, comparatively.

A .NET wrapper for the USPS APIs
http://www.codeproject.com/KB/cs/USPS_Web_Tools_Wrapper.aspx

Most of the software that I've worked with to do this is very expensive (or to put it another way, marketing departments are naive and have huge budgets).
This sort of work is a precursor to Geo-coding. This linked Wiki article includes a list of Geocoding software, some of which is free. If you're lucky, some of the free ones may include address standardizing routines.
If you find a good one, let me know.

We use Accuzip. It's a lot cheaper than most solutions (~$700/year) and comes with bi-monthly updates. It uses the USPS address standardization API, for which I've written a .NET wrapper. This allows me to run it in real-time (Accuzip, by default, comes only with a batch mode).

Related

Store street address & prevent duplicates

I have a database that I am accessing through Django & Python. We want to store buildings based on their addresses (not names, since some buildings simply don't have names).
We need to prevent users from entering duplicate entries into our database for the same building. This is made difficult by the way people could type in the addresses (eg. "1000 Main Street" vs. "1000 Main St.")
In what way can we reliably prevent duplicates? I am using a MySQL database.
Thanks

If you're working only with the U.S., you can use the USPS Address Standardization web service to resolve duplicates:
http://www.usps.com/webtools/address.htm

Address de-duplication is a complicated task. While the USPS web service is alright, it's seriously lacking in some important features. Plus, it's quite inefficient to perform batch de-duplication using a regular web service, performing requests, etc.
And, it appears the USPS has updated their site, so the link Dan posted, while useful, is now broken.
As an updated answer, I'd like to point out that I work for SmartyStreets and we remove duplicates from address lists. You could, for example, upload your list to CASS-Certified Scrubbing and the addresses will be standardized and flagged for duplicates. It's really easy this way. If you need point-of-entry validation, take a look at LiveAddress, which provides more important information than the USPS service alone does.

How to get book metadata?

My application needs to retrieve information about any published book based on a provided ISBN, title, or author. This is hardly a unique requirement---sites like Amazon.com, Chegg.com, and even software like Book Collector seem to be able to do this easily. But I have not been able to replicate it.
To clarify, I do not need to search the entire database of books---only a limited subset which have been inputted, as in a book collection. The database would simply allow me to tag the inputted books with the necessary metadata to enable search on that subset of books. So scale is not the issue here---getting the metadata is.
The options I have tried are:
Scrape Amazon. Scraping the regular Amazon pages was not very robust to things like missing authors, and while scraping the smaller mobile pages was faster, they shared the same issues with robustness of extraction. Plus, building this into an application is a clear violation of Amazon's Terms of Service.
Scrape the Library of Congress. While this seems to have fewer legal ramifications, ease and robustness were again issues.
ISBNdb.com API. While the service is free up to a point, and does a good job of returning the necessary metadata, I need to do this for over 500 books on a daily basis, at which point this service costs money proportional to use. I'd prefer a free or one-time payment solution that allows me to do the same.
Google Book Data API. While this seems to provide the information I need, I cannot display the book preview as their terms of service requires.
Buy a license to a database of books. For example, companies like Ingram or Baker & Taylor provide these catalogs to retailers and libraries. This solution is obviously expensive, so I'm hoping that there's a more elegant solution I've missed. But if not, and someone on SO has had a good experience with a particular database, I'm willing to go with that.
I've tried to describe my approach in detail so others with fewer books can take advantage of the above solutions. But given my requirements, I'm at my wits' end for retrieving book metadata.

Since it is unlikely that you have to retrieve the same 500 books every day: store the data retrieved from isbndb.com in a database and fill it up book by book.

Instead of scraping Amazon, you can use the API they expose for their affiliate program: https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html
It allows about 3k requests per hour and returns well-formed XML. It requires you to set a link to the book that you show the information about, and you must state that you are an affiliate partner.

This might be what you're looking for. They even offer a complete download!
https://openlibrary.org/data

As it seems, a lot of libraries and other organisations make information such as "ISBN" available through MAchine-Readable Cataloging aka MARC, you can find more information about it here as well.
Now knowing the "right" term to search for I discovered WorldCat.org.
Maybe this whole MARC thing gives you a new kind of an idea :)

Booking logic and architecture, database sync: Hotels, tennis courts reservation system

Imagine that you want to design a tennis booking system.
You have 5 tennis clubs as partners with no online api allowing you to check on their side if a court is booked or not: You have to build this part as well.
Every time a booking is done on their side you want it to be known by our system. Probably using a POST request form tennis partner to our server.
Every time a booking is done on our website, we want to push the booking to their system. The difficulty is that their system need to be online and accessible from outside. Ip may change, we have to use a dns updater.
In case their system is not available we still accept the booking and fallback to an async email with 'i confirm booking/reject booking' link sent to the club.
I find the whole process quite complex and was wondering about the way online hotel booking system and hotel were working. Do they all have their data open and online ?
The good thing is that the data will grow large and fits nicely to some no SQL ;) like couch db

There are several questions here, let me try and address each one...
Since this appears to be an internet application with federated servers, using the implied HTTP Protocol makes a lot of sense. This could be done via Form POSTs, GET, or even REST-ful submission of some custom data structure. In the end, the exact approach to use will need to come down to the size and complexity of the information being communicated. Many architectures employ these approaches and often combine them with encrypted, signed, and/or encoded payloads for security. One short-fall to consider with these approaches is that they will require you to clearly communicate all request / response message formats, field ranges, and variations since these mechanisms are not really self-describing. On the other hand, these patterns use very common protocols, are easily understood, easy to implemented, and are typically lean on-the-wire.
In constrast, architectures with very complex structures often chose to use WSDL-based web services. Also driven by common standards, these tend to be self-describing, inherently versionable, although they can take more time and energy to implement. There are a lot of advantanges to web services which are driven by many WS-* standards which may be worth investigating further in your case.
As for the reservation process... many similar architectures will employ an orchestration model such as the following:
Find open booking spaces
Make a reservation for a booking space. This places an expiring lock on a space while the requestor fills in all required booking information. This mitigates against race conditions that could lead to multiple bookings for the same space
Once all required booking information is received and validated the booking is confirmed and permamently locked from use by other requestors
As for the SQL-style DB comment, I can't really say given the amount of information supplied. With that said, my instincts tell me a SQL-style DB is completely reasonable for this problem set. I have databases with many pedabytes and have very high SLA's. You implied a need for high availability and SQL-based databases have a few decades of proven support behind them in this area.
Hope this helps.

I think you will find most on-line hotel reservation systems aren't really on-line. My experience is that those companies (not the hotels themselves) offering on-line booking systems also insist that the hotel itself also books their rooms on-line using the same system.
Everything works fine as long as connectivity is not an issue - and in small motels scenario it normally will. Of course the bigger hotels use the same system the airlines do and they have dedicated communications links for the purpose. The reservations are of course maintained on one central computer with appropriate backup links etc etc etc.
It is very easy for individual tennis clubs to offer their own real-time online booking systems using their own database/website with programs like MyCourts offers however once you want to link more than one clubs facilities then you really don't have much option other than to have a centralized server that both the user and the club both have to use to reserve facilities.

Where can I find a city/neighborhood database?

Where can I find a database of cities and neighborhoods using MySQL? I'm only interested in US areas. Price doesn't matter.
The database must help identify locations by ZIP code. I've already got a database showing cities and states, but I need to find surrounding neighborhoods as well.
I saw good example on http://www.oodle.com/.

The Zillow Neighborhood data has a CC-sharealike license and it is pretty comprehensive. It is widely used in the Geospatial world nowadays.
Cheers

For a fee... you can subscribe to Maponics' Neighborhood dataset
While Maponics provides mostly GIS data, (eg. allowing one to pinpoint on a map the boundaries of neighborhoods and such), the simple neighborhood list is also available, I think.
Another commercial offering is Urban Mapping's
In you target particular cities/counties, there are plenty of free resources to be found, oft' in the .gov / .us sites, for specific cities and counties. Unfortunately aside from the difficulty of locating such resources (there doesn't seem to exist any practical directory for such local gov-managed databases), there is no standard as to the format in which the data is stored or the specific semantics of the data collected. Luckily, ZIP-code is rather unanbiguous, and he neighborhood concept relatively general (even though the neighborhoods themselves can be quite dynamic, with bot the introduction of new neighborhood names, and some minor shifting of boundaries).
The overall complexity of the task of compiling such databases, the long half-life of the data, and the potentially lucrative uses of such data, seem to explain why it is hard to find non-commercial sources.

This is an old question - but there is a far better and EASIER way of doing it as of June 2015:
http://maps.googleapis.com/maps/api/geocode/json?address=YOUR_ADDRESS&sensor=false
Example:
http://maps.googleapis.com/maps/api/geocode/json?address=11%20W%2053rd%20St%20New%20York&sensor=false

Here's a great site offering free databases for both cities and countries:
http://ipinfodb.com/ip_database.php

Yelp has a neighborhood API.
http://www.yelp.com/developers/documentation/technical_overview

It might be worth checking out some of the links in this article. There are several where you might find the data you're after.

Infochimps has the Zillow Neighborhoods API:
http://www.infochimps.com/datasets/zillow-neighborhoods

Maponics has over 150,000 neighborhoods worldwide available in MySQL and other formats, as well as an API.

Urban Mapping has an API to find neighborhoods by address, City/State, and as you need in your case, Zip Code (called the getNeighborhoodsByPostalCode method).
Here is a link to their demo apps which show how it works:
URBANWARE API Demo Applications
Edit:
Urban Mapping doesn't exist anymore, and the Demo link has linkrot; here's what it did look like, via Wayback Machine
[
While this isn't a database per se, you could quickly populate your own database by calling their API for every Zip code you'd be interested in seeing.
Note that this is part of their Premium API. If you have the long/lat coordinates of each city, you can use their free API to get a list of neighborhoods whose boundaries contain the long/lat coordinates.

Software evaluation licensing [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
My company is looking to start distributing some software we developed and would like to be able to let people try the software out before buying. We'd also like to make sure it can't be copied and distributed to our customers' customers.
One model we've seen is tying a license to a MAC address so the software will only work on one machine.
What I'm wondering is, what's a good way to generate a license key with different information embedded in it such as license expiration date, MAC address, and different software restrictions?

I've used both FLEXlm from Macrovision (formerly Globetrotter) and the newer RLM from Reprise Software (as I understand, written by FlexLM's original authors). Both can key off either the MAC address or a physical dongle, can be either node-locked (tied to one machine only) or "floating" (any authorized machine on the network can get a license doled out by a central license server, up to a maximum number of simultaneously checked-out copies determined by how much they've paid for). There are a variety of flexible ways to set it up, including expiration dates, individual sub-licensed features, etc. Integration into an application is not very difficult. These are just the two I've used, I'm sure there are others that do the job just as well.
These programs are easily cracked, meaning that there are known exploits that let people either bypass the security of your application that uses them, either by cutting their own licenses to spoof the license server, or by merely patching your binary to bypass the license check (essentially replacing the subroutine call to their library with code that just says "return 'true'". It's more complicated than that, but that's what it mostly boils down to. You'll see cracked versions of your product posted to various Warez sites. It can be very frustrating and demoralizing, all the more so because they're often interested in cracking for cracking sake, and don't even have any use for your product or knowledge of what to do with it. (This is obvious if you have a sufficiently specialized program.)
Because of this, some people will say you should write your own, maybe even change the encryption scheme frequently. But I disagree. It's true that rolling your own means that known exploits against FLEXlm or RLM won't instantly work for your application. However, unless you are a total expert on this kind of security (which clearly you aren't or you wouldn't be asking the question), it's highly likely that in your inexperience you will end up writing a much less secure and more crackable scheme than the market leaders (weak as they may be).
The other reason not to roll your own is simply that it's an endless cat and mouse game. It's better for your customers and your sales to put minimal effort into license security and spend that time debugging or adding features. You need to come to grips with the licensing scheme as merely "keeping honest people honest", but not preventing determined cracking. Accept that the crackers wouldn't have paid for the software anyway.
Not everybody can take this kind of zen attitude. Some people can't sleep at night knowing that somebody somewhere is getting something for nothing. But try to learn to deal with it. You can't stop the pirates, but you can balance your time/effort/expense trying to stop all piracy versus making your product better for users. Remember, sometimes the most pirated applications are also the most popular and profitable. Good luck and sleep well.

I'd suggest you take the pieces of information you want in the key, and hash it with md5, and then just take the first X characters (where X is a key length you think is manageable).
Cryptographically, it's far from perfect, but this is the sort of area where you want to put in the minimum amount of effort which will stop a casual attacker - anything more quickly becomes a black hole.
Oh, I should also point out, you will want to provide the expiration date (and any other information you might want to read out yourself) in plain text (or slightly obfuscated) as part of the key as well if you go down this path - The md5 is just to stop the end user from changing he expiration date to extend the license.
The easiest thing would be a key file like this...
# License key for XYZZY
expiry-date=2009-01-01
other-info=blah
key=[md5 has of MAC address, expiry date, other-info]

We've used the following algorithm at my company for years without a single incident.
Decide the fields you want in the code. Bit-pack as much as possible. For example, dates could be "number of days since 2007," and then you can get away with 16-bits.
Add an extra "checksum" field. (You'll see why in a second.) The value of this field is a checksum of the packed bytes from the other fields. We use "first 32 bits from MD5."
Encrypt everything using TEA. For the key, use something that identifies the customer (e.g. company name + personal email address), that way if someone wants to post a key on the interweb they have to include their own contact info in plain text.
Convert hex to a string in some sensible way. You can do straight hex digits but some people like to pick a different set of 16 characters to make it less obvious. Also include dashes or something regularly so it's easier to read it over the phone.
To decrypt, convert hex to string and decrypt with TEA. But then there's this extra step: Compute your own checksum of the fields (ignoring the checksum field) and compare to the given checksum. This is the step that ensures no one tampered with the key.
The reason is that TEA mixes the bits completely, so if even one bit is changed, all other bits are equally likely to change during TEA decryption, therefore the checksum will not pass.
Is this hackable? Of course! Almost everything is, but this is tight enough and simple to implement.
If tying to contact information is not sufficient, then include a field for "Node ID" and lock it to MAC address or somesuch as you suggest.

Don't use MAC addresses. On some hardware we've tested - in particular some IBM Thinkpads - the MAC address can change on a restart. We didn't bother investigating why this was, but we learned quite early during our research not to rely on it.
Obligatory disclaimer & plug: the company I co-founded produces the OffByZero Cobalt licensing solution. So it probably won't surprise you to hear that I recommend outsourcing your licensing, & focusing on your core competencies.
Seriously, this stuff is quite tricky to get right, & the consequences of getting it wrong could be quite bad. If you're low-volume high-price a few pirated copies could seriously dent your revenue, & if you're high-volume low-price then there's incentive for warez d00dz to crack your software for fun & reputation.
One thing to bear in mind is that there is no such thing as truly crack-proof licensing; once someone has your byte-code on their hardware, you have given away the ability to completely control what they do with it.
What a good licensing system does is raise the bar sufficiently high that purchasing your software is a better option - especially with the rise in malware-infected pirated software. We recommend you take a number of measures towards securing your application:
get a good third-party licensing system
pepper your code with scope-contained checks (e.g. no one global variable like fIsLicensed, don't check the status of a feature near the code that implements the feature)
employ serious obfuscation in the case of .NET or Java code

The company I worked for actually used a usb dongle. This was handy because:
Our software was also installed on that USB Stick
The program would only run if it found the (unique) hardware key (any standard USB key has that, so you don't have to buy something special, any stick will do)
it was not restricted to a computer, but could be installed on another system if desired
I know most people don't like dongles, but in this case it was quite handy as it was actually used for a special purpose media player that we also delivered, the USB keys could thus be used as a demo on any pc, but also, and without any modifications, be used in the real application (ie the real players), once the client was satisfied

We keep it simple: store every license data to an XML (easy to read and manage), create a hash of the whole XML and then crypt it with a utility (also own and simple).
This is also far from perfect, but it can hold for some time.

Almost every commercial license system has been cracked, we have used many over the years all eventually get cracked, the general rule is write your own, change it every release, once your happy try to crack it yourself.
Nothing is really secure, ultimately look at the big players Microsoft etc, they go with the model honest people will pay and other will copy, don't put too much effort into it.
If you application is worth paying money for people will.

I've used a number of different products that do the license generation and have created my own solution but it comes down to what will give you the most flexibility now and down the road.
Topics that you should focus on for generating your own license keys are...
HEX formating, elliptic curve cryptography, and any of the algorithms for encryption such as AES/Rijndael, DES, Blowfish, etc. These are great for creating license keys.
Of course it isn't enough to have a key you also need to associate it to a product and program the application to lock down based on a key system you've created.
I have messed around with creating my own solution but in the end when it came down to making money with the software I had to cave and get a commercial solution that would save me time in generating keys and managing my product line...
My favorite so far has been License Vault from SpearmanTech but I've also tried FlexNet (costly), XHEO (way too much programming required), and SeriousBit Ellipter.
I chose the License Vault product in the end because I would get it for much cheaper than the others and it simply had more to offer me as we do most of our work in .NET 3.5.

It is difficult to provide a good answer without knowing anything about your product and customers. For enterprise software sold to technical people you can use a fairly complex licensing system and they'll figure it out. For consumer software sold to the barely computer-literate, you need a much simpler system.
In general, I've adopted the practice of making a very simple system that keeps the honest people honest. Anyone who really wants to steal your software will find a way around any DRM system.
In the past I've used Armadillo (now Software Passport) for C++ projects. I'm currently using XHEO for C# projects.

If your product requires the use of the internet, then you can generate a unique id for the machine and use that to check with a license web service.
If it does not, I think going with a commercial product is the way to go. Yes, they can be hacked, but for the person who is absolutely determined to hack it, it is unlikely they ever would have paid.
We have used: http://www.aspack.com/asprotect.aspx
We also use a function call in their sdk product that gives us a unique id for a machine.
Good company although clearly not native English speakers since their first product was called "AsPack".

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight