Store street address & prevent duplicates - database

I have a database that I am accessing through Django & Python. We want to store buildings based on their addresses (not names, since some buildings simply don't have names).
We need to prevent users from entering duplicate entries into our database for the same building. This is made difficult by the way people could type in the addresses (eg. "1000 Main Street" vs. "1000 Main St.")
In what way can we reliably prevent duplicates? I am using a MySQL database.
Thanks

If you're working only with the U.S., you can use the USPS Address Standardization web service to resolve duplicates:
http://www.usps.com/webtools/address.htm

Address de-duplication is a complicated task. While the USPS web service is alright, it's seriously lacking in some important features. Plus, it's quite inefficient to perform batch de-duplication using a regular web service, performing requests, etc.
And, it appears the USPS has updated their site, so the link Dan posted, while useful, is now broken.
As an updated answer, I'd like to point out that I work for SmartyStreets and we remove duplicates from address lists. You could, for example, upload your list to CASS-Certified Scrubbing and the addresses will be standardized and flagged for duplicates. It's really easy this way. If you need point-of-entry validation, take a look at LiveAddress, which provides more important information than the USPS service alone does.

Related

Protecting (or tracking plagiarism of) Openly Available Web Content (database/list/addreses)

We have put together a very comprehensive database of retailers across the country with specific criteria. It took over a year of phone interviews, etc., to put together the list. The list is, of course, not openly available on our site to download as a flat file...that would be silly.
But all the content is searchable on the site via Google Maps. So theoretically with enough zip-code searches, someone could eventually grab all the retailer data. Of course, we don't want that since our whole model is to do the research and interviews required to compile this database and offer it to end-users for consumption on our site.
So we've come to the conclusion there isnt really any way to protect the data from being taken en-masse but a potentially competing website. But is there a way to watermark the data? Since the Lat/Lon is pre-calculated in our db, we dont need the address to be 100% correct. We're thinking of, say, replacing "1776 3rd St" with "1776 Third Street" or replacing standard characters with unicode replacements. This way, if we found this data exactly on a competing site, we'd know it was plagiarism. The downside is if users tried to cut-and-paste the modified addresses into their own instance of Google Maps -- in some cases the modification would make it difficult.
How have other websites with valuable openly-distributed content tackled this challenge? Any suggestions?
Thanks
It is a question of "openly distribute" vs "not openly distribute" if you ask me. If you really want to distribute it, you should acknowledge that someone can receive the data.
With certain kinds of data (media like photos, movies, etc) you can watermark or otherwise tamper with the data so it becomes trackable, but if your content is like yours that will become hard, and even harder to defend: if you use "third street" and someone else also uses it, do you think you can make a case against them? I highly doubt it.
The only steps I can think of is
Making it harder to get all the information. Hide it behind scripts and stuff instead of putting it on google maps, make sure it is as hard as you can make it for bots to get the information, limit the amount of results shown to one user, etc. This could very well mean your service is less attractive to the end user, this is a trade-off
Sort of the opposite of above: use somewhat the same technique to HIDE some of the data for the common user instead of showing it to them. This would be FAKE data, that a normal person shouldn't see. If these retailers show up at your competitors, you've caught them red-handed. This is certainly not fool-proof, as they can check their results for validity and remove your fake stuff, there is always a possibility a user with a strange system gets the fake data which makes your served content less correct, and lastly if your competitors' scraper looks too much like real user, it won't get the data.
provide 2-step info: in step one you get the "about" info, anyone can find that. In step 2, after you've confirmed that this is what the user wants, maybe a login, maybe just limited in requests etc, you give everything. So if the user searches for easy-to-reach retailers, first say in which area you have some, and show it 'roughly' on the map, and if they have chosen something, show them in a limited environment what the real info is.

Is there value in producing code so flexible that it will never need to be updated?

I am currently involved in a debate with my coworkers surrounding how I should design an API that will be used by my department. Specifically, I am tasked with writing an API that will serve as a wrapper facade to access Active Directory information - tailored to my company's/department's needs. I am aware that open source wrappers facades already exist but that is not the crux of this question and is merely being used to serve as an example.
When I presented my design proposal to my team, they shot me down because the API was not "configurable" enough. They claimed that they did not want the API to make the link between "Phone number" and <Obscure Active Directory representation of Phone number>. Every person in the meeting (except for me) agreed that they would prefer to ask around, "What is the correct field in Active Directory to use for the user's phone number?", and plug that into their respective apps (LOL!).
They asked me, "What if our company decides to use a different field for phone number and you weren't around to make the change in your source code?" They eventually admitted that they were afraid to be tasked with changing someone else's source code, even if the code was pristine and had extensive unit tests. Every senior IT person in my department was in agreement on this.
Is this really the correct attitude to have when designing software?!
http://en.wikipedia.org/wiki/Inner_platform_effect
While hard-coding too many assumptions into your program is bad, overzealously avoiding hard-coded assumptions can be just as bad. If you try to make code excessively flexible, it becomes essentially impossible to configure, as the configuration scheme becomes almost a programming language in itself. I think in general, phone number is a common enough thing that it can just be hard coded as a field.
If I understood correctly, they want to have the option of mapping the links outside the code, be it through a configuration file, a database, whatever. If that is correct, I think they have a valid point - why be forced to change any code at all if all you need to do is to change a configuration mapping.
If possible, you should always err on the side of more configurable. It will save you headaches later.
Column Names
Specifically in your case, columns in tables are an inherently non-static variable. They will commonly change as your needs change.
If you have a "phonenum" column, then they add a second phone number, they change the column to "phonenum1" and "phonenum2". It would need to be changed in the code. Then if they change them to "Home_Phone", "Work_Phone", "Cell_Phone" then the code would again have to be changed. If, however, you had a mapping file (a key/value config file) then all these changes would be extremely simple to make.
In General
I disagree with dsimcha that an application can be 'too configurable'. What he is talking about is 'feature bloat', where there are so many intertwining configurables that it becomes impossible to change any one without futzing all the others. This is a very real problem. However, the problem is not the number of configuring options, the problem is how they are presented to the user.
If you present all the configuration options in a concise, clear, streamlined manner. There should be comments to explain each one, and how it interacts with the others. In that case, you can have as many configuration variables as you want, because you have been careful to keep them segregated into singles or pairs, and have marked them as such.
You should be writing applications so that external (environmental) changes do NOT require code changes. Things such as
Database user password changes
Column names change
"Temp folder" location changes
Target Machine name/ip change
App needs to be run twice a day instead of once
Logging levels
None of those changes affect the function of the application and so there should be NO CODE CHANGES required. That is the metric you should use if you ever wonder whether hard-coding is all right.
If the functionality needs to change, it should be a code change. Otherwise, make it configurable.
It seems easy enough to do both: produce a flexible API which allows the field to be specified, and then a wrapper around it which knows about the obscure ActiveDirectory name.
Of course, you could build that flexible solution later and just hard code the name for the moment. If that's significantly easier than the two-pronged approach, it's reasonable to argue for it - but if you'd probably end up with that sort of separation internally anyway, then it doesn't do much harm to expose it.
I can honestly say I have been in your position before and I agree with the argument they are presenting you. Especially with an in-house app you will see feature creep. The more useful your application, the worse the feature creep. It is possible your application could be used in another office and they will have fields mapped differently than your current office. If you hard code mappings you are then stuck with different versions for different locations. Maintaining separate versions of source code quickly becomes a nightmare for a programmer. If you design in configurability now and your application is forgotten you have lost very little, but if your application becomes a standard across the company you have saved yourself an immense amount of time in the future.
Fear of change, as well as fear of accountability for making a change, is not uncommon in IT software organizations. Often, the culture in the organization is what drives this attitude.
Having said that, in your specific example, you are describing an API that provides a facade on top of the ActiveDirectory service - one that appears to be intended to be used by different groups and/or projects in your organization.
In that particular scenario, it makes sense to make portions of your API support configurability, since you may ultimately find that the requirements for different projects or groups could be different, or change over time.
As a general practice, when I build code that involves a mapping of one programming interface to another and there are data mapping considerations involved, I try to make the mapping configurable. I've found that this helps both unit testing as well as dealing with evolving requirements or contradictory requirements from different consumers.
If you're saying "should I hard code everything", then I think it's not a good idea.
In 2 years you will be gone and there will be a programmer that will waste a lot of time trying to update your legacy code when updating a configuration file would have been way easier.
In some cases it makes sense to hard code information, but I' don't think that your situation is one of these cases. I'd need more knowledge of the situation to be sure, this is just my guess from what you said.
I think it depends on why the API is being created, and what problems you're aiming to solve. If the aim of the API is to be a service that lives on a server somewhere and manages requests from different applications, then I think your approach is probably the way to go, with the addition of a database or config files to perhaps customize the LDAP paths of certain properties.
However, if the goal of the API is to simple be a set of classes that abstract away the details of accessing Active Directory, but not what properties are being accessed, then what your coworkers have specified is the way to go.
Either approach isn't necessarily right or wrong, so it ultimately depends on your overall reasons for creating the API in the first place.

Where we can find database designs schemes (ERD or other) for very common use cases?

The question is more simple than what it looks. There are many use cases that are well known and people have put a lot of thought into them. For example: Audit trailing, login users, and so on. We are looking for a good resource site that present the DB design for those common use cases.
Check out
http://databaseanswers.org/
There are over 500 data models available for free there. It's not hard to convert them into a working database. For some of them, if you contact the webmaster, they will send you a working MS Access application with a built in database.
Even if you don't use Access this could serve as a prototype.
Try The Data Model Resource Book by Len Silverston. It has 3 volumes and not only shows but also explains the usual use cases.

how to restrict or filter database access according to application user attributes

I've thought about this too much now with no obviously correct solution. It might be a real wood-for-the-trees situation, so I need stackoverflow's help.
I'm trying to enforce database filtering on a regional basis. My system has various users and each one is assigned to a regional office. I only want users to be able to see data that is associated with their regional office.
Put simply my application is: Java App -> JPA (hibernate) -> MySQL
The database contains object from all regions, but I only want the users to be able to manipulate objects from their own region. I've thought about the following ways of doing it:
1) modify all database querys so they read something like select * from tablex where region="myregion". This is nasty. It doesn't work to well with JPA eg the entitymanager.find() method only accepts primary key. Of course I can go native, but I only have to miss one select statement and my security is shot
2) use a mysql proxy to filter results. kind of funky, but then the mysql proxy just sees the raw call and doesn't really know how it should be filtering them (ie which region the user that made this request belongs to). Ok, I could start a proxy for each region, but it starts getting a little messy..
3) use separate schemas for each region. yeah, simple, I'm using spring so I could use the RoutingDataSource to route the requests via the correct datasource (1 datasource per schema). Of the course the problem now is somewhere down the line I'm going to want to filter by region and some other category. ohps.
4) ACL - not really sure about this. If a did a select * from tablex; would it quietly filter out objects I don't have access for or would a load of access exceptions be thrown?
But am I thinking too much about this? This seems like a really common problem. There must be some easy solution I'm just too dumb to see. I'm sure it'll be something close to / or in the database as you want to filter as near to source as possible, but what?
Not looking to be spoonfed - any links, keywords, ideas, commerical/opensource product suggestions would be really appreciated!! thanks.
I've just been implementing something similar (REALbasic talking to MySQL) over the last couple of weeks for a hierarchical multi-company extension to an accounting package.
There's a large body of existing code which composes SQL statements so we had to live with that and just do a lot of auditing to ensure the restrictions were included in each table as appropriate. One gotcha was related lookups where lookup tables were normally only used in combination with a primary table but for some maintenance GUIs would load the lookup table itself, directly.
There's a danger of giving away implied information such as revealing that Acme Pornstars are a client of some division of the company ;-)
The only solution for that part was very careful construction of DB diagrams to show all implied relationships and lots of auditing and grepping source code, with careful commenting to indicate areas which had been OK'd as not needing additional restrictions.
The one pattern I've come up with to make this more generalised in future is, rather than explicit region=currentRegionVar type searches, using an arbitrary entityID which is supplied by a global CurrentEntityForRole("blah") function.
This abstraction allows for sharing of some data as well as implementing pseudo-entities which represent other restriction boundaries.
I don't know enough about Java and Spring to be able to tell but is there a way you could use views to provide a single-key lookup, where the views are restricted by the region filter?
The desire to provide aggregations and possible data sharing was why we didn't go down the separate database route.
Good Question.
Seems like #1 is the best since it's the most flexible.
Region happens to be what you're filtering on today, but it could be region + department + colour of hair tomorrow.
If you start carving up the data too much it seems like you'll be stuck working harder than necessary to glue them all back together for reporting.
I am having the same problem. It is hard to believe that such a common task (filtering a list of model entities based on the user profile) has not a 'standard' way, pattern or best-practice to do it.
I've found pgacl, a PostgreSQL module. Basically, you do your query like you normally would, and then you tack on an acl_access() predicate to work as a filter.
Maybe there is something similar for MySQL.
I suggest you to use ACL. It is more flexible than other choices. Use Spring Security. You can use it without using Spring Framework. Read the tutorial from link text

Good methods for human-readable & human-maintained databases

So this is the scenario:
You have a bunch of data that needs to end up in SQL.
It needs to entered by hand.
It is not an "enter once and you're done" scenario: it will need to be modified and expanded by humans in an ongoing iterative way. Comments will be associated with entries. It is also useful for data entry people to be able to see related entries near each other.
Different parts of data will need to be worked on simultaneously by different people.
Some error checking also needs to happen. (Let the data entry people correct their mistakes before SQL picks them up)
I have one answer, which is how my project currently operates, but it occurred to me that maybe there are other awesome ways of doing this which don't have the problems of my current method.
Look at YAML as a way to represent the data as plain, human-readable, and human-fixable text.
A very simple program can parse the YAML, locate errors and (if there are no errors) update the database.
These are some really basic requirements, and you probably have more issues than those stated. Nonetheless, you need a simple admin utility to enter data into your database.
A straight SQL query/update utility doesn't cut it because your team needs validation and such. You need multi-user access to the same data with transactional support. You also want to annotate your data entries and allow "related entries" to be viewed by your other users.
You need a database-maintenance application.
Consider using something like Django and it's built admin utilities. It might be more than you're expecting, but I imagine you have more needs in your future than what you've stated here.
My answer is basically
Have the data entry work in Prolog files (Prolog facts)
Have multiple files, split up in a way that is sane for the data.
Have a script that converts the Prolog facts to SQL.
Have some tests in Prolog that validate the Prolog facts.
CONS of this approach:
a little bit annoying to have to check across multiple files to see if an entry already exists, or has been moved etc.
Writing Prolog, as simple as this is, is pretty scary for non-programmers (compared to say, filling out an Excel spreadsheet, or some guided process)
maybe: Merging is tricky, or maybe my VCS is just not very smart (see Which SCM/VCS cope well with moving text between files?)
So this works pretty well, but maybe there is something better that I've never thought of!
If the constraints you're referring to can be enforced at the database level, free software like Quest Toad could allow them enter data directly into the db. It feels very much like using a spreadsheet when in grid view and displays an error when constraints are violated.
Alternatively, depending on what existing stack you have available, .Net grid views make it easy to slap together crud screens in little time.

Resources