Store location information, or use a third party source? - database

I'm working on a location-based web app (learning purposes), where users will rate local businesses. I then want users to be able to see local businesses, based on where they live and a given range (i.e., businesses within 10 miles of 123 Street. City St, 12345).
I'm wondering what I should use for location info; some 3rd party source (like Googles geocoding API) or host my own location database? I know of zip-code databases that come with lat/lon of each zip code along with other data, but these databases are often not complete, and definitely not global.
I know that most API's set usage limits, which may be a deciding factor. I suppose what I could do is store all data retrieved from Google into a database of my own, so that I never make the same query twice.
What do you guys suggest? I've tried looking at existing answers on SO, but nothing helped.
EDIT To be clear, I need a way to find all businesses that fall within a certain range of a given location. Is there a service that I could use to do this (i.e., return all cities, zips, etc. that fall within range of a given location)

Storing the data you retrieve in a local cache is always a good idea. It will reduce lag and keep from taxing whatever API you are using. It can also help keep you under usage limits as you stated. You can always place size limits on that cache and clear it out as it ages if the need arises.
Using an API means that you'll only be pulling data for sites you need information on, versus buying a bunch of data and having to load/host it all yourself (these can tend to get huge). I suggest using and API+caching

Related

How to cache batches of IDs "locally" in a serverless environment?

Traditionally, in a non-serverless environment, I would have the following system. Say I have a custom ID generation protocol for all my models. Say I also have 20 servers scattered around. I give each server a slice of IDs to work with off the whole stack of IDs. When they are done or the server goes down, it returns the IDs back to the system so they don't get wasted. The reason for sending each server a batch of IDs is so that every time a new record is created you don't need to fetch from a central ID server to get the next ID. Instead they have a local set they can work with freely.
How would you do this sort of thing in a serverless system? I am deploying to Vercel and wondering what the appropriate architecture might be for such an ID batching system. There are other use cases for needed a persistent copy of data in a local server, so if you don't like the ID example just imagine another sort of system. How do you solve this optimization problem in a serverless environment?
Serverless is an approach. Like all such things (solutions), it should be matched to the problem - not the other way around. Is this simply a case where serverless is a good solution choice for dealing with 80% of your problem, and that all you need to do is choose something appropriate to deal to the other 20%?
Assuming you have the freedom to do this, can't you just have the serverless parts of the solution consume non-serverless services - e.g. an ID Service?
Separately to this, caching comes to mind - just the general idea of having some data close by which might be mastered somewhere else. Caching patterns like Write Behind would allow you to work with local copies (i.e. immediate consumption) whilst farming out the cache-master communication.

Protecting (or tracking plagiarism of) Openly Available Web Content (database/list/addreses)

We have put together a very comprehensive database of retailers across the country with specific criteria. It took over a year of phone interviews, etc., to put together the list. The list is, of course, not openly available on our site to download as a flat file...that would be silly.
But all the content is searchable on the site via Google Maps. So theoretically with enough zip-code searches, someone could eventually grab all the retailer data. Of course, we don't want that since our whole model is to do the research and interviews required to compile this database and offer it to end-users for consumption on our site.
So we've come to the conclusion there isnt really any way to protect the data from being taken en-masse but a potentially competing website. But is there a way to watermark the data? Since the Lat/Lon is pre-calculated in our db, we dont need the address to be 100% correct. We're thinking of, say, replacing "1776 3rd St" with "1776 Third Street" or replacing standard characters with unicode replacements. This way, if we found this data exactly on a competing site, we'd know it was plagiarism. The downside is if users tried to cut-and-paste the modified addresses into their own instance of Google Maps -- in some cases the modification would make it difficult.
How have other websites with valuable openly-distributed content tackled this challenge? Any suggestions?
Thanks
It is a question of "openly distribute" vs "not openly distribute" if you ask me. If you really want to distribute it, you should acknowledge that someone can receive the data.
With certain kinds of data (media like photos, movies, etc) you can watermark or otherwise tamper with the data so it becomes trackable, but if your content is like yours that will become hard, and even harder to defend: if you use "third street" and someone else also uses it, do you think you can make a case against them? I highly doubt it.
The only steps I can think of is
Making it harder to get all the information. Hide it behind scripts and stuff instead of putting it on google maps, make sure it is as hard as you can make it for bots to get the information, limit the amount of results shown to one user, etc. This could very well mean your service is less attractive to the end user, this is a trade-off
Sort of the opposite of above: use somewhat the same technique to HIDE some of the data for the common user instead of showing it to them. This would be FAKE data, that a normal person shouldn't see. If these retailers show up at your competitors, you've caught them red-handed. This is certainly not fool-proof, as they can check their results for validity and remove your fake stuff, there is always a possibility a user with a strange system gets the fake data which makes your served content less correct, and lastly if your competitors' scraper looks too much like real user, it won't get the data.
provide 2-step info: in step one you get the "about" info, anyone can find that. In step 2, after you've confirmed that this is what the user wants, maybe a login, maybe just limited in requests etc, you give everything. So if the user searches for easy-to-reach retailers, first say in which area you have some, and show it 'roughly' on the map, and if they have chosen something, show them in a limited environment what the real info is.

Environmental database design

I've never designed a database before, but I've had experience programming in a few languages and assembler throughout college, as well as some web design, so I'm able to at least pick up what I need to know if I can be pointed in the right direction. One of the tasks of my job is to sort through some data that we've been collecting in the field, using a "sonde" which measures temperature, pH, conductivity, and other parameters. The device sits in a stream 24/7 (except for when we take it out and switch it with our other sonde every couple weeks, so that we can put in a newly calibrated one in the stream and retrieve the data from the one that was in the field). It collects data every 15 minutes or so, and has done so since 2007. Currently, all of our data is spread across multiple excel spreadsheets, and we have additional data from a weather station and another instrument that all gets compiled into quarterly documents. My goal is to design as simple of a database as possible with most of the functionality of a database like this: http://hudson.dl.stevens-tech.edu/hrecos/d/index.shtml. Ours would be significantly simpler as it is not live data (but would instead retrieve data from files that we upload once we'd finished handling the formatting and compilation of all our data). I would very much like the graphing ability on the site that the above database has, but I at least need to be able to select a range of data and select as many variables as I want within that time range and then be able to download a spreadsheet with the generated data (or at least a CSV file).
I realize this is a tough task, and as I have not designed a database before, I suspect it is very much an uphill task. However if I would be able to learn the things necessary to do this, and make it web-accessible, that would be a huge accomplishment and very much impress my boss. Any advice or tips to go off in the right direction would be very much appreciated.
Thanks for your help!
There are actually 2 parts to the solution you're looking for:
The database, which will store your data in a single organized place, and
The application, which is the interface used by people to interact with the database.
Basically, a database by itself is just a container. You need some kind of application which accept criteria from a user, pull the appropriate data meeting the criteria from the database, and display it to the user in a meaningful fashion - in this case, a graph or a spreadsheet.
Normally for web-based apps the database and application are two separate components. However, for a small app with a fairly small number of users, and especially for someone just starting out, you may want to consider an all-in-one solution like InfoDome, sort of like MSAccess for the web.
Either way, you're still going to need to learn about database design. There's many good tutorials out there, just do some searching. DatabaseAnswers.org has been useful for me. They have a set of tutorials as well as a large collection of sample database schemas.

Google Map & search by location feature

I'm building a site where my users will be able to specify locations (say, their residence, etc.). Then, I want to do 2 things to this information:
Plot these location on a mapping service (such as Google Map)
Allow users to search by location (e.g. find all users that live in or are in a certain radius from XYZ city)
The question I have is this: what is the best way for me to implement such features?
My concern is that my database of location information may or may not be in sync with the mapping service I use.
For example, say I have a list of cities and a user picks XYZ city from my list. Later, it turns out that city is not recognized by the mapping service. No way to plot the location on the map (but I could still provide the search by location feature).
If I try to use the database of the mapping service (if I can get a hold of something like that), I may end up being "locked in" to using that mapping service. Plus, these databases tend to be HUGE (and probably too much information for a simple application like mine).
Any recommendations on how to go about solving this problem? Thanks.
You should not rely only on the Cities but take the coordinates into account (watch out for different mappings of the world - this will change the coordinates). If you use the coordinates, you are independet from the service because x,y will always be x,y no matter the application or geo-provider. WIth the help of reverser geo-location services, you should be able to have something like that pretty easy.
This is very biased opinion: if(!!) you stick to, say Google, you can be quite sure that they won't quit the service in the next 1-2 years (maybe if there is a new uber-technology) but I doubt that very much. So I think it is ok to stick to a single service if you trust it and are sure that it will live for more than a few years.

Document/Image Database Repository Design Question

Question:
Should I write my application to directly access a database Image Repository or write a middleware piece to handle document requests.
Background:
I have a custom Document Imaging and Workflow application that currently stores about 15 million documents/document images (90%+ single page, group 4 tiffs, the rest PDF, Word and Excel documents). The image repository is a commercial, 3rd party application that is very expensive and frankly has too much overhead. I just need a system to store and retrieve document images.
I'm considering moving the imaging directly into a SQL Server 2005 database. The indexing information is very limited - basically 2 index fields. It's a life insurance policy administration system so I index images with a policy number and a system wide unique id number. There are other index values, but they're stored and maintained separately from the image data. Those index values give me the ability to look-up the unique id value for individual image retrieval.
The database server is a dual-quad core windows 2003 box with SAN drives hosting the DB files. The current image repository size is about 650GB. I haven't done any testing to see how large the converted database will be. I'm not really asking about the database design - I'm working with our DBAs on that aspect. If that changes, I'll be back :-)
The current system to be replaced is obviously a middleware application, but it's a very heavyweight system spread across 3 windows servers. If I go this route, it would be a single server system.
My primary concerns are scalabity and performace - heavily weighted toward performance. I have about 100 users, and usage growth will probably be slow for the next few years.
Most users are primarily read users - they don't add images to the system very often. We have a department that handles scanning and otherwise adding images to the repository. We also have a few other applications that receive documents (via ftp) and they insert them into the repository automatically as they are received, either will full index information or as "batches" that a user reviews and indexes.
Most (90%+) of the documents/images are very small, < 100K, probably < 50K, so I believe that storage of the images in the database file will be the most efficient rather than getting SQL 2008 and using a filestream.
Oftentimes scalability and performance are ultimately married to each other in the sense that six months from now management comes back and says "Function Y in Application X is running unacceptably slow, how do we speed it up?" And all too the often the answer is to upgrade the back end solution. And when it comes to upgrading back ends, its almost always going to less expensive to scale out than to scale up in terms of hardware.
So, long story short, I would recommend building a middleware app that specifically handles incoming requests from the user app and then routes them to the appropriate destination. This will sufficiently abstract your front-end user app from the back end storage solution so that when scalability does become an issue only the middleware app will need to be updated.
This is straightforward. Write the application to an interface, use some kind of factory mechanism to supply that interface, and implement that interface however you want.
Once you're happy with your interface, then the application is (mostly) isolated from the implementation, whether it's talking straight to a DB or to some other component.
Thinking ahead a bit on your interface design but doing bone stupid, "it's simple, it works here, it works now" implementations offers a good balance of future proofing the system while not necessarily over engineering it.
It's easy to argue you don't even need an interface at this juncture, rather just a simple class that you instantiate. But if your contract is well defined (i.e. the interface or class signature), that is what protects you from change (such as redoing the back end implementation). You can always replace the class with an interface later if you find it necessary.
As far as scalability, test it. Then you know not only if you may need to scale, but perhaps when as well. "Works great for 100 users, problematic for 200, if we hit 150 we might want to consider taking another look at the back end, but it's good for now."
That's due diligence and a responsible design tactic, IMHO.
I agree with gabriel1836. However, an added benefit would be that you could for a time run a hybrid system for a time since you aren't going to convert 14 millions documents from your proprietary system to you home grown system overnight.
Also, I would strongly encourage you to store the documents outside of a database. Store them on a file system (local, SAN, NAS it doesn't matter) and store pointers to the documents in the database.
I'd love to know what document management system you are using now.
Also, don't underestimate the effort of replacing the capture (scanning and importing) provided by the proprietary system.

Resources