We have a database with roughly 1.5 million rows, soon to be closer to 3 million. Each row has an address. Our service is responsible for visualizing each row on a map in several ways. The issue is that many times, each map will have well over a thousand different rows being displayed. Because of that, it is impractical to have the client (or the server) load all 1000 coordinates from something such as Google Maps API v3.
Ideally, we'd like to store the coordinate values into the table so they are ready for use whenever. However, rate limiting would make that take months to successfully cache all the data.
Is there a service that has no limit, or maybe allows for multiple addresses to be sent at a time to expedite the process?
You could try LiveAddress by SmartyStreets -- the addresses will not only be geocoded but also verified. (Though, you won't get geocoding results for addresses which don't verify.)
You could upload a list with all your addresses or process them through our API. (If your addresses aren't split into components, you'll need to use the API for now, which can receive 100 addresses per request.) Granted, for 1 million+ rows, it's not free (unless you're a non-profit or educational), but the service scales in the cloud and can handle thousands of addresses per second. There are plans which fit millions of lookups, all the way to just plain Unlimited. (By the way, I work at SmartyStreets.)
Most addresses have "Zip9" precision (meaning, the coordinates are accurate to the 9-digit ZIP code level, which is like block-level). Some are Zip7 or Zip5 which are less accurate but might be good enough for your needs.
If you need to arrange precision skydiving, though, you might consider a more dedicated mapping service that gives you rooftop-level precision and allows you to store the data. I know that you can store and cache SmartyStreets data, but map services have different restrictions. For example, Google has rooftop-level data for most US addresses, and lets you cache their data to improve performance, but you aren't allowed to store it in a database and build your own data set. You could also pay Google to raise your rate limits, though it's a little pricey.
I'm not sure what terms the other mapping providers have. (Geocoding services like TAMU have better accuracy but less capable infrastructure, thus rate limits, although you can probably pay to have those raised or lifted.)
Related
I'm trying to retrieve a random entry from a database using the Notion API. There is a page limit on how many entries you can retrieve at once, so pagination is utilized to sift through the pages 100 entries at a time. Since there is no database attribute telling you how long the database is, you have to go through the pages in order until reaching the end in order to choose a random entry. This is fine for small databases, but I have a cron job going that regularly chooses a random entry from a notion database with thousands of entries. Additionally, if I make too many calls simultaneously I risk being rate limited pretty often. Is there a better way to go about choosing a random value from a database that uses pagination? Thanks!
I don't think there is a better way to do it right now (sadly). If your entries don't change often, think about caching the pages. Saves you a lot of execution time in your cron job. For the rate limit, if you use Node.js, you can build a rate-limited queue (3 requests/second) pretty easily with something like bull
I am at the beginning of a project where we will need to manage a near real-time flow of messages containing some ids (e.g. sender's id, receiver's id, etc.). We expect a throughput of about 100 messages per second.
What we will need to do is to keep track of the number of times these ids appeared in a specific time frame (e.g. last hour or last day) and store these values somewhere.
We will use the values to perform some real time analysis (i.e. apply a predictive model) and update them when needed while parsing the messages.
Considering the high throughput and the need to be in real time what DB solution would be the better choice?
I was thinking about a key-value in memory DB that will persist data on disk periodically (like Redis).
Thanks in advance for the help.
The best choice depends on many factors we don’t know, like what tech stack is your team already using, how open are they to learning new things, how much operational burden are you willing to take on, etc.
That being said, I would build a counter on top of DynamoDB. Since DynamoDB is fully managed, you have no operational burden (no database server upgrades, etc.). It can handle very high throughput, and it has single-digit millisecond latency for writes and reads to a single row. AWS even has documentation describing how to use DynamoDB as a counter.
I’m not as familiar with other cloud platforms, but you can probably find something in Azure or GCP that offers similar functionality.
I am considering using S3 for back-end persistent storage.
However, depending on architecture choices, I predict some buckets may need to store billions of small objects.
How will GET Object and PUT Object perform under these conditions, assuming I am using UUIDs as keys? Can I expect O(1), O(logN), or O(n) performance?
Will I need to rethink my architecture and subdivide bigger buckets in some way to maintain performance? I need object lookups (GET) in particular to be as fast as possible.
Though it is probably meant for S3 customers with truly outrageous request volume, Amazon does have some tips for getting the most out of S3, based on the internal architecture of S3:
Performing PUTs against a particular bucket in alphanumerically increasing order by key name can reduce the total response time of each individual call. Performing GETs in any sorted order can have a similar effect. The smaller the objects, the more significantly this will likely impact overall throughput.
When executing many requests from a single client, use multi-threading to enable concurrent request execution.
Consider prefacing keys with a hash utilizing a small set of characters. Decimal hashes work nicely.
Consider utilizing multiple buckets that start with different alphanumeric characters. This will ensure a degree of partitioning from the start. The higher your volume of concurrent PUT and GET requests, the more impact this will likely have.
If you'll be making GET requests against Amazon S3 from within Amazon EC2 instances, you can minimize network latency on these calls by performing the PUT for these objects from within Amazon EC2 as well.
Source: http://aws.amazon.com/articles/1904
Here's a great article from AWS that goes into depth about the hash prefix strategy and explains when it is and isn't necessary:
http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-seattle-hiring-event.html
Bottom line: Your plan to put billions of objects in a single bucket using UUIDs for the keys should be fine. If you have outrageous request volume, you might split it into multiple buckets with different leading characters for even better partitioning.
If you are going to be spending a lot of money with AWS, consider getting in touch with Amazon and talking through the approach with them.
S3 is like an external disk. So like read/write GET or PUT will depend on file object size, regardless of the number of other files in the disk.
From the FAQ page:
Since Amazon S3 is highly scalable and you only pay for what you use,
developers can start small and grow their application as they wish,
with no compromise on performance or reliability. It is designed to be
highly flexible: Store any type and amount of data that you want; read
the same piece of data a million times or only for emergency disaster
recovery; build a simple FTP application, or a sophisticated web
application such as the Amazon.com retail web site. Amazon S3 frees
developers to focus on innovation, not figuring out how to store their
data.
If you want to know what is the time complexity of file lookup in S3 file system, it is difficult to say, since I don't know how it does that. But at least is better than O(n). O(1) if uses hash or O(logn) if trees. But either is very scalable.
Bottomline is don't worry about that.
i have a db that store many posts, like a blog. The problem is that exist many users and this users create many post at the same time. So, when a user request the home page i request this posts to db. In less words, i've to get the posts that i've showed, for show the new ones. How can i avoid this performance problem?
Before going down a caching path ensure
Review the logic (are you undertaking unnecessary steps, can you populate some memory variables with slow changing data and so reduce DB calls, etc)
Ensure DB operations are as distinct as possible (minimum rows and columns returned)
Data is normalised to at least 3rd normal form and then selectively denormalised with the appropriate data handling routines for the denormalised data.
After normalisation, tune the DB instance (server perfomance, disk IO, memory, etc)
Tune the SQL statements
Then ...
Consider caching. Even though it is not possible to cache all data, if you can get a significant percentage into cache for a reasonable period of time (and those values vary according to site) you remove load from the DB server and so other queries can be served faster.
do you do any type of pagination? if not database pagination would be the best bet... start with the first 10 posts, and after that only return the full list of the user requests it from a link or some other input.
The standard solution is to use something like memcached to offload common reads to a caching layer. So you might decide to only refresh the home page once every 5 minutes rather than hitting the database repeatedly with the same exact query.
If there are data which is requested very often, you should cache it. Try using an in-memory cache such as memcached to store things that are likely to be re-requested in short time. You should have free RAM for this: try using free memory on your frontend machine(s), usually serving HTTP requests and applying templates is less RAM-intensive. BTW, you can cache not only raw DB records, but also ready-made pieces of pages with formatting and all.
If your load cannot be reasonably handled by one machine, try sharding your database. Put data of some of your users (posts, comments, etc) on one machine, data of other users to another machine, etc. This will make some joins impossible on database level, because data are on different machines, but joins that you do often will be parallelized.
Also, take a look at document-oriented 'NoSQL' data stores like (MongoDB)[http://www.mongodb.org/]. It e.g. allows you to store a post and all comments to it in a single record and fetch in one operation, without any joins. But regular joins are next to impossible. Probably a mix of SQL and NoSQL storage is most efficient (and hard to handle).
I am developing an application which involves multiple user interactivity in real time. It basically involves lots of AJAX POST/GET requests from each user to the server - which in turn translates to database reads and writes. The real time result returned from the server is used to update the client side front end.
I know optimisation is quite a tricky, specialised area, but what advice would you give me to get maximum speed of operation here - speed is of paramount importance, but currently some of these POST requests take 20-30 seconds to return.
One way I have thought about optimising it is to club POST requests and send them out to the server as a group 8-10, instead of firing individual requests. I am not currently using caching in the database side, and don't really have too much knowledge on what it is, and whether it will be beneficial in this case.
Also, do the AJAX POST and GET requests incur the same overhead in terms of speed?
Rather than continuously hitting the database, cache frequently used data items (with an expiry time based upon how infrequently the data changes).
Can you reduce your communication with the server by caching some data client side?
The purpose of GET is as its name
implies - to GET information. It is
intended to be used when you are
reading information to display on the
page. Browsers will cache the result
from a GET request and if the same GET
request is made again then they will
display the cached result rather than
rerunning the entire request. This is
not a flaw in the browser processing
but is deliberately designed to work
that way so as to make GET calls more
efficient when the calls are used for
their intended purpose. A GET call is
retrieving data to display in the page
and data is not expected to be changed
on the server by such a call and so
re-requesting the same data should be
expected to obtain the same result.
The POST method is intended to be used
where you are updating information on
the server. Such a call is expected to
make changes to the data stored on the
server and the results returned from
two identical POST calls may very well
be completely different from one
another since the initial values
before the second POST call will be
differentfrom the initial values
before the first call because the
first call will have updated at least
some of those values. A POST call will
therefore always obtain the response
from the server rather than keeping a
cached copy of the prior response.
Ref.
The optimization tricks you'd use are generally the same tricks you'd use for a normal website, just with a faster turn around time. Some things you can look into doing are:
Prefetch GET requests that have high odds of being loaded by the user
Use a caching layer in between as Mitch Wheat suggests. Depending on your technology platform, you can look into memcache, it's quite common and there are libraries for just about everything
Look at denormalizing data that is going to be queried at a very high frequency. Assuming that reads are more common than writes, you should get a decent performance boost if you move the workload to the write portion of the data access (as opposed to adding database load via joins)
Use delayed inserts to give priority to writes and let the database server optimize the batching
Make sure you have intelligent indexes on the table and figure out what benefit they're providing. If you're rebuilding the indexes very frequently due to a high write:read ratio, you may want to scale back the queries
Look at retrieving data in more general queries and filtering the data when it makes to the business layer of the application. MySQL (for instance) uses a very specific query cache that matches against a specific query. It might make sense to pull all results for a given set, even if you're only going to be displaying x%.
For writes, look at running asynchronous queries to the database if it's possible within your system. Data synchronization doesn't have to be instantaneous, it just needs to appear that way (most of the time)
Cache common pages on disk/memory in a fully formatted state so that the server doesn't have to do much processing of them
All in all, there are lots of things you can do (and they generally come down to general development practices on a more bite sized scale).
The common tuning tricks would be:
- use more indexing
- use less indexing
- use more or less caching on filesystem, database, application, or content
- provide more bandwidth or more cpu power or more memory on any of your components
- minimize the overhead in any kind of communication
Of course an alternative would be to:
0 develop a set of tests, preferable automatic that can determine, if your application works correct.
1 measure the 'speed' of your application.
2 determine how fast it has to become
3 identify the source of the performane problems:
typical problems are: network throughput, file i/o, latency, locking issues, insufficient memory, cpu
4 fix the problem
5 make sure it is actually faster
6 make sure it is still working correct (hence the tests above)
7 return to 1
Have you tried profiling your app?
Not sure what framework you're using (if any), but frankly from your questions I doubt you have the technical skill yet to just eyeball this and figure out where things are slowing down.
Bluntly put, you should not be messing around with complicated ways to try to solve your problem, because you don't really understand what the problem is. You're more likely to make it worse than better by doing so.
What I would recommend you do is time every step. Most likely you'll find that either
you've got one or two really long running bits or
you're running a shitton of queries because of an n+1 error or the like
When you find what's going wrong, fix it. If you don't know how, post again. ;-)