I am developing an application where the user can request data based on its current location. So for example I have a big pool of local stores all over the world and when the user hits a query just the stores nearby his location gets queried to match. Same as in tinder, there are a lot of people in the database but one user can only see people around his location. How must the database be structured, cause I guess just querying the whole database pool, possible out of million entries, to find people that match with your geo location is bad practice? What is the architecture similar applications use? Thanks for tips.
I would say querying the whole database pool and letting the database take care of optimization is actually good practice. However, you need to be careful in defining your table so that these queries will be efficient.
In particular, you'll want a spatial two-dimensional index on the column. PostGIS is great for this: http://revenant.ca/www/postgis/workshop/indexing.html
Then, when you query, use ST_DWithin.
If you are interested in learning about how this actually looks like when laid out on memory/disk, start with r-tree indexes: https://en.wikipedia.org/wiki/R-tree
Related
In many sub-system designs for messaging applications (twitter, facebook e.t.c) I notice duplication of where user message history is stored. On other hand they use tokenizing indexer like ElasticSeach or Solr. It's good for search. On other hand still use some sort of DB for history. Why to duplicate? Why the same instance of ES/Solr/EarlyBird can not be used for history? It's in fact able to.
The usual problem is the following - you want to search and also ideally you want to try index data in a different manner (e.g. wipe index and try new awesome analyzer, that you forgot to include initially). Separating data source and index from each other makes system less coupled. You're not afraid, that you will lose data in the Elasticsearch/Solr.
I am usually strongly against calling Elasticsearch/Solr a database. Since in fact, it's not. For example none of them have support for transactions, which makes your life harder, if you want to update multiple documents following standard relational logic.
Last, but not least - one of the hardest operation in Elasticsearch/Solr is to retrieve stored values, since it's not much optimised to do so, especially if you want to return 10k documents at once. In this case separate datasource would also help, since you will be able to return only matched document ids from Elasticsearch/Solr and later retrieve needed content from datasource and return it to the user.
Summary is just simple - Elasticsearch/Solr should be more think of as a search engines, not data storage.
True that ES is NOT a database per se and will never be. But no one says you cannot use it as such, and many people actually do. It really depends on your specific use case(s), and in the end it's all a question of the trade-offs you are ready to make to support your specific needs. As with pretty much any technology in general, there is no one-size-fits-all approach and with ES (and the like) it's no different.
A primary source of truth might not necessarily be a relational DBMS and they are not necessarily "duplicating" the data in the sense that you meant, it can be anything that has a copy of your data and allows you to rebuild your ES indexes in case something goes wrong. I've seen many many different "sources of truth". It could simply be:
your raw flat files containing your historical logs or business data
Kafka topics that you can replay anytime easily
a snapshot that you take from ES on a regular basis
a relational DB
you name it...
The point is that if something goes wrong for any reason (and that happens), you want to be able to recreate your ES indexes, be it from a real DB, from backups or from raw data. You should see that as a safety net. Even if all you have is a MySQL DB, you usually have a backup of it, so you're already "duplicating" the data in some way.
One thing that you need to think of, though, when architecting your system, is that you might not necessarily need to have the entirety of your data in ES, since ES is a search and analytics engine, you should only store in there what is necessary to support your search and analytics needs and be able to recreate that information anytime. In the end, ES is just a subsystem of your whole architecture, just like your DB, your messaging queue or your web server.
Also worth reading: Using ElasticSeach as primary source for part of my DB
Currently using a SQL Server database to keep records of storage and retrieve of high resolution media files located in several SANs. However, due to several SQL Server databases involved, I'm curious to learn if No-SQL design approach would be best. Thank you.
NoSQL solutions are great at non-structured or variably-structured information like the layout of a web page, while SQL solutions are better at structured information. The general rule is that if you can fit it into a table structure, use that. Since you're just storing a pointer to a file location I don't think you'll get much by switching data engines, unless you're also storing other information along with the location like a bunch of hashtags that describe the video (for easier searching or "others like this one"). And even then it's probably easier to just have an indexed table of hashtags. If you want to share more information about your architecture the geniuses here can probably give you some pointers.
I have a very basic question, if we talk about millions of records to be manipulated, then why we need to store millions of records in memory? whatever record we need we can fetch from database and do manipulation in memory using some data structures and update back to database.
I'll give one example.
At work we work with language learning, so enormous data sets of words and phrases. (hundreds of thousands of words, though it could easily reach the millions as time goes on)
Good use of Data Structures are crucial to a successful application. Like #Juan Lopes said, keeping everything in databases is slow, and impractical. What happens if I need to manipulate multiple values or run an algorithm on a data set? I need to retrieve that data from my database first in order to to do this.
An argument can be made that algorithms can be added to the database then to solve this problem. More often then not however, you will not have ownership of the database, or you will consume data that you do not have permission to modify the server code for.
Also depending on which data structures you use you can save large amounts of time! Take the map/dictionary. By doing a O(n) pass on the data to create a map, I now am able to access any of the data in O(1) time if I know the key I'm looking for, Running a database query will rarely produce such fast results, also in modern applications often times the database is on a server far away from your program, and the time to retrieve the data is compounded with that of an HTTP request, which could very well take 10x the time to run the query itself.
In the end, there is a good reason that data structures are a fundamental part of any good programmers toolbox and why they teach it so vigorously in universities.
Imagine an enormous collection of locations of interest. And given any point on the map we would like to list all such locations within say 5 km of it.
This seems like a reasonably simple idea that I expect there is already a thought out solution. But I don't know how to Google for it.
How would the location data be stored in a database to make searching fast. I'm assuming that a SQL database (which is based around relational tabular data) will not work since I don't see an obvious way to use SQL's tabular nature to filter out most location further than 5 km away to keep each query fast.
Maybe databases like Postgres have some kind of spatial extensions that allow what I am asking to be done fast. If so how is such a thing implemented.
And if one were implementing a database from scratch for spatial queries like mine how would they be implemented
Spatial extension for postgres is called PostGIS. It has special data types to represent maps and locations. Also it has special indexes to speed up the queries on spatial data (GIN and GiST indexes).
Here is a list of PostGIS Frequently Asked Questions. It has an answer for your question.
http://postgis.net/docs/manual-2.1/PostGIS_FAQ.html
Ok so I'm working on an ASP MVC Web Application that queries a fairly large amount of data from an SQL Server 2008. When the application starts, the user is presented with a search mask that includes several fields. Using the search mask, the user can search for data in the data base and also filter the search by specifying parameters in the mask. In order to speed up searching I'm storing the result set returned by the data base query in the server session. During subsequent searches I can then search the data I have in the session thus avoiding unecessary trips to the DB.
Since the amount of data that can be returned by a data base query can be quite large, the scalability of the web application is severily limited. If there are, let's say, 100 users using the application at the same time, the server will keep search results in its session for each separate user. This will eventually eat up quite a bit of memory. My question now is, what's the best alternative to storing the data in session? The query in the DB can take quite a while at times so, at the moment, I would like to avoid having to run the query on subsequent searches if the data I already retrieved earlier contains the data that is now being searched for. Options I've considered are creating a temp table in the DB in my search query that stores the retrieved data and which can be used for subsequent searches. The problem with that is, I don't have all too much experience with SQL Server so I don't know if the SQL Server would create temp tables for each user if there are multiple users performing the search. Are there any other possibilities? Could the idea with the temp table in the SQL Server work or would it only lead to memory issues on the SQL Server? Thanks for the help! :)
Edit: Thanks a lot for the helpful and insightful answers, guys! However, I failed to mention a detail that's kind of important. When I query the database, the format of the result set can vary from user to user. This is because the user can decide which columns the result table can have by selecting columns from a predefined multiselect box in the search mask. If user A wants ColA, ColB and ColC to be displayed in his result table, he selects those values from the multiselect box in the search mask. User B, however, might select ColA and ColC only. Therefore, caching the results in a single table for all users might be a bit tricky since the table columsn are not necessarily going to be the same for all users. Therefore, I'm thinking, I'll almost have to use an alternative that saves each user's cached table separately. The HTML5 Local Storage alternative option mentioned below sounds interesting. Since this is an intranet application, it might be fair to assume (or require) that users have an up to date browser that supports HTML5. What do you guys think? Again, thanks for the help :)
If you want to cache query results, they'll have to be either on the web server or client in some form or another. All options will require memory, and since search results are user-specific, that memory usage will increase as a linear function of the number of current users.
My suggestions are to limit the number of rows returned from SQL (with TOP) and/or to look into optimizing your query on the SQL end. If your DB query takes a noticeable amount of time there's a good chance it can be optimized in SQL.
Have you already tought about the NoSql databases?
The idea of a NoSql database is to store information that is optimized for reading or writing and is accessed with 'easy queries' (for example a look-up on search terms). They scale easily horizontally and would allow you to search trough a whole lot of data very fast (Think of Google's BigData for example!)
if HTML5 is a possibility, you could use Local Storage.
You could try turning on sql session state.
http://support.microsoft.com/kb/317604
Plus: Effortless, you will find out if this fixes the memory pressure and has acceptable performance (i.e. reading and writing the cache to the DB). If the perf is okay, then you may want to implement the sort of thing that sql session does yourself because there is a ...
Down side: If you aren't aggressive about removing it from session, it will be serialized and deserialized on each request on unrelated pages.