I know this is a topic that's been addressed ad nauseam but I also know there are people who enjoy opining about databases so I figured I'd just go ahead and ask the question again.
I'm building out a web application that on a very basic level displays a list of objects that meet a user-defined search criteria. The primary function of the application will be to provide an interface by which a user can perform realtime faceted searches on a large number of object properties, including ranges of data, location data, and probably related data.
Of course there will be ancillary information too: user accounts, lookup tables etc.
My background is entirely in relational database development, primarily SQL Server with a little bit of MySQL. However, I'm intrigued by the possible applicability of an object-relational approach or even a full-on document database. Without the experience of working in those paradigms I'm not sure what I might be getting myself into.
Here are some further considerations that may affect the decision:
The schema will likely evolve considerably over time as more properties and search options are added, creating the typical versioning/deployment challenges. This is the primary reason why I would consider a document database.
The application itself will likely be written in Node/Express with an Angular or React front-end using Typescript and so the code will be interacting with data in json format. In other words, regardless of what comes back from the db server, we want json on the code level. (Another case for a doc database.)
There is the potential for a large amount of search parameters and a large amount of data, so indexing will be key and performance will be a huge potential gotcha. This would seem to me to be a strong case against a document db.
A potential use case would involve a user adjusting a slider control (let's say it controls high and low price parameters or a distance range). The selected parameters would then be packaged as a json object and sent to a search controller, which would then pass these parameters to the db server on change and expect a list of objects in return. In other words, the user would generally not be pushing a button to refine search criteria. The search update would happen each time they change a parameter.
I don't know the extent to which this a thing or not, but it would also be great if there were some way to leverage technology that could cache search results and then search within those results if the search were narrowed, thus performing a second search only on the smaller subset of the first search rather than the entire universe of available objects.
I guess while I'm at it I should ask about ORMs. Also something I'm generally not some experienced with (I've used Entity Framework a bit) but wondering if I should expand my horizons.
Thanks and I look forward to your opinions!
I think that your requirement of "There is the potential for a large amount of search parameters and a large amount of data, so indexing will be key and performance will be a huge potential gotcha" makes a strong case for using a relational database to store the data.
Leveraging an ORM that can support data in JSON format would seem to be ideal in your use case. Schema evolution for a system in production would sure be a challenge (not unsurmountable though) but it would be nice to use an ORM product that can at least easily support schema evolution during the development stage when things are more likely to change and evolve rapidly.
Given the kind of queries you would typically be issuing (e.g., adjusting a slider control), an ORM that supports prepared statements that can have range criteria would be more efficient.
Also, given your need to "perform realtime faceted searches on a large number of object properties, including ranges of data, location data, and probably related data", an ORM product that can easily support one-to-one, one-to-many, and many-to-many relationships and path-expressions in search criteria should simplify your development process.
Related
I'm looking to store around 50-100 million documents in a database and be able to do queries at a very fast speed. A document would look something like this:
{
name: 'example',
value: '300,201,512'
}
The value column is always unique, name is not.
Now I want to be able to only check if there exists a document with a specific value using a query. What database would be the best choice and which design would be best to approach the fastest speed for a query like that?
NoSQL databases try to offer certain functionality that more traditional relational database management systems do not. Whether it is for holding simple key-value pairs for shorter lengths of time for caching purposes, or keeping unstructured collections (e.g. collections) of data that could not be easily dealt with using relational databases and the structured query language (SQL) – they are here to help.
In order to better understand the roles and underlying technology of each database management system, let's quickly go over these four operational models.
Key / Value Based
We will begin our NoSQL modeling journey with key / value based database management simply because they can be considered the most basic and backbone implementation of NoSQL.
These type of databases work by matching keys with values, similar to a dictionary. There is no structure nor relation. After connecting to the database server (e.g. Redis), an application can state a key (e.g. the_answer_to_life) and provide a matching value (e.g. 42) which can later be retrieved the same way by supplying the key.
Key / value DBMSs are usually used for quickly storing basic information, and sometimes not-so-basic ones after performing, for example, a CPU and memory intensive computation. They are extremely performant, efficient and usually easily scalable.
Note: When it comes to computers, a dictionary usually refers to a special sort of data object. They constitutes of arrays of collections with individual keys matching values.
Column Based
Column based NoSQL database management systems work by advancing the simple nature of key / value based ones.
Despite their complicated-to-understand image on the internet, these databases work very simply by creating collections of one or more key / value pairs that match a record.
Unlike the traditional defines schemas of relational databases, column-based NoSQL solutions do not require a pre-structured table to work with the data. Each record comes with one or more columns containing the information and each column of each record can be different.
Basically, column-based NoSQL databases are two dimensional arrays whereby each key (i.e. row / record) has one or more key / value pairs attached to it and these management systems allow very large and un-structured data to be kept and used (e.g. a record with tons of information).
These databases are commonly used when simple key / value pairs are not enough, and storing very large numbers of records with very large numbers of information is a must. DBMS implementing column-based, schema-less models can scale extremely well.
Document Based
Document based NoSQL database management systems can be considered the latest craze that managed to take a lot of people by storm. These DBMS work in a similar fashion to column-based ones; however, they allow much deeper nesting and complex structures to be achieved (e.g. a document, within a document, within a document).
Documents overcome the constraints of one or two level of key / value nesting of columnar databases. Basically, any complex and arbitrary structure can form a document, which can be stored using these management systems.
Despite their powerful nature, and the ability to query records by individual keys, document based management systems have their own issues and downfalls compared to others. For example, retrieving a value of a record means getting the whole lot of it and same goes for updates, all of which affect the performance.
Graph Based
Finally, the very interesting flavour of NoSQL database management systems is the graph based ones.
The graph based DBMS models represent the data in a completely different way than the previous three models. They use tree-like structures (i.e. graphs) with nodes and edges connecting each other through relations.
Similarly to mathematics, certain operations are much simpler to perform using these type of models thanks to their nature of linking and grouping related pieces of information (e.g. connected people).
These databases are commonly used by applications whereby clear boundaries for connections are necessary to establish. For example, when you register to a social network of any sort, your friends' connection to you and their friends' friends' relation to you are much easier to work with using graph-based database management systems.
Fasted document based db
1) MongoDB
2) DynamoDB
Here is difference for your reference
I will give preference to DynamoDB
Currently, we are working on aws datalake, really fast performance
store data in s3 and get back via athena.
If you want to import data on to some database then try using MS SQL Server 2008 R2, because it is very user friendly and allow you to do your work more accurately and precisely. If you want to do that without any cost, then MySQL will be a better option to do so(better MySQL editor is SQLYog). I hope it would be beneficial for you.
Short Answer:
I think, 100 million documents in your mentioned structure and conditions is not BIG ENOUGH to use NoSQL. You can handle them with PostgreSQL and MySQL and etc.
Note that: for a long time Wikipedia used MySQL (not now). see Reference
I am working on a group project and we are having a discussion about whether to calculate data that we want from an existing database and store it in a new database to query later, or calculate the data from the existing database every time we need to use it. I was wondering what the pros and cons may be for either implementation. Is there any advice you could give?
Edit: Here is more elaborate explanation. We have a large database that has a lot of information being submitted to it daily. We are building a system to track certain points of data. For example, we are getting the count of how many times a user does something that is entered in the database. Using this example (are actual idea is a bit more complex), we are discussing to methods of getting the count of actions per users. The first method is to create a database that stores the users and their action count, and query this database every time we need the action count. The second method would be to query the large database and count the actions per user every time we need to use it. I hope this explanation helps explain. Thoughts?
Edit 2: Two more things that may be useful to point out is 1: I only have read access to the large database and 2: My ultimate goal is to display this information on a web page for end users.
This is a generic question about optimization by caching. The following was my answer to essentially the same question. Even though that question provided a bunch of different details, none of them were specific enough to merit a non-generic answer either:
The more you want to calculate at query time, the more you want views,
calculated columns and stored or user routines. The more you want to
calculate at normalized base update time, the more you want cascades
and triggers. The more you want to calculate at some other (scheduled
or ad hoc) time, the more you use snapshots aka materialized views and
updated denormalized bases. You can combine these. Any time the
database is accessed it can be enabled by and restricted by stored
routines or other api.
Until you can show that they are in adequate, views and calculated
columns are the simplest.
The whole idea of a DBMS is to store a representation of your
application state as the database (which normalization reduces the
redundancy of) and then you query and let the DBMS implement and
optimize calculation of the answer. You haven't presented a reason for
not doing that in the most straightforward way possible.
[sic]
Always make sure an application is reading its own personal ("external") database that is a view of "the" ("conceptual") database so that when you change the implemention of the former (plus the rest of some combined interfact) by the latter (plus the rest of some compbined mechanisms) your applications do not have to change ("logical independence"). Here the applications are your users' and your trackers'.
Ultimately you must instrument and guestimate. When it is worth it you start caching. Preferably as much as possible in terms of high-level notions like views and snapshots and as little as possible in non-DBMS code. One of he benefits of the relational model is that it is easy to describe a strightforward relational interface in terms of another straightforward relational interface. You protect your applications from change by offering an interface that hides secrets of implementation or which of a family of interfaces is the current one.
I want to make a database that can store any king of objects and for each classes of objects different features.
Giving some of the questions i asked on different forums the solution is http://en.wikipedia.org/wiki/Entity-attribute-value_model or http://en.wikipedia.org/wiki/Xml with some kind of validation before storage.
Can you please give me an alternative to the ones above or some advantages or examples that would help decide which of the two methods is the best one in my case?
Thanks
UPDATE 1 :
Is your db read or write intensive?
will be both -> auction engine
Will you ever conceivably move off SQL Server and onto another platform?
I won't move it, I will use a WCF Service to expose functionality to mobile devices.
How do you plan to surface your data to the application?
Entity Framework for DAL and WCF Service Layer for Bussiness
Will people connect to your data through means other than those you control?
No
While #marc_s is correct in his cautions, there unarguably are situations where the relational model is just not flexible enough. For quite a number of years now, I've been working with a database that is straightforwardly relational for the largest part, but has a small EAV part. This is because users can invent new properties any time for observation purposes in trials.
Admittedly, it is awkward wrt querying and reporting, to name a few, but no other strategy would suffice here. We use stored procedures with T-Sql's pivot to offer flattened data structures for reporting and grids with dynamic columns for display. Once the infrastructure stands it's pretty comfortable altogether.
We never considered using XML data because it wasn't there yet and, apart from its common limitations, it has some drawbacks in our context:
The EAV data is queried heavily. A development team needs more than standard sql knowledge because of the special syntax. Indexing is possible but "there is a cost associated with maintaining the index during data modification" (as per MSDN).
The XML datatype is far less accessible than regular tables and fields when it comes to data processing and reporting.
Hardly ever do users fetch all attribute values of an entity, but the whole XML would have to be crunched anyway.
And, not unimportant: XML datatype is not (yet) supported by Entity Framework.
So, to conclude, I would go for a design that is relational as much as possible but EAV where necessary. Auction items could have a number of fixed fields and EAV's for the flexible data.
I will use my answer from another question:
EAV:
Storage. If your value will be used often for different products, e.g. clothes where attribute "size" and values of sizes will be repeated often, your attribute/values tables will be smaller. Meanwhile, if values will be rather unique that repeatable (e.g. values for attribute "page count" for books), you will get a big enough table with values, where every value will be linked to one product.
Speed. This scheme is not weakest part of project, because here data will be changed rarely. And remember that you always can denormalize database scheme to prepare DW-like solution. You can use caching if database part will be slow too.
Elasticity This is the strongest part of solution. You can easily add/remove attributes and values and ever to move values from one attribute to another!
XML storage is more like NoSQL: you will abdicate database functionality and you wisely prepare your solution to:
Do not lose data integrity.
Do not rewrite all database functionality in application (it is senseless)
I think there is way too much context missing for anyone to add any kind of valid comment to the discussion.
Is your db read or write intensive?
Will you ever conceivably move off SQL Server and onto another platform?
How do you plan to surface your data to the application?
Will people connect to your data through means other than those you control?
First do not go either route unless the structure truly cannot be known in advance. Using EAV or XML because you don't want to actually define the requirements will result in an unmaintainable mess and a badly performing mess at that. Usually at least 90+% (a conservative estimate based on my own experience) of the fields can be known in advance and should be in ordinary relational tables. Only use special techiniques for structures that can't be known in advance. I can't stress this strongly enough. EAV tables look simple but are actually very hard to query especially for complex reporting queries. Sure it is easy to get data into them, but very very difficult to get the data back out.
If you truly need to go the EAV route, consider using a nosql database for that part of the application and a relational database for the rest. Nosql databases simply handle EAV better.
We have to redesign a legacy POI database from MySQL to PostgreSQL. Currently all entities have 80-120+ attributes that represent individual properties.
We have been asked to consider flexibility as well as good design approach for the new database. However new design should allow:
n no. of attributes/properties for any entity i.e. no of attributes for any entity are not fixed and may change on regular basis.
allow content admins to add new properties to existing entities on the fly using through admin interfaces rather than making changes in db schema all the time.
There are quite a few discussions about performance issues of EAV but if we don't go with a hybrid-EAV we end up:
having lot of empty columns (we still go and add new columns even if 99% of the data does not have those properties)
spend more time maintaining database esp. when attributes keep changing.
no way of allowing content admins to add new properties to existing entities
Anyway here's what we are thinking about the new design (basic ERD included):
Have separate tables for each entity containing some basic info that is exclusive e.g. id,name,address,contact,created,etc etc.
Have 2 tables attribute type and attribute to store properties information.
Link each entity to an attribute using a many-to-many relation.
Store addresses in different table and link to entities using foreign key.
We think this will allow us to be more flexible when adding,removing or updating on properties.
This design, however, will result in increased number of joins when fetching data e.g.to display all "attributes" for a given stadium we might have a query with 20+ joins to fetch all related attributes in a single row.
What are your thoughts on this design, and what would be your advice to improve it.
Thank you for reading.
I'm maintaining a 10 year old system that has a central EAV model with 10M+ entities, 500M+ values and hundreds of attributes. Some design considerations from my experience:
If you have any business logic that applies to a specific attribute it's worth having that attribute as an explicit column. The EAV attributes should really be stuff that is generic, the application shouldn't distinguish attribute A from attribute B. If you find a literal reference to an EAV attribute in the code, odds are that it should be an explicit column.
Having significant amounts of empty columns isn't a big technical issue. It does need good coding and documentation practices to compartmentalize different concerns that end up in one table:
Have conventions and rules that let you know which part of your application reads and modifies which part of the data.
Use views to ease poking around the database with debugging tools.
Create and maintain test data generators so you can easily create schema conforming dummy data for the parts of the model that you are not currently interested in.
Use rigorous database versioning. The only way to make schema changes should be via a tool that keeps track of and applies change scripts. Postgresql has transactional DDL, that is one killer feature for automating schema changes.
Postgresql doesn't really like skinny tables. Each attribute value results in 32 bytes of data storage overhead in addition to the extra work of traversing all the rows to pull the data together. If you mostly read and write the attributes as a batch, consider serializing the data into the row in some way. attr_ids int[], attr_values text[] is one option, hstore is another, or something client side, like json or protobuf, if you don't need to touch anything specific on the database side.
Don't go out of your way to put everything into one single entity table. If they don't share any attributes in a sensible way, use multiple instantitions of the specific EAV pattern you use. But do try to use the same pattern and share any accessor code between the different instatiations. You can always parametrise the code on the entity name.
Always keep in mind that code is data and data is code. You need to find the correct balance between pushing decisions into the meta-model and expressing them as code. If you make the meta-model do too much, modifying it will need the same kind of ability to understand the system, versioning tools, QA procedures, staging as your code, but it will have none of the tools. In essence you will be doing programming in a very awkward non-standard language. On the other hand, if you leave too much in the code, every trivial change will need a new version of your software. People tend to err on the side of making the meta-model too complex. Building developer tools for meta-models is hard and tedious work and has limited benefit. On the other hand, making the release process cheaper by automating everything that happens from commit to deploy has many side benefits.
EAV can be useful for some scenarios. But it is a little like "the dark side". Powerful, flexible and very seducing it is. But it's something of an easy way out. An easy way out of doing proper analysis and design.
I think "entity" is a bit over the top too general. You seem to have some idea of what should be connected to that entity, like address and contact. What if you decide to have "Books" in the model. Would they also have adresses and contacts? I think you should try to find the right generalizations and keep the EAV parts of the model to a minium. Whenever you find yourself wanting to show a certain subset of the attributes, or test for existance of the value, or determining behaviour based on the value you should really have it modelled as a columns.
You will not get a better opportunity to design this system than now. The requirements are known since the previous version, and also what worked and what didn't. (Just don't fall victim to the Second System Effect)
One good implementation of EAV can be found in magento, a cms for ecommerce. There is a lot of bad talk about EAV those days, but I challenge anyone to come up with another solution than EAV for dealing with infinite product attributes.
Sure you can go about enumerating all the columns you would need for every product in the world, but that would take you a lot of time and you would inevitably forget product attributes in the way.
So the bottom line is : use EAV for infinite stuff but don't rely on EAV for all the database's tables. Hence an hybrid EAV and relational db, when done right, is a powerful tool that could not be acomplished by only using fixed columns.
Basically EAV is trying to implement a database within a database, and it leads to madness. The queries to pull data become overly complex, and your data has no stable, specific model to keep it in some kind of order.
I've written EAV systems for limited applications, but as a generic solution it's usually a bad idea.
I am building a system that allows front-end users to define their own business objects. Defining a business object involves creating data fields for that business object and then relating it to other business objects in the system - fairly straight forward stuff. My question is, what is the most efficient storage strategy?
The requirements are:
Must support business objects with potentially 100+ fields (of all common data types)
The system will eventually support hundreds of thousands of business object instances
Business objects sometimes display data and aggregates from their relationships with other business objects
Users must be able to search for business objects by their data fields (and fields from related business objects)
The two possible solutions I can envisage are:
Have a dynamic schema such that when a new business object type is created a new table is created for storing instances of that object. The object's fields become columns in the storage table.
Have a fixed schema where instance data fields are stored as rows in basically a big long table.
I can see pros and cons to both approaches:
the dynamic schema allows me to index search columns
the dynamic tables are potentially limited in width by the max column size
dynamic schemas rule out / cause issues with replication
the static schema means less or even no dynamic sql generation
my guess is the static schema may perform like a dog when it comes to searching across 100,000+ objects
So what is the best soution? Is there another approach I haven't thought of?
Edit: The requirement I have been given is to build a generic system capable of supporting front-end user defined business objects. There will of course be restrictions on how these objects can be constructed and related, but the requirement itself is not up for negotiation.
My client is a service provider and requires a degree of flexibility in servicing their own clients, hence the need to create business objects.
I think your problem matches very well to a graph database like Neo4j, as it's built for the requested kind of flexibility from the beginning. It stores data as nodes and relationships/edges, and both nodes and relationships can hold arbitrary properties (in a key/value fashion). One important difference to a RDBMS is that a graph database won't need to lookup the relationships in a big long table (like in your fixed schema solution), so there should be a significant performance gain there. You can find out about language bindings for Neo4j in the wiki and read what others say about it in this stackoverflow thread. Disclaimer: I'm part of the Neo4j team.
Without much understanding of your situation...
Instead of writing a general purpose one-size-fits-all business objects system (which is the holy grail for Oracle, Microsoft, SAS, etc.), why not do it the typical way, where the requirements are gathered, and a developer designs and implements the users' business objects in an effective manner?
If your users are typical, they will create a monster, which will end up running slow, and they will hate it. Most users will view the data as an Excel sheet, and not understand relationships like: parent/child. As a result there will be some crazy objects built, and impossible-to-solve reports. You'll be forced to create scripts to manually convert many old objects to better and properly defined ones, etc...
Your requirements sound a little bit like an associative database with a front end to compose and edit entities.
I agree with KM above, unless you have a very compelling reason not to, you would be better off using a traditional approach. There are a lot of development tools and practices that allow you to build a robust and scalable system. Otherwise you will have to implement much of this yourself.
I don't know the best way to do this, because it sounds like something that has already been implemented by others. If I were asked to implement this feature, I would recommend buying a wheel instead of reinventing it.
Perhaps there are reasons you have to invent your own? If so, then you should add those reasons to the requirements you listed.
If you absolutely must be this generic, I still recommend buying a system that has been architected for this requirement. Not just the storage requirements, which are the least of the problems your customer will have; but also: how do you keep the customer from screwing up totally when given this much freedom. Some of the commercial systems already meet this challenge without going out of business because of customers messing up.
If you still need to do this on your own, then I suggest that your requirements (or perhaps those of another vendor?) must include: allow the customer to get it right, and help keep the customer from getting it wrong. You'll need some sort of UI to allow the customer to define these business objects, and the UI should validate the model that the customer builds.
I recommend a UI that works at a conceptual level. As an example, see NORMA, a Visual Studio add-in for Object-Role Modeling (the "other" ORM). Consider it as a example only, if your end users cannot afford a Visual Studio Standard license. Otherwise, you'll find that it is extensible, already produces many types of artifact (from SQL in various dialects to code), and will validate the model to see that it makes sense. End users would also be able to enter sample data that they believe should be valid, and the system will validate the data against the model.
If your customers are producing sensible (if dynamic) business objects, then the question of storage will be much simpler.
Have you thought about an XML based solution? The requirements suggested to me "Build a system that allows users to dynamically generate an XML Schema and work with XML documents based on that schema." I don't know enough about storing and querying XML documents to comment on your original question.
Another possibility might be to leverage NHibernate's ability to generate database schemas. If you can dynamically generate business objects, then you can generate XML mappings or Fluent mappings and use that to generate a normalized database schema.
Every user that I have ever talked to has always wanted "everything" in their project. Part of the job of gathering requirements is to guide the user, not just write down everything they say.
Your only hope is to build several template objects, that they can add properties to, you could code your application to handle each type of these objects, but allow the user to still slightly modify each as necessary.
You need to inform the user upfront of the major flaws this type of design has. This will help you in the end, when it runs slow, or if they screw up and need help fixing something. I'd put this in writing.
How many possible objects would they really need? Perhaps you could set these up using your system first. I have developed several very customizable systems over the years and when the user is sitting at an empty screen, it is like a deer in the headlights.
In any event, good luck.