I am in the architecture stage of an academic project involving billions of records. The project should be very lightweight in terms of computing power and highly scalable.
The information structure is very simple: I need to store a list of items each one with different features. The feature are integers, decimals, dates, strings etc. When the data is imported the types of the feature is known. Also, features can be used to reference other items.
I need to be able to get and sort a list of items by its features (more than one) - possibly using queries such as >, <, =, and regexes, length, left, right, mid for strings between the feature values and against user arbitrary input.
Reporting in the sense of sums, averages, grouping is also necessary by the demands for that are more relaxed - there is not need for a full cube capabilities, but more are better.
I am very new to the whole NoSQL world. What would you recommend?.
If you check out the tutorials for MongoDB, they have, in my opinion, the best introduction to the Map/Reduce system that is used to query and aggregrate.
I do wonder though why you have concluded in advance that NoSQL is the route to go. Although different items may have different schemas, are there a fixed number of entities and attributes, and why have you (if you have) ruled out SQL, which, after all, has decades of accumulated features for storing and querying data.
If you are going to use aggregates then you could use map reduce to populate aggregate tables and then serve that data.
Writing map reduce for every query may be cumbersome, you can also have a look at Apache Pig and Hive. This is especially helpful for the kindly of adhoc queries you are talking about.
Related
I am currently working with java spring and postgres.
I have a query on a table, many filters can be applied to the query and each filter needs many joins.
This query is very slow, due to the number of joins that must be performed, also because there are many elements in the table.
Foreign keys and indexes are correctly created.
I know one approach could be to keep duplicate information to avoid doing the joins. By this I mean creating a new table called infoSearch and keeping it updated via triggers. At the time of the query, perform search operations on said table. This way I would do just one join.
But I have some doubts:
What is the best approach in postgres to save item list flat?
I know there is a json datatype, could I use this to hold the information needed for the search and use jsonPath? is this performant with lists?
I also greatly appreciate any advice on another approach that can be used to fix this.
Is there any software that can be used to make this more efficient?
I'm wondering if it wouldn't be more performant to move to another style of database, like graph based. At this point the only problem I have is with this specific table, the rest of the problem is simple queries that adapt very well to relational bases.
Is there any scaling stat based on ratios and number of items which base to choose from?
Denormalization is a tried and true way to speed up queries/reports/searching processes for relational databases. It uses a standard time vs space tradeoff to reduce the time of query, at the cost of duplicating the data and increasing write/insert time.
There are third party tools that are specifically designed for this use-case, including search tools (like ElasticSearch, Solr, etc) and other document-centric databases. Graph databases are probably not useful in this context. They are focused on traversing relationships, not broad searches.
I know this is a 'soft' question, which is usually frowned upon on SO, but I have been using BigQuery to do data analysis on (obviously) flat data, which contains both structs and repeated data. Let's just use a very basic example, a row might look like this:
ID
Title (str)
ReleaseYear (int)
Genres (str[])
Credits (struct[])
And an example piece of data might look like:
{
"ID": "T-1997",
"Title": "Titanic",
"ReleaseYear": 1997,
"Genres": ["Drama", "Romance"],
"Credits": {
"Actors": ["Leonardo DiCaprio", "Kate Winslet"],
"Directors": ["James Cameron"]
}
}
My question is basically what type of operations or queries can be done in a native document store, such as MongoDB or CouchBase, that couldn't be done in a relational DB that supports arbitrarily-nested data. In other words, my assumption (and I hope I'm wrong or misguided) is that as long as a DB supports structs, it can do everything that a document-store can do. If not, what are some places where it is either: (1) something that can be done in MongoDB (or any other document-store) that cannot be done in BigQuery (or any other database that supports structs)? and (2) something that can be done much more easily in MongoDB that in a relational DB?
what type of operations or queries can be done in a native document
store, such as MongoDB or CouchBase, that couldn't be done in a
relational DB that supports arbitrarily-nested data.
Even if does support arbitrarily nested data, BigQuery allows limited nesting compared to MongoDB .MongoDB supports more levels of nesting.
In BigQuery, your schema cannot contain more than 15 levels of nested STRUCTs. MongoDB supports unto 100 levels of nesting for BSON documents.
In other words, my assumption (and I hope I'm wrong or misguided) is
that as long as a DB supports structs, it can do everything that a
document-store can do.
Not exactly - nested columns are columns within columns. But sharding in an RDBMS is a complex endeavor compared to a NoSQL database like Mongo. Technically you can do, but it wasn't designed for the same purpose. Its like using a wrench as a hammer - sure you can, but its purpose was something different. You should use the right tool for the right purpose.
If not, what are some places where it is either: (1) something that
can be done in MongoDB (or any other document-store) that cannot be
done in BigQuery (or any other database that supports structs)? and
(2) something that can be done much more easily in MongoDB that in a
relational DB?
The crux of the matter is, an RDBMS may tack on features to "technically" allow you to do some things that you can do in a NoSQL database. But it doesn't mean it may work just as well. For example, because of the features that make an RDBMS an RDBMS (ACID compliance, transactions etc), there will always be an additional performance hit compared to a NoSQL database. If an RDBMS removes these features, then it is no longer an RDBMS!
This answer illustrates how MongoDB achieves better performance because it doesn't need to support RDBMS features :
https://softwareengineering.stackexchange.com/questions/54373/when-would-someone-use-mongodb-or-similar-over-a-relational-dbms
MongoDB has a lower latency per query & spends less CPU time per query because it is doing a lot less work (e.g. no joins,
transactions).
As a result, it can handle a higher load in terms of queries per second and is thus often used if you have a massive # of users.
MongoDB is easier to shard (use in a cluster) because it doesn't have to worry about transactions and consistency. - MongoDB has a
faster write speed because it does not have to worry about
transactions or rollbacks (and thus does not have to worry about
locking).
MongoDB does not have a schema in case you have a special use case that can take advantage of that.
Another feature is sharding - sharding is easier with mongodb because it doesn't need to support many of the features which make an RDBMS an RDBMS, such as being ACID compliant. In contrast, sharding is complex for an RDBMS because an RDBMS must remain ACID compliant.
Take a look at the following two images:
The speed boat would out perform the "amphibious car" in the water 10/10 times. The amphibious car technically can navigate in water, but it wasn't designed to, hence is much slower and unsuited for its purpose.
Like wise, look at the difference in aerodynamics of the speed boat and this sweet automobile. Even if you tacked on wheels to the boat, its not going to perform as well as this car on land. (As an analogy you could say that NoSQL databases don't do joins - you have to implement them yourself. - but will it perform better than an RDBMS for join heavy operations ?)
The point I'm making with the analogies, is that each kind of database was initially designed for a specific goal, and over time features have been added to try and make it solve problems it was not designed for (hence it doesn't do it as well as something specifically designed for that purpose).
Hence in your question, even if BigQuery or some RDBMS can do something, it doesn't mean that you should use them for the job. The same applies for NoSQL databases. You should use the best tool for the job.
Disclaimer: I don't have experience in MongoDB or CouchBase. My answer is based on BigQuery's capability on STRUCT.
Performance
BigQuery's STRUCT is optimized for query. For example, if you query select a.nested_b.nested_c.nested_d from table_t, the query only scans data for the left STRUCT field nested_d, it is fast and cheap.
Usability
If your data is write-once or append-only, then STRUCT column is comparable with document store AFAIK.
But if you want to update only certain nested field later, nested STRUCT makes it pretty difficult to do, because there is no way to update single item in REPEATED field, you have to load the whole array, scan and change, and repack to update a column. You will be writing something like:
UPDATE table
SET Credits.Actors = (SELECT ARRAY_AGG(...) FROM UNNEST(Credits.Actors) WHERE ...)
WHERE ...
It may become a bigger problem when there is array of struct of arrays (and even more nested levels). Based on my understanding of document store, updating single nested field of a document should be easier than this. Basically, this is kind of the price you have to pay to get the performance benefit mentioned earlier.
I have a legacy in-house human resources web app that I'd like to rebuild using more modern technologies. Doctrine 2 is looking good. But I've not been able to find articles or documentation on how best to organise the Entities for a large-ish database (120 tables). Can you help?
My main problem is the Person table (of course! it's an HR system!). It currently has 70 columns. I want to refactor that to extract several subsets into one-to-one sub tables, which will leave me with about 30 columns. There are about 50 other supporting one-to-many tables called person_address, person_medical, person_status, person_travel, person_education, person_profession etc. More will be added later.
If I put all the doctrine associations (http://docs.doctrine-project.org/projects/doctrine-orm/en/latest/reference/working-with-associations.html) in the Person entity class along with the set/get/add/remove methods for each, along with the original 30 columns and their methods, and some supporting utility functions then the Person entity is going to be 1000+ lines long and a nightmare to test.
FWIW i plan to create a PersonRepository to handle the common bulk queries, a PersonProfessionRepository for the bulk queries / reports on that sub table etc, and Person*Service s which will contain some of the more complex business logic where needed. So organising the rest of the app logic is fine: this is a question about how to correctly organise lots of sub-table Entities with Doctrine that all have relationships / associations back to one primary table. How do I avoid bloating out the Person entity class?
Identifying types of objects
It sounds like you have a nicely normalized database and I suggest you keep it that way. Removing columns from the people table to create separate tables for one-to-one relations isn't going to help in performance nor maintainability.
The fact that you recognize several groups of properties in the Person entity might indicate you have found cases for a Value Object. Even some of the one-to-many tables (like person_address) sound more like Value Objects than Entities.
Starting with Doctrine 2.5 (which is not yet stable at the time of this writing) it will support embedding single Value Objects. Unfortunately we will have to wait for a future version for support of collections of Value objects.
Putting that aside, you can mimic embedding Value Objects, Ross Tuck has blogged about this.
Lasagna Code
Your plan of implementing an entity, repository, service (and maybe controller?) for Person, PersonProfession, etc sounds like a road to Lasagna Code.
Without extensive knowledge about your domain, I'd say you want to have an aggregate Person, of which the Person entity is the aggregate root. That aggregate needs a single repository. (But maybe I'm off here and being simplistic, as I said, I don't know your domain.)
Creating a service for Person (and other entities / value objects) indicates data-minded thinking. For services it's better to think of behavior. Think of what kind of tasks you want to perform, and group coherent sets of tasks into services. I suspect that for a HR system you'll end up with many services that evolve around your Person aggregate.
Is Doctrine 2 suitable?
I would say: yes. Doctrine itself has no problems with large amounts of tables and large amounts of columns. But performance highly depends on how you use it.
OLTP vs OLAP
For OLTP systems an ORM can be very helpful. OLTP involves many short transactions, writing a single (or short list) of aggregates to the database.
For OLAP systems an ORM is not suited. OLAP involves many complex analytical queries, usually resulting in large object-graphs. For these kind of operations, native SQL is much more convenient.
Even in case of OLAP systems Doctrine 2 can be of help:
You can use DQL queries (in stead of native SQL) to use the power of your mapping metadata. Then use scalar or array hydration to fetch the data.
Doctrine also support arbitrary joins, which means you can join entities that are not associated to each other according by mapping metadata.
And you can make use of the NativeQuery object with which you can map the results to whatever you want.
I think a HR system is a perfect example of where you have both OLTP and OLAP. OLTP when it comes to adding a new Person to the system for example. OLAP when it comes to various reports and analytics.
So there's nothing wrong with using an ORM for transactional operations, while using plain SQL for analytical operations.
Choose wisely
I think the key is to carefully choose when to use what, on a case by case basis.
Hydrating entities is great for transactional operations. Make use of lazy loading associations which can prevent fetching data you're not going to use. But also choose to eager load certain associations (using DQL) where it makes sense.
Use scalar or array hydration when working with large data sets. Data sets usually grow where you're doing analytical operations, where you don't really need full blown entities anyway.
#Quicker makes a valid point by saying you can create specialized View objects. You can fetch only the data you need in specific cases and manually mold that data into objects. This is accompanied by his point to don't bloat the user interface with options a user with a certain role doesn't need.
A technique you might want to look into is Command Query Responsibility Segregation (CQRS).
I understood that you have a fully normalized table persons and now you are asking for how to denormalize that best.
As long as you do not hit any technical constaints (such as max 64 K Byte) I find 70 columns definitly not overloaded for a persons table in a HR system. Do yourself a favour to not segment that information for following reasons:
selects potentially become more complex
each extract table needs (an) extra index/indeces, which increases your overall memory utilization -> this sounds to be a minor issue as disk is cheap. However keep in mind that via caching the RAM to disk space utilization ratio determines your performance to a huge extend
changes become more complex as extra relations demand for extra care
as any edit/update/read view can be restricted to deal with slices of your physical data from the tables only no "cosmetics" pressure arises from end user (or even admin) perspective
In summary your the table subsetting causes lots of issues and effort but does add low if not no value.
Btw. databases are optimized for data storage. Millions of rows and some dozens of columns are no brainers at that end.
So I'm making a database for a personal project just to get more than my feet wet with PostgreSQL and certain languages and applications that can use a PostgreSQL database.
I've come to the realization that using an array isn't necessarily even compliant (Arrays are not atomic, right?) with 1NF. So my question is: Is there a lack of efficiency or data safety this way? Should I learn early to not use arrays?
Short answer to the title: No
A bit longer answer:
You should learn to use arrays when appropriate. Arrays are not bad design themselves, they are as atomic as a character varying field (array of characters, no?) and they exists to make our lives easier and our databases faster and lighter. There are issues considering portability (most database systems don't support arrays, or do so in a different way than Postgres)
Example:
You have a blog with posts and tags, and each post may have 0 or more tags. The first thing that comes to mind is to make a different table with two columns postid and tagid and assign the tags in that table.
If we need to search through posts with tagid, then the extra table is necessary (with appropriate indexes of course).
But if we only want the tag information to be shown as the post's extra info, then we can easily add an integer array column in the table of posts and extract the information from there. This can still be done with the extra table, but using an array reduces the size of the database (no needed extra tables or extra rows) and simplifies the query by letting us execute our select queries with joining one less table and seems easier to understand by human eye (the last part is in the eye of the beholder, but I think I speak for a majority here). If our tags are preloaded, then not even one join is necessary.
The example may be poor but it's the first that came to mind.
Conclusion:
Arrays are not necessary. They can be harmful if you use them wrong. You can live without them and have a great, fast and optimized database. When you are considering portability (e.g. rewriting your system to work with other databses) then you must not use arrays.
If you are sure you'll stick with Postgres, then you can safely use arrays where you find appropriate. They exist for a reason and are neither bad design nor non-compliant. When you use them in the right places, they can help a little with simplicity of database structures and your code, as well as space and speed optimization. That is all.
Whether an array is atomic depends on what you're interested in. If you generally want the whole array then it's atomic. If you are more interested in the individual elements then it is being used as structure. A text field is basically a list of characters. However, we're usually interested in the whole string.
Now - from a practical viewpoint, many frameworks and ORMs don't automatically unpack PostgreSQL's array types. Also, if you want to port the database to e.g. MySQL then you'll
Likewise foreign-key constraints can't be added to an array (EDIT: this is still true as of 2021).
Short answer: Yes, it is bad design. Using arrays will guarantee that your design is not 1NF, because to be 1NF there must be no repeating values. Proper design is unequivocal: make another table for the array's values and join when you need them all.
Arrays may be the right tool for the job in certain limited circumstances, but I would still try hard to avoid them. They're a feature of last resort.
The biggest problem with arrays is that they're a crutch. You know them already and you want to use them because they're familiar to you. But they do not work quite like you expect, and they will only allow you to postpone a true understanding of SQL and relational databases. You're much better off waiting until you're forced to use them than learning them and looking for opportunities to rely on them.
I believe arrays are a useful and appropriate design in cases where you're working with array-like data and want to use the power of SQL for efficient queries and analysis. I've begun using PostgreSQL arrays regularly for data science purposes, as well as in PostGIS for edge cases, as examples.
In addition to the well-explained challenges mentioned above, I'm finding the biggest problem in getting third-party client apps to be able to handle the array fields in ways I'd expect. In Tableau and QGIS, for example, arrays are treated as strings, so array operations are unavailable.
Arrays are a first class data type in the SQL standard, and generally allow for a simpler schema and more efficient queries. Arrays, in general, are a great data type. If your implementation is self-contained, and doesn't need to rely on third-party tools without an API or some other middleware that can deal with incompatibilities, then use the array field.
IF, however, you interface with third-party software that directly queries the DB, and arrays are used to produce queries, then I'd avoid them in favor of simpler lookup tables and other traditional relational approaches.
Consider Microsoft SQL Server 2008
I need to create a table which can be created two different ways as follows.
Structure Columnwise
StudentId number, Name Varchar, Age number, Subject varchar
eg.(1,'Dharmesh',23,'Science')
(2,'David',21,'Maths')
Structure Rowwise
AttributeName varchar,AttributeValue varchar
eg.('StudentId','1'),('Name','Dharmesh'),('Age','23'),('Subject','Science')
('StudentId','2'),('Name','David'),('Age','21'),('Subject','Maths')
in first case records will be less but in 2nd approach it will be 4 times more but 2 columns are reduced.
So which approach is more better in terms of performance,disk storage and data retrial??
Your second approach is commonly known as an EAV design - Entity-Attribute-Value.
IMHO, 1st approach all the way. That allows you to type your columns properly allowing for most efficient storage of data and greatly helps with ease and efficiency of queries.
In my experience, the EAV approach usually causes a world of pain. Here's one example of a previous question about this, with good links to best practices. If you do a search, you'll find more - well worth a sift through.
A common reason why people head down the EAV route is to model a flexible schema, which is relatively difficult to do efficiently in RDBMS. Other approaches include storing data in XML fields. This is one reason where NOSQL (non-relational) databases can come in very handy due to their schemaless nature (e.g. MongoDB).
The first one will have better performance, disk storage and data retrieval will be better.
Having attribute names as varchars will make it impossible to change names, datatypes or apply any kind of validation
It will be impossible to index desired search actions
Saving integers as varchars will use more space
Ordering, adding or summing integers will be a headache, and will have bad performance
The programming language using this database will not have any possibility to have strong typed data
There are many more reasons for using the first approach.