I am wondering how is a JSON stored in a NoSQL DB like MongoDb and others. If I were to store a JSON data in a SQL DB then I could chose to store it as a text(varchar) column. But then I would lose the benefits of a NoSQL DB. Does a NoSQL DB save JSON in a file? How does update of a field happen? Is the complete file read in memory, then updated and written back to the file?
The broad answer -- especially because you say "MongoDB and others" -- is "in many ways, each probably unique to the database engine ingesting the JSON and into what target field type." Even most newer relational DBs have special performance and type handling for JSON data, the postgres jsonb column type being a notable standout. There is no easy, consistently applied answer here.
Most NoSql databases save json as VARCHAR or STRING. Different NoSql databases use different strategy to save on disk. For example, Cassandra creates a file for each table. For every update, C* just appends the data in the file. There are processes like compaction where the data in file can gets compacted, for multiple rows of single primary key a single row gets saved in compaction process, compaction depends on timestamp of the row.
Update operations are always time and resource intensive. Most NoSql databases do not use update operation, an update operation can be internally turned in to a insert operation. That means, for a signal primary key, there can be multiple rows exist at a time. The compaction process takes care of merging multiple rows in to single row.
Related
I have time series data in a relational database (postgres). Data import to the database every 5 minutes, but imput get overwritten during the day, meaning at the end of the day there is only 1 record for that day for specific id (id and date-> composite PKs).
current process is like this ->Data comes in and is evaluated the same way 1:1. (data comes in every table as they are in source, there is many redundancy.
3 problems:
currently performance of getting data out of database(reading) is fast (good performance)
frontend get query from this database and show data. result of the query is very fast. if I do normalization then getting the query become slower, but writing and updating become easier.
how can I optimize this database?
missing data (ignore this problem )
if we are able to store more records daily (history of one ID in different points of time everyday) then we can show comparison of two points in time in a day. does database support huge amoount of data every day?
DWH
source is just one, all data come from one source. can we have DWH for it or since source is only one, there is no need for it?
Edit:
How can I optimise this database?
currently there is only one Schema in a database. Data comes in and is evaluated the same way 1:1. writng is hard since we have redundany.
my solution:
I want to create 3 schemas for this database.
1 schema, for inserting data into tables, tables structure is base on data source. ( I assume data remains here temporary, and will be transfer in second schema)
2 schema, incoming data stored, and data is structured in 3NF.
3 Schema, denormlising data again because we need to get fast query (fast reading is required).
Your three schema model is exactly how this has been done for many years.
Schema 1:
Names: Staging/Landing/Ingestion
Schema matches the source system but it is cleared and reloaded for every load batch. Typically has a "looser" schema definition to allow for import and capture of bad data
Schema 2:
Names: Replica/ODS/Persisted data store
Schema 2 is never cleared, it's permanent. Following a data load, this layer should look exactly like your source systems. Data in schema 1 is "merged" into schema 2 each time. For example on a daily load cycle, Schema 1 just contains that days data but schema 2 contains the entire history of data loaded. Reference data is merged on a known primary key. Transactional data might be merged on a key or it might be merged on a "windowing" basis - i.e. delete the last days data from schema 2 and load schema 1 in
Some people like to have a "point in time view" where they can recreate what the source system looks like a historical point in time. I've never seen anyone use that though.
Schema 3:
Names: Business Layer/Star Schema/Reporting Layer/Datamart/Sematic Layer
Layer 2, which is usually a replica of an OLTP data model (OLTP is optimised for entering data). This is transformed into a data model that is optimised for reporting.
The tried and tested data model here is a star schema. It's been around for decades. If you research any reporting tool (i.e. Power BI), thay all say that the preferred data model to report from is a star schema. Yes a star schema is denormalised and has other benefits beyonf perforamnce, for example it is more easily understood by a business user, supports slowly changing dimensions etc.
All these concepts are explained further online but of you have any specific questions happy to expand further
Currently i'm storing two day's data(200M rows with 5 columns in each) in an RDBMS (mssql)- yesterday's and today's tables, so i keep removing older tables which are no longer useful. I always read and update data in yesterday's table and insert data in today's table.
Whenever i read some rows from yesterday's table, i increment a status column in the table for those rows by 1, so that i'd read those rows again only after i've read all the other rows which match the criteria (some criteria based on other columns).
I want to move to a noSql db for my use case. Please suggest which ones might be worth looking at.
It is not very clear why you would want to move to a NoSQL database - I am not suggesting you shouldn't.
You could take a look at Oracle NoSQL Database, it will allow you to map your current model to its Table API, also supports Time-to-live that purges data based on elapsed time (It is not clear if you purge tables only after some time window is elapsed). It supports JSON Document, and Key-Value API as well.
We are building a Spreadsheet web app for our clients. They can upload any csv (20 MB+) and then perform operations (listed below) on the data. The data is highly unstructured.
Over the last few months we have experimented with a couple of architectures:
Initially, we stored the whole grid in 2d array format e.g [ {a: 'b', x:'y'}, {a: 'e'} ] inside of PostGreSQL's JSON data type. But then any cell updated required the whole CSV to be stored in the database. This made the app extremely slow.
Next, we moved to MongoDB. This improved the performance but we are still running into performance and scalability issues. Below is our structure.
Our current database design:
PostgreSql Structure:
Table - datasets
id, name, description, etc...
Mongo Structure:
Row 1
_id, column1: value1, column2: value2, _data_set_id = datasets.id
Row 2
_id, column1: value1, column2: value2, _data_set_id = datasets.id
and so on...
Also, we have a mongo index on _data_set_id key to support faster queries of the following types.
( db.coll.find({_data_set_id: xyz}) )
We are also using hosted mongo from a third party vendor who takes care of sharding, backups, uptime etc. (we don't have devops)
The operations on data are of 2 types:
Row operations e.g adding or deleting a row
Column Operations e.g adding or deleting a column
Most of the operations on the data are column level operations i.e update only the column in each of the rows.
We have optimized to a point where in mongo works fairly fine with datasets having less than 10k rows. But, beyond that, we are not able to scale. We currently have ~25GB of data in Mongo and within next few weeks we will hit 50GB.
Our current product is a prototype and now, we are reconsidering our database architecture in order to scale better.
The most critical requirements for our database are:
Fast Read-Writes.
Column querying and updates.
Updating single cell (i.e row x, column y) value.
So,
Is Mongo the right database for this use case ?
If yes, what else (other than indexing, sharding) can we do to scale Mongo ?
P.S
We do realise we can achieve only 2 of CAP and also gone through Cassandra vs Mongodb vs Couchdb vs Redis
We are also evaluating Couchdb (Master-master replication, MVCC etc but no qynamic querying), Cassandra (querying on unstructured data is not possible) and HBase(ColumnStore) as alternatives.
I strongly suspect your database is not actually sharded. If your paying for sharding, you're probably not getting the benefit.
You can then shard by the index which should save you time as the data will end up being stored on one or two shard servers who can then respond more quickly according to your _data_set_id_.
Try typing:
sh.status()
This should how well distributed your database is. It will probably be only on one shard.
Have a good read of these bits before setting up your shard. It's very difficult to redo the sharding without rebuilding your entire collection!
http://docs.mongodb.org/manual/tutorial/choose-a-shard-key/
I've seen a few questions on this topic already but I'm looking for some insight on the performance differences between these two techniques.
For example, lets say I am recording a log of events which will come into the system with a dictionary set of key/value pairs for the specific event. I will record an entry in an Events table with the base data but then I need a way to also link the additional key/value data. I will never know what kinds of Keys or Values will come in so any sort of predefined enum table seems out of the question.
This event data will be constantly streaming in so insert times is just as important as query times.
When I query for specific events I will be using some fields on the Event as well as data from the key/value data. For the XML way I would simply use a Attributes.exists('xpath') statement as part of the where clause to filter the records.
The normalized way would be to use a Table with basically Key and Value fields with a foreign link to the Event record. This seems clean and simple but I worry about the amount of data that is involved.
You've got three major options for a 'flexible' storage mechanism.
XML fields are flexible but put you in the realm of blob storage, which is slow to query. I've seen queries against small data sets of 30,000 rows take 5 minutes when it was digging stuff out of the blobs with Xpath queries. This is the slowest option by far but it is flexible.
Key/value pairs are a lot faster, particularly if you put a clustered index on the event key. This means that all attributes for a single event will be physically stored together in the database, which will minimise the I/O. The approach is less flexible than XML but substantially faster. The most efficient queries to report against it would involve pivoting the data (i.e. a table scan to make an intermediate flattened result); joining to get individual fields will be much slower.
The fastest approach is to have a flat table with a set of user defined fields (Field1 - Field50) and hold some metadata about the contents of the fields. This is the fastest to insert and fastest and easiest to query, but the contents of the table are opaque to anything that does not have access to the metadata.
The problem I think the key/value table approach is regarding the datatypes - if a value could be a datetime, or a string or a unicode string or an integer, then how do you define the column? This dilemma means the value column has to be a datatype which can contain all the different types of data in it which then begs the question of efficiency/ease of querying. Alternatively, you have multiple columns of specific datatypes, but I think this is a bit clunky.
For a true flexible schema, I can't think of a better option than XML. You can index XML columns.
This article off MSDN discusses XML storage in more detail.
I'd assume the normalized way would be faster for both INSERT and SELECT operations, if only because that's what any RDBMS would be optimized for. The "amount of data involved" part might be an issue too, but a more solvable one - how long do you need that data immediately on hand, can you archive it after a day, or a couple weeks, or 3 months, etc? SQL Server can handle an awful lot.
This event data will be constantly streaming in so insert times is just as important as query times.
Option 3: If you really have a lot of data constantly streaming - create a separate queue in shared memory, in-process sqlite, separate db table, or even it's own server, to store the incoming raw event & attributes, and have another process (scheduled task, windows service, etc) parse that queue into whatever preferred format tuned for speedy SELECTs. Optimal input, optimal output, ready to scale in either direction, everyone's happy.
I am trying to decide between two possible implementations and am eager to choose the best one :)
I need to add an optional BLOB field to a table which currently only has 3 simple fields. It is predicted that the new field will be used in fewer than 10%, maybe even less than 5% of cases so it will be null for most rows - in fact most of our customers will probably never have any BLOB data in there.
A colleague's first inclination was to add a new table to hold just the BLOBs, with a (nullable) foreign key in the first table. He predicts this will have performance benefits when querying the first table.
My thoughts were that it is more logical and easier to store the BLOB directly in the original table. None of our queries do SELECT * from that table so my intuition is that storing it directly won't have a significant performance overhead.
I'm going to benchmark both choices but I was hoping some SQL gurus had any advice from experience.
Using MSSQL and Oracle.
For MSSQL, the blobs will be stored on a separate page in the database so they should not affect performance if the column is null.
If you use the IMAGE data type then the data is always stored out of the row.
If you use the varbinary(max) data type then if the data is > 8kb it is stored outside the row, otherwise it may be stored in the row depending on the table options.
If you only have a few rows with blobs the performance should not be affected.