Considering a system that has the following characteristics:
Stores time series data/metrics collected from multiple sensors/inputs.
Data points (metrics) are collected from many different systems at different times.
Each of these metrics is generally one data point (e.g. temp and humidity are not reported at the same time, but rather individually and will have a different timestamp)
The types of metrics that are collected will expand over time - the system is open and additional inputs will be supported over time (e.g. today we collect temp, humidity and cpu, tomorrow a sensor maybe added that monitors co2 and RAM).
A summary of all metrics for a given time bucket needs to be obtained via a query and it likely to be the most common querying scenario.
I can think of three ways of modeling this.
1. Wide table - with table per category (covered)
Notes: has lots of sparse values due to the data points being collected individually. Storage of new metrics require a new column
2. Narrow table - with table per metric (covered)
Notes: Storage of new metrics require a new table
3. Typed table (not covered) - with single metric table (not covered)
Notes: Storage of new metrics just require a new row in the metricType table, no schema changes. Concerned about performance implications due to chunk size although grouping by a time bucket across all metrics would not require joins and could therefore be faster?
I was wondering if anyone could comment or the options presented, point me to some performance bench marks that include 3 as well as 1 and 2 or generally give any advice on the suitability of each approach. I'm planning to run my own experiments on this and I will post the results when done, but any insight at this stage would be gratefully received. :)
Please note, do not suggest a nosql solution, I'm aware of the options in that space and am assessing that option separately
1 Proposal
"Wide table"
That has gross Normalisation errors (as well as, if taken seriously, it has masses of Nulls and integrity problems). It is unuseable, no further comment is required.
"Narrow table"
That is free of errors, but the Normalisation is not yet complete.
"Typed table"
That is sort of complete, the "best" of your three scenarios. But it views the issue through a narrow lens, and in total isolation from the context in which the issue exists. Thus it is in error for reasons other than those you inquire about.
2 Problem
The first problem is that you are comparing three things which are not reasonably comparable, not reasonably equal to each other.
The second problem is, EAV is the flavour of the month, and many people are attracted to it. However, it has major problems, and requires an additional set of "metadata" tables if it is to be implemented with some data integrity. The point is, EAV is not needed.
3 Solution
The types of metrics that are collected will expand over time - the system is open and additional inputs will be supported over time (e.g. today we collect temp, humidity and cpu, tomorrow a sensor maybe added that monitors co2 and RAM).
This is actually a straight-forward Relational database problem, which is solved by a perfectly ordinary Relational design, which provides full Relation Power; Relational Integrity; and Speed (which other designs will not have).
3.1 Caveat
But there are a few caveats, due to the fact that what is marketed as "relational" is not Relational.
Get rid of the Record ID fields, they are anti-Relational.
Record IDs reduce your schema to a 1970's style Record Filing system (located in an SQL container for convenience).
Record IDs do not provide row uniqueness, which is demanded by the Relational Model.
Further, they require one additional field and one additional index per file.
When modelling a database (Relational or not), perceive the data, as data, and nothing but data. Do not view the data in terms of your need re the GUI, or some query or other.
It is an error to concern yourself with performance issues at this (modelling) stage. First get it right. Second, make it fast. Do not reverse the prescribed sequence.
Relational Keys provide meaning, as well as Relational Integrity (which is Logical, and distinct from Referential Integrity, which is a physical facility of SQL). What this addresses is the context in which an object exists.
A Sensor does not exist in isolation (except when it is in a package on a shelf in a shop ... but even then, it exists in the context of the shop inventory)
An active Sensor exists only in the context of the object in which it is housed. You have not provided any info regarding that. Let's call the thing Article as a generic label.
Further, it is the Article that requires a limit on the Metric that is being measured by the Sensor (for the purpose of out-of-range alarms, etc), and not the Sensor itself. (The Sensor may have a range, which is a different thing.)
Likewise, a Sensor exists in a Location, which is a second vector. Or else, the Article exists in a Location, and the Article Key carries the Location. I have modelled the latter.
3.2 Data Model
Here is the solution:
Sensor Data Model
Inline graphics may not show up in some browsers. In that case, here it is in PDF.
It will satisfy both OLTP and OLAP (Dimension-Fact) requirements.
If you provide more context, we can get that modelled precisely. This may take a bit of to-and-fro.
It is limited to the info provided.
I have taken MetricType and SensorType to be synonymous.
Article is shown as Dependent on (exists within) Location, alternately they could be separate vectors. In any case, Article and Location together qualify Sensor.
Since SensorSerialNo is unique (AK2), therefore Reading(SensorSerialNo, DateTime) is unique. An index is not required. However, in the event there are many queries on Reading via SensorSerialNo alone, such an index will boost performance.
Please feel free to ask questions, and I will answer.
For those who are completely new to IDEF1X, refer to IDEF1X Introduction.
For those who are familiar with IDEF1X, and only want a brush-up, refer to IDEF1X Anatomy.
4 Performance
Your concern re performance is good, but far too premature to be applied at this stage. First get the data model right, second get the data structures fast. The reasons for that are many, not the least of which is, when the data is Normalised, Relationally, the structures are already very fast. Further, one should never optimise for a particular query (one can add indices, if necessary, in the second stage).
Nevertheless, I will respond to your stated concerns.
Eg. a ClusteredIndex on the prescribed Reading PK will:
Serve most queries, most Dimensions (except queries that use SensorSerialNo alone, in case of which I have suggested an additional index)
Serve all OLTP Transactions and ensure the highest concurrency, because the Sensors are distributed per the real world: across Locations and Articles`.
Whereas an Index on a Record ID guarantees a HotSpot on every single INSERT. Great for creating Deadlocks.
4.1 Benchmark
I do have a hundred or so benchmarks for data structures such as this, collected over the last four decades for both OLTP & OLAP use. Most of my customers are banks (Think: Sensor Readings are very much like Stock Prices that change over the period of a day; several vectors (Dimensions); billions of rows). Banks are very strict about confidentiality, so I cannot publish the benchmarks as is, and redacting them will take time and effort.
I do have one benchmark for a very similar requirement, that is public. In fact, it was included in an Answer to a SO Question re Time Series data, but the seeker got the moderators to excise it (it is embarrassing to Oracle). Here is the Benchmark Summary for the Sybase ASE vs Oracle 10.2 benchmark on a fixed DDL (Time Series data) and population.
Finally, the structures and code required are simple enough for you to run your own benchmark.
5 Response to Other Answers
Re Neville's comments:
However, if you also have to answer questions like "on what day was CPU above 30% while humidity was below 56% for more than 3 hours", your EAV model becomes really hard to work with. Those queries would rapidly become really hard to write and understand - every criterium becomes at least 1 self-join.
Noting that his comments regard EAV, but that it may imply that it applies equally to the subject table (an ordinary Relational database table (non-EAV) Reading) in this case, because it concerns the query type (and not the EAV concept vs the Relational concept):
The declaration does not apply to Relational tables (it may well apply to EAV; the masses of problems introduced due to Record IDs; etc)
As long as you have
a genuine Relational database schema (as I have suggested), and
a genuine SQL platform (not a pretend "sql", which does not comply but fraudulently uses the name), and
you understand IN and NOT IN, and how to compare Sets in SQL
... such queries are straight-forward to code.
6 Response to Comments
Record ID is Anti-Relational
Do you have any links on the record_id being anti-relational, I don't disbelieve you for a second but I'm interested to learn more about why this anti-pattern is so prevalent.
In this mess of anti-science, the academics manufacture and contrive various "solutions" to "problems", that do not exist in the Relational Model, and then you have a second level of endless "debates" about which correction to the non-problem is better or worse.
You don't need links because there is nothing to "debate", and whatever "debate" you might happen to read misses the above point.
The one and only authority is the great Dr E F Codd. All the authors of all books and textbooks alleging to be about the Relational Model, other than Codd, are actually false, they are about implementing 1970's style Record Filing Systems, and anti-Relational (no Relational Power; no Relational Integrity; no Relational Speed). They made the mistake, from 1970, of trying to fit the RM into their 1970's RFS mindset, rather than releasing it and taking on the RM mindset. And they have spent the last FIVE DECADES reinforcing that, even justifying it with "mathematical definitions"; 17 "relational algebras"; 42 abnormal "normal forms". All completely anti-Relational. And they cite each other, so they get published.
The second problem is, sites such as SO are predicated on the basis of populism. The popular answer is not the best or correct answer. For that you need an Authority (very scary to populists), and objective, absolute truth. (People love their relative or subjective "truths", that change all the time).
Therefore, you need just the single, authoritative definition, the original paper, the Relational model.
Yes, the terms are out-dated, and not well understood these days.
Yes, it is seminal (every word counts, has deep meaning).
No, you need not read section 2 (math).
You need to glean from that, that:
the Relational Key is “made up from the data” (my paraphrase, to the several entries, which are layers in the RM), which is Logical
that surrogates are (a) not only against that definition, (b) they are the pre-Relational paradigm, that is Physical pointers, the very thing the RM replaces, and (c) explicitly prohibited.
Very important, you need to understand not only the definition of the Relational Key, but the whys and the wherefores.
Eg. that it transcends import/export problems that pointer-based systems have.
Eg. the temporal definition (seminal; 8 letters; scary).
Therefore, there is no argument, no "debate", to be had.
Anyone going against that is anti-Relational. Not because I say so, but because it contradicts evidenced facts, and the single Authority.
I have named the explicit technical benefits of using the RM correctly (Relational Power; Relational Integrity; Relational Speed), but an expansion of that requires a fair amount of effort
The consequence of NOT complying with the RM is, you get (a) none of the benefits, AND (b) you get the complete set of problems that pre-Relational Record Filing Systems had in 1970, AND (c) the contrived "solutions" supplied by the "academics" that have never worked.
If you need an expansion of those benefits of the RM, which of course you do need to understand to some degree, because each one is very deep and very important, the best I can provide is this. As you can imagine, this is a battle that I have to fight on every Answer that relates to this subject, so I have posted a fair amount, over the years, across many Answers.
Go to my profile, select All Answers, and read any that relate to this subject.
Why is this Record ID anti-pattern so prevalent ?
The short answer is, people love their ignorance, their subjective "truths", and will fight tooth and nail to protect it. They quickly accept and repeat any justification for remaining the same. Learning something that is a paradigm shift away from what they know, is very scary, because it threatens their comfortable ignorance, and exposes it for what it really is. They will have to admit that what they have been writing for FIVE DECADES is wrong. That is why populism thrives. In ignorance.
The slightly longer answer is this. Just look at the internet. In the old days, for any particular subject, we had one source, one absolute authority: eg. buy the Encyclopædia Britannica; spend your entire childhood devouring it. Permanent truth. Honest history. But now anyone with a keyboard and two fingers plus some connective tissue (no brain required) can post. As an instant "authority". The web is chock-full of (a) superficial answers (the anti-thesis of "Now THAT is an answer") (b) in many flavours (c) that get upvoted due to populism (d) that are nowhere near the correct or full answer. Sound bites that can be easily understood by the populace. Very few want the depth of the full answer.
Even when an authority of sorts becomes established (eg. Wikipedia; Stack Overflow), it is easily subverted, because there are literally millions of people who change the entries (truth does not change, therefore, as long as something is changing, it is not truth). Mostly to serve their political positions; their ideologies; their re-writes of history to make the past wrong (it wasn't, it already happened), and the present insanity "good".
The definitive answer is this: academic envy. It took a whole decade for Codd's Relational Model to be understood and accepted. And even then, only by the few. IBM, and Britton-Lee (which became Sybase) implemented Codd's RM, in spirit and word. (Digital Equipment Corp did as well, but they are defunct.) Those academics who appeared to be working with Codd turned out to be actually working against him (by virtue of the evidence). They hated the fact that they did not come up with it themselves, that one man came up with the first real model, with a sound; logical; mathematical, foundation, complete with a Relational Algebra. All integrated. All requirements of the day (eg. the Bill Of Materials problem) answered. That has stood the test of time: five decades and nothing has been added or changed.
Typically they will declare, "but Codd did not define this or that, so here I am defining it ...". So they came up with their own RA. Now they have 17, all irrelevant. And abnormal "normal forms" to elevate fragmented bits of their Record Filing Systems to seem "relational". Now they have 42, all irrelevant. And many books, alleging to be "relational", but by evidenced fact, anti-Relational. Each "academic" seeks to reinforce their "academic" position, against all others.
Which is why I say, again, go to the one and only Authority. Read nothing from the anti-Relational crowd, because it will diminish your understanding of the RM (at best), or poison your mind (at worst).
One Clarification
If you examine a Relation PK (eg) Location.Location, it may seem odd. This is a %Code or %ShortName that is data, that the user actually uses. Usually 4 to 6 characters, max 12. As distinct from the long Name, which has to exist, and which is an Alternate Key. And of course, it is definitely not a number of any kind (which is not data, not something that the user uses). Users too, like their short forms. Obviously, use any International Standard if such exists.
The Key must be stable (not static, nothing in the universe is static), and one that is used in the real world to uniquely identify the object (data row).
Eg. for Security, which is a company listed on the stock exchange, in America, it would be TickerSymbol, in Australia ASXCode. The ISO code, an ISINCode, is an AlternateKey.
For cities, use one of the geographic location standards: ISO; FIPS; etc. (I use Statoids because it existed long before the others, but those days are numbered). At worst, use Airport Code.
Genuine SQL Platform
What do you consider to be genuine SQL? Sql Server, Postgres, MySQL, Oracle I guess all would be?
No. I mean any platform that actually complies with the published SQL Standard, and therefore can actually support relational tables; relational processing of Sets; and ACID Transactions.
That automatically excludes freeware/vapourware/nowhere/"open source", for which bits are written by 10,000 developers spread across the universe, with no governing principles. Eg. no ACID Transactions, or the structures that are required for it, which are required in every code segment. Too late to insert that now, because it will require a 100% re-write, and heaven forbid ... a Server Architecture.
Commercial
which means paid-for and supported, is also important. Either you have a maintenance contract and support is immediate, or you post a bug report and you check for updates every day for the next year or three.
Server Architecture
If either scalability or performance (high throughput; high concurrency; low latency) is required, then the Server Architecture is most important. Again, that excludes the freeware, and Oracle, because they have no Architecture, they are massive collections of interacting programs, that get the o/s to perform all the functions that a architected Database Server would normally perform.
Check this Comparison of Oracle vs Sybase Architecture.
The exact same applies to PostgreSQL and other freeware. PostgreSQL (son of the total failure Ingres) famously failed under pressure, with masses of locking problems and very low concurrency.
1 High-End, Commercial, SQL Compliant
Something like 5% market share, but 95% of the Financial Services and Automation markets. Great Architecture, hopeless marketing.
**Sybase ASE
IBM DB2**
2 Commercial, SQL Compliant
MS SQL Server
Easily the most common. Good Architecture (originally stolen from Sybase) and then "progressed" in the usual insane MS style. Pain to use; masses of overhead; poorly integrated with various add-ons and must-uses.
3 Commercial, SQL Non-Compliant
Hopeless Architecture, great marketing.
Oracle
Generally, Oracle developers are quite good at using the product in the ways that are required to get it to work, but that means they have strayed quite far away from the Relational Model.
Eg. in the Time Series benchmark, the whole point was, Oracle cacks itself when a Subquery is requested, so it has to use an "Inline View". Which the OP alleged was just as fast as a Subquery (avoiding the fact that it requires far more code, and the coder must step outside the Relational mindset). Which the benchmark proved to be hilariously false, in each scenario tested (Oracle was 3 to 4.8 times slower than Sybase on a COUNT(), 26 to 36 times slower on a SUM()
...and the Subquery (Sybase 2.1 secs) had to be abandoned after 120 mins.
Eg. Oracle is non-compliant re ACID Transactions, and developers work around that obstacle to a degree, but Phantom Updates and Lost Updates (technical terms) are simply not prevented. If the work-arounds are not written properly, entire rows (UPDATES or INSERTS are lost).
All that applies to the below ...
4 Non-Commercial, SQL Non-Compliant
These guys spend an awful lot of time developing "features" that are not required for a Relational database, but very attractive to the anti-Relational Record ID Filing Systems.
Eg. "deferred constraint checking"; ENUMs; etc.
They lack the basics of SQL compliance. Eg. no genuine ACID Transactions.
Further, as explained above, zero Architecture. This results in systems that perform wonderfully under single-use, and fail miserably under any order of pressure from concurrency or scalability.
Due to their non-compliance with the SQL requirement, they take pains to post a notice of compliance on every page in the Commands manual. (Just one declaration of compliance at the front of the manual is all that is required.) Of course, the missing commands are simply missing, so gee whiz, they do not have a compliance declaration.
PostgreSQL
The worst piece of software I have ever had to examine since the days of Ingres. Dearly loved by the "academic" crowd, simply because it was scrawled by a fellow "academic".
5 user max, or deal with the concurrency problems (just take a cursory look at the problems reported on SO).
MySQL
Head and shoulders above PostgreSQL, but still in this category.
The InnoDB engine is distinctly better in the performance department, but nowhere near the Sybase/DB2 level (still no genuine Server Architecture). No respite in the SQL non-compliance department.
5 Summary
You get what you pay for.
Server Architecture, most visibly, performance in every scenario.
SQL Compliance, thought through deeply, and implemented in every applicable code segment.
Last but not least, Support.
Whatever you choose, remember, when you port it to another platform, your SQL code will require a complete check-and-change, because the "flavours" of SQL (or NON-sql) are very different. For the Non-Commercial program suites, that means a complete rewrite. Therefore choose carefully, with the long term implementation in mind.
It depends largely on the types of query you'll need to run. I think performance may not be your biggest concern if, as you say
A summary of all metrics for a given time bucket needs to be obtained
via a query and it likely to be the most common querying scenario.
As queries in all scenarios would hit an indexable timestamp column, it really is just a question of the performance of joins, and pretty much every relational database is really good at that.
If your queries really are just "show data for a time range", your option 3 (an entity/attribute/value design) is most effective from a development effort point of view. .
Your query would have a single, inner join, and the timestamp column would provide a good index. As you say, you wouldn't need to change schema or queries when collecting new measurement points.
The alternative designs would require outer joins for each table. In performance terms, that's not a huge deal, but managing the schema and associated queries would be a pain.
However, if you also have to answer questions like "on what day was CPU above 30% while humidity was below 56% for more than 3 hours", your EAV model becomes really hard to work with. Those queries would rapidly become really hard to write and understand - every criterium becomes at least 1 self-join.
TimescaleDB's documentation discusses wide versus narrow data models:
https://docs.timescale.com/timescaledb/latest/overview/data-model-flexibility/
In summary:
"A narrow model makes sense if you collect each metric independently. It allows you to add new metrics as you go by adding a new tag without requiring a formal schema change."
"If you typically query multiple metrics together, it is both faster and easier to store them in a wide table format"
Indeed, the way 3 is a sort of EAV modeling on the relational storage including timestamps into EAV key.
+---------+ +-----+ +-------------+
| Sensors | -- 1:M --< | EAV | >-- M:1 -- | Value kinds |
+---------+ +-----+ +-------------+
A summary of all metrics for a given time bucket needs to be obtained
via a query and it likely to be the most common querying scenario
If queries don't require joins but need to be grouped by time, the clustered index on timestamp column ensures the performance.
However, any queries with joins (i.e. comparing values of different sensors) risque to degrade the performance. The solution can be a separated OLAP storage for collected EAV data.
From a developer's point of view, I would like to recommend the third option. In your third option, you might consider having indexes on the MetricType (i.e. typeId) and the timestamp column, which will greatly optimize the query performance.
Whereas your first table requires a system downtime, as when a new column needs to be inserted, you need to shut down your live system first to add the column, initialize with some default or null values, and then bring back the system to live again. In my opinion, it will contain un-necessary data (garbage) for the previous rows from the point it was being added in the system. The size of the database table will be huge and might contain garbage data in a significant amount hence affecting the query time.
The second idea shows an improvement over the first, however, in spite of having garbage data, this will require joining multiple tables which will increase the query performance over time. You cannot have indexes on multiple tables as you could for the third option.
Hence I think going for the third option is the most effective. The tables are normalized and effective indexes will provide efficient query results.
I would like to suggest another thing. You might also consider having a separate table which will contain aggregated data. For example, if your system requires aggregated data, you might consider having the data in a denormalized style in a separate table where the aggregated values for a certain timeline can be stored so that you can remove the data from your original table which are already processed. I am referring to the OLAP database where you might consider looking into.
I wouldn't recommend an ERD design where you need to Alter whenever you add a sensor (as long as you know you will). That's why I believe you should eliminate option 1. Whenever you alter your table you will get plenty of null values and unnecessary work you might have in your code.
The same applies on Option 2, maybe except for nulls, but still you will get unnecessary work whenever you add a new data source to your system.
Option 3 Looks good fit to me, as its ready to expanding data sources and keeps data clean and neat.
I read a topic on Cassandra DB. It wrote that it's good for app that do not require ACID property.
Is there any application or situation that ACID is not important?
There are many (most?) scenarios where ACID isn't needed. For example, if you have a database of products where one table is id -> description and another is id -> todays_price, and yet another is id -> sales_this_week, then there's no need to have all tables locked when updating something. Even if (for some reason) there are some common bits of data across the three tables, depending on usage, having them out of sync for a few seconds may not be a problem. Perhaps the monthly sales are only needed at the end of the month when aggregating a report. Not being ACID compliant means not necessarily satisfying all four properties of ACID... in most business cases, as long as things are eventually consistent with each other, it may be good enough.
It's worth mentioning that commits affecting the same cassandra partition ARE atomic. If something does need to be atomically consistent, then your data model should strive to put that bit of information in the same partition (and as such, in the same table). When we talk about eventual consistency in the context of cassandra, we mean things affecting different partitions (which could be different rows in the same table), not the same partition.
The canonical example of "transaction" is a debit from one account, and a credit from another.... this is "important" because "banks". In reality, this is not how banks operate. What such a system does need is a definitive list of transactions (read transfers). If you were to model this for cassandra, you could have a table of transfers consisting of (from_account, to_account, amount, time, etc.). These records would be atomically consistent. Your "accounts" tables would be updated from this list of transfers. How soon this gets reflected depends on the business. For example, in the UK, transfers from Lloyds bank to Lloyds bank are almost instant. Whereas some inter-bank transfers can take a couple of days. In the case of the latter, your account's balance usually shows the un-deducted amount of pending transfers, while a separate "available balance" considers the pending transfers.
Different things operate at different latencies, and in some cases ACID, and the resultant immediate consistency across all updated records may be important. For a lot of others though, specially when dealing with distributed systems with lots of data, ACID at the database level may not be required.
Even where "visible consistency" is required, it can often be handled with coordination mechanisms at the application level, CRDTs, etc. To the end user, the system is atomic - either something succeeds, or it doesn't, and the user gets a confirmation. Internally, the system may be updating multiple database, dealing with external services, etc., and only confirming when everything's peachy. So, ACID for different rows in a table, or across tables in a single database, or even across multiple databases may not be sufficient for externally visible consistency. Cassandra has tunable consistency where by you can use data modelling, and deal with the tradeoffs to make a "good enough" system that meets business requirements. If you need ACID transactionality across tables, though - Cassandra wouldn't be fit for that use case. However, you may be able to model your business requirements within cassandra's constraints, and use it to get the other scalability benefits it provides.
BASE stands for 'Basically Available, Soft state, Eventually consistent'
So, I've come this far: "Basically Available: the system is available, but not necessarily all items in it at any given point in time" and "Eventually Consistent: after a certain time all nodes are consistent, but at any given time this might not be the case" (please correct me if I'm wrong).
But, what is meant exactly by 'Soft State'? I haven't been able to find any decent explanations on the internet yet.
This page (originally here, now available only from the web archive) may help:
[soft state] is information (state) the user put into the system that
will go away if the user doesn't maintain it. Stated another way, the
information will expire unless it is refreshed.
By contrast, the position of a typical simple light-switch is
"hard-state". If you flip it up, it will stay up, possibly forever. It
will only change back to down when you (or some other user) explicitly
comes back to manipulate it.
The BASE acronym is a bit contrived, and most NoSQL stores don't actually require data to be refreshed in this way. There's another explanation suggesting that soft-state means that the system will change state without user intervention due to eventual consistency (but then the soft-state part of the acronym is redundant).
There are some specific usages where state must indeed be refreshed by the user; for example, in the Cassandra NoSQL database, one can give all rows a time-to-live to make them completely soft-state (they will expire unless refreshed), but this is an unusual mode of usage (a transient cache, essentially).
"Soft-state" might also apply to the gossip protocol within Cassandra; a new node can determine the state of the cluster from the gossip messages it receives, and this cluster state must be constantly refreshed to detect unresponsive nodes.
I was taught in classes that "Soft state" means that the state of the system could change over time (even during times without input), because there may be changes going on due to "eventual consistency". That's why says "soft" state.
Some source: link
Soft state means data that is not persisted on the disk, yet in case of failure it could be possible to restore it (e.g. recreate a lower quality image from a high quality one). A good article that addresses this and other interesting issues is Cluster-Based Scalable Network Services
A BASE system gives up on consistency to improve the performance of the database. Hence most of the famous NoSQL databases are highly available and scalable than ACID-compliant relational databases.
Soft state indicates that the state of the system may change over time, even without input. This is because of the eventual consistency model.
Eventual consistency indicates that the system will become consistent over time.
For example, Consider two systems such as A and B. If a user writes data to a system A, there will be some delay in reflecting these written data to B usually within milliseconds(depending upon the network speed and the design for syncing).