I have inherited a legacy content delivery system and I need to re-design & re-build it. The content is delivered by content suppliers (e.g. Sony Music) and is ingested by a legacy .NET app into a SQL Server database.
Each content has some common properties (e.g. Title & Artist Name) as well as some content-type specific properties (e.g. Bit Rate for MP3 files and Frame Rate for video files).
This information is stored in a relational database in multiple tables. These tables might have null values in some of their fields because those fields might not belong to a property of the content. The database is constantly under write operations because the content ingestion system is constantly receiving content files from the suppliers and then adds their metadata to the database.
Also, there is a public facing web application which lets end users buy the ingested contents (e.g. musics, videos etc). This web application totally relies on an Elasticsearch index. In fact this application does not see the database at all and uses the Elasticsearch index as the source of data. The reason is that SQL Server does not perform as fast and as efficient as Elasticsearch when it comes to text-search.
To keep the database and Elasticsearch in sync there is a Windows service which reads the updates from SQL Sever and writes them to the Elasticsearch index!
As you can see there are a few problems here:
The data is saved in a relational database which makes the data hard to manage. e.g. there is a table of 3 billion records to store metadata of each contents as a key value pairs! To me using a NoSQL database or index would make a lot more sense as they allow to store documents with different formats in them.
The Elasticsearch index needs to be kept in Sync with the database. If the Windows services does not work for any reason then the index will not get updated. Also when there are too many inserts/updates in the database it takes a while for the index to get updated.
We need to maintain two sources of data which has cost overhead.
Now my question: is there a NoSQL database which has these characteristics?
Allows me to store documents with different structures in it?
Provides good text-search functions and performance? e.g. Fuzzy search etc.
Allows multiple updates to be made to its data concurrently? Based on my experience Elasticsearch has problems with concurrent updates.
It can be installed and used at Amazon AWS infrastructure because our new products will be hosted on AWS. Auto scaling and clustering is important. e.g. DynamoDB.
It would have a kind of GUI so that support staff or developers could modify the data to some extent.
A combination of DynamoDB and ElasticSearch may work for your use case.
DynamoDB certainly supports characteristics 1, 3, 4, and 5.
There is now a Logstash Input Plugin for DynamoDB that can be combined with an ElasticSearch output plugin to keep your table and index in sync in real time. ElasticSearch provides characteristic 2.
Related
Looking for suggesting on loading data from SQL Server into Elasticsearch or any other data store. The goal is to have transactional data available in real time for Reporting.
We currently use a 3rd party tool, in addition to SSRS, for data analytics. The data transfer is done using daily batch jobs and as a result, there is a 24 hour data latency.
We are looking to build something out that would allow for more real time availability of the data, similar to SSRS, for our Clients to report on. We need to ensure that this does not have an impact on our SQL Server database.
My initial thought was to do a full dump of the data, during the weekend, and writes, in real time, during weekdays.
Thanks.
ElasticSearch's main use cases are for providing search type capabilities on top of unstructured large text based data. For example, if you were ingesting large batches of emails into your data store every day, ElasticSearch is a good tool to parse out pieces of those emails based on rules you setup with it to enable searching (and to some degree querying) capability of those email messages.
If your data is already in SQL Server, it sounds like it's structured already and therefore there's not much gained from ElasticSearch in terms of reportability and availability. Rather you'd likely be introducing extra complexity to your data workflow.
If you have structured data in SQL Server already, and you are experiencing issues with reporting directly off of it, you should look to building a data warehouse instead to handle your reporting. SQL Server comes with a number of features out of the box to help you replicate your data for this very purpose. The three main features to accomplish this that you could look into are AlwaysOn Availability Groups, Replication, or SSIS.
Each option above (in addition to other out-of-the-box features of SQL Server) have different pros and drawbacks. For example, AlwaysOn Availability Groups are very easy to setup and offer the ability to automatically failover if your main server had an outage, but they clone the entire database to a replica. Replication let's you more granularly choose to only copy specific Tables and Views, but then you can't as easily failover if your main server has an outage. So you should read up on all three options and understand their differences.
Additionally, if you're having specific performance problems trying to report off of the main database, you may want to dig into the root cause of those problems first before looking into replicating your data as a solution for reporting (although it's a fairly common solution). You may find that a simple architectural change like using a columnstore index on the correct Table will improve your reporting capabilities immensely.
I've been down both pathways of implementing ElasticSearch and a data warehouse using all three of the main data synchronization features above, for structured data and unstructured large text data, and have experienced the proper use cases for both. One data warehouse I've managed in the past had Tables with billions of rows in it (each Table terabytes big), and it was highly performant for reporting off of on fairly modest hardware in AWS (we weren't even using Redshift).
I have a primary use case where I want to have a transactional relational database for which I am using Postgres.
I also need to run frequent aggregate queries (count, sum, average) on the data. These statistics cannot be precomputed as there are multiple filters for search that we have to provide.
I was initially thinking of using Redshift as a secondary storage, which can serve these queries, but then I would also need to build a system to keep the data in sync between the two storages.
Is there a better way to achieve this?
Take a look at AWS DMS, you can set this up to keep a near real time replica of your Postgres data on Redshift.
It is reliable and requires minimal maintenance (e.g. if you add new columns to your source data).
Read both of these carefully, especially limitations and requirements.
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html
and
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.Redshift.html
Unless you need them, I recommend excluding text (and other large object) columns from the sync. this can be done easily by setting a flag, or can be tailored column by column.
The source Postgres database does not have to be held on AWS.
I'm looking for an open source data store that scales as easily as Cassandra but data can be queried via documents like MongoDB.
Are there currently any databases out that do this?
In this website http://nosql-database.org you can find a list of many NoSQL databases sorted by datastore types, you should check the Document stores there.
I'm not naming any specific database to avoid a biased/opinion-based answer, but if you are interested in a data store that is as scalable as Cassandra, you probably want to check those which use master-master/multi-master/masterless (you name it, the idea is the same) architecture, where both writes and reads can be split among all nodes in the cluster.
I know Cassandra is optimized towards writes rather than reads, but without further details in the question can't refine the answer with more information.
Update:
Disclaimer: I haven't used CouchDB at all, and haven't tested it's performance either.
Since you spotted CouchDB I'll add what I've found in the official documentation, in the distributed database and replication section.
CouchDB is a peer-based distributed database system. It allows users
and servers to access and update the same shared data while
disconnected. Those changes can then be replicated bi-directionally
later.
The CouchDB document storage, view and security models are designed to
work together to make true bi-directional replication efficient and
reliable. Both documents and designs can replicate, allowing full
database applications (including application design, logic and data)
to be replicated to laptops for offline use, or replicated to servers
in remote offices where slow or unreliable connections make sharing
data difficult.
The replication process is incremental. At the database level,
replication only examines documents updated since the last
replication. Then for each updated document, only fields and blobs
that have changed are replicated across the network. If replication
fails at any step, due to network problems or crash for example, the
next replication restarts at the same document where it left off.
Partial replicas can be created and maintained. Replication can be
filtered by a javascript function, so that only particular documents
or those meeting specific criteria are replicated. This can allow
users to take subsets of a large shared database application offline
for their own use, while maintaining normal interaction with the
application and that subset of data.
Which looks quite scalable to me, as it seems you can add new nodes to the cluster and then all the data gets replicated.
Also partial replicas seems an interesting option for really big data sets, which I'd configure these very carefully, in order to prevent situations where a given query to the database might not yield valid results, for example, in the case of a network partition and having only access to a partial set.
I have an Oracle 11g database with an extremely complex and badly designed schema. This is a legacy system with many dependencies that supports several critical applications, modifying the schema of this database is unfortunately out of the question.
I am developing a web application (ASP.NET MVC 5) that acts as a read-only status dashboard for the information in this database. Currently, I rely on purpose built database views to get only the information the web application needs. Given the complexity of the schema many of these views perform very poorly. When the web application is busy with many users, the database struggles to keep up, usually resulting in time out errors. Also, when the database does fail for whatever reason, the web application cannot show any data. Users would still like the web application to show a snapshot of the data before the database failed.
The nature of the data is very dynamic, rows are being added/updated/deleted by several external systems and processes constantly, and I have no way of knowing when there is a change to the underlying data, so I have to re-query the view to get fresh data.
Because of this situation, we are considering removed the direct link between this database and the web application and instead create some sort of intermediary cache/database/magic layer between them. This way, the web application would get its data from this intermediary later without placing heavy load on the complex database. When the complex database fails, the web application can still query the last snapshot of data from this intermediary layer.
The question is, what should this intermediary layer be? Because I don't know when and how the underlying data changes I can't maintain a live cache of the data. Instead I would need to rely on snapshots of the views in this database.
This is our current idea:
We create a new, intermediary SQL Server/Oracle database. A job runs every 2 minutes for each database view we are currently using, queries it, then dumps the results into a table in our intermediary SQL Server/Oracle database. This would require truncating the intermediary table, then refilling it with the fresh view result data. In the meantime, the web application would be querying these intermediary tables for data. The obvious concern is what happens when the web application is trying to query the intermediary table while it is being truncated and repopulated with fresh data? Another concern is dealing with possible concurrency issues when grabbing data from views that share foreign keys or related data.
Normally, a web application would simply maintain a cache, but this would require being hooked into the add/update/delete events in the database to maintain the state of the cache. On top of that, the cache wouldn't be a full snapshot of the database. If the database were to fail, the cache would be unable to provide a snapshot of data from before the failure.
Any other suggestions on what this magical intermediary layer should be? We are looking at solutions available on both the database end (either SQL Server or Oracle) as well as solutions on the web application side (ASP.NET MVC 5, IIS).
I am currently looking at an issue where I am trying to integrate hadoop with a database, since hadoop offers parallelism but not performance. I was referring the paper of hadoopDB. Hadoop usually takes a file and splits it into chunks and places these chunks in different data nodes. During processing the namenode tells the location where a chunk might be found and runs a map on that node. I am looking at a possiblility of the user telling the namenode which datanode to run the map on and the namenode either runs the map to get the data from a file or a database. Can you kindly tell me whether it is feasible to tell the namenode which datanode to run the map ?
Thanks!
Not sure why you would like to tie a map/reduce task to a particular node. What happens if that particular node goes down? In Hadoop the map/reduce operations cannot be tied to a particular node in the cluster that what makes Hadoop more scalable.
Also, you might want to take a look # Apache Sqoop for importing/exporting between Hadoop and Database.
If you are looking to query data from a distributed data store, then why don't you consider storing your data into Hbase which is a distributed data base built on top of Hadoop and HDFS. It stores data into HDFS in the background and gives query semantics like a big database. In that case you don't have to worry about issuing queries to the right data node. The query semantics of HBase (also known as hadoop database will take care of the same).
For easy querying and storing data into Hbase and if your data is timeseries data, then you can also consider using OpenTSDB which is a wrapper around Hbase and provides you with easy tag based query semantics as well as integrates nicely with GNUPlot, to give you graph like visualization of your data.
Hbase is very well suited for random reads/writes to a very large distributed data store however, if your queries operate on bulk writes/reads Hive maybe a well suited solution for your case. Similar to Hbase, it is also built on top of Hadoop Map Reduce and HDFS and converts each query to underlying map-reduce jobs. The best thing about Hive is that it provides SQL like semantics and you can query just like you would do on a relational database.
As far as organization of data and a basic introduction to the features of Hive is concerned you may like to go through the following points:
Hive adds structure to the data stored on HDFS. The schema of tables is stored in a separate metadata store. It converts SQL like semantics to multiple map reduce jobs running on HDFS in the backend.
Traditional databases follow the schema on write policy where once a schema is designed for a table, at the time of writing data itself, it is checked whether the data to be written conforms to the pre-defined schema. If it does not, the write is rejected.
In case of Hive, it is the opposite. It uses the schema on read policy. Both the policies have their own individual trade-offs. In case of schema on write, load time is more and loads are slower because schema conformance is verified at the time of loading data. However, it provides faster query time because it can index data based on predefined columns in the schema, however there may be cases where the indexing cannot be specified while populating the data initially and this is where schema on read comes in handy. It provides the option to have 2 different schema present on the same underlying data depending on the kind of analysis required.
Hive is well suited for bulk access, updates of data as a new update requires a completely new table to be constructed. Also, query time is slower as compared to traditional databases because of the absence of indexing.
Hive stores the metadata into a relational database called the “Metastore”.
There are 2 kinds of tables in Hive:
Managed tables - Where the data file for the table is predefined and is moved to the hive warehouse directory on HDFS (in general, or any other hadoop filesystem). When a table is deleted, in that case, the metadata and the data both are deleted from the filesystem.
External tables - Here you can create data into the table lazily. There is no data moved to the Hive warehouse directory in this case and the schema/metadata is loosely coupled to the actual data. When a table is deleted, only the metadata gets deleted and the actual data is left untouched. It becomes helpful in cases if you want the data to be used by multiple databases. Another reason of using the same maybe when you need multiple schemas on the same underlying data.