Elasticsearch database: Where does elasticsearch store data? - database

I am curious to know how exactly elasticsearch manages data at its server? Does it have an inbuilt nosql database or does it store in files or is it using some existing db like mysql, mongodb etc?

Elasticsearch internally uses Lucene which uses the segments(stored in file system) to store the actual data and it uses the inverted index to enable the fast search capabilities.
Please refer elasticsearch official blog on bottom up which explains above statment in quite detail with examples.

Related

Best way to store a blog post in DynamoDB?

I'm creating a blog section for a website with Amazon Web Services. I'm comparing database solutions, and I came across DynamoDB. I'd like to know if it'd be a good idea to use DynamoDB for storing a blog post of more than 1500 words (6KB approximately). Should I save the article as a file onto the S3 instead, and store its link on my DynamoDB database? What is the right way of implementation?
Thanks in advance
DynamoDB is a key-value, NoSQL database that delivers single-digit millisecond performance at scale. It is a fully managed durable database with built-in security, backup and restore, and in-memory caching for internet-scale applications. More information here.
You can certainly use DynamoDB to build a blog application. You would need to model your data and depending upon the language that you use, you can use a DynamoDB mapper. For example, if you built you application by using the Spring Framework, you can use the Enchanced Client.
Assuming you did build with Spring Framework - you could build it very similar to this tutorial and by replacing the relational database with DynamoDB. Using DynamoDB as opposed to reading a file stored in Amazon S3 in my view is the better way to proceed here.

Can I use Apache Jena and persist the ontology using Apache Solr

We have a cloud base Java application that uses Oracle DB and Apache Solr for document indexing/searching. I need to implement an ontology and I intend to use Apache Jena. It's an uncharted territory for me. According to the docs, seems that using TDB, we can use Oracle DB for storage/query, but it's not clear to me if we can use Apache Solr for the same purpose. Is that possible? What are the pros/cons? Can you give me e brief comparison between TDB and Solr in regard with that?
tl;dr You can do this, but it's obviously not meant to be this way.
The base question here is: Can we store ontological data in something as flat a s Lucene/Solr index. Well, with enough work and dedication you can do this. I wrote a Lucene-based store for Topic Maps data several years ago. It earned me a masters degree in Comp.Sci. But that is not what you want, I suppose.
The Apache Jena extension TDB is a database of its own, designed for easy use in Jena. As far as I'm concerned, there is no such connector for Solr to use as a store. If you insist on using Solr as the datastore you will have to a) think hard about how to flatten the ontological data into index tables and b) implement the connector by yourself.
I'd say, go with TDB and if you want to do text search with Jena, use something like the TEXT QUERY extension.

Recommended Setup for BigData Application

I am currently working on a long term project that will need to support:
Lots of fast Read/Write operations via RESTful Services
An Analytics Engine continually reading and making sense of data
It is vital that the performance of the Analytics Engine not be affected by the volume of Reads/Writes coming from the API calls.
Because of that, I'm thinking that I may have to use a "front-end" database and some sort of "back-end" data warehouse. I would also need to have something like Elastic Search or Solr indexing the data stored in the data warehouse.
The Questions:
Is this a Recommended Setup? What would the alternative be?
If so...
I'm considering either Hive or Pig for the data-warehousing, and Elastic Search or Solr as a Search Engine. Which combination is known to work better together?
And finally...
I'm seriously considering Cassandra as the "fron-end" database. What is the relation between Cassandra and Hadoop, and when/why should they be put to work together instead of having just Cassandra?
Please note, my intention is NOT to start a debate about which of these is better, but to understand how can they be put to work better more efficiently. If it makes any difference, the main code is being written in Scala and Java.
I truly appreciate your help. I'm basically learning as I go and all comments will be very helpful.
Thank you.
First let's talk about Cassandra
This is a NoSQL database with eventual consistency which basically means for you that different nodes into a Cassandra cluster may have different 'snapshots' of data in the case that there is an inter cluster communication/availability problem. The data eventually will be consistent however.
Since you consider it as a 'frontend' database what you need to understand is how you will model your data. Cassandra can take advantage of indexes however you still need to defined upfront your access pattern.
Normally there is no relation between Cassandra and Hadoop (except that both are written in Java) however the Datastax distribution (enterprise version) has Hadoop support directly from Cassandra.
As a general workflow you will read/write most current data (let's say - last 24 hours) from your 'small' database that enough performance (Cassandra has excellent support for it) and you would move anything older than X (older than 24 hours) to a 'long term storage' such as Hadoop where you can run all sort of Map Reduce etc.
In regards to the text search it really depends what you need - Elastic Search is sort of competition to Solr and reverse. You can see yourself how they compare here http://solr-vs-elasticsearch.com/
As for your third question,
I think Cassandra is more like a database to save data.
Hadoop is responsible to provide a compution model to let you analyze your large data in
Cassandra.
So it is very helpful to combine Cassandra with Hadoop.
Also have other ways you can consider, such as combine with mongo and hadoop,
for mongo has support mongo-connector between hadoop and it's data.
Also if you have some search requirements , you can also use solr, directly generated index from mongo.

solr - can I use it for this?

Is solr just for searching ie it's not for 'updating' or 'inserting' data?
My site is currently MySQL based, and on looking at SOLR as an alt option, I see you make your queries through http requests.
My first thought was - how do you stop someone from making a query that updates or inserts data?
Obviously, I'm not understanding SOLR, hence my question here.
Cheers
Solr mainly is for Full Text search, and rather should not be used as a Persistent store.
Solr stores its data in the File store and does not provide the features of Relational database (ACID or Nested Entities etc )
Usually, the model followed is use Relationship database for you data management.
Replicate the data into Solr for Full Text search.
You can always control the Insert/Update access for Solr by securing the urls.

setting up elasticSearch with Postgresql

where do I find a howto to set up elasticSearch using Postgres?
My field sizes will be about 350mb, yes, MB, each in size. I have a
text output of all of the US Code and all decisions from all the courts,
the Statutes at Large, pretty much everything you would find in a library,
and I need to be able to do full text searches and return the exact point
in the field to the app to return the exact page in PDF form. Postgres
can easily handle the datastore, but I've never used elasticSearch and
have no idea of how it integrates into the indexing, etc.
As of 2015, there's ZomboDB (https://github.com/zombodb/zombodb). As the author, I'm a bit biased, but it's quite powerful. ;)
It's a Postgres extension and Elasticsearch plugin that allows you to "CREATE INDEX"s that use a remote Elasticsearch cluster, and it exposes a fairly powerful query language for performing full-text searches.
Because it's an actual index in Postgres, the ES cluster is automatically synchronized as you INSERT/UPDATE/DELETE records. As such, there's no need for asynchronous synchronization processes.
Additionally, because it's an actual index, it is transaction-safe, which means concurrent Postgres sessions will only see results that are consistent with their current transaction.
Here's a link to ZomboDB's tutorial. It should give you an idea of how easy ZomboDB is to use.
There is an application that you can use to import SQL Server, Oracle, Postgresql MySQL, etc. in to an ElasticSearch index.
http://code.google.com/p/ogr2elasticsearch/
Please let me know if you have any trouble building or using it. ~Adam
You can explore using pgsync.
PGSync is an open-source middleware (written in python) for syncing data from Postgres to Elasticsearch effortlessly. It allows you to keep Postgres as your source of truth and expose structured denormalized documents in Elasticsearch.
Githib link: https://github.com/toluaina/pgsync
Its possible to insert/update/delete postgres data in elasticsearch without middle ware other than the pgsql_http extension. Using triggers you can get a pretty much real-time index update.
You can also query elasticsearch and use the results within postgres to do joins etc with other tables/data in your database.
See the elasticsearch examples: https://github.com/sysadminmike/pgsql-http_examples

Resources