Elastic search, multiple indexes vs one index and types for different data sets? - database

I have an application developed using the MVC pattern and I would like to index now multiple models of it, this means each model has a different data structure.
Is it better to use mutliple indexes, one for each model or have a type within the same index for each model? Both ways would also require a different search query I think. I just started on this.
Are there differences performancewise between both concepts if the data set is small or huge?
I would test the 2nd question myself if somebody could recommend me some good sample data for that purpose.

There are different implications to both approaches.
Assuming you are using Elasticsearch's default settings, having 1 index for each model will significantly increase the number of your shards as 1 index will use 5 shards, 5 data models will use 25 shards; while having 5 object types in 1 index is still going to use 5 shards.
Implications for having each data model as index:
Efficient and fast to search within index, as amount of data should be smaller in each shard since it is distributed to different indices.
Searching a combination of data models from 2 or more indices is going to generate overhead, because the query will have to be sent to more shards across indices, compiled and sent back to the user.
Not recommended if your data set is small since you will incur more storage with each additional shard being created and the performance gain is marginal.
Recommended if your data set is big and your queries are taking a long time to process, since dedicated shards are storing your specific data and it will be easier for Elasticsearch to process.
Implications for having each data model as an object type within an index:
More data will be stored within the 5 shards of an index, which means there is lesser overhead issues when you query across different data models but your shard size will be significantly bigger.
More data within the shards is going to take a longer time for Elasticsearch to search through since there are more documents to filter.
Not recommended if you know you are going through 1 terabytes of data and you are not distributing your data across different indices or multiple shards in your Elasticsearch mapping.
Recommended for small data sets, because you will not waste storage space for marginal performance gain since each shard take up space in your hardware.
If you are asking what is too much data vs small data? Typically it depends on the processor speed and the RAM of your hardware, the amount of data you store within each variable in your mapping for Elasticsearch and your query requirements; using many facets in your queries is going to slow down your response time significantly. There is no straightforward answer to this and you will have to benchmark according to your needs.

Although Jonathan's answer was correct at the time, the world has moved on and it now seems that the people behind ElasticSearch have a long term plan to drop support for multiple types:
Where we want to get to: We want to remove the concept of types from Elasticsearch, while still supporting parent/child.
So for new projects, using only a single type per index will make the eventual upgrade to ElasticSearch 6.x be easier.

Jonathan's answer is great. I would just add few other points to consider:
number of shards can be customized per solution you select. You may have one index with 15 primary shards, or split it to 3 indexes for 5 shards - performance perspective won't change (assuming data are distributed equally)
think about data usage. Ie. if you use kibana to visualize, it's easier to include/exclude particular index(es), but types has to be filtered in dashboard
data retention: for application log/metric data, use different indexes if you require different retention period

Both the above answers are great!
I am adding an example of several types in an index.
Suppose you are developing an app to search for books in a library.
There are few questions to ask to the Library owner,
Questions:
How many books are you planning to store?
What kind of books are you going to store in the library?
How are you going to search for books?
Answers:
I am planning to store 50 k – to 70 k books (approximately)
I will have 15 k -20 k technology related books (computer science, mechanical engineering, chemical engineering and so on), 15 k of historical books, 10 k of medical science books. 10 k of language related books (English, Spanish and so on)
Search by authors first name, author last name, year of publish, name of the publisher. (This gives you the idea about what information you should store in the index)
From the above answers we can say the schema in our index should look somewhat like this.
//This is not the exact mapping, just for the example
"yearOfPublish":{
"type": "integer"
},
"author":{
"type": "object",
"properties": {
"firstName":{
"type": "string"
},
"lastName":{
"type": "string"
}
}
},
"publisherName":{
"type": "string"
}
}
In order to achieve the above we can create one index called Books and can have various types.
Index: Book
Types: Science, Arts
(Or you can create many types such as Technology, Medical Science, History, Language, if you have lot more books)
Important thing to note here is the schema is similar but the data is not identical. And the other important thing is the total data you are storing.
Hope the above helps when to go for different types in an Index, if you have different schema you should consider different index. Small index for less data . big index for big data :-)

Related

How to store datapoints in the magnitudes of 1+ trillion?

So, I have astronomical spectroscopy data in the following format:
{
"molecule": "CO2",
"blahblah":
"5 more simple fields"
"arrayofvalues": [lengths can go up to 2 million]
}
of this data, I have 600,000 files, so that means that there are 1 trillion individual datapoints that I want to search through, and do computations with.
So can someone please direct me to a source of maybe bigData or bigQueries on how I can efficiently lookup this data for computations and graphing? I want to like for example search certain molecules, under certain condition, what data they show etc.
I want to make a website where people can pick some variables, and a value range, and get graphical or textual data.
Now I tried to put some of this stuff on PostgresQL, but obviously when I do a get request, (and store even just 5 files) it will crash Postman, because its too much data.
Without knowing more details, you can take advantage of the data modeling options available in bigquery, such as:
nested data
arrays and structs
partitioned tables
clustering
Take a look at the data types: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
And also the partitioning and clustering techniques.
https://towardsdatascience.com/how-to-use-partitions-and-clusters-in-bigquery-using-sql-ccf84c89dd65?gi=cd1bc7f704cc

Flat data with struct type vs document store

I know this is a 'soft' question, which is usually frowned upon on SO, but I have been using BigQuery to do data analysis on (obviously) flat data, which contains both structs and repeated data. Let's just use a very basic example, a row might look like this:
ID
Title (str)
ReleaseYear (int)
Genres (str[])
Credits (struct[])
And an example piece of data might look like:
{
"ID": "T-1997",
"Title": "Titanic",
"ReleaseYear": 1997,
"Genres": ["Drama", "Romance"],
"Credits": {
"Actors": ["Leonardo DiCaprio", "Kate Winslet"],
"Directors": ["James Cameron"]
}
}
My question is basically what type of operations or queries can be done in a native document store, such as MongoDB or CouchBase, that couldn't be done in a relational DB that supports arbitrarily-nested data. In other words, my assumption (and I hope I'm wrong or misguided) is that as long as a DB supports structs, it can do everything that a document-store can do. If not, what are some places where it is either: (1) something that can be done in MongoDB (or any other document-store) that cannot be done in BigQuery (or any other database that supports structs)? and (2) something that can be done much more easily in MongoDB that in a relational DB?
what type of operations or queries can be done in a native document
store, such as MongoDB or CouchBase, that couldn't be done in a
relational DB that supports arbitrarily-nested data.
Even if does support arbitrarily nested data, BigQuery allows limited nesting compared to MongoDB .MongoDB supports more levels of nesting.
In BigQuery, your schema cannot contain more than 15 levels of nested STRUCTs. MongoDB supports unto 100 levels of nesting for BSON documents.
In other words, my assumption (and I hope I'm wrong or misguided) is
that as long as a DB supports structs, it can do everything that a
document-store can do.
Not exactly - nested columns are columns within columns. But sharding in an RDBMS is a complex endeavor compared to a NoSQL database like Mongo. Technically you can do, but it wasn't designed for the same purpose. Its like using a wrench as a hammer - sure you can, but its purpose was something different. You should use the right tool for the right purpose.
If not, what are some places where it is either: (1) something that
can be done in MongoDB (or any other document-store) that cannot be
done in BigQuery (or any other database that supports structs)? and
(2) something that can be done much more easily in MongoDB that in a
relational DB?
The crux of the matter is, an RDBMS may tack on features to "technically" allow you to do some things that you can do in a NoSQL database. But it doesn't mean it may work just as well. For example, because of the features that make an RDBMS an RDBMS (ACID compliance, transactions etc), there will always be an additional performance hit compared to a NoSQL database. If an RDBMS removes these features, then it is no longer an RDBMS!
This answer illustrates how MongoDB achieves better performance because it doesn't need to support RDBMS features :
https://softwareengineering.stackexchange.com/questions/54373/when-would-someone-use-mongodb-or-similar-over-a-relational-dbms
MongoDB has a lower latency per query & spends less CPU time per query because it is doing a lot less work (e.g. no joins,
transactions).
As a result, it can handle a higher load in terms of queries per second and is thus often used if you have a massive # of users.
MongoDB is easier to shard (use in a cluster) because it doesn't have to worry about transactions and consistency. - MongoDB has a
faster write speed because it does not have to worry about
transactions or rollbacks (and thus does not have to worry about
locking).
MongoDB does not have a schema in case you have a special use case that can take advantage of that.
Another feature is sharding - sharding is easier with mongodb because it doesn't need to support many of the features which make an RDBMS an RDBMS, such as being ACID compliant. In contrast, sharding is complex for an RDBMS because an RDBMS must remain ACID compliant.
Take a look at the following two images:
The speed boat would out perform the "amphibious car" in the water 10/10 times. The amphibious car technically can navigate in water, but it wasn't designed to, hence is much slower and unsuited for its purpose.
Like wise, look at the difference in aerodynamics of the speed boat and this sweet automobile. Even if you tacked on wheels to the boat, its not going to perform as well as this car on land. (As an analogy you could say that NoSQL databases don't do joins - you have to implement them yourself. - but will it perform better than an RDBMS for join heavy operations ?)
The point I'm making with the analogies, is that each kind of database was initially designed for a specific goal, and over time features have been added to try and make it solve problems it was not designed for (hence it doesn't do it as well as something specifically designed for that purpose).
Hence in your question, even if BigQuery or some RDBMS can do something, it doesn't mean that you should use them for the job. The same applies for NoSQL databases. You should use the best tool for the job.
Disclaimer: I don't have experience in MongoDB or CouchBase. My answer is based on BigQuery's capability on STRUCT.
Performance
BigQuery's STRUCT is optimized for query. For example, if you query select a.nested_b.nested_c.nested_d from table_t, the query only scans data for the left STRUCT field nested_d, it is fast and cheap.
Usability
If your data is write-once or append-only, then STRUCT column is comparable with document store AFAIK.
But if you want to update only certain nested field later, nested STRUCT makes it pretty difficult to do, because there is no way to update single item in REPEATED field, you have to load the whole array, scan and change, and repack to update a column. You will be writing something like:
UPDATE table
SET Credits.Actors = (SELECT ARRAY_AGG(...) FROM UNNEST(Credits.Actors) WHERE ...)
WHERE ...
It may become a bigger problem when there is array of struct of arrays (and even more nested levels). Based on my understanding of document store, updating single nested field of a document should be easier than this. Basically, this is kind of the price you have to pay to get the performance benefit mentioned earlier.

Elastic Search - MultiLog design

we are trying to implement a new log system for our IoT device, different applications in the cloud (api, spa, etc). We are trying to design the "Schema" to be the most efficient as possible and we feel there are many good solutions, but it's hard to select one.
Here is a general structure : under the devices node we have our 3 different kinds of IoT devices and similar for infra : different applications and more.
So we were thinking of creating one index for each blue circle and create a hierarchical naming with our indexes so we can take advantage of the wildcard when execute search.
For example :
logs-devices-modules
logs-devices-edges
logs-devices-modules
logs-infra-api
logs-infra-portal
And for mapping, we have different log type in each index and should we map only the common field or everything ? Should we map common field and let the dynamic mapping for the logs type specifics?
Please share your opinion and tips if you have !
Ty.
I would generally map everything to ECS, since Kibana knows the meaning of many fields and it aligns with other inputs.
How much data do you have and how different are your fields? If you don't have too much data (every shard should have >10GB — manage with rollover / ILM ideally) and less than 100 fields in total, I would go for a single index and add a field with with the different names, so you can easily filter on that. Though different retention lengths of the data would favor multiple indices, so you will have to pick the right tradeoffs for your system.

Location based horizontal scalable dating app database model

I am assessing backend for location base dating app similar to Tinder.
App feature is showing nearby online users (with sex, and age filter)
Some database engines in mind are Redis, Cassandra, MySQL Cluster
The app should scale horizontally by adding node at high traffic time
After researching, I am very confused whether there is a common "best practice" data model, algorithm for this.
My approach is using Redis Cluster:
// Store all online users in same location (city) to a Set. In this case, store user:1 to New York set
SADD location:NewYork 1
// Store all users age to Sorted Set. In this case, user:1 has age 30
ZADD age 30 "1"
// Retrieve users in NewYork age from 20 to 40
ZINTERSTORE tmpkey 2 location:NewYork age AGGREGATE MAX
ZRANGEBYSCORE tmpkey 20 40
I am inexperienced and can not foresee potential problem if scaling happen for million of concurrent users.
Hope any veteran could shed some light.
For your use case, mongodb would be a good choice.
You can store each user in single document, along with their current location.
Create indexes on fields you want to do queries on, e.g. age, gender, location
Mongodb has inbuilt support for geospatial queries, hence it is easy to find users within 1 km radius of another user.
Most noSQL Geo/proximity index features rely on the GeoHash Algorithm
http://www.bigfastblog.com/geohash-intro
It's a good thing to understand how it works, and it's really quite fascinating. This technique can also be used to create highly efficient indexes on a relational database.
Redis does have native support for this, but if you're using ElastiCache, that version of Redis does not, and you'll need to mange this in your API.
Any Relational Database will give you the most flexibility and simplest solution. The problem you may face is query times. If you're optimizing for searches on your DB instance (possibly have a 'search db' separate to profile/content data), then it's possible to have the entire index in memory for fast results.
I can also talk a bit about Redis: The sorted set operations are blazingly fast, but you need to filter. Either you have to scan through your nearby result and lookup meta information to filter, or maintain separate sets for every combination of filter you may need. The first will have more performance overhead. The second requires you to mange the indexes yourself. EG: What if someone removes one of their 'likes'? What if they move around?
It's not flash or fancy, but in most cases where you need to search a range of data, relational databases win due to their simplicity and support. Think of your search as a replica of your master source, and you can always migrate to another solution, or re-shard/scale if you need to in the future.
You may be interested in the Redis Geo API.
The Geo API consists of a set of new commands that add support for storing and querying pairs of longitude/latitude coordinates into Redis keys. GeoSet is the name of the data structure holding a set of (x,y) coordinates. Actually, there isn’t any new data structure under the hood: a GeoSet is simply a Redis SortedSet.
Redis Geo Tutorial
I will also support MongoDB on the basis of requirements with the development of MongoDB compass you can also visualize your geospatial data.The link of mongodb compass documentation is "https://docs.mongodb.com/compass/getting-started/".

"Parametrized" database model & backend storage system as well as data mining manipulation

I have implicitly made this a community wiki seeing that the answers can be quite broad.
I'm working with a start-up company to accomplish the following goal.
In a medical research, a patient medical record can have infinite amount of data regarding a patient for a specific diagnosis, e.g. a smoker has a higher chance of catching lung cancer but that doesn't necessarily mean that a non-smoker can catch lung cancer. My goal is to create/use a database model that can deal with such parameters.
Now, I also have to come up with ways to data mine these parametrized data to create statistical data e.g. see the trends on all 40 year old female who suffered from lung cancer. That report can be generic, (graph, tabular, etc.) where doctors can see trends or analyse possible solutions that can work....
My questions are:
1) Which Database systems allows for parametrized backend storage (e.g. Cassandra) that can easily be used in java, and is very efficient in data retrieval, linkage, etc. We are dealing with high amount of patient records per states.
2) What algorithms or AI techniques can I use for data mining? Is there any mining techniques out there that can help me do this?
PS How does Google Analytics deal with parametrised data?
PPS A parametrized data is data which has a key, and data where data can be value, another key-value pair, a list of value, a set of parametrized data (organized, unorganized)
I'm looking forward for suggestive answers! :-D
I'll try to answer your first question only.
Cassandra is a key-value datastore (in your case parametrized). If you use Cassandra, you need higher computation time to derive complex reports. The reason being - it stores data in raw format. Cassandra like NOSQL databases are good if you want to scale very very big. They are eventually consistent and compromise on data replication and latency.
In your case as a patient can have data in infinitely any form, try to fit the model of a Triple Store (Semantic Web frameworks like Jena, OpenSesame, etc). They allow you to have a lousy data structures and can be molded at runtime. Also, their querying engines (SPARQL, SeRQL) give you more power than NOSQL stores (like Cassandra), but these querying capabilities are obviously lesser than RDBMS.
For this question, this is how we have implemented this.
We created a keyspace called medical and a supercolumn family called patient.
under the supercolumn family, we have a general supercolumn which basically store the patient details, and another supercolumn called operation to keep recording of the user occupation.
Don't forget that the general supercolumn keeps record of the patient as he/she comes to the doctor. That way, we know exactly the patient's exact condition before, during and after operation.
I know some data can be duplicates, but no supercolumns can be identical as there is no way that you can have exactly 2 different patient of identical attributes and sickness.
So basically, Cassandra allows 3 layers of abstraction, Keyspace, Column/Supercolumn family, Column/Supercolumn.
Hope this can help somebody.

Resources