What is a good web application SQL Server data mart implementation in ElasticSearch? - sql-server

Coming from a RDBMS background and trying to wrap my head around ElasticSearch data storage patterns...
Currently in SQL Server, we have a star schema data mart, RecordData. Rows are organized by user ID, geographic location that pertains to the rest of the searchable record, title and description (which are free text search fields).
I would like to move this over to ElasticSearch, and have read about creating a separate index per user. If I understand this correctly, with this suggestion, I would be creating a RecordData type in each user index, correct? What is a recommended naming convention for user indices that will be simple for Kibana analysis?
One issue I have with this recommendation is, how would you organize multiple web applications on the ES server? You wouldn't want to have all those user indices all over the place?
Is it so bad to have one index per application, and type per SQL Server table?
Since in SQL Server, we have other tables for user configuration, based on user ID's, I take it that I could then create new ES types in user indices for configuration. Is this a recommended pattern? I would rather not have two data base systems for this web application.
Suggestions welcome, thank you.

I went through the same thing, and there are a few things to take into account.
Data Modeling
You say you use a star schema today. Elasticsearch is typically appropriate for denormalized data where the totality of the information resides in each document unlike with a star schema. If you can live with denormalized, that is fine but I assume that since you already have star schema, denormalized data is not an option because you don't want to go and update millions of documents each time the location name change for example(if i understand the use case). At least in my use case that wasn't an option.
What are Elasticsearch options for normalized data?
This leads us to think of how to put star schema like data in a system like Elasticsearch. There are a few options in the documentation, the main ones i focused were
Nested Objects - more details at https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-objects.html . In nested objects the entire information is kept in a single document, meaning one location and its related users would be in a single document. That may make it not optimal becasue the document will be huge and again, a change in the location name will require to update the entire document. So this is better but still not optimal.
Parent - Child Relationship - more details at https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child.html . In this case the location and the User records would be kepts in separate indices similarly to a relational database. This seems to be the right modeling for what we need. The only major issue with this option is the fact that Kibana 4 does not provide ways to manipulate/aggregate documents based on parent/child relationship as of this writing. So if you main driver for using Elasticsearch is Kibana(this was mine), that kind of eliminates the option. If you want to benefit from the elasticsearch speed as an engine this seems to be the desired option for your use case.
In my opinion once you got right the data modeling all of your questions will be easier to answer.
Regarding the organization of the servers themselves, the way we organize that is by having a separate cluster of 3 elasticsearch nodes behind a Load Balancer(all of that is hosted on a cloud) and then have all your Web Applications connect to that cluster using the Elasticsearch API.
Hope that helps.

Related

Considering database options for new web application

I know this is a topic that's been addressed ad nauseam but I also know there are people who enjoy opining about databases so I figured I'd just go ahead and ask the question again.
I'm building out a web application that on a very basic level displays a list of objects that meet a user-defined search criteria. The primary function of the application will be to provide an interface by which a user can perform realtime faceted searches on a large number of object properties, including ranges of data, location data, and probably related data.
Of course there will be ancillary information too: user accounts, lookup tables etc.
My background is entirely in relational database development, primarily SQL Server with a little bit of MySQL. However, I'm intrigued by the possible applicability of an object-relational approach or even a full-on document database. Without the experience of working in those paradigms I'm not sure what I might be getting myself into.
Here are some further considerations that may affect the decision:
The schema will likely evolve considerably over time as more properties and search options are added, creating the typical versioning/deployment challenges. This is the primary reason why I would consider a document database.
The application itself will likely be written in Node/Express with an Angular or React front-end using Typescript and so the code will be interacting with data in json format. In other words, regardless of what comes back from the db server, we want json on the code level. (Another case for a doc database.)
There is the potential for a large amount of search parameters and a large amount of data, so indexing will be key and performance will be a huge potential gotcha. This would seem to me to be a strong case against a document db.
A potential use case would involve a user adjusting a slider control (let's say it controls high and low price parameters or a distance range). The selected parameters would then be packaged as a json object and sent to a search controller, which would then pass these parameters to the db server on change and expect a list of objects in return. In other words, the user would generally not be pushing a button to refine search criteria. The search update would happen each time they change a parameter.
I don't know the extent to which this a thing or not, but it would also be great if there were some way to leverage technology that could cache search results and then search within those results if the search were narrowed, thus performing a second search only on the smaller subset of the first search rather than the entire universe of available objects.
I guess while I'm at it I should ask about ORMs. Also something I'm generally not some experienced with (I've used Entity Framework a bit) but wondering if I should expand my horizons.
Thanks and I look forward to your opinions!
I think that your requirement of "There is the potential for a large amount of search parameters and a large amount of data, so indexing will be key and performance will be a huge potential gotcha" makes a strong case for using a relational database to store the data.
Leveraging an ORM that can support data in JSON format would seem to be ideal in your use case. Schema evolution for a system in production would sure be a challenge (not unsurmountable though) but it would be nice to use an ORM product that can at least easily support schema evolution during the development stage when things are more likely to change and evolve rapidly.
Given the kind of queries you would typically be issuing (e.g., adjusting a slider control), an ORM that supports prepared statements that can have range criteria would be more efficient.
Also, given your need to "perform realtime faceted searches on a large number of object properties, including ranges of data, location data, and probably related data", an ORM product that can easily support one-to-one, one-to-many, and many-to-many relationships and path-expressions in search criteria should simplify your development process.

Full Text Search in ERP application - is Apache Lucene\Solr the right choice?

Im currently investigating the tools necessary to add a fast, full text search to our ERP SAAS application with the aim of providing a single search entry point in the application that could search over the many different kind of objects that compose the domain of the software.
​
The application (a Spring Java web application) is backed by a Sql Server RDBMS (usign Hibernate as ORM), there are hundreds of different tables, dozens of which (but maybe more) should be searchable (usually there are one or more varchar columns in evenry table that should be indexed/searched).
Think for example of a single search bar where i can search customers, contracts, employees, articles..), this data is also very often updated (new inserts, deletes, updates..)
​
I found this article (www.chrisumbel.com/article/lucene_solr_sql_server) that shows how to connect a Sql Server db with Solr, posting a query example on the database that extracts the data used by Solr during the data import.
Since we have dozens (and more) tables containing the searchable data that means that we should pass for a first step that integrate all the sql queries that extracts this data with Solr, in order to build the index?
Second question: not all the data is searchable by everyone (permissions and ad hoc filters), so how could we complement the full text search provided by Solr with the need of putting in place more complex queries (join on other tables for example) on this data?
​
Thanks
You are nearly asking for a full blown consulting project :-) But a few suggestions are possible.
Define Search Result Types: Search engines use denormalized data, i.e. you won't do any joins while querying (if you think you do, stick to your DB:-) That means you need to do the necessary joins while filling the index. This defines what you can search for. Most people "just" index documents or log-lines, so there is just one type of result. Sometimes people's profiles are included, sometimes a difference is made between results from different source systems where the documents come from, but in the end, there is a limited number of types of search results. And even more, they are nevertheless indexed into one and the same schema (where schemas are very malleable for search engines).
Index: You know your SQL statements to extract your data. Converting to JSON and shoveling it into a search engine is not difficult. One thing to watch out for: while your DB changes, you keep indexing, incremental or full "crawl" depends on how much logic you want to add. The most tricky part is to get deletes on the DB side into the index. If its gone, its gone: how do you know there was something that needs to be purged from the index :-)
Secure Search Since you don't really join, applying access rights at query time amounts requires two steps. During indexing, write principle (group, user) names of those who may read your search result. At query time, get the user ID and expand it, recursively, to get all groups of the user. Add this as a query filter. Make sure to cache the filter or even pre-compute for all users quite regularly and store it in a fast store (the search index is one place, DB would do too:-) Obviously you need to re-index if access rights change. The good thing is: as long as things only change in LDAP/AD, you don't need to index the data, only the expanded groups of the affected users.
ad hoc filters If you want to filter for X, put X as a field into the index. At query time, apply the filter.

What database to choose for the hierarchical content with relations?

I want to have a reviews-like website, but not only with reviews, other types of content as well. The design of the website combines both hierarchical structure (each content object/record/entity has a parent - kind of container), and relations - each content object/record/entity has a number of related other objects:
an author of the content (i.e. user)
related comments (with their own relations, particularly authors)
item being reviewed as a separate record in DB
images from the gallery
One of the most important things is performance. Relations used to be inefficient in the NoSQL, as I've read on the net and already tried out with other projects. On the other hand, the general design, apart from the relations mentioned, has an obvious content repository like structure, which is the exact reflection of hierarchical arrangement of objects (documents, articles, reviews) websites are designed. Also, I really like the loose structure of the records in NoSQL. Yet, I don't care about (nor use) things like versioning and other things related to NoSQL.
So I want to combine both wordls: hierarchical and relational within one project, or actually, its model. Apart from it, I want the project to be restful, so that a mobile apps could use the same content available through the API. Another requirement is that the content should be searchable.
What type of storage would you choose for a project like this?
I decided to go with the Graph DBs. Here's why I rejected the other ones:
I don't want to use NoSQL (Documents), since relations are hard to maintain and often require extra code infrastructure (often custom) to handle them, see e.g. Diaspora NoSQL problems
I don't want to use RDBMS, since the structure based DBs impose well known limitations and doesn't reflect the domain
I rejected the key-value and big table DBs as they have very specific use cases
Graph Databases have been used in number of content-oriented projects, and appeared to be doing the job surprisingly well.
You can easily model a hierarchical data structure in SQL with the following (using PostgreSQL):
CREATE TABLE comments (
id INTEGER,
parent INTEGER,
content VARCHAR(1024)
)
Where parent refers to the id of the parent comment.
If you are after a NoSQL database that exposes a RESTful interface, you could consider CouchDB.
You can then replicate CouchDB to Elasticsearch for more robust searching.
But if your data is relational then I would very much recommend you consider a SQL database like PostgreSQL first.

couchdb multiple databases

I'm used to working with mysql but for my next series of projects CouchDB (NoSQL) seems to be the way to go, basically to avoid EAV in mysql and to embrace all the cool features it has to offer.
After lots of investigation and reading documentation etc, there is one thing I don't seem to understand quite well.
Lets assume I host three web applications on my server and thus need three databases accordingly. For instance one is a webshop with product and invoice tables, one is a weblog with article and comment tables and another one is a web based game with game stats tables (simplification obviously).
So I host multiple sites on one installation of mysql, and each application I run on my server gets its own database with tables, fields and content.
Now, with CouchDb I want do the exact same thing. The problem seems to be that creating a database in CouchDb, is more similar to creating a table in mysql. I.e. I create databases called 'comments', 'articles' etc. for my weblog and inside I create a document per article or a document per comment.
So my question is: how can I separate my data from multiple web applications on one CouchDB installation?
I think I am doing something fundamentally wrong here but hopefully one of you guys can help me get on the right track.
In CouchDB, there's no explicit need to separate unrelated data into multiple databases. If you've constructed your documents and views correctly, only relevant data will appear in your queries.
If you do decide to separate your data into separate databases, simply create a new database.
$ curl -X PUT http://localhost:5984/somedb
{"ok":true}
From my experience with couchdb, separating unrelated data into different databases is very important for performance and also a no-brainer. The view generation is a painful part of couchdb. Everytime the database is updated, the views (think of them as indexes in a traditional relational sql db) have to be regenerated. This involves iterating every document in the database. So if you have say 2 million documents of type A, and you have 300 documents of type, B. And you need to regenerate a view the queries type B, then all 2 million and 300 hundred enumerations will be performed during view generation and it will take a long time (it might even do a read-timeout).
Therefore, having multiple databases is a no-brainer when it comes to keeping views (how you query in couchdb, an obviously important and unavoidable feature) updated.
#Zombies is extremely right about performance. CouchDB isn't suited to perform on a lot of documents in a single database. If you need to perform on, let's say, more than 5000 documents, MongoDB will outperfom CouchDB.
Views in CouchDB are essential, but painful, with limited JavaScript options to build your queries (don't even think about document references or nested objects). Considering having multiples databases for different documents is quite the solution. Some people will say something like:
CouchDB is a NoSQL database, and as such you should not need to order your documents nor filtering them using something else than views. NoSQL database core feature is the ability to store scheme-less documents [...]
And I find it very annoying when you need to find a workaround to performance and querying. You should not mind creating a few databases to separate your data if it allows you to split your data, it will still be on a 'single CouchDB installation'. Don't forget that CouchDB is suited for small databases. The smallest a database will be, the fastest your query will be, the better the performance will be.
(I do not know if there are any english mistakes, pardon me if so)
EDIT
Some companies like ArangoDB made a comparison between themselves, MongoDB and CouchDB, and it is confirming my saying about the number of documents. This is the result:
There are a lot of other resources on their website. On the other hand, this statement was a personnal experience, and from benchmarking them for my internship, with a .PHP benchmarking software I found on the Internet. The results are provided below:

Alternative to Apatar for scripted data migration

I'm looking for the fastest-to-success alternative solution for related data migration between Salesforce environments with some specific technical requirements. We were using Apatar, which worked fine in testing, but late in the game it has started throwing the dreaded socket "connection reset" errors and we have not been able to resolve it - it has some other issues that are leading me to ditch it.
I need to move a modest amount of data (about 10k rows total) between several sandboxes and ultimately to a production environment. The data is spread across eight custom objects. There is a four-level-deep master-detail relationship, which obviously must be preserved.
The target environment tables are 100% empty.
The trickiest object has a master-detail and two lookup fields.
Ideally, the data from one table near the top of the hierarchy should be filtered by a simple WHERE, and then children not under matching rows not migrated, but I'll settle for a solution that migrates all the data wholesale.
My fallback in this situation is going to be good old Data Loader, but it's not ideal because our schema is locked down and does not contain external ID fields, so scripting a solution that preserves all the M-D and lookups will take a while and be more error prone than I'd like.
It's been a long time since I've done a survey of the tools available, and don't have much time to do one now, so I'm appealing to the crowd. I need an app that will be simple (able to configure and test very quickly), repeatable, and rock-solid.
I've always pictured an SFDC data migration app that you can just check off eight checkboxes from a source environment, point it to a destination environment, and it just works, preserving all your relationships. That would be perfect. Free would be nice too. Does such a shiny thing exist?
Sesame Relational Junction seems to best match what you're looking for. I haven't used it, though; so, I can't comment on its effectiveness for what you're attempting.
The other route you may want to look into is using the Bulk API or using the Data Loader CLI with Task Scheduling.
You may find this information (below), from an answer to a different question, helpful.
Here is a list of integration services (other than Apatar):
Informatica Cloud
Cast Iron
SnapLogic
Boomi
JitterBit
Sesame Relational Junction
Information on other tools, to integrate Salesforce with other databases, is available here:
Salesforce Web Services API
Salesforce Bulk API
Relational Junction has a unique feature set that supports cloning, splitting, and merging of Salesforce orgs, and will keep the relationships intact in a one-pass load. It works like this:
Download source org to an empty database schema (any relationship DBMS)
Download target org to a second empty database schema
Run some scripts to condition the data; this varies by object. Sesame provides guidance and sample scripts, but essentially you have to set a control field to tell Relational Junction to create or update Salesforce. This is also where you may need to replace source ID's with target ID's if some objects have been pre-populated during sandbox creation
Replicate the second database to the target org
Relational Junction handles the socket disconnects, timeouts, and whatever havoc happens during the unload/reload process gracefully and without creating duplicates.
This process was developed for a proof of concept at a large Silicon Valley network vendor in 2007, who became a customer. The entire down and up of 15 GB of data took 46 hours, plus about 2 days of preparation.

Resources