SolrNet and many to many relations - solr

I haven't been able to find and example or documentation how you do a many to many relation in SolrNet, so I hoped one of you experts might have a clue or a link which can point me in the right direction?

There is no many-to-many relationship in Solr, and in fact there are no relationships at all. Solr's index is a flat structure. You must denormalize your data, this depends on what searches you will need. See http://wiki.apache.org/solr/SchemaDesign

Related

Database choice for crawled page semantics

I'm not sure whether this question has already been asked in the past.
I'm writing a webcrawler, intended to extract information from multiple websites for promotions,prices and product descriptions.
Which database choice would be ideal to do an in memory comparison on the data of promotions and prices, based on identification of the same product from multiple websites.
I know the design is going to be complex for the Scraper, HTMLDataProcessor and Storage for wrangling. But, I'm looking for a solve for the data layer choice.
Appreciate the help on this.
I'd suggest first you create your object model or Entity relationship diagram for all the entities.(a.k.a ER diagram)
For instance you can see the tutorial here: http://creately.com/blog/diagrams/er-diagrams-tutorial/
Once you have the diagram and relationships between your entity then you can make a choice of whether you need relational database or not.
You need to answer question like:
Do you care about FK (foreign key) constraints?
What is the most common query and do you care about it's performance?
Is an in-memory database sufficient or do you need data to be persisted?
Think along those lines.

When should I use a Column Family NoSQL solution vs Key-Value, Document Store, Graph

I understand the technical differences between the different solutions. But I can't seem to find concrete examples of the pros/cons of the different types of NoSQL solutions, and when to use one type over the other.
All of the information I find online gives very vague suggestions of when to use one type vs the other. And they all seem to be able to be interchangeably used without a clear indication of the advantage of using one over the other.
Document-oriented
Examples: MongoDB, CouchDB
Strengths: Heterogenous data, working object-oriented, agile development
Their advantage is that they do not require a consistent data structure. They are useful when your requirements and thus your database layout changes constantly, or when you are dealing with datasets which belong together but still look very differently. When you have a lot of tables with two columns called "key" and "value", then these might be worth looking into.
Graph databases
Examples: Neo4j, GiraffeDB
Strengths: Data Mining
Their focus is at defining data by its relation to other data. When you have a lot of tables with primary keys which are the primary keys of two other tables (and maybe some data describing the relation between them), then these might be something for you.
Key-Value Stores
Examples: Redis, Cassandra, MemcacheDB
Strengths: Fast lookup of values by known keys
They are very simplistic, but that makes them fast and easy to use. When you have no need for stored procedures, constraints, triggers and all those advanced database features and you just want fast storage and retrieval of your data, then those are for you.
Unfortunately they assume that you know exactly what you are looking for. You need the profile of User157641? No problem, will only take microseconds. But what when you want the names of all users who are aged between 16 and 24, have "waffles" as their favorite food and logged in in the last 24 hours? Tough luck. When you don't have a definite and unique key for a specific result, you can't get it out of your K-V store that easily.
There is an excellent article describing the types of nosql databases and when to use what.. read this
You will get a good understanding.

Cakephp lookup tables and saving data to subordinate models

So I've just read a bunch of the Cakephp model saving related data questions here on stack, but I am not finding what I'm looking for. Beyond the obvious technical issue, I have the distinct feeling that I am doing it wrong. My question is this: If you have an organizations table, and a users table, and you want to link them with a lookup table, so that neither is associated with another except by the linking association in the lookup table, how would you do it? Is a lookup table advisable in Cakephp, or is that a horrible hold-over from my sql days that needs to die? What is best practice here? HABTM what I need? Furthermore, how do you learn this stuff? I try things I think might work, but they turn out kludgy at best.

How do the advanced features in Relational databases work?

To make a long question short, I know about the basics of a Relational Database, Indexing ,Replication, Locking, Concurrency, etc, and SQL syntax (SELECT,INSERT,UPDATE,DELETE, CREATE,DROP,ALTER,TRUNCATE) when used with simple expressions such as:
SELECT EventID,EventName FROM Events WHERE CustomerID=5 ORDER BY EventType
But I don't understand any of the "advanced" topics in Relational databases, like:
Domains
Constraints
Indices
Will anyone please give me a quick primer, an approximate explanation on what these aspects do and how they work?
You may down-vote and totally trash this question, but please explain to me, approximately how these topics work because I need to get up to speed on Relational databases very quickly.
The Wikipedia articles on Relational Databases and the Relational Model are a good place to start. They have links to other articles on the specific topics you mention and these have examples, such as:
Domains
Constraints
Index
Primary Key and Foreign Key
I think that one issue you're going to face with this is that features vary widely between different RDBMS implementations. Locking, consistency and concurrency are very different in Oracle to <insert random name of other system here>. If there is a particular RDBMS that you have an interest in then I'd urge you to investigate how that particular system implements them, because the devil is in the details, as they say.
For example, start with the Oracle Concepts Guide, available in HTML and PDF from http://docs.oracle.com for each version.

How to store directory / hierarchy / tree structure in the database?

How do i store a directory / hierarchy / tree structure in the database? Namely MSSQL Server.
#olavk: Doesn't look like you've seen my own answer. The way i use is way better than recursive queries :)
p.p.s. This is the way to go!
There are many ways to store hierarchies in SQL databases. Which one to choose depends on which DBMS product you use, and how the data will be used. As you have used the MSSQL2005 tag, I think you should start considering the "Adjacency List" model; if you find that it doesn't perform well for your application, then have a look at Vadim Tropashko's comparison which highlights differences between models with a focus on multiple performance characteristics.
If using Sql Server 2008 is an option: maybe you should check out new hierarchyid data type.
There also is the Nested-Set Model of Trees which has some advantages over the ParentID model. See http://www.evanpetersen.com/item/nested-sets.html and http://falsinsoft.blogspot.nl/2013/01/tree-in-sql-database-nested-set-model.html
This is more of a bookmark for me than a question, but it might help you too. I've used this article's approach to store a directory / tree structure in the database.
There are some useful code snippets in the article as well.
Hope this helps.
I'm not affiliated with that website in any way
I faced the similar problem with one of my projects. We had a huge hierarchy which will keep increasing forever.
I needed to traverse it fast and then finding the right group after some complex validations.
Rather than going to SQL Server and scratching my head how can I do it efficiently there when I knew that Recursive queries are the only viable solution. But do you really know if there is any optimization at all possible in Recursive Queries. Is there any guarantee that your hierarchy will not increase in future and one fine day you find out that your recursive queries are too slow to be used in production?
So, I decided to give a shot to Neo4J. It's a graph database with many useful algorithms in-built, amazingly fast traversal with decent documentation and examples.
Store the hierarchy in Neo4J and access hierarchy using a Thrift Service (or something else).
Yes you will have to write code which will integrate your SQL queries with Neo4J but you will have a scalable and more future-proof solution.
Hope you find this useful.
Are you using SQL Server 2005? Recursive queries make querying hierarchical data much more elegant.
Edit: I do think materialized paths are a bit of a hack. The path contain non-normalized redundant data, and you have to use triggers or something to keep them updated. Eg. if a node changes parent, the whole subtree have to have their paths updated. And subtree queries have to use some ugly substring matching rather than an elegant and fast join.
The question is similar to this question that was closed. I found answers to both questions very helpful in my pursuits, and they eventually led me to the MongoDB manual that presents 5 different ways to model tree structures:
https://docs.mongodb.com/manual/applications/data-models-tree-structures/
While MongoDB is not a relational database, the models presented are applicable to relational databases, as well as other formats such as JSON. You clearly need to figure out which model is right based on the pros/cons presented.
The author of this question found a solution that combined both the Parent and Materialized Paths models. Maintaining the depth and parent could present some problems (extra logic, performance), but there are clearly upsides for certain needs. For my project, Materialized Paths will work best and I overcame some issues (sorting and path length) through techniques from this article.
The typical way is a table with a foreign key (e.g. "ParentId") onto itself.

Resources