Denormalization of database tables for Lucene indexing

Denormalization of database tables for Lucene indexing - database

I am just starting up with Lucene, and I'm trying to index a database so I can perform searches on the content. There are 3 tables that I am interested in indexing:
1. Image table - this is a table where each entry represents an image. Each image has an unique ID and some other info (title, description, etc).
2. People table - this is a table where each entry represent a person. Each person has a unique ID and other info like (name, address, company, etc)
3. Credited table - this table has 3 fields (image, person, and credit type). It's purpose is to associate some people to a image as the credits for that image. Each image can have multiple credited people (there's the director, photographer, props artist, etc). Also, a person is credited in multiple images.
I'm trying to index these tables so I can perform some searching using Lucene but as I've read, I need to flatten the structure.
The first solution the came to me would be to create Lucene documents for each combination of Image/Credited Person. I'm afraid this will create a lot of duplicate content in the index (all the details of an image/person would have to be duplicated in each Document for each person that worked on the image).
Is there anybody experienced with Lucene that can help me with this? I know there is no generic solution to denormalization, that is why I provided a more specific example.
Thank you, and I will gladly provide more info on the database is anybody needs
PS: Unfortunately, there is no way for me to change the structure of the database (it belongs to the client). I have to work with what I have.

You could create a Document for each person with all the associated images' descriptions concatenated (either appended to the person info or in a separate Field).
Or, you could create a minimal Document for each person, create a Document for each image, puts the creators' names and credit info in a separate field of the image Document and link them by putting the person ID (or person Document id) a third, non-indexed field. (Lucene is geared toward flat document indexing, not relational data, but relations can be defined manually.)
This is really a matter of what you want to search for, images or persons, and whether each contains enough keywords for search to function. Try several options, see if they work well enough and don't exceed the available space.
The credit table will probably not be a good candidate for Document construction, though.

Related

NetSuite - UNION ALL equivalent in saved search?

I'm in the process of writing a SuiteTalk integration, and I've hit an interesting data transformation issue. In the target system, we have a sort of notes table which has a category column and then the notes column. Data going into that table from NetSuite could be several different fields on a single entity in NetSuite terms, but several records of different categories in our terms.
If you take the example of a Sales Order, you might have two text fields that we need to bring across as notes. For each of those fields I need to create a row, with both the notes field in the same column but separate rows. This would allow me to add a dynamic column that give the category for each of those fields.
So instead of
SO number notes 1 notes 2
SO1234567 some text1 some text2
You’d get
SO Number Category Text
SO1234567 category 1 some text1
SO1234567 category 2 some text2
The two problems I’m really trying to solve here are:
Where can I store the category name? It can’t be the field name in NetSuite. It needs to be configurable per customer as the number of notes fields in each record type might vary across implementations. This is currently my main blocker.
Performance – I could create a saved search for each type of note, and bring one row across each time, but that’s not really an acceptable performance hit if I can do it all in one call.
I use Saved Searches in NetSuite to provide a configurable way of filtering the data to import into the target system.
If I were writing a SQL query, i would use the UNION clause, with the first column being a dynamic column denoting the category and the second column being the actual data field from NetSuite. My ideal would be if I could somehow do a similar thing either as a single saved search, or as one saved search per entity, without having to create any additional fields within NetSuite itself, so that from the SuiteTalk side I can just query the search and pull in the data.
As a temporary kludge, I now have multiple saved searches in NetSuite, one per category, and within the ID of the saved search I expect the category name and an indicator of the record type. I then have a parent search which gives me the searches for that record type - it's very clunky, and ultimately results in far too many round trips for me to be satisfied.
Any idea if something like this is at all possible?? Or if not, is there a way of solving this without hard-coding the category values in the front end? Even if I can bring back multiple recordsets in one call, that would be a performance enhancement.
I've asked the same question on the NetSuite forums but to no avail.
Thanks

At first read it sounds like you are trying to query a set of fields from entities. The fields may be custom fields or built in fields. Can you not just query the entities where your saved search has all the potential category columns and then transform the received data into categories?
Otherwise please provide more specifics in Netsuite terms about what you are trying to do.

Database Architecture: Multiple Type Relations

I'm building a Laravel application that has "listings". These listings can be things like boats, planes, and automobiles; each with their own specific fields.
I will also have an images table that should relate to each type of listing and a users table that needs to map to each type of listing. I'm trying to determine the best way to map each listing type back to images and users.
One way I've thought of doing this was having separate boats, planes and automobiles tables with their specific fields and then having specific boat_images plane_images and automobile_images tables to map to each respective type. But then relating each type to a user would be a bit tricky.
I don't think one giant listing table with all fields I'd ever use through these 3 (which could grow in size later) would make sense --- and I also don't believe having a general metadata field that has a JSON object full of specifications for each listing would work well either when I want to have a searchable database.
I know of pivot tables, but I'm trying to grasp the overall architecture here. Any help would be greatly appreciated. Thanks!

You could have a listings table, holding only id and name. Boats, planes, automobiles and others should be a subset table.
Each table will have its own entity. And the Listing entity will have multiple hasMany relationships with its subset tables. These relationships will be named like boats(), planes(), etc. Each subset listing entity will hold a single belongsTo relationship.
Using these subset tables should also help to compartmentalize form validation.
You can have a single images table and use a polymorphic relationship towards the listings table. This one is a huge savior.

Where to Store Metadata

We've built an algorithm that helps us deliver relevant articles to our users. In the background during certain intervals, the algorithm will calculate metadata, such as average age, age spread, and gender coefficient from a slew of data related to views, comments, and votes.
With that said, are there any downsides to storing this metadata as fields on the Articles table? Or, should I create a separate table, such as Article_Data, to store the information? I am just not sure how much the updating of this metadata will interfere with selecting the articles.
For the most part, we will be SELECTing articles and its metadata and JOINing it on user data (age, gender, etc) to show users relevant content. The only time we don't need the metadata is when we show a particular article to a user.

If the fields are clearly defined, and there are a limited number of them, put them in the Articles table.
If you are going to store more than one record of metadata fields per article, you need another table, in a one-to-many relationship with the Articles table.
If the fields are not clearly defined, user-defined, or there are many of them, you probably need a new table with one row per metadata item. But this is more difficult to work with in the long run.
See Also
http://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80%93value_model

Database design for a product aggregator

I'm trying to design a database for a product aggregator. Each product has information about where it comes from, what it costs, what type of thing it is, price, color, etc. Users need to able to search and filter results based on any of those product categories. I also expect to have a large number of users. My initial thought was having one big table with every product in it with a column for each piece of information and an index on anything I need to be able to search by but I think this might be inefficient with a lot of users pounding on this one table. My other thought was to organize the database to promote a tree-like navigation of tables but because you can search by anything I'm not sure how I would organize the tables.
Any thoughts on some good practices?

One table of products - databases are designed to have lots of users pounding on tables.
(from the comments)
You need to model your data. This comes from looking at the all the data you have, determining what is related to what (a table is called a relation because all the attributes in a row are related to a candidate key). You haven't really given enough information about the scope of what data (unstructured?) you have on these products and how it varies. Are you going to have difficulties because Shoes have brand, model, size and color, but Desks only have brand, model and finish? All this is going to inform your data model. Typically you have one products table, and other things link to it.
Some of those attributes will be foreign keys to lookup tables, others (price) would be simple scalars. Appropriate indexing and you'll be fine. For advanced analytics, consider a dimensionally modeled star-schema, but perhaps not for your live transaction system - depends what your data flow/workflow/transactions are. Or consider some benefits of its principles in your transactional database. Ralph Kimball is source of good information on dimensional modeling.

I dont see any need for the tree structure here. You can do with single table.
if you insist on tree structure with hierarchy here is an example to get you started.

For text based search, and ease of startup & design, I strongly recommend Apache SOLR. The SOLR API is easy to use (especially JSON). Databases do text search poorly, and I would instead recommend that you just make sure that they respond to primary/unique key queries properly, and those are the fields you should index.

One table for the products, and another table for the product category hierarchy (you don't specifically say you have this but "tree-like navigation of tables" makes me think you might).
I can see you might be concerned about over-indexing causing problems if you plan to index almost every column. In that case, it might be best to index on the top 5 or 10 columns you think users are likely to search for, unless it's possible for a user to search on ANY column. In that case you might want to look at building a data warehouse. Maybe you'll want to look into data cubes to see if those will help...?
For hierarchical data, you need a PRODUCT_CATEGORY table looking something like this:
ID
PARENT_ID
NAME
Some sample data:
ID PARENT_ID NAME
1 ROOT
2 1 SOCKS
3 1 HELICOPTER PARTS
4 2 ARGYLE
Some SQL engines (such as Oracle) allow you to write recursive queries to traverse the hierarchy in a single query. In this example, the root of the tree has a PARENT_ID of NULL, but if you don't want this column to be nullable, I've also seen -1 used for the same purposes.

Full-text Search on Joined, Hierarchical Records in SQL Server 2008

Probably a noob question, but I'll go for it nevertheless.
For sake of example, I have a Person table, a Tag table and a ContactMethod table. A Person will have multiple Tag records and multiple ContactMethod records associated with them.
I'd like to have a forgiving search which will search among several fields from each table. So I can find a person by their email (via ContactMethod), their name (via Person) or a tag assigned to them.
As a complete noob to FTS, two approaches come to mind:
Build some complex query which addresses each field individually
Build some sort of lookup table which concatenates the fields I want to index and just do a full-text query on that derived table.
(Feel free to edit for clarity; I'm not in it for the rep points.)

If your sql server supports it you can create an indexed view and full text search that; you can use containstable(*,'"chris"') to read all the columns.
If it doesn't support it as the fields are all coming from different tables I think for scalability; if you can easily populate the fields into a single row per record in a separate table I would full text search that rather than the individual records. You will end up with a less complex FTS catalog and your queries will not need to do 4 full text searches at a time. Running lots of separate FTS queries over different tables at the same time is a ticket to query performance issues in my experience. The downside with doing this is you lose the ability to search for Surname on its own; if that is something you need you might need to look at an alternative.
In our app we found that the single table was quicker (we can't rely on customers having enterprise sql at hand); so we populate the data with spaces into an FTS table through an update sp then our main contact lookup runs a search over the list. We have two separate searches to handle finding things with precision (i.e. names or phone numbers) or just for free text. The other nice thing about the table is it is relatively easy and low cost to add further columns to the lookup (we have been asked for social security number for example; to do it we just added the column to the update SP and we were away with little or no impact.

One possibility is to make a view which has these columns: PersonID, ContentType, Content. ContentType would be something like "Email", "PhoneNumber", etc... and Content would hold that. You'd be searching on the Content column, and you'd be able to see what the person's ID is. I'm not 100% sure how full text search works though, so I'm not sure if you could use that on a view.

The FTS can search multiple fields out-of-the-box. The CONTAINS predicate accepts a list of columns to search. Also CONTAINSTABLE.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight