I want to create a system that stores books (and some other documents). Users will be able to log into the system where they can either see a list of all books or enter some search string and get a list of the books containing the search string. My problem is that I don´t know how I should go about storing my books. The books obv have to be searchable and the search needs to return the books ID, Name, and preferable page. Anything more like the text surrounding the search term would be a nice extra.
Some facts that might help you help me get the best answer.
The database does not have to be free. If SQL Server or an Oracle DB will help me than I´m all for that.
The books will be about ~100 (2-600 pages)
The documents will be about ~1000 (10-50 pages)
Adding books and documents will be a slow process that will happen infrequently so any type of re-indexing of tables does not need to be fast.
I have not decided how to search the documents. I do need my search results to be ranked based on relevance somehow. This might become a source of another question in the future
Do not use a RDBMS database. RDBMS are good for storing relational data. Data you are trying to store are a set of documents. Use a document store like couchDB or mongoDB. However, you since have to search this data, it is better to index this data in lucene which is built for such needs
Provided you don't intend to search the entire text of the book (perhaps consider initial processing to store a serialized hash of unique words?):
SQL Server 2008R2 has a new FILESTREAM system which will enforce relational integrity using the DB engine but will maintain the files in the file system.
It's the "best of both worlds" and you won't have to worry about how DB backup plans affects your BLOBs
http://msdn.microsoft.com/en-us/library/cc949109(v=sql.100).aspx
SharePoint Foundation 2010 and 2013 could be your perfect solution which is absolutely free to use. You can store bulk amount of documents to different document libraries, add and edit their metadata, and search them using metadata like Title, Author, etc and even the text content inside the book.
Related
Im currently investigating the tools necessary to add a fast, full text search to our ERP SAAS application with the aim of providing a single search entry point in the application that could search over the many different kind of objects that compose the domain of the software.
The application (a Spring Java web application) is backed by a Sql Server RDBMS (usign Hibernate as ORM), there are hundreds of different tables, dozens of which (but maybe more) should be searchable (usually there are one or more varchar columns in evenry table that should be indexed/searched).
Think for example of a single search bar where i can search customers, contracts, employees, articles..), this data is also very often updated (new inserts, deletes, updates..)
I found this article (www.chrisumbel.com/article/lucene_solr_sql_server) that shows how to connect a Sql Server db with Solr, posting a query example on the database that extracts the data used by Solr during the data import.
Since we have dozens (and more) tables containing the searchable data that means that we should pass for a first step that integrate all the sql queries that extracts this data with Solr, in order to build the index?
Second question: not all the data is searchable by everyone (permissions and ad hoc filters), so how could we complement the full text search provided by Solr with the need of putting in place more complex queries (join on other tables for example) on this data?
Thanks
You are nearly asking for a full blown consulting project :-) But a few suggestions are possible.
Define Search Result Types: Search engines use denormalized data, i.e. you won't do any joins while querying (if you think you do, stick to your DB:-) That means you need to do the necessary joins while filling the index. This defines what you can search for. Most people "just" index documents or log-lines, so there is just one type of result. Sometimes people's profiles are included, sometimes a difference is made between results from different source systems where the documents come from, but in the end, there is a limited number of types of search results. And even more, they are nevertheless indexed into one and the same schema (where schemas are very malleable for search engines).
Index: You know your SQL statements to extract your data. Converting to JSON and shoveling it into a search engine is not difficult. One thing to watch out for: while your DB changes, you keep indexing, incremental or full "crawl" depends on how much logic you want to add. The most tricky part is to get deletes on the DB side into the index. If its gone, its gone: how do you know there was something that needs to be purged from the index :-)
Secure Search Since you don't really join, applying access rights at query time amounts requires two steps. During indexing, write principle (group, user) names of those who may read your search result. At query time, get the user ID and expand it, recursively, to get all groups of the user. Add this as a query filter. Make sure to cache the filter or even pre-compute for all users quite regularly and store it in a fast store (the search index is one place, DB would do too:-) Obviously you need to re-index if access rights change. The good thing is: as long as things only change in LDAP/AD, you don't need to index the data, only the expanded groups of the affected users.
ad hoc filters If you want to filter for X, put X as a field into the index. At query time, apply the filter.
Coming from a RDBMS background and trying to wrap my head around ElasticSearch data storage patterns...
Currently in SQL Server, we have a star schema data mart, RecordData. Rows are organized by user ID, geographic location that pertains to the rest of the searchable record, title and description (which are free text search fields).
I would like to move this over to ElasticSearch, and have read about creating a separate index per user. If I understand this correctly, with this suggestion, I would be creating a RecordData type in each user index, correct? What is a recommended naming convention for user indices that will be simple for Kibana analysis?
One issue I have with this recommendation is, how would you organize multiple web applications on the ES server? You wouldn't want to have all those user indices all over the place?
Is it so bad to have one index per application, and type per SQL Server table?
Since in SQL Server, we have other tables for user configuration, based on user ID's, I take it that I could then create new ES types in user indices for configuration. Is this a recommended pattern? I would rather not have two data base systems for this web application.
Suggestions welcome, thank you.
I went through the same thing, and there are a few things to take into account.
Data Modeling
You say you use a star schema today. Elasticsearch is typically appropriate for denormalized data where the totality of the information resides in each document unlike with a star schema. If you can live with denormalized, that is fine but I assume that since you already have star schema, denormalized data is not an option because you don't want to go and update millions of documents each time the location name change for example(if i understand the use case). At least in my use case that wasn't an option.
What are Elasticsearch options for normalized data?
This leads us to think of how to put star schema like data in a system like Elasticsearch. There are a few options in the documentation, the main ones i focused were
Nested Objects - more details at https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-objects.html . In nested objects the entire information is kept in a single document, meaning one location and its related users would be in a single document. That may make it not optimal becasue the document will be huge and again, a change in the location name will require to update the entire document. So this is better but still not optimal.
Parent - Child Relationship - more details at https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child.html . In this case the location and the User records would be kepts in separate indices similarly to a relational database. This seems to be the right modeling for what we need. The only major issue with this option is the fact that Kibana 4 does not provide ways to manipulate/aggregate documents based on parent/child relationship as of this writing. So if you main driver for using Elasticsearch is Kibana(this was mine), that kind of eliminates the option. If you want to benefit from the elasticsearch speed as an engine this seems to be the desired option for your use case.
In my opinion once you got right the data modeling all of your questions will be easier to answer.
Regarding the organization of the servers themselves, the way we organize that is by having a separate cluster of 3 elasticsearch nodes behind a Load Balancer(all of that is hosted on a cloud) and then have all your Web Applications connect to that cluster using the Elasticsearch API.
Hope that helps.
The project I have been given is to store and retrieve unstructured data from a third-party. This could be HR information – User, Pictures, CV, Voice mail etc or factory related stuff – Work items, parts lists, time sheets etc. Basically almost any type of data.
Some of these items may be linked so a User many have a picture for example. I don’t need to examine the content of the data as my storage solution will receive the data as XML and send it out as XML. It’s down to the recipient to convert the XML back into a picture or sound file etc. The recipient may request all Users so I need to be able to find User records and their related “child” items such as pictures etc, or the recipient may just want pictures etc.
My database is MS SQL and I have to stick with that. My question is, are there any patterns or existing solutions for handling unstructured data in this way.
I’ve done a bit of Googling and have found some sites that talk about this kind of problem but they are more interested in drilling into the data to allow searches on their content. I don’t need to know the content just what type it is (picture, User, Job Sheet etc).
To those who have given their comments:
The problem I face is the linking of objects together. A User object may be added to the data store then at a later date the users picture may be added. When the User is requested I will need to return the both the User object and it associated Picture. The user may update their picture so you can see I need to keep relationships between objects. That is what I was trying to get across in the second paragraph. The problem I have is that my solution must be very generic as I should be able to store anything and link these objects by the end users requirements. EG: User, Pictures and emails or Work items, Parts list etc. I see that Microsoft has developed ZEntity which looks like it may be useful but I don’t need to drill into the data contents so it’s probably over kill for what I need.
I have been using Microsoft Zentity since version 1, and whilst it is excellent a storing huge amounts of structured data and allowing (relatively) simple access to the data, if your data structure is likely to change then recreating the 'data model' (and the regression testing) would probably remove the benefits of using such a system.
Another point worth noting is that Zentity requires filestream storage so you would need to have the correct version of SQL Server installed (2008 I think) and filestream storage enabled.
Since you deal with XML, it's not an unstructured data. Microsoft SQL Server 2005 or later has XML column type that you can use.
Now, if you don't need to access XML nodes and you think you will never need to, go with the plain varbinary(max). For your information, storing XML content in an XML-type column let you not only to retrieve XML nodes directly through database queries, but also validate XML data against schemas, which may be useful to ensure that the content you store is valid.
Don't forget to use FILESTREAMs (SQL Server 2008 or later), if your XML data grows in size (2MB+). This is probably your case, since voice-mail or pictures can easily be larger than 2 MB, especially when they are Base64-encoded inside an XML file.
Since your data is quite freeform and changable, your best bet is to put it on a plain old file system not a relational database. By all means store some meta-information in SQL where it makes sense to search through structed data relationships but if your main data content is not structured with data relationships then you're doing yourself a disservice using an SQL database.
The filesystem is blindingly fast to lookup files and stream them, especially if this is an intranet application. All you need to do is share a folder and apply sensible file permissions and a large chunk of unnecessary development disappears. If you need to deliver this over the web, consider using WebDAV with IIS.
A reasonably clever file and directory naming convension with a small piece of software you write to help people get to the right path will hands down, always beat any SQL database for both access speed and sequential data streaming. Filesystem paths and file names will always beat any clever SQL index for data location speed. And plain old files are the ultimate unstructured, flexible data store.
Use SQL for what it's good for. Use files for what they are good for. Best tools for the job and all that...
You don't really need any pattern for this implementation. Store all your data in a BLOB entry. Read from it when required and then send it out again.
Yo would probably need to investigate other infrastructure aspects like periodically cleaning up the db to remove expired entries.
Maybe i'm not understanding the problem clearly.
So am I right if I say that all you need to store is a blob of xml with whatever binary information contained within? Why can't you have a users table and then a linked(foreign key) table with userobjects in, linked by userId?
We're trying to identify the locations of certain information stored across our enterprise in order to bring it into compliance with our data policies. On the file end, we're using Nessus to search through differing files, but I'm wondering about on the database end.
Using Nessus would seem largely pointless because it would output the raw data and wouldn't tell us what table or row it was in, or give us much useful information, especially considering these databases are quite large (hundreds of gigabytes).
Also worth noting, this system needs to be able to do pattern-based matching (such as using regular expressions). Not just a "dumb search" engine.
I've investigated the use of Data Mining and Data Warehousing in order to find this data but it seems like they're more for analysis of data than actually just finding data.
Is there a better method of searching through large amounts of data in a database to try and find this information? We're using both Oracle 11g and SQL Server 2008 and need to perform the searches on both, so I'd like to stay away from server-specific paradigms (although if I have to rewrite some code to translate from T-SQL to PL/SQL, and vice versa, I don't mind)
On SQL Server for searching through large amounts of text, you can look into Full Text Search.
Read more here http://msdn.microsoft.com/en-us/library/ms142559.aspx
But if I am reading right, you want to spider your database in a similar fashion to how a web search engine spiders web sites and web pages.
You could use a set of full text queries that bring back the results spanning multiple tables.
Oracle supports regular expression with the RegExp_Like() function and it ought to be fairly straightforward to automate the generation of the code you need based on system metadate (to find all text columns over a certain length, for example, and include them in a predicate againt that table to find the rows and values that match your regexp). Doesn't sound too challenging really. In theory you could check constrain columns to prevent the insertion of values that match a regexp but that might be overkill.
Oracle Text is suited for searching for words/phrases in larg(ish) bits of text (eg PDFs, HTMLs, TXT or DOCs) held in the database. There is some limited fuzziness searching, but not regular expressions per se.
You don't really go into what sort of data you are looking for or what you have in your databases. Nessus indicates you are looking for security issues, but the title of "Data Correlation" suggests something completely different.
Really the data structures should provide the information about what to look for and where. That's what databases are about - structuring data for accessibility. A database backing a CMS, forum software or similar would be a different kettle of fish.
Has anyone used Lucene.NET rather than using the full text search that comes with sql server?
If so I would be interested on how you implemented it.
Did you for example write a windows service that queried the database every hour then saved the results to the lucene.net index?
Yes, I've used it for exactly what you are describing. We had two services - one for read, and one for write, but only because we had multiple readers. I'm sure we could have done it with just one service (the writer) and embedded the reader in the web app and services.
I've used lucene.net as a general database indexer, so what I got back was basically DB id's (to indexed email messages), and I've also use it to get back enough info to populate search results or such without touching the database. It's worked great in both cases, tho the SQL can get a little slow, as you pretty much have to get an ID, select an ID etc. We got around this by making a temp table (with just the ID row in it) and bulk-inserting from a file (which was the output from lucene) then joining to the message table. Was a lot quicker.
Lucene isn't perfect, and you do have to think a little outside the relational database box, because it TOTALLY isn't one, but it's very very good at what it does. Worth a look, and, I'm told, doesn't have the "oops, sorry, you need to rebuild your index again" problems that MS SQL's FTI does.
BTW, we were dealing with 20-50million emails (and around 1 million unique attachments), totaling about 20GB of lucene index I think, and 250+GB of SQL database + attachments.
Performance was fantastic, to say the least - just make sure you think about, and tweak, your merge factors (when it merges index segments). There is no issue in having more than one segment, but there can be a BIG problem if you try to merge two segments which have 1mil items in each, and you have a watcher thread which kills the process if it takes too long..... (yes, that kicked our arse for a while). So keep the max number of documents per thinggie LOW (ie, dont set it to maxint like we did!)
EDIT Corey Trager documented how to use Lucene.NET in BugTracker.NET here.
I have not done it against database yet, your question is kinda open.
If you want to search an db, and can choose to use Lucene, I also guess that you can control when data is inserted to the database.
If so, there is little reason to poll the db to find out if you need to reindex, just index as you insert, or create an queue table which can be used to tell lucene what to index.
I think we don't need another indexer that is ignorant about what it is doing, and reindexing everytime, or uses resources wasteful.
I have used lucene.net also as storage engine, because it's easier to distribute and setup alternate machines with an index than a database, it's just a filesystem copy, you can index on one machine, and just copy the new files to the other machines to distribute the index. All the searches and details are shown from the lucene index, and the database is just used for editing. This setup has been proven as a very scalable solution for our needs.
Regarding the differences between sql server and lucene, the principal problem with sql server 2005 full text search is that the service is decoupled from the relational engine, so joins, orders, aggregates and filter between the full text results and the relational columns are very expensive in performance terms, Microsoft claims that this issues have been addressed in sql server 2008, integrating the full text search inside the relational engine, but I don't have tested it. They also made the whole full text search much more transparent, in previous versions the stemmers, stopwords, and several other parts of the indexing where like a black box and difficult to understand, and in the new version are easier to see how they works.
With my experience, if sql server meet your requirements, it will be the easiest way, if you expect a lot of growth, complex queries or need a big control of the full text search, you might consider working with lucene from the start because it will be easier to scale and personalise.
I used Lucene.NET along with MySQL. My approach was to store primary key of db record in Lucene document along with indexed text. In pseudo code it looks like:
Store record:
insert text, other data to the table
get latest inserted ID
create lucene document
put (ID, text) into lucene document
update lucene index
Querying
search lucene index
for each lucene doc in result set load data from DB by stored record's ID
Just to note, I switched from Lucene to Sphinx due to it superb performance