Data Correlation in large Databases - sql-server

We're trying to identify the locations of certain information stored across our enterprise in order to bring it into compliance with our data policies. On the file end, we're using Nessus to search through differing files, but I'm wondering about on the database end.
Using Nessus would seem largely pointless because it would output the raw data and wouldn't tell us what table or row it was in, or give us much useful information, especially considering these databases are quite large (hundreds of gigabytes).
Also worth noting, this system needs to be able to do pattern-based matching (such as using regular expressions). Not just a "dumb search" engine.
I've investigated the use of Data Mining and Data Warehousing in order to find this data but it seems like they're more for analysis of data than actually just finding data.
Is there a better method of searching through large amounts of data in a database to try and find this information? We're using both Oracle 11g and SQL Server 2008 and need to perform the searches on both, so I'd like to stay away from server-specific paradigms (although if I have to rewrite some code to translate from T-SQL to PL/SQL, and vice versa, I don't mind)

On SQL Server for searching through large amounts of text, you can look into Full Text Search.
Read more here http://msdn.microsoft.com/en-us/library/ms142559.aspx
But if I am reading right, you want to spider your database in a similar fashion to how a web search engine spiders web sites and web pages.
You could use a set of full text queries that bring back the results spanning multiple tables.

Oracle supports regular expression with the RegExp_Like() function and it ought to be fairly straightforward to automate the generation of the code you need based on system metadate (to find all text columns over a certain length, for example, and include them in a predicate againt that table to find the rows and values that match your regexp). Doesn't sound too challenging really. In theory you could check constrain columns to prevent the insertion of values that match a regexp but that might be overkill.

Oracle Text is suited for searching for words/phrases in larg(ish) bits of text (eg PDFs, HTMLs, TXT or DOCs) held in the database. There is some limited fuzziness searching, but not regular expressions per se.
You don't really go into what sort of data you are looking for or what you have in your databases. Nessus indicates you are looking for security issues, but the title of "Data Correlation" suggests something completely different.
Really the data structures should provide the information about what to look for and where. That's what databases are about - structuring data for accessibility. A database backing a CMS, forum software or similar would be a different kettle of fish.

Related

Considering database options for new web application

I know this is a topic that's been addressed ad nauseam but I also know there are people who enjoy opining about databases so I figured I'd just go ahead and ask the question again.
I'm building out a web application that on a very basic level displays a list of objects that meet a user-defined search criteria. The primary function of the application will be to provide an interface by which a user can perform realtime faceted searches on a large number of object properties, including ranges of data, location data, and probably related data.
Of course there will be ancillary information too: user accounts, lookup tables etc.
My background is entirely in relational database development, primarily SQL Server with a little bit of MySQL. However, I'm intrigued by the possible applicability of an object-relational approach or even a full-on document database. Without the experience of working in those paradigms I'm not sure what I might be getting myself into.
Here are some further considerations that may affect the decision:
The schema will likely evolve considerably over time as more properties and search options are added, creating the typical versioning/deployment challenges. This is the primary reason why I would consider a document database.
The application itself will likely be written in Node/Express with an Angular or React front-end using Typescript and so the code will be interacting with data in json format. In other words, regardless of what comes back from the db server, we want json on the code level. (Another case for a doc database.)
There is the potential for a large amount of search parameters and a large amount of data, so indexing will be key and performance will be a huge potential gotcha. This would seem to me to be a strong case against a document db.
A potential use case would involve a user adjusting a slider control (let's say it controls high and low price parameters or a distance range). The selected parameters would then be packaged as a json object and sent to a search controller, which would then pass these parameters to the db server on change and expect a list of objects in return. In other words, the user would generally not be pushing a button to refine search criteria. The search update would happen each time they change a parameter.
I don't know the extent to which this a thing or not, but it would also be great if there were some way to leverage technology that could cache search results and then search within those results if the search were narrowed, thus performing a second search only on the smaller subset of the first search rather than the entire universe of available objects.
I guess while I'm at it I should ask about ORMs. Also something I'm generally not some experienced with (I've used Entity Framework a bit) but wondering if I should expand my horizons.
Thanks and I look forward to your opinions!
I think that your requirement of "There is the potential for a large amount of search parameters and a large amount of data, so indexing will be key and performance will be a huge potential gotcha" makes a strong case for using a relational database to store the data.
Leveraging an ORM that can support data in JSON format would seem to be ideal in your use case. Schema evolution for a system in production would sure be a challenge (not unsurmountable though) but it would be nice to use an ORM product that can at least easily support schema evolution during the development stage when things are more likely to change and evolve rapidly.
Given the kind of queries you would typically be issuing (e.g., adjusting a slider control), an ORM that supports prepared statements that can have range criteria would be more efficient.
Also, given your need to "perform realtime faceted searches on a large number of object properties, including ranges of data, location data, and probably related data", an ORM product that can easily support one-to-one, one-to-many, and many-to-many relationships and path-expressions in search criteria should simplify your development process.

Best database and database design for this particular need?

I'm looking to store around 50-100 million documents in a database and be able to do queries at a very fast speed. A document would look something like this:
{
name: 'example',
value: '300,201,512'
}
The value column is always unique, name is not.
Now I want to be able to only check if there exists a document with a specific value using a query. What database would be the best choice and which design would be best to approach the fastest speed for a query like that?
NoSQL databases try to offer certain functionality that more traditional relational database management systems do not. Whether it is for holding simple key-value pairs for shorter lengths of time for caching purposes, or keeping unstructured collections (e.g. collections) of data that could not be easily dealt with using relational databases and the structured query language (SQL) – they are here to help.
In order to better understand the roles and underlying technology of each database management system, let's quickly go over these four operational models.
Key / Value Based
We will begin our NoSQL modeling journey with key / value based database management simply because they can be considered the most basic and backbone implementation of NoSQL.
These type of databases work by matching keys with values, similar to a dictionary. There is no structure nor relation. After connecting to the database server (e.g. Redis), an application can state a key (e.g. the_answer_to_life) and provide a matching value (e.g. 42) which can later be retrieved the same way by supplying the key.
Key / value DBMSs are usually used for quickly storing basic information, and sometimes not-so-basic ones after performing, for example, a CPU and memory intensive computation. They are extremely performant, efficient and usually easily scalable.
Note: When it comes to computers, a dictionary usually refers to a special sort of data object. They constitutes of arrays of collections with individual keys matching values.
Column Based
Column based NoSQL database management systems work by advancing the simple nature of key / value based ones.
Despite their complicated-to-understand image on the internet, these databases work very simply by creating collections of one or more key / value pairs that match a record.
Unlike the traditional defines schemas of relational databases, column-based NoSQL solutions do not require a pre-structured table to work with the data. Each record comes with one or more columns containing the information and each column of each record can be different.
Basically, column-based NoSQL databases are two dimensional arrays whereby each key (i.e. row / record) has one or more key / value pairs attached to it and these management systems allow very large and un-structured data to be kept and used (e.g. a record with tons of information).
These databases are commonly used when simple key / value pairs are not enough, and storing very large numbers of records with very large numbers of information is a must. DBMS implementing column-based, schema-less models can scale extremely well.
Document Based
Document based NoSQL database management systems can be considered the latest craze that managed to take a lot of people by storm. These DBMS work in a similar fashion to column-based ones; however, they allow much deeper nesting and complex structures to be achieved (e.g. a document, within a document, within a document).
Documents overcome the constraints of one or two level of key / value nesting of columnar databases. Basically, any complex and arbitrary structure can form a document, which can be stored using these management systems.
Despite their powerful nature, and the ability to query records by individual keys, document based management systems have their own issues and downfalls compared to others. For example, retrieving a value of a record means getting the whole lot of it and same goes for updates, all of which affect the performance.
Graph Based
Finally, the very interesting flavour of NoSQL database management systems is the graph based ones.
The graph based DBMS models represent the data in a completely different way than the previous three models. They use tree-like structures (i.e. graphs) with nodes and edges connecting each other through relations.
Similarly to mathematics, certain operations are much simpler to perform using these type of models thanks to their nature of linking and grouping related pieces of information (e.g. connected people).
These databases are commonly used by applications whereby clear boundaries for connections are necessary to establish. For example, when you register to a social network of any sort, your friends' connection to you and their friends' friends' relation to you are much easier to work with using graph-based database management systems.
Fasted document based db
1) MongoDB
2) DynamoDB
Here is difference for your reference
I will give preference to DynamoDB
Currently, we are working on aws datalake, really fast performance
store data in s3 and get back via athena.
If you want to import data on to some database then try using MS SQL Server 2008 R2, because it is very user friendly and allow you to do your work more accurately and precisely. If you want to do that without any cost, then MySQL will be a better option to do so(better MySQL editor is SQLYog). I hope it would be beneficial for you.
Short Answer:
I think, 100 million documents in your mentioned structure and conditions is not BIG ENOUGH to use NoSQL. You can handle them with PostgreSQL and MySQL and etc.
Note that: for a long time Wikipedia used MySQL (not now). see Reference

Appropriate Database to store 20 GB for Delphi, Firemonkey

I don't have experience in database development, so I need your suggestions in choosing of a database that can be used in Firemonkey.
I need to store html files (without media now, but they can be with), their total size is around 20 GB (uncompressed text). A main feature must be maximally fast searching of text in the database, and it must be possible to implement human searching (like google). Plus, there can be compression (20 GB is to much to store. If compression makes searching slow it's not required).
What kind of databases are appropriate for my concern?
Thanks a lot for your suggestions!
Edited
Requirements:
Price: Free
Location: local or remote
Operating system support: Windows
System requirements: a database with a large footprint
(hopefully in exchange of better performances)
Performances: fast text searching
Concurrent users: 20
Full text indexing and searching: human (Google-like) fast
text searching is required
Manageability: doesn't matter much
I know an on-line web legal database that can search words through 100 GB of information in milliseconds. I need the same performance, and Google-like searching is required.
Delphi database access layer is separated from FireMonkey, it's the same used by VCL (although FM AFAIK relies only on LiveBindings to access data, but that's not an issue in your case).
Today 20 GB are really not much data. Almost any database will handle them without much effort if properly configured. What engine to choose depends on:
Price: how much are you going to spend for it?
Location: do you need a local database (same machine) or a remote one (LAN or WAN)?
Operating system support: which OS should it run on?
System requirements: do you need a database with a small footprint, or you can use one with a larger one (hopefully in exchange of better performances)?
Performances: what are the required performances?
Concurrent users: how much user will connect to the database concurrently?
Full text indexing and searching: not all databases offer it out of the box
Manageability: some databases may require more management than others.
There is no "one database fits all" yet.
I'm no DBA so I can't say directly, and honestly I'm not sure that any one person could give a direct answer to this question as it's one of those it just depends scenarios.
http://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems
That's a good starting point to compare features and platform compatibility. I think the major thing to consider here is what hardware will be running it and how can you best utilize that to accomplish the task at hand.
If you have a server farm being sure your DB supports distribution and some sort of load balancing (most do to some degree from what I understand).
To speed up searching unless you code up a custom algorithm that searches the compressed version somehow I think you're going to want to keep the data un-compressed. Searching the compressed data actually might be faster. If you're able to use the index for the compressed file to compare against your plain text search parameters then are just looking for those keys that were matched within the index. If any are found in the index check for them within the compressed data. Without tons of custom code I haven't heard of any DB that supports this idea of searching compressed text (though I could easily be wrong on this point).
If the entire data set needs to be decompressed before doing the search it will very likely be much slower (memory is relatively cheap compared to CPU time). It looks like Firemonkey has a limited selection of DBs to use so that will help to narrow your choices down as well.
What I would suggest based on your edited question, is to write (or find) a parser or regular expression to extract all the important elements from the HTML that you would like to be searchable. Then store those in a database along with a reference for where they were found in the HTML. In terms of Google like searching, if you mean in terms of how it can correct misspellings and use synonyms, you probably need some sort of custom code to do dictionary look ups for spelling and thesaurus look ups for synonyms. I believe full text searching in any modern DB will handle the need to query with LIKE or similar statements in the where clause.
Looks like ldsandon's answer covers most of this anyhow. TLDR; if not thanks for reading.
I would recommend PostgreSQL for this task. It has good performance, and built in full text search capability for Google-like searching. And it's free and open source.
Unfortunately Delphi doesn't come with Postgres data access components out of the box. You can connect by ODBC, or you can purchase components available from, for example, Devart, DA-Soft or microOLAP.
Have you considered NoSQL databases? The Wikipedia article explains their differences to SQL databases and also mentions that they are suited as document store.
http://en.wikipedia.org/wiki/NoSQL
The article lists around twelve implementations in the document store category, many are open source. (Jackrabbit, CouchDB, MongoDB).
This question on Stackoverflow contains some pointers to Delphi clients:
Delphi and NoSQL
I would also consider caching on the application server, to speed up search. And of course a text indexing solution like Apache Lucene.
I would take Microsoft SQL Server Express Edition. I think 2008 R2 is latest stable version but there is also Denali (2011). It match all criterien you have.
You can use ADO to work with.
Try the Advantage Database Server.
It's easy to manage and configure.
Both dbase-like and SQL data management languages.
Fast indexed full text search capabilities.
Plus, unparalled support from the developers themselves.
The local server (stand-alone version, as opposed to the network based server) is free.
devzone.advantagedatabase.com
There is a Firebird version with full text search according to its documentation - http://www.red-soft.biz/en/document_21 - it uses Apache Lucene, a popular search engine

What strategy to migrate data from a spreadsheet to an RDBMS?

This is linked to my other question when to move from a spreadsheet to RDBMS
Having decided to move to an RDBMS from an excel book, here is what I propose to do.
The existing data is loosely structured across two sheets in a work-book. The first sheet contains main record. The second sheet allows additional data.
My target DBMS is mysql, but I'm open to suggestions.
Define RDBMS schema
Define, say, web-services to interface with the database so the same can be used for both, UI and migration.
Define a migration script to
Read each group of affiliated rows from the spreadsheet
Apply validation/constraints
Write to RDBMS using the web-service
Define macros/functions/modules in spreadsheet to enforce validation where possible. This will allow use of the existing system while the new comes up. At the same time, ( i hope ) it will reduce migration failures when the move is eventually made.
What strategy would you follow?
There are two aspects to this question.
Data migration
Your first step will be to "Define RDBMS schema" but how far are you going to go with it? Spreadsheets are notoriously un-normalized and so have lots of duplication. You say in your other question that "Data is loosely structured, and there are no explicit constraints." If you want to transform that into a rigourously-defined schema (at least 3NF) then you are going to have to do some cleansing. SQL is the best tool for data manipulation.
I suggest you build two staging tables, one for each worksheet. Define the columns as loosely as possible (big strings basically) so that it is easy to load the spreadsheets' data. Once you have the data loaded into the staging tables you can run queries to assess the data quality:
how many duplicate primary keys?
how many different data formats?
what are the look-up codes?
do all the rows in the second worksheet have parent records in the first?
how consistent are code formats, data types, etc?
and so on.
These investigations will give you a good basis for writing the SQL with which you can populate your actual schema.
Or it might be that the data is so hopeless that you decide to stick with just the two tables. I think that is an unlikely outcome (most applications have some underlying structure, we just have to dig deep enough).
Data Loading
Your best bet is to export the spreadsheets to CSV format. Excel has a wizard to do this. Use it (rather than doing Save As...). If the spreadsheets contain any free text at all the chances are you will have sentences which contain commas, so make sure you choose a really safe separator, such as ^^~
Most RDBMS tools have a facility to import data from CSV files. Postgresql and Mysql are the obvious options for an NGO (I presume cost is a consideration) but both SQL Server and Oracle come in free (if restricted) Express editions. SQL Server obviously has the best integration with Excel. Oracle has a nifty feature called external tables which allow us to define a table where the data is held in a CSV file, removing the need for staging tables.
One other thing to consider is Google App Engine. This uses Big Table rather than an RDBMS but that might be more suited to your loosely-structured data. I suggest it because you mentioned Google Docs as an alternative solution. GAE is an attractive option because it is free (more or less, they start charging if usage exceeds some very generous thresholds) and it would solve the app sharing issue with those other NGOs. Obviously your organisation may have some qualms about Google hosting their data. It depends on what field they are operating in, and the sensitivity of the information.
Obviously, you need to create a target DB and the necessary table structure.
I would skip the web services and write a groovy script which reads the .xls (using the POI library), validates and saves the data in the database.
In my view, anything more involved (web services, GUI...) is not justified: these kinds of tasks are very well suited for scripts because they're concise and extremely flexible while things like performance, code base scalability and such are less of an issue here. Once you have something that works, you will be able to adapt the script to any future document with different data anomalies you run into in a matter of minutes or a few hours.
This is all assuming your data isn't in perfect order and needs to be filtered and/or cleaned.
Alternatively, if the data and validation rules aren't too complex, you can probably get good results with using a visual data transfer tool like Kettle: you just define the .xls as your source, the database table as the table, some validation/filter rules if needed and trigger the loading process. Quite painless.
If you'd rather use a tool that roll your own, check out SeekWell, which lets you write to your database from Google Sheets. Once you define your schema, Select the tables into a Sheet, then edit or insert the records and mark them for the appropriate action (e.g., update, insert, etc.). Set the schedule for the update and you're done. Read more about it here. Disclaimer--I'm a co-founder.
Hope that helps!
You might be doing more work than you need to. Excel spreadsheets can be saved as CVS or XML files and many RDBMS clients support importing these files directly into tables.
This could allow you skip writing web service wrappers and migration scripts. Your database constraints would still be properly enforced during any import. If your RDBMS data model or schema is very different from your Excel spreadsheets, however, then some translation would of course have to take place via scripts or XSLT.

Configure Lucene.Net with SQL Server

Has anyone used Lucene.NET rather than using the full text search that comes with sql server?
If so I would be interested on how you implemented it.
Did you for example write a windows service that queried the database every hour then saved the results to the lucene.net index?
Yes, I've used it for exactly what you are describing. We had two services - one for read, and one for write, but only because we had multiple readers. I'm sure we could have done it with just one service (the writer) and embedded the reader in the web app and services.
I've used lucene.net as a general database indexer, so what I got back was basically DB id's (to indexed email messages), and I've also use it to get back enough info to populate search results or such without touching the database. It's worked great in both cases, tho the SQL can get a little slow, as you pretty much have to get an ID, select an ID etc. We got around this by making a temp table (with just the ID row in it) and bulk-inserting from a file (which was the output from lucene) then joining to the message table. Was a lot quicker.
Lucene isn't perfect, and you do have to think a little outside the relational database box, because it TOTALLY isn't one, but it's very very good at what it does. Worth a look, and, I'm told, doesn't have the "oops, sorry, you need to rebuild your index again" problems that MS SQL's FTI does.
BTW, we were dealing with 20-50million emails (and around 1 million unique attachments), totaling about 20GB of lucene index I think, and 250+GB of SQL database + attachments.
Performance was fantastic, to say the least - just make sure you think about, and tweak, your merge factors (when it merges index segments). There is no issue in having more than one segment, but there can be a BIG problem if you try to merge two segments which have 1mil items in each, and you have a watcher thread which kills the process if it takes too long..... (yes, that kicked our arse for a while). So keep the max number of documents per thinggie LOW (ie, dont set it to maxint like we did!)
EDIT Corey Trager documented how to use Lucene.NET in BugTracker.NET here.
I have not done it against database yet, your question is kinda open.
If you want to search an db, and can choose to use Lucene, I also guess that you can control when data is inserted to the database.
If so, there is little reason to poll the db to find out if you need to reindex, just index as you insert, or create an queue table which can be used to tell lucene what to index.
I think we don't need another indexer that is ignorant about what it is doing, and reindexing everytime, or uses resources wasteful.
I have used lucene.net also as storage engine, because it's easier to distribute and setup alternate machines with an index than a database, it's just a filesystem copy, you can index on one machine, and just copy the new files to the other machines to distribute the index. All the searches and details are shown from the lucene index, and the database is just used for editing. This setup has been proven as a very scalable solution for our needs.
Regarding the differences between sql server and lucene, the principal problem with sql server 2005 full text search is that the service is decoupled from the relational engine, so joins, orders, aggregates and filter between the full text results and the relational columns are very expensive in performance terms, Microsoft claims that this issues have been addressed in sql server 2008, integrating the full text search inside the relational engine, but I don't have tested it. They also made the whole full text search much more transparent, in previous versions the stemmers, stopwords, and several other parts of the indexing where like a black box and difficult to understand, and in the new version are easier to see how they works.
With my experience, if sql server meet your requirements, it will be the easiest way, if you expect a lot of growth, complex queries or need a big control of the full text search, you might consider working with lucene from the start because it will be easier to scale and personalise.
I used Lucene.NET along with MySQL. My approach was to store primary key of db record in Lucene document along with indexed text. In pseudo code it looks like:
Store record:
insert text, other data to the table
get latest inserted ID
create lucene document
put (ID, text) into lucene document
update lucene index
Querying
search lucene index
for each lucene doc in result set load data from DB by stored record's ID
Just to note, I switched from Lucene to Sphinx due to it superb performance

Resources