What is the most effective method for handling large scale dynamic data for recommendation system?

What is the most effective method for handling large scale dynamic data for recommendation system? - database

We re thinking on a recommendation system based on large scale data but also looking for a professional way to keeping a dynamic DB structure for working in faster manner. We consider some of the alternative approaches. One is to keep in a normal SQL database but it would be slower compared to using normal file structure. Second is to use nosql graph model DB but it is also not compatible with the algorithms we use since we continuously pull al the data into a matrix. Final approach we think is to use normal files to keep the data but it is harder to keep track and watch the changes since no query method or the editor. Hence there are different methods and the pros and cons. What ll be the your choice and why?

I'm not sure why you mention "files" and "file structure" so many times, so maybe I'm missing something, but for efficient data processing, you obviously don't want to store things in files. It is expensive to read/write data to disk and it's hard to find something to query files in a file system that is efficient and flexible.
I suppose I'd start with a product that already does recommendations:
http://mahout.apache.org/
You can pick from various algorithms to run on your data for producing recommendations.
If you want to do it yourself, maybe a hybrid approach would work? You could still use a graph database to represent relationships, but then each node/vertex could be a pointer to a document database or a relational database where a more "full" representation of the data would exist.

Related

Which is faster , interacting with a database or using a file system for input output

I was wondering what threshold of data volume may determine whether to use a database or a simple file I/O, assuming that fresh data needs to be handled quite frequently.
Edit: There is no multi-threading in my application. Data needs to be stored and then retrieved sequentially and at this point I am not really worried about anyone else accessing the data/data safety.
Given this backdrop is there still any advantage to using databases over files?

It depends and you probably should consider other factors as well.
If you use a database, there is an overhead for transactions, security, index management etc. on the one hand. On the other hand you can get caching (which could significantly speed up your application) and better performance for random access, if you have a lot of data. In a multithreaded environment I suggest using a database because of a property implemented locking mechanism.
Flat files are OK for really simple and small data. Do you really need to open and close them so often?

If you have indexes on your table correctly then I think it would be better to use database instead of file system to get a better performance. Also to include that if your data in the database is going to be million of records then also the performance will not be affected when compared to file system with that much amount of data.

Probably a database is prefered and in this case id suggest to use sqlite database insted of sql server and mysql as data is small.

In this case I would say DB. You are writing and reading and thats what DBs are good at.
On the flip side if you are holding a tiny amount of data thats alot of over head for not much data
also depends on licensing etc. a file will be alot quicker

Storing and processing high data volume

Good day!
I have 350GB unstructured data disaggregated by 50-80 columns.
I need to store this data in NoSQL database and do a variety of selection and map / reduce queries filtered by 40 columns.
I would like to use mongodb, so I have a certain question: is this database able to cope with this task and what do I need to implement its architecture within the existing provider hetzner.de?

Yes, large datasets are easy.
Perhaps Apache Hadoop is also worth looking at. It is aimed at handling/analyzing large/huge amounts of data.

mongodb is a very scalable and flexible database, if used properly. It can store as much data as you need, but the bottom line is whether you can query your data efficiently.
comments:
You will need to make sure you have the proper indexes in place and that a fair amount of them can fit in RAM.
In order to achieve that you may need to use sharding to split the working set
current mapreduce is easy to use, can iterate over all your data but it is rather slow to process. It should become faster in next mongodb and there will also be a new aggregation framework to complement mapreduce.
Bottom line is that you should not take mongodb as a magical store that will be perfect out of the box, make sure you read the good docs and materials :)

database vs. flat files

The company I work for is trying to switch a product that uses flat file format to a database format. We're handling pretty big files of data (ie: 25GB/file) and they get updated really quick. We need to run queries that randomly access the data, as well as in a contiguous way. I am trying to convince them of the advantages of using a database, but some of my colleagues seem reluctant to this. So I was wondering if you guys can help me out here with some reasons or links to posts of why we should use databases, or at least clarify why flat files are better (if they are).

Databases can handle querying
tasks, so you don't have to walk
over files manually. Databases can
handle very complicated queries.
Databases can handle indexing tasks,
so if tasks like get record with id
= x can be VERY fast
Databases can handle multiprocess/multithreaded access.
Databases can handle access from
network
Databases can watch for data
integrity
Databases can update data easily
(see 1) )
Databases are reliable
Databases can handle transactions
and concurrent access
Databases + ORMs let you manipulate
data in very programmer friendly way.

This is an answer I've already given some time ago:
It depends entirely on the
domain-specific application needs. A
lot of times direct text file/binary
files access can be extremely fast,
efficient, as well as providing you
all the file access capabilities of
your OS's file system.
Furthermore, your programming language
most likely already has a built-in
module (or is easy to make one) for
specific parsing.
If what you need is many appends
(INSERTS?) and sequential/few access
little/no concurrency, files are the
way to go.
On the other hand, when your
requirements for concurrency,
non-sequential reading/writing,
atomicity, atomic permissions, your
data is relational by the nature etc.,
you will be better off with a
relational or OO database.
There is a lot that can be
accomplished with SQLite3, which
is extremely light (under 300kb), ACID
compliant, written in C/C++, and
highly ubiquitous (if it isn't already
included in your programming language
-for example Python-, there is surely one available). It can be useful even
on db files as big as 140 terabytes, or 128 tebibytes (Link to Database Size), possible
more.
If your requirements where bigger,
there wouldn't even be a discussion,
go for a full-blown RDBMS.
As you say in a comment that "the system" is merely a bunch of scripts, then you should take a look at pgbash.

Don't build it if you can buy it.
I heard this quote recently, and it really seems fitting as a guide line. Ask yourself this... How much time was spent working on the file handling portion of your app? I suspect a fair amount of time was spent optimizing this code for performance. If you had been using a relational database all along, you would have spent considerably less time handling this portion of your application. You would have had more time for the true "business" aspect of your app.

They're faster; unless you're loading the entire flat file into memory, a database will allow faster access in almost all cases.
They're safer; databases are easier to safely backup; they have mechanisms to check for file corruption, which flat files do not. Once corruption in your flat file migrates to your backups, you're done, and you might not even know it yet.
They have more features; databases can allow many users to read/write at the same time.
They're much less complex to work with, once they're setup.

What types of files is not mentioned. If they're media files, go ahead with flat files. You probably just need a DB for tags and some way to associate the "external BLOBs" to the records in the DB. But if full text search is something you need, there's no other way to go but migrate to a full DB.
Another thing, your filesystem might provide the ceiling as far as number of physical files are concerned.

Databases all the way.
However, if you still have a need for storing files, don't have the capacity to take on a new RDBMS (like Oracle, SQLServer, etc), than look into XML.
XML is a structure file format which offers you the ability to store things as a file but give you query power over the file and data within it. XML Files are easier to read than flat files and can be easily transformed applying an XSLT for even better human-readability. XML is also a great way to transport data around if you must.
I strongly suggest a DB, but if you can't go that route, XML is an ok second.

What about a non-relational (NoSQL) database such as Amazon's SimpleDB, Tokio Cabinet, etc? I've heard that Google, Facebook, LinkedIn are using these to store their huge datasets.
Can you tell us if your data is structured, if your schema is fixed, if you need easy replicability, if access times are important, etc?

Difference between database and flat files are given below:
Database provide more flexibility whereas flat file provide less flexibility.
Database system provide data consistency whereas flat file can not provide data consistency.
Database is more secure over flat files.
Database support DML and DDL whereas flat files can not support these.
Less data redundancy in database whereas more data redundancy in flat files.

SQL ad hoc query abilities are enough of a reason for me. With a good schema and indexing on the tables, this is fast and effective and will have good performance.

Unless you are loading the files into memory each time you boot, use a database. Simple as that.
That is assuming that your colleges already have the program to handle queries to the files. If not, then use a database.

Although other answers are good, I would like to emphasize a point that was not really well talked about:
The developer's ease of use. databases are much simpler to work with! If you don't have any strong reason(s) for using files, use a database.

Database recommendation

I'm writing a CAD (Computer-Aided Design) application. I'll need to ship a library of 3d objects with this product. These are simple objects made up of nothing more than 3d coordinates and there are going to be no more than about 300 of them.
I'm considering using a relational database for this purpose. But given my simple needs, I don't want any thing complicated. Till now, I'm leaning towards SQLite. It's small, runs within the client process and is claimed to be fast. Besides I'm a poor guy and it's free.
But before I commit myself to SQLite, I just wish to ask your opinion whether it is a good choice given my requirements. Also is there any equivalent alternative that I should try as well before making a decision?
Edit:
I failed to mention earlier that the above-said CAD objects that I'll ship are not going to be immutable. I expect the user to edit them (change dimensions, colors etc.) and save back to the library. I also expect users to add their own newly-created objects. Kindly consider this in your answers.
(Thanks for the answers so far.)

The real thing to consider is what your program does with the data. Relational databases are designed to handle complex relationships between sets of data. However, they're not designed to perform complex calculations.
Also, the amount of data and relative simplicity of it suggests to me that you could simply use a flat file to store the coordinates and read them into memory when needed. This way you can design your data structures to more closely reflect how you're going to be using this data, rather than how you're going to store it.
Many languages provide a mechanism to write data structures to a file and read them back in again called serialization. Python's pickle is one such library, and I'm sure you can find one for whatever language you use. Basically, just design your classes or data structures as dictated by how they're used by your program and use one of these serialization libraries to populate the instances of that class or data structure.
edit: The requirement that the structures be mutable doesn't really affect much with regard to my answer - I still think that serialization and deserialization is the best solution to this problem. The fact that users need to be able to modify and save the structures necessitates a bit of planning to ensure that the files are updated completely and correctly, but ultimately I think you'll end up spending less time and effort with this approach than trying to marshall SQLite or another embedded database into doing this job for you.
The only case in which a database would be better is if you have a system where multiple users are interacting with and updating a central data repository, and for a case like that you'd be looking at a database server like MySQL, PostgreSQL, or SQL Server for both speed and concurrency.
You also commented that you're going to be using C# as your language. .NET has support for serialization built in so you should be good to go.

I suggest you to consider using H2, it's really lightweight and fast.

When you say you'll have a library of 300 3D objects, I'll assume you mean objects for your code, not models that users will create.
I've read that object databases are well suited to help with CAD problems, because they're perfect for chasing down long reference chains that are characteristic of complex models. Perhaps something like db4o would be useful in your context.

How many objects are you shipping? Can you define each of these Objects and their coordinates in an xml file? So basically use a distinct xml file for each object? You can place these xml files in a directory. This can be a simple structure.

I would not use a SQL database. You can easy describe every 3D object with an XML file. Pack this files in a directory and pack (zip) all. If you need easy access to the meta data of the objects, you can generate an index file (only with name or description) so not all objects must be parsed and loaded to memory (nice if you have something like a library manager)
There are quick and easy SAX parsers available and you can easy write a XML writer (or found some free code you can use for this).
Many similar applications using XML today. Its easy to parse/write, human readable and needs not much space if zipped.
I have used Sqlite, its easy to use and easy to integrate with own objects. But I would prefer a SQL database like Sqlite more for applications where you need some good searching tools for a huge amount of data records.

For the specific requirement i.e. to provide a library of objects shipped with the application a database system is probably not the right answer.
First thing that springs to mind is that you probably want the file to be updatable i.e. you need to be able to drop and updated file into the application without changing the rest of the application.
Second thing is that the data you're shipping is immutable - for this purpose therefore you don't need the capabilities of a relational db, just to be able to access a particular model with adequate efficiency.
For simplicity (sort of) an XML file would do nicely as you've got good structure. Using that as a basis you can then choose to compress it, encrypt it, embed it as a resource in an assembly (if one were playing in .NET) etc, etc.
Obviously if SQLite stores its data in a single file per database and if you have other reasons to need the capabilities of a db in you storage system then yes, but I'd want to think about the utility of the db to the app as a whole first.

SQL Server CE is free, has a small footprint (no service running), and is SQL Server compatible

Databases versus plain text

When dealing with small projects, what do you feel is the break even point for storing data in simple text files, hash tables, etc., versus using a real database? For small projects with simple data management requirements, a real database is unnecessary complexity and violates YAGNI. However, at some point the complexity of a database is obviously worth it. What are some signs that your problem is too complex for simple ad-hoc techniques and needs a real database?
Note: To people used to enterprise environments, this will probably sound like a weird question. However, my problem domain is bioinformatics. Most of my programming is prototypes, not production code. I'm primarily a domain expert and secondarily a programmer. Most of my code is algorithm-centric, not data management-centric. The purpose of this question is largely for me to figure out how much work I might save in the long run if I learn to use proper databases in my code instead of the more ad-hoc techniques I typically use.

1) Concurrency. Do you have multiple people accessing the same dataset? Then it's going to get pretty involved to broker all of the different readers and writers in a scalable fashion if you roll your own system.
2) Formatting and relationships: Is your data something that doesn't fit neatly into a table structure? Long nucleotide sequences and stuff like that? That's not really conveniently tabular data.
Another example: Nobody would consider implementing software like Photoshop to store PSDs in a relational format, because the data structures don't really lend themselves to that type of storage or query pattern.
3) ACID (sort of a corollary to #1): If Atomicity, Consistency, Integrity, and Durability are not challenges with a flat file, then go with a flat file.

For me, the line is crossed once I have to query my data in ways that involve more than a single relationship. Relating two flat data structures on disk is fairly simple, but once we get beyond that, a set-based language like SQL and formal database relationships actually reduce complexity.

I think at some point you'll miss the querying capabilities of a database, but you can consider some minimalistic database alternatives:
SQLite (Great, almost SQL-92 standard compliant)
shsql
SQL Server Compact

I would only write my own on-disk format under very special circumstances. Reusing someone else's code is nearly always faster.
For relational data, I would use SQLite. For key/value pairs, I would use BerkeleyDB (perhaps via KiokuDB). For simple objects, I would use JSON or YAML, but only if I only had a few.
With SQLite and BDB, "a real database" is literally two lines of code away. It is hard to beat that.

The problem with small projects is that they become bigger before we know it. And once they do , we start missing the sql capabilities.
Always design such that a db can be utilized later on if required without ripping apart half of the application.

It depends entirely on the domain-specific application needs. A lot of times direct text file/binary files access can be extremely fast, efficient, as well as providing you all the file access capabilities of your OS's file system.
Furthermore, your programming language most likely already has a built-in module (or is easy to make one) for specific parsing.
If what you need is many appends (INSERTS?) and sequential/few access little/no concurrency, files are the way to go.
On the other hand, when your requirements for concurrency, non-sequential reading/writing, atomicity, atomic permissions, your data is relational by the nature etc., you will be better off with a relational or OO database.
There is a lot that can be accomplished with SQLite3, which is extremely light (under 300kb), ACID compliant, written in C/C++, and highly ubiquitous (if it isn't already included in your programming language -for example Python-, there is surely one available). It can be useful even on db files as big as 1GB, possible more.
If your requirements where bigger, there wouldn't even be a discussion, go for a full-blown RDBMS.

For the kind of applications you are developing in bioinformatics, you are often doing one-shot applications (often scripts that define a workflow of calculations) that answer a specific questions, and you are not likely to be reusing these applications after you answered your question.
Often, you should therefore avoid creating databases to store the results, as after all you are not going to use their features very much.
You will probably be querying some webservices, files, or databases, run some local algorithms on the data gathered from different sources, and produce some tabular or structured output format (xml, json, etc).
For that, I would suggest you to use workflow tools like Knime (or a commercial solution like Inforsense KDE, Accelrys's Pipeline pilot, or Snaplogic, as they allow you to query data in a variety of formats and locations (rdbms, flat files, webservices), run algorithms, and build powerful web apps that allow you to easily publish your workflows to your users and let them interact at specific points).
If your prototype "grows" and you have to build more functionality on top of the data your workflows output, and if the output of your prototype is not likely to change everyday, then it's a wise decision to store a subset of the results in a database. This allows you to plug in powerful reporting tools like BusinessObjects, Crystal reports, jasper reports or whatever reporting solution available out there and show data to your users in a better shape than a spreadsheet or a csv file.
Finally, some development frameworks will make your choices more obvious : if you build a web application using an MVC framework, it is likely that your data will reside in an RDBMS (but please, don't put genomic sequences in a table column :-)).
All in all, it's a case by case choice, depending on your needs for each particular application.

In software I can usually get away with storing values in a XML configuration file or in the registry, e.g. software options. Once I need to persist objects I move to a database because the upfront cost is not that bad compared to the long term effects that relations and reporting can offer.

For bioinformatics you may be interested on that: Blast on DB. The guy who is working on that is a friend of mine and has a work on fast similarity sequence search, he found out to make his own binary storage better than using databases at this point.
I don't know specific details about his solution but you probably can exchange one or two ideias mailing the guy, even sharing code.

Do you need/want SQL queries?
Are multiple people going to want to access the data?
Is your data relational?
If you answered no to those questions, you (probably) don't need a full on database.

First, I'd consider:
How large will the database initially be: # of tables, # of rows
How quickly will it grow?
Is the data frequently queried?
If I were to create a personal recipe app, for example, I know I might add 50 favorite recipes to start and add no more than 5 recipes a year. With that being said, I could easily get by without a database since the size of the data store will have minimal impact on queries.
That said, I would probably use a database for any application where data entry and queries occur (even a small personal recipe app). I don't think it adds a lot of overhead especially when your framework (e.g. Rails) allows you to keep your database dumb (primarily tables, indexes, and constraints). It alleviates the chance that I'll have to eventually port to a database if I decide to scale up.

If you know the format of your data, flat files, if faster/easier to develop with, will be fine. If you expect your record formats to change frequently during development then I'd suggest that ALTER TABLE is your friend. Flat files will also tend to be faster (if you care about speed) unless you expect to implement the equivalent of joins across many combinations of files.
The real benefit of using a RDBMS during development is the flexibility with which you can modify your data schema and the ease with which you can access your data via queries.
Good design will ensure that you keep your data access layer relatively isolated (because of separation of concerns) so it should be a fairly straightforward (if tedious) matter to rework to a database later should it be worthwhile. Or, of course, if you use a database to develop your structures you may subsequently take the app back to flat/indexed files once those structures are crystallized in order to gain performance.

Use whatever persistence technology you're most comfortable with, and scales sufficiently.
YAGNI at least means "Don't add a new technology to your personal stack unless you can't be productive with whatever is already there."
For many (most?) of us, our comfort zone for data persistence is SQL. For some, it might be XML. Just don't write your own until (see paragraph 2).

As someone also doing research in Bioinformatics, I would suggest NOT using a database for these kinds of prototype projects unless you are sure it needs it. If you are on the fence, go with the databaseless solution and stick with flat files. It is also important to note that traditionally Bioinformatics researchers have go the flat file route, which means there are well defined file formats for most types of data in the feild. If you decide to go with a database solution, it may hurt your compatibility with existing research projects.