Is it recommended to store certain data in a config-file while referencing to it by ID in database? Specifically, when storing information that's rarely or never updated.
For example, I may have different roles for users in my application. As these roles are hardly ever (or never) updated, is it really necessary to store them in a roles-table and reference to them by ID in users? I could as well have the roles defined in an array. This way the database wouldn't need to be called every time to get role information (which is required on every page).
Another option would be to cache the whole bunch of role-information from the database. Not sure if this is any better than simply having an array stored somewhere.
The same question can be asked when storing any application-related data that's not edited by the users but the developers.
Another option would be to cache the whole bunch of role-information from the database. Not sure if this is any better than simply having an array stored somewhere.
It isn't until you have a bug in the application, a network failure or a good-old power cut. At that point, the cached solution will just re-read the "master" data from the database and keep churning away, safe in the knowledge that data was protected by database transactions and backup regime.
If such calamity befell "separate array" solution, it could corrupt the array storage or make it inconsistent with the rest of the data (no referential integrity).
On top of that, storing data centrally means changes are automatically visible to all clients. You can store files "centrally" via shared folders or FTP and such, but why not use the database if it's already there?
Related
I am now working on a project which requires to show the transaction history of one customer and if the product customer buys is under warranty or not. I need to use the data from the current system, the system can provide Web API, which is a .csv file. So how can I make use of the current system data?
A solution I think of is to download all the .csv files and write scripts to insert every record into the database I built which contains the necessary tables and relations to hold the data I retrieve. Then I can have a new database which I want. because I never done this before so I want know if it is feasible?
And one more question would be, if I should store the data locally or use a cloud database like Firebase?
High-end databases like SQL Server and Oracle come with utilities that allow you to read directly from a csv file. Check the docs. Having done this many times, the best procedure I found was to read the file into one holding table. This gives you the chance to examine the data and find any unexpected quirks or missing fields. This allows you to correct the data, where possible.
Then write the scripts to move the data from the holding table into the proper tables you have designed. This must be done in a logical manner. For example, move the customer data before the buy transactions. Thus any error messages you get will not be because you tried to store a transaction before you stored the customer. (You will have referential integrity set up, yes?) This gives you more chances to correct or adjust the data or just identify problems more or less at your leisure.
Whether or not to store the data in the cloud is strictly according to the preferences of your employer.
I'm intending to use both of SQL Server and simple text files to save my data.
Information like Users data are going to be stored in SQL Server, RSS fedd for each user are going to be stored in folder with the user Id as a title and inside this folder I can put the files that going to store the data in, each file can take only 20 lines, if there is more than 20 then I make a new file.
When I need to reed this data I simply call the last file in the user's folder.
I need to know what is the advantages and disadvantages of using this method?
thanx
I would suggest you to store the text file data into either VARCHAR(8000) or Blob and store inside the table in database.
The advantages of storing in database is:
All your data is stored in a single place. It is very easy for you to backup and restore in other place, if required
Database by default comes with concurrency and if you have say multiple users trying to access the same row, same table, database handles it inherently
When you go for files and database kind of hybrid approach, you are going for distributed storage and you have to always make sure that they are consistent
If you want to just store the latest text file content, go for UPDATE. If you want to keep history of earlier text files content, go for SCD Type 2 kind of storage or go for historical table containing previous text file data
Database is a single contained unit and you can do so many things on it like : Transparent data encryption, masking, access control and all security related stuff in a single contained unit. In hybrid approach, you have to manage security in two places.
When all your data is in a single place, and once you have proper indexes, you can write queries and come up with so many different reporting use cases, using SQL. But, if the data is distributed, you have to manage how will be handling the different reporting use cases.
The question is not quite correct.
You should start with clarification of requirements for the application. Answer to yourself the following questions:
What type of data queries need to be executed (selects, updates, reports).
How many users will be. How often requests from them will be coming. Does data must be synchronized across users (Concurrency).
Need of authentication and authorization, localization.
Need for modification history support.
Etc.
Databases usually have all this mechanisms and you do not have to implement them in your application.
Depending on your application needs you decide what strategy to use for storing the data: by means of database, files, or by both approaches.
The project I have been given is to store and retrieve unstructured data from a third-party. This could be HR information – User, Pictures, CV, Voice mail etc or factory related stuff – Work items, parts lists, time sheets etc. Basically almost any type of data.
Some of these items may be linked so a User many have a picture for example. I don’t need to examine the content of the data as my storage solution will receive the data as XML and send it out as XML. It’s down to the recipient to convert the XML back into a picture or sound file etc. The recipient may request all Users so I need to be able to find User records and their related “child” items such as pictures etc, or the recipient may just want pictures etc.
My database is MS SQL and I have to stick with that. My question is, are there any patterns or existing solutions for handling unstructured data in this way.
I’ve done a bit of Googling and have found some sites that talk about this kind of problem but they are more interested in drilling into the data to allow searches on their content. I don’t need to know the content just what type it is (picture, User, Job Sheet etc).
To those who have given their comments:
The problem I face is the linking of objects together. A User object may be added to the data store then at a later date the users picture may be added. When the User is requested I will need to return the both the User object and it associated Picture. The user may update their picture so you can see I need to keep relationships between objects. That is what I was trying to get across in the second paragraph. The problem I have is that my solution must be very generic as I should be able to store anything and link these objects by the end users requirements. EG: User, Pictures and emails or Work items, Parts list etc. I see that Microsoft has developed ZEntity which looks like it may be useful but I don’t need to drill into the data contents so it’s probably over kill for what I need.
I have been using Microsoft Zentity since version 1, and whilst it is excellent a storing huge amounts of structured data and allowing (relatively) simple access to the data, if your data structure is likely to change then recreating the 'data model' (and the regression testing) would probably remove the benefits of using such a system.
Another point worth noting is that Zentity requires filestream storage so you would need to have the correct version of SQL Server installed (2008 I think) and filestream storage enabled.
Since you deal with XML, it's not an unstructured data. Microsoft SQL Server 2005 or later has XML column type that you can use.
Now, if you don't need to access XML nodes and you think you will never need to, go with the plain varbinary(max). For your information, storing XML content in an XML-type column let you not only to retrieve XML nodes directly through database queries, but also validate XML data against schemas, which may be useful to ensure that the content you store is valid.
Don't forget to use FILESTREAMs (SQL Server 2008 or later), if your XML data grows in size (2MB+). This is probably your case, since voice-mail or pictures can easily be larger than 2 MB, especially when they are Base64-encoded inside an XML file.
Since your data is quite freeform and changable, your best bet is to put it on a plain old file system not a relational database. By all means store some meta-information in SQL where it makes sense to search through structed data relationships but if your main data content is not structured with data relationships then you're doing yourself a disservice using an SQL database.
The filesystem is blindingly fast to lookup files and stream them, especially if this is an intranet application. All you need to do is share a folder and apply sensible file permissions and a large chunk of unnecessary development disappears. If you need to deliver this over the web, consider using WebDAV with IIS.
A reasonably clever file and directory naming convension with a small piece of software you write to help people get to the right path will hands down, always beat any SQL database for both access speed and sequential data streaming. Filesystem paths and file names will always beat any clever SQL index for data location speed. And plain old files are the ultimate unstructured, flexible data store.
Use SQL for what it's good for. Use files for what they are good for. Best tools for the job and all that...
You don't really need any pattern for this implementation. Store all your data in a BLOB entry. Read from it when required and then send it out again.
Yo would probably need to investigate other infrastructure aspects like periodically cleaning up the db to remove expired entries.
Maybe i'm not understanding the problem clearly.
So am I right if I say that all you need to store is a blob of xml with whatever binary information contained within? Why can't you have a users table and then a linked(foreign key) table with userobjects in, linked by userId?
I have a internet application that supports offline mode where users might create data that will be synchronized with the server when the user comes back online. So because of this I'm using UUID's for identity in my database so the disconnected clients can generate new objects without fear of using an ID used by another client, etc. However, while this works great for objects that are owned by this user there are objects that are shared by multiple users. For example, tags used by a user might be global, and there's no possible way the remote database could hold all possible tags in the universe.
If an offline user creates an object and adds some tags to it. Let's say those tags don't exist on the user's local database so the software generates a UUID for them. Now when those tags are synchronized there would need to be resolution process to resolve any overlap. Some way to match up any existing tags in the remote database with the local versions.
One way is to use some process by which global objects are resolved by a natural key (name in the case of a tag), and the local database has to replace it's existing object with this the one from the global database. This can be messy when there are many connections to other objects. Something tells me to avoid this.
Another way to handle this is to use two IDs. One global ID and one local ID. I was hoping using UUIDs would help avoid this, but I keep going back and forth between using a single UUID and using two split IDs. Using this option makes me wonder if I've let the problem get out of hand.
Another approach is to track all changes through the non-shared objects. In this example, the object the user assigned the tags. When the user synchronizes their offline changes the server might replace his local tag with the global one. The next time this client synchronizes with the server it detects a change in the non-shared object. When the client pulls down that object he'll receive the global tag. The software will simply resave the non-shared object pointing it to the server's tag and orphaning his local version. Some issues with this are extra round trips to fully synchronize, and extra data in the local database that is just orphaned. Are there other issues or bugs that could happen when the system is in between synchronization states? (i.e. trying to talk to the server and sending it local UUIDs for objects, etc).
Another alternative is to avoid common objects. In my software that could be an acceptable answer. I'm not doing a lot of sharing of objects across users, but that doesn't mean I'd NOT be doing it in the future. Which means choosing this option could paralyze my software in the future should I need to add these types of features. There are consequences to this choice, and I'm not sure if I've completely explored them.
So I'm looking for any sort of best practice, existing algorithms for handling this type of system, guidance on choices, etc.
Depend on what application semantics you want to offer to users, you may pick different solutions. E.g., if you are actually talking about tagging objects created by an offline user with a keyword, and wanting to share the tags across multiple objects created by different users, then using "text" for the tag is fine, as you suggested. Once everyone's changes are merged, tags with the same "text", like, say "THIS IS AWESOME", will be shared.
There are other ways to handle disconnected updates to shared objects. SVN, CVS, and other version control system try to resolve conflicts automatically, and when cannot, will just tell user there is a conflict. You can do the same, just tell user there have been concurrent updates and the users have to handle resolution.
Alternatively, you can also log updates as units of change, and try to compose the changes together. For example, if your shared object is a canvas, and your application semantics allows shared drawing on the same canvas, then a disconnected update that draws a line from point A to point B, and another disconnected update drawing a line from point C to point D, can be composed. In this case, if you keep those two updates as just two operations, you can order the two updates and on re-connection, each user uploads all its disconnected operations and applies missing operations from other users. You probably want some kind of ordering rule, perhaps based on version number.
Another alternative: if updates to shared objects cannot be automatically reconciled, and your application semantics does not support notifying user and asking user to resolve conflicts due to disconnected updates, then you can also use version tree to handle this. Each update to a shared object creates a new version, with past version as the parent. When there are disconnected updates to a shared object from two different users, two separate children versions/leaf nodes result from the same parent version. If your application's internal representation of state is this version tree, then your application's internal state remains consistent despite disconnected updates, and you can handle the two branches of the version tree in some other way (e.g. letting user know of branches and create tools for them to merge branches, as in source control systems).
Just a few options. Hope this helps.
Your problem is quite similar to versioning systems like SVN. You could take example from those.
Each user would have a set of personal objects plus any shared objects that they need. Locally, they will work as if they own the all the objects.
During sync, the client would first download any changes in the objects, and automatically synchronize what is obvious. In your example, if there is a new tag coming from the server with the same name, then it would update the UUID correspondingly on the local system.
This would also be a nice place in which to detect and handle cases like data committed from another client, but by the same user.
Once the client has an updated and merged version of the data, you can do an upload.
There will be to round trips, but I see no way of doing this without overcomplicating the data structure and having potential pitfalls in the way you do the sync.
As a totally out of left-field suggestion, I'm wondering if using something like CouchDB might work for your situation. Its replication features could handle a lot of your online/offline synchronisation problems for you, including mechanisms to allow the application to handle conflict resolution when it arises.
With really small sets of data, the policy where I work is generally to stick them into text files, but in my experience this can be a development headache. Data generally comes from the database and when it doesn't, the process involved in setting it/storing it is generally hidden in the code. With the database you can generally see all the data available to you and the ways with which it relates to other data.
Sometimes for really small sets of data I just store them in an internal data structure in the code (like A Perl hash) but then when a change is needed, it's in the hands of a developer.
So how do you handle small sets of infrequently changed data? Do you have set criteria of when to use a database table or a text file or..?
I'm tempted to just use a database table for absolutely everything but I'm not sure if there are any implications to this.
Edit: For context:
I've been asked to put a new contact form on the website for a handful of companies, with more to be added occasionally in the future. Except, companies don't have contact email addresses.. the users inside these companies do (as they post jobs through their own accounts). Now though, we want a "speculative application" type functionality and the form needs an email address to send these applications to. But we also don't want to put an email address as a property in the form or else spammers can just use it as an open email gateway. So clearly, we need an ID -> contact_email type relationship with companies.
SO, I can either add a column to a table with millions of rows which will be used, literally, about 20 times OR create a new table that at most is going to hold about 20 rows. Typically how we handle this in the past is just to create a nasty text file and read it from there. But this creates maintenance nightmares and these text files are frequently looked over when data that they depend on changes. Perhaps this is a fault with the process, but I'm just interested in hearing views on this.
Put it in the database. If it changes infrequently, cache it in your middle tier.
The example that springs to mind immediately is what is appropriate to have stored as an enumeration and what is appropriate to have stored in a "lookup" database table.
I tend to "draw the line" with the rule that if it will result in a column in the database containing a "magic number" that maps to an enumeration value, then the enumeration should really exist as a lookup table. If it's unrelated to the data stored in the database (eg. Application configuration data rather than user generated data), then it's an enumeration all the way.
Surely it depends on the user of the software tool you've developed to consume the set of data, regardless of size?
It might just be that they know Excel, so your tool would have to parse a .csv file that they create.
If it's written for the developers, then who cares what you use. I'm not a fan of cluttering databases with minor or transient data however.
We have a standard config file format (key:value) and a class to handle it. We just use that on all projects. Mostly we're just setting persistent properties for our applications (mobile phone development) so that's an appropriate thing to do. YMMV
In cases where the program accesses a database, I'll store everything in there: easier for backup and moving data around.
For small programs without database access I store my data in the .net settings, which are stored in an xml file - of course this is a feature of c#, so it might not apply to you.
Anyway, I make sure to store all data in one place. Usually a database.
Have you considered sqlite ? It's file-based, which addresses your feeling that "just a file might do" (zero configuration), but it's a perfectly good database and scales remarkably well. It supports a number of APIs and there are numerous front ends for administering it.
If these are small config-like data, i use some simple and common format. ini, json and yaml are usually ok. Java and .NET fans also like XML. in short, use something that you can easily read to an in-memory object and forget about it.
I would add it to the database in the main table:
Backup and recovery (you do want to recover this text file, right?)
Adhoc querying (since you can do it will a SQL tool and join it to the other database data)
If the database column is empty the store requirements for it should be minimal (nothing if it's a NULL column at the end of the table in Oracle)
It will be easier if you want to have multiple application servers as you will not need to keep multiple copies of some extra config file around
Putting it into a little child table only complicates the design without giving any real benefits
You may well already be going to that same row in the database as part of your processing anyway, so performance is not likely to be a problem. If you are not, you could cache it in memory.