Experimenting with data to determine value - migration/methods? - database

I have a LOT of data available to me, and want to capture and experiment with data that isn't currently being used in production. I do not want to immediately add this to my existing data store since that would undoubtedly mess with production. The obvious solution seems to be to make a copy of production data and integrate it with what I want to play around with (applications accessing this data ,etc), but I was wondering if there was a better (less expensive?) way to do this.
Both isolation and integration are important. I'd like to be able to keep lightweight/experimental data assets apart from high volume production data, but also be able to integrate (RELATIVELY) painlessly if experimental assets are deemed useful.
Thanks.

I would prefer to work on small amount of data initially to start with and create a duplicate of the production environment and then start working on the migration part.
If i am successfull, within a less time, then the same steps would be continued for the rest.

Related

In need of an embeddable NoSQL database that handles ~1Gb datasets, persisted on disk

I am building an Electron app, for which I need to select an embeddable NoSQL database. In fact, this database is supposed to hold a local subset of data stored on an ArangoDB remote backend. I have been searching the Internet a lot, but fail so far to converge to an ultimate candidate. I hope that somebody could advise me from experience.
Typical datasets amount to possibly ~tens of thousands of documents, and I can imagine cases where the set would amount to ~1Gb over time. Furthermore, I have the need for secondary indexes.
I have looked at PouchDB, UnQlite, LokiJS, LevelDB, NeDB, LinvoDB...
In the end, NeDB and LinvoDB seem like reasonable candidates with persistence to disk (SQlite-like), where NeDB cannot handle large datasets; something which LinvoDB, a fork of NeDB, seems to be able to handle. LinvoDB does not load the whole database in memory, but appears to index "everything" by default and keep that in memory.
On the other hand, I have tried to follow several conversations regarding their indexes, where NeDB appears to suggest in their documentation that they are persisted to disk (https://github.com/louischatriot/nedb#indexing), once built, which appears then again to be negated by LinvoDB (sorry, I lost many of the quotes/sources in the vast amount of tabs open...), suggesting indexes are to be build from scratch on launch. (And it may also be I misunderstand NeDB's documentation althogether.)
Basically, what I need, is a JS database solution for an Electron app, which may hold "considerable" but not "huge" amounts of data. The app's loading times should be reasonable (i.e., not discourage usage), while being responsive (i.e., database should contain secondary indexes) and respecting the user's resources as much as possible.
Questions:
Has anybody any experience with above or other embedded NoSQL databases, by which any of these or others could be recommended for my use case?
If indeed LinvoDB's indexes need to be rebuilt from scratch every time I launch the app, could that be a significant performance hit (loading time of the order of seconds)? (Surely I'd have to benchmark this...)
ArangoDB is not embeddable, but perhaps I should consider to just deploy it as a service alongside my native app? This link NoSQL database: ArangoDB appears to suggest that the developers themselves do not discourage this. Would this be overkill and/or not user friendly? A performance hit?
Any advise would really be appreciated.
Have the same need, seems linvodb3 is the best choice currently. It's under positive developing and the target is dedicated to Electron desktop environment.
Have considered sqlite?
There is a npm package and it works with electron, i have tried it by myself.
You just have to rebuild electron, this could make some problems.
Here your answers:
yes I have, but not much
no I've never tried LinvoDB
no I've never tried ArangoDB too

Configuration vs Database storage

I will keep this short, I am looking to store product plan data, these are the plans that the users would pick for their payment options. This data include how much the plan cost and what the unit details of the plan are, like what makes a unit (day/week/month) and fairly simple data about the plan. These plans may or may not change once a month or once a year, the company is a start up and things are always changing on the 11th hour and contently so there is no real way to predict when they will change. A co-worker and I are discussing whether these values should be stored in the web.config (where they currently are) or move them to the database.
I have done some googling and I have not found any good resource that help draw a clear line of when something should be in the database or in the web config. I wanted to know what your thought on this was and see if someone could clearly define when data should be stored in config or in the database.
Thanks for the help!
From the brief description you provide, it seems to me that the configuration data, eventually, may be accessed not just by your web server-based application running on one computer, but also by other supporting applications, such as end-of-month batch jobs, that you may want to run on other computers. To support that possibility, it would be a good idea to store the data in some sort of centralized repository that can be accessed remotely from multiple computers.
Storing the configuration data in a database is the obvious way to meet that requirement. But if you don't want to do that, then another approach would be to store the configuration data in a file on a company-internal (rather than public) web/ftp server. Then an application can use a utility such as curl to retrieve the configuration file from the web/ftp server.
Of those two approaches, I think using a database is probably best, because it provides an ergonomic way to not just read the configuration data, but also update it.

recommendation for maintaining dev database

Right now, the devs all have a their local dev environments with a snapshot of the production database - which they can twist, churn and beat up the data without affecting anyone but themselves.
These snapshots are starting to get large, and a data import of them is starting to take close to an hour.
Any better recommendations at maintaining dev data? The dev data can be ripped apart for potential changes, and then need to be put back together if a change idea was bad, etc.
I try to use the following approach:
Developers maintain a baseline script which is in version control and sets up the database schema from scratch. It creates the schema just as it exists in the production database.
They also maintain a 'script' to setup test data. This 'script' uses actually production classes and sometimes a little DSL on top of that. In order to be reasonable fast the script generates only minimal testdata. I recommend making it part of the definition of done to create some testdate for any new feature build.
Developers can run these scripts at will on their database (or database schema). The first script is also used as a basis for running automatic database tests.
Result of any work done by the developers is a migration script. i.e. a script that can be applied to the production database to bring it to the new desired state, including updates to data.
These migrations can be tested on snapshots of the production database. Snapshots of the production database are also used to run load and performance tests.
Only for the snapshots I use database specific tools. Mostly everything else is written in the main programming language (java for me) so the developers feel comfortable using it.
I often encounter resistance to this approache ("too many scripts", "too many databases", "I don't want to use version control, because my db modelling tool doesn't support it"). But appart from loads of manual works I don't really see an alternative.
In my experience, having a centralized DB+data for each environment: Development, Testing+Integration and Production has been the best approach.
Development: let the developers do whatever they want with it. If production-like data is required, obfuscate/remove sensitive data. The more lightweight this database is, the better for you to move, maintain and backup.
Testing: use it to simulate the production environment and let the
testers to input/retrieve all the data the want but only through your
application interfaces. This environment also allows you to test your deployments
before sending them to production, you don't want a bad DB installer
to leave the production app in an unusable state. If required, you
can input this environment with production data but obfuscate/remove
sensitive data too. You could use high volumes to spot performance issues before they get to production.
Production: Leave your production data/environment alone, you don't
want sensitive data to end up in the wrong hands or a DB error configuration to allow the developers to change data accidentally.
Usually, as a developer, you want a few things from the dev database set up.
You want it to be easy to work with - it should be straightforward to make changes, keep those changes versioned, and apply them to other environments.
You want to have representative data - and have that data be predictable. For instance, if you're building an invoicing system, you want clients with known credit limits so you can write test cases to track what happens to them as issue an invoice, have it paid etc. (Integration tests, rather than unit tests).
You want to be able to query against representative data volumes so performance issues arise in dev as well as production.
You never, ever want to be able to affect "real" data - for instance, you want email addresses and names to be anonymous, you want passwords to be re-set.
Continuous Database Integration offers a solution to most of this - and also solves the "it takes an hour to set up a database for a development environment" issue.
I'm in the same situation. I had the idea to move archive data to a read-only filegroup so that I only need to backup and restore it once. The non-archive data would be much smaller and could be copied more frequently to backup storage and to the dev machines.
Of course that only works if it is possible to split a big portion of the database size off to a read-only filegroup.
A different idea would be to restore once on a dev machine and use a database snapshot for quick restore to a clean state. I found that one particularly useful.

How to organize a database in Django with multiple, distinct apps?

I'm new to Django (and databases in general), and I'm not sure how to structure the following. The sources of data I'll have for my site are:
a blog
for a few different games:
a high score list
user-created levels
If I were storing the data in ordinary files, I'd just have one file for each of the above. In Django, ideally (I think) I'd have a separate database for each of these, but apparently multiple database support isn't there for Django yet. I'm worried (unnecessarily?) about keeping everything in one database for two reasons:
If I screw something up in one of the sections, I don't want to mess up the rest of the data.
When I'm working on one of these sections, I'd like the freedom to easily change the model around. Since I've learned that syncdb doesn't, in fact, sync the database, I've decided that the easiest thing to do when messing around with a model is to simply wipe the database and start over. Again, I'm worried about messing up the other sections. I looked at south, and it seems like more trouble than it's worth during the planning stages of an app (but I'll reconsider later when there's actually valuable data).
Part of the problem is that I'm not really comfortable keeping my data in a binary format. I'm used to text, so I can easily diff it, modify it in an editor, etc., without going through some magical database interface (I'm using postgresql, by the way).
Are my fears unfounded? How do people normally handle this problem?
For what it's worth, I totally understand your frustration as I went through a really similar thought process when starting. Unfortunately, there isn't much you can do (easily, anyway) besides get familiar with the tools you'll be using.
First of all, you don't need multiple databases for what you're doing - one will do. Each app will create its own tables in the database which are somewhat isolated from one another. As czarchaic mentioned, you can do python manage.py reset app_name to reset just one of them in case you change your model. You will lose that data, though.
To get data in relatively easy to work with format, you can use the command python manage.py dumpdata > file_name.json, and then to reload it later python manage.py loaddata file_name.json. You can use this for backups, local test data, or as a poor man's migration (hint: South would be easier).
Yet another option is to use a NoSQL database for any data you think will need extra flexibility. Just keep in mind that Django doesn't have support for any of these at the moment. That means no no admin support or ModelForms. Of course, having a model may become unnecessary.
In short, your fears are unfounded. You should "organize" your database by project to use the Django term. Each model in each app will have it's own table, but they will all be in the same database. Putting them in a separate database isn't a good idea for a whole host of reasons, the biggest is that you cannot query across databases.
While I agree that south is probably a bit heavy for your initial design/dev stages it should be considered seriously for anything even resembling a beta version and absolutely necessary in production.
If you're going to be messing with your models a bunch during development the best thing to do is use fixtures to load data in quickly after running sync. Or, if you are going to be changing a bunch of required fields, then write some quick Python to create dummy data for you.
As for not trusting your data to binary, a simple "pg_dump " will get you a textual version of your data. It sounds to me like you're working on your app against live production data, which is a mistake. Your goal should be to get your application built, working, and tested on fake data or at least a copy of your production data and then when you're sure everything is golden migrate it into production. This is where things like south come in handy as you can automate this deployment and it will help you handle any database table/column changes you need to make.
I'm sure all of this sounds like a pain, but all of it is able to be automated and trust me it makes your life down the road much much easier.
I generally just reset the module
>>> python manage.py reset blog
this will reset all tables in INSTALLED_APPS.blog
I'm not sure if this answers your question but it's much lest destructive than wiping the DB.
Syncdb should really only be used for development. That's why it doesn't really matter if you wipe the tables and start again, perhaps exporting look up data into a json file that you can import each time you sync.
When your site reaches production however, you have a little more work. Any changes you make to your models that need to be reflected in the database, need to be emitted as SQL and run manually on the database. There's a django-admin.py function to emit the suggested SQL, which you can use to build up a script to run on the database to migrate it. Like you mention, a migrations app like South can be beneficial here but it's not strictly needed.
As far as your separation of sites goes, run them as separate sites/projects. You can have a separate settings file per project which allows you to run two different databases. This is in contrast to running the two sites as separate apps within the same project. If they're totally separate they probably shouldn't be in the same project unless you need to leverage common code.

Database Refresh

How often do you refresh your development database from production database?Since there are many types of projects (targeting different domains) I would like to know how it is being done and at what intervals(days/months/years) it is being done ?
thanks
While working at Callaway Golf we had an automated build that would completely refresh the database from a baseline. This baseline would be updated (from production) almost daily. We had a set up scripts (DTS) that would do this for us. So if there was some new and interesting information we could easily do it a couple times of day, once a week, etc. The key here is automation to perform the task. If it is easy to do then when it is done is really only dependent on how performing the task impacts the load on the production database, the network, and the amount of time it takes to complete it. This could of course be set up as a schedule task to run at off peak hours and before the dev team gets in in the morning.
The key things in refreshing your development database are:
(1) Automate the refresh through a script.
(2) Obfuscate the data in your development database since you do not want your developers to see the real data or you could do some sampling of your production database.
(3) Decide the frequency of the refresh -- I usually do it once a week.
Depends on what kind of work you're doing. If you're debugging issues that are closely related to the data, then frequent updates are good.
If you're doing data Quality Assurance (which often involves writing code to detect and repair it, that you have to develop and test away from the production server), then you need extremely fresh data. The bad data that is the most valuable to fix is the data that was just inserted or updated today.
If you are writing client code, then infrequent updates are good. Often when I'm writing C# UI code, I could care less what the data is, I just care if it shows up in the right box on the screen.
If you have data with any security issues, you should stop using production data--i.e. never update from production--and get a good sample data generator. Writing a good sample data generator is hard, so 3rd party products are the way to go. MS Data Dude comes to mind, and I recommend Sql RedGate's data generator.
And finally, how hard is it to get a copy of the production data? If it is cheap and automatable, just get a new copy every night. If it is expensive (requires the attention of a very busy DBA), well, resource constraints might answer the question for you regardless to these other concerns.
We tend to refresh every couple of days, or perhaps once a week or so if things are "normal," though if we're investigating something amiss we may do so more much more often.
Our production database is about 1GB, so it's not a trivial thing to copy around. Also, for us there's generally no burning need to get current data from production into the dev systems.
The "how" is simply a MySQL "backup" and "restore"
In a lot of cases, refreshing the dev database really isn't that important. Often production systems have far more data that required for development, and working with such a large dataset can be a hassle for several reasons. Examples include development on the interface, where it's more important to have some data instead of anything specific. In this case, it's more customary to thin out the production database to a smaller subset of real data. After you do this once, it's not really that important to update, as long as all schema changes are pushed through the dev database(s).
On the other hand, performance bugs may often require production-sized databases to be able to reproduce and identify bottlenecks, so in this scenario it is extremely useful to have an almost-realtime database. Many issues may only present themselves with the exact data used in production.
We tend to always go back to an on-demand schedule. We have many different databases that are used in a suite of applications. We stay away from automatic DEV databases b/c many of our code changes involve database changes and I don't want anything overwritten.
Not at all, the dev databases (one per developer) get setup by a script or similar a couple of times a day, possibly a couple hundred times when running db tests locally. This includes a couple of records for playing around
Of course we still need a database with at least parts of production in it, for integration and scalability tests. We are aiming for a daily automated refresh. But we aren't there yet.

Resources