We're building an app that will have a number of games. Kids will learn Math as they play these games. All the user profile data, game data and lessons/ questions data are all being stored in the app and will sync to a MySQL database on the server side.
There also tons of events data that we would like to capture, analyze and improve our game. These events could be the start of a lesson, touching a game object, choosing the correct game object but targeting it wrongly, answering correctly but got timed out and so on. We expect this to be 100s of rows for each game that the kids plays. Also the data stored will be dependent on the type of event.
The database should allow us to analyze the data and answer questions like which games are tough on kids, which lessons are too easy for kids, are kids from some countries finding some of the lessons to be tough, how long are each of these games able to hold the attention of the kid and so on.
Which database would allow us to store so many different types of events, scale to millions of rows a day and allow for all these kinds of analysis? Given the changing nature of the data model, NoSQL seems to be an obvious choice. But which one would allow us to do all these analysis. Or should we go with Hadoop / Hive?
Thanks in advance.
Although you can do this using Hadoop/Hive, but you won't get real time performance as Hive is best suited for batch processing kinda stuff. Hbase would be a better choice in such a scenario. You could create OLAP datacube kinda thing whose dimensions could be the info specified by you, like session info, info about each kid etc etc. Or you could serialize all of this information as JSON objects and then store them in Hbase cells. You could also store each of these events in individual cells, but that would consume unnecessary space and won't be that efficient while fetching the data back.
HTH
Related
I'm with a company that is building a venue / artist database for live music and recently came across Freebase. It looks very compelling, even if the data isn't there for new, up-and-coming bands. For those of you who have worked with Freebase, I have a couple questions:
Are there downsides to integrating all of the data entry with Freebase? We are not looking to sell or privatize this information.
What are the weaknesses of Freebase, with regards to usability?
Disclosure: I work on Freebase at Google.
The music data in Freebase is one of our strongest areas and is going to continue to get broader and richer as we continue to load more datasets. For example, we import data from MusicBrainz, clean it up and match the topics against existing topics in Freebase to avoid duplicates.
In terms of downsides, you should be prepared to work with a lot of data. For example, Freebase currently has 4 musical artists named "John Smith" which may or may not be useful for your application but you'll still need to figure out which one(s) map to the John Smith that your users are interested in. We call this "reconciliation" and its necessary so that your app knows precisely which topics to query the API for.
Since you mentioned music venues I should also point out that while Freebase has a lot of data about places, we don't yet have a geosearch API so you'd need to roll your own if that's something you need.
Since anyone can edit Freebase, you should also consider using as_of_time to protect your site against vandalism.
Freebase is great for developers because you can easily jump in and clean up bad data or add missing topics. However, one area that has always been a challenge is loading large amounts of data from outside of Google. We've built the OpenRefine which allows folks to upload datasets, but these datasets must pass a QA process that takes some time to complete. Its necessary to have these QA processes to maintain the level of quality in Freebase, but it does slow down the process of loading large datasets.
I really hope that you choose to make use of Freebase music data to build your company. I know that there are already a number of music startups happily using our data.
I'm undertaking a project with a learning purpose. Since this project is compelling to me because of its topic I want to build good foundations and maybe put it live eventual.
Since my project is quite complex, to explain you what my question is I'm gonna use a fiction project that is an agenda application.
This web application will have a calendar where the user can add events and reminders.It will be used by, lets say, 10,000 users and those 10,000 users will add thousands of events and reminders.
My question is which of the two methods would you recommend related to database structure?
Should I create a separate database with reminders and events tables for each user (on user creation) and relate the databases to a user in a separate database
or should I make one table for events, one for reminders and one for users and relate them to one another in a single database?
I haven't done any multi-user web applications so far and I am not familiar with database structures approach when it comes to many users. Please if there are any design patterns that you think of, I would appreciate sharing :)
Here's my opinion:
No, you should not create a separate database for each user. It can't scale. It means that every time you add a user, you have to create a new database? Never.
One database, multiple users - that's what relational databases are born for.
10,000 users is not that large an audience. Each creating thousands of events and reminders would mean 10M events, 10M reminders. That's not considered a large relational database.
You may need to worry about partitioning and purging old records. What kind of policy will you have in place for keeping those events and reminders? What access will users have after a year? Five years? Ten years? Those would be good topics to think about, too.
Get a good book about entity/relationship modeling and read it carefully. Anything modern on Amazon will do.
I used to work with a database where each user data was held in a separate database (your option 1) and believe me it was a nightmare to work with and the company spent enormous amount of resources to consolidate all these databases to one single database and it was not an easy task.
As #duffymo stated one database/multiple users that's what relational databases are for.
So, I've just decided to build my own fantasy sports web site.
You know the type of site where you can pick players from your favourite league and depending on how they do you get a certain amount of points in your team. There are fantasy teams for all types of leagues and sports, I'm sure you know what I'm talking about.
I haven't settled for a specific sport or league just yet because I want the basics to fit to different types of team-based sports.
I have a few expectations on it myself. If you can come up with any other I'll be glad to hear them.
I expect the site to be dynamic and have many visits during a game, but almost only static content otherwise.
Player points should be updated in real-time during a game.
I would need a list that shows each game being played and the points of every player in that game. It should also show minutes played, goals, assists etc.
Each registered user would be able to see the points and players of his/hers team updated in real time.
I need the site to scale so that if I start with 1000 teams I could end up with 5 million.
I probably won't be needing language support right now, but who knows in the future.
Based on these prerequisites what would be best to use in terms of language (php, .NET, drupal or other cms's), database (mysql, sqlserver, xml) and other techniques?
Maybe it doesn't really matter what I use?
I guess the dynamic and real time update of each player's points is where I need help the most.
Thanks in advance!
/Niklas
EDITED
I could use an array with the following data for a specific game week:
Player ID
Minutes played
Sport specific points(goals, assists, penalties, yellow cards, man of the match bonus) etc.
Total points in current game week
When the game is over I'd add these to a DB and sum this data with any previous game weeks. Plus player value, number of teams that has selected this player, etc.
You are probably going to have to go down the custom route for your "Game" code - rather than using a CMS, although depending on your experience, you may be able to leverage a framework (e.g CodeIgniter) to speed up some of your DEV time.
This type of site would be pretty language agnostic, however it would depend on the actual numbers of users you are looking at as to the most scalable solution / set of techniques to deploy.
One of the biggest considerations you are going to have to look at would be the design of the data model, and the platform that this sits on.
If you want to be processing near to realtime updates, you are going to want to focus your efforts on making the DB queries / processing the most efficient possible.
One big consideration that you have not discussed here is caching. There is some data on your site that I am sure will be static for long periods of time (such as weekly totals etc), and there is data that will be very much real time (but only during match days).
However, during match days you will have a lot more traffic than non match days, and you will therefore have a lot of requests for the same data in a short period of time. Therefore, employing a good caching strategy will save you masses of CPU power. What I am thinking of, is to calculate a player's score and then cache for 1 minute at a time, therefore each time that specific player is requested, you are retrieving from a cache, rather than recalculating each time.
I've never designed a database before, but I've had experience programming in a few languages and assembler throughout college, as well as some web design, so I'm able to at least pick up what I need to know if I can be pointed in the right direction. One of the tasks of my job is to sort through some data that we've been collecting in the field, using a "sonde" which measures temperature, pH, conductivity, and other parameters. The device sits in a stream 24/7 (except for when we take it out and switch it with our other sonde every couple weeks, so that we can put in a newly calibrated one in the stream and retrieve the data from the one that was in the field). It collects data every 15 minutes or so, and has done so since 2007. Currently, all of our data is spread across multiple excel spreadsheets, and we have additional data from a weather station and another instrument that all gets compiled into quarterly documents. My goal is to design as simple of a database as possible with most of the functionality of a database like this: http://hudson.dl.stevens-tech.edu/hrecos/d/index.shtml. Ours would be significantly simpler as it is not live data (but would instead retrieve data from files that we upload once we'd finished handling the formatting and compilation of all our data). I would very much like the graphing ability on the site that the above database has, but I at least need to be able to select a range of data and select as many variables as I want within that time range and then be able to download a spreadsheet with the generated data (or at least a CSV file).
I realize this is a tough task, and as I have not designed a database before, I suspect it is very much an uphill task. However if I would be able to learn the things necessary to do this, and make it web-accessible, that would be a huge accomplishment and very much impress my boss. Any advice or tips to go off in the right direction would be very much appreciated.
Thanks for your help!
There are actually 2 parts to the solution you're looking for:
The database, which will store your data in a single organized place, and
The application, which is the interface used by people to interact with the database.
Basically, a database by itself is just a container. You need some kind of application which accept criteria from a user, pull the appropriate data meeting the criteria from the database, and display it to the user in a meaningful fashion - in this case, a graph or a spreadsheet.
Normally for web-based apps the database and application are two separate components. However, for a small app with a fairly small number of users, and especially for someone just starting out, you may want to consider an all-in-one solution like InfoDome, sort of like MSAccess for the web.
Either way, you're still going to need to learn about database design. There's many good tutorials out there, just do some searching. DatabaseAnswers.org has been useful for me. They have a set of tutorials as well as a large collection of sample database schemas.
I'm in the planning stages of building a SQL Server DataMart for mail/email/SMS contact info and history. Each piece of data is located in a different external system. Because of this, email addresses do not have account numbers and SMS phone numbers do not have email addresses, etc. In other words, there isn't a shared primary key. Some data overlaps, but there isn't much I can do except keep the most complete version when duplicates arise.
Is there a best practice for building a DataMart with this data? Would it be an acceptable practice to create a key table with a column for each external key? Then, a unique primary ID can be assigned to tie this to other DataMart tables.
Looking for ideas/suggestions on approaches I may not have yet thought of.
Thanks.
The email address or phone number itself sounds like a suitable business key. Typically a "staging" database is used to load the data from multiple sources and then assign surrogate keys and do other transformations.
Are you familiar with data warehouse methods and design patterns? If you don't have previous knowledge or experience then consider hiring some help. BI / data warehouse projects have a very high failure rate and mistakes can be expensive.
Found more information here:
http://en.wikipedia.org/wiki/Extract,_transform,_load#Dealing_with_keys
Well, with no other information to tie the disparate pieces together, your datamart is going to be pretty rudimentary. You'll be able to get the types of data (sms, email, mail), metrics for each type over time ("this week/month/quarter/year we averaged 42.5 sms texts per day, and 8000 emails per month! w00t!"). With just phone numbers and email addresses, your "other datamarts" will likely have to be phone company names, or internet domains. I guess you could link from that into some sort of geographical information (internet provider locations?), or maybe financial information for the companies. Kind of a blur if you don't already know which direction you want to head.
To be honest, this sounds like someone high-up is having a knee-jerk reaction to the "datamart" buzzword coupled with hearing something about how important communication metrics are, so they sent orders on down the chain to "get us some datamarts to run stats on all our e-mails!"
You need to figure out what it is that you or your employer is expecting to get out of this project, and then figure out if the data you're currently collecting gives you a trail to follow to that information. Right now it sounds like you're doing it backwards ("I have this data, what's it good for?"). It's entirely possible that you don't currently have the data you need, which means you'll need to buy it (who knows if you could) or start collecting it, in which case you won't have nice looking graphs and trend-lines for upper-management to look at for some time... falling right in line with the warning dportas gave you in his second paragraph ;)