I am about to build a service that logs clicks and transactions from an e-commerce website. I expect to log millions of clicks every month.
I will use this to run reports to evaluate marketing efforts and site usage (similar to Google Analytics*). I need to be able to make queries, such as best selling product, most clicked category, average margin, etc.
*As some actions occur at later times and offline GA doesn´t fullfill all our needs.
The reporting system will not have a heady load and it will only be used internally.
My plan is to place loggable actions in a que and have a separate system store these to a database.
My question is what database I should use for this. Due to corporate IT-policy I do only have these options; SimpleDB (AWS), DynamoDB (AWS) or MS SQL/My SQL
Thanks in advance!
Best regards,
Fredrik
Have you checked this excelent Amazon documentation page ? http://aws.amazon.com/running_databases/ It helps to pick the best database from their products.
From my experience, I would advise that you do not use DynamoDB for this purpose. There is no real select equivalent and you will have hard time modeling your data. It is feasible, but not trivial.
On the other hand, SimpleDB provides a select operation that would considerably simplify the model. Nonetheless, it is advised against volumes > 10GB: http://aws.amazon.com/running_databases/
For the last resort option, RDS, I think you can do pretty much everything with it.
Related
I am going to create a web site or you may call it a web application as it is going to and I attempt to use GAE data store. However this website would be used by many people to search for companies, create profiles (accounts). I am not sure how much it is going to cost as there would be many requests to fetch companies profiles and create new profiles. So, I need some advice about my idea is it going to cost a lot ? Does GAE data store fits with this kind of websites and applications ?.
Thanks in advance for reply.
First of all, to estimate the cost you need to answer some questions. What is the number of read and write operations you expect? How many entities will you be storing? What is the approximate size of each entity? With that values you could estimate the cost based on the current pricing.
You didn't write what your requirements are, but I think one of those would be scalability. Take a look at this excerpt from the docs.
The App Engine Datastore is a schemaless object datastore providing
robust, scalable storage for your web application, with no planned
downtime, atomic transactions, high availability of reads and writes,
strong consistency for reads and ancestor queries, and eventual
consistency for all other queries.
If this doen't feet your needs you can also use Google Cloud Sql.
I'm running a classifieds website that has ads and comments on it. As the traffic has grown to a considerable amount and the number of ads in the system have reached over 1.5 million out of which nearly 250K are active ads.
Now the problem is that the system has been designed to be very dynamic in terms of the category of the ads and the properties each kind of ad can have based on it category or sub category, therefore to display an ad I have to join nearly 4 to 5 tables.
To solve this issue I have created a flat table (conceptually what I call a publishing table) and populate that table with an SQL Job every 3 to 4 minutes. Now for web requests I query that table to show ad listings or details.
I also have implemented a data cache of around 1 minute for each unique url combination for ad listings and for each ad detail.
I do the same thing for comments on ads (i.e. cache the comments and as the comments are hierarchical, I have used a flat table publishing model for them also, again populated with an SQL Job)
My questions are as follows:
Is the publishing model with a backgroung sql job a good design approach?
What approach would you take or people take for scenarios like this?
How does a website like facebook show comments realtime with millions of users, keeping sure that they do not lose any comments data by only keeping it in the cache and doing batch updates ?
Starting at the end:
3.How does a website like facebook show comments realtime with millions of users, keeping sure
that they do not lose any comments data by only keeping it in the cache and doing batch updates ?
Two things:
Smarter programming than you. They can put a larget etam on solving this problem for months.
Ignorance. They really dont care too muich about a cache being a little outdated. Noone will really realize.
Hardware ;) More and more powerful servers than yours.
That said, your apoproach sounds sensible.
We are considering moving a planning and budgeting app to the Salesforce platform. The existing app is built on a dimensional data model, and has extensive ad-hoc query capability implemented through star joins.
We see how the platform will allow us to put together the data entry screens quickly, but the underlying datamodel and query languages do not seem suitable for our reporting requirements.
Is it possible to have fast and flexible reporting with this platform? If not, how cumbersome is it to extract the data on a regular basis to bring it into an analytical application?
Hmm - I guess I answer my own question? The relative silence on this (even with bounty- who wants to have anything to do with something that is ignored on stackoverflow?) is a kind of answer.
So - No, this platform is not well suited for applications that have any kind of ROLAP requirements. I guess shame on me for asking a silly question, but I welcome any responses...
Doing native, fast, OLAP-like queries: possible, but somewhat cumbersome since SFDC is basically a traditional-style RDBMS with somewhat limited joining capability within its native reporting. You can do OLAP-like things with custom code but it can get cumbersome if you are used to using established high-end OLAP solutions.
Extracting data from SFDC to use in other applications: really easy and supported across a number of technologies, the most common is extracting CSV files or using the data web service. There are tools like the SFDC data loader which also let you extract/load data via command line or UI. That's probably what I would recommend to a client who has pre-existing expertise in a given analysis tool.
I would not attempt to build an OLAP data model in salesforce. The limitations in both the joins and roll-up of data from child to parent make it difficult to implement a star schema with aggregations.
There are some products such as IQ 20/20 that can integrate with salesforce and provide near real time business intelligence functionality.
Analytical snapshots can also help as they provide a way to build aggregate tables. The snapshots pull data from a report and can be scheduled to run periodically. The different salesforce editions give different features regarding the scheduling so it is best to check the limits for your edition before going too far into the design.
I have a web design issue regarding to performance to ask advice. On a web site, there are many personalized information, for example the friends of a user of facebook. Personalized I mean different users have different friend list.
Suppose friend list is stored in database like Oracle or Mysql, each time the user clicks Home of his/her facebook page or login, we need to read database again. Each time the user add/remove friend, the database needs some update operations.
My question is, I think the performance capability (e.g. concurrency of transactions of read/write) of database is limited, and if facebook is using database to store friend list, it is hard to implement the good performance. But if not using database (e.g. MySql or Oracle), how did Facebook implement such personalization function?
This is a pretty good article about the technology behind facebook.
As Justin said, it looks like a combination of Memcached and Cassandra.
Facebook and other large sites typically use a caching layer to store that kind of data so that you don't have to make a round trip to the database each time you need to fetch it.
One of the most popular is Memcached (which, last I remember reading, is used by Facebook).
You could also check out how some sites are using NoSQL databases as their caching layer. I actually just read an article yesterday about how StackOverflow utilizes Redis to handle their caching.
From what I can gather they use a MySQL cluster and memcached and lots of custom written software. They open source plenty of it: http://developers.facebook.com/opensource/
the solution is to use a super-fast NoSQL-style database. Start with Simon Willison's excellent tutorial on redis, and it will all begin to become clear :)
I have an application that requires analytics for different level of aggregation, and that's the OLAP workload. I want to update my database pretty frequently as well.
e.g., here is what my update looks like (schema looks like: time, dest, source ip, browser -> visits)
(15:00-1-2-2010, www.stackoverflow.com, 128.19.1.1, safari) --> 105
(15:00-1-2-2010, www.stackoverflow.com, 128.19.2.1, firefox) --> 110
...
(15:00-1-5-2010, www.cnn.com, 128.19.5.1, firefox) --> 110
And then I want to ask what is the total visit to www.stackoverflow.com from a firefox browser last month.
I understand Vertica system can do this in a relatively cheap way (performance and scalability wise, but not cost-wise probably). I have two questions here.
1) Is there an open-source product that I can build upon to solve this problem? In particular, how well does a Mondrian system work? (scalability, and performance)
2) Is there an HBase or Hypertable base solution (obviously, a naked HBase/Hypertable can't do this) for this? -- but if there is a project based on HBase/Hypertable, scalability probably won't be an issue IMO)?
Thanks!
You can download a free edition (the single node edition) of the greenplum database. I haven't tried it myself but I think/guess it is a powerful beast. Read here: http://www.dbms2.com/2009/10/19/greenplum-free-single-node-edition/
Another option is MongoDB, it is fast and free and you can write MapReduce functions with JavaScript to do analytics.
My reputation here is to low to add a hyperlink to mongodb, so you have to google . I can add only one hyper link per post.
The zohmg project aims to solve this problem using Hadoop and HBase.
Facebook also built Hive on-top of Hadoop. Pretty simple to get going - reasonable query API too.
http://mirror.facebook.net/facebook/hive/
Is your data model more complex than that? If it isn't you might be beter of just writing custom code for it. Then you can really tune it to your data. Real products have to offer a lot of flexibility, need a lot of complexiy to achieve that, and suffer in speed as a result.
Your question is not clear in one aspect: when you talk about scalable, what do you mean by that? Are you collecting data from lots of sites but only have a limited amount of query users, or do you also have a lot of users? That situation leads to a significantly different model.