I am currently working on a B2B platform on which I have to implement a feature where the respective customers should be able to download their logging entries up to 2 years ago.
There can be up to 1 million logging entries per day per customer. Now this is quite a lot of data, but it is retrieved on average 5-6 times a month per customer. This means that a lot of data is stored, but relatively little of it needs to be retrieved.
We host on AWS and as the main database we currently use Postgres, which can of course handle this, but I ask myself if there aren't more suitable candidates.
I also had cloudwatch in mind, but I don't know if you should use it operationally for these purposes.
Thanks for the help!
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm currently receiving 2000 prices per second from a stock exchange and need to save those in an appropriate database. My current choice is PostgresQL which is way too slow. I need to save those prices (ticks) in an aggregated form like OHLC. So if I want to save D1 data for instance, I need to first get the previous D1 record for the stock from the database, check if the high or low price has changed and set a new close price and then save it to the database again. This is taking forever and is not possible with Postgres. I don't want to save the OHLC data, I prefer querying (aggregating) those in real-time.
So my requirements are:
persistance
fast writes (currently 2k per second, up to 10k)
queries, e.g. aggregating OHLC data in real-time (50-100 per second)
adoptable to any modern programming language without writing raw queries (SDK for Python or JS for that database)
deployable on AWS or GCP without hassle
I was thinking about Apache Cassandra. I'm not familiar with Cassandra, are powerful queries like OHLC one possible? Are there any alternatives to Cassandra?
Thanks in advance!
Given what I've understood from your question, I believe Cassandra should easily fit your use-case.
Regarding your requirements:
persistence : Cassandra will not only persist your data but also cover redundancy with minimal configuration;
fast writes : this is what Cassandra is most optimized for and while the exact throughput depends on a lot of factors, in general Cassandra will manage writes measured in the thousands/sec/core; Also, the eventual number o writes is not really relevant as Cassandra can scale linearly with no real penalty so 5k,10k, 100k or more are all doable;
adaptability : Cassandra has official drivers for the most common languages(Python, C family, NodeJs, Java, Ruby, PHP, Scala) as well as community developed ones for more languages (list of divers);
deployable : It's very easy to deploy in the cloud. You can chose to deploy it manually on independent instances or maybe use a managed Cassandra cluster (AWS has one, it's called 'AWS Keyspaces', Datastax(the company driving most of the development behind Cassandra) has one called 'Astra' and there are even more possible solutions. Given that Cassandra is one of the major players when it comes to big-data storage finding a place for you DB in the cloud should be easy.
I have only mentioned 4 of the 5 requirements. That is because when talking about reading, things get more complex and a larger discussion is needed.
500-100 reads/s given the 2k+ writes per second seem to be in line with the general idea of Cassandra being optimized for write intensive tasks. In Cassandra the way you will model your tables will dictate how well things can work. For a task like you have described my first thoughts are:
You bucket each stock per day => you get a partition with around 30k rows (1 update/s for 8 trading hours) and a size of under 0.2MB (30k * 4B). This would be well within the recommended values and clearly under the worst case scenario ones;
when you need the aggregated data you have 2 options:
2a. You read the partition as is and aggregate it application side (what I would recommend);
2b. You implement an "User-Defined Aggregate" function on your database that will do the work (docs). This should be doable although I won't guarantee it. Apart from being harder to implement, the problem is that putting this kind of extra workload on the DB might not be want you want given your apparent use-case. Let me explain: I'd expect your reading load to be most active during certain times, (before, during and after trading hours) with times when the load is lighter. Depending on your architecture, you could have multiple application instances up during peak times, and then scale them back during off-peak in order to lower costs. While applications can be easily scaled up and down on cloud providers like AWS and GC. Cassanadra cannot be scaled up and down like this (5 nodes in the morning, 3 in the night and so on)(well it could but it's not designed to and would be a terrible decision). So moving as much of the non-constant workload to the application seems the best idea;
(Optional) have a worker that at the end of the day/trading day will aggregate the values for each stock and save them to another table so that when looking at historic data it will be easier. This data could even be bucketed by week, month or even year depending on how much space the aggregated data takes.
You could also add Spark and Kafka in front of Casandra for a more powerful approach to the real-time aggregation but we should't deviate that much from the question at hand.
Cassandra is very powerful with the right modeling and the right architecture. At first glance what you need seems to be a good fit for Cassandra however as powerful as it can be, as bad as it can get if you use it in ways it wasn't designed to. I hope this answer puts you on a path into making the right decision.
Cheers.
I'm running a classifieds website that has ads and comments on it. As the traffic has grown to a considerable amount and the number of ads in the system have reached over 1.5 million out of which nearly 250K are active ads.
Now the problem is that the system has been designed to be very dynamic in terms of the category of the ads and the properties each kind of ad can have based on it category or sub category, therefore to display an ad I have to join nearly 4 to 5 tables.
To solve this issue I have created a flat table (conceptually what I call a publishing table) and populate that table with an SQL Job every 3 to 4 minutes. Now for web requests I query that table to show ad listings or details.
I also have implemented a data cache of around 1 minute for each unique url combination for ad listings and for each ad detail.
I do the same thing for comments on ads (i.e. cache the comments and as the comments are hierarchical, I have used a flat table publishing model for them also, again populated with an SQL Job)
My questions are as follows:
Is the publishing model with a backgroung sql job a good design approach?
What approach would you take or people take for scenarios like this?
How does a website like facebook show comments realtime with millions of users, keeping sure that they do not lose any comments data by only keeping it in the cache and doing batch updates ?
Starting at the end:
3.How does a website like facebook show comments realtime with millions of users, keeping sure
that they do not lose any comments data by only keeping it in the cache and doing batch updates ?
Two things:
Smarter programming than you. They can put a larget etam on solving this problem for months.
Ignorance. They really dont care too muich about a cache being a little outdated. Noone will really realize.
Hardware ;) More and more powerful servers than yours.
That said, your apoproach sounds sensible.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I have evaluated several development tools for converting my informix SQL-based app. They are: Genero/4Js, FileMaker, Oracle APEX, VFP, Clarion and Access 2007. I have a CRUD pawnshop app (see video-demo www.frankcomputer.com) This app centers on customers who pawn, sell or buy merchandise. I need to have one CRUD multi-table form which displays one customer master on the top half with all of the customers associated items which they have pawned, sold or purchased on the bottom half. Can CRUD be accomplished from within one form in Access? The main reason I'm leaning towards Access is because of its integration with Excel, Word and other modules, plus many people have it and experienced using it, plus it's Microsoft. Can anyone who has developed apps with Access tell me if I can mimic my INFORMIX-SQL based app and what limitations does Access have? Also, can a touch-screen POS front-end like Microsoft Dynamics be used or are other POS application generators/rapid development systems available to re-write my current app?
I suspect if you ask Foxpro developer, they would tell you that's the best tool of choice.
And I'm sure if you ask a filemaker developer, they would tell you to choose their tool.
So much of the question is, for the most part if you ask an access developer, that developer would also answer yes.
I would be hard pressed to imagine that all of the above tools you mentioned above, all of them have the capability of displaying information from more than one table in a screen. That's pretty much a requirement for any development system today. So in a nutshell, you're really asking the wrong question here.
I don't think the question is do they have the capability of displaying information from more than one table. They all can do this. Perhaps a fair question would be how much work and how well does each product slice and dice together these multiple tables?
In access you place text boxes and controls on a form, and to display related data, you can you place a control called a sub-form control. This approach allows you to model this classic can typical master to child record table relationship, and do so without having to write one line of code.
And of course you're not limited to one to many, but you can actually insert two sub forms side by side, and have a one to many, and in turn have the 2nd sub form control display many more records from that second table.
Here's a screen shot of what I mean:
In above you have one main record at the top with information about donation date and event. On the left you have a list of people and their donation amount (one to many).
Then on the right side, for each person, you have the donation amount split out into multiple accounts. (and the green box shows red when the amounts don't balance).
So, the above creates that classic accounting problem that just about every accounting package from Quickbooks all the way up to hi end accounting packages have done from day one when splitting out funds to multiple accounts.
The above form has very little code in it, and most of the relationships and setups and filtering and displaying of the child records is all handled automatically by access.
So at the end of the day, I'm pretty much of the view that all of products you mention above are capable of modeling and developing these types of screens. And, they all will result in an screen and user experience that would be relatively similar to what you have now.
Now course I'm a biased towards access, and I believe that I can build screens like the above quicker, faster, and with less hassle less code and effort than most of the other products you mentioned .
However, at the end of the day what platform and tools you use and find as appropriate is certainly not going to be centered around the ONE QUESTION and ONE CONCEPT that you have need to display multiple pieces of information on a form for multiple tables. As mentioned, this is gonna be given for any modern development system, including web based development systems.
Other considerations and factors is what type of reports, and outputs to you need? Do you need his column are reports, or do you need to send an invoice style forms type report to a printer that's preprinted invoice forms. I think these are bigger questions then your current question.
The real question here's not can any modern development system display multiple pieces of data on a form, they all can. The REAL factors and issues here are what platforms, hardware requirements, and systems do you need the software to run on?
So the issue is will some of the locations have multiple users? Will some of the locations need secure backup or some type of encryption? How do you plan to issue bug fixes, and updates to the next great version of the software?
Other issues are how many developers will you have working on this. What kind of distribution method will you use for the software. What kind of support infrastructure will you have to give customers support and installing the software. So, this list goes on and on and all these issues dwarf the question about the ability to display multiple pieces of information on one form.
In addition to all of the above issues, you need to consider your own training and skill set in development of software. To really master any software development system, you need to invest a considerable amount of your own time to learn. While I think the access is a very good RAD (rapid application development) tool, I will actually say that access has a considerably larger learning curve then say that of VB6 for example.
Choosing a platform is very much like a marriage, you have to invest enormous amounts of your time (months, and even years) to really learn and become proficient at developing software with such a system .
If you're jumping into a new set of tools, then the following list of skill sets needs to be taken into consideration **:
Stage 1 Innocent (never heard of the product)
Stage 2 Aware (Has read an article about X)
Stage 3 Apprentice (has attended a three-day seminar)
Stage 4 Practitioner (ready to use X on a real project)
Stage 5 Journeyman (uses X naturally and automatically in his job)
Stage 6 Master (has internalized X, knows when to break the rules)
Stage 7 Expert (writes books, gives lectures, looks for ways to extend x)
One should NEVER attempt a project with a team consisting with Stage 3 or lower people. (**** Page-Jones, Meilir. "The Seven Stages of Expertise in Software Engineering", American Programmer, July-Aug 1990).
So you just can't jump into a new tool and expect to be proficient at developing complex applications. I have an article here about converting a legacy application into ms access .
There are some great lessons in this article:
Notes on Conversion of a Pick (Multi-Value database) Application to a Relational database system.
http://www.members.shaw.ca/AlbertKallal/Articles/fog0000000003.html
Good luck in whichever platform you choose.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
Does anybody know how data in Google Analytics is organized? Difficult selection from large amounts of data they perform very-very fast, what structure of database is it?
AFAIK Google Analytics is derived from Urchin. As it has been said it is possible that since now Analytics is part of the Google family it is using MapReduce/BigTable. I can assume that Google had integrated the old format of Urchin DB with the new BigTable/MapReduce.
I found this links which talk about Urchin DB. Probably some of the things are still in use at the moment.
http://www.advanced-web-metrics.com/blog/2007/10/16/what-is-urchin/
this says:
[snip] ...still use a proprietary database to store reporting data, which makes ad-hoc queries a bit more limited, since you have to use Urchin-developed tools rather than the more flexible SQL tools.
http://www.urchinexperts.com/software/faq/#ques45
What type of database does Urchin use?
Urchin uses a proprietary flat file database for report data storage. The high-performance database architecture handles very high traffic sites efficiently. Some of the benefits of the data base architecture include:
* Small database footprint approximately 5-10% of raw logfile size
* Small number of database files required per profile (9 per month of historical reporting)
* Support for parallel processing of load-balanced webserver logs for increased performance
* Databases are standard files that are easy to back up and restore using native operating system utilitiesv
More info about Urchin
http://www.google.com/support/urchin45/bin/answer.py?answer=28737
Long time ago I used to have a tracker and on their site they were discussing about data normalization: http://www.2enetworx.com/dev/articles/statisticus5.asp
There you can find a bit of info of how to reduce the data in DB and maybe it is a good start in research.
BigTable
Google Publication: Chang, Fay, et al. "Bigtable: A distributed storage system for structured data."ACM Transactions on Computer Systems (TOCS) 26.2 (2008):
Bigtable is used by more than sixty Google products and projects,
including Google Analytics, Google Finance, Orkut, Personalized
Search, Writely, and Google Earth.
I'd assume they use their 'Big Table'
I can't know exactly how they implement it.
But because I've made a product that extracts non-sampled, non-aggregated data from Google Analytics I have learned a thing or two about the structure.
I makes sense that the data is populated via BigTable.
BT offers localization data awareness and map/reduce querying across n-nodes.
Distinct counts
(Whether a data service can provide distinct counts or not is a simple measure of flexibility of a data model - but it's typically also a measure of cost and performance)
Google Analytics is not built to do distinct counts even though GA can count users across almost any dimension - but it can't count e.g. Sessions per ga:pagePath?
How so...
Well they only register a session with the first pageView in a session.
This means that we can only count how many landingpages that have had a session.
We have no count for all the other 99% of pages on your site. :/
The reason for this is that Google made the choice NOT to count discount counts at all. It simply doesn't scale well economically when serving millions of sites for free.
They needed an approach where they could avoid counting distinct. Distinct count is all about sorting, grouping lists of ids for every cell in data intersection.
But...
Isn't it simple to count the distinct number of session on a ga:pagePath value?
I'll answer this in a bit
The User and data partitioning
The choice they made was to partition data on users (clientIds or userIds)
Because when they know that clientId/userId X is only present in a certain table in BT, they can run a map/reduce function that counts users and they don't have to be concerned that the same user is present in another dataset and be forced to store all clientIds/userIds in a list - group them - and then count them - distinct.
Since the current GA tracking script is called Universal Analytics they have to be able to count users correct. Especially when focusing on cross-device tracking.
OK, but how does this affect session count?
You have a set of users, each having multiple sets of sessions each having a list of page hits.
When counting within a specific session looking for a pagePaths, you will find the same page multiple times but you will not count the page more than once.
You need to write down you've already seen this page before.
When you have traversed all pages within that session you need only count the session once per page. This procedure requires a state/memory. And since the counting process is probably done in parallel on the same server. You can't be sure that a specific session is handled by the same process. Which makes the counting even more memory consuming.
Google decided not to chase that rabit any longer and just ignore that the session count is wrong for pagePath and other hit scoped dimensions.
"Cube" storage
The reason I write "cube" is that I don't know exactly if they use traditional a OLAP cube structure, but I know they have up to 100 cubes populated for answering different dimension/metric combinations.
By isolation/grouping dimensions in smaller cubes, data won't explode exponentially like it would if they put all data in a single cube.
The drawback is that not all data combinations are allowed. Which we know is true.
E.g. ga:transactionId and ga:eventCategory can't be queried together.
By choosing this structure the dataset can scale well economical and performance-wise
Many places and applications in the Google portfolio use the MapReduce algorithm for storage and processing of large quantities of data.
See the Google Research Publications on MapReduce for further information and also have a look at page 4 and page 5 of this Baseline article.
Google analytics runs on 'Mesa: Geo-Replicated, Near Real-Time, Scalable DataWarehousing'.
https://storage.googleapis.com/pub-tools-public-publication-data/pdf/42851.pdf
"Mesa is a highly scalable analytic data warehousing systemthat stores critical measurement data related to Google’sInternet advertising business."
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I've followed the CouchDB project with interest over the last couple of years, and see it is now an Apache Incubator project. Prior to that, the CouchDB web site was full of do not use for production code type disclaimers, so I'd done no more than keep an eye on it. I'd be interested to know your experiences if you've been using CouchDB either for a live project, or a technology pilot.
After 18 Months of prototypes, testing and waiting for CouchDb to get ready we moved an internal application over to CouchDB in December 2008. So far I'm very happy with that move. It gets rid of a lot of filesystem objects for us (PDFs and JPEGs, now stored as attachments in CouchDB). This enables us to get rid of NFS and easier cluster/replicate our frontend webservers.
To what degree CouchDB is ready for you depends very much on the culture of your organization. We have an in-house development team maintaining several internal Erlang applications. Since CouchDB is written in Erlang and the codebase is of quite decent quality we felt confident that we could fix show stopper issues in CouchDB should the need arise - or at least get our data back out. We also hired one of the CouchDB core team as an consultant - just in case.
But CouchDB for sure isn't 1.0 yet. There are crashes in the Web worker processes all the time (if you misuse them). Replication breaks for us and we don't get error messages about it. Documentation is still very lacking. Still I'm confident that it will not eat our data and development moves forward with reasonable pace.
To give you an idea about our application: currently our biggest database is about 512000 records taking 7.5 GB of diskspace.
I use the CouchDB to power a Facebook application (over 35k monthly active users). For a while it was using MySQL but after porting the entire project over from Perl to Erlang, I decided to go for the gold and migrate all of the data into CouchDB and use that instead.
CouchDB has been a great data store to work with. I think that it is on track to becoming a major player in web-based services.
I got to know one of the people (Jan) working on it a while ago (like 6 months) and have been playing with it ever since. I found the community around CouchDB to be both very knowledgable and helpful so that whenever I ran into an issue it was resolved in a matter of minutes or hours at least.
We just kicked off a project the other week which basically requires us to store data in the non-relational way and due to CouchDB's document oriented store we selected it as one of the technologies to use. So this is actually the first time that I will run it in production, but I'm still pretty confident about it. :)
Just an update here (2009-10-25):
Our first CouchDB install is 20 GB, it hosts 40 million records. It's been running in production since January 2009, and it's been great. Read (GET) speed is outstanding and we use it as a store for complex data, and then it's just pull.
Our second couchdb installment has two databases, one is 160,000,000+ documents (210 GB), and growing between 150,000-300,000 documents a day. The other is only 35,000,000 documents (7 GB). This setup has a lot more reads and writes and initial tests are performing very well.
View building on the 160,000,000 document database took roughly a week, but since then we upgraded to a larger Amazon EC2 instance and we are also getting ready to update to CouchDB 0.10.x (from 0.9.1) as this release includes a lot of performance improvements in view building.
I am using couchdb in a few scenarios, as a document store for http://devk.it (under development) and in a much larger scale as a template store for a distributed email delivery system.
CouchDB is very slick for what it does, but I was not able to get it to run at as high of a concurrency level as I would have expected. Also note that the maximum document size is fairly limiting at 1MB due to the hardcoded max input buffer size in mochiweb. You can however alter a header file and recompile to get around this limit.
I'm using CouchDB to store (and serve) article ratings on my blog. It's not exactly heavy traffic but it's been rock solid so far.
Also planning on adding comments sometime which I'll most likely also store in CouchDB.
I've found it quite easy to get started with, on OSX you can just download CouchDBX to get started quickly. I use a Sinatra backend with RestClient to interact with 'the couch' through straight HTTP verbs and such.
Great fun.
At the moment I'm working with CouchDB for a computer science thesis. I'm writing about my progresses and opinions on my blog, http://metalelf0dev.blogspot.com. I think the project is well done, but the existing documentation isn't organized as it should. A quick tutorial about the Futon web interface could be really useful for starters IMHO :)
I used couchdb twice in production. First was the wiki likes project and I think that couchdb was perfect candidate for that role. Saving the version of all docs helps a lot.
The second project was quite query loaded and idea was dumping social data first, then query it with various filters. It was looked like standard CouchDB query features looks a bit pure for our needs. But we add Lucene like a full text indexer and after that doing many queries during Lucene part. And that solution looks good enough.