I'm building a digital product for a big community of users (2 million +), using Express + GraphQL for the API server and React + Apollo for the web app. Then I'm going to build mobile applications using React Native when the web part is completed.
Right now I'm struggling thinking how to develop the part that is going to gather all the statistics for the user generated content in the platform. To simplify things let's say I have to record:
unique user views of each article
total number of views of each
article
visits on each user profile
I have a couple of questions for those who had previous experience in developing such systems to gather data.
How should I record the raw data?
Should I create a kind of a log in a database and use that later to generate aggregate data depending on my needs?
Something like (article view example):
{
'user_id' : String,
'article_id' : String,
'date' : Date,
}
or should I use a different approach? And which database you recommend to use? Right now I'm thinking about using MongoDb since I'm already using it for the rest of the application.
Indeed there is no single "right" solution but some approaches may be chosen. I'd like suggest the combined approach used in several of my projects: store the most significant (and queryable) part of data as structured but put also raw data as semi-structured. The DBMS like SQL Server (faster but limited in free edition) or PostgreSQL (slower but may be sufficient) can do the job.
You may take a look at the chapter "Semi-structured data and high load" in my book for more details.
I have seen so many e-commerce websites that provides search box to search products. In that search features most of the search fields are auto-complete. If we enter a letter on field, then it will show the data which is including that letter as suggestions from database. As I know basics on developing that functionality.
But what if database contains huge amount of data?
For example, e-commerce websites like flipkart and amazon had a lot of products in their database. so, if user enter a letter in search field, it have to search for data including that letter among all the data in database and match data including that letter and display data as suggestions. The websites are processing it within nano seconds of time. I wonder how they achieved that functionality? I can't understand what are the technologies they are using.
As a learner I wanna know the functional flow and if possible demo for that feature.
I think your question can be divided into two parts. 1) how to design the database for the search technology. 2) how to implement an effect search... They belong to the field of search engine technology.
About the Q1, you can create a table to save the keywords for search, and in the table, you'd better to design a column or similar method to describe the "search-weight". As well known, a view is a practical solution to accelerate the access of the data.
About the Q2, the search engine technique is No longer mysterious, some open source projects can simulate the feature of search engine, such as Apache Lucene, visit please Apache Lucene.
more discuss:
And specially, in your front system for example, the ASP/JSP or even simple HTML page, you should use some scripts e.g. Ajax, to popup, drawdown, of caurse, simple DOM Javascript+DIV can reach it too, but with jQuery or other libarary can make it easily. Here is an example.
Here is the backend system example
To reduce the burden on the host and reduce the requirement of network's bandwidth, the front javascript should active the autocomplete feature with more than three characters.
Please pay attention in your actual application, that your server has calculation's limitation, and the client page has usually many elements, all will reduce user friendliness. Please do not make the request and response too complex.
An alternative simulation can be: make a FIFO logic, save some usual search keyword in the "cache" or temp-table|view, and the amount of data will be reduced.
There are too many solutions, I can only think of these tricks at this moment.
regards
Azure Search service maxes out at 300GB of data. As of today, we've exceeded that. Our database table consists mainly of unstructured text from website news articles around the world.
Do we have any options at all here? We like Azure Search and have built our entire back-end infrastructure around it, but now we're dead in the water with being able to add any more documents to it. Does Azure Search allow compression on the documents?
Azure Search offers a variety of SKUs. The biggest one allows you to index up to 2.4 TB per service. You can find more details here.
Note, changing SKUs requires re-indexing the data.
We don't provide data compression. If you'd like to talk to Azure Search program managers about your capacity requirements, feel free to reach out to #liamca.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Let us consider the following problem. We have a system containing a huge amount of data (Big-Data). So, in fact we have a data base. As the first requirement we want to be able to write to and to read from the data base quickly. We also want to have a web-interface to the data-bases (so that different clients can write to and read from the data base remotely).
But the system that we want to have should be more than a data base. First, we want to be able to run different data-analysis algorithm on the data to find regularities, correlations, abnormalities and so on (as before we do care a lot about the performance). Second, we want to bind a machine learning machinery to the data-base. Which means that we want to run machine learning algorithms on the data to be able to learn "relations" present on the data and based on that predict the values of entries that are not yet in the data base.
Finally, we want to have a nice clicks based interface that visualize the data. So that the users can see the data in form of nice graphics, graphs and other interactive visualisation objects.
What are the standard and widely recognised approaches to the above described problem. What programming languages have to be used to deal with the described problems?
I will approach your question like this: I assume you are firmly interested in big data database use already and have a real need for one, so instead of repeating textbooks upon textbooks of information about them, I will highlight some that meet your 5 requirements - mainly Cassandra and Hadoop.
1) The first requirement we want to be able to write to and to read from the database quickly.
You'll want to explore NoSQL databases which are often used for storing “unstructured” Big Data. Some open-source databases include Hadoop and Cassandra. Regarding the Cassandra,
Facebook needed something fast and cheap to handle the billions of status updates, so it started this project and eventually moved it to Apache where it's found plenty of support in many communities (ref).
References:
Big Data and NoSQL: Five Key Insights
NoSQL standouts: New databases for new applications
Big data woes: Which database should I use?
Cassandra and Spark: A match made in big data heaven
List of NoSQL databases (currently 150)
2) We also want to have a web interface to the database
See the list of 150 NoSQL databases to see all the various interfaces available, including web interfaces.
Cassandra has a cluster admin, a web-based environment, a web-admin based on AngularJS, and even GUI clients.
References:
150 NoSQL databases
Cassandra Web
Cassandra Cluster Admin
3) We want to be able to run different data-analysis algorithm on the data
Cassandra, Hive, and Hadoop are well-suited for data analytics. For example, eBay uses Cassandra for managing time-series data.
References:
Cassandra, Hive, and Hadoop: How We Picked Our Analytics Stack
Cassandra at eBay - Cassandra Summit
An Introduction to Real-Time Analytics with Cassandra and Hadoop
4) We want to run machine learning algorithms on the data to be able to learn "relations"
Again, Cassandra and Hadoop are well-suited. Regarding Apache Spark + Cassandra,
Spark was developed in 2009 at UC Berkeley AMPLab, open sourced in
2010, and became a top-level Apache project in February, 2014. It has
since become one of the largest open source communities in big data, with over 200 contributors in 50+ organizations (ref).
Regarding Hadoop,
With the rapid adoption of Apache Hadoop, enterprises use machine learning as a key technology to extract tangible business value from their massive data assets.
References:
Getting Started with Apache Spark and Cassandra
What is Apache Mahout?
Data Science with Apache Hadoop: Predicting Airline Delays
5) Finally, we want to have a nice clicks-based interface that visualize the data.
Visualization tools (paid) that work with the above databases include Pentaho, JasperReports, and Datameer Analytics Solutions. Alternatively, there are several open-source interactive visualization tools such as D3 and Dygraphs (for big data sets).
References:
Data Science Central - Resources
Big Data Visualization
Start looking at:
what kind of data you want to store in the Database?
what kind of relationship between data you got?
how this data will be accessed? (for instance you need to access a certain set of data quite often)
are they documents? text? something else?
Once you got an answer for all those questions, you can start looking at which NoSQL Database you could use that would give you the best results for your needs.
You can choose between 4 different types: Key-Value, Document, Column family stores, and graph databases.
Which one will be the best fit can be determined answering the question above.
There are ready to use stack that may really help to start with your project:
Elasticsearch that would be your Database (it has a REST API that you can use to write them to the DB and to make queries and analysis)
Kibana is a visualization tool, it will allows you to explore and visualize your data, it it quite powerful and will be more than enough for most of your needs
Logstash can centralize the data processing and help you process and save them in elasticsearch, it already support quite few sources of logs and events, and you can also write your own plugin as well.
Some people refers to them as the ELK stack.
I don't believe you should worry about the programming language you have to use at this point, try to select the tools first, sometimes the choices are limited by the tools you want to use and you can still use a mixture of languages and make the effort only if/when it make sense.
A common way to solve such a requirements is to use Amazon Redshift and the ecosystem around it.
Redshift is a peta-scale data warehouse (it can also start with giga-scale), that exposes Ansi SQL interface. As you can put as much data as you like into the DWH and you can run any type of SQL you wish against this data, this is a good infrastructure to build almost any agile and big data analytics system.
Redshift has many analytics functions, mainly using Window functions. You can calculate averages and medians, but also percentiles, dense rank etc.
You can connect almost every SQL client you want using JDBS/ODBC drivers. It can be from R, R studio, psql, but also from MS-Excel.
AWS added lately a new service for Machine Learning. Amazon ML is integrating nicely with Redshift. You can build predictive models based on data from Redshift, by simply giving an SQL query that is pulling the data needed to train the model, and Amazon ML will build a model that you can use both for batch prediction as well as for real-time predictions. You can check this blog post from AWS big data blog that shows such a scenario: http://blogs.aws.amazon.com/bigdata/post/TxGVITXN9DT5V6/Building-a-Binary-Classification-Model-with-Amazon-Machine-Learning-and-Amazon-R
Regarding visualization, there are plenty of great visualization tools that you can connect to Redshift. The most common ones are Tableau, QliView, Looker or YellowFin, especially if you don't have any existing DWH, where you might want to keep on using tools like JasperSoft or Oracle BI. Here is a link to a list of such partners that are providing free trial for their visualization on top of Redshift: http://aws.amazon.com/redshift/partners/
BTW, Redshift also provides a free trial for 2 months that you can quickly test and see if it fits your needs: http://aws.amazon.com/redshift/free-trial/
Big Data is a tough problem primarily because it isn't one single problem. First if your original database is a normal OLTP database that is handling business transactions throughout the day, you will not want to also do your big data analysis on this system since the data-analysis you will want to do will interfere with the normal business traffic.
Problem #1 is what type of database do you want to use for data-analysis? You have many choices ranging from RDBMS, Hadoop, MongoDB, and Spark. If you go with RDBMS then you will want to change the schema to be more compliant with data-analysis. You will want to create a data warehouse with a star schema. Doing this will make many tools available to you because this method of data analysis has been around for a very long time. All of the other "big data" and data analysis databases do not have the same level of tooling available, but they are quickly catching up. Each one of these will require research on which one you will want to use based on your problem set. If you have big batches of data RDBMS and Hadoop will be good. If you have streaming types of data then you will want to look at MongoDB and Spark. If you are a Java shop then RDBMS, Hadoop or Spark. If you are JavaScript MongoDB. If you are good with Scala then Spark.
Problem #2 is getting your data from your transactional database into your big data storage. You will need to find a programming language that has libraries to talk to both databases and you will have to decide when and where you will be moving this data. You can use Python, Java or Ruby to do this work.
Problem #3 is your UI. If you decide to go with RDBMS then you can use many of the available tools available or you can build your own. The other data storage solutions will have tool support but it isn't as mature is that available for the RDBMS. You are most likely going to build your own here anyway because your analysts will want to have the tools built to their specifications. Java works with all of these storage mechanisms but you can probably get Python to work too. You may want to provide a service layer built in Java that provides a RESTful interface and then put a web layer in front of that service layer. If you do this, then your web layer can be built in any language you prefer.
These three languages are most commonly used for machine learning and data mining on the Server side: R, Python, SQL. If you are aiming for heavy mathematical functions and graph generation, Haskell is very popular.
How are document collaboration tools such as Google Docs and Sharepoint implemented in the backend? What kind of database architecture in the backend is used to implement features such as multiple people editting the document simultaneously. How is this done efficiently efficiently for large documents without having each edit update an entire database entry?
And how do they maintain the complete version history of every single edit while not using up tons of disk space?
Do Google Docs and Sharepoint have degrading performance for very very large documents?