Data Processing / Mining Question

Data Processing / Mining Question - database

I'm starting to work on a financial information website (somewhat like google finance or bloomberg).
My website needs to display live currency, commodity, and stock values. I know how to do this frontend-wize, but I have a backend data storing question (I already have the data feed APIs):
How would you guys go about this - would you set up your own database and save all the data in the db with some kind of a backend worker, and then plug in your frontend to your db, or would you plug your frontend directly to the API and not mine the data?
Mining the data could be good for later reference (statistics and other things that the API wont allow), but can such a big quantity of ever growing information be stored on a database? Is this feasible? What other things should I be considering?
Thank you - any comment would be much appreciated!

First, I'd cleanly separate the front end from the code that reads the source APIs. Having done that, I could have the code that reads the source APIs feed the front end directly, feed a database, or both.
I'm a database guy. I'd lean toward feeding data from the APIs into a database, and connecting the front end to the database. But it really depends on the application's requirements.
Feeding a database makes it simple and cheap to change your mind. If you (or whoever) decides later to keep no historical data, just delete old data after storing new data. If you (or whoever) decides later to keep all historical data, just don't delete old data.
Feeding a database also gives you fine-grained control over who gets to see the data, relatively independent of their network operating system permissions. Depending on the application, this may or may not be a good thing.

Related

Send sensor data to hazelcast database and link it to development tools to design dashboards

Can we use hazel-cast database to link and design the data according to tracker with bar graph, below are the points which I need to confirm to build the application for hardware:
- I am using temperature sensor interfacing with Arduino Yun and wanted to upload the data given by temperature sensor on hazel-cast server.
By using single database output uploaded in hazelcast server, reads the data through database through Arduino MKR1000.
Link the data to different development tools to design different types of dashboards like Pie chart, Bar chart, Line chart etc.
Please suggest how the best way to link to create the database in data-grid

How you want to use data on your dashboard will basically depend on how you have modelled your data - one map or multiple maps etc. Then you can retrieve data through single key-based lookups or by running queries and use that for your dashboard. You can define the lifetime of the data - be it few minutes or hours or days. See eviction: http://docs.hazelcast.org/docs/3.10.1/manual/html-single/index.html#map-eviction
If you decide to use a visualisation tool for dashboard that can use JMX then you can latch on to Hazelcast exposed JMX beans that would give you information about data stored in the cluster and lot more. Check out this: http://docs.hazelcast.org/docs/3.10.1/manual/html-single/index.html#monitoring-with-jmx

You can configure Hazelcast to use a MapLoader - MapStore to persist the cached data to any back-end persistence mechanism – relational or no-sql databases may be good choices.
On your first point, I wouldn’t expect anything running on the Arduino to update the database directly, but the MKR1000 is going to get you connectivity, so you can use Kafka/MQTT/… - take a look at https://blog.hazelcast.com/hazelcast-backbone-iot-internet-things/.
If you choose this route, you’d set up a database that is accessible to all cluster members, create the MapLoader/MapStore class (see the example code, for help) and configure the cluster to read/write.
Once the data is in the cluster, access is easy and you can use a dashboard tool of your choice to present the data.
(edit) - to your question about presenting historical data on your dashboard:
Rahul’s blog post describes a very cool implementation of near/real-time data management in a Hazelcast RingBuffer. In that post, I think he mentioned collecting data every second and buffering two minutes worth.
The ring buffer has a configured capacity, but note that he is over-writing, on add - this is kind of a given for real-time systems; given the choice is losing older data or crashing.
For a generalized query-tool approach, I think you’d augment this. Off the top of my head, I could see using the ring-buffer in conjunction with a distributed map. You could (but wouldn’t need to) populate the map, using an map-event interceptor to populate the ring buffer. That should leave the existing functionality intact. The map, though, would allow you to configure a map-store/map-loader, so that your data is saved in a backing store. The map would support queries - but keep in mind that IMDG queries do not read through to the backing store.
This would give you flexibility, at the cost of some complexity. The real-time data in the ring buffer would be always available, quickly and easily. Data returned from querying the map would be very quick, too. For ‘historical’ data, you can query your backing-store - which is slower, but will probably have relatively great storage capacity. The trick here is to know when to query each. The most recent data is a given, with it’s fixed capacity. You need to know how much is in the cluster - i.e. how far back your in-memory history goes. I think it best to configure the expiry to a useful limit and provision the storage so that data leaves the map by expiration - not eviction. In this way, you can know what the beginning of the in-memory history is. Monitoring eviction events would tell you that your cluster has a complete view of data back to a known time.

Change Data Capture (CDC)

I was reading this article about ETL for analytics databases, and I came across this interesting note:
If you discover that your internal applications are deleting data that’s important for analysis, you have two options: either ask your
software engineers to modify the application code to avoid deletions,
or implement a data pipeline that includes Change Data Capture
(CDC). CDC preserves the state of a database at every point in its
history so that, even if data is deleted from the production schema,
it is still available for analysis. This solution is often far less
invasive than re-architecting an application to avoid deletions.
I'm relatively new to these tools. If I have a ruby on rails app with typical CRUD actions (on a MySql database), instead of re-writing my code to preserve data:
Could I actually implement something like RJ metrics so I don't need to modify my code but get to keep all my data? If not RJ Metrics,
Are there services out there that allow me to keep a stream of my data so I don't have to re-write code?

How to structure/coordinate multiple databases?

Imagine a large corp with dozens of companies, each with their own website and each website will have their own unique functional requirements
Most data on each website will be specific to that website
Each website can edit its own data
Some data will be shared across all websites
There will be a central CMS that is allowed to edit this data, but other websites can read and use that data
e.g. say you're planning the infrastructure for a company that owns multiple sub-companies that make different kinds of products, some in the same category (cereal, food), others in completely different categories (books, instruments). Some are marketing websites, some are for CRM, some are online stores
there are a list of regulatory requirements that affect all products
each company should manage the status of compliance of its own products to each requirement
when a new requirement surfaces, details regarding that requirement should only be entered once
How would the multiple databases be coordinated?
edit: added more info per Bob's suggestions
Thanks for the incredibly insightful questions!
compliance data is not shared, silo'd within each site
shared data is only on the one enterprise-wide database, they will mostly be "types of [thing]"
no conclusive list of instances where they'll be used but currently it'd be to populate CMS dropdowns for individual sites.
changes to shared data would occur a few times a year.
Ideally changes would be reflected within a few minutes, but an hour or so should be acceptable
very low volume in shared data.
All DBs will be new, decision on which DB is pending current investigation.
Sub-systems will expose REST api

Here are some ways I have seen this handled, you need to think about the implications of each structure based on the details of your particular business domain. All can work, but all have to be carefully set up if they are going to work.
One database for shared information and one for each client for client-specific information. Set up the overall application so that the first thing you put in the application on log in is the client and it connects to the correct client. People might have to also have a way to change the client if users will handled multiples.
Separate servers for each client if they completely need to be siloed. Database changes are by script (and in source control) and are applied to each server as need be. So the changes to the central database might have a job that runs to push any data changes to the other servers
All the data in one database, but making sure each table has a client_id so that the data is always filtered correctly by client. You can set up separate views by client, so that the users can only see the clients they are supposed to see. This only works if the data for each client is substantially in the same form.
And since you are in a regulatory environment, I strongly urge that you create an audit database that is updated by database triggers (never audit from the application, you will lose changes to the data) for each database.

I agree with Chris that, even after both the sets of questions, there is still a big set of possible solutions. For instance, if the databases were the same technology, and the shared data were stored in the same way in each one, you could do db-level replication from the central db to the others. Is it OK to have 2 separate dbs per application (one with shared stuff and one with not-shared?) - this would influence the kind of replication.
Or you could have a purely code solution, where clicking publish in a GUI that updates the central db calls a set of APIs that also update the other dbs. Or micro-services - updating the central db also creates a message on a shared queue, that is picked up by services that each look after a different db and apply the updates in whatever form makes sense for that db.
It depends on (among the things already mentioned) what your organisation's technology strategy is, what technology and skills you already have in-house, and so on.
So this is as much an architecture question as it is a db question.

I don't think this question is sufficiently clear to get a single answer. However there are a few possibilities.
In many cases, where you have shared data you want to have a single point of ownership of that information. It could be in a database, in an excel file (which can then be turned into csv and periodically loaded on all dbs), or some other form. The specifics depend on what is shared exactly.
Now in this case it sounds like you are going to have some sort of legal department in charge of some shared information and they will manage that data, which will then be shared to the other sites. This might be done with an application they manage which aggregates information from the other companies or it could be data which is pushed to their systems.
A final point:
Software is at its best when it facilitates human solutions to human problems, not when it tries to solve those problems directly. In these cases, you probably want a good human solution in place and then to look at what software can do to support that. A lot of the issues (who owns the information?) will already have been solved and you will be simply automating what is already done.

Is this the right architecture for our MMORPG mobile game?

These days I am trying to design architecture of a new MMORPG mobile game for my company. This game is similar to Mafia Wars, iMobsters, or RISK. Basic idea is to prepare an army to battle your opponents (online users).
Although I have previously worked on multiple mobile apps but this is something new to me. After a lot of struggle, I have come up with an architecture which is illustrated with the help of a high-level flow diagram:
We have decided to go with client-server model. There will be a centralized database on server. Each client will have its own local database which will remain in sync with server. This database acts as a cache for storing things that do not change frequently e.g. maps, products, inventory etc.
With this model in place, I am not sure how to tackle following issues:
What would be the best way of synchronizing server and client databases?
Should an event get saved to local DB before updating it to server? What if app terminates for some reason before saving changes to centralized DB?
Will simple HTTP requests serve the purpose of synchronization?
How to know which users are currently logged in? (One way could be to have client keep on sending a request to server after every x minutes to notify that it is active. Otherwise consider a client inactive).
Are client side validations enough? If not, how to revert an action if server does not validate something?
I am not sure if this is an efficient solution and how it will scale. I would really appreciate if people who have already worked on such apps can share their experiences which might help me to come up with something better. Thanks in advance.
Additional Info:
Client-side is implemented in C++ game engine called marmalade. This is a cross platform game engine which means you can run your app on all major mobile OS. We certainly can achieve threading and which is also illustrated in my flow diagram. I am planning to use MySQL for server and SQLite for client.
This is not a turn based game so there is not much interaction with other players. Server will provide a list of online players and you can battle them by clicking battle button and after some animation, result will be announced.
For database synchronization I have two solutions in mind:
Store timestamp for each record. Also keep track of when local DB
was last updated. When synchronizing, only select those rows that
have a greater timestamp and send to local DB. Keep a isDeleted flag
for deleted rows so every deletion simply behaves as an update. But
I have serious doubts about performance as for every sync request we
would have to scan the complete DB and look for updated rows.
Another technique might be to keep a log of each insertion or update
that takes place against a user. When the client app asks for sync,
go to this table and find out which rows of which table have been
updated or inserted. Once these rows are successfully transferred to
client remove this log. But then I think of what happens if a user
uses another device. According to logs table all updates have been
transferred for that user but actually that was done on another
device. So we might have to keep track of device also. Implementing
this technique is more time consuming but not sure if it out
performs the first one.

I've actually worked on some of the titles you mentioned.
I do not recommend using mysql, it doesn't scale up correctly, even if you shard. If you do you are loosing any benefits you might have in using a relational database.
You are probably better off using a no-sql database. Its is faster to develop, easy to scale and it is simple to change the document structure which is a given for a game.
If your game data is simple you might want to try couchDB, if you need advanced querying you are probably better of with MongoDB.
Take care of security at the start. They will try to hack the game for sure and if you have a number of clients released it is hard to make security changes backward compatible. SSL won't do much as the end user is the problem not an eavesdropper. Signing or encrypting your data will make it harder for a user to add items and gold to their accounts.
You should also define your architecture to support multiple clients without having a bunch of ifs and case statements. Read the client version and dispatch that client to the appropriate codebase.
Have a maintenance mode with flags for upgrading, maintenance, etc. It will cut you some slack if you need to re-shard your DB or any other change that might require downtime.
Client side validations are not enough, specially if using in app purchases. I agree with the above post. Server should control game logic.
As for DB sync, its best to memcache read only data. Typical examples are buyable items, maps, news, etc. User data is harder as you might not be able to afford loosing any modified data. The easiest setup is to cache user data for a couple of hours and write directly to the DB every time. If you are using no-sql it will probably withstand a high load without the need of using a persistence queue.

I see two potential problem hidden in the fact that you store all the state on the client, and then update the state on the server using a background thread.
How can the server validate the data being posted? If someone hacked your application, they could modify the code so whenever they swing their sword (or whatever they do in your game), it is always a hit. Doing that in a single player game is not that big a deal, but doing that in an MMORPG can ruin the experience for everyone else. So the server should validate every update of data - or even better, the server should be in charge of every business rule. So when you swing your sword against an opponent, that should be a server call, and the server returns whether or not it is a hit, and how many hit points the opponent lost.
What about interaction with other players (since you say it is an MMORP, there will be interaction with other players)? Since you say that you update the server, and get updates in a background thread, interaction will be sluggish. When you communicate with another character you have first wait for you background thread to sync data, but you also have to wait on the background thread of the other player to sync data.

Looks nice. But what is the client-side made of ? Web ? Can you use threading to synchronize both DB ? I should make the game in that way that it interacts immediately with the local DB, and let some background mechanism do the sync (something like a snapshot). This leads me to think about mysql replication. I think it is worth to be tried, but I never did. It also brings you answers to other questions. But what about the charge (how many customers are connected together) ?
http://dev.mysql.com/doc/refman/5.0/en/replication.html

Make your client issue commands to the server ("hit player"), and server send (relevant) events to client ("player was killed"). I wouldn't advice going with data synchronization. Server should be responsible for all important game decisions.

Updating data from several different sources

I'm in the process of setting up a database with customer information. The database will handle customer data (customer id, address, phonenr etc.) as well as some basic information about which kind of advertisement a specific customer has been subjected to, and how they reacted to it.
The data will be maintained both from a central data-warehouse, but additional information about customers and the advertisement will also be updated from other sources. For example, if an external advertisement agency runs a campaign, I want them to be able to feed back data about OptOuts, e-mail bounces etc. I guess what I need is an API which can be easily handed out to any number of agencies.
My first thought was to set up a web service API for all external sources, but since we'll probably be talking large amounts of data (millions of records per batch) I'm not sure a web service is the best option.
So my question is, what's the best practice here? I need a solution simple enough for advertisement agencies (likely with moderately skilled IT-people) to make use of. Simplicity is of the essence – by which I mean “simplicity over performance” in this case. If the set up gets too complex, it won't work.
The system will very likely be based on Microsoft technology.
Any suggestions?

The process you're describing is commonly referred to as Data Integration using ETL processes. ETL stands for Extract-Transform-Load. The idea is to build up your central data warehouse by extracting information from a lot of different data-sources, transform it and then load it into your data warehouse.
A variety of (also graphical) tools exist to implement such a process. Since you said you'll probably running a Microsoft stack, I suggest having a look at Sql Server Integration Services (SSIS).
Regarding your suggestion to implement integration using a web-service, I don't think that's a good idea too. Similarily, I don't think shifting the burden of data integration to your customers is a good idea either. You should agree with your customers on some form of a data exchange format, it could be as simple as a CSV file, or XML, Excel sheets, Access databases, use whatever suits your needs.
Any modern ETL tool like SSIS is capable of working with those different data sources.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight