Persisting data from Apache Storm, any Framework? - database

We are using Kafka-storm in our project. In storm we will use multiple bolts for transformations. But before that, as part of POC, we want to persist data into DB. Which framework we should use? For BigData scenario which can be used? Is Trident applicable here? For persistence I am looking for something like Hibernate/JPA. What can be used? and if possible provide a sample code for this.

You can use almost any framework you want, although your specific choice may impact your topology slightly. Choose your favorite framework and give it a try.

Related

Pros/Cons of incorporating multiple database types into same project

I'm beginning to pursue my first online project that I am planning will need to scale as such I have opted for a NoSQL DB. Some reading into this and modeling of what my queries would look like and there are two databases I am considering. Cassandra seems like the right choice for item lookups by keyword but MongoDB sounds like the right choice for initially entering the data in as it can retain the account structure in document form.
This split decision has left me wondering: Are there any major companies that use multiple database types for storage of different items as in using both Cassandra and Mongo together?
I would think scaling up would be more difficult but are the added benefits (if there are any) worth the trouble? I'm not the expert on this. I'm hoping you are. Thanks in advance for sharing your experience.
Cassandra can handle both use cases so you can use the same database for your purposes.
Stargate (https://stargate.io/) is an open-source API platform which provides a data gateway to Cassandra with REST API, GraphQL API, Document API and even native CQL access.
The Document API lets you save and search schemaless JSON documents to/from Cassandra directly from your app.
You can try it out for free on Astra with no credit card required. In just a few clicks, you'll be able to launch a Cassandra cluster with Stargate pre-configured so you can use the Document API straight out-of-the box and build a proof-of-concept app immediately without having to worry about downloading/installing/configuring a Cassandra cluster.
There are even sample apps you can access straight from the Astra dashboard so you can see Stargate in action. For more info, see Using the Document API on Astra. Cheers!
Using multiple database technologies in the same project is somewhat common nowadays and it is called "Polyglot persistence".
Many people use this method to take advantage of multiple systems - and as you mentioned Cassandra is right for somethings and something else (maybe MongoDB) is best for something else, so using a combination can give the advantage of both worlds.
Scaling, Replication, Support can be more costly when you use multiple technologies because you need expertise in both to support.
So if you really have use cases where Cassandra wont be a good choice and you have some primary use cases where Cassandra is the best choice then yes, going with two databases can be the best option provided you are ready to take the trouble of supporting two systems.

How do I use Zookeeper as "Database" with Play Framework instead of traditional JDBC?

I am using Play framework 2.0.1 in scala to create an application that has basic stuff in the model but not very complicated. I would like the information in the model to be saved/requested/updated and deleted from a zookeeper instance.
How can i best go about this without crucially breaking play framework?
The simple answer would be to tell you that ZooKeeper is not meant to be used as a general datastore/database; however, I am inclined to believe that you are really looking for something like MongoDB.
Check out MongoDB Replica Sets and Election/Voting. This should give you want you want. Much easier to manage than ZooKeeper and more useful for general application data storage needs.

Building OLAP style applications with SalesForce/Apex

We are considering moving a planning and budgeting app to the Salesforce platform. The existing app is built on a dimensional data model, and has extensive ad-hoc query capability implemented through star joins.
We see how the platform will allow us to put together the data entry screens quickly, but the underlying datamodel and query languages do not seem suitable for our reporting requirements.
Is it possible to have fast and flexible reporting with this platform? If not, how cumbersome is it to extract the data on a regular basis to bring it into an analytical application?
Hmm - I guess I answer my own question? The relative silence on this (even with bounty- who wants to have anything to do with something that is ignored on stackoverflow?) is a kind of answer.
So - No, this platform is not well suited for applications that have any kind of ROLAP requirements. I guess shame on me for asking a silly question, but I welcome any responses...
Doing native, fast, OLAP-like queries: possible, but somewhat cumbersome since SFDC is basically a traditional-style RDBMS with somewhat limited joining capability within its native reporting. You can do OLAP-like things with custom code but it can get cumbersome if you are used to using established high-end OLAP solutions.
Extracting data from SFDC to use in other applications: really easy and supported across a number of technologies, the most common is extracting CSV files or using the data web service. There are tools like the SFDC data loader which also let you extract/load data via command line or UI. That's probably what I would recommend to a client who has pre-existing expertise in a given analysis tool.
I would not attempt to build an OLAP data model in salesforce. The limitations in both the joins and roll-up of data from child to parent make it difficult to implement a star schema with aggregations.
There are some products such as IQ 20/20 that can integrate with salesforce and provide near real time business intelligence functionality.
Analytical snapshots can also help as they provide a way to build aggregate tables. The snapshots pull data from a report and can be scheduled to run periodically. The different salesforce editions give different features regarding the scheduling so it is best to check the limits for your edition before going too far into the design.

AI Framework for managing automatic responses

What is the best technique or framework to implement one robot in one website, something like IKEA Site:
http://193.108.42.79/ikea-us/cgi-bin/ikea-us.cgi
But I don't care about the frontend what I really need to know is what framework or technique to manage the Artificial Inteligence.
Best Regards,
Pedro
While it is possible that the only technique they are using is Keyword Spotting it is more likely that they are also using Naive Bayesian Classification.
There are some some client-side classifiers written in javascript such as Brain, eliminating the need for server-side processing.
Brain: http://harthur.github.com/brain/#bayesian
A more comprehensive solution may include Classifier4J or OpenNLP, either of which could be utilized in a Java Applet or on a server if need be.
OpenNLP: http://incubator.apache.org/opennlp/

Django database scalability

We have a new django powered project which have a potential heavy-traffic characteristic(means a heavy db interaction). So we need to consider the database scalability in advance. With some researches, the following questions are still not clear to us:
coarse-grained: how to specify one db table(a django model) to a specific db(maybe in another server)?
fine-grained: how to specify a group of table rows to a specific db(so-called sharding, also can in another db server)?
how to specify write and read to different db?(which will be helpful for future mysql master/slave replication)
We are finding the solution with:
be transparent to application program(means we don't need to have additional codes in views.py)
should be in ORM level(means only needs to specify in models.py)
compatible with the current(or future) django release(to keep a minimal change for future's upgrading of django)
I'm still doing the research. And will share in this thread later if I've got some fruits.
Hope anyone with the experience can answer. Thanks.
Don't forget about caching either. Using memcached to relieve your DB of load is key to building a high performance site.
As alex said, django-core doesn't support your specific requests for those features, though they are definitely on the todo list.
If you don't do this in the application layer, you're basically asking for performance trouble. There aren't any really good open source automation layers for this sort of task, since it tends to break SQL axioms. If you're really concerned about it, you should be coding the entire application for it, not simply hoping that your ORM will take care of it.
There is the GSoC project by Alex Gaynor that in future will allow to use multiple databases in one Django project. But now there is no cross-RDBMS working solution.
There is no solution right now too.
And again - there is no cross-RDBMS solution. But if you are using MySQL you can try excellent third-party Django application called - mysql_replicated. It allows to setup master-slave replication scenario easily.
here for some reason we r using django with sqlalchemy. maybe combination of django and sqlalchemy also works for your needs.

Resources