Reading and Processing Millions of documents - database

I have 1 Million+ documents stored on a Cloud Storage and I want to annotate these documents by some application and then storing the result into the DB. This data is one time loaded and not any real time data, Data might be updated once in a week.
My Approach: So what I have done and trying to implement is:
List all the documents and store with their path in MongoDB with a flag to process
Writing some code to read the files which have flag to process
Get the result for the file and store it in MongoDB
Read the document from DB directly
This approach is very slow and taking a lot of efforts.
My Question is: How can I faster the process of listing all the documents and cleaning and then passing to the Application for annotation and then storing it in DB?
So just want to explore whether I can use some other tools as well to reduce coding efforts to achieve this process.

Related

How to find delta between two SOLR collections

We are using Lucid works Solr version 4.6.
Our source system basically stores data into two destination systems (one through real time and another thorough the batch mode). Data is ingested into Solr through the real time route.
We need to periodically synch the data ingested in Solr with the data ingested into the batch system.
The design we are currently trying to evaluate is to import the data from batch system into another Solr collection, but really not sure how to sync both collections (i.e the one with realtime data and second is through batch import).
I read through data import handlers but this will override the existing data in Solr. Is there any way in which we can identify the delta between the two collections and ingest that only.
There is no good way; there are a couple of things you can do:
When data is coming into the real time system there is a an import timestamp. Then do a range query to pull in the new stuff. I think new versions of Solr already have a field for this.
Log IDs of documents going into the first Solr and then index these.
Separate queue for the other collection

Automatically push engine datastore data to bigquery tables

To move data from datastore to bigquery tables I currently follow a manual and time consuming process, that is, backing up to google cloud storage and restoring to bigquery. There is scant documentation on the restoring part so this post is handy http://sookocheff.com/posts/2014-08-04-restoring-an-app-engine-backup/
Now, there is a seemingly outdated article (with code) to do it https://cloud.google.com/bigquery/articles/datastoretobigquery
I've been, however, waiting for access to this experimental tester program that seems to automate the process, but gotten no access for months https://docs.google.com/forms/d/1HpC2B1HmtYv_PuHPsUGz_Odq0Nb43_6ySfaVJufEJTc/viewform?formkey=dHdpeXlmRlZCNWlYSE9BcE5jc2NYOUE6MQ
For some entities, I'd like to push the data to big query as it comes (inserts and possibly updates). For more like biz intelligence type of analysis, a daily push is fine.
So, what's the best way to do it?
There are three ways of entering data into bigquery:
through the UI
through the command line
via API
If you choose API, then you can have two different ways: "batch" mode or streaming API.
If you want to send data "as it comes" then you need to use the streaming API. Every time you detect a change on your datastore (or maybe once every few minutes, depending on your needs), you have to call the insertAll method of the API. Please notice you need to have a table created beforehand with the structure of your datastore. (This can be done via API if needed too).
For your second requirement, ingesting data once a day, you have the full code in the link you provided. All you need to do is adjust the JSON schema to those of your data store and you should be good to do.

Choosing a DB for a caching system

I am working on a financial database that I need to develop caching for. I have a MySQL database with a lot of raw, realtime data. This data is then provided over a HTTP API using Flask (Python).
Before the raw data is returned it is manipulated by my python code. This manipulation can involve a lot of data, therefore a caching system is in order.
The cached data never changes. For example, if someone queries for data for a time range of 2000-01-01 till now, the data will get manipulated, returned and stored in the cache as being the specifically manipulated data from 2000-01-01 till now. If the same manipulated data is queried again later, the cache will retrieve the values from 2000-01-01 till the last time it was queried, elimination the need for manipulation for that entire period. Then, it will manipulate the new data from that point till now, and add that to the cache too.
The data size shouldn't be enormous (under 5GB I would say at max).
I need to be able to retrieve from the cache using date ranges.
Which DB should I be looking it? MongoDB? Redis? CouchDB?
Thanks!
Using BigData solution for such a small data set seems like a waste and might still not yell the required latency.
It seems like what you need is not one of the BigData solution like MongoDB or CouchDB but a distributed Caching (or In Memory Data Grid).
One of the leading solution which (which I'm one of its contributors) seems like a perfect match for you needs is XAP Elastic Caching.
For more details see: http://www.gigaspaces.com/datagrid
And you can find a post describing exactly this case on how you can use DataGrid to scale MySQL: "Scaling MySQL" - http://www.gigaspaces.com/mysql

Dumping Twitter Streaming API tweets as-is to Apache Cassandra for post-processing

I am using the Twitter Streaming API to monitor several keywords/users. I am planning to dump the tweets json strings I get from twitter directly as-is to cassandra database and do post processing on them later.
Is such a design practical? Will it scale up when I have millions of tweets?
Things I will do later include getting top followed users, top hashtags etc. I would like to save the stream as is for mining them later for any new information that I may not know of now.
What is important is not so much the number of tweets as the rate at which they arrive. Cassandra can easily handle thousands of writes per second, which should be fine (Twitter currently generates around 1200 tweets per second in total, and you will probably only get a small fraction of those).
However, tweets per second are highly variable. In the aftermath of a heavy spike in writes, you may see some slowdown in range queries. See the Acunu blog posts on Cassandra under heavy write load part i and part ii for some discussion of the problem and ways to solve it.
In addition to storing the raw json, I would extract some common features that you are almost certain to need, such as the user ID and the hashtags, and store those separately as well. This will save you a lot of processing effort later on.
Another factor to consider is to plan for how the data stored will grow over time. Cassandra can scale very well, but you need to have a strategy in place for how to keep the load balanced across your cluster and how to add nodes as your database grows. Adding nodes can be a painful experience if you haven't planned out how to allocate tokens to new nodes in advance. Waiting until you have an overloaded node before adding a new one is a good way to make your cluster fall down.
You can easily store millions of tweets in cassandra.
For processing the tweets and getting stats such as top followed users, hashtags look at brisk from DataStax which builds on top of cassandra.

Retrieving information from aggregated weblogs data, how to do it?

I would like to know how to retrieve data from aggregated logs? This is what I have:
- about 30GB daily of uncompressed log data loaded into HDFS (and this will grow soon to about 100GB)
This is my idea:
- each night this data is processed with Pig
- logs are read, split, and custom UDF retrieves data like: timestamp, url, user_id (lets say, this is all what I need)
- from log entry and loads this into HBase (log data will be stored infinitely)
Then if I want to know which users saw particular page within given time range I can quickly query HBase without scanning whole log data with each query (and I want fast answers - minutes are acceptable). And there will be multiple querying taking place simultaneously.
What do you think about this workflow? Do you think, that loading this information into HBase would make sense? What are other options and how do they compare to my solution?
I appreciate all comments/questions and answers. Thank you in advance.
With Hadoop you are always doing one of two things (either processing or querying).
For what you are looking to-do I would suggest using HIVE http://hadoop.apache.org/hive/. You can take your data and then create a M/R job to process and push that data how you like it into HIVE tables. From there (you can even partition on data as it might be appropriate for speed to not look at data not required as you say). From here you can query out your data results as you like. Here is very good online tutorial http://www.cloudera.com/videos/hive_tutorial
There are a lots of ways to solve this but it sounds like HBase is a bit overkill unless you want to setup all the server required for it to run as an exercise to learn it. HBase would be good if you have thousands of people simultaneously looking to get at the information.
You might also want to look into FLUME which is new import server from Cloudera . It will get your files from some place straight to HDFS http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-flume/

Resources