elasticsearch maximum requests per seconde - database

I installed elasticseach on a linux server, I have a single node of it. I have a csv file of several tens of thousands of lines which contains IDs, the goal is to iterate over this file and retrieve the data from the elasticsearch index.
The problem is that after a few thousand requests, elasticsearch crashes.
Please how much is the number of requests per second for a neod elasticsearch?
Thanks

Related

Unbalanced distribution of tuples despite partitioning with Apache Flink

I have a batch job running with Flink on EMR which enriches some data stored as CSV on AWS S3 and indexes the tuples with Elasticsearch.
For some reason, one of the hosts is getting a lot more work than others. I tried to address that by hash partitioning on the several fields of the tuples but this doesn't make any difference: one of the nodes still gets more than the others. See host 40705 in screenshot below.
I need to distribute the indexing across the various nodes as well as possible to optimize throughput.
I tried using rebalance() but the result is the same. Any clues?
EDIT
Overview screen

Running solr index on hadoop

I have a huge amount of data needs to be indexed and it took more than 10 hours to get the job done. Is there a way I can do this on hadoop? Anyone has done this before? Thanks a lot!
You haven't explained where does 10hr take? Does it take to extract the data? or does it take just to index the data.
If you are taking long time on the extraction, then you may use hadoop. Solr has a feature called bulk insert. So in your map function you could accumulate 1000s of record and commit for index in one shot to solr for large number of recods. That will optimize your performance alot.
Also what size is your data?
You could collect large number of records in reduce function of map/reduce job. You have to generate proper keys in your map so that large number of records go to single reduce function. In your custom reduce class, initialize solr object in setup/configure method, depending on your hadoop version and then close it in cleanup method.You will have to create a document collection object(in solrNet or solrj) and commit all of them in one single shot.
If you are using hadoop there is other option called katta. You can look over it as well.
You can write a map reduce job over your hadoop cluster which simply takes each record and sends it to solr over http for indexing. Afaik solr currently doesn't have indexing over cluster of machines, so it would be of worth to look into elastic search if you want to distribute your index also over multiple nodes.
There is a SOLR hadoop output format which creates a new index in each reducer- so you disteibute your keys according to the indices which you want and then copy the hdfs files into your SOLR instance after the fact.
http://www.datasalt.com/2011/10/front-end-view-generation-with-hadoop/

Can Neo4j handle traversal with more than 60,000 nodes?

I'm working with a quite big Neo4J setup with more than 60,000 nodes. Each node has about 4~5 properties and a simple parent->child relationship. When working with those 60,000 nodes, specially in queries that are expensive and repetitive, i'm getting various 500 HTTP errors through Neo4J's REST interface.
After going through the logs, i found that Java heap space was the problem. I cranked up the 512 MB limit to 2048 MB but it's still giving me 500. If i set the heap to something like 3GB or 4GB, neo4j doesn't even start. I'm testing this on a quite good laptop (i5, 4GB RAM) and i really want to know if this is a configuration problem or if the application will perform ok on my server (an Amazon Extra-Large High-CPU instance). Is there some sort of caching that can help me get things faster? Basically, i'm iterating over the entire network of nodes multiple times.
I'm running two queries. The first is:
start referrer=node(3) match path=referrer-[*1..1]->referral return referral
Which is done to discover the nodes which are Tier 1 for the Referrer #3. Then, i have to discover all nodes from all his tiers, returning the node, nodes from the first tier and then the tier number.
start referrer=node(3) match path=referrer-[*1..1]->firsttier-[*0..]->referral return referral, firsttier, length(path)
It works perfectly and it's quite fast. However, i'm doing this for ALL the nodes in my network. I'm running both queries (and applying business logic with them) inside a for loop. The loop runs 60,000 times.
Right now i'm testing this on my laptop, however, this "task" has been prepared for distributed processing, since i made everything with ZeroMQ. The for loop sends messages to workers and workers make the queries.
60,000 nodes is small for Neo4j -- it can go up to 32 billion+ -- but you need to increase the heap size in the config.
See http://blog.neo4j.org/2011/03/neo4j-13-abisko-lampa-m04-size-really.html
However, you probably want to limit the number of nodes you return over REST and page them.
Or you might consider returning all the IDs, caching them in your app or something like Redis, and then doing a multi-get with Cypher on the IDs. This way you aren't running the query every time.

Dumping Twitter Streaming API tweets as-is to Apache Cassandra for post-processing

I am using the Twitter Streaming API to monitor several keywords/users. I am planning to dump the tweets json strings I get from twitter directly as-is to cassandra database and do post processing on them later.
Is such a design practical? Will it scale up when I have millions of tweets?
Things I will do later include getting top followed users, top hashtags etc. I would like to save the stream as is for mining them later for any new information that I may not know of now.
What is important is not so much the number of tweets as the rate at which they arrive. Cassandra can easily handle thousands of writes per second, which should be fine (Twitter currently generates around 1200 tweets per second in total, and you will probably only get a small fraction of those).
However, tweets per second are highly variable. In the aftermath of a heavy spike in writes, you may see some slowdown in range queries. See the Acunu blog posts on Cassandra under heavy write load part i and part ii for some discussion of the problem and ways to solve it.
In addition to storing the raw json, I would extract some common features that you are almost certain to need, such as the user ID and the hashtags, and store those separately as well. This will save you a lot of processing effort later on.
Another factor to consider is to plan for how the data stored will grow over time. Cassandra can scale very well, but you need to have a strategy in place for how to keep the load balanced across your cluster and how to add nodes as your database grows. Adding nodes can be a painful experience if you haven't planned out how to allocate tokens to new nodes in advance. Waiting until you have an overloaded node before adding a new one is a good way to make your cluster fall down.
You can easily store millions of tweets in cassandra.
For processing the tweets and getting stats such as top followed users, hashtags look at brisk from DataStax which builds on top of cassandra.

Does it make sense to use Hadoop for import operations and Solr to provide a web interface?

I'm looking at the need to import a lot of data in realtime into a Lucene index. This will consist of files of various formats (Doc, Docx, Pdf, etc).
The data will be imported as batches compressed files, and so they will need to be decompressed and indexed into an individual file, and somehow relate to the file batch as a whole.
I'm still trying to figure out how to accomplish this, but I think I can use Hadoop for the processing and import into lucene. I can then use Solr as a web interface.
Am I over complicating things since Solr can already process data? Since the CPU load for import is very high (due to pre processing) I believe I need to separate import and casual searching regardless of the implementation.
Q: "Please define a lot of data and realtime"
"A lot" of data is 1 Billion email messages per year (or more), with an average size of 1K, with attachments ranging from 1K to 20 Megs with a small amount of data ranging from 20 Megs to 200 Megs. These are typically attachments that need indexing referenced above.
Realtime means it supports searching within 30 minutes or sooner after it is ready for import.
SLA:
I'd like to provide a search SLA of 15 seconds or less for searching operations.
If you need the processing done in real-time (or near real-time for that matter) then Hadoop may not be the best choice for you.
Solr already handles all aspects of processing and indexing the files. I would stick with a Solr-only solution first. Solr allows you to scale to multiple machines, so if you find that the CPU load is too high because of the processing, then you can easily add more machines to handle the load.
I suggest that you use Solr Replication to ease the load, by indexing on one machine and retrieving from others. Hadoop is not suitable for real-time processing.
1 billion documents per year translates to approximately 32 documents per second spread uniformly.
You could run text extraction on a separate machine and send the indexable text to Solr. I suppose, at this scale, you have go for multi-core Solr. So, you can send indexable content to different cores. That should speed up indexing.
I have done indexing of small structured documents in the range of 100 million without much trouble on a single core. You should be able scale to few 100 million documents with a single solr instance. (The text extraction service could use another machine.)
Read about large scale search on Hathi Trust's blog for various challanges and solutions. They use Lucene/Solr.

Resources