Designing a near real time streaming backend - solr

I have the following requirements for designing a streaming backend :
Documents are getting added # 20 docs/sec. Each doc has a timestamp field.
Searches are primarily based on timestamp range ( e.g. show me documents arrived in last 20 minutes )
Search QueriesPerSecond : 100 searches/sec
Documents older than 2 days could be continuously deleted for optimization purposes ( by a cron )
I am thinking of using Solr ( with SolrReplication/NRT ). The problem with Solr is basically frequent updates/deletes. For freshest data I will need to do commit on each update ( otherwise data wont be visible by searchers). Setting pollInterval~1 minute might kill the master/server both. NRT/SolrCloud could be one fo the options, but not very sure about their stability.
Any other approaches/suggestions based on SQL/NoSQL architectures ?

mysql + memcached. Facebook runs their entire site on these two widely-available, widely supported, free and open source packages.

Related

Which Open source CEP shoud I choose for distributed and pipelined processing ; siddhi, Flink , Esper?

I am little bised towards siddhi cep as it has siddhi query language but it uses storm for distributed processing and WSO2 provides an web interface / dashboard to create and deploy applications . I think it will give me less independence to enhance / use some features .
Flink on the other hand seems to be good choice but it requires lot of code to implement even simple logic.
Is there a better option than these , I am
Confused
What do you mean by less independence? You can use Siddhi 4.x [1] without depending on storm by using its source and sink features to receive and send messages from one instance to another using tcp, Kafka, http, etc.
WSO2 Stream processor also uses the new version of Siddhi and with its editor you and simulate events and also debug.
Update: From 4.1 [WSO2 Stream Processor][2] can run on top of Kafka in fully distributed mode. See https://docs.wso2.com/display/SP4xx/Fully+Distributed+Deployment.
[1] https://wso2.github.io/siddhi/
[2] https://wso2.com/analytics
I would do a test...create 10 queries in each system....something like....
select * from SomeEvent where value = 1
select * from SomeEvent where value = 2
...
select * from SomeEvent where value = 9
select * from SomeEvent where value = 10
The idea is to see how easy it is to create the queries, how the API or deploy steps work and how performance changes with the number of queries.

Quickly adding edge counts to a document in ArangoDB

Not too complicated: I want to count the edges of each document and save the number in the document. I've come up with two queries that work; unfortunately since I have millions of edges both are quite slow. Is there a faster way to update documents with a property storing their number of edges? (just a count at a point in time)
AQL queries that are functional but slow:
FOR doc IN Documents
LET inEdgesCount = LENGTH(GRAPH_NEIGHBORS('edgeGraph', doc,{direction: 'inbound', maxDepth:1})
LET outEdgesCount = LENGTH(GRAPH_NEIGHBORS('edgeGraph', doc,{direction: 'outbound', maxDepth:1})
UPDATE doc WITH {inEdgesCount: inEdgesCount, outEdgesCount: outEdgesCount} In Documents
or:
FOR e IN Edges
COLLECT docId = e._to WITH COUNT INTO counter
UPDATE SPLIT(docId,'/')[1] WITH {inEdgeCount: counter}
(and then repeat for outbound edges)
As an aside, is there any way to view either query speed (e.g. FOR executions per second) or percentage completion? I've been trying to judge speed by using LIMITed queries to start with, but the time required doesn't seem to scale linearly.
With ArangoDB 2.8 you can use graph pattern matching traversals to execute this with better performance:
FOR doc IN documents
LET inEdgesCount = LENGTH(FOR v IN 1..1 INBOUND doc GRAPH 'edgeGraph' RETURN 1)
LET outEdgesCount = LENGTH(FOR v IN 1..1 OUTBOUND doc GRAPH 'edgeGraph' RETURN 1)
UPDATE doc WITH
{inEdgesCount: inEdgesCount, outEdgesCount: outEdgesCount} In Documents
Currently ArangoDB doesn't have a way to monitor the progress of long running tasks. With ArangoDB 3.0 we're going to introduce a new monitoring framkework that allows better inspection of whats actually going on in the server. However, with 3.0 it won't be able to gather live statistics; we may see this further down the 3.x road later this year. Judging percentage completion may become possible for easy tasks like creating indices, but on queries its rather going to be the number of documents read/written so far.
We did similar queries for validating whether a graph obeys a power law

App Engine query in admin datastore viewer returning different results than programmatic query

I'm flummoxed.
I noticed today that some data I thought should be present in my production appengine app wasn't showing up. I connected to the app via the remote console and ran the queries manually. Sure enough it looked like I only had 15 of the 101 rows I was expecting to see.
Then I went to my admin console at appengine.google.com and fired up the datastore viewer with the following query:
SELECT * FROM Assignment where game = KEY('Game', '201212-foo') and player = KEY('Player', 'player-mb')
The result I see is the first page of 20 results. I page through those results, and am able to see all 101 entities. HOORAY! My data is still there. BUT why then can't I access it via the db api? (NOTE: I've already tried clearing memcache via the memcache viewer, even though this particularly query isn't manually memcached)
From the remote console:
> from google.appengine.ext.db import GqlQuery
> GqlQuery("SELECT * FROM Assignment WHERE game = KEY('Game', '201212-foo') and player = KEY('Player', 'player-mb')").count()
15
The remote console agrees with the app itself, which only seems to be able to see 15 of the expected 101 rows.
What gives?
UPDATE:
I suspect this might be an indexing issue. If I issue get_by_key_name for one of the missing rows, it subsequently shows up in db api queries.
> GqlQuery("SELECT * FROM Assignment WHERE game = KEY('Game', '201212-foo') and player = KEY('Player', 'player-mb')").count()
15
> entities.Assignment.get_by_key_name('201212-assignment-135.9')
<entities.Assignment object at 0xa11eb6c>
> GqlQuery("SELECT * FROM Assignment WHERE game = KEY('Game', '201212-foo') and player = KEY('Player', 'player-mb')").count()
16
So should I (or can I) rebuild my indexes to remedy this problem?
UPDATE #2:
I attempted to build a perfect index for this query, and have just verified that even when the query does use the just-built index (via query.index_list()), the results are still only limited to a small subset of items available via the datastore viewer. Infuriatingly, it's actually a different subset than is available with the previous index (20 items vs 15 items). So now adding an additional filter term results in an additional 5 rows returned. So dumb.
All indexes claim to be "serving" so there shouldn't be any reason that the indexes are this far off.
UPDATE #3:
Sometimes, using my new index, I'll get the right answer:
> GqlQuery("SELECT * FROM Assignment WHERE game = KEY('Game', '201212-foo') and player = KEY('Player', 'player-mb') and user = 'zee'").count()
101
However if I issue this query 10 times, it comes back with the 'bad' results about half the time:
> GqlQuery("SELECT * FROM Assignment WHERE game = KEY('Game', '201212-foo') and player = KEY('Player', 'player-mb') and user = 'zee'").count()
16
So maybe its an issue of a bad/behind bigtable replica that I'm hitting half the time, or something else completely opaque that we won't get an answer to (appengine status does list a service disruption today), but I have a feeling that this will be fixed on its own. Will update again if it does.
FINAL UPDATE:
As I suspected, when I woke up this morning my app (and manual queries) now see a consistent, correct view of the data. Would still love an answer as to why this happened, but until I get that I'm going to chalk it up to internal Google bigtable weirdness.
I filed this issue against appengine to see if I can get an answer from someone in the know.
For HRD applications, this is working as intended. App Engine High Replication Datastore (HRD) stores your data synchronously in multiple datacenters. However, the delay from the time a write is committed until it becomes visible in all datacenters means that queries across multiple entity groups (non-ancestor queries) can only guarantee eventually consistent results. [1]
In your specific case, the discrepancy between the results from your application and the Admin Console Datastore Viewer is just because they most likely are reading from different Datastore servers with different consistency.
If you require a consistent view of your data, I advise taking a closer look into the article "Structuring Data for Strong Consistency"
[1] https://developers.google.com/appengine/docs/java/datastore/structuring_for_strong_consistency

Solr 3.5 indexing taking very long

We recently migrated from solr3.1 to solr3.5, we have one master and one slave configured. The master has two cores,
1) Core1 – 44555972 documents
2) Core2 – 29419244 documents
We commit every 5000 documents, but lately the commit is taking very long 15 minutes plus in some cases. What could have caused this, I have checked the logs and the only warning i can see is,
“WARNING: Use of deprecated update request parameter update.processor detected. Please use the new parameter update.chain instead, as support for update.processor will be removed in a later version.”
Memory details:
export JAVA_OPTS="$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g"
Solr Config:
<useCompoundFile>false</useCompoundFile>
<mergeFactor>10</mergeFactor>
<ramBufferSizeMB>32</ramBufferSizeMB>
<!-- <maxBufferedDocs>1000</maxBufferedDocs> -->
<maxFieldLength>10000</maxFieldLength>
<writeLockTimeout>1000</writeLockTimeout>
<commitLockTimeout>10000</commitLockTimeout>
Also noticed, that top command show almost 350GB of Virtual memory usage.
What could be causing this, as everything was running fine a few days back?
Do you have a large search warming query? Our commits take upto 2 mins because of search warming in place. Wondering if that is the case.
The large virtual memory usage would explain this.

What's your experience developing on Google App Engine?

Is GQL easy to learn for someone who knows SQL? How is Django/Python? Does App Engine really make scaling easy? Is there any built-in protection against "GQL Injections"? And so on...
I'd love to hear the not-so-obvious ups and downs of using app engine.
Cheers!
My experience with google app engine has been great, and the 1000 result limit has been removed, here is a link to the release notes:
app-engine release notes
No more 1000 result limit - That's
right: with addition of Cursors and
the culmination of many smaller
Datastore stability and performance
improvements over the last few months,
we're now confident enough to remove
the maximum result limit altogether.
Whether you're doing a fetch,
iterating, or using a Cursor, there's
no limits on the number of results.
The most glaring and frustrating issue is the datastore api, which looks great and is very well thought out and easy to work with if you are used to SQL, but has a 1000 row limit across all query resultsets, and you can't access counts or offsets beyond that. I've run into weirder issues, with not actually being able to add or access data for a model once it goes beyond 1000 rows.
See the Stack Overflow discussion about the 1000 row limit
Aral Balkan wrote a really good summary of this and other problems
Having said that, app engine is a really great tool to have at ones disposal, and I really enjoy working with it. It's perfect for deploying micro web services (eg: json api's) to use in other apps.
GQL is extremely simple - it's a subset of the SQL 'SELECT' statement, nothing more. It's only a convenience layer over the top of the lower-level APIs, though, and all the parsing is done in Python.
Instead, I recommend using the Query API, which is procedural, requires no run-time parsing, and makes 'GQL injection' vulnerabilities totally impossible (though they are impossible in properly written GQL anyway). The Query API is very simple: Call .all() on a Model class, or call db.Query(modelname). The Query object has .filter(field_and_operator, value), .order(field_and_direction) and .ancestor(entity) methods, in addition to all the facilities GQL objects have (.get(), .fetch(), .count()), etc.) Each of the Query methods returns the Query object itself for convenience, so you can chain them:
results = MyModel.all().filter("foo =", 5).order("-bar").fetch(10)
Is equivalent to:
results = MyModel.gql("WHERE foo = 5 ORDER BY bar DESC LIMIT 10").fetch()
A major downside when working with AppEngine was the 1k query limit, which has been mentioned in the comments already. What I haven't seen mentioned though is the fact that there is a built-in sortable order, with which you can work around this issue.
From the appengine cookbook:
def deepFetch(queryGen,key=None,batchSize = 100):
"""Iterator that yields an entity in batches.
Args:
queryGen: should return a Query object
key: used to .filter() for __key__
batchSize: how many entities to retrieve in one datastore call
Retrieved from http://tinyurl.com/d887ll (AppEngine cookbook).
"""
from google.appengine.ext import db
# AppEngine will not fetch more than 1000 results
batchSize = min(batchSize,1000)
query = None
done = False
count = 0
if key:
key = db.Key(key)
while not done:
print count
query = queryGen()
if key:
query.filter("__key__ > ",key)
results = query.fetch(batchSize)
for result in results:
count += 1
yield result
if batchSize > len(results):
done = True
else:
key = results[-1].key()
The above code together with Remote API (see this article) allows you to retrieve as many entities as you need.
You can use the above code like this:
def allMyModel():
q = MyModel.all()
myModels = deepFetch(allMyModel)
At first I had the same experience as others who transitioned from SQL to GQL -- kind of weird to not be able to do JOINs, count more than 1000 rows, etc. Now that I've worked with it for a few months I absolutely love the app engine. I'm porting all of my old projects onto it.
I use it to host several high-traffic web applications (at peak time one of them gets 50k hits a minute.)
Google App Engine doesn't use an actual database, and apparently uses some sort of distributed hash map. This will lend itself to some different behaviors that people who are accustomed to SQL just aren't going to see at first. So for example getting a COUNT of items in regular SQL is expected to be a fast operation, but with GQL it's just not going to work the same way.
Here are some more issues:
http://blog.burnayev.com/2008/04/gql-limitations.html
In my personal experience, it's an adjustment, but the learning curve is fine.

Resources