efficient way to get deleted documents - cloudant

I'm searching for an efficient way to get the list of documents deleted in a Cloudant database.
Background: I have a Cloudant database containing 4 million records. The business logic allows also documents to be deleted. Data from this database is loaded daily into a SQL data warehouse and needs to be also marked as deleted.
A full reload is no option since it takes too long. Also querying the _changes stream seems not to scale well if the Cloudant database contains so many documents.

I would use the _changes feed and apply a server-side filter function (http://guide.couchdb.org/draft/notifications.html) to eliminate all documents that don't have the _deletedproperty set. Your change feed listener would therefore only be notified whenever a DELETE operation is reported and network traffic kept to a minimum.

Related

idiomatic way to do many dynamic filtered views of a Flink table?

I would like to create a per-user view of data tables stored in Flink, which is constantly updated as changes happen to the source data, so that I can have a constantly updating UI based on a toChangelogStream() of the user's view of the data. To do that, I was thinking that I could create an ad-hoc SQL query like SELECT * FROM foo WHERE userid=X and convert it to a changelog stream, which would have a bunch of inserts at the beginning of the stream to give me the initial state, followed by live updates after that point. I would leave that query running as long as the user is using the UI, and then delete the table when the user's session ends. I think this is effectively how the Flink SQL client must work, so it seem like this is possible.
However, I anticipate that there may be some large overheads associated with each ad hoc query if I do it this way. When I write a SQL query, based on the answer in Apache Flink Table 1.4: External SQL execution on Table possible?, it sounds like internally this is going to compile a new JAR file and create new pipeline stages, I assume using more JVM metaspace for each user. I can have tens of thousands of users using the UI at once, so I'm not sure that's really feasible.
What's the idiomatic way to do this? The other ways I'm looking at are:
I could maybe use queryable state since I could group the current rows behind the userid as the key, but as far as I can tell it does not provide a way to get a changelog stream, so I would have to constantly re-query the state on a periodic basis, which is not ideal for my use case (the per-user state can be large sometimes but doesn't change quickly).
Another alternative is to output the table to both a changelog stream sink and an external RDBMS sink, but if I do that, what's the best pattern for how to join those together in the client?

cloudant dashdb sync issue

We have created a warehouse with source database in cloudant,
We had ran schema discovery process on near about 40,000 records initially.Our cloudant database consist of around 2 millions records.
Now the issue we are facing is that we have got many records in _OVERFLOW Table in DashDB (means that they have got rejected ) with error like "[column does not exist in the discovered schema. Document has not been imported.]"
Seems to me the issue is that cloudant database which is actually result of dbcopy ,contains partials in the docs and as those partials are created internally by cloudant with value which we can judge only after the partials gets created like "40000000-5fffffff" in the dd doesn't get discovered by schema discovery process and now all docs which have undiscovered partials are being rejected by cloudant-dashdb sync.
Does anyone has any idea how to resolve it..
The best option for you to resolve this is with a simple trick: feed the schema discovery algorithm exactly one document with the structure you want to create in your dashDB target.
If you can build such a "template" document ahead of time, have the algorithm discover that one and load it into dashDB. With the continuous replication from Cloudant to dashDB you can then have dbcopy load your actual documents into the database that serves as source for your cloudant-dashdb sync.
We had ran schema discovery process on near about 40,000 records initially.
Our database consist of around 2 millions records
Do all these 2 millions share the same schema? I believe not.
"[column does not exist in the discovered schema. Document has not been imported.]"
It means that during your initial 40'000 records scan application didn't find any document with that field.
Let's say sequence of documents in your Cloudant db is:
500'000 docs that match schema A
800'000 docs that match schema B
700'000 docs that match schema C
And your discovery process checked just first 40'000. It never got to schema B and C.
I would recommend to re-run discovery process and process all 2 millions records. It will take time, but will guarantee that all fields are discovered.

Couchdb document deletion and performance

Couchdb documents when deleted using DELETE HTTP request, doesn't actually delete the document ,instead the document still exists with "_deleted":true. This makes an update the document, so that view indexes should be updated (which I think is costly). So my question is if space is no concern, are there any performance gain that can be achieved by deleting documents
I did the below test a while back. It's in a database with some pretty large documents.
I replicated the database and reran a view in a design document. Each view in this document has hundred of emits per document, with a total of around 10 million emits on the database. This took around 3 hours to generate the index file from scratch.
I ran a view in another design document that only uses the doc._id field, and has only one emit per document. This took around 3 minutes.
I then deleted all documents and ran the views again, both completed in less than 2 minutes.

Handling large number of ids in Solr

I need to perform an online search in Solr i.e user need to find list of user which are online with particular criteria.
How I am handling this: we store the ids of user in a table and I send all online user id in Solr request like
&fq=-id:(id1 id2 id3 ............id5000)
The problem with this approach is that when ids become large, Solr is taking too much time to resolved and we need to transfer large request over the network.
One solution can be use of join in Solr but online data change regularly and I can't index data every time (say 5-10 min, it should be at-least an hour).
Other solution I think of firing this query internally from Solr based on certain parameter in URL. I don't have much idea about Solr internals so don't know how to proceed.
With Solr4's soft commits, committing has become cheap enough that it might be feasible to actually store the "online" flag directly in the user record, and just have &fq=online:true on your query. That reduces the overhead involved in sending 5000 id's over the wire and parsing them, and lets Solr optimize the query a bit. Whenever someone logs in or out, set their status and set the commitWithin on the update. It's worth a shot, anyway.
We worked around this issue by implementing Sharding of the data.
Basically, without going heavily into code detail:
Write your own indexing code
use consistent hashing to decide which ID goes to which Solr server
index each user data to the relevant shard (it can be a several machines)
make sure you have redundancy
Query Solr shards
Do sharded queries in Solr using the shards parameter
Start an EmbeddedSolr and use it to do a sharded query
Solr will query all the shards and merge the results, it also provides timeouts if you need to limit the query time for each shard
Even with all of what I said above, I do not believe Solr is a good fit for this. Solr is not really well suited for searches on indexes that are constantly changing and also if you mainly search by IDs than a search engine is not needed.
For our project we basically implement all the index building, load balancing and query engine ourselves and use Solr mostly as storage. But we have started using Solr when sharding was flaky and not performant, I am not sure what the state of it is today.
Last note, if I was building this system today from scratch without all the work we did over the past 4 years I would advise using a cache to store all the users that are currently online (say memcached or redis) and at request time I would simply iterate over all of them and filter out according to the criteria. The filtering by criteria can be cached independently and updated incrementally, also iterating over 5000 records is not necessarily very time consuming if the matching logic is very simple.
Any robust solution will include bringing your data close to SOLR (batch) and using it internally. NOT running a very large request during search which is low latency thing.
You should develop your own filter; The filter will cache the online users data once in a while (say, every minute). If the data changes VERY frequently, consider implementing PostFilter.
You can find a good example of filter implementation here:
http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/
one solution can be use of join in solr but online data change
regularly and i cant index data everytime(say 5-10 min, it should be
at-least an hr)
I think you could very well use Solr joins, but after a little bit of improvisation.
The Solution, I propose is as follows:
You can have 2 Indexes (Solr Cores)
1. Primary Index (The one you have now)
2. Secondary Index with only two fields , "ID" and "IS_ONLINE"
You could now update the Secondary Index frequently (in the order of seconds) and keep it in sync with the table you have, for storing online users.
NOTE: This secondary Index even if updated frequently, would not degrade any performance provided we do the necessary tweaks like usage of appropriate queries during delta-import, etc.
You could now perform a Solr join on the ID field on these two Indexes to achieve what you want. Here is the link on how to perform Solr Joins between Indexes/ Solr Cores.

Preventing duplicates with MapReduce to BigQuery pipeline

I was reading the answer by Michael to this post here, which suggests using a pipeline to move data from datastore to cloud storage to big query.
Google App Engine: Using Big Query on datastore?
I want to use this technique to append data to a bigquery table. That means I have to have some way of knowing if the entities have been processed, so they don't get repeatedly submitted to bigquery during mapreduce runs. I don't want to rebuild my table each time.
The way I see it, I have two options. I can put a flag on the entities and update it when each entity is processed and filter it out on subsequent runs - or - I can save each entity to a new table and delete it from the source table. The second way seems superior but I wanted to ask for options or see if there's any gotchas
Assuming you have some stream of activity represented as entities, you can use query cursors to start up one query where a prior one left off. Query cursors are perfect for the type of incremental situation that you've described, because they avoid the overhead for marking entities as having been processed.
I'd have to poke around a bit to see if App Engine MapReduce supports cursors (I suspect that it doesn't, yet).

Resources