I'm writing application that sync users and groups from Active Directory.
Specifically, I need to track their IDs, DNs and group membership, save them to local database.
I'm afraid of member attribute, as it can possibly have millions of values.
Production environments have been reported to exceed 4 million members, and Microsoft scalability testing reached 500 million members.
How to track changes of such gigantic mutli-valued attributes?
I'm using LDAP, UnboundID SDK.
Is it possible to query attribute value count?
Is it possible to know, if multi-valued attribute has been updated without reading it?
How to get iterative updates, similar to DirSync, but with USNChanged approach?
Here is what I know
As mentioned in microsoft docs, there are three ways to do synchronization:
USNChanged -- the most compatible way.
DirSync -- required near admin authorities, can sync only whole domain (partition), syncing arbitrary subtree is not possible. Returns only updated attributes, iterative updates for multi-valued attrs are possible.
Change Notifications -- async search request, scope can be BASE or ONE_LEVEL, can have up to 5 searches per connection. Each change sends the whole object.
I'm implementing USNChanged, cuz it's advised.
This is how to read attribute with a lot of values.
Related
I am working on a Salesforce integration for an high-traffic app where we want to be able to automate the process of importing records from Salesforce to our app. To be clear I am not working from the Salesforce side (i.e. Apex), but rather using the Salesforce Rest API from within the other app.
The first idea was to use the cutoff time for when the record was created where we would increase that time on each poll based on the creation time of the applicant in the last poll. It was quickly realized this wouldn't work for this. There can be other filters in the query that might include a status field in Salesforce, for example, where the record should only import after a certain status is set. This would make checking creation time or anything like that unreliable since an older record could later become relevant to our auto importing.
My next idea was to poll the Salesforce API to find records every few hours. In order to avoid importing the same record twice, the only way I could think to do this is by keeping track of the IDs we already attempted to import and using these to do a NOT IN condition:
SELECT #{columns} FROM #{sobject_name}
WHERE Id NOT IN #{ids_we_already_imported} AND #{other_filters}
My big concern at this point was whether or not Salesforce had a limitation on the length of the WHERE clause. Through some research I see there are actually several limitations:
https://developer.salesforce.com/docs/atlas.en-us.salesforce_app_limits_cheatsheet.meta/salesforce_app_limits_cheatsheet/salesforce_app_limits_platform_soslsoql.htm
The next thing I considered was doing queries to find the all of the IDs in Salesforce that meet the conditions of the other filters without checking the ID itself. Then we could take that list of IDs and remove the ones we already tracked on our end to find a smaller IN condition we could set to find all of the data on the records we actually need.
This still doesn't seem completely reliable though. I see a single query can only return 2000 rows and only have an offset up to 2000. If we already imported 2000 records the first query might not have any necessary rows we'd want to import, but we can't offset it to get the relevant rows because of these limitations.
With these limitations I can't figure out a reliable way to find the relevant records to import as the number of records we already imported grows. I feel like this would be common usage of a Salesforce integration, but I can't find anything on this. How can I do this without having to worry about issues when we reach a high volume?
Not sure what all of your requirements are or if the solution needs to be generic, but you could do a few of things.
Flag records that have been imported, but that means making a call back to salesforce to update the records, but that can be bulkified to reduce the number of calls and modify your query to exclude the flag
Reverse the way you get the data to push instead of pull, so have salesforce push records that meet the criteria to you app whenever the record meets the criteria with workflow and outbound messages
Use the streaming API to setup a push topic that you app can subscribe to that would get notified when a records meets the criteria
I'm searching for an efficient way to get the list of documents deleted in a Cloudant database.
Background: I have a Cloudant database containing 4 million records. The business logic allows also documents to be deleted. Data from this database is loaded daily into a SQL data warehouse and needs to be also marked as deleted.
A full reload is no option since it takes too long. Also querying the _changes stream seems not to scale well if the Cloudant database contains so many documents.
I would use the _changes feed and apply a server-side filter function (http://guide.couchdb.org/draft/notifications.html) to eliminate all documents that don't have the _deletedproperty set. Your change feed listener would therefore only be notified whenever a DELETE operation is reported and network traffic kept to a minimum.
As described in the tile, can Solr provide search service when reindexing?
If not, is there a solution cover such scenario?
Until you commit (or auto-commit), no changes submitted to Solr are visible to the client. So you can continue providing the search service. After commit, the searcher is reopened and the clients will see new content.
If the changes are significant, you may consider building them in a separate collection from scratch (on the same or on different server) and then - after it is done - swapping the cores for standalone or changing the alias for the SolrCloud configuration. Either approach will keep the same name but point to your new collection.
I am implementing a license key system on Google AppEngine. Keys are generated ahead of time and emailed to users. Then they log into the system and enter the key to activate a product.
I could have potentially several hundred people submitting their keys for validation at the same time. I need the transactions to be strongly consistent so that the same license key cannot be used more than once.
Option 1: Use the datastore
To use the datastore, I need it to be strongly consistent, so I will use an EntityGroup for the license keys. However, there is a limit of 1 write / second to an entity group. Appengine requests must complete within 60 seconds, so this would mean either notifying users offline when their key was activated, or having them poll in a loop until their key was accepted.
Option 2: Use Google Cloud SQL
Even the smallest tier of Google Cloud SQL can handle 250 concurrent connections. I don't expect these queries to take very long. This seems like it would be a lot faster and would handle hundreds or thousands of simultaneous license key requests without any issues.
The downside to Google Cloud SQL is that it is limited in size to 500GB per instance. If I run out of space, I'll have to create a new database instance and then query both for the submitted license key. I think it will be a long time before I use up that 500GB and it looks like you can even increase the size by contacting Google.
Seems like Option2 is the way to go - but I'm wondering what others think. Do you find Entity Group performance for transactions acceptable?
Option 2 seems more feasible, neat and clean in your case but you have to take care of db connections by yourself and its a hassle with increasing load if connection pooling is not properly used.
Datastore can also be used in license key system by defining multiple EntityGroups with dummy ancestors based on few leading or trailing digits of key to deal with 1 write / second to an entity group. In this way you can also easily determine EntityGroup of a generated or provided license key.
For example 4321 G42T 531P 8922 is license key so 4321 can be used as EntityGroup and all keys starting with 4321 will be part of this EntityGroup. This is sort of sharding like mechanism to avoid the potential of simultaneous writes to single entity group.
If you need to perform queries on some columns other than license key then a separate mapping table can be maintained without an EntityGroup.
You can mixed them , Google Cloud SQL is only have Keys and Email , with 500G i belived you can store key for all of people in the planet .
In other hand you can request google to increase data size limit .
I will go with Option 1 datastore, it's much faster and scalable.
And I don't know why you need to create EntityGroup, you could make the "license key" itself as the Key, so each Entity is in it's own EntityGroup... only this will make things scalable.
I'm trying to find a way to access a centralized database for both retrieval and update.
the following is what I'm looking for,
Server 1 has this variable for example
int counter;
Server 2 will be interacting with the user, and will increase the counter whenever the user uses the service, until a certain threshold is reached. when this threshold is reached then server 2 will start rejecting the user access.
Also, the user will be able to use multiple servers (like server 2) from multiple locations and each time the user accesses the access any server the counter will be increased.
I tried google but it's hard to search for something without a name.
One approach to designing this is to do sharding by user - i.e. split the users between your servers depending on the ID of the user. That is, if you have 10 servers, then users with ID's ending with 2 would have all of their data stored on server 2, and so on. This assumes that user ID's are distributed uniformly.
One other approach is to shard the users by location - if you have servers in Asia vs Europe, for example. You'd need a property in the User record that tells you where the user is located; based on that, you'll know which server to route them to.
Ultimately, all of these design options have a concept of "where does the master record for a user reside?" Each of these approaches attempts to definitively answer this question.
A different category of approaches has to do with multi-master replication, which is supported by some database vendors; this approach does not scale as well (i.e. it's hard to get it to scale to 20 servers), but you might want to look into it, too.