How can I poll the Salesforce API to find records that meet criteria and have not been seen by my app before? - salesforce

I am working on a Salesforce integration for an high-traffic app where we want to be able to automate the process of importing records from Salesforce to our app. To be clear I am not working from the Salesforce side (i.e. Apex), but rather using the Salesforce Rest API from within the other app.
The first idea was to use the cutoff time for when the record was created where we would increase that time on each poll based on the creation time of the applicant in the last poll. It was quickly realized this wouldn't work for this. There can be other filters in the query that might include a status field in Salesforce, for example, where the record should only import after a certain status is set. This would make checking creation time or anything like that unreliable since an older record could later become relevant to our auto importing.
My next idea was to poll the Salesforce API to find records every few hours. In order to avoid importing the same record twice, the only way I could think to do this is by keeping track of the IDs we already attempted to import and using these to do a NOT IN condition:
SELECT #{columns} FROM #{sobject_name}
WHERE Id NOT IN #{ids_we_already_imported} AND #{other_filters}
My big concern at this point was whether or not Salesforce had a limitation on the length of the WHERE clause. Through some research I see there are actually several limitations:
https://developer.salesforce.com/docs/atlas.en-us.salesforce_app_limits_cheatsheet.meta/salesforce_app_limits_cheatsheet/salesforce_app_limits_platform_soslsoql.htm
The next thing I considered was doing queries to find the all of the IDs in Salesforce that meet the conditions of the other filters without checking the ID itself. Then we could take that list of IDs and remove the ones we already tracked on our end to find a smaller IN condition we could set to find all of the data on the records we actually need.
This still doesn't seem completely reliable though. I see a single query can only return 2000 rows and only have an offset up to 2000. If we already imported 2000 records the first query might not have any necessary rows we'd want to import, but we can't offset it to get the relevant rows because of these limitations.
With these limitations I can't figure out a reliable way to find the relevant records to import as the number of records we already imported grows. I feel like this would be common usage of a Salesforce integration, but I can't find anything on this. How can I do this without having to worry about issues when we reach a high volume?

Not sure what all of your requirements are or if the solution needs to be generic, but you could do a few of things.
Flag records that have been imported, but that means making a call back to salesforce to update the records, but that can be bulkified to reduce the number of calls and modify your query to exclude the flag
Reverse the way you get the data to push instead of pull, so have salesforce push records that meet the criteria to you app whenever the record meets the criteria with workflow and outbound messages
Use the streaming API to setup a push topic that you app can subscribe to that would get notified when a records meets the criteria

Related

Does QuickBooks have any kind of audit log?

QuickBooks allows users to change posted periods. How can I tell if a user does this?
I actually don't need an audit log, but just the ability to see recently added/edited data that has a transaction date that's over a month in the past.
In a meeting today it was suggested that we may need to refresh data for all our users going back as far as a year on a regular basis. This would be pretty time consuming, and I think unnecessary when the majority of the data isn't changing. But I need to find out how can I see if data (such as an expense) has been added to a prior period so I know when to pull it again.
Is there a way to query for data (in any object or report) based not on the date of the transaction, but based on the date it was entered/edited?
I'm asking this in regard to using the QBO api, however if you know how to find this information from the web portal that may also be helpful.
QuickBooks has a ChangeDataCapture endpoint which is specifically for exactly the purpose you are describing. It's documented here:
https://developer.intuit.com/app/developer/qbo/docs/api/accounting/all-entities/changedatacapture
The TLDR summary is this:
The change data capture (cdc) operation returns a list of objects that have changed since a specified time.
e.g. You can continually query this endpoint, and you'll only get back the data that has actually changed since the last time you hit the endpoint.

Handling large volume of data using Web API

We have a long running DB query that populates a temporary table (we are not supposed to change this behavior) which results 6 to 10 million records, around 4 to 6 GB data.
I need to use .NET Web API for fetching data from SQL DB and the API is hosted on IIS. When a request comes from the client to API, query runs minimum 5 minutes based on amount of data in different joining tables and populates temp table. Then API has to read data from DB temp table and send it to client.
Without blocking client, without loosing DB temp table, without blocking IIS, how can we achieve this requirement?
Just thinking, if I use async API, will I be able to achieve this?
there are things you need to consider and things you can do.
if you kick off the query execution as the result of an API call, what happens if you get 10 calls to that endpoint, at the same time? Dead API, that's what going to happen.
You might be able to find a different trigger for the execution of the query, so you can run this query once per day for example or once every 4 hours and then store the result in a permanent table. The APIs job then only becomes to look at this table, not wait for anything and return some data.
The second thing you can do is to return only the data you need for the screen you are displaying. You are not going to show 4-6 gb worth of data in one go, I suspect you have some pagination there and you can rejig the code a little to only return one page of data in one go.
You don't say what kind of data you have, but if it something which doesn't require you to run that query very often then you can definitely make some improvements.
<---- edited after report clarification ---->
ok, since it's a report, here's another idea.
the aim is to make sure that the pressure is not on the api itself which needs to be responsive and quick. Let the API receive the request with the parameters needed. Offload the actual report generation activity to another service.
Keep track of what this service is doing so you can report on the status of the activity : has it started, is it finished, whatever else you need. You can use a queue for that, or simply keep track of jobs in the database.
generate the report file and store it somewhere.
email the user with the file attached or email a link so the user can download it. Another option is to provide a link to the report somewhere in the UI.

Google App Engine - Event when queue finishes

I'm starting to build a bulk upload tool and I'm trying to work out how to accomplish one of the requirements.
The idea is that a user will upload a CSV file and the tool will parse it and send each row of the CSV to the task queue as a task to be run. Then once all the tasks (relating to that specific CSV file) are completed, a summary report will be sent to the user.
I'm using Google App Engine and in the past I've used the standard Task Queue to handle tasks. However, with the standard Task Queue there is no way of knowing when the queue has finished, no event is fired to trigger the report generation so I'm not sure how to achieve this?
I've looked into it more and I understand that Google also offers Google PubSub. This is more sophisticated and seems more suited, but I still can't find out how to trigger and event when a PubSub queue is finished, any ideas?
Seems that you could use a counter for this. Create an entity with an Integer property that is set to the number of lines of the CSV file. Each task will decrement the counter in a transaction when it finishes processing the row (in a transaction). One task will set the counter to 0, and that task could trigger the event. This might cause too much contention though.
Another possibility could be to have each task create an entity of a specific kind when it finishes processing a row. You can then count the number of these entities to determine when all the rows have been processed.
It might be easier to use the The GAE Pipeline API, which would take care of this as a basic portion of its functionality.
There's a nice article explaining it a bit here.
And a related SO question which happens to mention the same reason for moving to this API and has an excellent answer: Google AppEngine Pipelines API
I didn't use it myself yet, but it's just a matter of time :)
It's also possible to implement a scheme to track the related tasks still being active, see Figure out group of tasks completion time using TaskQueue and Datastore.
You can also check the queue (approximate) status, see Get number of tasks in a named queue?
I faced a similar problem earlier this week and managed to find a nice workaround for it. What i did was i created an extra column in the table where a task inserts data into. And once a specific task is completed, it updates this 'task_status' column with 'done', otherwise it's left as the default null. Then when the user refreshes the page or goes to a specific URL or you do an AJAX call to query the task status for a specific id in your table, you can see if it is complete or not.
select * from table where task_status is not null and id = ?;
You can also create a 'tasks' table where you can store relevant columns there instead of modifying existing tables.
Hope this finds you some use.

About Youtube views count

I'm implementing an app that keeps track of how many times a post is viewed. But I'd like to keep a 'smart' way of keeping track. This means, I don't want to increase the view counter just because a user refreshes his browser.
So I decided to only increase the view counter if IP and user agent (browser) are unique. Which is working so far.
But then I thought. If Youtube, is doing it this way, and they have several videos with thousands or even millions of views. This would mean that their views table in the database would be overly populated with IP's and user agents....
Which brings me to the assumption that their video table has a counter cache for views (i.e. views_count). This means, when a user clicks on a video, the IP and user agent is stored. Plus, the counter cache column in the video table is increased.
Every time a video is clicked. Youtube would need to query the views table and count the number of entries. Won't this affect performance drastically?
Is this how they do it? Or is there a better way?
I would leverage client side browser fingerprinting to uniquely identify view counts. This library seems to be getting significant traction:
https://github.com/Valve/fingerprintJS
I would also recommend using Redis for anything to do with counts. It's atomic increment commands are easy to use and guarantee your counts never get messed up via race conditions.
This would be the command you would want to use for incrementing your counters:
http://redis.io/commands/incr
The key in this case would be the browser fingerprint hash sent to you from the client. You could then have a Redis "set" that would contain a list of all browser fingerprints known to be associated with a given user_id (the key for the set would be the user_id).
Finally, if you really need to, you run a cron job or other async process that dumps the view counts for each user into your counter cache field for your relational database.
You could also take the approach where you store user_id, browser fingerprint, and timestamps in a relational database (mysql?) and counter cache them into your user table periodically (probably via cron).
First of all, afaik, youtube uses BigTable, so do not worry about querying the count, we don't know the exact structure of the database anyway.
Assuming that you are on a relational model, create a column view_count, but do not update it on every refresh. Record the visists and periodically update the cache.
Also, you can generate hash from IP, browser, date and any other information you are using to detect if this is an unique view, and do not store the whole data.
Also, you can use session/cookie to record the view being viewed. Since it will expire, it won't be such memory problem - I don't believe anyone is viewing thousand of videos in one session
If you want to store all the IP's and browsers, then make sure you have enough DB storage space, add an index and that's it.
If not, then you can use the rails session to store the list of videos that a user has visited, and only increment the view_count attribute of a video when he's visiting a new video.

timeout workarounds in app engine

I am building a demo for a banking application in App Engine.
I have a Users table and Stocks table.
In order for me to be able to list the "Top Earners" in the application, I save a "Total Amount" field in each User's entry so I will later be able to SELECT it with ORDER BY.
I am running a cron job that runs over the Stocks table and update each user's "Total Amount" in the User's table. The problem is that I often get TIMEOUTS since the Stocks table is pretty big.
Is there anyway to overcome the time limit restriction in App Engine, or is there any workaround for these kind of updates (where you MUST select many entries from a table that result a timeout)?
Joel
The usual way is to split the job into smaller tasks using the task queue.
You have several options, all will involve some form of background processing.
One choice would be to use your cron job to kick off a task which starts as many tasks as needed to summarize your data. Another choice would be to use one of Brett Slatkin's patterns and keep the data updated in (nearly) realtime. Check out his high performance data-pipelines talk for details.
http://code.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html
You could also check out the mapper api (app engine map reduce) and see if it can do what you need.
http://code.google.com/p/appengine-mapreduce/

Resources