Not Able to retrive all data from datastore - google-app-engine

In my datasore have one table EFlow and this table have 7000 entries but first 1000 entries have these fileds :
(ID/Name, appliedBy, approved, childEflowName, completed, completedApprovers, created_on, dueDate, eflowDispName, eflowName, isResubmitted, modified_on, nextApprover, parentEflowName, ruleEmailReceivers, ruleNames, upComingApprovers, workFlowName, workFlowVersion, approvalStateValues)
and remaining 6000 entries have these feilds:
(ID/Name, appliedBy, approvalStateValues, approved, childEflowName, completed, completedApprovers, created_on, draft, dueDate, dynamicApprovalStates, eflowApprovers, eflowDispName, eflowName, fieldValues, isResubmitted, modified_on, nextApprover, parentEflowName, ruleEmailReceivers, ruleNames, upComingApprovers, workFlowName, workFlowVersion)
I have added draft,dynamicApprovalStates,eflowApprovers and fieldValues this new field.
my problem is when I retrieve data from datastore then I got only first 1000 entries record.
How to retrieve all records?
My query is:
List<EFlow> lst = this.entityManager.createQuery("select from " + this.clazz.getName() + " i where i.completed = false and i.approved = false").getResultList();

First, it looks like you are using JPA. From our docs:
Warning: We think most developers will have a better experience using
the low-level Datastore API, or one of the open-source APIs developed
specifically for Datastore, such as Objectify. JPA was designed for
use with traditional relational databases, and so has no way to
explicitly represent some of the aspects of Datastore that make it
different from relational databases, such as entity groups and
ancestor queries. This can lead to subtle issues that are difficult to
understand and fix.
However, if you need to keep using JPA:
As the number of results can be large, you need to handle pagination with your query.
The best way to achieve this is with cursors.
import com.google.appengine.api.datastore.Cursor;
import com.google.appengine.datanucleus.query.JPACursorHelper;
...
Query query = this.entityManager.createQuery("select from " + this.clazz.getName() + " i where i.completed = false and i.approved = false")
Cursor cursor = Cursor.newBuilder().build();
do {
query.setHint(JPACursorHelper.CURSOR_HINT, cursor);
List<EFlow> lst = query.getResultList();
// ... Do stuff on lst here .. //
// Get the cursor so you can see if there are more results
cursor = JPACursorHelper.getCursor(lst);
} while (cursor != null)

Related

Searching on arrays using Datastore Queries

I'm interested in migrating from JDO queries to Datastore queries to make use of the AsyncDatastore API.
However, I'm unable to make the following query work in Datastore queries:
//JDO query (working correctly)
PersistenceManager pm = PMF.get().getPersistenceManager();
Query query = pm.newQuery("SELECT FROM "
+ Tasks.class.getName()
+ " WHERE archivado==false & arrayUsers=="
+ user.getId()
+ " & taskDate != null & taskDate > best_before_limit "
+ "PARAMETERS Date best_before_limit "
+ "import java.util.Date");
List <Tasks> results= (List<Tasks>) pm.newQuery(query).execute(new Date());
//Datastore query (returning zero entities)
AsyncDatastoreService datastore = DatastoreServiceFactory.getAsyncDatastoreService();
com.google.appengine.api.datastore.Query query = new com.google.appengine.api.datastore.Query("Tasks");
Filter userFilter = new FilterPredicate("arrayUsers", FilterOperator.EQUAL,user.getId());
Filter filterPendingTasks = new FilterPredicate("taskDate", FilterOperator.LESS_THAN_OR_EQUAL , new Date());
Filter completeFilter = CompositeFilterOperator.and(filterPendingTasks,userFilter);
query.setFilter(completeFilter);
List<Entity> results = datastore.prepare(query).asList(FetchOptions.Builder.withDefaults());
Apart from the fact that I have to build my Task objects out of the Entities resulting from the query, these should be the same.
The problem is that the query must look up if the passed user id (user.getId()) is present in the array (arrayUsers). JDO does this without any issues, but no joy with Datastore queries so far.
Any ideas about what is wrong with my code?
As was pointed out by the users commenting, you use different properties in your datastore query. If you have such a query and you don't have EXACTLY the index for this, it won't work. Without seeing what indexes you have, I say this query looks good to me, so either you don't have data that returns there (unlikely, since your JDO query does it), or you're missing a filter.
In general, in datastore when querying for one of the values to equal something specific, you indeed would use something like this :
new Query("Widget").setFilter(new FilterPredicate("x", FilterOperator.EQUAL, 1))
Since you're using an equality filter, you won't get funky results (as you can see in the docs (look for "Properties with multiple values can behave in surprising ways")).
Currently, there is no way to do this using Datastore Queries due to the lack of a "CONTAINS" operator or similar.
The alternative to keep using JDO (at least for this kind of queries).

Does the NDB membership query ("IN" operation) performance degrade with lots of possible values?

The documentation for the IN query operation states that those queries are implemented as a big OR'ed equality query:
qry = Article.query(Article.tags.IN(['python', 'ruby', 'php']))
is equivalent to:
qry = Article.query(ndb.OR(Article.tags == 'python',
Article.tags == 'ruby',
Article.tags == 'php'))
I am currently modelling some entities for a GAE project and plan on using these membership queries with a lot of possible values:
qry = Player.query(Player.facebook_id.IN(list_of_facebook_ids))
where list_of_facebook_ids could have thousands of items.
Will this type of query perform well with thousands of possible values in the list? If not, what would be the recommended approach for modelling this?
This won't work with thousands of values (in fact I bet it starts degrading with more than 10 values). The only alternative I can think of are some form of precomputation. You'll have to change your schema.
One way you can you do it is to create a new model called FacebookPlayer which is an index. This would be keyed by facebook_id. You would update it whenever you add a new player. It looks something like this:
class FacebookUser(ndb.Model):
player = ndb.KeyProperty(kind='Player', required=True)
Now you can avoid queries altogether. You can do this:
# Build keys from facebook ids.
facebook_id_keys = []
for facebook_id in list_of_facebook_ids:
facebook_id_keys.append(ndb.Key('FacebookPlayer', facebook_id))
keysOfUsersMatchedByFacebookId = []
for facebook_player in ndb.get_multi(facebook_id_keys):
if facebook_player:
keysOfUsersMatchedByFacebookId.append(facebook_player.player)
usersMatchedByFacebookId = ndb.get_multi(keysOfUsersMatchedByFacebookId)
If list_of_facebook_ids is thousands of items, you should do this in batches.

Why is my django bulk database population so slow and frequently failing?

I decided I'd like to use django's model system rather than coding raw SQL to interface with my database, but I am having a problem that surely is avoidable.
My models.py contains:
class Student(models.Model):
student_id = models.IntegerField(unique = True)
form = models.CharField(max_length = 10)
preferred = models.CharField(max_length = 70)
surname = models.CharField(max_length = 70)
and I'm populating it by looping through a list as follows:
from models import Student
for id, frm, pref, sname in large_list_of_data:
s = Student(student_id = id, form = frm, preferred = pref, surname = sname)
s.save()
I don't really want to be saving this to the database each time but I don't know another way to get django to not forget about it (I'd rather add all the rows and then do a single commit).
There are two problems with the code as it stands.
It's slow -- about 20 students get updated each second.
It doesn't even make it through large_list_of_data, instead throwing a DatabaseError saying "unable to open database file". (Possibly because I'm using sqlite3.)
My question is: How can I stop these two things from happening? I'm guessing that the root of both problems is that I've got the s.save() but I don't see a way of easily batching the students up and then saving them in one commit to the database.
So it seems I should have looked harder before posing the question.
Some solutions are described in this stackoverflow question (the winning answer is to use django.db.transaction.commit_manually) and also in this one on aggregating saves.
Other ideas for speeding up this type of operation are listed in this stackoverflow question.

How to use MS Sync Framework to filter client-specific data?

Let's say I've got a SQL 2008 database table with lots of records associated with two different customers, Customer A and Customer B.
I would like to build a fat client application that fetches all of the records that are specific to either Customer A or Customer B based on the credentials of the requesting user, then stores the fetched records in a temporary local table.
Thinking I might use the MS Sync Framework to accomplish this, I started reading about row filtering when I came across this little chestnut:
Do not rely on filtering for security.
The ability to filter data from the
server based on a client or user ID is
not a security feature. In other
words, this approach cannot be used to
prevent one client from reading data
that belongs to another client. This
type of filtering is useful only for
partitioning data and reducing the
amount of data that is brought down to
the client database.
So, is this telling me that the MS Sync Framework is only a good option when you want to replicate an entire table between point A and point B?
Doesn't that seem to be an extremely limiting characteristic of the framework? Or am I just interpreting this statement incorrectly? Or is there some other way to use the framework to achieve my purposes?
Ideas anyone?
Thanks!
No, it is only a security warning.
We use filtering extensively in our semi-connected app.
Here is some code to get you started:
//helper
void PrepareFilter(string tablename, string filter)
{
SyncAdapters.Remove(tablename);
var ab = new SqlSyncAdapterBuilder(this.Connection as SqlConnection);
ab.TableName = "dbo." + tablename;
ab.ChangeTrackingType = ChangeTrackingType.SqlServerChangeTracking;
ab.FilterClause = filter;
var cpar = new SqlParameter("#filterid", SqlDbType.UniqueIdentifier);
cpar.IsNullable = true;
cpar.Value = DBNull.Value;
ab.FilterParameters.Add(cpar);
var nsa = ab.ToSyncAdapter();
nsa.TableName = tablename;
SyncAdapters.Add(nsa);
}
// usage
void SetupFooBar()
{
var tablename = "FooBar";
var filter = "FooId IN (SELECT BarId FROM dbo.GetAllFooBars(#filterid))";
PrepareFilter(tablename, filter);
}

Hitting the 2100 parameter limit (SQL Server) when using Contains()

from f in CUSTOMERS
where depts.Contains(f.DEPT_ID)
select f.NAME
depts is a list (IEnumerable<int>) of department ids
This query works fine until you pass a large list (say around 3000 dept ids) .. then I get this error:
The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect. Too many parameters were provided in this RPC request. The maximum is 2100.
I changed my query to:
var dept_ids = string.Join(" ", depts.ToStringArray());
from f in CUSTOMERS
where dept_ids.IndexOf(Convert.ToString(f.DEPT_id)) != -1
select f.NAME
using IndexOf() fixed the error but made the query slow. Is there any other way to solve this? thanks so much.
My solution (Guids is a list of ids you would like to filter by):
List<MyTestEntity> result = new List<MyTestEntity>();
for(int i = 0; i < Math.Ceiling((double)Guids.Count / 2000); i++)
{
var nextGuids = Guids.Skip(i * 2000).Take(2000);
result.AddRange(db.Tests.Where(x => nextGuids.Contains(x.Id)));
}
this.DataContext = result;
Why not write the query in sql and attach your entity?
It's been awhile since I worked in Linq, but here goes:
IQuery q = Session.CreateQuery(#"
select *
from customerTable f
where f.DEPT_id in (" + string.Join(",", depts.ToStringArray()) + ")");
q.AttachEntity(CUSTOMER);
Of course, you will need to protect against injection, but that shouldn't be too hard.
You will want to check out the LINQKit project since within there somewhere is a technique for batching up such statements to solve this issue. I believe the idea is to use the PredicateBuilder to break the local collection into smaller chuncks but I haven't reviewed the solution in detail because I've instead been looking for a more natural way to handle this.
Unfortunately it appears from Microsoft's response to my suggestion to fix this behavior that there are no plans set to have this addressed for .NET Framework 4.0 or even subsequent service packs.
https://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=475984
UPDATE:
I've opened up some discussion regarding whether this was going to be fixed for LINQ to SQL or the ADO.NET Entity Framework on the MSDN forums. Please see these posts for more information regarding these topics and to see the temporary workaround that I've come up with using XML and a SQL UDF.
I had similar problem, and I got two ways to fix it.
Intersect method
join on IDs
To get values that are NOT in list, I used Except method OR left join.
Update
EntityFramework 6.2 runs the following query successfully:
var employeeIDs = Enumerable.Range(3, 5000);
var orders =
from order in Orders
where employeeIDs.Contains((int)order.EmployeeID)
select order;
Your post was from a while ago, but perhaps someone will benefit from this. Entity Framework does a lot of query caching, every time you send in a different parameter count, that gets added to the cache. Using a "Contains" call will cause SQL to generate a clause like "WHERE x IN (#p1, #p2.... #pn)", and bloat the EF cache.
Recently I looked for a new way to handle this, and I found that you can create an entire table of data as a parameter. Here's how to do it:
First, you'll need to create a custom table type, so run this in SQL Server (in my case I called the custom type "TableId"):
CREATE TYPE [dbo].[TableId] AS TABLE(
Id[int] PRIMARY KEY
)
Then, in C#, you can create a DataTable and load it into a structured parameter that matches the type. You can add as many data rows as you want:
DataTable dt = new DataTable();
dt.Columns.Add("id", typeof(int));
This is an arbitrary list of IDs to search on. You can make the list as large as you want:
dt.Rows.Add(24262);
dt.Rows.Add(24267);
dt.Rows.Add(24264);
Create an SqlParameter using the custom table type and your data table:
SqlParameter tableParameter = new SqlParameter("#id", SqlDbType.Structured);
tableParameter.TypeName = "dbo.TableId";
tableParameter.Value = dt;
Then you can call a bit of SQL from your context that joins your existing table to the values from your table parameter. This will give you all records that match your ID list:
var items = context.Dailies.FromSqlRaw<Dailies>("SELECT * FROM dbo.Dailies d INNER JOIN #id id ON d.Daily_ID = id.id", tableParameter).AsNoTracking().ToList();
You could always partition your list of depts into smaller sets before you pass them as parameters to the IN statement generated by Linq. See here:
Divide a large IEnumerable into smaller IEnumerable of a fix amount of item

Resources