is the number of colums limited for an Entity? - google-app-engine

I might have an Entity with possibly thousands of columns, and was wondering if it would pose any problem (nothing will be indexed):
Will queries be slower if the number of columns increases?
Can there be in theory an unlimited number of columns?

While I never had thousands of columns to know about the speed and performance, as it looks from the data viewer on the dashboard the number of columns should be unlimited:

Considering that the GAE Datastore is essentially a very large key-value store right down to property level, in principle an unlimited number of properties are allowed. Just not all together in one record for space reasons, as others already said.
Datastore is schemaless, but many libraries such as JDO, JPA and Objectify aim to "fix" this "deficiency" by introducing some schema of their own. That is unhelpful in your scenario.
I suggest you bypass those libraries and directly call the Datastore low-level API as per this example instead. You can avoid the overheads of indexing if you change the setProperty calls to setUnindexedProperty as often as possible. Remember to test for a null return from a getProperty call for a property that may be absent in some records.

Related

Google Cloud Datastore queries too slow when fetching all records

I am experiencing extremely slow performance of Google Cloud Datastore queries.
My entity structure is very simple:
calendarId, levelId, levelName, levelValue
And there are only about 1400 records and yet the query takes 500ms-1.2 sec to give back the data. Another query on a different entity also takes 300-400 ms just for 313 records.
I am wondering what might be causing such delay. Can anyone please give some pointers regarding how to debug this issue or what factors to inspect?
Thanks.
You are experiencing expected behavior. You shouldn't need to get that many entities when presenting a page to user. Gmail doesn't show you 1000 emails, it shows you 25-100 based on your settings. You should fetch a smaller number (e.g., the first 100) and implement some kind of paging to allow users to see other entities.
If this is backend processing, then you will simply need that much time to process entities, and you'll need to take that into account.
Note that you generally want to fetch your entities in large batches, and not one by one, but I assume you are already doing that based on the numbers in your question.
Not sure if this will help but you could try packing more data into a single entity by using embedded entities. Embedded entities are not true entities, they are just properties that allow for nested data. So instead of having 4 properties per entity, create an array property on the entity that stores a list of embedded entities each with those 4 properties. The max size an entity can have is 1MB, so you'll want to pack the array to get as close to that 1MB limit as possible.
This will lower the number of true entities and I suspect this will also reduce overall fetch time.

Database Structure for hierarchical data with horizontal slices

We're currently looking at trying to improve performance of queries for our site, the core hierarchical data-structure has 5 levels, each type has about 20 fields.
level1: rarely added, updated infrequently, ~ 100 children
level2: rarely added, updated fairly infrequently, ~ 200 children
level3: added often, updated fairly often, ~ 1-50 children (average ~10)
level4: added often, updated quite often, ~1-50 children (average <10)
level5: added often, updated often (a single item might update once a second)
We have a single data pipeline which performs all of these updates and inserts (ie. we have full control over data going in).
The queries we need to do on this are:
fetch single items from a level + parents
fetch a slice of items across a level (either by PK, or sometimes filtering criteria)
fetch multiple items from level3 and parts of their children (usually by complex criteria)
fetch level3 and all children
We read from this datasource a lot, as-in hundreds of times a second. All of the queries we need to perform are known and optimised as well as they can be to the current data structure.
We're currently using MySQL queries behind memcached for this, and just doing additional queries to get children/parents, I'm thinking that some sort of Tree-based or Document based database might be more suitable.
My question is: what's the best way to model this data for efficient read performance?
Sounds like your data belongs in an OLAP (On-Line Analytical Processing) database. The way you're describing levels, slices, and performance concerns seems to lend itself to OLAP. It's probably modeled fine (not sure though), but you need a different tool to boost performance.
I currently manage a system like this. We have a standard relational database for input, and then copy the pertinent data for reporting to an OLAP server. Our combo is Microsoft SQL Server (input, raw data), Microsoft Analysis Services (pre-calculates then stores the analytical data to increase speed), and Microsoft Excel/Access Pivot Tables and/or Tableau for reporting.
OLAP servers:
http://en.wikipedia.org/wiki/Comparison_of_OLAP_Servers
Combining relational and OLAP:
http://en.wikipedia.org/wiki/HOLAP
Tableau:
http://www.tableausoftware.com/
*Tableau is a superb product, and can probably replace an OLAP server if your data isn't terribly large (even then it can handle a lot of data). It will make local copies as necessary to improve performance. I strongly advise giving it a look.
If I've misunderstood the issue you're having, then by all means please ignore this answer :\
UPDATE: After more discussion, an Object DB might be a solution as well. Your data sounds multi-dimensional in nature, one way or the other, but I think the difference would be whether you're doing analytic aggregate calculations and retrieval (SUMs, AVGs), or just storing and fetching categorical or relational data (shopping cart items, or friends of a family member).
ODBMS info: http://en.wikipedia.org/wiki/Object_database
InterSystem's Cache is one Object Database I know of that sounds like a more appropriate fit based on what you've said.
http://www.intersystems.com/cache/
If conversion to a different system isn't feasible (entirely understandable), then you might have to look at normalization and the types of data your queries are processing in order to gain further improvements in speed. In fact, that's probably a good first step before jumping to a different type of system (sorry I didn't get to this sooner).
In my case, I know on MS SQL that a switch we did from having some core queries use a VARCHAR field to using an INTEGER field made a huge difference in speed. Text data is one of the THE MOST expensive types of data to process. So for instance, if you have a query doing a lot of INNER JOINs on text fields, you might consider normalizing to the point where you're using INTEGER IDs that link to the text data.
An example of high normalization could be using ID numbers for a person's First or Last Name. Most DB designs store these names directly and don't attempt to reduce duplication, but you could normalize to the point where Last Name and/or First Name have their own tables (or one table to hold both First and Last names) and IDs for each unique name.
The point in your case would be more for performance than de-duplication of data, but something like switching from VARCHAR to INTEGER might have huge gains. I'd try it with a single field first, measure the before and after cases, and make your decision carefully from there.
And of course, in general you should be sure to have appropriate indexes on your data.
Hope that helps.
Document/Tree based database is designed to perform hierarchical queries. Do you have any hierarchical queries in your design -- I fail to see any? Querying one level up and down doesn't count: it is a simple join. Please have in mind that going "Document/Tree based database" route you would compromise your general querying ability. To summarize, just hire a competent db specialist who would analyze your performance bottlenecks -- they are usually cured with mundane index addition.
there's not really enough info here to say much useful - you'd need to measure things, look at "explains", etc - but one option that goes beyond the usual indexing would be to shard by level 3 instances. that would give you better performance on parallel queries that hit different shards, at its simplest (separate disks), or you could use separate machines if you want to throw more resources at each shard.
the only reason i mention this really is that your use cases suggest sharding at that level would work quite well (it looks like it would be simple enough to do in your application layer, if you wanted - i have no idea what tools mysql has for this).
and if your data volume isn't so high then with sharding you might be able to get it down to ssds...

Datastore Write Operations limit exceeding

I am using Java SDK of app-engine.
I am using Master-slave datastore.
I have only two tables, each having 30 columns and none of them has size greater than 20 bytes.
After entering 300 rows in each table, it shows Datastore write operations 0.03 Millions out of 0.05 Millions.
I have checked the tables. They contain 300 entries only. There is no infinite loop kind of bug in my code.
Would someone please help me to point me out where I might be going wrong?
Thanks,
Amrish.
As noted in the previous answer, those write totals include your index writes.
All entity properties have associated default indexes (unless the property is configured to be unindexed), even if you have not defined any custom indexes.
See http://code.google.com/appengine/articles/indexselection.html for more detail, and http://code.google.com/appengine/docs/billing.html#Billable_Resource_Unit_Cost for more specifics on write costs.
For example, a new entity 'put' is:
2 Writes + 2 Writes per indexed property value (these are for the default indexes for that property) + 1 Write per composite index value (for any relevant custom indexes that you have defined).
Datastore writer operations include index updates. Make sure you don't have any exploding indexes. Keep in mind also that by default all fields have a built-in index; make any fields that you're not using unindexed to save quota.
Also, for better reliability and availability, consider switching to the high-reliability datastore (this doesn't directly fix your problem though).
I think there is problem due to size of list_flightinfo. Also, this code might have been called several times per second.
The key of the entity is:
src+"_"+dest
Which is not getting changed in the loop, hence same entity is getting overwritten again and again.

Does having large number of properties in an Entity effect datastore read/write performance?

I have couple of entities with properties numbering in the range of 40 - 50. All these properties are unindexed. These entities are a part of a larger entitygroup tree structure, and are always retrieved by using their key. None of the properties (except the key property) are indexed. I am using Objectify to work with entities on BigTable.
I want to know if there is any performance impact in reading or writing an entity with large number of properties from/to BigTable.
Since these large entities are only fetched by their keys are never participate in any query, I was wondering if I should serialize the entity pojo and store as a blob. It is pretty straightforward to do this in Objectify using the #Serialized annotation. I understand that by serializing my entity and storing it as a blob, I render the blob totally opaque to any other program or non-Java code, but this is not a concern.
I am yet to benchmark the performance difference, but before doing so, I want to know if anybody has done this before or has any advice/opinion to share.
there is always an overhead for number of properties. and serializing won't help much as it just moves processing from one point to another.
i have entities with number of property up to 25 and i fetch them almost on all request by key. the performance difference is negligible for me. hardly +- 1ms. performance problems normally occurs on query parts. number of unindexed property wont count much in performance. while indexed property can significantly delayed put due to modification of index.
if you must, you can break up property in to multiple table if you not going to need them at once.
Going purely by what little I know of how it works, I'd say having a bunch of unindexed properties wouldn't be any different from having the whole thing serialized.

Access database performance

For a few different reasons one of my projects is hosted on a shared hosting server
and developed in asp.Net/C# with access databases (Not a choice so don't laugh at this limitation, it's not from me).
Most of my queries are on the last few records of the databases they are querying.
My question is in 2 parts:
1- Is the order of the records in the database only visual or is there an actual difference internally. More specifically, the reason I ask is that the way it is currently designed all records (for all databases in this project) are ordered by a row identifying key (which is an auto number field) ascending but since over 80% of my queries will be querying fields that should be towards the end of the table would it increase the query performance if I set the table to showing the most recent record at the top instead of at the end?
2- Are there any other performance tuning that can be done to help with access tables?
"Access" and "performance" is an euphemism but the database type wasn't a choice
and so far it hasn't proven to be a big problem but if I can help the performance
I would sure like to do whatever I can.
Thanks.
Edit:
No, I'm not currently experiencing issues with my current setup, just trying to look forward and optimize everything.
Yes, I do have indexes and have a primary key (automatically indexes) on the unique record identifier for each of my tables. I definitely should have mentioned that.
You're all saying the same thing, I'm already doing all that can be done for access performance. I'll give the question "accepted answer" to the one that was the fastest to answer.
Thanks everyone.
As far as I know...
1 - That change would just be visual. There'd be no impact.
2 - Make sure your fields are indexed. If the fields you are querying on are unique, then make sure you make the fields a unique key.
Yes there is an actual order to the records in the database. Setting the defaults on the table preference isn't going to change that.
I would ensure there are indexes on all your where clause columns. This is a rule of thumb. It would rarely be optimal, but you would have to do workload testing against different database setups to prove the most optimal solution.
I work daily with legacy access system that can be reasonably fast with concurrent users, but only for smallish number of users.
You can use indexes on the fields you search for (aren't you already?).
http://www.google.com.br/search?q=microsoft+access+indexes
The order is most likely not the problem. Besides, I don't think you can really change it in Access anyway.
What is important is how you are accessing those records. Are you accessing them directly by the record ID? Whatever criteria you use to find the data you need, you should have an appropriate index defined.
By default, there will only be an index on the primary key column, so if you're using any other column (or combination of columns), you should create one or more indexes.
Don't just create an index on every column though. More indexes means Access will need to maintain them all when a new record is inserted or updated, which makes it slower.
Here's one article about indexes in Access.
Have a look at the field or fields you're using to query your data and make sure you have an index on those fields. If it's the same as SQL server you won't need to include the primary key in the index (assuming it's clustering on this) as it's included by default.
If you're running queries on a small sub-set of fields you could get your index to be a 'covering' index by including all the fields required, there's a space trade-off here, so I really only recommend it for 5 fields or less, depending on your requirements.
Are you actually experiencing a performance problem now or is this just a general optimization question? Also from your post it sounds like you are talking about a db with 1 table, is that accurate? If you are already experiencing a problem and you are dealing with concurrent access, some answers might be:
1) indexing fields used in where clauses (mentioned already)
2) Splitting tables. For example, if only 80% of your table rows are not accessed (as implied in your question), create an archive table for older records. Or, if the bulk of your performance hits are from reads (complicated reports) and you don't want to impinge on performance for people adding records, create a separate reporting table structure and query off of that.
3) If this is a reporting scenario, all queries are similar or the same, concurrency is somewhat high (very relative number given Access) and the data is not extremely volatile, consider persisting the data to a file that can be periodically updated, thus offloading the querying workload from the Access engine.
In regard to table order, Jet/ACE writes the actual table date in PK order. If you want a different order, change the PK.
But this oughtn't be a significant issue.
Indexes on the fields other than the PK that you sort on should make sorting pretty fast. I have apps with 100s of thousands of records that return subsets of data in non-PK sorted order more-or-less instantaneously.
I think you're engaging in "premature optimization," worrying about something before you actually have an issue.
The only circumstances in which I think you'd have a performance problem is if you had a table of 100s of thousands of records and you were trying to present the whole thing to the end user. That would be a phenomenally user-hostile thing to do, so I don't think it's something you should be worrying about.
If it really is a concern, then you should consider changing your PK from the Autonumber to a natural key (though that can be problematic, given real-world data and the prohibition on non-Null fields in compound unique indexes).
I've got a couple of things to add that I didn't notice being mentioned here, at least not explicitly:
Field Length, create your fields as large as you'll need them but don't go over - for instance, if you have a number field and the value will never be over 1000 (for the sake of argument) then don't type it as a Long Integer, something smaller like Integer would be more appropriate, or use a single instead of a double for decimal numbers, etc. By the same token, if you have a text field that won't have more than 50 chars, don't set it up for 255, etc, etc. Sounds obvious, but it's done, often times with the idea that "I might need that space in the future" and your app suffers in the mean time.
Not to beat the indexing thing to death...but, tables that you're joining together in your queries should have relationships established, this will create indexes on the foreign keys which greatly increases the performance of table joins (NOTE: Double check any foreign keys to make sure they did indeed get indexed, I've seen cases where they haven't been - so apparently a relationship doesn't explicitly mean that the proper indexes have been created)
Apparently compacting your DB regularly can help performance as well, this reduces internal fragmentation of the file and can speed things up that way.
Access actually has a Performance Analyzer, under tools Analyze > Performance, it might be worth running it on your tables & queries at least to see what it comes up with. The table analyzer (available from the same menu) can help you split out tables with alot of redundant data, obviously, use with caution - but it's could be helpful.
This link has a bunch of stuff on access performance optimization on pretty much all aspects of the database, tables, queries, forms, etc - it'd be worth checking out for sure.
http://office.microsoft.com/en-us/access/hp051874531033.aspx
To understand the answers here it is useful to consider how access works, in an un-indexed table there is unlikely to be any value in organising the data so that recently accessed records are at the end. Indeed by the virtue of the fact that Access / the JET engine is an ISAM database it's the other way around. (http://en.wikipedia.org/wiki/ISAM) That's rather moot however as I would never suggest putting frequently accessed values at the top of a table, it is best as others have said to rely on useful indexes.

Resources