Apache Commons cluster dimensions to be normalized - apache-commons-math

I want to cluster MyClass which implements Clusterable and has 4 dimensions taken into account. It works quite well, but I am wondering if the dimensions must be normalized to be comparable or is there normalization integrated within clustering?
Clusterer<Match> clusterer = new KMeansPlusPlusClusterer<>(8);
final List<? extends Cluster<MyClass>> clusters = clusterer.cluster(listToCluster);

Related

NDB Queries Exceeding GAE Soft Private Memory Limit

I currently have a an application running in the Google App Engine Standard Environment, which, among other things, contains a large database of weather data and a frontend endpoint that generates graph of this data. The database lives in Google Cloud Datastore, and the Python Flask application accesses it via the NDB library.
My issue is as follows: when I try to generate graphs for WeatherData spanning more than about a week (the data is stored for every 5 minutes), my application exceeds GAE's soft private memory limit and crashes. However, stored in each of my WeatherData entities are the relevant fields that I want to graph, in addition to a very large json string containing forecast data that I do not need for this graphing application. So, the part of the WeatherData entities that is causing my application to exceed the soft private memory limit is not even needed in this application.
My question is thus as follows: is there any way to query only certain properties in the entity, such as can be done for specific columns in a SQL-style query? Again, I don't need the entire forecast json string for graphing, only a few other fields stored in the entity. The other approach I tried to run was to only fetch a couple of entities out at a time and split the query into multiple API calls, but it ended up taking so long that the page would time out and I couldn't get it to work properly.
Below is my code for how it is currently implemented and breaking. Any input is much appreciated:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
for acct in qry.fetch():
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
# Children Entity - log of a weather at parent location
class WeatherData(ndb.Model):
# model for data to save
...
# Function for querying data below a given ancestor between two optional
# times
#classmethod
def time_ordered_query(cls, ancestor_key, start=None, end=None):
return cls.query(cls.time>=start, cls.time<=end,ancestor=ancestor_key).order(-cls.time)
EDIT: I tried the iterative page fetching strategy described in the link from the answer below. My code was updated to the following:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
cursor = None
while True:
gc.collect()
fetched, next_cursor, more = qry.fetch_page(FETCHNUM, start_cursor=cursor)
if fetched:
for acct in fetched:
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
if more and next_cursor:
cursor = next_cursor
else:
break
where FETCHNUM=500. In this case, I am still exceeding the soft private memory limit for queries of the same length as before, and the query takes much, much longer to run. I suspect the problem may be with Python's garbage collector not deleting the already used information that is re-referenced, but even when I include gc.collect() I see no improvement there.
EDIT:
Following the advice below, I fixed the problem using Projection Queries. Rather than have a separate projection for each custom query, I simply ran the same projection each time: namely querying all properties of the entity excluding the JSON string. While this is not ideal as it still pulls gratuitous information from the database each time, generating individual queries of each specific query is not scalable due to the exponential growth of necessary indices. For this application, as each additional property is negligible additional memory (aside form that json string), it works!
You can use projection queries to fetch only the properties of interest from each entity. Watch out for the limitations, though. And this still can't scale indefinitely.
You can split your queries across multiple requests (more scalable), but use bigger chunks, not just a couple (you can fetch 500 at a time) and cursors. Check out examples in How to delete all the entries from google datastore?
You can bump your instance class to one with more memory (if not done already).
You can prepare intermediate results (also in the datastore) from the big entities ahead of time and use these intermediate pre-computed values in the final stage.
Finally you could try to create and store just portions of the graphs and just stitch them together in the end (only if it comes down to that, I'm not sure how exactly it would be done, I imagine it wouldn't be trivial).

Google App Engine ndb performance on repeated property

Do I pay a penalty on query performance if I choose to query repeated property? For example:
class User(ndb.Model):
user_name = ndb.StringProperty()
login_providers = ndb.KeyProperty(repeated=true)
fbkey = ndb.Key("ProviderId", 1, "ProviderName", "FB")
for entry in User.query(User.login_providers == fbkey):
# Do something with entry.key
vs
class User(ndb.Model)
user_name = ndb.StringProperty()
class UserProvider(ndb.Model):
user_key = ndb.KeyProperty(kind=User)
login_provider = ndb.KeyProperty()
for entry in UserProvider.query(
UserProvider.user_key == auserkey,
UserProvider.login_provider == fbkey
):
# Do something with entry.user_key
Based on the documentation from GAE, it seems that Datastore take care of indexing and the first less verbose option would be using the index. However, I failed to find any documentation to confirm this.
Edit
The sole purpose of UserProvider in the second example is to create a one-to-many relationship between a user and it's login_provider. I wanted to understand if it worth the trouble of creating a second entity instead of querying on repeated property. Also, assume that all I need is the key from the User.
No. But you'll raise your write costs because each entry needs to be indexed, and write costs are based on the number of indexes updated.

Should the size of entities be as small as possible when I count them by "count()" method?

I'm wondering if I should have a kind only for counting entities.
For example
There is a model like the following.
class Message(db.Model):
title = db.StringProperty()
message = db.StringProperty()
created_on = db.DateTimeProperty()
created_by = db.ReferenceProperty(User)
category = db.StringProperty()
And there are 100000000 entities made of this model.
I want to count entities which category equals 'book'.
In this case, should I create the following mode for counting them?
class Category(db.Model):
category = db.StringProperty()
look_message = db.ReferenceProperty(Message)
Does this small model make it faster to count?
And does it erase smaller memory?
I'm thinking to count them like the following by the way
q = db.Query(Message).filter('category =', 'book')
count = q.count(10000)
Counting 100000000 entities is a very expensive operation on a NoSQL database as the App Engine datastore. You'll probably want to count as you update, or run a map-reduce operation to count after the fact.
App Engine also offers a simple way to query how many entities of each type you have:
https://developers.google.com/appengine/docs/python/datastore/stats
For example, to count all Messages:
from google.appengine.ext.db import stats
kind_stats = stats.KindStat().all().filter("kind_name =", "Message").get()
count = kind_stats.count
Note that stats are updated asynchronously, so they'll lag the actual count.
I think that you have to create another entity like this.
This entity will just count the number of messages by category.
Just change your category to this:
class Category(db.model):
category = db.StringProperty()
totalOfMessages = db.IntegerProperty(default=0)
In the message class you change to reference the category class, just change the category property to:
category = db.ReferenceProperty(Category)
When you create a new Message object, you have to update the counter, increment when you create a new message or decrement if you delete.
The best way to work with counters on GAE is using Sharding Counters
Count is implemented as an index scan that discards all data except the number of records seen . It never looks up the entity, so the size of the entity does not matter.
That being said, counting like this does not scale and is quite costly in a system without a fixed schema. It would likely be better to use another method like a Sharded Counter, MapReduce or Materialized View/Fork Join. If you really want it to scale, this talk is pretty informative: http://www.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html

What ORMs work well with Scala? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm about to write a Scala command-line application that relies on a MySQL database. I've been looking around for ORMs, and am having trouble finding one that will work well.
The Lift ORM looks nice, but I'm not sure it can be decoupled from the entire Lift web framework. ActiveObjects also looks OK, but the author says that it may not work well with Scala.
I'm not coming to Scala from Java, so I don't know all the options. Has anyone used an ORM with Scala, and if so, what did you use and how well did it work?
There are several reasons why JPA-oriented frameworks (Hibernate, for instance) do not fit into idiomatic Scala applications elegantly:
there are no nested annotations as states the Scala 2.8 Preview -- that means you cannot use annotations as mapping metadata for complex applications (even the simplest ones often use #JoinTable -> #JoinColumn);
inconsistencies between Scala and Java collections make developers convert collections; there are also cases when it is impossible to map Scala collections to associations without implementing complex interfaces of the underlying framework (Hibernate's PersistentCollections, for example);
some very common features, such as domain model validation, require JavaBeans conventions on persistent classes -- these stuff is not quite "Scala way" of doing things;
of course, the interop problems (like Raw Types or proxies) introduce a whole new level of issues that cannot be walked around easily.
There are more reasons, I'm sure. That's why we have started the Circumflex ORM project. This pure-Scala ORM tries it's best to eliminate the nightmares of classic Java ORMs. Specifically, you define your entities in pretty much way you would do this with classic DDL statements:
class User extends Record[User] {
val name = "name".TEXT.NOT_NULL
val admin = "admin".BOOLEAN.NOT_NULL.DEFAULT('false')
}
object User extends Table[User] {
def byName(n: String): Seq[User] = criteria.add(this.name LIKE n).list
}
// example with foreign keys:
class Account extends Record[Account] {
val accountNumber = "acc_number".BIGINT.NOT_NULL
val user = "user_id".REFERENCES(User).ON_DELETE(CASCADE)
val amount = "amount".NUMERIC(10,2).NOT_NULL
}
object Account extends Table[Account]
As you can see, these declarations are a bit more verbose, than classic JPA POJOs. But in fact there are several concepts that are assembled together:
the precise DDL for generating schema (you can easily add indexes, foreign keys and other stuff in the same DSL-like fashion);
all queries can be assembled inside that "table object" instead of being scattered around in DAO; the queries themselves are very flexible, you can store query objects, predicates, projections, subqueries and relation aliases in variables so you can reuse them, and even make batch update operations from existing queries (insert-select for example);
transparent navigation between associations (one-to-one, many-to-one, one-to-many and many-to-many-through-intermediate-relation) can be achieved either by lazy or by eager fetching strategies; in both cases the associations are established on top of the foreign keys of underlying relations;
validation is the part of framework;
there is also a Maven2 plugin that allows generating schema and importing initial data from handy XML formatted files.
The only things Circumflex ORM lacks are:
multi-column primary keys (although it is possible to create multi-column foreign keys backed by multi-column unique constraints, but it is only for data integrity);
full-fledged documentation (although we are actively working on it);
success stories of ten-billion-dollar production systems that have Circumflex ORM as it's core technology.
P.S. I hope this post will not be considered an advertisement. It isn't so, really -- I was trying to be as objective as possible.
I experimented with EclipseLink JPA and basic operations worked fine for me. JPA is a Java standard and there are other implementations that may also work (OpenJPA, etc). Here is an example of what a JPA class in Scala looks like:
import javax.persistence.Entity;
import javax.persistence.GeneratedValue;
import javax.persistence.Id;
#Entity { val name = "Users" }
class User {
#Id
#GeneratedValue
var userid:Long = _
var login:String = _
var password:String = _
var firstName:String = _
var lastName:String = _
}
Slick is a perfect match for a functional world. Traditional ORM's are not a perfect fit for Scala. Slick composes well and uses a DSL that mimics Scala collection classes and for comprehensions.
I am happy to announce the 1st release of a new ORM library for Scala. MapperDao maps domain classes to database tables. It currently supports mysql, postgresql (oracle driver to be available soon), one-to-one, many-to-one, one-to-many, many-to-many relationships, autogenerated keys, transactions and optionally integrates nicely with spring framework. It allows freedom on the design of the domain classes which are not affected by persistence details, encourages immutability and is type safe. The library is not based on reflection but rather on good Scala design principles and contains a DSL to query data, which closely resembles select queries. It doesn't require implementation of equals() or hashCode() methods which can be problematic for persisted entities. Mapping is done using type safe Scala code.
Details and usage instructions can be found at the mapperdao's site:
http://code.google.com/p/mapperdao/
The library is available for download on the above site and also as a maven dependency (documentation contains details on how to use it via maven)
Examples can be found at:
https://code.google.com/p/mapperdao-examples/
Very brief introduction of the library via code sample:
class Product(val name: String, val attributes: Set[Attribute])
class Attribute(val name: String, val value: String)
...
val product = new Product("blue jean", Set(new Attribute("colour", "blue"), new Attribute("size", "medium")))
val inserted = mapperDao.insert(ProductEntity, product)
// the persisted entity has an id property:
println("%d : %s".format(inserted.id,inserted))
Querying is very familiar:
val o=OrderEntity
import Query._
val orders = query(select from o where o.totalAmount >= 20.0 and o.totalAmount <= 30.0)
println(orders) // a list of orders
I encourage everybody to use the library and give feedback. The documentation is currently quite extensive, with setup and usage instructions. Please feel free to comment and get in touch with me at kostas dot kougios at googlemail dot com.
Thanks,
Kostantinos Kougios
Here's basically the same example with #Column annotation:
/*
Corresponding table:
CREATE TABLE `users` (
`id` int(11) NOT NULL auto_increment,
`name` varchar(255) default NULL,
`admin` tinyint(1) default '0',
PRIMARY KEY (`id`)
)
*/
import _root_.javax.persistence._
#Entity
#Table{val name="users"}
class User {
#Id
#Column{val name="id"}
var id: Long = _
#Column{val name="name"}
var name: String = _
#Column{val name="admin"}
var isAdmin: Boolean = _
override def toString = "UserId: " + id + " isAdmin: " + isAdmin + " Name: " + name
}
Of course, any Java database access framework will work in Scala as well, with the usual issues that you may encounter, such as collections conversion, etc. jOOQ for instance, has been observed to work well in Scala. An example of jOOQ code in Scala is given in the manual:
object Test {
def main(args: Array[String]): Unit = {
val c = DriverManager.getConnection("jdbc:h2:~/test", "sa", "");
val f = new Factory(c, SQLDialect.H2);
val x = T_AUTHOR as "x"
for (r <- f
select (
T_BOOK.ID * T_BOOK.AUTHOR_ID,
T_BOOK.ID + T_BOOK.AUTHOR_ID * 3 + 4,
T_BOOK.TITLE || " abc" || " xy"
)
from T_BOOK
leftOuterJoin (
f select (x.ID, x.YEAR_OF_BIRTH)
from x
limit 1
asTable x.getName()
)
on T_BOOK.AUTHOR_ID === x.ID
where (T_BOOK.ID <> 2)
or (T_BOOK.TITLE in ("O Alquimista", "Brida"))
fetch
) {
println(r)
}
}
}
Taken from
http://www.jooq.org/doc/2.6/manual/getting-started/jooq-and-scala/

Co-occurrence of words in documents with Google big table

Given document-D1: containing words (w1,w2,w3)
and document D2 and words (w2,w3..)
and document Dn and words ( w1,w2, wn)
Can I structure my data in big table to answer the questions like:
which words occur most frequently with w1,
or which words occur most frequently with w1 and w2.
What I am trying to achieve is to find the third word Wx (suggestion) which ocures most frequently in documents togehter with given words W1 and W2
I know the solution in SQL, but is it possible with google-big table?
I know I would have to build my indices by myself, the question is how should I structure them to avoid index explosion
thanks
almir
The only way to do this that I'm aware of is to index all 3-tuples of words, with their counts. Your kind would look something like this:
class Tuple(db.Model):
words = db.StringListProperty()
count = db.IntegerProperty()
Then, you need to insert or update the appropriate tuple entity for each set of 3 unique words in your text. Eg, the string "the king is dead" would result in the tuples (the, king, is), (the, king, dead), (the, is, dead), (king, is, dead)... This obviously results in an exponential explosion in entries, but I'm not aware of any way around that for what you want to do.
To find the suggestions, you'd do something like this:
q = Tuple.all().filter('word =', w1).filter('word =', w2).order('-count')
In the broader sense of recommendation algorithms, however, there is a lot of research into more efficient ways to do this. It's an open question, as evidenced by the existence of the Netflix challenge.
Using list-properties and merge-join is the best way to answer set membership questions in Google App Engine: Building Scalable, Complex Apps on App Engine.
You could setup your model as follows:
class Document(db.Model):
word = db.StringListProperty()
name = db.StringProperty()
...
doc.word = ["google", "app", "engine"]
Then it would be easy to query for co-occurrence. For example, which documents have the words google and engine?
results = db.GqlQuery(
"SELECT * FROM Documents "
"WHERE word = 'google'"
" and word = 'engine'")
docs = [d.name for d in results]
There are some limitations, though. From the presentation:
Index writes are done in parallel on
Bigtable Fast-- e.g., update a list
property of 1000 items with 1000 row
writes simultaneously! Scales linearly
with number of items Limited to 5000
indexed properties per entity
But queries must unpackage all result
entities When list size > ~100, reads
are too expensive! Slow in wall-clock
time Costs too much CPU
You could also create a model of words and save in the StringListProperty only their keys, but depending on the size of your documents even that would not be feasible.
There is nothing inherent to the AppEngine datastore that will help you with this problem. You will need to index the words in the documents programatically.

Resources