SSIS/VB.NET Equivalent of SQL IN (anonymous array.Contains()) - sql-server

I've got some SQL which performs complex logic on combinations of GL account numbers and cost centers like this:
WHEN (#IntGLAcct In (
882001, 882025, 83000154, 83000155, 83000120, 83000130,
83000140, 83000157, 83000010, 83000159, 83000160, 83000161,
83000162, 83000011, 83000166, 83000168, 83000169, 82504000,
82504003, 82504005, 82504008, 82504029, 82530003, 82530004,
83000000, 83000100, 83000101, 83000102, 83000103, 83000104,
83000105, 83000106, 83000107, 83000108, 83000109, 83000110,
83000111, 83000112, 83000113, 83100005, 83100010, 83100015,
82518001, 82552004, 884424, 82550072, 82552000, 82552001,
82552002, 82552003, 82552005, 82552012, 82552015, 884433,
884450, 884501, 82504025, 82508010, 82508011, 82508012,
83016003, 82552014, 81000021, 80002222, 82506001, 82506005,
82532001, 82550000, 82500009, 82532000))
Overall, the whole thing is poorly performing in a UDF, especially when it's all nested and the order of the steps is important etc. I can't make it table-driven just yet, because the business logic is so terribly convoluted.
So I'm doing a little exploratory work in moving it into SSIS to see about doing it in a little bit of a different way. Inside my script task, however, I've got to use VB.NET, so I'm looking for an alternative to this:
Select Case IntGLAcct = 882001 OR IntGLAcct = 882025 OR ...
Which is obviously a lot more verbose, and would make it terribly hard to port the process.
Even something like ({90605, 90607, 90610} AS List(Of Integer)).Contains(IntGLAcct) would be easier to port, but I can't get the initializer to give me an anonymous array like that. And there are so many of these little collections, I'm not sure I can create them all in advance.
It really all NEEDS to be in one place. The business changes this logic regularly. My strategy was to use the udf to mirror their old "include" file, but performance has been poor. Now each of the functions takes just 2 or three parameters. It turns out that in a dark corner of the existing system they actually build a multi-million row table of all these results - even though the pre-calced table is not used much.
So my new experiment is to (since I'm still building the massive cross join table to reconcile that part of the process) go ahead and use the table instead of the code, but go ahead and populate this table during an SSIS phase instead of calling the udf 12 million times - because my udf version just basically stopped working within a reasonable time frame and the DBAs are not of much help right now. Yet, I know that SSIS can process these rows pretty efficiently - because each month I bring in the known good results dozens of multi-million row tables from the legacy system in minutes AND run queries to reconcile that there are no differences with the new versions.
The SSIS code would theoretically become the keeper of the business logic, and the efficient table would be built from that (based on all known parameter combinations). Of course, if I can simplify the logic down to a real logic table, that would be the ultimate design - but that's not really foreseeable at this point.

Try this:
Array.IndexOf(New Integer() {90605, 90607, 90610}, IntGLAcct) >-1

What if you used a conditional split transform on your incoming data set and then used expressions or something similar (I'm not sure if your GL Accounts are fixed or if you're going to dynamically pass them in) to apply to the results? You can then take the resulting data from that and process as necessary.

Related

Angularjs slow with a table

I'm having a table with about 30 rows and about 10 columns. The rows are a subrange of a much bigger set (which I do manually in order to avoid huge DOM). The columns are stored in a list like [{name: "firstname", width: 200}, {name: "married", type: "bool"}], which allows some flexibility (like showing the property "married" as a checkbox).
So there are just about 300 fields, yet the digest cycle takes about one second (on my i5-2400 CPU # 3.10GHz).
I'm having troubles interpreting the Batarang performance page. It says
p.name | translate 16.0% 139.6ms
e[c.name] 15.8% 138.4ms
c.name | translate 11.1% 96.3ms
The meaning of the (sparsely named) variables is clear to me:
e stands for entity, i.e., table row.
p stands for property and only occurs outside of the table.
c stands for column.
e[c.name] stands for the field content (from entity e the property named by c).
But the performance figures make little sense:
p.name gets only used maybe 10 times, how can it take that long?
c.name | translate occurs only 10 times, too (in the header row), how can it take that long?
I'm aware of {::a_once_only_bound_expression}, and I tried it, but without much success. What I'd actually need is the following:
When c changes, re-create the whole table (this happens only exceptionally, so I don't care about speed).
When e changes, re-create its whole row (when there's a change, then only in a single row).
Any way to achieve this?
A solution idea
I guess, what I need could be achieved using a directive stripping off all angular stuff from the row after rendering:
drop all child scopes
with all their watches
but keep all the HTML and listeners
I could add a single watch per row responsible for the repaint if needed.
Does it make sense?
Update
I've been rather busy working on the application - improving other things than performance. I was very lucky and got some performance as a bonus. Then I simplified the pages a little bit and the speed is acceptable now. At least for now.
Still:
I don't trust the above Batarang performance values.
I'm still curious how to implement the above solution idea and if it makes sense.
You may wanna look NgTable which places inputs from json data as rows and columns maybe solves your performance issue either, I recommend checkout for that

Neo4j output format

After working with neo4j and now coming to the point of considering to make my own entity manager (object manager) to work with the fetched data in the application, i wonder about neo4j's output format.
When i run a query it's always returned as tabular data. Why is this??
Sure tables keep a big place in data and processing, but it seems so strange that a graph database can only output in this format.
Now when i want to create an object graph in my application i would have to hydrate all the objects and this is not really good for performance and doesn't leverage true graph performace.
Consider MATCH (A)-->(B) RETURN A, B when there is one A and three B's, it would return:
A B
1 1
1 2
1 3
That's the same A passed down 3 times over the database connection, while i only need it once and i know this before the data is fetched.
Something like this seems great http://nigelsmall.com/geoff
a load2neo is nice, a load-from-neo would also be nice! either in the geoff format or any other formats out there https://gephi.org/users/supported-graph-formats/
Each language could then implement it's own functions to create the objects directly.
To clarify:
Relations between nodes are lost in tabular data
Redundant (non-optimal) format for graphs
Edges (relations) and vertices (nodes) are usually not in the same table. (makes queries more complex?)
Another consideration (which might deserve it's own post), what's a good way to model relations in an object graph? As objects? or as data/method inside the node objects?
#Kikohs
Q: What do you mean by "Each language could then implement it's own functions to create the objects directly."?
A: With an (partial) graph provided by the database (as result of a query) a language as PHP could provide a factory method (in C preferably) to construct the object graph (this is usually an expensive operation). But only if the object graph is well defined in a standard format (because this function should be simple and universal).
Q: Do you want to export the full graph or just the result of a query?
A: The result of a query. However a query like MATCH (n) OPTIONAL MATCH (n)-[r]-() RETURN n, r should return the full graph.
Q: you want to dump to the disk the subgraph created from the result of a query ?
A: No, existing interfaces like REST are prefered to get the query result.
Q: do you want to create the subgraph which comes from a query in memory and then request it in another language ?
A: no i want the result of the query in another format then tabular (examples mentioned)
Q: You make a query which only returns the name of a node, in this case, would you like to get the full node associated or just the name ? Same for the edges.
A: Nodes don't have names. They have properties, labels and relations. I would like enough information to retrieve A) The node ID, it's labels, it's properties and B) the relation to other nodes which are in the same result.
Note that the first part of the question is not a concrete "how-to" question, rather "why is this not possible?" (or if it is, i like to be proven wrong on this one). The second is a real "how-to" question, namely "how to model relations". The two questions have in common that they both try to find the answer to "how to get graph data efficiently in PHP."
#Michael Hunger
You have a point when you say that not all result data can be expressed as an object graph. It reasonable to say that an alternative output format to a table would only be complementary to the table format and not replacing it.
I understand from your answer that the natural (rawish) output format from the database is the result format with duplicates in it ("streams the data out as it comes"). I that case i understand that it's now left to an alternative program (of the dev stack) to do the mapping. So my conclusion on neo4j implementing something like this:
Pro's - not having to do this in every implementation language (of the application)
Con's - 1) no application specific mapping is possible, 2) no performance gain if implementation language is fast
"Even if you use geoff, graphml or the gephi format you have to keep all the data in memory to deduplicate the results."
I don't understand this point entirely, are you saying that these formats are no able to hold deduplicated results (in certain cases)?? So infact that there is no possible textual format with which a graph can be described without duplication??
"There is also the questions on what you want to include in your output?"
I was under the assumption that the cypher language was powerful enough to specify this in the query. And so the output format would have whatever the database can provide as result.
"You could just return the paths that you get, which are unique paths through the graph in themselves".
Useful suggestion, i'll play around with this idea :)
"The dump command of the neo4j-shell uses the approach of pulling the cypher results into an in-memory structure, enriching it".
Does the enriching process fetch additional data from the database or is the data already contained in the initial result?
There is more to it.
First of all as you said tabular results from queries are really commonplace and needed to integrate with other systems and databases.
Secondly oftentimes you don't actually return raw graph data from your queries, but aggregated, projected, sliced, extracted information out of your graph. So the relationships to the original graph data are already lost in most of the results of queries I see being used.
The only time that people need / use the raw graph data is when to export subgraph-data from the database as a query result.
The problem of doing that as a de-duplicated graph is that the db has to fetch all the result data data in memory first to deduplicate, extract the needed relationships etc.
Normally it just streams the data out as it comes and uses little memory with that.
Even if you use geoff, graphml or the gephi format you have to keep all the data in memory to deduplicate the results (which are returned as paths with potential duplicate nodes and relationships).
There is also the questions on what you want to include in your output? Just the nodes and rels returned? Or additionally all the other rels between the nodes that you return? Or all the rels of the returned nodes (but then you also have to include the end-nodes of those relationships).
You could just return the paths that you get, which are unique paths through the graph in themselves:
MATCH p = (n)-[r]-(m)
WHERE ...
RETURN p
Another way to address this problem in Neo4j is to use sensible aggregations.
E.g. what you can do is to use collect to aggregate data per node (i.e. kind of subgraphs)
MATCH (n)-[r]-(m)
WHERE ...
RETURN n, collect([r,type(r),m])
or use the new literal map syntax (Neo4j 2.0)
MATCH (n)-[r]-(m)
WHERE ...
RETURN {node: n, neighbours: collect({ rel: r, type: type(r), node: m})}
The dump command of the neo4j-shell uses the approach of pulling the cypher results into an in-memory structure, enriching it and then outputting it as cypher create statement(s).
A similar approach can be used for other output formats too if you need it. But so far there hasn't been the need.
If you really need this functionality it makes sense to write a server-extension that uses cypher for query specification, but doesn't allow return statements. Instead you would always use RETURN *, aggregate the data into an in-memory structure (SubGraph in the org.neo4j.cypher packages). And then render it as a suitable format (e.g. JSON or one of those listed above).
These could be a starting points for that:
https://github.com/jexp/cypher-rs
https://github.com/jexp/cypher_websocket_endpoint
https://github.com/neo4j-contrib/rabbithole/blob/master/src/main/java/org/neo4j/community/console/SubGraph.java#L123
There are also other efforts, like GraphJSON from GraphAlchemist: https://github.com/GraphAlchemist/GraphJSON
And the d3 json format is also pretty useful. We use it in the neo4j console (console.neo4j.org) to return the graph visualization data that is then consumed by d3 directly.
I've been working with neo4j for a while now and I can tell you that if you are concerned about memory and performances you should drop cypher at all, and use indexes and the other graph-traversal methods instead (e.g. retrieve all the relationships of a certain type from or to a start node, and then iterate over the found nodes).
As the documentation says, Cypher is not intended for in-app usage, but more as a administration tool. Furthermore, in production-scale environments, it is VERY easy to crash the server by running the wrong query.
In second place, there is no mention in the docs of an API method to retrieve the output as a graph-like structure. You will have to process the output of the query and build it.
That said, in the example you give you say that there is only one A and that you know it before the data is fetched, so you don't need to do:
MATCH (A)-->(B) RETURN A, B
but just
MATCH (A)-->(B) RETURN B
(you don't need to receive A three times because you already know these are the nodes connected with A)
or better (if you need info about the relationships) something like
MATCH (A)-[r]->(B) RETURN r

Entity Framework: Max. number of "subqueries"?

My data model has an entity Person with 3 related (1:N) entities Jobs, Tasks and Dates.
My query looks like
var persons = (from x in context.Persons
select new {
PersonId = x.Id,
JobNames = x.Jobs.Select(y => y.Name),
TaskDates = x.Tasks.Select(y => y.Date),
DateInfos = x.Dates.Select(y => y.Info)
}).ToList();
Everything seems to work fine, but the lists JobNames, TaskDates and DateInfos are not all filled.
For example, TaskDates and DateInfos have the correct values, but JobNames stays empty. But when I remove TaskDates from the query, then JobNames is correctly filled.
So it seems that EF can only handle a limited number of these "subqueries"? Is this correct? If so, what is the max. number of these "subqueries" for a single statement? Is there a way to work around these issue without having to make more than one call to the database?
(ps: I'm not entirely sure, but I seem to remember that this query worked in LINQ2SQL - could it be?)
UPDATE
I'm getting crazy about this. I tried to repro the issue from ground up using a fresh, simple project (to post the entire piece of code here, not only an oversimplified example) - and I found I wasn't able to repro it. It still happens within our existing code base (apparently there's more behind this problem, but I cannot share this closed code base, unfortunately).
After hours and hours of playing around I found the weirdest behavior:
It works great when I don't SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; before calling the LINQ statement
It also works great (independent of the above) when I don't use a .Take() to only get the first X rows
It also works great when I add an additional .Where() statements to cut the the number of rows returned from SQL Server
I didn't find any comprehensible reason why I see this behavior, but I started to look at the SQL: Although EF generates the exact same SQL, the execution plan is different when I use READ UNCOMMITTED. It returns more rows on a specific index in the middle of the execution plan, which curiously ends in less rows returned for the entire SQL statement - which in turn results in the missing data, that is the reason for my question to begin with.
This sounds very confusing and unbelievable, I know, but this is the behavior I see. I don't know what else to do, I don't even know what to google for at this point ;-).
I can fix my problem (just don't use READ UNCOMMITTED), but I have no idea why it occurs and if it is a bug or something I don't know about SQL Server. Maybe there's some "magic max number of allowed results in sub-queries" in SQL Server? At least: As far as I can see, it's not an issue with EF itself.
A little late, but does calling ToList() on each subquery produce the required effect?
var persons = (from x in context.Persons
select new {
PersonId = x.Id,
JobNames = x.Jobs.Select(y => y.Name.ToList()),
TaskDates = x.Tasks.Select(y => y.Date).ToList(),
DateInfos = x.Dates.Select(y => y.Info).ToList()
}).ToList();

How to make datastore keys mapreduce-friendly(-er)?

Edit: See my answer. Problem was in our code. MR works fine, it may have a status reporting problem, but at least the input readers work fine.
I ran an experiment several times now and I am now sure that mapreduce (or DatastoreInputReader) has odd behavior. I suspect this might have something to do with key ranges and splitting them, but that is just my guess.
Anyway, here's the setup we have:
we have an NDB model called "AdGroup", when creating new entities
of this model - we use the same id returned from AdWords (it's an
integer), but we use it as string: AdGroup(id=str(adgroupId))
we have 1,163,871 of these entities in our datastore (that's what
the "Datastore Admin" page tells us - I know it's not entirely
accurate number, but we don't create/delete adgroups very often, so
we can say for sure, that the number is 1.1 million or more).
mapreduce is started (from another pipeline) like this:
yield mapreduce_pipeline.MapreducePipeline(
job_name='AdGroup-process',
mapper_spec='process.adgroup_mapper',
reducer_spec='process.adgroup_reducer',
input_reader_spec='mapreduce.input_readers.DatastoreInputReader',
mapper_params={
'entity_kind': 'model.AdGroup',
'shard_count': 120,
'processing_rate': 500,
'batch_size': 20,
},
)
So, I've tried to run this mapreduce several times today without changing anything in the code and without making changes to the datastore. Every time I ran it, mapper-calls counter had a different value ranging from 450,000 to 550,000.
Correct me if I'm wrong, but considering that I use the very basic DatastoreInputReader - mapper-calls should be equal to the number of entities. So it should be 1.1 million or more.
Note: the reason why I noticed this issue in the first place is because our marketing guys started complaining that "it's been 4 days after we added new adgroups and they still don't show up in your app!".
Right now, I can think of only one workaround - write all keys of all adgroups into a blobstore file (one per line) and then use BlobstoreLineInputReader. The writing to blob part would have to be written in a way that does not utilize DatastoreInputReader, of course. Should I go with this for now, or can you suggest something better?
Note: I have also tried using DatastoreKeyInputReader with the same code - the results were similar - mapper-calls were between 450,000 and 550,000.
So, finally questions. Is it important how you generate ids for your entities? Is it better to use int ids instead of str ids? In general, what can I do to make it easier for mapreduce to find all of my entities mapping them?
PS: I'm still in the process of experimenting with this, I might add more details later.
After further investigation we have found that the error was actually in our code. So, mapreduce actually works as expected (mapper is called for every single datastore entity).
Our code was calling some google services functions that were sometimes failing (the wonderful cryptic ApplicationError messages). Due to these failures, MR tasks were being retried. However, we have set a limit on taskqueue retries. MR did not detect nor report this in any way - MR was still showing "success" in the status page for all shards. That is why we thought that everything is fine with our code and that there is something wrong with the input reader.

Multi-location entity query solution with geographic distance calculation

in my project we have an entity called Trip. This trip has two points: start and finish. Start and finish are geo coordinates with some added properties like address atc.
what i need is to query for all Trips that satisifies search criteria for both start and finish.
smth like
select from trips where start near 16,16 and finish near 18,20 where type = type
So my question is: which database can offer such functionality?
what i have tried
i have explored mongodb which has support for geo indexes but does not support this use case. current solution stores the points as separate documents which have a reference to a Trip. we run two separate quesries for starts and finishes, then extract ids of their associated trips and then select trip ids that are found both in starts and finishes and finally return a collection of trips.
on a small sample it works fine but with a larger collection it gets slow and it's like scratching my left ear with my right hand.
so i am looking for a better solution.
i know about neo4j and its spatial plugin but i couldn't even make it work on windows. would it support our use case?
or are there any better solutions? preferably with a object mapper written in php.
like edze already said Postgres (PostGIS) or SQLite(SpatiaLite) is what your looking for
SELECT
*
FROM
trips
WHERE
ST_Distance(ST_StartPoint(way), ST_GeomFromText('POINT(16 16)',4326) < 5
AND ST_Distance(ST_EndPoint(way), ST_GeomFromText('POINT(18 20)',4326) < 5
AND type = 'type'

Resources