Creating and using multiple datasets in SPSS - dataset

Forgive the likely naive question, but despite experience in databases I'm new to SPSS and am probably overlooking something simple.
I have data about Patients (unique-pt-identifier, age, gender, etc.)
The patients take multiple different kinds of Tests, each of which can can require a few 100 to several thousand fields (unique-pt-identifier, testtype, testdate, testdata1, testdata2, ... testdata2000). I have sizable datasets of these test results.
I'd like to compute things about the test results, but those computations sometimes need to reference properties of the patients. I know I can add columns to the Test dataset, adding the patient data to each row, but this seems awkward and redundant (patients take the same type of test multiple times, so I'd end up adding the same info multiple times).
This seems conceptually straightforward, but unless I'm just using the wrong terminology, I can't find anything about this in either SPSS command syntax or in multiple web searches. Happy to read the right documentation if pointed to it.
Many thanks.

In SPSS you need to have all the data you want to interact sit in the same dataset. So yes - you have to get the patients' properties together with test results in the same sataset. If this makes for (too) big datasets, there are two simple ways to do get what you need with a smaller dataset:
First, you don't necessarily have to bring together ALL test results and ALL patient properties - just the relevant ones for each analysis. for example:
match files /file=testresults /table=patients /by=patientID
/keep=patientID test1 test2 property1 property2.
exe.
dataset name dataForAnalysis1.
The second approach is to first aggregate the test data to patient level, and only then match the datasets.
dataset activete testdata.
dataset declare agg1.
aggregate out=agg1 /break patientID /test1 test2=mean(test1 test2).
match files /file=agg1 /table=patients /by patientID.
exe.
dataset name dataForAnalysis1.

Related

Pattern matching based on variable name for variable selection in a Postgres query?

I'm trying to query some data in Postgres and I'm wondering how I might use some sort of pattern matching not merely to select rows - e.g. SELECT * FROM schema.tablename WHERE varname ~ 'phrase' - but to select columns in the SELECT statement, specifically doing so based on the common names of those columns.
I have a table with a bunch of estimates of rates over the course of many years - say, of apples picked per year - along with upper and lower 95% confidence intervals for each year. (For reference, each estimate and 95% CI comes from a different source - a written report, to be precise; these sources are my rows, and the table describes various aspect of each source. Based on a critique from below, I think it's important that the reader know that the unit of analysis in this relational database is a written report with different estimates of things picked per year - apples in one Table, oranges in another, pears in a third.)
So in this table, each year has three columns / variables:
rate_1994
low_95_1994
high_95_1994
The thing is, the CIs are mostly null - they haven't been filled in. In my query, I'm really only trying to pull out the rates for each year: all the variables that begin with rate_. How can I phrase this in my SELECT statement?
I'm trying to employ regexp_matches to do this, but I keep getting back errors.
I've done some poking around StackOverflow on this, and I'm getting the sense that it may not even be possible, but I'm trying to make sure. If it isn't possible, it's easy to break up the table into two new ones: one with just the rates, and another with the CIs.
(For the record, I've looked at posts such as this one:
Selecting all columns that start with XXX using a wildcard? )
Thanks in advance!
If what you are basically asking is can columns be selected dynamically based on an execution-time condition,
No.
You could however use PL/SQL to build up a query as a string and then execute it using EXECUTE IMMEDIATE.

Is it possible to use LookupSet/Lookup with Running Value in SSRS

This is my first question on StackOverflow so apologies if there is not enough appropriate information.
Rather than having four different tables that I try to position 'just so' so that they look like one table, I was hoping to have all of my data in one visible table and hide the rest.
To do this I was trying to use LookupSet/Lookup with Running Value (I need a cumulative figure for each fortnight from a start date).
I have used the following code which supplies me with figures in the table - however the figures seem to be nearly double what they actually are.
=Lookup(Fields!StartFortnightDate.Value, Fields!StartFortnightDate.Value,
Fields!RowIdentifier.Value, "KPI004")
Is it possible to use Lookup with RunningValue? It won't let me use ReportItems either its obviously only pulling from the first box and therefore is just repeating the first figure again and again.
Any help, guidance, or even a simple "it's not possible" would be appreciated.
Edited to add more information as suggested:
It's difficult to add example data without worrying about data protection etc.
Report design is currently:
ReportDesign
Each table has it's own dataset - I'm trying to get them all into one table.
Lets say the first dataset is number of cars sold in each fortnight.
The second dataset (table) is number of meetings held.
The third dataset is number of days weather was sunny/cloudy/rainy etc.
(This obviously isn't what the datasets are, but I'm trying to show that they don't actually relate to each other that much and therefore can't all be in the same script)
All datasets have a table of the fortnightly dates within that quarter, my hope was to get one table that showed the cumulative figures of each item even though they're not in the same dataset - the tables are all grouped by the StartOfFortnightDate.
The script =RunningValue(Fields!NumberOfFordCarsSold.Value, Count, Nothing) and similar works fine in the separate tables, however if I add a row to the top table and try to use RunningValue with Lookup it doesn't work.
When I used the script mentioned at the top (Lookup script) I get inflated figures (top row of this image) compared to the expected figures (bottom row of the image): IncorrectAndCorrectFigures
Apologies if this doesn't make sense, it's likely that my complete confusion in trying to find the answer is coming across in the question.
If the resulting datasets are all similar then why can you not combine them?
From the output they seem to be just Indicator & Date.
Add an extra column to indicate which set of data each row belongs to (Cars Meetings etc), this might help with grouping rows in the report.

How do you verify the correct data is in a data mart?

I'm working on a data warehouse and I'm trying to figure out how to best verify that data from our data cleansing (normalized) database makes it into our data marts correctly. I've done some searches, but the results so far talk more about ensuring things like constraints are in place and that you need to do data validation during the ETL process (E.g. dates are valid, etc.). The dimensions were pretty easy as I could easily either leverage the primary key or write a very simple and verifiable query to get the data. The fact tables are more complex.
Any thoughts? We're trying to make this very easy for a subject matter export to run a couple queries, see some data from both the data cleansing database and the data marts, and visually compare the two to ensure they are correct.
You test your fact table loads by implementing a simplified, pared-down subset of the same data manipulation elsewhere, and comparing the results.
You calculate the same totals, counts, or other figures at least twice. Once from the fact table itself, after it has finished loading, and once from some other source:
the source data directly, controlling for all the scrubbing steps in between source and fact
a source system report that is known to be correct
etc.
If you are doing this in the database, you could write each test as a query that returns no records if everything correct. Any records that get returned are exceptions: count of x by (y,z) does not match.
See this excellent post by ConcernedOfTunbridgeWells for more recommendations.
Although it has some drawbacks and potential problems if you do a lot of cleansing or transforming, I've found you can round trip an input file by re-generating the input file from the star schema(s). Then simply comparing the input file to the output file. It might require some massaging to make them match (one is left padded, the other right padded).
Typically, I had a program which used the same layout the ETL used and did a compare, ignoring alignment within a field. Also, the files might have to be sorted - there is a command-line sort I used.
If your ETL does a transform incorrectly and you transform out incorrectly, it's still possible that this method doesn't show every problem in the DW, and I wouldn't claim it has complete coverage, but it's a pretty good first whack at a regression unit test for each load.

Preferred way of retrieving row with multiple relating rows

I'm currently hand-writing a DAL in C# with SqlDataReader and stored procedures. Performance is important, but it still should be maintainable...
Let's say there's a table recipes
(recipeID, author, timeNeeded, yummyFactor, ...)
and a table ingredients
(recipeID, name, amount, yummyContributionFactor, ...)
Now I'd like to query like 200 recipes with their ingredients. I see the following possibilities:
Query all recipes, then query the ingredients for each recipe.
This would of course result in maaany queries.
Query all recipes and their ingredients in a big joined list. This will cause a lot of useless traffic, because every recipe data will be transmitted multiple times.
Query all recipes, then query all the ingredients at once by passing the list of recipeIDs back to the database. Alternatively issue both queries at one and return multiple resultsets. Back in the DAL, associate the ingredients to the recipes by their recipeID.
Exotic way: Cursor though all recipes and return for each recipe two separate resultsets for recipe and ingredients. Is there a limit for resultsets?
For more variety, the recipes can be selected by a list of IDs from the DAL or by some parametrized SQL condition.
Which one you think has the best performance/mess ratio?
If you only need to join two tables and an "ingredient" isn't a huge amount of data, the best balance of performance and maintainability is likely to be a single joined query. Yes, you are repeating some data in the results, but unless you have 100,000 rows and it's overloading the database server/network, it's too soon to be optimizing.
The story is a little bit different if you have many layers of joins each with decreasing cardinality. For example, in one of my apps I have something like the following:
Event -> EventType -> EventCategory
-> EventPriority
-> EventSource -> EventSourceType -> Vendor
A query like this results in a significant amount of duplication which is unacceptable when there are 100k events to retrieve, 1000 event types, maybe 10 categories/priorities, 50 sources, and 5 vendors. So in that case, I have a stored procedure that returns multiple result sets:
All 100k Events with just EventTypeID
The 1000 EventTypes with CategoryID, PriorityID, etc. that apply to these Events
The 10 EventCategories and EventPriorities that apply to the above EventTypes
The 50 EventSources that generated the 100k events
And so on, you get the idea.
Because the cardinality goes down so drastically, it is much quicker to download only what is needed here and use a few dictionaries on the client side to piece it together (if that is even necessary). In some cases the low-cardinality data may even be cached in memory and never retrieved from the database at all (except on app start or when the data is changed).
The determining factors in using an approach such as this are a very high number of results and a steep decrease in cardinality for the joins, in other words fanning in. This is actually the reverse of most usages and probably the reverse of what you are doing here. If you are selecting "recipes" and joining to "ingredients", you are probably fanning out, which can make this approach wasteful, especially if there are only two tables to join.
So I'm just putting it out there that this is a possible alternative if performance becomes an issue down the road; at this point in your design, before you have real-world performance data, I would simply go the route of using a single joined result set.
The best performance/mess ratio is 42.
On a more serious note, go with the simplest solution: retrieve everything with a single query. Don't optimize before you encounter a performance issue. "Premature optimization is the root of all evil" :)
One stored proc that returns 2 datasets: "recipe header" and "recipe details"?
This is what I'd do if I needed the data all at once in one go. If I don't need it in one go, I'd still get 2 datasets but with less data.
We've found it slightly easier to work with this in the client rather than one big query as Andomar suggested, but his/her answer is still very valid.
I would look at the bigger picture - do you really need to retrieve ingredients for 200 recipes? What happens when you have 2,000?
For example, if this is in a web page I would have the 200 recipes listed (if not less because of paging), and when the user clicked on one to see the ingredient then I would get the ingredients from the database.
If this isn't doable, I would have 1 stored proc that returns one DataSet containing 2 tables. One with the recipes and the second with the list of ingredients.
"I'm currently hand-writing a DAL in C#..." As a side note, you might want to check out the post: Generate Data Access Layer Methods From Stored Procs. It can save you a lot of time.

Help determining Maintenance Item table structure best practice

I have two different existing Database design's in the legacy apps I inherited. On one side they created two tables; tblAdminMaintCategory and tblAdminMaintItems. On the other side individual tables for everything; tblAdminPersonTitle and tblAdminRace.
Method #1
Now in the first example there would be an entry in tblAdminMaintCategory for Race(say ID is #2) and then in tblAdminMaintItems each individual race would have an entry with the corressponding categoryID. Then to query it for race options, for example, would go --> SELECT * FROM tblAdminMaintItems WHERE CategoryID = 2
Method #2
Now in the second example there would just be an entry in tblAdminRace for each individual race option. To query that would go --> SELECT * FROM tblAdminRace.
Now, I am trying to figure out, going forward, which of these paths I want to follow in my own apps. I don't like that the First Method, seemingly, introduces magic numbers. I don't like that the Second Method introduces many, many, small tables but I am not sure that is the END OF THE WORLD!!
I am curious as to others opinions on how they would proceed or how they have proceeded. What worked for you? What are some reasons I shouldn't use one or the other?
Thanks!
It is reasonable design to have separate entities in different tables; like: Race, Car, Person, Location, Maintenance task, Maintenance schedule. If you don't like joins in queries, simply design several views. I may have misunderstood the example, but here is one suggestion.
Magic numbers always seem like a bad idee. Specifically where you can use more desciptive codes in tables that do not have that many entries (Titles, Races, etc).
On the other hand having a gazillion small tables to link to not only makes the queries difficult to maintain, but harder to read and more joins to parse.
EDIT:
Smaller reference tables will make maintinance easier. Change in one place change all. But it will definately cluter up you table structure designers, and forgetting to populate these metadata tables will give you a lot of issues at new install sites.

Resources