create a star schema database from an xml file - sql-server

Is it possible to create a Star Schema Database from data available in an XML file or a text file?
Your valuable inputs appreciated.
Thanks in advance.
Regards,
Sam.

It's a very vague question, but you can pretty much describe anything logical with an XML file. XML is "extensible", so you can make it describe nearly any set of data in any form as long as there is consistency.

I've done both. I wrote a some JavaScripts that parse files and creates records in an Oracle database. One script processed the Tyco2 data from a formated text file. I've only done two of the 20 text files (250,000 stars) because of the size and I'm doing this on my laptop. A similar file created a table for the Hipparcos data (118,000 stars). The other I just finished processed the data from a file called "Stars - 2300+6000-2.xml." That file contained some stars with names and I created a table with the named stars that matched the Hipparcos stars. Alas only 92 of them showed up.
For xml, I used the Microsoft.XMLDOM interface and to add the data to Oracle, I used ADODB.Connection. The xml also used ADODB.Recordset because I had to run queries against the Hipparcos table to find the Hipparcos number (where there was a match) matching a star in the xml file.
If that is what your looking for, let me know and I can send the JavaScripts.
Frank Perry, MSEE

Related

SSIS: add non-existent column to a CSV source

I am loading a large set (10s of thousands) of CSV files into a single staging sql server table, using standard SSIS approach.
Vast majority of source CSV files have identical column structure (order, set of columns, data types). There's around 140 columns all together.
However, in certain (<1%) cases a source file will be lacking some columns (I know exactly which columns they are, and there are three possible combinations of missing columns). This is by design i.e. this is a valid business scenario (meh).
Can I somehow create a "virtual" column (filled with NULL/empty/blank values) for a source CSV connection if (and only if) that column does not exist in the physical source CSV file?
I know I can read CSV header with a C# scripting component and create multiple source connections, and re-direct to the right data flow based on existence (or lack) of certain columns but I am hoping for a more "elegant" solution, with just single CSV data source "smart" enough to "artificially" add blank columns that are missing in the source file.
For simplicity let's assume that the full column set is:
ID;C1;C2;C3
And that C3 is missing occasionally i.e. some CSV files are:
ID;C1;C2
Any hints welcome.
No, there is no "smart" CSV data source built in to SSIS.
You are certainly going to need to use a script component, but instead of using a Script Task outside the dataflow that directs the control flow to the correct dataflow, you can simply create one dataflow that has a script component as the data source. The script component reads the CSV that is currently being imported, and if the column in question is missing, it supplies it with NULL or default values.

Best way to move data from SQL to file system (FileStream?)

At the moment I have a SQL Server Express instance with a database nearing the 10gig limit. I have a table that stores details about activities, one column being "Note" which can be up to 3000 characters long.
I would like to take this table and convert the "Note" column to use the file system so that it frees up some database space. I also need this column to allow full text searching. I've poured over all the documents on FileStreams that I can find but I still can't figure out the best way to do this.
My main question is how I get my existing data (nvarchar(3000)) moved over to this new Filestream column and allow full text searches to work on this. Everything I've read states that you have to have a column that denotes the file extension being used -- but I don't know how to file my data into the new field using a specific file type. I'd be fine using .txt as the type, if I could figure out how.
Any ideas or best practices would be appreciated. I'm not hung up on the FileStream, so I'd take any other recommendations as well. My company is too cheap to buy SQL Standard, so I'd l

Need to map csv file to target table dynamically

I have several CSV files and have their corresponding tables (which will have same columns as that of CSVs with appropriate datatype) in the database with the same name as the CSV. So, every CSV will have a table in the database.
I somehow need to map those all dynamically. Once I run the mapping, the data from all the csv files should be transferred to the corresponding tables.I don't want to have different mappings for every CSV.
Is this possible through informatica?
Appreciate your help.
PowerCenter does not provide such feature out-of-the-box. Unless the structures of the source files and target tables are the same, you need to define separate source/target definitions and create mappings that use them.
However, you can use Stage Mapping Generator to generate a mapping for each file automatically.
PMy understanding is you have mant CSV files with different column layouts and you need to load them into appropriate tables in the Database.
Approach 1 : If you use any RDBMS you should have have some kind of import option. Explore that route to create tables based on csv files. This is a manual task.
Approach 2: Open the csv file and write formuale using the header to generate a create tbale statement. Execute the formula result in your DB. So, you will have many tables created. Now, use informatica to read the CSV and import all the tables and load into tables.
Approach 3 : using Informatica. You need to do lot of coding to create a dynamic mapping on the fly.
Proposed Solution :
mapping 1 :
1. Read the CSV file pass the header information to a java transformation
2. The java transformation should normalize and split the header column into rows. you can write them to a text file
3. Now you have all the columns in a text file. Read this text file and use SQL transformation to create the tables on the database
Mapping 2
Now, the table is available you need to read the CSV file excluding the header and load the data into the above table via SQL transformation ( insert statement) created by mapping 1
you can follow this approach for all the CSV files. I haven't tried this solution at my end but, i am sure that the above approach would work.
If you're not using any transformations, its wise to use Import option of the database. (e.g bteq script in Teradata). But if you are doing transformations, then you have to create as many Sources and targets as the number of files you have.
On the other hand you can achieve this in one mapping.
1. Create a separate flow for every file(i.e. Source-Transformation-Target) in the single mapping.
2. Use target load plan for choosing which file gets loaded first.
3. Configure the file names and corresponding database table names in the session for that mapping.
If all the mappings (if you have to create them separately) are same, use Indirect file Method. In the session properties under mappings tab, source option.., you will get this option. Default option will be Direct change it to Indirect.
I dont hav the tool now to explore more and clearly guide you. But explore this Indirect File Load type in Informatica. I am sure that this will solve the requirement.
I have written a workflow in Informatica that does it, but some of the complex steps are handled inside the database. The workflow watches a folder for new files. Once it sees all the files that constitute a feed, it starts to process the feed. It takes a backup in a time stamped folder and then copies all the data from the files in the feed into an Oracle table. An Oracle procedure gets to work and then transfers the data from the Oracle table into their corresponding destination staging tables and finally the Data Warehouse. So if I have to add a new file or a feed, I have to make changes in configuration tables only. No changes are required either to the Informatica Objects or the db objects. So the short answer is yes this is possible but it is not an out of the box feature.

SQL2008 Integration Services - Loading CSV files with varying file schema

I'm using SQL2008 to load sensor data in a table with Integration Services. I have to deal with hundreds of files. The problem is that the CSV files all have slightly different schemas. Each file can have a maximum of 20 data fields. All data files have these fields in common. Some files have all the fields others have some of the fields. In addition, the order of the fields can vary.
Here’s and example of what the file schemas look like.
Station Name,Station ID,LOCAL_DATE,T_1,TD_1,RH_1,CL_1,RS_1,RI_1,PR_1,RD_1,SH_1,CL_2
Station Name,Station ID,LOCAL_DATE,T_1,TD_1,RH_1,CL_1,RS_1,RI_1,PR_1,WS_1,WD_1,WSM_1,WDM_1,SH_1
Station Name,Station ID,LOCAL_DATE,T_1,TD_1,RH_1,RS_1,RI_1,PR_1,RD_1,WS_1,WD_1,WSM_1,WDM_1
Station Name,Station ID,LOCAL_DATE,T_1,RH_1,RS_1,PR_1,VI_1,PW_1,WS_1,WD_1,WSM_1
Station Name,Station ID,LOCAL_DATE,T_1,RH_1,RS_1,WS_1,WD_1,WSM_1
Station Name,Station ID,LOCAL_DATE,T_1,RH_1,RS_1,PR_1,VI_1,WS_1,WD_1,WSM_1
I’m using a Data Flow Script Task to process the data via CreateNewOutputRows() and MyOutputBuffer.AddRow(). I have a working package to load the data however it’s not reliable and robust because as I had more files the package fails because the file schema has not been defined in CreateNewOutputRows().
I'm looking for a dynamic solution that can cope with the variation in the file schema. Doeas anyone have any ideas?
Who controls the data model for the output of the sensors? If it's not you, do they know what they are doing? If they create new and inconsistent models every time they invent a new sensor, you are pretty much up the creek.
If you can influence or control the evolution of the schemas for CSV files, try to come up with a top level data architecture. In the bad old days before there were databases, files made up of records often had, as the first field of each record, a "record type". CSV files could be organized the same way. The first field of every record could indicate what type of record you are dealing with. When you get an unknown type, put it in the "bad input file" until you can maintain your software.
If that isn't dynamic enough for you, you may have to consider artificial intelligence, or looking for a different job.
Maybe the cmd command is good. in the cmd, you can use sqlserver import csv.
If the CSV files that all have identical formats use the same file name convention or if they can be separated out in some fashion you can use the ForEach Loop Container for each file schema type.
Possible way to separate out the CSV files is run a Script (in VB) in SSIS that reads the first row of the CSV file and checks for the differing types (if the column names are in the first row) and then moves the files to the appropriate folder for use in the ForEach Loop Container.

import CSV into SQLite WITHOUT a table schema

I know that I can import .csv file into a pre-existing table in a sqlite database through:
.import filename.csv tablename
However, is there such method/library that can automatically create the table (and its schema), so that I don't have to manually define: column1 = string, column2 = int ....etc.
Or, maybe we can import everything as string. To my limited understanding, sqlite3 seems to treat all fields as string anyway?
Edit:
The names of each column is not so important here (assume we can get that data from the first row in the CSV file, or they could be arbitrary names) The key is to identify the value types of each column.
This seems to work just fine for me (in sqlite3 version 3.8.4):
$ echo '.mode csv
> .import data_with_header.csv some_table' | sqlite3 db
It creates the table some_table with field names taken from the first row of the data_with_header.csv file. All fields are of type TEXT.
You said yourself in the comment that its a nontrivial problem to determine the types of columns. (Imagine a million rows that all look like numbers, but one of those rows has a Z in it. - Now that row has to be typed "string".)
Though non-trivial, it's also pretty easy to get the 90% scenario working. I would just write a little Python script to do this. Python has a very nice library for parsing CSV files and its interface to sqlite is simple enough.
Just load the CSV, guess and check at the column types. Devise a create table that encapsulates this information, then emit your insert intos. I can't imagine this taking up more than 20 lines of Python.
This is a little off-topic but it might help to use a tool that gives you all the SQL functionality on an individual csv file without actually using SQLite directly.
Take a look at TextQL - a utility that allows querying of csv files directly which uses SQLite engine in memory:
https://github.com/dinedal/textql
textql -header -sql "select * from tbl" -source some_file.csv

Resources