Extract data from Word documents with SSIS to ETL into SQL - sql-server

I could really use some help in how to extract data from Word documents using SSIS and inserting the extracted data in SQL. There are 10,000 - 13,000 Word files to process. The files most likely aren't consistent over the years. Any help is greatly appreciated!
Below is the example data from the Word documents that I'm interested in capturing. Note that Date and Job No are in the Header section.
Customer : Test Customer
Customer Ref. : 123456
Contact : Test Contact
Part No. : 123456789ABCDEFG
Manufacturer : Some Mfg.
Package : 123-456
Date Codes : 1234
Lot Number : 123456
Country of Origin : Country
Total Incoming Qty : 1 pc
XRF Test Result : PASS
HCT Result : PASS
Solder Test Result : PASS

My approach would be this:
Create a script in Python that extracts your data from the Word files and save them in XML or JSON format
Create SSIS package to load the data from each XML/JSON file to SQL Server

1. Using a script component as a source
To import data from Microsoft Word into SQL Server, you can use a script component as a data source where you can implement a C# script to parse document files using Office Interoperability libraries or any third-party assembly.
Example of reading tables from a Word file
2. Extracting XML from DOCX file
DOCX file is composed of several embedded files. Text is mainly stored within an XML file. You can use a script task or Execute Process Task to extract the DOCX file content and use an XML source to read the data.
How can I extract the data from a corrupted .docx file?
How to extract just plain text from .doc & .docx files?
3. Converting the Word document into a text file
The third approach is to convert the Word document into a text file and use a flat-file connection manager to read the data.
convert a word doc to text doc using C#
Converting a Microsoft Word document to a text file in C#

Related

How to Extract .owl and save to mysql

I have a file ontobible.owl. how to extract that file and then save data to mysql (because I want display data from ontobible.owl in website). can anyone help me?
edited:
here is my ontobible.owl file (https://teamtrainit.com/ontobible.owl)
i've try open ontobible.owl with sublime text 3 and contains like this
<Verse rdf:about="http://www.semanticweb.org/budsus/ontologies/2021/7/ontobible#HOS5_2">
<verseID>HOS5_2</verseID>
<verse_text>And the revolters are profound to make slaughter, though I have been a rebuker of them all.</verse_text>
</Verse>
<Verse rdf:about="http://www.semanticweb.org/budsus/ontologies/2021/7/ontobible#2CH2_1">
<hasPerson rdf:resource="http://semanticbible.org/ns/2006/NTNames#god_1324"/>
<hasPerson rdf:resource="http://www.co-ode.org/roberts/family-tree.owl#solomon_2762"/>
<verseID>2CH2_1</verseID>
<verse_text>And Solomon determined to build an house for the name of the LORD, and an house for his kingdom.</verse_text>
</Verse>
how to convert that xml tag to array or json so I cant save it to mysql database
you have several options for extracting data from owl
use owl-api and write java code (i think owl api is accessible in other languages) to extract data and pack it in the format you need. also you can use sparql queries for extracting data via jena api
install protege, open your file in protege and save it in format json-dl. this format is very similar to the regular json and you can easily transform it for your needs
install fuseki server, add your file and using sparql queries extract data from there
i think that the second option is the easiest for start if you don't want to write queries or code and it won't take long

SSIS - Create Summary Output File

I would like to use SSIS in orded to perform tranformations on multiple files (CSVs, Excels) which are comming from various datasources and the output should be always CSV files in certain structure.
One of the requirement after performing tranformation steps is to create a output summary file (MANIFEST FILE) about the results of the process in following structure.
BATCH_ID|EXTRACTED_FILE_NAME|MODEL_TYPE|RECORD_COUNT|TOTAL_QTY|GENERATED_ON_TE|CONTENTS_FROM_DATE|CONTENTS_TO_DATE|WORKSET_ID|FILE_STATUS|FILE_STATUS_TS
000005|NSL_B_YFRCARRAB0_PRODUCT_MASTER_20171122.txt|B|829||20171122121525|||||
Important columns:
Batch ID: ID of run
EXTRACTED_FILE_NAME: Name of created CSV file by SSIS (output file)
RECORD_COUNT: Number of rows in output file
TOTAL_QTY: SUM of column QTY
GENERATED_ON_TE: When the file was generated
STATUS_TS: Status - OK / FAIL
Is this output possible to achive in SSIS? Can I create it without using script compontent? If I have to use script compontent, can you navigate me little bit?
Many thanks,
Martin!

SSIS error handling: redirect rows that have zip code field more than 5 from a flat file

I have been given a task to load a simple flat file into another using ssis package. The source flat file contains a zip code field, now my task is to extract and load into another flat file that accepts only the ones with correct zip code which is 5 digit zip code , and redirect the invalid rows to a new file.
Since I am new to SSIS, any help or ideas is much appreciated.
You can add a derived column which determines the length of the field. Then you can add a conditional split based on that column. <= 5 goes the good path, > 5 goes the reject path.

Reading and writing to xls and doc files in c

I have this particular problem where i have to write a c program that reads numerical data from a text file. The data is tab delimited. Here is a sample from the text file.
1 23099 345565 345569
2 908 66766 66768
This is data for clients and each client has a row.Each column represents customer no.,previous balance,previous reading, current reading.Then i have to generate a doc. document
that summarizes all this information and calculates the balance I can write a function that does this but how do i create an xls document
and a word document where all the results are summarized using the program? The text document has only numerical data. Any ideas
The easiest way is to create a csv file and not a xls file.
Office can open those csv files with good results.
And it is way easier to create a ascii text file with commaseparated values,
than to create something into a closed format like the ms office formats.
The simplest way to create a spreadsheet that contains formulas and formatting, and can be opened by Excel, is to create an XML Spreadsheet file.

components of an SPSS project

I have given some data in an excel sheet to a 3rd party for SPSS data processing. After completion of the processing, what are the files that I should get back from them.
I have received one file with a ".sav" extension. I presume this file contains the imported data (from my excel file).
I have received documents (.rtf - rich text format) with the chart and graphs only. Is there something else I need to get so that I can use the files later on for further analysis.
Thanks in advance
V Karthick
Yes, the ".sav" extension is the data file. You should also request the syntax file(s), ".sps" extension. The syntax file is a record of all data transformations which have been performed and allows you to review their work. The syntax file can be opened with notepad or any text editor.
Arthur

Resources