Importing loosely structured data into database - database

I get daily data feeds with data that is only loosely structured. I need to import it into a database so I can run a report that finds new records and changes to existing records.
The data looks like this:
--------------------------------
blah:
foo
bar
lorum: ipsum
dolor: sit
foo: bar
bar: foo
123-555-1212
Lorum / Ipsum / Dolor / Sit
Foo / Bar
--------------------------------
As you can see there are some field headings like "blah", "lorum", etc. but some data lacks a heading, like the phone number or slash delimited list. And some headings are on the same line and others are not.
Just to keep us on our toes, the records do not have the same number of fields.
So I'm thinking that parsing needs to have at least 3 ways to parse the data like,
if "heading:$" then grab the next lines until the next "*.:" is read
and
grab "heading: value"
and
if line starts with number assume heading of "phone"
and
if line contains slash delimited list assume heading "features" until "--------..."
But I have no idea how to start coding something like this. The language is open at this point although I have to run the code in MacOS.
I suppose perl might be good for this, but have very poor perl foo.
Don't even know where to start with this one.

You always need to assume something about your text, otherwise you have an exercise in NLP.
Can we assume that the non-key-value part is in the end? is so, the following regexs will help you:
# split the text into records:
#records = split /\n-----------------\n/, $text;
# this will find lines that have another key/value pair after it
qr/\A(\w+):(.*?)(?=\n\w+:)/ms
# then the last key/value, that probably must be one line:
qr/^(\w+):(.*)/
I recommend that each time, after successful matching, remove the matched text and continue.
Other useful assumptions: that the phone number can appear only once in the record, (and not as part of other key/value) that tags are in the end.

Related

Replace All and reduce text string

I have Google Forms results from live workshops that I want to edit in Sheets. I then want to calculate averages, pivot etc. to explore the data for insights.
I want to clean and standardize the data. I'd like to replace a series of long strings and reduce them just to their leading number:
UPDATE as requested here is a View-only sample Sheet with raw data on 1st tab, and how I'd like it to be formatted (2nd tab)
https://docs.google.com/spreadsheets/d/1pP8YV3oJXWGt3-88qgzMuup1pm1IsIgD4SHDL09MXrc/edit?usp=sharing
Example:
I'm aware of certifications I should be starting.
Replace with
3
Example 2:
I'm currently progressing a powerful certification.
Replace with:
4
In Excel I would simply use * as wildcard for the rest of the string, but Sheets appears different. I've read documentation and posts about regular expressions etc. and I'm not sure if that's overkill or how to proceed.
I'd THEN like to create a macro which does that for the whole sheet:
All strings which begin with 1)*
--> Replace with 1
All strings which begin with 2)*
--> Replace with 2
try:
=REGEXEXTRACT(A1; "^\d+")

Word wrap issues with SSIS Flat file destination

Background: I need to generate a text file with 5 records each of 1565 character length. This text file is further used to feed the data to a software.
Hence, they are some required fields and optional fields. I created a query with all the fields added together to get one single field. I populated optional fields with a blank.
For example:
Here is the sample input layout for each fields
Field CharLength Required
ID 7 Yes
Name 15 Yes
Address 15 No
DOB 10 Yes
Age 1 No
Information 200 No
IDNumber 13 Yes
and then i generated a query for each unique ID with the above fields into a single row which looks like following:
> SELECT Cast(1 AS CHAR(7))+CAST('XYZ' AS CHAR(15))+CAST('' AS CHAR(15))+CAST('22/12/2014' AS
CHAR(10))+CAST('' AS CHAR(1))+CAST(' AS CHAR(200))+CAST('123456' AS CHAR(13))
UNION
SELECT Cast(2 AS CHAR(7))+CAST('XYZ' AS CHAR(15))+CAST('' AS CHAR(15))+CAST('22/12/2014' AS
CHAR(10))+CAST('' AS CHAR(1))+CAST(''AS CHAR(200))+CAST('123456' AS CHAR(13))
Then, I created an SSIS package to produce the output text file through Flat file destination delimited.
Problem:
Even though the flat file is generated as per the desired length(1565). The text file looks differently when the word wrap is ON or OFF. When Word wrap is off , i get the record in single line. If the Word wrap is on, the line is broken into multiple. the length of the record in either case is same.
Even i tried to use VARCHAR + Space in the query instead of CHAR for each field, but there is no success. Its breaking the line for blank fields.
For example: Cast('' as varchar(1)) + Space(200-len(Cast('' as varchar(1)))) for Information field
Question: How do make it into a single line even though the word wrap is ON.
Since its my first post, please excuse me for format of the question
The purpose of word wrap is to put characters on the next line in instances of overflow rather than creating an extremely horizontal scrolling document.
Word wrap is the additional feature of most text editors, word processors, and web browsers, of breaking lines between words rather than within words, when possible.
Because this is what word wrap is there's nothing you can do to change its behavior. What does it matter anyway? The document should still be parsed as you would expect. Just don't turn word wrap on.
As far as I'm aware, having word wrap on or off has no impact on the document itself, it's simply a presentation option.
Applications parsing a document parse it as if word wrap were off. Something that could throw off parsing is breaks for a new line, but that is a completely different thing from word wrap.

Using org-mode as a flat file database and sanitizing input

I'd like to use an org-mode file as a flat file database that can be edited both programmatically and by hand. An example follows showing a list of bookmarks.
* Somebody's blog :: I like org-mode
:url: http://somebody.com/org
** Quotation 1
:date: 2013/01/13 08:32:11 EST
Very interesting observations here.
** Quotation 2
:date: 2013/01/13 08:33:46 EST
A marvelous code snippets
* Man bites dog
:url: http://newssite.com/today
I'd like emacs or a webserver cgi-script or similar to edit such a file (in the example above, to add more bookmarks or more quotations to existing bookmarks).
The problem is when, e.g., accepting arbitrary selections from websites to insert under an org-mode heading, it becomes necessary to sanitize the input so that, at minimum, quoted lines starting with asterisks don't affect the file's structure: if a quotation starts with "* this is pathological example", and is inserted into the file under some heading, when I open the file in emacs, it'll appear as a new first-level (h1) heading.
How can I meet the twin goals of (i) an editable org-mode flat file database (this rules out escaping and all our XML tricks) and (ii) isolating arbitrary inputs?
anti-solution: #+BEGIN_QUOTE wouldn't work because lines starting with "* " are rendered as new headings.
possibility 1: box/rebox everything from the outside world: http://www.emacswiki.org/emacs/BoxQuote this seems excessive though.

Fix CSV file with new lines

I ran a query on a MS SQL database using SQL Server Management Studio, and some the fields contained new lines. I selected to save the result as a csv, and apparently MS SQL isn't smart enough to give me a correctly formatted CSV file.
Some of these fields with new lines are wrapped in quotes, but some aren't, I'm not sure why (it seems to quote fields if they contain more than one new line, but not if they only contain one new line, thanks Microsoft, that's useful).
When I try to open this CSV in Excel, some of the rows are wrong because of the new lines, it thinks that one row is two rows.
How can I fix this?
I was thinking I could use a regex. Maybe something like:
/,[^,]*\n[^,]*,/
Problem with this is it matches the last element of one line and the 1st of the next line.
Here is an example csv that demonstrates the issue:
field a,field b,field c,field d,field e
1,2,3,4,5
test,computer,I like
pie,4,8
123,456,"7
8
9",10,11
a,b,c,d,e
A simple regex replacement won't work, but here's a solution based on preg_replace_callback:
function add_quotes($matches) {
return preg_replace('~(?<=^|,)(?>[^,"\r\n]+\r?\n[^,]*)(?=,|$)~',
'"$0"',
$matches[0]);
}
$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){5}$~m';
$result=preg_replace_callback($row_regex, 'add_quotes', $source);
The secret to $row_regex is knowing ahead of time how many columns there are. It starts at the beginning of a line (^ in multiline mode) and consumes the next five things that look like fields. It's not as efficient as I'd like, because it always overshoots on the last column, consuming the "real" line separator and the first field of the next row before backtracking to the end of the line. If your documents are very large, that might be a problem.
If you don't know in advance how many columns there are, you can discover that by matching just the first row and counting the matches. Of course, that assumes the row doesn't contain any of the funky fields that caused the problem. If the first row contains column headers you shouldn't have to worry about that, or about legitimate quoted fields either. Here's how I did it:
preg_match_all('~\G,?[^,\r\n]++~', $source, $cols);
$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){' . count($cols[0]) . '}$~m';
Your sample data contains only linefeeds (\n), but I've allowed for DOS-style \r\n as well. (Since the file is generated by a Microsoft product, I won't worry about the older-Mac style CR-only separator.)
See an online demo
If you want a java programmatic solution, open the file using the OpenCSV library. If it is a manual operation, then open the file in a text editor such as Vim and run a replace command. If it is a batch operation, you can use a perl command to cleanup the CRLFs.

Make SQL Server index small numbers

We're using SQL Server 2005 in a project. The users of the system have the ability to search some objects by using 'keywords'. The way we implement this is by creating a full-text catalog for the significant columns in each table that may contain these 'keywords' and then using CONTAINS to search for the keywords the user inputs in the search box in that index.
So, for example, let say you have the Movie object, and you want to let the user search for keywords in the title and body of the article, then we'd index both the Title and Plot column, and then do something like:
SELECT * FROM Movies WHERE CONTAINS(Title, keywords) OR CONTAINS(Plot, keywords)
(It's actually a bit more advanced than that, but nothing terribly complex)
Some users are adding numbers to their search, so for example they want to find 'Terminator 2'. The problem here is that, as far as I know, by default SQL Server won't index short words, thus doing a search like this:
SELECT * FROM Movies WHERE CONTAINS(Title, '"Terminator 2"')
is actually equivalent to doing this:
SELECT * FROM Movies WHERE CONTAINS(Title, '"Terminator"') <-- notice the missing '2'
and we are getting a plethora of spurious results.
Is there a way to force SQL Server to index small words? Preferably, I'd rather index only numbers like 1, 2, 21, etc. I don't know where to define the indexing criteria, or even if it's possible to be as specific as that.
Well, I did that, removed the "noise-words" from the list, and now the behaviour is a bit different, but still not what you'd expect.
A search won't for "Terminator 2" (I'm just making this up, my employer might not be really happy if I disclose what we are doing... anyway, the terms are a bit different but the principle the same), I don't get anything, but I know there are objects containing the two words.
Maybe I'm doing something wrong? I removed all numbers 1 ... 9 from my noise configuration for ENG, ENU and NEU (neutral), regenerated the indexes, and tried the search.
These "small words" are considered "noise words" by the full text index. You can customize the list of noise words. This blog post provides more details. You need to repopulate your full text index when you change the noise words file.
I knew about the noise words file, but I'm not why your "Terminator 2" example is still giving you issues. You might want to try asking this on the MSDN Database Engine forum where people that specialize in this sort of thing hang out.
You can combine CONTAINS (or CONTAINSTABLE) with simple where conditions:
SELECT * FROM Movies WHERE CONTAINS(Title, '"Terminator 2"') and Title like '%Terminator 2%'
While the CONTAINS find all Terminator the where will eliminate 'Terminator 1'.
Of course the engine is smart enough to start with the CONTAINS not the like condition.

Resources