advice for transforming data gathered from screen scrapers

advice for transforming data gathered from screen scrapers - database

good day folks,
I have my screen scraper (scrapy) collecting data of property listings on several property websites. They all have several common fields like price, floor area etc. However, like all scraped data, the values for the fields are rather undesirable right now. For instance, in price, I have obvious values like $1,000,000,000, but I also have stuff like $1,000,000,000 Price on Ask and Price on Ask. So currently, I stored all my scraped fields as char in my database.
I would like to transform these string fields in my database from characters to the appropriate type e.g string to int, so I can index them accordingly. Can someone offer me some advice what would be sensible procedure and method to begin transforming the data?

You want to throw away the "Price On Ask" string? Or is that valuable information?
If there is a lot of noise in the data, and it is all of no-interest, I'd run a filter to remove all non-digits.
But, if time allows, I prefer to process the data explicitly with pattern matching (sample code is PHP):
//$price is raw string
$price=str_replace(',','',$price); //Get rid of commas
$price=str_replace('$','',$price); //Get rid of dollar signs
if($price=='Price On Ask')$price=null;
elseif(preg_match('/^\d+$/',$price))$price=(int)$price; //Simple number
elseif(preg_match('/^(\d+) Price On Ask$/i',$price,$parts)){
$price=(int)$parts[1];
}
else{
echo "Unexpected price string: $price\n";
$price=null;
}
I then have the structure to set flags for some of the strings. Also, when a new string appears in the data my script gets noisy and I can decide if it matters or not.
(Note: setting $price to null implies putting a NULL in the database, not a zero.)

Related

SQL Server validating postcodes

I have a table containing postcodes but there is no validation built in to the entry form so there is no consistency in the way they are stored in the database, sample below:
ID Postcode
001742 B5
001745
001746
001748 DY3
001750
001751
001768 B276LL
001774 B339HY
001776 B339QY
001780 WR51DD
I want to use these postcode to map the distance from a central point but before I can do that I need to put them into a valid format and filter out any blanks or incomplete postcodes.
I had considered using
left(postcode,3) + ' ' + right(postcode,3)
To correct the formatting but this wouldn't work for postcodes like 'M6 8HD'
My aim is to get the list of postcodes in a valid format but I don't know how to account for different lengths of postcode. Is this there a way to do this in SQL Server?

As discussed in the comments, sometimes looking at a problem the other way around presents a far simpler solution.
You have a list of arbitrary input provided by users, which frequently doesn't contain the correct spacing. You also have a list of valid postcodes which are correctly spaced.
You're trying to solve the problem of finding the correct place to insert spaces into your arbitrary inputs to make them match the list of valid codes, and this is extremely difficult to do in practice.
However, performing the opposite task - removing the spaces from the valid postcodes - is remarkably easy to do. So that is what I'd suggest doing.
In our most recent round of data modelling, we have modelled addresses with two postcode columns - PostCode containing the postcode as provided from whatever sources, and PostCodeNoSpace, a computed column which strips whitespace characters from PostCode. We use the latter column for e.g. searches based on user input. You may want to do something similar with your list of Valid postcodes, if you're keeping it around permanently - so that you can perform easy matches/lookups and then translate those matches back into a version that has spaces - which is actually a solution to the original question posed!

Remove Duplicate adjacent Sub-String from String in Microsoft SQL Server

I am using SQL Server 2008 and I have a column in a table, which has values like below. It basically shows departure and arrival information.
-->Heathrow/Dublin*Dublin/Heathrow
-->Gatwick/Liverpool*Liverpool/Carlisle *Carlisle/Gatwick
-->Heathrow/Dublin*Liverpool/Heathrow
(The 3rd example shown above is slightly different where the person did not depart from Dublin, instead departed from a Liverpool).
This makes the column too lengthy, and I want to remove only the adjacent duplicates, so the information can be shown like below:
-->Heathrow/Dublin/Heathrow
-->Gatwick/Liverpool/Carlisle/Gatwick
-->Heathrow/Dublin***Liverpool/Heathrow
So, this would still show the correct travel route, but omits only the contiguous duplicates. Also, in the 3rd case, since the departure and arrival information location is not the same, Iwould like to show it as ***.
I found a post here that removes all duplicates (Find and Remove Repeated Substrings) but this is slightly different from the solution that I need.
Could someone share any thoughts please?

The first step is to adapt the process defined in the following link so that it splits based on /:
T-SQL split string
This returns a table which you would then loop through checking if the value contains an *. In that case you would get the text values before and after the * and compare them. Use CHARINDEX to get the position of the *, and SUBSTRING to get the values before and after. Once you have those check both values and append to your output string accordingly.

So you have a database column that contains this text string? Is your concern to display the data to the user in a new format, or to update the data in your database table with a new value?
Do you have access to the original data from which this text string was built? It would probably be easier to re-create the string in the format you desire than it would be to edit the existing string programmatically.
If you don't have access to this data, it would probably be a lot simpler to update your data (or reformat it for display) if you do the string manipulation in a high-level language such as c# or java.
If you're reformatting it for display, write the string manipulation code in whatever language is appropriate, right before displaying it. If you're updating your table, you could write a program to process the table, reading each record, building the replacement string, and updating the record before moving on to the next one.
The bottom line is that T-SQL is just not a good language for doing this sort of string examination and manipulation. If you can build a fresh string from the original data, or do your manipulation in a high-level language, you'll have an easier job of it and end up with more maintainable code.

I wrote a code for the first example you gave. You still need to
improve it for the rest ...
DECLARE #STR VARCHAR(50)='Heathrow/Dublin*Dublin/Heathrow'
IF (SELECT SUBSTRING(#STR,CHARINDEX('/',#STR)+1,CHARINDEX('*',#STR)-CHARINDEX('/',#STR)-1)) =
(SELECT SUBSTRING(#STR,CHARINDEX('*',#STR)+1,LEN(SUBSTRING(#STR,CHARINDEX('/',#STR)+1,CHARINDEX('*',#STR)-CHARINDEX('/',#STR)-1))))
BEGIN
SELECT STUFF(#STR,CHARINDEX('*',#STR),LEN(SUBSTRING(#STR,CHARINDEX('/',#STR)+1,CHARINDEX('*',#STR)-CHARINDEX('/',#STR)-1))+1,'')
END
ELSE
BEGIN
SELECT STUFF(#STR,CHARINDEX('*',#STR),LEN(SUBSTRING(#STR,CHARINDEX('*',#STR)+1,LEN(SUBSTRING(#STR,CHARINDEX('/',#STR)+1,CHARINDEX('*',#STR)-CHARINDEX('/',#STR)-1)))),'***')
END

Using text in a column of Date/Time type in access

I have a column in MS Access in which the data could be any of the following:
A date
Text string: "n/a"
Text string: "n/e"
The vast majority of entries will be dates but a very few will need to be these specified text strings. I would like to still be able to perform date calculations on the column. Whats the best datatype to use?

In my opinion the best approach would be to leave the date field as Date/Time and then add another field to indicate the status if the Date/Time field is Null. Something like:
DateField DateStatus
--------- ----------
2014-09-21
n/a
2014-09-23
2014-09-25
n/e
You could use a single Text field, but then any time you wanted to use the field value as a proper Date/Time value you'd have to convert it using CDate(). You would also have the possibility of other junk getting in there, or dates getting entered in different formats (e.g. d/m/yyyy vs. m/d/yyyy). And finally, you would lose the ability to easily determine whether a Date/Time value is in a particular row (which in my approach would simply be ... WHERE DateField IS [NOT] NULL).

I agree with Gord Thompson's answer - mainly because it's so non-intuitive to have, essentially, two completely different types of data in a single column, and because it's going to make validation/data integrity stuff so much harder with little upside - and, as he indicates with the CDate() reference, dates basically only work reliably like dates if they're in a "date/time" field. Microsoft has a page on choosing a data type that explains some of the Access-specific differences in more detail.
I also suggest that you don't actually have a text field for those "comments," since you say there's only a handful of potential options - use a Long Integer and connect back to a separate table with the list of allowable entries. This will allow you to run reports more easily, change the "display text" in one step instead of potentially dozens of times, etc. It also saves a relatively small amount of space per record (long integer = 4 bytes; text = up to 255 bytes.)
You can also do fun data/reporting stuff with that Comment (long integer) field and dates - even combined into ranges, by the way - queries let you use the two different columns to create a single answer. I have a report that's grouped so that you can see stats for everything that's active (by quarter in which they start) plus everything that's pending (with the code indicating who's responsible for watching this record,) plus everything that's not pending but still doesn't have a start date (with the reason code displayed,) plus everything that's expired (by quarter in which they ended.) It looks like each of those things is in a single column in the report, but it's actually like five columns that have been concatenated with the IIf function.
(Almost every argument I can come up with boils down to "this is what relational databases are all about and why they're so awesome.)

How to use NSPredicate with NSPredicateEditor on different data (Multiple Predicates?)

I've got an array of filepaths and I've got a NSPredicateEditor setup in my UI where the user can combine a NSPredicate to find a file. He should be able to filter by name, type, size and date.
There are several problems I have now:
I can only get one predicate object from the editor. When I use
"predicateForRow:" it returns (null)
If the user wants to filter the file by name AND size or date, I
can't just use this predicate on my array anymore because those
information are not contained in it
Can I split up a predicate into different predicates without
converting it into a NSString object, then search for every #" OR " |
#" AND " and seperating the components into an array and then
converting every NSString into a new predicate?
In the NSPredicateEditor settings I've some options for the "left Expression":
Keypaths, Constant Values, Strings, Integer Numbers, Floating Point Numbers and Dates. I want to display a dropdown menu to the user with "name", "type", "date", "size". But then the generated predicate automatically looks like this:
"name" MATCHES[c] "nameTest" OR "type" MATCHES[c] "jpg" OR size == 100
Because the array is filled with strings, a search for "name", "type" etc. and those strings do not respond to #"myString"*.name*m the filter always returns 0 objects. Is there a way to show the Name, Type, Size and Date in the Menu, but write "self" into the predicate without doing it by hand?
I've already searched in the official Apple tutorials, on Stackoverflow, Google, and even Youtube to find a clue. This problem troubles me for almost one week now. Thanks for you time! If you need more information please let me know!

You have come to the right place! :)
I can only get one predicate object from the editor.
Correct. It is an NSPredicateEditor, not an NSPredicatesEditor. ;)
When I use "predicateForRow:" it returns (null)
I'm not sure I would use that method. My general rule of thumb is to largely ignore that NSPredicateEditor is a subclass of NSRuleEditor, mainly because it's such a highly specialized subclass that many of the superclass methods don't make that much sense on a predicate editor (like all the stuff about criteria, row selection, etc). It's possible that they're somehow relevant, but if they are, I haven't figured out how yet.
To get the predicate from the editor, you do:
NSPredicate *predicate = [myPredicateEditor objectValue];
If the user wants to filter the file by name AND size or date
You mean (name = [something]) AND (size = [something] OR date = [something])?
If so, NSPredicateEditor can do that if you've set the nesting mode to "Compound".
I can't just use this predicate on my array anymore because those information are not contained in it
What information do you need?
Can I split up a predicate into different predicates without converting it into a NSString object, then search for every #" OR " | #" AND " and seperating the components into an array and then converting every NSString into a new predicate?
Yes, but that is a BAD idea. It's bad because NSPredicate already contains all the information you need, and converting it to a different format and doing string manipulations just isn't necessary and can potentially lead to complications (like if someone can type in a value for "name", what happens if they type in " OR "?).
I'm having a hard time trying to figure out what it is you're trying to do. It sounds like you have an array of NSString objects that you want to filter based on a predicate that the user creates? If so, then what do these name, date, and size key paths mean? What are you trying to do?

SQL Server String Manipulation for URLs?

I need to append a paramter-value 'xval=9' to all non-blank SQL server column values in a multi-million row table. The column contains URLs and they have a random amount of "querystring" parameters appended to the column. So when I append, I may need to append '?xval=9' or I may need to append '&val=9', depending on if parameters already exist.
So the URL values could like like any of these:
http://example.com/example
http://example.com/example/?aval=1
http://example.com/example/?aval=1&bval=2
http://example.com/example/index.html?aval=1&bval=2
'aval' and 'bval' are just samples, really any kind of key/value pair might be on the end of the URL.
What is the smartest pure-TSQL way to manipulate that, hopefully utilizing some kind of indexing?
Thanks.

Do that on Presentation or Model layer, not on Data layer.
i.e. read all data and manipulate using C# or other language you use.

Maybe this should work
SELECT CASE CHARINDEX('?', Url) WHEN 0 THEN Url+'?foo=boo' ELSE Url+'&foo=boo' END AS Url FROM Whatever

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight