Parsing structured text data - text-parsing

I extracted blob field out of mysql table in text format:
CAST(orders AS CHAR(10000) CHARACTER SET utf8)
Now each field looks like this:
a:2:{s:4:"Cart";a:5:{s:4:"cart";a:2:{i:398;a:7:{s:2:"id";s:3:"398";s:4:"name";s:14:"Some product 1";s:5:"price";i:780;s:3:"uid";s:5:"FN-02";s:3:"num";s:1:"1";s:6:"weight";s:1:"0";s:4:"user";s:1:"4";}i:379;a:7:{s:2:"id";s:3:"379";s:4:"name";s:14:"Some product 2";s:5:"price";i:750;s:3:"uid";s:5:"FR-01";s:3:"num";s:1:"1";s:6:"weight";s:1:"0";s:4:"user";s:1:"4";}}s:3:"num";i:2;s:3:"sum";s:7:"1530.00";s:6:"weight";i:160;s:8:"dostavka";s:3:"180";}s:6:"Person";a:17:{s:4:"ouid";s:6:"103-47";s:4:"data";s:10:"1278090513";s:4:"time";s:8:"21:33 pm";s:4:"mail";s:15:"mail#mailer.com";s:11:"name_person";s:8:"John Doe";s:8:"org_name";s:13:"John Doe Inc.";s:7:"org_inn";s:12:"667110804509";s:7:"org_kpp";s:0:"";s:8:"tel_code";s:3:"343";s:8:"tel_name";s:7:"2670039";s:8:"adr_name";s:26:"London, 221b, Baker street";s:14:"dostavka_metod";s:1:"8";s:8:"discount";s:0:"";s:7:"user_id";s:2:"13";s:6:"dos_ot";s:0:"";s:6:"dos_do";s:0:"";s:11:"order_metod";s:1:"1";}}
What I can notice is that this text goes in order: [type]:[length]:[data];, where [type]: s stands for string and a stands for array (or dictionary in Python). It also has i:'number': groups without [length]:.
I don't see better solution than parsing it with regex in several passes, though I don't clearly understand how to parse nested dictionaries (in Python terminology).
The question: is it a standard data structure that already has a parser?

This looks like the output from the PHP serialize function (you need to unserialize it):
http://php.net/manual/en/function.serialize.php
If you are working in python, there is a port of the serialize and unserialize functions here:
https://pypi.python.org/pypi/phpserialize
Anatomy of a serialize()'ed value:
String
s:size:value;
Integer
i:value;
Boolean
b:value; (does not store "true" or "false", does store '1' or '0')
Null
N;
Array
a:size:{key definition;value definition;(repeated per element)}
Object
O:strlen(object name):object name:object size:{s:strlen(property name):property name:property definition;(repeated per property)}
String values are always in double quotes
Array keys are always integers or strings
"null => 'value'" equates to 's:0:"";s:5:"value";',
"true => 'value'" equates to 'i:1;s:5:"value";',
"false => 'value'" equates to 'i:0;s:5:"value";',
"array(whatever the contents) => 'value'" equates to an "illegal offset type" warning because you can't use an
array as a key; however, if you use a variable containing an array as a key, it will equate to 's:5:"Array";s:5:"value";',
and
attempting to use an object as a key will result in the same behavior as using an array will.

Related

EF Core array intersection

I use PostgreSQL and have a column codes of type text[], and I have another column filterCodes of type string[]. When I query data from the table, I need to check that codes contains at least one element from filterCodes, I try use Intersection and Any but neither seems to work.
How can I do this without writing custom functions?
patientQuery.Where(p => p.Codes.Intersect(filterCodes).Any());
According to documentation Array Type Mapping (look at translation of array1 && array2)
It should be:
patientQuery = patientQuery
.Where(p => p.Codes.Any(c => filterCodes.Contains(c)));

Return values in an array of hashes

I have an assignment that I cannot figure out where my mistake lies. I have a large array of hashes all under the method twitter_data. The hash is structured as such.
def twitter_data
[{"User"=>
{"description"=>
"Description here",
"last twenty tweets"=>
["tweets written out here"],
"number of followers"=>1000,
"number of friends"=>100,
"latest tweet"=>
"tweet written out here",
"number of tweets"=>1000,
"location"=>"Wherever, Wherever"}},]
end
Now if I wanted to for instance list all of the users and their descriptions I thought the code would read as such.
twitter_data.each do |twitter_data|
puts "#{twitter_data[:twitter_data]}: #{twitter_data[:description]}"
end
But the output for that just gives me about seven :, without the username in front of it or the description afterwards.
As you can see the description key is nested into another hash which key is User. I don't know which is the other key you want to print because data seems incomplete but if you wanted to print just the descriptions this one should work
twitter_data.each do |user_data|
description = user_data["User"]["description"]
puts description
end
There are a couple of reasons why this does not work:
1) The twitter_data element inside the each looks like this { 'User' => { 'description'.... On that hash, the value stored under the :description key is nil.
2) Even if you where to refer to the correct hash via twitter_data['User'] you would still be using symbols (e.g. :description) instead of strings. So even then, the value stored for the keys would be nil.
3) You are referencing elements that do not seem to exist in the hash even if one where to use strings (e.g. :twitter_data). Now this might simply be due to the example selected.
What will work is to correctly reference the hashes:
twitter_data.each do |data|
user_hash = data['User']
puts "#{user_hash['twitter_data']}: #{user_hash['description']}"
end

Ruby: Using a string variable to call an array element set is not working

I'm trying to call a set of elements in the example below.
session_times: {
thursday: ["10:20am", "12:30pm", "6:40pm"],
friday: ["10:20am", "12:30pm", "6:40pm"],
saturday: ["10:20am", "12:30pm", "6:00pm"],
sunday: ["10:20am", "12:30pm", "6:30pm"]
}
I tried doing the following functions
days_all = movie[:session_times]
string = ':' + 'thursday'
var1 = days_all[:thursday]
var2 = days_all["#{string}"]
var3 = days_all[string]
The variable var1 comes out perfectly fine but i dont understand why var2 or var 3 will not get my result, it should come out the same with same variable called right?
Help would be much appreciated :)
No, there is a difference between symbol and string and they are not always interchangable. A symbol is not the same as a string starting with a colon (that's still a string). When you use the key: val hash syntax the keys are symbols; "key" => val would be a string key.
any of these would work:
string = "thursday" # don't put the colon in here
days_all[:"#{string}"]
days_all[string.to_sym]
days_all["#{string}".to_sym]
If you install the gem activesupport and then require active_support/all (this is automatically done in rails), it's less strict about what key you need to use:
days_all = days_all.with_indifferent_access
days_all["thursday"]
days_all[:thursday]
days_all.thursday
With Ruby :x refers to a Symbol and "x" refers to a string. A Symbol is an "internalized string", it acts more like an arbitrary constant, and every instance of :x is identical with every other, they're literally the same object.
The same is not true for strings, each one may be different, and normally occupies a different chunk of memory. This is why you see Symbols used for keys in hashes, their repetition would be wasteful otherwise.
You can reference your structure any of the following ways:
days_all[:thursday] # With a plain symbol
days_all["thursday".to_sym] # With a converted string
days_all[:"thursday"] # With a long-form symbol
Another thing to note is you probably don't want to stick with this data structure if you can avoid it. This isn't very "machine readable", names like :thursday are completely arbitrary. It's much better to use a consistent index like 0 meaning Sunday, 1 meaning Monday and so on. That way functions like cwday can be used to look things up in a regular Array.
The same goes for human-annotated times like "10:30pm" where a value like 1350 meaning 22 hours plus 30 minutes, or even 2230 if you don't mind gaps between your intervals. Those are easy to compare: 1130<230 is never suddenly true due to ASCII sorting issues.

PostgreSQL: How to implement GIN?

I want to bring my text hash function into GIN indexer.
See the Extensibility below:
http://www.postgresql.org/docs/9.0/static/gin-extensibility.html
I can understand about compare.
int compare(Datum a, Datum b)
However how about extractValue, extractQuery and consistent.
Datum *extractValue(Datum inputValue, int32 *nkeys)
Datum *extractQuery(Datum query, int32 *nkeys, StrategyNumber n, bool **pmatch, Pointer **extra_data)
bool consistent(bool check[], StrategyNumber n, Datum query, int32 nkeys, Pointer extra_data[], bool *recheck)
The manual doesn't help me to implement them.
I know how to implement them. In detail:
What's passed to inputValue of extractValue?
What's returned by extractValue?
What's passed to query of extractQuery?
What's returned by extractQuery?
What's passed to query of consistent?
What's passed to check of consistent?
The index storage (hashed key) will be int4. The input type is text.
Did you read all the documentation? Do you know what a inverse index does. I can't answer your question fully because you haven't specified what your queries will look like. But here is an attempt (based on the information at http://www.postgresql.org/docs/9.2/static/gin-extensibility.html and http://www.sai.msu.su/~megera/wiki/Gin). Also, look at the tsearch example.
Input to compare is two keys values, so two integers.
The input to extractValue is your input type, text. The output is an array of keys: in your case apparently an array of integers.
extractQuery gets as input your query type, which might be a string (you don't specify) and returns a list of interesting keys you want the system to find matches for. It can also return extra information for the consistant method.
After the system has found values that are interesting according to what you returned from extractQuery, this method will return whether the value actually matched the query.
Since you haven't specified your query type I'll give an example with fulltext search.
There for example, the query type is a string like 'foo and bar', this would return the keys 'foo' and 'bar' and some data so the consistant function knows that both terms must be present.
But really, this is all described on the above pages.

App engine - easy text search

I was hoping to implement an easy, but effective text search for App Engine that I could use until official text search capabilities for app engine are released. I see there are libraries out there, but its always a hassle to install something new. I'm wondering if this is a valid strategy:
1) Break each property that needs to be text-searchable into a set(list) of text fragments
2) Save record with these lists added
3) When searching, just use equality filters on the list properties
For example, if I had a record:
{
firstName="Jon";
lastName="Doe";
}
I could save a property like this:
{
firstName="Jon";
lastName="Doe";
// not case sensative:
firstNameSearchable=["j","o", "n","jo","on","jon"];
lastNameSerachable=["D","o","e","do","oe","doe"];
}
Then to search, I could do this and expect it to return the above record:
//pseudo-code:
SELECT person
WHERE firstNameSearchable=="jo" AND
lastNameSearchable=="oe"
Is this how text searches are implemented? How do you keep the index from getting out of control, especially if you have a paragraph or something? Is there some other compression strategy that is usually used? I suppose if I just want something simple, this might work, but its nice to know the problems that I might run into.
Update:::
Ok, so it turns out this concept is probably legitimate. This blog post also refers to it: http://googleappengine.blogspot.com/2010/04/making-your-app-searchable-using-self.html
Note: the source code in the blog post above does not work with the current version of Lucene. I installed the older version (2.9.3) as a quick fix since google is supposed to come out with their own text search for app engine soon enough anyway.
The solution suggested in the response below is a nice quick fix, but due to big table's limitations, only works if you are querying on one field because you can only use non-equality operators on one property in a query:
db.GqlQuery("SELECT * FROM MyModel WHERE prop >= :1 AND prop < :2", "abc", u"abc" + u"\ufffd")
If you want to query on more than one property, you can save indexes for each property. In my case, I'm using this for some auto-suggest functionality on small text fields, not actually searching for word and phrase matches in a document (you can use the blog post's implementation above for this). It turns out this is pretty simple and I don't really need a library for it. Also, I anticipate that if someone is searching for "Larry" they'll start by typing "La..." as opposed to starting in the middle of the word: "arry". So if the property is for a person's name or something similar, the index only has the substrings starting with the first letter, so the index for "Larry" would just be {"l", "la", "lar", "larr", "larry"}
I did something different for data like phone numbers, where you may want to search for one starting from the beginning or middle digits. In this case, I just stored the entire set of substrings starting with strings of length 3, so the phone number "123-456-7890" would be: {"123","234", "345", ..... "123456789", "234567890", "1234567890"}, a total of (10*((10+1)/2))-(10+9) = 41 indexes... actually what I did was a little more complex in order to remove some unlikely to-be-used substrings, but you get the idea.
Then your query would be:
(Pseaudo Code)
SELECT * from Person WHERE
firstNameSearchIndex == "lar"
phonenumberSearchIndex == "1234"
The way that app engine works is that if the query substrings match any of the substrings in the property, then that is counted as a match.
In practice, this won't scale. For a string of n characters, you need n factorial index entries. A 500 character string would need 1.2 * 10^1134 indexes to capture all possible substrings. You will die of old age before your entity finishes writing to the datastore.
Implementations like search.SearchableModel create one index entry per word, which is a bit more realistic. You can't search for arbitrary substrings, but there is a trick that lets you match prefixes:
From the docs:
db.GqlQuery("SELECT * FROM MyModel
WHERE prop >= :1 AND prop < :2",
"abc", u"abc" + u"\ufffd")
This matches every MyModel entity with
a string property prop that begins
with the characters abc. The unicode
string u"\ufffd" represents the
largest possible Unicode character.
When the property values are sorted in
an index, the values that fall in this
range are all of the values that begin
with the given prefix.

Resources