AT NEW with substring access? - loops

I have a solution that includes a LOOP which I would like to spare. So I wonder, whether you know a better way to do this.
My goal is to loop through an internal, alphabetically sorted standard table. This table has two columns: a name and a table, let's call it subtable. For every subtable I want to do some stuff (open an xml page in my xml framework).
Now, every subtable has a corresponding name. I want to group the subtables according to the first letter of this name (meaning, put the pages of these subtables on one main page -one main page for every character-). By grouping of subtables I mean, while looping through the table, I want to deal with the subtables differently according to the first letter of their name.
So far I came up with the following solution:
TYPES: BEGIN OF l_str_tables_extra,
first_letter(1) TYPE c,
name TYPE string,
subtable TYPE REF TO if_table,
END OF l_str_tables_extra.
DATA: ls_tables_extra TYPE l_str_tables_extra.
DATA: lt_tables_extra TYPE TABLE OF l_str_tables_extra.
FIELD-SYMBOLS: <ls_tables> TYPE str_table."Like LINE OF lt_tables.
FIELD-SYMBOLS: <ls_tables_extra> TYPE l_str_tables_extra.
*"--- PROCESSING LOGIC ------------------------------------------------
SORT lt_tables ASCENDING BY name.
"Add first letter column in order to use 'at new' later on
"This is the loop I would like to spare
LOOP AT lt_tables ASSIGNING <ls_tables>.
ls_tables_extra-first_letter = <ls_tables>-name+0(1). "new column
ls_tables_extra-name = <ls_tables>-name.
ls_tables_extra-subtable = <ls_tables>-subtable.
APPEND ls_tables_extra TO lt_tables_extra.
ENDLOOP.
LOOP AT lt_tables_extra ASSIGNING <ls_tables_extra>.
AT NEW first_letter.
"Do something with subtables with same first_letter.
ENDAT.
ENDLOOP.
I wish I could use
AT NEW name+0(1)
instead of
AT NEW first_letter
, but offsets and lengths are not allowed.
You see, I have to inlcude this first loop to add another column to my table which is kind of unnecessary because there is no new info gained.
In addition, I am interested in other solutions because I get into trouble with the framework later on for different reasons. A different way to do this might help me out there, too.
I am happy to hear any thoughts about this! I could not find anything related to this here on stackoverflow, but I might have used not optimal search terms ;)

Maybe the GROUP BY addition on LOOP could help you in this case:
LOOP AT i_tables
INTO DATA(wa_line)
" group lines by condition
GROUP BY (
" substring() because normal offset would be evaluated immediately
name = substring( val = wa_line-name len = 1 )
) INTO DATA(o_group).
" begin of loop over all tables starting with o_group-name(1)
" loop over group object which contains
LOOP AT GROUP o_group
ASSIGNING FIELD-SYMBOL(<fs_table>).
" <fs_table> contains your table
ENDLOOP.
" end of loop
ENDLOOP.

why not using a IF comparison?
data: lf_prev_first_letter(1) type c.
loop at lt_table assigning <ls_table>.
if <ls_table>-name(1) <> lf_prev_first_letter. "=AT NEW
"do something
lf_prev_first_letter = <ls_table>-name(1).
endif.
endloop.

Related

Identify Unique Sentences in SQL String with Repeat Sentences

I have a VARCHAR(MAX) string field in SQL that, in some instances, contains up to 100k char in LEN. Lets call it [Notes_Comments].
The contents of the field, [Notes_Comments], are notes and comments and in several instances have those notes and comments copied/pasted in multiple times. There is no delineation as to where the old and new parts start/end. Punctuation is also hit and miss.
I would like to figure out how to identify the unique sentences, or strings/substrings, within the [Notes_Comments] field to essentially get a pure, deduplicated, view of contents.
I built a SplitString function and was able to start parsing out 2,3 and 5 word phrases but it isn't really getting me to where I need to get for full sentences.
Example [Notes_Comments]:
I want to get this
10:46 AM Talked to blah about blah 11:46 AM They said they didn’t know what I meant 12:46 PM Built a SplitString function and parsed words into a table with a word index 1:46 PM Good progress to identify words but not full strings 2:46 PM Removing noise words 3:46 PM Removing known categories of words 4:46 PM Trying to figure out how to remove repeat parts of the one string field. Notes Some duplicate stuff. Now some new stuff has been added. Notes Added a little more. Must deduplicate portions of text that appear more than once. Notes One more section.
From this
10:46 AM Talked to blah about blah 11:46 AM They said they didn’t know what I meant 12:46 PM Built a SplitString function and parsed words into a table with a word index 1:46 PM Good progress to identify words but not full strings 2:46 PM Removing noise words 3:46 PM Removing known categories of words Notes Some duplicate stuff 10:46 AM Talked to blah about blah 11:46 AM They said they didn’t know what I meant 12:46 PM Built a SplitString function and parsed words into a table with a word index 1:46 PM Good progress to identify words but not full strings 2:46 PM Removing noise words 3:46 PM Removing known categories of words 4:46 PM Trying to figure out how to remove repeat parts of the one string field. Notes Some duplicate stuff. Now some new stuff has been added. Notes Added a little more. Must deduplicate portions of text that appear more than once. 10:46 AM Talked to blah about blah 11:46 AM They said they didn’t know what I meant 12:46 PM Built a SplitString function and parsed words into a table with a word index 1:46 PM Good progress to identify words but not full strings 2:46 PM Removing noise words 3:46 PM Removing known categories of words 4:46 PM Trying to figure out how to remove repeat parts of the one string field. Notes Some duplicate stuff. Now some new stuff has been added. Notes Added a little more. Must deduplicate portions of text that appear more than once. Notes One more section.
Here is my SplitString function:
CREATE FUNCTION dbo.SplitString
(
#str NVARCHAR(max),
#separator CHAR(1)
)
RETURNS TABLE
AS
RETURN (
WITH tokens(p, a, b) AS (
SELECT
CAST(1 AS bigint),
CAST(1 AS bigint),
CHARINDEX(#separator, #str)
UNION all
SELECT
p + 1,
b + 1,
CHARINDEX(#separator, #str, b + 1)
FROM tokens
WHERE b > 0
)
SELECT
p-1 ItemIndex,
SUBSTRING(
#str,
a,
CASE WHEN b > 0 THEN b-a ELSE LEN(#str) END)
AS word
FROM tokens
);
GO
To help going forward I would recommend taking that "Notes" column off of the table it is currently on, and making a new table of "EntityId", and "Note". EntityId being that of the parent record you want to associate the note to. In this way you have a one-to-many relationship between your entity and notes.
This will require quite a bit of work to change over, but maintaining it will be much easier and prevent similar issues to the one you are currently having.
For a purely sql approach I would change the table relationships first. Perform your split, and insert each individual message into the new table. After doing that you can write a separate procedure for identifying duplicates and perform deletes on all but one of those.
I should note that below is an application approach.
If you choose to update your table relationships, implement it first, then de-duping will become a two-step process, but relatively simple to implement.
For de-duping, you have to account for the fact that you said punctuation is hit or miss.
I would split on the :??AM and :??PM syntax of your timestamps. You'll have to work out the syntax but that should help you identify where an entry begins and ends. Split those into a list, Trim() each one as you add it, if you don't need the punctuation at all, do a replace on "." and "," and other punctuation characters you don't want as you add them to your collection as well.
Now you have two options, if you went with the table update, you could insert all of the values and then filter for duplicates after, in either sql or in code.
Or you can create a new list, do some looping and add only unique items to this new list. If you kept your single column concat them into one string and update your db, or if you went with the one-to-many design perform an insert for each unique entry.
If you want to do less exact matching, build a hash for each substring. Perhaps "timestamp" + " " + first X chars + last Y chars. Then compare your hashes for equality, again keeping only one. I recommend this pattern because it looks from your example that duplicate notes have duplicate timestamps. If not try only first x and last y as a hash.
I highly recommend updating your table structure to save you similar headaches in the future. The rest is a little bit of regex, and string splitting and comparison. Also, experiment in debug with the hashing if you try to use it to make sure you don't cut out too much before committing it to your db.
If you are going to split the strings in code you'll need to start by putting together a regex. I would match on the substring ":.. AM" or ":.. PM". Matching on just a ":" seems a little dangerous given that this is a note field and users can probably type just about anything they want.
Please note this is untested, but the logic should be sound. You may need to tinker with the match object accessors, and test that the index value is splitting exactly where you want, etc etc.
Start by reading your field into a string. Then utilize regex to find the messages within the string. Something like the following.
string pattern = #":.?*M";
Regex rgx = new Regex(pattern, RegexOptions.None);
MatchCollection matches = rgx.Matches(input);
The match objects in the matches collection contain an index for the starting point where the match was found. You'll want to loop through them and extract the substrings into individual strings. Something like:
List<string> allNotes = new List<string>();
for(i = 0, i < matches.Count, i++)
{
string nextNote = string.Empty;
int currentIndex = matches[i].index - 2;
int endIndex = -1;
if(matches[i+1] != null)
{
endIndex = matches[i + 1].index - 2;
}
if(endIndex != -1)
{
nextNote = substring(currentIndex, (endIndex - currentIndex);
}
else
{
nextNote = substring(currentIndex);
}
if(!string.IsNullOrEmpty(nextNote))
{
allNotes.Add(nextNote);
}
}
These substrings should span the beginning of the timestamp through the character before the next time stamp.
You can then do another function to de-dupe the list. Somethinglike:
Dictionary <string, string> dedupedDict = new Dictionary<string, string>();
foreach(string note in allNotes)
{
string noteHash = string.concat(note.substring(0, 6), note.substring(note.length - 7));
if(!dedupedDict.Keys.Contains(noteHash))
{
dedupedDict.Add(noteHash, note);
}
}
return dedupedDict.Values().ToList();
You might need to include some checking for strings being the minimum length to avoid index out of bounds exceptions, and/or change the hash size(my example should be the first and last 6 characters). You may also want to Trim() and/or Replace(".", "") on the punctuation at some point if you think it will help your matching.
Then use the list of deduped notes to populate your new CallNotes table. You'll need your call Id as well.
Once you've cleaned up your current values, you can use the same methods to intercept new notes before they are put into your database.

In this code below, what do the parenthesis do? Or rather, what does this whole call ACTUALLY call?

[SELECT i.Name,(SELECT Name FROM Line_items__r ORDER BY Name) FROM Invoice__c i
WHERE i.Name = :invoiceName LIMIT 1];
Why does there need to be an i variable and the parenthesis? For instance, why not just do:
SELECT Name FROM Invoice__c WHERE Name =: invoiceName LIMIT 1? Is the parenthesis like a way to just specifically get the line items?
A walk through of the code would be extremely helpful, thank you!
The following part is a sub query:
SELECT Name FROM Line_items__r ORDER BY Name
It is SOQL's way to traversing a parent-to-child relationship. See Using Relationship Queries. As it is a child relationship, there could be multiple sub results returned. The parenthesis are required as part of the sub query syntax.
The i variable probably isn't strictly necessary. However, it helps to disambiguate the Name field of the Invoice__c record from the Name field of the Line_items__c records.

Arrays in Rails

After much googling and console testing I need some help with arrays in rails. In a method I do a search in the db for all rows matching a certain requirement and put them in a variable. Next I want to call each on that array and loop through it. My problem is that sometimes only one row is matched in the initial search and .each causes a nomethoderror.
I called class on both situations, where there are multiple rows and only one row. When there are multiple rows the variable I dump them into is of the class array. If there is only one row, it is the class of the model.
How can I have an each loop that won't break when there's only one instance of an object in my search? I could hack something together with lots of conditional code, but I feel like I'm not seeing something really simple here.
Thanks!
Requested Code Below
#user = User.new(params[:user])
if #user.save
#scan the invites dbtable and if user email is present, add the new uid to the table
#talentInvites = TalentInvitation.find_by_email(#user.email)
unless #talentInvites.nil?
#talentInvites.each do |tiv|
tiv.update_attribute(:user_id, #user.id)
end
end
....more code...
Use find_all_by_email, it will always return an array, even empty.
#user = User.new(params[:user])
if #user.save
#scan the invites dbtable and if user email is present, add the new uid to the table
#talentInvites = TalentInvitation.find_all_by_email(#user.email)
unless #talentInvites.empty?
#talentInvites.each do |tiv|
tiv.update_attribute(:user_id, #user.id)
end
end

Selective PostgreSQL database querying

Is it possible to have selective queries in PostgreSQL which select different tables/columns based on values of rows already selected?
Basically, I've got a table in which each row contains a sequence of two to five characters (tbl_roots), optionally with a length field which specifies how many characters the sequence is supposed to contain (it's meant to be made redundant once I figure out a better way, i.e. by counting the length of the sequences).
There are four tables containing patterns (tbl_patterns_biliteral, tbl_patterns_triliteral, ...etc), each of which corresponds to a root_length, and a fifth table (tbl_patterns) which is used to synchronise the pattern tables by providing an identifier for each row—so row #2 in tbl_patterns_biliteral corresponds to the same row in tbl_patterns_triliteral. The six pattern tables are restricted such that no row in tbl_patterns_(bi|tri|quadri|quinqui)literal can have a pattern_id that doesn't exist in tbl_patterns.
Each pattern table has nine other columns which corresponds to an identifier (root_form).
The last table in the database (tbl_words), contains a column for each of the major tables (word_id, root_id, pattern_id, root_form, word). Each word is defined as being a root of a particular length and form, spliced into a particular pattern. The splicing is relatively simple: translate(pattern, '12345', array_to_string(root, '')) as word_combined does the job.
Now, what I want to do is select the appropriate pattern table based on the length of the sequence in tbl_roots, and select the appropriate column in the pattern table based on the value of root_form.
How could this be done? Can it be combined into a simple query, or will I need to make multiple passes? Once I've built up this query, I'll then be able to code it into a PHP script which can search my database.
EDIT
Here's some sample data (it's actually the data I'm using at the moment) and some more explanations as to how the system works: https://gist.github.com/823609
It's conceptually simpler than it appears at first, especially if you think of it as a coordinate system.
I think you're going to have to change the structure of your tables to have any hope. Here's a first draft for you to think about. I'm not sure what the significance of the "i", "ii", and "iii" are in your column names. In my ignorance, I'm assuming they're meaningful to you, so I've preserved them in the table below. (I preserved their information as integers. Easy to change that to lowercase roman numerals if it matters.)
create table patterns_bilateral (
pattern_id integer not null,
root_num integer not null,
pattern varchar(15) not null,
primary key (pattern_id, root_num)
);
insert into patterns_bilateral values
(1,1, 'ya1u2a'),
(1,2, 'ya1u22a'),
(1,3, 'ya12u2a'),
(1,4, 'me11u2a'),
(1,5, 'te1u22a'),
(1,6, 'ina12u2a'),
(1,7, 'i1u22a'),
(1,8, 'ya1u22a'),
(1,9, 'e1u2a');
I'm pretty sure a structure like this will be much easier to query, but you know your field better than I do. (On the other hand, database design is my field . . . )
Expanding on my earlier answer and our comments, take a look at this query. (The test table isn't even in 3NF, but the table's not important right now.)
create table test (
root_id integer,
root_substitution varchar[],
length integer,
form integer,
pattern varchar(15),
primary key (root_id, length, form, pattern));
insert into test values
(4,'{s,ş,m}', 3, 1, '1o2i3');
This is the important part.
select root_id
, root_substitution
, length
, form
, pattern
, translate(pattern, '12345', array_to_string(root_substitution, ''))
from test;
That query returns, among other things, the translation soşim.
Are we heading in the right direction?
Well, that's certainly a bizarre set of requirements! Here's my best guess, but obviously I haven't tried it. I used UNION ALL to combine the patterns of different sizes and then filtered them based on length. You might need to move the length condition inside each of the subqueries for speed reasons, I don't know. Then I chose the column using the CASE expression.
select word,
translate(
case root_form
when 1 then patinfo.pattern1
when 2 then patinfo.pattern2
... up to pattern9
end,
'12345',
array_to_string(root.root, '')) as word_combined
from tbl_words word
join tbl_root root
on word.root_id = root.root_id
join tbl_patterns pat
on word.pattern_id = pat.pattern_id
join (
select 2 as pattern_length, pattern_id, pattern1, ..., pattern9
from tbl_patterns_biliteral bi
union all
select 3, pattern_id, pattern1, pattern2, ..., pattern9
from tbl_patterns_biliteral tri
union all
...same for quad and quin...
) patinfo
on
patinfo.pattern_id = pat.pattern_id
and length(root.root) = patinfo.pattern_length
Consider combining all the different patterns into one pattern_details table with a root_length field to filter on. I think that would be easier than combining them all together with UNION ALL. It might be even easier if you had multiple rows in the pattern_details table and filtered based on root_form. Maybe the best would be to lay out pattern_details with fields for pattern_id, root_length, root_form, and pattern. Then you just join from the word table through the pattern table to the pattern detail that matches all the right criteria.
Of course, maybe I've completely misunderstood what you're looking for. If so, it would be clearer if you posted some example data and an example result.

Searching for and matching elements across arrays

I have two tables.
In one table there are two columns, one has the ID and the other the abstracts of a document about 300-500 words long. There are about 500 rows.
The other table has only one column and >18000 rows. Each cell of that column contains a distinct acronym such as NGF, EPO, TPO etc.
I am interested in a script that will scan each abstract of the table 1 and identify one or more of the acronyms present in it, which are also present in table 2.
Finally the program will create a separate table where the first column contains the content of the first column of the table 1 (i.e. ID) and the acronyms found in the document associated with that ID.
Can some one with expertise in Python, Perl or any other scripting language help?
It seems to me that you are trying to join the two tables where the acronym appears in the abstract. ie (pseudo SQL):
SELECT acronym.id, document.id
FROM acronym, document
WHERE acronym.value IN explode(documents.abstract)
Given the desired semantics you can use the most straight forward approach:
acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]
joins = []
for id, abstract in documents:
for word in abstract.split():
try:
index = acronyms.index(word)
joins.append((id, index))
except ValueError:
pass # word not an acronym
This is a straightforward implementation; however, it has n cubed running time as acronyms.index performs a linear search (of our largest array, no less). We can improve the algorithm by first building a hash index of the acronyms:
acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]
index = dict((acronym, idx) for idx, acronym in enumberate(acronyms))
joins = []
for id, abstract in documents:
for word in abstract.split():
try
joins.append((id, index[word]))
except KeyError:
pass # word not an acronym
Of course, you might want to consider using an actual database. That way you won't have to implement your joins by hand.
Thanks a lot for the quick response.
I assume the pseudo SQL solution is for MYSQL etc. However it did not work in Microsoft ACCESS.
the second and the third are for Python I assume. Can I feed acronym and document as input files?
babru
It didn't work in Access because tables are accessed differently (e.g. acronym.[id])

Resources