How split annotated dataset into sentence

How split annotated dataset into sentence - dataset

i have dataset annoted in spacy 2 format like below
td = ["Where is Shaka Khan lived.I Live In London.", {"entities": [(9, 19, "FRIENDS"),(32, 37, "JILLA")]}]
my datasets has sequence length greater than 512 and trying to migrate to hugging face so would like to split document into sentences at same time need to update the tagging also is there any tools available for that my expected result should be like below
td = [["Where is Shaka Khan lived.", {"entities": [(9, 19, "FRIENDS")]}],["I Live In London.", {"entities": [(10, 16, "JILLA")]}],]

Why to do it with spacy? write a small parser that split it, and then run spacy on the already split sentences , it will give you the same result you want

Related

Pagination and Entity Framework

In my mobile app, I try to retrieve data from a table of my SQL Server database. I'm using EF and I try to use pagination for better performance. I need to retrieve data from the last element of the table. so if the table has 20 rows, I need, for page 0, IDs 20, 19, 18, 17, 16 then for page 1 IDs 15, 14, 13, 12, 11 and so on...
The problem is this: what if, while user "A" is downloading data from table, user "B" add row? If user "A" get Page 0 (so IDs 20, 19, 18, 17, 16), and user "B" in the same moment add row (so ID 21), with classic query, user "A" for page 1 will get IDs 16, 15, 14, 13, 12... so another time ID 16
My code is very simple:
int RecordsForPagination = 5;
var list = _context.NameTable
.Where(my_condition)
.OrderByDescending(my_condition_for ordering)
.Skip (RecordsForPagination * Page)
.Take (RecordsForPagination)
.ToList();
Of course Page is int that come from the frontend.
How can I solve the problem?
I found a solution but I don't know if it's the perfect one. I could use
.SkipWhile(x => x.ID >= LastID)
instead
.Skip (RecordsForPagination * Page)
and of course LastID always is sent from the frontend.
Do you think that performance is always good with this code? Is there a better solution?

The performance impacts will depend greatly on your SQL Index implementation and the order by clause. But its not so much a question about performance as it is about getting the expected results.
Stack Overflow is a great example where there is a volume of activity such that when you get to the end of any page, the next page may contain records from the page you just viewed because the underlying recordset has changed (more posts have been added)
I bring this up because in a live system it is generally accepted and in some cases an expected behaviour. As developers we appreciate the added overheads of trying to maintain a single result set and recognise that there is usually much lower value in trying to prevent what looks like duplications as you iterate the pages.
It is often enough to explain to users why this occurs, in many cases they will accept it
If it was important to you to maintain the place in the original result set, then you should constrain the query with a Where clause, but you'll need to retrieve either the Id or the timestamp in the original query. In your case you are attempting to use LastID, but to get the last ID would require a separate query on it's own because the orderby clause will affect it.
You can't really use .SkipWhile(x => x.ID >= LastID) for this, because skip is a sequential process that is affected by the order and is dis-engaged the first instance that the expression evaluates to false, so if your order is not based on Id, your skip while might result is skipping no records at all.
int RecordsForPagination = 5;
int? MaxId = null;
...
var query = _context.NameTable.Where(my_condition);
// We need the Id to constraint the original search
if (!MaxId.HasValue)
MaxId = query.Max(x => x.ID);
var list = query.Where(x => x.ID <= MaxId)
.OrderByDescending(my_condition_for ordering)
.Skip(RecordsForPagination * Page)
.Take(RecordsForPagination);
.ToList();
It is generally simpler to filter by a point in time as this is known from the client without a round trip to the DB, but depending on the implementation the filtering on dates can be less efficient.

How to Sort or Filter imported IMPORTHTML column with Numbers & Text Strings in it?

I'm basically importing both Tables from this website.
Q1. When I'm importing data via IMPORTHTML and sorting out results, some results are showing as "5 cases or less". It's making it hard to sort it out. I want these to come after 0. How can I make sure of that?
Q2. I wish to import only the numbers from the element. It's basically a whole phrase. How do I only import the last number, in such case, 1,991?
Here is my Sample Google Sheet

try:
=ARRAY_CONSTRAIN(SORT({IMPORTHTML(
"https://santemontreal.qc.ca/en/public/coronavirus-covid-19/",
"table", 1), REGEXREPLACE(SUBSTITUTE(IMPORTHTML(
"https://santemontreal.qc.ca/en/public/coronavirus-covid-19/", "table", 1),
"5 cases or less", 0), "\*", )*1}, 6, 0), 99^99, 2)
not sure if you want 3rd and 4th column too so if you do then change 2 to 4
=REGEXREPLACE(IMPORTXML(
"https://santemontreal.qc.ca/en/public/coronavirus-covid-19/",
"//div[#id='c36390']/h4[2]"), "(.*) ", )*1

Elasticsearch query strategy for nested array elements

I am trying to find results by color. In the database, it is recorded in rgb format: an array of three numbers representing red, green, and blue values respectively.
Here is how it is stored in the db and elasticsearch record (storing 4 rgb colors in an array):
"color_data":
[
[253, 253, 253],
[159, 159, 159],
[102, 102, 102],
[21, 21, 21]
]
Is there a query strategy that will allow me to find similar colors? i.e. exact match or within a close range of rgb values?
Here is a method I am trying, but the addressing method to access array values doesn't work:
curl -X GET 'http://localhost:9200/_search' -d '{
"from": 0,
"size": 50,
"range": {
"color_data.0.0": {
"gte": "#{b_lo}",
"lte": "#{b_hi}"
},
"color_data.0.1": {
"gte": "#{g_lo}",
"lte": "#{g_hi}"
}
}
}'
(r_lo, r_hi, et. al. are set to +/- 10 from the rgb values recorded in the color_data variable)

First, you should move channel data to separate fields (or to object field at least)
If you need simple matching algo (±deviation without scoring), then you can perform simple filter>range queries, passing your fuzziness threshold in query.
If you need scoring (how much similar that docs are), than you need to perform scripted queries. Take a look at this article
Btw, I strongly recommend work in HSL space, if you need such operations, you'll get much better results. Take a look at this example

Upload all text file content to mongodb (using pymongo)

I'm starting with mongodb.
I have a text file with 3400000 lines of data and I want to upload that data to a mongodb database.
The text file looks like this:
360715;157.55.34.97;Mozilla/5.0;/pub/index.asp;NULL
3360714;157.55.32.233;Mozilla/5.0;/pub/index.asp;NULL
....
and I want to put it on a mongodb database with the following structure :
{'log' : '360715;157.55.34.97;Mozilla/5.0;/pub/index.asp;NULL'}
{'log' : '3360714;157.55.32.233;Mozilla/5.0;/pub/index.asp;NULL'}
....
Actually I am uploading line by line like this :
for data_line in records:
parsed_line = re.sub(r'[^a-zA-Z0-9,.\;():/]', '', data_line)
to_insert = unicode(parsed_line, "utf-8")
db.data_source.insert({'log':to_insert})
Is there a way to upload all lines at one time to the format I want to ?
The "line by line" approach is to much slower.
Previously , i had coded a script in python to parse the text file, and it takes one second to 100000 lines and takes 65 seconds to 3400000.
I considered to use mongodb to improve the process but , actually , with mongodb it takes the same time just to upload 100000 lines to the database that my script to performe all the data parse. So is not to much hard to say that I'm doing something badly wrong with mongodb.
I'm grateful if someone can help me.
Greetings , João

I would try bulk inserting the data into mongodb. Right now you are calling insert() once for every single line in your text file which is expensive.
If you are running mongoDB version 2.6.x you can use bulk inserts http://api.mongodb.org/python/current/examples/bulk.html . Example code below from the tutorial:
>>> import pymongo
>>> db = pymongo.MongoClient().bulk_example
>>> db.test.insert(({'i': i} for i in xrange(10000)))
[...]
>>> db.test.count()
10000
In this case a single insert() call inserted 10000 documents into mongodb.
Another example below:
>>> new_posts = [{"author": "Mike",
... "text": "Another post!",
... "tags": ["bulk", "insert"],
... "date": datetime.datetime(2009, 11, 12, 11, 14)},
... {"author": "Eliot",
... "title": "MongoDB is fun",
... "text": "and pretty easy too!",
... "date": datetime.datetime(2009, 11, 10, 10, 45)}]
>>> posts.insert(new_posts)
[ObjectId('...'), ObjectId('...')]
http://api.mongodb.org/python/current/tutorial.html

Searching for and matching elements across arrays

I have two tables.
In one table there are two columns, one has the ID and the other the abstracts of a document about 300-500 words long. There are about 500 rows.
The other table has only one column and >18000 rows. Each cell of that column contains a distinct acronym such as NGF, EPO, TPO etc.
I am interested in a script that will scan each abstract of the table 1 and identify one or more of the acronyms present in it, which are also present in table 2.
Finally the program will create a separate table where the first column contains the content of the first column of the table 1 (i.e. ID) and the acronyms found in the document associated with that ID.
Can some one with expertise in Python, Perl or any other scripting language help?

It seems to me that you are trying to join the two tables where the acronym appears in the abstract. ie (pseudo SQL):
SELECT acronym.id, document.id
FROM acronym, document
WHERE acronym.value IN explode(documents.abstract)
Given the desired semantics you can use the most straight forward approach:
acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]
joins = []
for id, abstract in documents:
for word in abstract.split():
try:
index = acronyms.index(word)
joins.append((id, index))
except ValueError:
pass # word not an acronym
This is a straightforward implementation; however, it has n cubed running time as acronyms.index performs a linear search (of our largest array, no less). We can improve the algorithm by first building a hash index of the acronyms:
acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]
index = dict((acronym, idx) for idx, acronym in enumberate(acronyms))
joins = []
for id, abstract in documents:
for word in abstract.split():
try
joins.append((id, index[word]))
except KeyError:
pass # word not an acronym
Of course, you might want to consider using an actual database. That way you won't have to implement your joins by hand.

Thanks a lot for the quick response.
I assume the pseudo SQL solution is for MYSQL etc. However it did not work in Microsoft ACCESS.
the second and the third are for Python I assume. Can I feed acronym and document as input files?
babru

It didn't work in Access because tables are accessed differently (e.g. acronym.[id])

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How split annotated dataset into sentence - dataset

Why to do it with spacy? write a small parser that split it, and then run spacy on the already split sentences , it will give you the same result you want

Related

Pagination and Entity Framework

How to Sort or Filter imported IMPORTHTML column with Numbers & Text Strings in it?

Elasticsearch query strategy for nested array elements

Upload all text file content to mongodb (using pymongo)

Searching for and matching elements across arrays

Categories

Resources