Uploading Multiple Models to LightTag

Check out the video explaining this notebook here

In this tutorial we’ll use four different models to generate suggestions. We’ll then use LightTag’s review feature to compare the models performance, and generate a high precision labeled data set. To showcase the API and techniques, we’ll do this on two distinct datasets, Data from the Federal Register and a collection of politcal tweets.

The models we’ll be using are the Named Entity Recognition components from each of the following:

Outline

This guide is broken down into a few parts as follows:

  1. First, we write utility functions for each of the models above which run them on our text and return the results in LightTag’s expected format

  2. We’ll pull the datasets we want to process from LightTag and run each of the models on them

  3. We’ll unify the model outputs, since some of them output different names for the same thing (ORG vs ORGANIZATION)

  4. We’ll create a new Schema based on the unified tags

  5. Upload the model suggestions

  6. Review the Data in LightTag

  7. Pull metrics

Part 1 - Adapters for our four models

[52]:
from ltsession import LTSession # Thin wrapper over LightTag's api, get it here (https://gist.github.com/talolard/793563397c48dca32f75c9d4b6f8f560)
import spacy
import requests
import pandas
import re
import pandas as pd # We use this to check ourselves
from flair.data import Sentence
from flair.models import SequenceTagger

CoreNLP

We’re running CORENLP In a docker container. CoreNLP trims whitespaces and sometimes returns overlapping annotations, so we need to handle those cases

[33]:
preWhiteSpace = re.compile('^\s+')

def stanford_to_lighttag_format(example,ent):
    '''
    Takes a LightTag example and a stanford entitty and returns a LightTag Suggestion
    '''
    match = preWhiteSpace.search(example['content'])
    offset = match.end(0) if match else 0 #CORENLP strips whitespaces so we use that regex to adjust offsets
    start = ent["characterOffsetBegin"] + offset
    end = ent["characterOffsetEnd"] + offset
    return {
                    "example_id":example["id"],
                    "start":start,
                    "end":end,
                    "tag":ent["ner"],
                    "value":example['content'][start:end]
#                     "tag_id":tagMap[sug["ner"]]
                }
# This is the URL of the CORENLP server running in a docker container
url='http://localhost:9000/?properties={"annotators":"ner","outputFormat":"json"}'
def process_with_stanford(example):
    '''
    Gets a LightTag example, runs coreNLP on it and returns a list of suggestions in LightTag format
    '''
    results = []
    txt = example['content'].encode('utf8') # We need to send it bytes
    data = requests.post(url,data=txt,).json() #Send to the container
    cursor =-1 # Track the last position of a corenlp annotation, so we can ignore overlapping
    for sentence in data['sentences']: #Corenlp does sentence parsing as well, which we dont care about

        for entity in sentence["entitymentions"]: #iterate over the entities
            sug = stanford_to_lighttag_format(example,entity) #covert stanford entitiy to lighttag format
            if sug['start']>cursor: # don't accept overlaps
                results.append(sug)
                cursor=sug['end']
    return results #The list of lighttag suggestions

Spacy Big and Small

We’re using two built in NER models from Spacy. It’s a little easier than stanford

[25]:
big_nlp = spacy.load("en_core_web_lg") # Load the big spacy model
small_nlp = spacy.load("en_core_web_sm") #Load the small spacy model

def spacyToSug(example,ent):
        return {
                    "example_id":example["id"],
                    "start":ent.start_char,
                    "end":ent.end_char,
                    "tag":ent.label_,
                    "value":example['content'][ent.start_char:ent.end_char]

                }
def process_with_spacy_big(example):
    results = []
    doc = big_nlp(example['content'])
    for ent in doc.ents:
        results.append(spacyToSug(example,ent))
    return results
def process_with_spacy_small(example):
    results = []
    doc = small_nlp(example['content'])
    for ent in doc.ents:
        results.append(spacyToSug(example,ent))
    return results

Flair

Zalandos’s Flair package has received many rave reviews and made some big claims. Will be interesting to see how it fares against others Only thing to note is that if you are on CPU, use the ner-fast model, the regular one is very slow

[4]:
ftagger = SequenceTagger.load('ner-fast')
def flair_to_suggestions(example,ent):
            return {
                    "example_id":example["id"],
                    "start":ent.start_pos,
                    "end":ent.end_pos,
                    "tag":ent.tag,
                    "value":example['content'][ent.start_pos:ent.end_pos]
#                     "tag_id":tagMap[sug["ner"]]
                }

def process_with_flair(example):
    doc = Sentence(example['content'])
    ftagger.predict(doc)
    return [flair_to_suggestions(example,ent) for ent in doc.get_spans('ner')]

2019-10-17 14:53:55,792 loading file /home/tal/.flair/models/en-ner-fast-conll03-v0.4.pt
[18]:

Putting them together

Here we write a helper function, that receives a list of examples and process each one with each of the models. We collect the list of example_ids that we have seen and submit a testament for each model,example pair. This tells LightTag that the model saw the example, even if it made no predictions, and then we can give accurate analytics

[45]:
def process_multiple_examples(examples):
    models={ # Dictionary of models, each has a list of suggestions
        'spacy_big':[],
        'spacy_small':[],
        'stanford':[],
        'flair':[],

    }
    example_ids = [] # we use this to track which examples have been seen. Later we'll submit a testament to LightTag for each model
    for num,example in enumerate(examples):
        models['spacy_big']+=(process_with_spacy_big(example))
        models['spacy_small']+=(process_with_spacy_small(example))
        models['flair'] += (process_with_flair(example))
        models['stanford'] +=(process_with_stanford(example))
        example_ids.append(example['id']) # Take note of the example_id we just processed
        if num %10 ==0:
            print(num)
    return {'models':models,'example_ids':example_ids}

Part 2 Get The Data From LightTag

We’ve already uploaded the data to our LightTag workspace, but if you’d like to follow along on yours, you can find the raw data here The important point here, is that you pull the data from your LightTag worksapce so that you have the example_ids

[21]:
session = LTSession(workspace='demo',user='lighttag',pwd='Shiva666') # Start an API session
[62]:
fed_reg_examples =session.get('v1/projects/default/datasets/fedreg/examples/').json() #Retreive the examples from the fedreg dataset
trump_examples =session.get('v1/projects/default/datasets/tweets/examples/').json() # Retreive the examples from the tweeets dataset
examples = fed_reg_examples[:250] + trump_examples[:250] #We take a subset because Flair is slow
[23]:
len(fed_reg_examples),len(trump_examples),len(examples)
[23]:
(1818, 6444, 8262)
[63]:
#Run all of the models on the data
result_dict = process_multiple_examples(examples)


0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
420
430
440
450
460
470
480
490
[64]:
# This is what the results look like
model_outputs = result_dict['models']
model_outputs['stanford'][0]
[64]:
{'example_id': '40e46279-6602-4d97-bf61-66878d565d1d',
 'start': 4,
 'end': 23,
 'tag': 'MISC',
 'value': 'Trade Agreement Act'}

Part 3 - Normalizing the Output

We ran different models, and while all of them do “NER”, they use different terms and different granularities. In order to compare them, we need to normalize the tags they use which is what we do below. We make a dictionary that maps from the tag we want to replace to it’s replacement value, then iterate over the suggestions and apply it when necasary

[65]:
maper_dict = dict(ORGANIZATION='ORG',CARDINAL='NUMBER',LOCATION='GPE',LOC='GPE',COUNTRY='GPE',STATE_OR_PROVINCE='GPE',
            NATIONALITY='NORP',WORK_OF_ART='MISC',CITY='GPE',IDEOLOGY='NORP',PER='PERSON',ORDINAL='NUMBER',
             PRODUCT='MISC',RELIGION='NORP'
            )
replace_if_need = lambda tag: maper_dict.get(tag,tag) #if the tag is in the dict, give its replacement, otherwise keep it
def normalize_suggestion(suggestion):
    suggestion['tag'] = replace_if_need(suggestion['tag'])
    return suggestion
for model_name in model_outputs:
    model_outputs[model_name] =list(map(normalize_suggestion,model_outputs[model_name]))


Checking ourselves

[66]:

AllSuggestions = pd.DataFrame()
for model_name in model_outputs:
    suggestions_pd = pd.DataFrame(model_outputs[model_name])
    suggestions_pd['model'] = model_name
    AllSuggestions = AllSuggestions.append(suggestions_pd)

It’s really useful to look at a pivot table of the tags vs models, counting how often each model said each tag. This tells us nothing about who did a better job, but it can help us recognize overlapping tag names or tags we might not care about

[68]:
AllSuggestions.pivot_table(index='model',columns='tag',values='start',aggfunc=len).fillna(0).T
[68]:
model flair spacy_big spacy_small stanford
tag
CAUSE_OF_DEATH 0.0 0.0 0.0 7.0
CRIMINAL_CHARGE 0.0 0.0 0.0 3.0
DATE 0.0 335.0 332.0 398.0
DURATION 0.0 0.0 0.0 32.0
EVENT 0.0 6.0 15.0 0.0
FAC 0.0 7.0 25.0 0.0
GPE 351.0 565.0 563.0 596.0
HANDLE 0.0 0.0 0.0 90.0
LAW 0.0 239.0 164.0 0.0
MISC 490.0 51.0 97.0 257.0
MONEY 0.0 58.0 43.0 31.0
NORP 0.0 52.0 56.0 71.0
NUMBER 0.0 585.0 608.0 846.0
ORG 726.0 1013.0 1012.0 552.0
PERCENT 0.0 19.0 25.0 24.0
PERSON 130.0 175.0 184.0 207.0
QUANTITY 0.0 1.0 1.0 0.0
SET 0.0 0.0 0.0 20.0
TIME 0.0 26.0 21.0 17.0
TITLE 0.0 0.0 0.0 121.0
URL 0.0 0.0 0.0 138.0

Part 4 - Defining a New Schema

In case we don’t already have a Schema defined in LightTag that contains all of these tags, we can create one now with the API. We’ll take a list of the tags that appeared from the dataframe we just calulated, then define a new schema

[70]:
tags=AllSuggestions.tag.unique().tolist()


[73]:
schema_def = {
    'name':'ner-model-comparison',
    'tags':[{'name':t,'description':t } for t in tags]
}
new_schema =session.post('v1/projects/default/schemas/bulk/',json=schema_def)
[74]:
schema_id = new_schema.json()['id']
[75]:

[75]:
11385

Part 5 - Registering the models and uploading suggestions.

As we saw before we need to register a model before we can upload suggestions to it. Models belong to a schema, like the one we just defined. In this example, we’ll iterate over the models we calculated and register them, upload suggestions and submit testaments in one go

[77]:
registerd_models = {} # Capture the models we registered already
for model_name in model_outputs:
    model_def = {  #definition of the model
        'schema':schema_id,
        'name':model_name,
        'metadata':{
            'anything':['you','want']
        }
    }
    response = session.post('v2/models/',json=model_def) # Send it to LightTag
    model = response.json() # Get back the model we just regitered
    registerd_models[model_name] = model #Store it for later

    session.post(model['url']+'suggestions/',json=model_outputs[model_name]) #Send the suggestions
    session.post(model['url']+'testaments/',json=result_dict['example_ids']) # Testaments, tells LightTag all of the examples this model has seen

Part 6 - Reviewing The Results In LightTag

Now that our suggestions have been uploaded, we want to know how our models compare to each other. We can get a rough sense by looking at the Inter Model Agreement in LightTag’s analytics dashboard. But agreement isn’t enough we want to know who was right and who was wrong, and we’ll use LightTag’s review feature for that

Inter Model Agreement

A quick way to see if our models tend to agree or conflict. In our case, looks like lots of disagreement. agreement

Inter Model Agreement API

Review

Still the question remains, are any of these models better than others ? Do they perform differently on the two datasets ? Only one way to find out, by reviewing the data. Luckily LightTag makes this easy with Review Mode

[87]:
IMA = session.get("/v1/metrics/model/iaa/",params={"schema_id":schema_id}).json()
IMA = pd.DataFrame(IMA)
IMA.head()
[87]:
dataset id model_x model_y num_agree schema size
0 42d4181d-934f-4c58-850d-ecdf6fdeb830 62f92ec5-8ae5-4843-aa16-99a638360dc5/62f92ec5-... 62f92ec5-8ae5-4843-aa16-99a638360dc5 62f92ec5-8ae5-4843-aa16-99a638360dc5 2691 f46c639f-7359-4978-9104-62d23b20656d 2691
1 8f5bd425-ae8c-45f2-9cdd-f297ae6a5806 62f92ec5-8ae5-4843-aa16-99a638360dc5/62f92ec5-... 62f92ec5-8ae5-4843-aa16-99a638360dc5 62f92ec5-8ae5-4843-aa16-99a638360dc5 455 f46c639f-7359-4978-9104-62d23b20656d 455
2 42d4181d-934f-4c58-850d-ecdf6fdeb830 62f92ec5-8ae5-4843-aa16-99a638360dc5/96f8d8a1-... 62f92ec5-8ae5-4843-aa16-99a638360dc5 96f8d8a1-7e00-4ba8-b806-76ae745bf13e 1881 f46c639f-7359-4978-9104-62d23b20656d 2691
3 8f5bd425-ae8c-45f2-9cdd-f297ae6a5806 62f92ec5-8ae5-4843-aa16-99a638360dc5/96f8d8a1-... 62f92ec5-8ae5-4843-aa16-99a638360dc5 96f8d8a1-7e00-4ba8-b806-76ae745bf13e 346 f46c639f-7359-4978-9104-62d23b20656d 455
4 42d4181d-934f-4c58-850d-ecdf6fdeb830 62f92ec5-8ae5-4843-aa16-99a638360dc5/b24cf488-... 62f92ec5-8ae5-4843-aa16-99a638360dc5 b24cf488-8b83-4766-b6cd-cee6afac2fe7 353 f46c639f-7359-4978-9104-62d23b20656d 2691