A Guide to Explainable Named Entity Recognition

Named entity recognition (NER) is one of the crucial parts of any NLP project. We can see various examples of NER implementation, but when it comes to understanding how the NER process worked in the background or how the process behaves with the data, it needs more explainability . In this article, we will try to make Named Entity Recognition explainable and interpretable, which will help to understand its working concept. The main points to be covered in this article are listed below.

Contents

  1. What is Named Entity Recognition (NER)?
  2. Extract data from Kaggle
    1. Loading data
    2. Data preprocessing
  3. NER modeling
  4. Explain the prediction

Let’s start by understanding Named Entity Recognition first.

What is Named Entity Recognition (NER)?

In one of our articles, we explained that named entities are the word in all textual data that are objects that exist in the real world. Some of the examples of these real world objects are the names of any person, place or thing. These objects have their own identity in all textual data. For example, Narendra Singh Modi, Mumbai, Plasto water tank, etc. This named entity has its own class like Narendra Modi is the name of a person. Named entity recognition (NER) can be thought of as a process of making a machine to recognize objects with their class and other specifications. Also, with this information on Named Entity Recognition, we discussed how we can implement NER using the spaCy and NLTK libraries.

Talking about the applications of NER, we find that we can use the process in summarizing information from documents, optimization of search engine algorithm, identification of different biomedical subparts and recommendations of contents.

By using this article, we aim to make the NER implementation process more explainable. For this, we will use the Keras and lime libraries. Additionally, we will discuss how we can pull data from the Kaggle into the Google Colab environment. Let’s start by extracting the data

Extract data from Kaggle

To extract data from Kaggle, we need to have an account on the Kaggle website. To create an account we use this link. After creating an account, we need to access the account page. For example, the address of my account page is

https://www.kaggle.com/yugeshvermaaim/account

Where yugeshvermaaim is my account name. After reaching the account page, we need to scroll down to the API section.

In this section, we have to click on the Create a new API token button. This button is responsible for providing us with a JSON file named kaggle.json. After extracting this file, we need to upload it to the Google Colab environment. We can find a shortcut to download a file in the left section of the google collab notebook.

The first button in this panel is for uploading a file using which we need to upload the kaggle.json file. After downloading the JSON file, we can use the following command to install the Kaggle in the environment.

! pip install kaggle

After installation, we need to create a directory. Using the following command we can do this:

! mkdir ~/.kaggle

Using the following command, we can copy the kaggle.json to the above directory like,

! cp kaggle.json ~/.kaggle/

Below code is required to give file allocation permission.

! chmod 600 ~/.kaggle/kaggle.json

Now in the NER implementation we are using a ner_datasset from Kaggle which can be found in this link. This data includes an annotated corpus for named entity recognition and can be used for entity classification.

To extract this data, we need to copy the API command, which can be retrieved from the three-dot panels on the right side of the new laptop panel.

For our data, it has the following API command:

kaggle datasets download -d abhinavwalia95/entity-annotated-corpus

This command can be run on Google Colab.

!kaggle datasets download -d abhinavwalia95/entity-annotated-corpus

To go out:

This gathered file is a zip file which can be unzipped using the following command.

! unzip entity-annotated-corpus

To go out:

In the above, we can make changes based on our interest in extracting files from the Kaggle. After this data extraction, we are ready to implement NER. Let’s start by loading the data.

Loading data

import pandas as pd
data = pd.read_csv("ner_dataset.csv", encoding="latin1").fillna(method="ffill")

Data Verification

data.head()

To go out:

data.describe()

To go out:

From the above description, we can see that there are 47959 sentences and 35178 unique words in the sentences where 42 parts of speech and 17 categories of objects or tags are available.

The image above is an example of sentence 4 from the dataset. To make it work, I defined a function that can add all the words to make a sentence according to its number. Interested readers may find the function here.

Let’s check the tags for these words:

print(labels[3])

To go out:

Data preprocessing

In this section, we will start by building a vocabulary of some 10,000 common words. The image below is a vocabulary showcase.

print(vocabulary)
len(vocabulary)

To go out:

The second step in this section will be to fill the sequence to a common length.

print(word2idx)
print(tag2idx)

To go out:

After creating the common length pad sequence, we are ready to split our data.

from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, shuffle=False)

After these data preprocessing steps, we are ready for NER modeling.

NER modeling

In the NER modeling we used a bi-directional LSTM model with 0.1 recurrent dropouts and in the compilation I used root mean square spread as the optimizer. To fit the model, we used 5 epochs. The complete procedure code can be found here.

Using the following code we can train our network.

history = model.fit(X_train, y_train.reshape(*y_train.shape, 1),
                    batch_size=32, epochs=5,
                    validation_split=0.1, verbose=1)

After training the model, we are ready to explain the predictions.

Explain the prediction

As shown above, we will make the NER process more explainable using the LIME library. This library is an open source library to make artificial intelligence and machine learning more explainable. Before going to explain the NER, we need to install this library. We can also use the eli5 library to use the function built in the lime library. Installation can be done using the following lines of code.

!pip install lime

To go out:

!pip install eli5

To go out:

After installation, we are ready to explain the NER modeling prediction. To explain NER using LIME, we need to make our problem a multiclass classification problem.

Let’s look at the 100th sample of our text data.

index = 99
label = labels[index]
text = sentences[index]
print(text)
print()
print(" ".join([f"{t} ({l})" for t, l in zip(text.split(), label)]))

To go out:

Let’s start by explaining the prediction of NER modeling. For this, we have defined a class named NERExplainerGenerator in which we have defined a function named get_predict_function which will provide the predictions to the TextExplainer functions of the LIME libraries.

From the example above, we will use the 6th word and generate the corresponding prediction function. Using the MaskExplainer module of LIME, we can initialize a sampler for the LIME algorithm.

from eli5.lime.samplers import MaskingTextSampler
word_index = 6
predict_func = explainer_generator.get_predict_function(word_index=word_index)
sampler = MaskingTextSampler(
    replacement="UNK",
    max_replace=0.7,
    token_pattern=None,
    bow=False
)

Printing the sample using the sampler.

samples, similarity = sampler.sample_near(text, n_samples=6)
print(samples)

To go out:

Initializing TextExplainer module from LIME to explain prediction.

from eli5.lime import TextExplainer
te = TextExplainer(
    sampler=sampler,
    position_dependent=True,
    random_state=42
)
 
te.fit(text, predict_func)

After adjusting it, we are ready to use the TestExplainer.

te.explain_prediction(
    target_names=list(explainer_generator.idx2tag.values()),
    top_targets=4
)

To go out:

In the output above we can see that we have made the NER modeling more explainable. In the output, we can see that the word Indian is heavily highlighted, which means there is a higher probability that India is the name of any object and the dataset has often India as a country name.

Last words

In this article, we have explained how to perform Named Entity Recognition using the Keras library and how to use the LIME library to make it more explainable. We learned how named entity recognition works and how it behaves with data.

The references:

Comments are closed.