Information extraction from structured and unstructured document using Natural language Processing

7 min readJul 11, 2021

Introduction

This Post revolves around the creation of a system that is capable of extracting information from structured ie the document consisting of tables for example invoices as well as unstructured document for example any biodata document

Technologies used are

Natural language processing (NLP) for the text classification and the library used is spacy 3.0
libraries like pdfminer are used which is an OCR based library
Django is used to link the front end ie the file upload interface and the python code

In the text extraction the steps involved are

Data Collection

Several types of document that included structured ones are provided and they were majorly in the form of PDF

challenges

# The pdf provided were not having uniformity ie some PDFs were image based ie we cant select text from them while some some were text selectable ones

we used Optical character reader to overcome this problem

steps involved are

a. The PDF images were converted to images and we structured images like this: each pages will be converted to images

Code snippet

b. The images are then pre-processed to make efficient extraction ie with the help of OpenCV the images were cropped and the text section were focused so that the Optical character reader can provide better results

Code snippet

c. A open-source module name pytesseract is used to extract text

Here is the overview of pytesseract

# Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images.

# Python-tesseract is a wrapper for Google’s Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others. Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a file.

this function processes all the images formed by pdf2image the imageprocess function then converts that to the string

Here is the snippet

d. The text obtained is in raw format thus some text cleaning is done

raw output

ie removal of garbage characters like spaces “\n” , “\t”

2. The next stage was to classify and characterise the data

This is divided into two parts

For Structured
For UnStructured

For Structured

# For Structured we got the data which can be seperated if were able to identify the delimiter

eg the Data was lke this

For this the data is majorly extarcted using regex and some modules like flatten which is a nltk module use to simplify the recursive data that is obtained from the initial data extraction

with the help of pdfplumber which provide the structed data in the form of tables and can be traversed with the help of index the data is extracted with index

If we have a index named “name” and “details”

it can be extracted like this

for i in final_data_1:
a=i.index([‘name’])
b=i.index([‘ DETAILS’])
data=i[a:b]

The snippet is attached

Then a final dictionary is made which consist of key and value obtained as per requirement

2. For Unstructured data

We faced many challenges for this format

challenges including

a. No such dataset available for classification of data

b. No regularity in the data

c. Documents were having multiple formats

we overcame the challenge of no dataset available with

anoting the data for classification of custom named entity

NER

Named entity recognition (NER) is a sub-task of information extraction (IE) that seeks out and categorises specified entities in a body or bodies of texts. NER is also simply known as entity identification, entity chunking and entity extraction.

the anoted data looked like

spaCy for NER

SpaCy is an open-source library for advanced Natural Language Processing in Python. It is designed specifically for production use and helps build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning Some of the features provided by spaCy are- Tokenization, Parts-of-Speech (PoS) Tagging, Text Classification and Named Entity Recognition.

SpaCy provides an exceptionally efficient statistical system for NER in python, which can assign labels to groups of tokens which are contiguous. It provides a default model which can recognize a wide range of named or numerical entities, which include person, organization, language, event etc. Apart from these default entities, spaCy also gives us the liberty to add arbitrary classes to the NER model, by training the model to update it with newer trained examples.

steps involved to use spacy

spacy 3.0 do not support json format so we had to convert that dataset in .spacy format which can be done like this

2. For training purpose we need a configuration file which can be made with the help of command like utility provided by spacy to make a configeration we kind of inherit the base configuration and we override it

To do this we use the following command

steps are

INSTRUCTIONS: WIDGET

Select your requirements and settings.
Use the buttons at the bottom to save the result to your clipboard or a file base_config.cfg.
Run init fill-config to create a full config.
Run train with your config and data.

INSTRUCTIONS: CLI

Run the init config command and specify your requirements and settings as CLI arguments.
Run train with the exported config and data.

Here we make changes like how many iteration we require when we train the model and we can specify the batch size

After defining the pipeline components we came across scenarios like

Train a new component from scratch on your data.
Update an existing trained component with more examples.
Include an existing trained component without updating it.
Include a non-trainable component, like a rule-based EntityRuler or Sentencizer, or a fully custom component.

Training with custom code

The spacy train recipe lets you specify an optional argument --code that points to a Python file. The file is imported before training and allows you to add custom functions and architectures to the function registry that can then be referenced from your config.cfg. This lets you train spaCy pipelines with custom components, without having to re-implement the whole training workflow. When you package your trained pipeline later using spacy package, you can provide one or more Python files to be included in the package and imported in its __init__.py. This means that any custom architectures, functions or components will be shipped with your pipeline and registered when it’s loaded.

the snippet is