Introduction
This Post revolves around the creation of a system that is capable of extracting information from structured ie the document consisting of tables for example invoices as well as unstructured document for example any biodata document
Technologies used are
- Natural language processing (NLP) for the text classification and the library used is spacy 3.0
- libraries like pdfminer are used which is an OCR based library
- Django is used to link the front end ie the file upload interface and the python code
In the text extraction the steps involved are
Data Collection
Several types of document that included structured ones are provided and they were majorly in the form of PDF
challenges
# The pdf provided were not having uniformity ie some PDFs were image based ie we cant select text from them while some some were text selectable ones
we used Optical character reader to overcome this problem
steps involved are
a. The PDF images were converted to images and we structured images like this: each pages will be converted to images
Code snippet
b. The images are then pre-processed to make efficient extraction ie with the help of OpenCV the images were cropped and the text section were focused so that the Optical character reader can provide better results
Code snippet
c. A open-source module name pytesseract is used to extract text
Here is the overview of pytesseract
# Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images.
# Python-tesseract is a wrapper for Google’s Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others. Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a file.
this function processes all the images formed by pdf2image the imageprocess function then converts that to the string
Here is the snippet
d. The text obtained is in raw format thus some text cleaning is done
raw output
ie removal of garbage characters like spaces “\n” , “\t”
2. The next stage was to classify and characterise the data
This is divided into two parts
- For Structured
- For UnStructured
For Structured
# For Structured we got the data which can be seperated if were able to identify the delimiter
eg the Data was lke this
For this the data is majorly extarcted using regex and some modules like flatten which is a nltk module use to simplify the recursive data that is obtained from the initial data extraction
with the help of pdfplumber which provide the structed data in the form of tables and can be traversed with the help of index the data is extracted with index
eg
If we have a index named “name” and “details”
it can be extracted like this
for i in final_data_1:
a=i.index([‘name’])
b=i.index([‘ DETAILS’])
data=i[a:b]
The snippet is attached
Then a final dictionary is made which consist of key and value obtained as per requirement
2. For Unstructured data
We faced many challenges for this format
challenges including
a. No such dataset available for classification of data
b. No regularity in the data
c. Documents were having multiple formats
we overcame the challenge of no dataset available with
anoting the data for classification of custom named entity
NER
Named entity recognition (NER) is a sub-task of information extraction (IE) that seeks out and categorises specified entities in a body or bodies of texts. NER is also simply known as entity identification, entity chunking and entity extraction.
the anoted data looked like
spaCy for NER
SpaCy is an open-source library for advanced Natural Language Processing in Python. It is designed specifically for production use and helps build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning Some of the features provided by spaCy are- Tokenization, Parts-of-Speech (PoS) Tagging, Text Classification and Named Entity Recognition.
SpaCy provides an exceptionally efficient statistical system for NER in python, which can assign labels to groups of tokens which are contiguous. It provides a default model which can recognize a wide range of named or numerical entities, which include person, organization, language, event etc. Apart from these default entities, spaCy also gives us the liberty to add arbitrary classes to the NER model, by training the model to update it with newer trained examples.
steps involved to use spacy
- spacy 3.0 do not support json format so we had to convert that dataset in .spacy format which can be done like this
2. For training purpose we need a configuration file which can be made with the help of command like utility provided by spacy to make a configeration we kind of inherit the base configuration and we override it
To do this we use the following command
steps are
INSTRUCTIONS: WIDGET
- Select your requirements and settings.
- Use the buttons at the bottom to save the result to your clipboard or a file
base_config.cfg
. - Run
init fill-config
to create a full config. - Run
train
with your config and data.
INSTRUCTIONS: CLI
- Run the
init config
command and specify your requirements and settings as CLI arguments. - Run
train
with the exported config and data.
Here we make changes like how many iteration we require when we train the model and we can specify the batch size
After defining the pipeline components we came across scenarios like
- Train a new component from scratch on your data.
- Update an existing trained component with more examples.
- Include an existing trained component without updating it.
- Include a non-trainable component, like a rule-based
EntityRuler
orSentencizer
, or a fully custom component.
Training with custom code
The spacy train
recipe lets you specify an optional argument --code
that points to a Python file. The file is imported before training and allows you to add custom functions and architectures to the function registry that can then be referenced from your config.cfg
. This lets you train spaCy pipelines with custom components, without having to re-implement the whole training workflow. When you package your trained pipeline later using spacy package
, you can provide one or more Python files to be included in the package and imported in its __init__.py
. This means that any custom architectures, functions or components will be shipped with your pipeline and registered when it’s loaded.
the snippet is
The next step is to load the model and get the desired result
we make some variables to store imformation like name
then we called the model to get the desired results
The results were like this
The Final part
Django is uesd to make an uploading interface then then the above system will perfrom the information extraction and return the data onto a table
after clicking on view we get information like this
Thank you