Structure to Knowledge: Information Extraction from Client Onboarding Documents

Sarang Shrivastava
4 min readMay 12, 2023

--

Image taken from Author’s Slide

This post summarizes an important aspect of my talk at GIDS 2020 on Information Extraction for professionals from financial documents.

The Intricacies of Financial Documents

The client onboarding process in the financial sector often involves dealing with documents dense with legal entities and their respective roles. These documents, frequently extensive in length, encompass both structured and unstructured data. Essential questions arise from these documents, such as the nature of the customer document, identification of individuals and organizations mentioned, their connections to the principals being onboarded, their roles, addresses, and more. Further, it is necessary to determine whether the counterparty possesses the ability to purchase specific products, whether the beneficial owner has authorized the counterparty to act on their behalf, and whether the document is duly executed with an effective date. Understanding the semantics of these documents is pivotal in deriving such key insights.

Addressing these questions presents a plethora of technical challenges that are diverse, immense, and stimulating. Some can be met with existing market solutions, while others necessitate the adaptation of existing technology, such as retraining models using proprietary datasets or developing new iterations of these techniques. Furthermore, some challenges mandate novel research, and others require the development of entirely new methodologies. Let's see how we went about solving them.

Identifying Important Named Elements

Image taken from Author’s Slide

Named Entity Recognition (NER) is a crucial task that can vary in its complexity, from identifying a few tokens, such as persons or organizations, to discerning lengthy addresses. The primary interest lies in understanding the different entities mentioned in these documents, along with their registered and mailing addresses. While open-source tools and libraries are available to identify organizations and persons, they lack specialized training in the financial domain. Despite this, it’s crucial to acknowledge the utility of these models, as they have been trained on extensive datasets and have developed latent representations for entities. Utilizing transfer learning, a proprietary dataset can be established to fine-tune existing models, such as Spacy, which employs a CNN-based neural architecture. The recent Spacy 3.0 release, with its support for fine-tuning transformer-based architectures, is noteworthy.

Image taken from Author’s Slide

Extracting Complicated Addresses

Address extraction is a task of higher complexity, warranting a BERT base backbone with a token classification head. Address representations vary across countries, with additional complexities such as prefix lines like C/O Org, or addresses in island regions where standard concepts of cities and states may not exist. Tools such as Libpostal can be utilized to parse these address spans, or a gazetteer may be employed.

Image taken from Author’s Slide

Identifying Relations between Elements

Relation extraction tasks, such as identifying an organization’s address or aliases, are another pivotal component in understanding these documents. The relationships may be represented in both structured and unstructured formats, and the complexity escalates quickly due to input variability. The number of relations in a sentence can easily change with the addition of a single word, as seen in the examples:

A, B acts as C, D to E and F
A, B acts as C, D to E and F respectively

A ternary relation extraction dataset can be curated from these documents, focusing on organizations, person names, and roles. A BERT-based backbone with a relation head could be employed to tackle this problem. One significant aspect of this task is deciding which specific token embedding should be forwarded to the relation head. Various techniques can be utilized, including max pooling or averaging all the token embeddings representing the entities and feeding them to the relation layer. Enhancing network awareness of specific tokens can be achieved by using entity markers around the Organizations and Roles, thereby improving results.”

--

--