HITEC and HICAT: hierarchic categorization tool and assistance tool

Introduction

Most of you have experienced in some way the famous saying of John Naisbitt: We are drowning in information but starving for knowledge. This applies particularly to such a text processing related field like the work of patent offices, and especially patent classification. The number of patent applications has been continuously increasing every year, demanding higher and higher level of computer support for categorization-related tasks.

We offer below the HITEC/HICAT tools a high performance text categorization tool that can efficiently facilitate various patent classification tasks.

HITEC: HIerarchical TExt Categorizer

The software comprises of a categorization engine and an interface. The engine is called HITEC that is the acronym for HIerarchical TExt Categorizer developed by Textminer Ltd. The basis of HITEC is a universal feature extractor (UFEX) algorithm that is able to classify arbitrary type of entities being numerically represented by vectors (images, audio/video files), and hence has much wider application areas.

HITEC is a general text classification tool that has been applied successfully to different problems, and has achieved good results at prestigious data mining competitions. Its application was awarded at the ACM-KDD Cup in 2005 in the query categorization contest for precision and creativity.

HICAT: HIerarchical Categorization Assistance Tool

The presentation tool is referred to as HICAT that stands for HIerarchical Categorization Assistance Tool developed by Arcanum Development Ltd., its outlook can be seen at screenshots in the presentation.

The input of the software is either simple text or text and bibliographic data.

The output of the software is a weighted list of guesses presented hierarchically. The weights represent meaningful confidence levels as it will be shown later.

 

 


Features of HITEC

There are also numerous natural language processing facilities:

As noted earlier we obtained awards with application of HITEC at KDD-CUP 2005.


 

Use of a categorization assistance tool

We have imagined the following scenarios for the use of categorization tools at patent offices and also at patent information users:

  1. autocategorization: we accept the first guess of the categorizer as the primary symbol
  2. precategorization: we get the first category (let’s say, the highest confidence class) returned by the categorizer and redirect the patent to an expert of the given class
  3. flat assisted categorization: the categorizer presents to the user three categories (for example, maingroups) and the user selects the most relevant one
  4. hierarchic assisted categorization: the categorizer presents to the user three sections; then the user selects the one he/she finds the most relevant, the software then presents the classes found within the selected section and the user can select the most relevant class. This process is iterated going downward in the hierarchy.

Autocategorization is totally automatic while hierarchic assisted categorization is closest to the current hierarchic browsing of IPC but providing confidence ranking based on the query.

 

Precision of HITEC


According to the scenarios above, we have evaluated the precision of the system after training with 600,000 documents downloaded from Espace Access discs.

The following table contains the precision at 100% of the test documents:

 

Section

Class

Subclass

Maingroup

Autocategorization

80 %

72 %

64 %

48 %

Precategorization

84 %

77 %

70 %

54 %

Flat assisted

97 %

91 %

84 %

70 %

Hierarchic assisted

 

93 %

92 %

80 %

 

It is not obvious to analyze the precision of the assisted categorization; we have analyzed if the top three categories returned by the categorizer corresponds to the primary category.

The most important numbers from this table are:

1.      in precategorization scenario

·        in 77% of the cases the class returned with highest confidence was one of the classes of the test document (in 72% of the cases it was the primary class)

·        in 54% of the cases the first maingroup was one of the maingroups associated to the patent or application of the test documents

2.      in hierarchic assisted categorization, after selection of a right section,

·        in 93% of the cases the top three classes returned by the system corresponded to the primary class

·        after selecting the right class and subclass, in 80% of the cases the top three maingroups returned by the system corresponded to the primary maingroup.


Analysis of reliability

The precision as function of the ratio of retrieved documents based on the confidence level, is presented below for two scenarios.

 

 

The dots in the graph present confidence levels returned for the first category. The dots correspond to confidence levels in the graphs. The lower the confidence level, the more documents returned.

In the precategorization task when selecting a confidence level that out 20% of the documents, on the remaining 80% the precision of the algorithm is 82%.

In assisted categorization task, even for the worst confidence levels, the precision on the class level is above 90%, that exceeeds 95% when filtering out 20% of the documents based on the confidence level.

It can be observed that if documents with higher confidence levels are selected, the categorizer achieved higher precision.

As a conclusion, we can state that one can rely on the quality of the returned confidence level.

 


Comparison to human classifiers

We have also analyzed the mistakes of the categorizer, comparing human decisions to the mistakes of the categorizer.

 

We have analyzed the number of test documents, when besides the section of the primary symbol (for example, section B), another section is selected as a secondary symbol (for example, section C). We have found that about every 10th document when it is categorized in section B as its primary classification symbol, it is also classified in one of the groups of section C.

We have also computed the ratio of mistakes in favor of a section when the section of the primary symbol was different. It turned out that about every 12th document, when categorized by human classifiers to section B was categorized to section C, as a mistake, by the algorithm.

We have drawn a map from the above statistics, so-called ambiguity and confusion map.

To our satisfaction, it turned out that the co-classification of humans and mistakes of the categorizer are in correlation.

 


How to access the services?

HITEC can be accessed in two ways:

  1. as a precategorization tool, it can be accessed through a high-throughput interface through HTTP protocol
  2. as categorization assistance tool, it can be accessed through HICAT, that is a web interface

HICAT is a clone of the reformed IPC publication. Therefore, HICAT itself is a proof of the concept that XML files created from the reformed IPC can be used in application integration.

Get access to HICAT


Please contact for access to the categorizer at attila@arcanum.com

Articles are also presented at http://www.textminer.hu.