Most of you have experienced in some way the famous saying of John Naisbitt: We are drowning in information but starving for knowledge. This applies particularly to such a text processing related field like the work of patent offices, and especially patent classification. The number of patent applications has been continuously increasing every year, demanding higher and higher level of computer support for categorization-related tasks.
We offer below the HITEC/HICAT tools a high performance text categorization tool that can efficiently facilitate various patent classification tasks.
The software comprises of a categorization engine and an interface. The engine is called HITEC that is the acronym for HIerarchical TExt Categorizer developed by Textminer Ltd. The basis of HITEC is a universal feature extractor (UFEX) algorithm that is able to classify arbitrary type of entities being numerically represented by vectors (images, audio/video files), and hence has much wider application areas.
HITEC is a general text classification tool that has been applied successfully to different problems, and has achieved good results at prestigious data mining competitions. Its application was awarded at the ACM-KDD Cup in 2005 in the query categorization contest for precision and creativity.
The presentation tool is referred to as HICAT that stands for HIerarchical Categorization Assistance Tool developed by Arcanum Development Ltd., its outlook can be seen at screenshots in the presentation.
The input of the software is either simple text or text and bibliographic data.
The output of the software is a weighted list of guesses presented hierarchically. The weights represent meaningful confidence levels as it will be shown later.
There are also numerous natural language processing facilities:
As noted earlier we obtained awards with application of HITEC at KDD-CUP 2005.
We have imagined the following scenarios for the use of categorization tools at patent offices and also at patent information users:
Autocategorization is totally automatic while hierarchic assisted categorization is closest to the current hierarchic browsing of IPC but providing confidence ranking based on the query.
According to the scenarios above, we have evaluated the precision of the system after training with 600,000 documents downloaded from Espace Access discs.
The following table contains the precision at 100% of the test documents:
|
|
Section |
Class |
Subclass |
Maingroup |
|
Autocategorization |
80 % |
72 % |
64 % |
48 % |
|
Precategorization |
84 % |
77 % |
70 % |
54 % |
|
Flat assisted |
97 % |
91 % |
84 % |
70 % |
|
Hierarchic assisted |
|
93 % |
92 % |
80 % |
It is not obvious to analyze the precision of the assisted categorization; we have analyzed if the top three categories returned by the categorizer corresponds to the primary category.
The most important numbers from this table are:
1. in precategorization scenario
· in 77% of the cases the class returned with highest confidence was one of the classes of the test document (in 72% of the cases it was the primary class)
· in 54% of the cases the first maingroup was one of the maingroups associated to the patent or application of the test documents
2. in hierarchic assisted categorization, after selection of a right section,
· in 93% of the cases the top three classes returned by the system corresponded to the primary class
· after selecting the right class and subclass, in 80% of the cases the top three maingroups returned by the system corresponded to the primary maingroup.
The precision as function of the ratio of retrieved documents based on the confidence level, is presented below for two scenarios.
The dots in the graph present confidence levels returned for the first category. The dots correspond to confidence levels in the graphs. The lower the confidence level, the more documents returned.

In the precategorization task when selecting a confidence level that out 20% of the documents, on the remaining 80% the precision of the algorithm is 82%.

In assisted categorization task, even for the worst confidence levels, the precision on the class level is above 90%, that exceeeds 95% when filtering out 20% of the documents based on the confidence level.
It can be observed that if documents with higher confidence levels are selected, the categorizer achieved higher precision.
As a conclusion, we can state that one can rely on the quality of the returned confidence level.
We have also analyzed the mistakes of the categorizer, comparing human decisions to the mistakes of the categorizer.
|
|
|
We have analyzed the number of test documents, when besides the section of the primary symbol (for example, section B), another section is selected as a secondary symbol (for example, section C). We have found that about every 10th document when it is categorized in section B as its primary classification symbol, it is also classified in one of the groups of section C.
We have also computed the ratio of mistakes in favor of a section when the section of the primary symbol was different. It turned out that about every 12th document, when categorized by human classifiers to section B was categorized to section C, as a mistake, by the algorithm.
We have drawn a map from the above statistics, so-called ambiguity and confusion map.
To our satisfaction, it turned out that the co-classification of humans and mistakes of the categorizer are in correlation.
HITEC can be accessed in two ways:
HICAT is a clone of the reformed IPC publication. Therefore, HICAT itself is a proof of the concept that XML files created from the reformed IPC can be used in application integration.
Please contact for access to the categorizer at attila@arcanum.com
Articles are also presented at http://www.textminer.hu.