EMC Developer Network

Content Intelligence Services
 

 

Content Intelligence Services

A Developer's Guide

Recently, we have been adding more and more CIS related content. We realize that many of you do not know what CIS is or why it might be of interest to a developer or system architect. We suggest that you read this and any other available information and then contact your Account Manager to see CIS in action. The ROI for a system like CIS can be substantial.

Got some ideas for CIS related content? Need clarification of some of these concepts? As usual you can always e-mail us with any questions

CIS Overview

As you know one of the key challenges in Document Management is to try to add structure to what is basically unstructured content; CIS automates this task by intelligently mining the pertinent information from the document and tagging and classifying the document. CIS is one of those magical pieces of software that seems to deliver something for nothing. You have to see it working before you will believe that it really does deliver all that it promises.

Our definition of CIS is...

"CIS automates the organization and tagging of enterprise content based on powerful information extraction, conceptual classification, business analysis, and taxonomy and metadata schema management capabilities."

What does this really mean? It means that CIS can look at the content of your documents and automatically fill in the attributes for you. It will also automatically select a folder to store it in. It does all of this using rules and associations specified by you or your industry.

How can it possibly do this? CIS works out attribute values and also gives each document a single concept which describes the document as a whole. The selection of attribute values and concepts is determined by a Domain Map customized by you to reflect your company's particular terminology.

Domain Map Overview

Also called 'Taxonomy Schemas', 'Vocabulary Management' or 'Metadata Management'. In this document we will only cover the basics of Domain Maps, we will start with looking at a Taxonomy because a Domain Map can be viewed as being a superset of a Taxonomy.

Taxonomies

A Taxonomy is a collection of concepts. These concepts can be represented in an abstract, structured form. We refer to the nodes in this structure as TaxonomyNodes. The structure of the TaxonomyNodes can be utilized for logical operations, commonly,

  • to categorize documents,
  • to build a Docbase folder hierarchy,
  • to aid navigation, (such as a Yahoo-type navigation paradigm)

Within a taxonomy all concepts are equal, the TaxonomyNode hierarchy is used simply for convenience. Some of the code samples in this series will illustrate why this is useful.

Here is a sample Taxonomy shown as a hierarchy.

Tissue and Cell Preparation
  Freeze Etching
  Freeze Substitution
  Sectioning
  Staining
    Horseradish Peroxidase
    Immunocytochemistry
      Ferritin Labeling
      Immunoelectron Microscopy
      Immunofluorescence Technique
      Immunoperoxidase
    Silver Impregnation

Domain Maps

A Domain Map represents these same concepts along with the keyword evidence for each concept. CIS analyses the document and finds keywords, it then uses the mapping of keywords to concepts to decide which is most likely the key concept for the given document. As a developer you can retrieve all concepts and their weighting based on the evidence found in the document.

It is common for the structure of the concepts represented in the Domain Map to match the structure modeled in the Taxonomy as this eases administration.

Domain Maps actually consist of 3 things; Concepts, Concept Types and Keywords. The following sections will look in detail at these elements.

Many companies already have Domain Maps defined and may have spent a substantial amount of money on this process. Also, there are some industry standard Domain Maps available which help standardize terminology across companies. You can leverage this existing work by importing these maps into CIS.

Taxonomies vs. Domain Maps

At this point you are probably wondering about the difference between Taxonomies and Domain Maps and when you would use one verses the other. This is a very complex, and quite academic, subject so we will just give you some pointers here. If you want more in-depth information then please e-mail us and we will answer your specific questions.

We will start by clarifying a few points.

  • The TaxonomyNodes in a Taxonomy can be represented in a hierarchical form.
  • The concepts in a Domain Map can be represented in a hierarchical form.
  • The structure of these 2 hierarchies does not have to match, (although we would recommend that they do).

This raises the following questions;

  • Why wouldn't you always make the 2 structures match?

    1. Unless you use TEF, (CIS's import mechanism), you must create your Domain Map and then add individual TaxonomyNodes linked to the concepts. This is not a scalable solution of you have a large number of nodes.
    2. If TEF is used, it can generate the TaxonomyNode objects in a form of a hierarchy, but the concepts would not have (by default) a hierarchy. This is done to avoid "evidence propagation" and to tag documents only to the leaf nodes. In this case the Domain Map is not a "master" but a "slave" structure and simply stores evidence information.

  • When would you traverse the Taxonomy rather than the Domain Map hierarchy?

    1. Traversing Taxonomy is easier then traversing the Domain Map. If there is a Taxonomy, it should probably be traversed.
    2. When the Domain Map does not have a hierarchy.

  • When would you traverse the Domain Map rather than the Taxonomy hierarchy?

    1. When I need to gather the evidence along with the nodes.
    2. When the Taxonomy does not have a hierarchy.

  • Why can't CIS synchronize the structures?
    Version 5.0 of CIS will streamline this process, look out for more details coming soon.

Concepts

Concepts are the basic descriptors for a document, they have a single type and are defined by keywords. For example the Concept 'IBM' may have a type of 'Company' and keywords including 'Big Blue', 'International Business Machines', etc. CIS looks at the keywords in a document and weights each concept based on the occurrence of keywords. It is quite possible for the concept's text itself to not appear in the document; imagine a document talking about banks, checks and overdrafts, the concept may be 'Financial Institution' even though this string is not found in the document.

CIS will also return a 'weighting' based on how certain it is that document relates to the concept. As a developer you can retrieve the weighting for each concept as well as the evidence that CIS found, i.e. the keywords found that made CIS believe that this concept was relevant. You can also navigate the Taxonomy's hierarchy in order to find the peer/child relationships between concepts - you'll see why this is useful when you start to delve into the code samples.

Concept Types

Concept Types are used to categorize concepts for convenience. For example you could list all 'Companies' by querying for all concepts with a type of 'Company'.

Evidence

Evidence is represented by keywords related to each concept. Finding these keywords in the document suggests that the specific concept is relevant. (How relevant is determined by the frequency and placement of the keywords)

Feature Sets

A Feature Set is a logical collection of keywords used to clarify the context in which the keywords are deemed to be valid evidence. For example you could say Branch + !Tree to look for the keyword branch when not used near the keyword tree to describe say a Bank-Branch concept rather than a Tree-Branch concept.

Keywords

CIS effectively parses the document looking for these keywords and then builds a 'picture' of the document based on the occurrence of the keywords. Keywords are the bottom of the Taxonomy food chain, everything else is derived from them.

A Keyword is just a string, (You may also hear keywords referred to as 'tokens' and if you are old enough to remember yacc and lex you will recognize this.)

Models

When you apply a Domain Map to your document you get an intersection; this resultant set of data is called a model. The model is the part of the Domain Map which relates to your document.

There are 3 types of model that CIS can create.

  1. Concept List
    This is a list of the concepts that CIS found when comparing the document to the Domain Map.

  2. Concept Header
    It is possible to have CIS suggest, or automatically, add concepts to the domain map based on their reoccurrence in documents. The concept header model would contain all of the concepts that CIS thinks should be added to the Domain Map.

  3. Text Model
    Using filters CIS can extract key parts of your documents and output them as XML content for later reuse.

Architecture

CIS a server process and is usually installed on its own server rather than having to run on the Docbase's server.

CIS has its own, unsupported, public, API, (separate from the DFC), called the BSS API. The BSS API calls can be made either synchronously or asynchronously. Typically a command to automatically categorize a document could be made asynchronously, CIS will queue up the command and process the request. If you wanted to query CIS to get a list of possible values for a drop-down list you would make a synchronous call.

When you install CIS you get an 'unsupported' directory which contains the SDK including an API overview in PDF format and Javadocs, (both also available below)

Resources

Content Personalization Services Application Developer's Guide
The Content Personalization Services (CPS) Application Developer's Guide describes how to create applications that access the Documentum Content Personalization Services. You can access the functionality of CPS through the Basic System Services (BSS) API. The Content Personalization Services SDK provides tools and instructions for creating your own applications using the BSS API.

Content Personalization Services API Javadocs
CPS's Basic System Services (BSS) API packages provide a means of gaining access to the Business Intelligence Server's functionality. We are making the BSS Javadocs available to give you the opportunity to explore the depth of the CPS server's abilities.