Content Intelligence Services
A Developer's Guide
Recently, we have been adding more and more CIS related content. We realize that many of
you do not know what CIS is or why it might be of interest to a developer or system architect.
We suggest that you read this and any other available information and then contact your Account
Manager to see CIS in action. The ROI for a system like CIS can be substantial.
Got some ideas for CIS related content? Need clarification of some of these concepts? As
usual you can always e-mail us with
any questions
CIS Overview
As you know one of the key challenges in Document Management is to try to add structure to
what is basically unstructured content; CIS automates this task by intelligently mining the pertinent
information from the document and tagging and classifying the document. CIS is one of those magical
pieces of software that seems to deliver something for nothing. You have to see it working
before you will believe that it really does deliver all that it promises.
Our definition of CIS is...
"CIS automates the organization and tagging of enterprise content based on powerful
information extraction, conceptual classification, business analysis, and taxonomy
and metadata schema management capabilities."
What does this really mean? It means that CIS can look at the content of your documents
and automatically fill in the attributes for you. It will also automatically select a folder
to store it in. It does all of this using rules and associations specified by you or your
industry.
How can it possibly do this? CIS works out attribute values and also gives each
document a single concept which describes the document as a whole. The selection of
attribute values and concepts is determined by a Domain Map customized by
you to reflect your company's particular terminology.
Domain Map Overview
Also called 'Taxonomy Schemas', 'Vocabulary Management' or 'Metadata Management'. In this document
we will only
cover the basics of Domain Maps, we will start with looking at a Taxonomy because
a Domain Map can be viewed as being a superset of a Taxonomy.
Taxonomies
A Taxonomy is a collection of concepts. These concepts can be represented in an abstract, structured
form. We refer to the nodes in this structure as TaxonomyNodes. The structure of the TaxonomyNodes can
be utilized for logical operations, commonly,
- to categorize documents,
- to build a Docbase folder hierarchy,
- to aid navigation, (such as a Yahoo-type navigation paradigm)
Within a taxonomy all concepts are equal, the TaxonomyNode hierarchy is used simply for convenience.
Some of the code samples in this series will illustrate why this is useful.
Here is a sample Taxonomy shown as a hierarchy.
Tissue and Cell Preparation
Freeze Etching
Freeze Substitution
Sectioning
Staining
Horseradish Peroxidase
Immunocytochemistry
Ferritin Labeling
Immunoelectron Microscopy
Immunofluorescence Technique
Immunoperoxidase
Silver Impregnation
Domain Maps
A Domain Map represents these same concepts along with the keyword evidence for
each concept. CIS analyses the
document and finds keywords, it then uses the mapping of keywords to concepts to decide
which is most likely the key concept for the given document. As a developer you can
retrieve all concepts and their weighting based on the evidence found in the document.
It is common for the structure of the concepts represented in the Domain Map to
match the structure modeled in the Taxonomy as this eases administration.
Domain Maps actually consist of 3 things; Concepts, Concept
Types and Keywords. The following sections will look in detail at these elements.
Many companies already have Domain Maps defined and may have spent a substantial amount of money on
this process. Also, there are some industry standard Domain Maps available which help standardize
terminology across companies. You can leverage this existing work by importing these maps into CIS.
Taxonomies vs. Domain Maps
At this point you are probably wondering about the difference between Taxonomies and Domain Maps
and when you would use one verses the other. This is a very complex, and quite academic, subject
so we will just give you some pointers here. If you want more in-depth information then please
e-mail us and we will answer your specific questions.
We will start by clarifying a few points.
- The TaxonomyNodes in a Taxonomy can be represented in a hierarchical form.
- The concepts in a Domain Map can be represented in a hierarchical form.
- The structure of these 2 hierarchies does not have to match, (although we would recommend that they do).
This raises the following questions;
Why wouldn't you always make the 2 structures match?
- Unless you use TEF, (CIS's import mechanism), you must create your Domain Map and then add
individual TaxonomyNodes linked to the concepts. This is not a scalable solution of you have
a large number of nodes.
- If TEF is used, it can generate the TaxonomyNode objects in a form of a hierarchy, but the
concepts would not have (by default) a hierarchy. This is done to avoid "evidence propagation"
and to tag documents only to the leaf nodes. In this case the Domain Map is not a "master" but
a "slave" structure and simply stores evidence information.
When would you traverse the Taxonomy rather than the Domain Map hierarchy?
- Traversing Taxonomy is easier then traversing the Domain Map. If there is a Taxonomy, it should
probably be traversed.
- When the Domain Map does not have a hierarchy.
When would you traverse the Domain Map rather than the Taxonomy hierarchy?
- When I need to gather the evidence along with the nodes.
- When the Taxonomy does not have a hierarchy.
Why can't CIS synchronize the structures?
Version 5.0 of CIS will streamline this process, look out for more details coming soon.
Concepts
Concepts are the basic descriptors for a document, they have a single type and are defined by keywords.
For example the Concept 'IBM' may have a type of 'Company' and keywords including 'Big Blue', 'International
Business Machines', etc. CIS looks at the keywords in a document and weights each concept based on the
occurrence of keywords. It is quite possible for the concept's text itself to not appear in the document; imagine
a document talking about banks, checks and overdrafts, the concept may be 'Financial Institution' even though this
string is not found in the document.
CIS will also return a 'weighting' based on how certain it is that document relates to the concept.
As a developer you can retrieve the weighting for each concept as well as the evidence that CIS found,
i.e. the keywords found that made CIS believe that this concept was relevant. You can also navigate the Taxonomy's
hierarchy in order to find the peer/child relationships between concepts - you'll see why this is useful when you
start to delve into the code samples.
Concept Types
Concept Types are used to categorize concepts for convenience. For example you could list all
'Companies' by querying for all concepts with a type of 'Company'.
Evidence
Evidence is represented by keywords related to each concept. Finding these keywords in the
document suggests that the specific concept is relevant. (How relevant is determined by the frequency
and placement of the keywords)
Feature Sets
A Feature Set is a logical collection of keywords used to clarify the context in which the keywords
are deemed to be valid evidence. For example you could say Branch + !Tree to look for the
keyword branch when not used near the keyword tree to describe say a Bank-Branch concept rather than a
Tree-Branch concept.
Keywords
CIS effectively parses the document looking for these keywords and then builds a 'picture' of
the document based on the occurrence of the keywords. Keywords are the bottom of the Taxonomy food
chain, everything else is derived from them.
A Keyword is just a string, (You may also hear keywords referred to as 'tokens' and
if you are old enough to remember yacc and lex you will recognize this.)
Models
When you apply a Domain Map to your
document you get an intersection; this resultant set of data is called a model. The model is the part
of the Domain Map which relates to your document.
There are 3 types of model that CIS can create.
Concept List
This is a list of the concepts that CIS found when comparing the document to the Domain Map.
Concept Header
It is possible to have CIS suggest, or automatically, add concepts to the domain map based on
their reoccurrence in documents. The concept header model would contain all of the concepts that
CIS thinks should be added to the Domain Map.
Text Model
Using filters CIS can extract key parts of your documents and output them as XML content
for later reuse.
Architecture
CIS a server process and is usually installed on its own server rather than having to run
on the Docbase's server.
CIS has its own, unsupported, public, API, (separate from the DFC), called the BSS API. The BSS
API calls can be made
either synchronously or asynchronously. Typically a command to automatically categorize a document
could be made asynchronously, CIS will queue up the command and process the request. If you
wanted to query CIS to get a list of possible values for a drop-down list you would make a
synchronous call.
When you install CIS you get an 'unsupported' directory which contains the SDK including an API
overview in PDF format and Javadocs, (both also available below)
Resources
Content Personalization Services Application Developer's Guide
The Content Personalization Services (CPS) Application Developer's Guide describes how to create
applications that access the Documentum Content Personalization Services. You can access the
functionality of CPS through the Basic System Services (BSS) API. The Content Personalization
Services SDK provides tools and instructions for creating your own applications using the BSS API.
Content Personalization Services API Javadocs
CPS's Basic System Services (BSS) API packages provide a means of gaining access to the Business
Intelligence Server's functionality. We are making the BSS Javadocs available to give you the
opportunity to explore the depth of the CPS server's abilities.
|