The combination of
the representativeness of UMBEL's subject concepts (the scope
of the ontology) and their relationships (the structure of
the backbone) is fundamental. These factors in turn express the functional
capabilities of the system. The use of OpenCyc as the source basis for
UMBEL is fundamental to these capabilities.
First Things
First: The Importance of Context
A reference structure of
almost any nature has value. A reference structure provides context, which in turn provides fixed points
in the information space for relating distributed datasets to one
another. Further, a reference structure of concepts has the further
benefit of providing a logical reference structure for instances as
well.
While Wikipedia is perhaps the most comprehensive
collection of well-known instances, no single source can or will be
complete in scope. Thus, many public and private sources of entities
will emerge as reference hubs.
How do each of these rich
instance sources relate to one another? What is the subject concept or
topical basis by which they overlap or complement? What is the
framework and graph structure of knowledge to give this information
context?
These are the benefits brought by a structure of
reference concepts, independent from the specifics of the reference
structure itself.
Over time, it is likely that a few Web-based
reference structures will emerge and compete and get supplemented by
still further structures. This evolution is expected and natural and
desirable in that it provides choice and options.
Alternative
Approaches
Since the Web's inception, there have been various
alternatives tried or in ascendance for organizing and bringing
structure to Web content. Some of these may be too static and
inflexible, others perhaps too arbitrary or parochial. All approaches
to date have had little collective success.
Here is a
summary of some of these alternate approaches:
Existing
library systems — Dewey Decimal Classification, Library of
Congress, UDC and many other library
classification schemes have been touted for the Web. None have
enjoyed broad acceptance. Some reasons cited for this failure are
physical books are very different from free digital bits; Web schema
need to evolve quickly; and lack of stewards and curation
Market
share — at various times certain successful vendors have
held temporary minor ascendance with content organizational frameworks,
generally directory
structures. Examples
include About, Yahoo!, Open Directory Project (DMOZ), Northern Light,
etc. Yet even at their peaks, market shares were low, external adoption
was rare, scope was questioned and arbitrary, with interest in
directories now nearly absent
WordNet —
though of strong interest and use to computational linguists, and quite
popular for many content analyses, WordNet has seen
little consumer or commercial interest. However, the synset structure
and its coverage is extremely valuable for concept disambiguation, and
therefore has a role in UMBEL (as it does in many other online systems)
Standards efforts — some sporadic success and some
notable failures have occurred in the standards arenas. Generally, the
successful initiatives tend to be in close communities where there are
clear financial benefits for adherence, such as in the exchange of
financial or commerce data; broader and more ambitious efforts have
tended to be less successful
Professional organizations
and associations — areas such a finance, pharmaceuticals,
biologists, physicists and many bounded communities have enjoyed
sporadic and sometimes notable success in developing and using
domain-specific schema; none have yet transferred beyond their
beginning boundaries to the broader Web
Government
initiatives — there are episodic successes for
government-sponsored content organizational initiatives, mostly in
metadata, controlled vocabularies and ontologies, often where
contractors or suppliers may be compelled to comply. NIH's National
Library of Medicine (and other NIH branches) have also seen significant
domain successes, due to its foresight and its receptive biology,
genetics and medical communities
Upper ontologies
— UMBEL investigated this area considerably in the early months
of the project. Most of the upper ontologies
have relatively sparse subject concept content, being geared to
smaller, abstract-oriented upper structures. Some such as SUMO and
DOLCE and now PROTON, have concerted initiatives to extend to middle-
and domain-level ontologies. To date, penetration of these systems into
general Web or commercial realms has been quite limited
Wikipedia
— a clear and phenomenal success, Wikipedia and related
initiatives like Wikinvest and Wikicompany and scores more have proven
to be a rich fount for named entities and article-length content, but
not for the category and content organization structures in which that
content is embedded. This is an area of keen academic and collective
interest and it may still result is useful organizational schema as
these popular wikis continue to evolve and mature. However, they have
not yet done so, and while a rich source for entities and data, UMBEL
decided to pass on their use for backbone structure at this time
No collective structure — tagging or folksonomies or
doing nothing have perhaps the greatest market share at present.
Since inception, the stated intent of the UMBEL project was to base
its subject structure on extant systems. To minimize development time,
the structure needed to be drawn from one of the categories above.
Possible development of a de novo structure was rejected
because of development time and the low probability of gaining
acceptance in the face of so many competing alternatives.
Rationale
for OpenCyc
The granddaddy
of knowledge bases suitable to all human content and knowledge is Cyc.
Because of its more than 20-year history, Cyc brings with it
considerable strengths and some weaknesses.
Amongst all
alternatives, Cyc rapidly emerged as the leading candidate. While its
strengths warranted close attention, its weaknesses also suggested a
considerable effort to overcome them. This combination compelled the
need for a significant investigation and due diligence.
First,
here are OpenCyc's strengths:
Venerable and solid
— through an estimated 1000 person-years of engineering and
effort, the Cyc structure has been tested and refined through many
projects and applications. While a few years back such groundings were
unparalleled in the field, we are also now seeing some Internet-wide
projects tap into the law of large numbers to get significant inputs of
human labor. Cyc has also tapped this venue for ongoing expansion of
its KB using the online FACTory game
Community
— there is a large community of Cyc users and supporters from
academic, government, commercial and non-profit realms. Moreover, the
formation of The Cyc Foundation has also served as a vehicle for
tapping into volunteer effort as well
Upgrade Path — OpenCyc has an
upgrade path to the more capable ResearchCyc,
full Cyc and the services of Cycorp
Comprehensive — no existing system has the scope,
breadth and coverage of human concepts to match that of Cyc (however,
sources for named entities such as Wikipedia have recently passed Cyc
in scope; see next section)
Common sense —
since its founding as a project and then backed by the standalone
Cycorp, Cyc has set for itself both a more pragmatic but harder
challenge than other knowledge systems. Cyc has set out to capture the
common sense at the heart of human reasoning. This objective means
codifying generally unstated logic and rules-of-thumb not unlike
teaching a baby to walk and talk and read all of which are lengthy
tasks of trial and error. However, as Cyc has gained this foundation,
it has also led to a more solid basis for its reasoning and conceptual
relationships
Power and inference —
ultimately the purpose of a knowledge base is to support reasoning and
inference by computer when presented with a (often small) starting set
of assertions or facts. Cyc has literally thousands of microtheories
now governing its inference domains, giving it a scope and power
unmatched by other systems. The importance of such reasoning is not the
silly science fiction of autonomous intelligent robots, but as
achievable aids to make connections, determine relationships and filter
and order results
Robust supporting capabilities
— such knowledge base-wide capabilities can also be deeply
leveraged in such areas as entity extraction, machine translation,
natural language processing, risk analysis or one of the other dozens
of specialty modules available in Cyc, and
Free and open
— last, but not least, is the fact that a mostly complete Cyc was
released as a free and open source version in 2002. OpenCyc has now
been downloaded more than 100,000 times and is in production use for
many applications. Non-profits and academics can also obtain access to
the full capabilities of the Cyc system through ResearchCyc. This open
character is an absolute essential because leading Web applications and
leading innovators of the Web eschew proprietary systems.
As
first
encountered, one impression of OpenCyc was that of a very solid
structure, but somewhat
obscured and deserving of a fresh cleaning.
The Decision and
Implementation
Nearly
five
full months of due diligence were devoted to the question of the
suitability of
OpenCyc as the conceptual and relationship grounding for UMBEL.
On
balance,
OpenCyc’s benefits significantly outweighed its then weaknesses.
This balance also stood
considerably superior
to all potential alternatives. An
important factor through this deliberation was the commitment of Cycorp
and The
Cyc Foundation to the aims of UMBEL, and the willingness of those
organizations
to lend time and effort.
The decision
was thus made in October 2007 to base UMBEL on OpenCyc and to undertake
the (eventual) two person-years of effort to clean and vet the OpenCyc
knowledge
base for UMBEL’s purposes.
As discussed in the
accompanying piece on UMBEL's
role, the project has also made two
pivotal decisions with respect to OpenCyc and its use:
AllUMBEL
subject concepts are based on
existing concepts in OpenCyc. This means UMBEL inherits
the
proven structure and relationships extant in OpenCyc
No new subject concepts will be added to UMBEL that are not included in
OpenCyc. This means that UMBEL's structure will not
diverge from
the structural relations already in OpenCyc. This decision preserves
the use of UMBEL as a sort of contextual middleware between
unstructured Web content and the inferential and tools infrastructure
within OpenCyc (and beyond into ResearchCyc and Cyc for commercial
purposes) and back again to the Web. We term this "round-tripping" and
the capability is available for any of the 20,000 subject concepts
vetted from OpenCyc within UMBEL.
Fortunately,
in the intervening months, Cycorp has been responsive and made
changes to the OpenCyc concept structure and its conversion to OWL in
support
of needs and observations brought forth by the UMBEL project. These
provide comfort and
a solid, adaptable structure for UMBEL moving forward.