TEXT Indexing Elements
This chapter describes elements related to Tibero TEXT indexes.
Tibero TEXT indexes can be configured with the following preference classes:
STORAGE
Storage of an index and a table, which are created when a TEXT index
is created.
LEXER
Lexer used for the TEXT index.
WORDLIST
Token created for the TEXT index.
STOPLIST
Words which are excluded from TEXT indexes.
DATASTORE
Datastore that specifies where your TEXT index is stored.
FILTER
Filtering method to index your document.
SECTION GROUP
Organizes your document into section groups.
Preferences can be created using TEXT_DDL.CREATE_PREFERENCE.
Tibero TEXT provides predefined preferences for the seven preference classes above.
Users can create new preferences using the defined preferences.
After a preference has been created, the properties of the preference can be changed using TEXT_DDL.SET_ATTRIBUTE.
STORAGE
The STORAGE class specifies the storage for an index and table which are created when a TEXT indexis created.
BASIC_STORAGE is defined in the STORAGE class and has the following properties:
I_TABLE_CLAUSE
Tablespace of a table that saves TEXT index token.
I_ROWID_INDEX_CLAUSE
Tablespace of an index for the ROWID column in a token table.
I_INDEX_CLAUSE
Tablespace of an index for a token column in a token table.
K_TABLE_CLAUSE
Reserved for future functionality.
P_TABLE_CLAUSE
Reserved for future functionality.
N_TABLE_CLAUSE
Reserved for future functionality.
R_TABLE_CLAUSE
Tablespace of an index for the ROWID column in a mapping table.
S_TABLE_CLAUSE
Reserved for future functionality.
LEXER
The LEXER class specifies the lexer to use for a TEXT index.
BASIC_LEXER
Uses whitespace or special character delimited words.
BASIC_LEXER has the following attributes:
MIXED_CASE
Option to preserve case-sensitivity of text for indexing. If set to YES, case
is preserved. (Attribute: TRUE/FALSE, default value: FALSE)
BASE_LETTER
Option to ignore base letters, such as Umlaut and acute accent for indexing.If set to YES, base letters are ignored.
(Attribute: TRUE/FALSE, default value: FALSE)
WHITESPACE
Option to search for a sentence or paragraph by joining with punctuation or newline characters to separate sentences or paragraphs. The default valueis space or tab. Other than the default value, user-specified whitespace characters can be recognized.
(Attribute: string, default value: space or tab characters)
JAPANESE_LEXER
Used to generate tokens in Japanese text. JAPANESE_LEXER is a Japanese morphological analyzer. JAPANESE_LEXER has the following attributes.
ASCII_MIXED_CASE
Option to preserve case-sensitivity of ASCII text for indexing. If set to YES, case is preserved. (Attribute: TRUE/FALSE, Default: FALSE)
DELIMETER
Option to ignore Japanese delimiters, such as /, -, ·, etc. If set to YES, such characters are ignored. (Attribute: TRUE/FALSE, Default: FALSE)
CHINESE_LEXER
Used to generate tokens in Chinese text. CHINESE_LEXER has the following attributes.
Only available in UTF8 and GBK character sets.
Can use CONTEXT indexes but not CTXCAT indexes.
CHINESE_VGRAM_LEXER
Used to generate tokens in Chinese text. Unlike CHINESE_LEXER, tokens are generated based on the number of characters. Lexical meaning is not used.
CHINESE_VGRAM_LEXER has the following limitations.
Only available in UTF8 character set.
WORLD_LEXER
Used to generate tokens from documents containing multiple languages. This automatically detects the language and generates a token. Tokens are generated based on whitespace for languages with whitespace, and they are made in 2-grams for Japanese or Chinese containing no whitespace.
WORLD_LEXER has the following limitations.
Only available in UTF8 character sets.
Japanese or Chinese, which generate tokens in 2-gram, do not support wildcards.
Can use CONTEXT indexes but not CTXCAT indexes.
WORDLIST
The WORDLIST class specifies a token created when a TEXT index is created. BASIC_WORDLIST is defined in the WORDLIST class and has the following properties:
PREFIX_INDEX
If this is specified as YES, prefixes of words are indexed to improve the performance of searching for a prefix.
(Attribute: YES/NO, Default: NO)
PREFIX_MIN_LENGTH
Minimum length of a prefix when PREFIX_INDEX is specified as YES. (Range: 1 ~ 64, Default: 1)
PREFIX_MAX_LENGTH
Maximum length of a prefix when PREFIX_INDEX is specified as YES. (Range: 1 ~ 64, Default: 64)
STOPLIST
The STOPLIST class specifies words which are excluded from TEXT indexes. DEFAULT_STOPLIST and EMPTY_STOPLIST are defined in the STOPLIST class.
EMPTY_STOPLIST is a STOPLIST without STOPWORD.
A new stoplist can be created using TEXT_DDL.CREATE_STOPLIST. A new stopword can be added to the list using TEXT_DDL.ADD_STOPWORD.
DATASTORE
The DATASTORE class specifies how your documents are stored. The DATASTORE class defines DIRECT_DATASTORE and MULTI_COLUMN_DATASTORE.
DIRECT_DATASTORE
DIRECT_DATASTORE internally stores data in text columns. Each row is indexed as a single document. DIRECT_DATASTORE has no attributes.
MULTI_COLUMN_DATASTORE
MULTI_COLUMN_DATASTORE stores data in multiple columns. Tibero combines the columns specified with MULTI_COLUMN_DATASTORE by each row and then indexes the rows.
COLUMNS
Comma-separated columns to be indexed. You can specify the columns of type VARCHAR, CLOB, NUMBER, and DATE. You can also specify column
name, column expression, and PSM function. (Attribute: STRING)
DELIMITER
Delimiter to split the column text.
– COLUMN_NAME_TAG: Delimits the text of each column by XML format start and end tags. Each tag has a column name. (Default)
– NEWLINE: Delimits the text of each column by newline.
FILTER
The FILTER class specifies how to filter documents to be indexed. The FILTER class only defines NULL_FILTER.
NULL_FILTER
NULL_FILTER does not filter. It indexes general documents, HTML and XML documents without filtering. Type NULL_FILTER has no attributes.
SECTION GROUP
The SECTION_GROUP class divides the document structure into multiple sections. This class only defines NULL_SECTION_GROUP, which does not define a section. You can use TEXT_DDL.CREATE_SECTION_GROUP to create a new SECTION_GROUP.
TEXT_DDL Package
The following procedures are provided by TEXT_DDL to create and manage preferences.
ADD_SPECIAL_SECTION
Adds a SENTENCE or PARAGRAPH as a SPECIAL SECTION in
a SECTION GROUP.
ADD_STOPWORD
Adds a stopword to a stoplist.
CREATE_PREFERENCE
Creates a new preference.
CREATE_SECTION_GROUP
Defines a new SECTION_GROUP.
CREATE_STOPLIST
Creates a new stoplist.
DROP_PREFERENCE
Deletes a preference.
DROP_SECTION_GROUP
Deletes SECTION_GROUP.
DROP_STOPLIST
Deletes a stoplist.
REMOVE_SECTION
Deletes a SECTION from a SECTION GROUP.
REMOVE_STOPWORD
Deletes a stopword from a stoplist.
SET_ATTRIBUTE
Specifies the attributes of a preference.
ADD_SPECIAL_SECTION
The ADD_SPECIAL_SECTION procedure adds a SENTENCE or PARAGRAPH as a SPECIAL SECTION in a SECTION GROUP. Adding a special section allows you to find a sentence or paragraph by using the WITHIN operator.
After specifying section groups containing special sections, the delimiter specified when creating indexes tokenizes special sections.
Sentences are delimited by [word][punctuation][whitespace] or [word][punctuation][newline]. Paragraphs are delimited by [word][punctuation][newline][whitespace] or [word][punctuation][newline][newline].
Prototype
Parameter
group_name
Name of a SECTION GROUP to be added.
section_name
Name of a section to be added. Can be specified as 'SENTENCE' or
'PARAGRAPH'.
Example
ADD_STOPWORD
The ADD_STOPWORD procedure adds a stopword to a stoplist.
Prototype
Parameter
stoplist_name
STOPLIST name.
stopword
Stopword to add.
lang
Language to which the stopword applies. (Unsupported)
Example
CREATE_PREFERENCE
The CREATE_PREFERENCE procedure creates a new preference.
Prototype
Parameter
udef_pref_name
PREFERENCE name
pdef_pref_name
Predefined Tibero TEXT preference name.
Example
CREATE_SECTION_GROUP
The CREATE_SECTION_GROUP procedure creates a new section group.
Prototype
Parameter
udef_pref_name
Preference name.
pdef_pref_name
Predefined Tibero TEXT preference name.
Example
CREATE_STOPLIST
The CREATE_STOPLIST procedure creates a new stoplist.
Prototype
Parameter
stoplist_name
Stoplist name.
stoplist_type
Stoplist type.
Currently, DEFAULT_STOPLIST and EMPTY_STOPLIST are supported.
Example
DROP_PREFERENCE
The DROP_STOPLIST procedure deletes a stoplist.
Prototype
Parameter
udef_pref_name
Preference name.
Example
DROP_SECTION_GROUP
The DROP_PREFERENCE procedure deletes a preference.
Prototype
Parameter
udef_pref_name
Section group name.
Example
DROP_STOPLIST
The DROP_STOPLIST procedure deletes a stoplist.
Prototype
Parameter
stoplist_name
Stoplist to delete.
Example
REMOVE_SECTION
Deletes a SECTION from a SECTION GROUP
Prototype
Parameter
group_name
Name of a SECTION GROUP to be deleted.
section_name
Name of a SECTION to be deleted.
Example
REMOVE_STOPWORD
The REMOVE_STOPWORD procedure deletes a stopword from a stoplist.
Prototype
Parameter
stoplist_name
Name of a stoplist which includes the stopword to be deleted.
stopword
Stopword to delete.
lang
Language to which the stopword applies. (Unsupported)
Example
SET_ATTRIBUTE
The SET_ATTRIBUTE procedure specifies the attributes for a preference.
Prototype
Parameter
udef_pref_name
Preference to specify the attribute for.
attribute_name
Attribute name.
value
Attribute value.
Example
CTX_DDL Package
The following are procedures provided by CTX_DDL to manage CONTEXT indexes.
OPTIMIZE_INDEX
Optimizes synchronized indexes. This improves query performance because old data is removed.
Prototype
Parameter
idx_name
Index name.
optlevel
One of the following two optimization methods.
– FAST: combine duplicate tokens, but do not remove old (removed row) data.
– FULL: combine duplicate tokens, and then remove the old (removed row) data.
maxtime
Maximum optimization time. (Unsupported)
token
Token to be optimized. (Unsupported)
part_name
Partition name for local indexes.
token_type
Token type. (Unsupported)
parallel_degree
Parallel degree. (Unsupported)
Example
SYNC_INDEX
Synchronizes CONTEXT indexes to process inserts, updates, and deletes to the base table.
Prototype
Parameter
idx_name
Index name.
memory
Runtime memory to use for synchronization. (Unsupported)
part_name
Partition name for local indexes.
parallel_degree
Parallel degree. (Unsupported)
maxtime
Maximum synchronization time. (Unsupported)
locking
How to deal with the case where another synchronization is already running
in the same index or partition. (Unsupported)
Example
CTX_OUTPUT Package
The following are procedures provided by CTX_OUTPUT to manage CONTEXT indexes.
END_LOG
Stops logging for indexing.
Prototype
Example
START_LOG
Starts logging for indexing.
Prototype
Parameter
file_name
Name of the log file.
write_option
If the log file already exists, this parameter specifies whether you overwrite
the file or add a new log file.
Example
Last updated