Mapping
Last updated
Last updated
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html
Applicable to text fields/values
Text
value are analyzed when indexing documents
The result is stored in data structures that are efficient for searching / etc.
The _source
object is not used when searching for documents
It contains the exact values specified when indexing a document
Add / Remove / Change characters
Analyzer contain zero or more character filter
Character filters are applied in the order in which they are specified
Example: html_strip filter
Input:
I'm in a <em>good</em> mood - and I <strong>love</strong> it!
Output:
I'm in a good mood - and I love it!
An analyzer contains one tokenizer
Tokenizes a string (i.e. splits it into tokens)
Characters may be stripped as part of the tokenization
Example:
Input:
Oh my god!
Output:
["Oh", "my", "god"]
Receive the output of the tokenizer as input (i.e. the tokens)
A token filter can add, remove, or modify tokens
An analyzer contains zero or more token filters
Token filters are applied in the order in which they are specified
Example: lowercase filter
Input:
["I", "REALLY", "like", "beer"]
Output:
["i", "really", "like", "beer"]
By default, there is no character filters. The inputs will be tokenized by a standard tokenizer with a lowercase token filter.
To test:
To add a character filter and a token filter:
One inverted index per text field
Terms are sorted alphabetically for performance reasons
Created and maintained by Apache Lucene!
Define the structure of documents (fields and datatypes)
Also used to configure how a field is indexed
Similar to database schema
Two types of mapping:
Explicit mapping: Define field mappings ourselves
Dynamic mapping: ES generates field mappings for us
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html
object
text
float
boolean
double
short
long
integer
date
...
Object
Used for any JSON object
Objects may be nested
Mapped using the properties
parameter
Objects are not stored as objects in Apache Lucene but are transformed to ensure that we can index any valid JSON
nested
Similar to object data type but maintains object relationships
Good for indexing arrays of objects
nested objects are stored as hidden documents
keyword
Used for exact matching of values
Good for filtering, aggregations and sorting
For full-text searches, use the text data type instead
keyword
fields are analyzed with the keyword analyzer
No-op analyzer
Output unmodified string as a single token
Suppose we have create an index with the following:
The first 2 document will be documented.
Behind the scene, ES will map price
to float in the first item.
When indexing the 2nd item, ES will intelligently index "7.4"
as float instead of string due to coercion
However, when inspecting the _source
field, we will still see "price": "7.4"
. This means it does not reveal how the field is indexed.
The 3rd document will not be documented.
Coercion is enabled by default but it could be disabled.
To create a mapping for an index:
Note that using dot notation is possible for nested array:
To retrieve a mapping for an index:
To add an additional mapping to existing indices:
Specified in one of three ways:
Specially formatted strings
Milliseconds since the epoch (long) (1st of Jan 1970)
Second since the epoch (integer)
Custom format is also supported
By default:
3 supported formats:
A date without time
A date with time
MS since the epoch (long)
Remember if your are using UNIX timestamp, multiply it by 1000
UTC timezone assumed if none is specified
Dates must be formatted according to the ISO 8601 spec
Example:
"2020-07-07" --> ES will assume it is 00:00 UTC
"2020-07-07T10:08:02Z" --> UTC
"2020-07-07T11:08:02+01:00" --> UTC+1
Customize the format for date
Recommended to use the default format if possible
"strict_date_optional_time||epoch_millis"
Using JAVA's DateFormatter
syntax
Using built-in formats like epoch_second
/ dd/MM/yyyy
Defines nested fields for object
and nested
fields
Used to enable / disable coercion of value (enabled by default)
To disable in field level: "field": { "type": "<type>", "coerce": false }
To disable in index level:
ES makes use of several data structures
No single data structure serves all purposes
Inverted indices are excellent for searching text, but not well for many other data access patterns
Doc values is another data structure used by Apache Lucene
Optimize for a different data access pattern (document --> terms)
doc_values is essentially an uninverted inverted index used for sorting / aggregations and scripting
It is an additional data structure but not a replacement
We can disable doc_values to save disk space by setting it to false
If you won't use the doc for aggregations / sorting / scripting, you can disable doc_values
However, cannot be changed without reindexing documents into new index
Normalization factors used for relevance scoring
Often we don't just want to filter results, but also rank them
Norms can be disabled to save disk space
Useful for fields that won't be used for relevance scoring
The fields can still be used for aggregation
Disable indexing for a field
Values are still stored within _source
Save disk space and improves indexing throughput
Often used for time series data
Fields disabled can still be used for aggregations
NULL
values cannot be indexed or searched
Use this parameter to replace NULL
values with another value
Used to copy multiple field values into a "group field"
Simply specify the name of the target field as the value
first_name
+ last_name
--> full_name
Values are copied but not terms / tokens
The analyzer of the target field is used for the values
The target field is not a part of _source
Limitations:
Generally ES field mappings cannot be changed
We can add mappings, but not modify
Only a few mapping parameters can be updated for existing mappings
ignore_above
The solution is to reindex documents to a new index ...
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html
First PUT a new mappings to a new index
Then:
_source data types:
The data type doesn't reflect how the values are indexed
_source
contains the field values supplied at index time
It's common to use _source
values from search results
We can modify _source
value while reindexing
Alternatively this can be done at the application level
Using query
is always preferred
For example, we want to query ip
by using ip_address
:
Example:
Document:
So you could use ingredients for aggregations:
As well as exact match search:
Priorities of index templates:
A new index may match multiple index templates
An order
parameter can be used to define the priority of index template
The value is simply an integer
Templates with lower values are merged first
Enable by default
Will create some overheads
Can be disabled
In this way, you will not able to search by a field that is not defined.
However, document with non-defined field will still be indexed without an issue.
Another setting is strict
Documents with undefined field will be rejected.
Also it is possible to enable dynamic mapping only for 1 field. For example, other
in the following example:
We can use dynamic_date_formats to parse non-standard date:
match / unmatch parameters
Used to specify conditions for field names
Field names must match the condition specified by the match
parameter
unmatch
is used to exclude fields that were matched by the match
parameter
Both parameters support patterns with wildcards *
Hard coding fields names wouldn't make any sense
Example:
Suppose regex match as well:
path_match and path_unmatch
These parameters evaluate the full field path (i.e. not just the field names)
This is the dot notation (e.g. name.first_name
)
Wildcards are supported
Example:
Dynamic mapping is convenient, but often not a good idea in production
Save disk space with static mapping!
Set dynamic
to strict
if possible
Avoid unexpected results
Don't map strings as both text
and keyword
text
: full-text searches
keyword
: aggregations / sorting / filtering on exact values
Disable coercion
Use appropriate numeric data types
For whole number, integer
might be enough. long
will use more disk space
float
might be precise enough. double
will use 2x disk space.
Mapping parameters
Set doc_values
to false
if you do not need sorting/aggregations/scripting
Set norms
to false
if you do not need relevance scoring
Set index
to false
if you do not need to filter on values
Can still do aggregations (e.g. time-series)
Not worth the effort when storing lots of documents
Stem: Revert to basic form of the word
Stop: Not indexing common meaningless words like "on" / "at" ...