Mapping

Text Analysis

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html

Applicable to text fields/values
Text value are analyzed when indexing documents
The result is stored in data structures that are efficient for searching / etc.
The _source object is not used when searching for documents
- It contains the exact values specified when indexing a document

Character Filters

Add / Remove / Change characters
Analyzer contain zero or more character filter
Character filters are applied in the order in which they are specified
Example: html_strip filter
- Input: I'm in a <em>good</em> mood -  and I <strong>love</strong> it!
- Output: I'm in a good mood - and I love it!

Tokenizers

An analyzer contains one tokenizer
Tokenizes a string (i.e. splits it into tokens)
Characters may be stripped as part of the tokenization
Example:
- Input: Oh my god!
- Output: ["Oh", "my", "god"]

Token filters

Receive the output of the tokenizer as input (i.e. the tokens)
A token filter can add, remove, or modify tokens
An analyzer contains zero or more token filters
Token filters are applied in the order in which they are specified
Example: lowercase filter
- Input: ["I", "REALLY", "like", "beer"]
- Output: ["i", "really", "like", "beer"]

By default, there is no character filters. The inputs will be tokenized by a standard tokenizer with a lowercase token filter.

To test:

POST /_analyze
{
  "text": "2 guys walk into    a bar, but the third... DUCKS! :-)",
  "analyzer": "standard"
}

To add a character filter and a token filter:

POST /_analyze
{
  "text": "2 guys walk into    a bar, but the third... DUCKS! :-)",
  "char_filter": [],
  "analyzer": "standard",
  "filter": ["lowercase"]
}

Inverted Indices

One inverted index per text field
Terms are sorted alphabetically for performance reasons
Created and maintained by Apache Lucene!

Mapping

Define the structure of documents (fields and datatypes)
- Also used to configure how a field is indexed
Similar to database schema
Two types of mapping:
- Explicit mapping: Define field mappings ourselves
- Dynamic mapping: ES generates field mappings for us

Data types

https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html

object
text
float
boolean
double
short
long
integer
date
...

Object

Used for any JSON object
Objects may be nested

Mapped using the properties parameter
Objects are not stored as objects in Apache Lucene but are transformed to ensure that we can index any valid JSON

nested

Similar to object data type but maintains object relationships
- Good for indexing arrays of objects

nested objects are stored as hidden documents

keyword

Used for exact matching of values
Good for filtering, aggregations and sorting
For full-text searches, use the text data type instead
keyword fields are analyzed with the keyword analyzer
- No-op analyzer
- Output unmodified string as a single token

POST /_analyze
{
    "text": "Hello! I am using Whatsapp.",
    "analyzer": "keyword"
}

Coercion

Suppose we have create an index with the following:

POST /coercion_test/_bulk
{ "create": { "_id": 1 } }
{ "price": 7.4 }
{ "create": { "_id": 2 } }
{ "price": "7.4" }
{ "create": { "_id": 3 } }
{ "price": "7.4m" }

The first 2 document will be documented.
- Behind the scene, ES will map price to float in the first item.
- When indexing the 2nd item, ES will intelligently index "7.4" as float instead of string due to coercion
- However, when inspecting the _source field, we will still see "price": "7.4". This means it does not reveal how the field is indexed.
The 3rd document will not be documented.
Coercion is enabled by default but it could be disabled.

Explicit Mappings (Static Mappings)

To create a mapping for an index:

PUT /<index>
{
    "mappings": {
        "properties": {
            "f1": { "type": "<type>" },
            "f2": { "type": "<type>" },
            "<nested field>": {
                "properties": {
                    "nf1": { "type": "<type>" },
                    "nf2": { "type": "<type>" }
                }
            }
        }
    }
}

Note that using dot notation is possible for nested array:

PUT /<index>
{
    "mappings": {
        "properties": {
            "f1": { "type": "<type>" },
            "f2": { "type": "<type>" },
            "<nested field>.<nf1>": { "type": "<type>" },
            "<nested field>.<nf2>": { "type": "<type>" },
        }
    }
}

To retrieve a mapping for an index:

GET /<index>/_mappings

To add an additional mapping to existing indices:

PUT /<index>/_mapping
{
    "properties": {
        "<new_field>": { "type": "<type>" }
    }
}

date

Specified in one of three ways:
- Specially formatted strings
- Milliseconds since the epoch (long) (1st of Jan 1970)
- Second since the epoch (integer)
Custom format is also supported
By default:
- 3 supported formats:
  - A date without time
  - A date with time
  - MS since the epoch (long)
    Remember if your are using UNIX timestamp, multiply it by 1000
- UTC timezone assumed if none is specified
- Dates must be formatted according to the ISO 8601 spec
Example:
- "2020-07-07" --> ES will assume it is 00:00 UTC
- "2020-07-07T10:08:02Z" --> UTC
- "2020-07-07T11:08:02+01:00" --> UTC+1

Mapping Parameters

format

Customize the format for date
Recommended to use the default format if possible
- "strict_date_optional_time||epoch_millis"
Using JAVA's DateFormatter syntax
Using built-in formats like epoch_second / dd/MM/yyyy

properties

Defines nested fields for object and nested fields

coerce

Used to enable / disable coercion of value (enabled by default)
To disable in field level: "field": { "type": "<type>", "coerce": false }
To disable in index level:

PUT /<index>
{
    "settings": { "index.mapping.coerce": false },
    "mappings": {
        "properties": {
            ...
        }
    }
}

doc_values

ES makes use of several data structures
- No single data structure serves all purposes
  - Inverted indices are excellent for searching text, but not well for many other data access patterns
Doc values is another data structure used by Apache Lucene
- Optimize for a different data access pattern (document --> terms)
doc_values is essentially an uninverted inverted index used for sorting / aggregations and scripting
- It is an additional data structure but not a replacement
We can disable doc_values to save disk space by setting it to false
- If you won't use the doc for aggregations / sorting / scripting, you can disable doc_values
- However, cannot be changed without reindexing documents into new index

PUT /<index>
{
    "mappings": {
        "properties": {
            "<f1>": { "type": "keyword", "doc_values": false }
        }
    }
}

norms

Normalization factors used for relevance scoring
Often we don't just want to filter results, but also rank them
Norms can be disabled to save disk space
- Useful for fields that won't be used for relevance scoring
- The fields can still be used for aggregation

PUT /<index>
{
    "mappings": {
        "properties": {
            "<f1>": { "type": "text", "norms": false }
        }
    }
}

index

Disable indexing for a field
Values are still stored within _source
Save disk space and improves indexing throughput
Often used for time series data
Fields disabled can still be used for aggregations

null_value

NULL values cannot be indexed or searched
Use this parameter to replace NULL values with another value

copy_to

Used to copy multiple field values into a "group field"
Simply specify the name of the target field as the value
- first_name + last_name --> full_name
Values are copied but not terms / tokens
- The analyzer of the target field is used for the values
The target field is not a part of _source

Updating field mappings

Limitations:

Generally ES field mappings cannot be changed
We can add mappings, but not modify
Only a few mapping parameters can be updated for existing mappings
- ignore_above
The solution is to reindex documents to a new index ...

Reindex documents with the reindex API

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html

First PUT a new mappings to a new index

PUT /<new_idx>
{
    "mappings": {
        "properties": {
            "f1": { "type": "<type>" }
            ...
        }    
    }
}

Then:

POST /_reindex
{
    "source": { "index": "<old_idx>" },
    "dest": { "index": "<new_idx>" }
}

_source data types:

The data type doesn't reflect how the values are indexed
_source contains the field values supplied at index time
It's common to use _source values from search results
We can modify _source value while reindexing
Alternatively this can be done at the application level

POST /_reindex
{
    "source": { "index": "<old_idx>" },
    "dest": { "index": "<new_idx>" },
    "script": { 
        "source": """
            if (ctx._source.product_id != null) {
                ctx._source.product_id = ctx._source.product_id.toString();
            }
        """
    }
}

Reindex document matching a query

POST /_reindex
{
    "source": { 
        "index": "<old_idx>",
        "query": { "match_all": { } }
    },
    "dest": { "index": "<new_idx>" }
}

Using query is always preferred

Reindex only selected fields

POST /_reindex
{
    "source": { 
        "index": "<old_idx>",
        "_source": [ "f1", "f2", ... ]
    },
    "dest": { "index": "<new_idx>" }
}

Change a field name

POST /_reindex
{
    "source": { "index": "<old_idx>" },
    "dest": { "index": "<new_idx>" },
    "script": {
        "source": """
        # Rename <field> to <new_field>
        ctx._source.<new_field> = ctx._source.remove("field");
        """
    }
}

Conditional

POST /_reindex
{
    "source": { "index": "<old_idx>" },
    "dest": { "index": "<new_idx>" },
    "script": {
        "source": """
        if (ctx._source.rating < 4.0) {
            ctx.op = "noop"; # drop is rating < 4.0
        }
        """
    }
}

Field aliases

For example, we want to query ip by using ip_address:

PUT /traffic/_mapping
{
    "properties": {
        "ip_address": { "type": "alias", "path": "ip" }
    }
}

Multi-field mappings

Example:

PUT /multi
{
  "mappings": {
    "properties": {
      "description": { "type": "text" },
      "ingredients": {
        "type": "text",
        "fields": {
          "keyword": { "type": "keyword" }
        }
      }
    }  
  }
}

Document:

POST /multi/_doc
{
  "description": "To make blah",
  "ingredients": ["A", "B", "C"]
}

So you could use ingredients for aggregations:

GET /multi/_search
{
    "query": {
        "term": {
            "ingredients.keyword": "B"
        }
    }
}

As well as exact match search:

GET /multi/_search
{
    "query": {
        "match": {
            "ingredients": "B"
        }
    }
}

Index Template

PUT /_template/<template_name>
{
    "index_patterns": [ "index-*" ],
    "settings": {
        "number_of_shards": 2,
        "number_of_replicas": 1
    },
    "mappings": {
        "properties": {
            ...
        }
    }
}

Priorities of index templates:

A new index may match multiple index templates
An order parameter can be used to define the priority of index template
- The value is simply an integer
- Templates with lower values are merged first

Dynamic Mapping

Enable by default
Will create some overheads
Can be disabled

PUT /<idx>
{
    "mappings": {
        "dynamic": false,
        "properties": {
            "f1": { "type": "<type>" }
        }
    }
}

In this way, you will not able to search by a field that is not defined.
However, document with non-defined field will still be indexed without an issue.
Another setting is strict

PUT /<idx>
{
    "mappings": {
        "dynamic": "strict",
        "properties": {
            "f1": { "type": "<type>" }
        }
    }
}

Documents with undefined field will be rejected.

Also it is possible to enable dynamic mapping only for 1 field. For example, other in the following example:

PUT /<idx>
{
    "mappings": {
        "dynamic": "strict",
        "properties": {
            "f1": { "type": "<type>" }
            "other": {
                "dynamic": true,
                "properties": {
                    ...
                }
            }
        }
    }
}

Date Detection

We can use dynamic_date_formats to parse non-standard date:

PUT /<idx>
{
    "mappings": {
        "dynamic_date_formats": [ "<format>" ]
    }
}

Dynamic Templates

PUT /<index>
{
    "mappings": {
        "dynamic_templates": [
            {
                "integers": {
                    "match_mapping_type": "long",
                    "mapping": { "type": "integer" }
                }
            }
        ]
    }
}

match / unmatch parameters

Used to specify conditions for field names
Field names must match the condition specified by the match parameter
unmatch is used to exclude fields that were matched by the match parameter
Both parameters support patterns with wildcards *
- Hard coding fields names wouldn't make any sense
Example:

PUT /<idx>
{
    "mappings": {
        "dynamic_templates": [
            "strings_only_text": {
                "match_mapping_type": "string",
                "match": "text_*",
                "unmatch": "*_keyword",
                "mapping": { "type": "text" }
            },
            "strings_only_keyword": {
                "match_mapping_type": "string",
                "match": "*_keyword",
                "mapping": { "type": "keyword" }
            }
        ]
    }
}

Suppose regex match as well:

PUT /<idx>
{
    "mappings": {
        "dynamic_templates": [
            "name": {
                "match_mapping_type": "string",
                "match_pattern": "regex",
                "match": "^[a-zA-Z]+_name$",
                "mapping": {
                    "type": "text"
                }
            }
        ]
    }
}

path_match and path_unmatch

These parameters evaluate the full field path (i.e. not just the field names)
This is the dot notation (e.g. name.first_name)
Wildcards are supported
Example:

PUT /<idx>
{
    "mappings": {
        "dynamic_templates": [
            "name": {
                "match_mapping_type": "string",
                "path_match": "employer.name.*"
                "mapping": {
                    "type": "text",
                    "copy_to": "full_name"
                }
            }
        ]
    }
}

Mapping Recommendations

Dynamic mapping is convenient, but often not a good idea in production
Save disk space with static mapping!
Set dynamic to strict if possible
1. Avoid unexpected results
Don't map strings as both text and keyword
1. text: full-text searches
2. keyword: aggregations / sorting / filtering on exact values
Disable coercion
Use appropriate numeric data types
1. For whole number, integer might be enough. long will use more disk space
2. float might be precise enough. double will use 2x disk space.
Mapping parameters
1. Set doc_values to false if you do not need sorting/aggregations/scripting
2. Set norms to false if you do not need relevance scoring
3. Set index to false if you do not need to filter on values
  1. Can still do aggregations (e.g. time-series)
4. Not worth the effort when storing lots of documents

Stem / Stop

Stem: Revert to basic form of the word
Stop: Not indexing common meaningless words like "on" / "at" ...

PreviousSearch NextConcepts

Last updated 4 years ago