Skip to main content
Skip to main content
Edit this page

Dictionary

A dictionary in ClickHouse provides an in-memory key-value representation of data from various internal and external sources, optimizing for super-low latency lookup queries.

Dictionaries are useful for:

  • Improving the performance of queries, especially when used with JOINs
  • Enriching ingested data on the fly without slowing down the ingestion process

Uses cases for Dictionary in ClickHouse

Speeding up joins using a Dictionary

Dictionaries can be used to speed up a specific type of JOIN: the LEFT ANY type where the join key needs to match the key attribute of the underlying key-value storage.

Using Dictionary with LEFT ANY JOIN

If this is the case, ClickHouse can exploit the dictionary to perform a Direct Join. This is ClickHouse's fastest join algorithm and is applicable when the underlying table engine for the right-hand side table supports low-latency key-value requests. ClickHouse has three table engines providing this: Join (that is basically a pre-calculated hash table), EmbeddedRocksDB and Dictionary. We will describe the dictionary-based approach, but the mechanics are the same for all three engines.

The direct join algorithm requires that the right table is backed by a dictionary, such that the to-be-joined data from that table is already present in memory in the form of a low-latency key-value data structure.

Example

Using the Stack Overflow dataset, let's answer the question: What is the most controversial post concerning SQL on Hacker News?

We will define controversial as when posts have a similar number of up and down votes. We compute this absolute difference, where a value closer to 0 means more controversy. We'll assume the post must have at least 10 up and down votes - posts which people don't vote on aren't very controversial.

With our data normalized, this query currently requires a JOIN using the posts and votes table:

Use smaller datasets on the right side of JOIN: This query may seem more verbose than is required, with the filtering on PostIds occurring in both the outer and sub queries. This is a performance optimization which ensures the query response time is fast. For optimal performance, always ensure the right side of the JOIN is the smaller set and as small as possible. For tips on optimizing JOIN performance and understanding the algorithms available, we recommend this series of blog articles.

While this query is fast, it relies on us to write the JOIN carefully to achieve good performance. Ideally, we would simply filter the posts to those containing "SQL", before looking at the UpVote and DownVote counts for the subset of blogs to compute our metric.

Applying a dictionary

To demonstrate these concepts, we use a dictionary for our vote data. Since dictionaries are usually held in memory (ssd_cache is the exception), users should be cognizant of the size of the data. Confirming our votes table size:

Data will be stored uncompressed in our dictionary, so we need at least 4GB of memory if we were to store all columns (we won’t) in a dictionary. The dictionary will be replicated across our cluster, so this amount of memory needs to be reserved per node.

In the example below the data for our dictionary originates from a ClickHouse table. While this represents the most common source of dictionaries, a number of sources are supported including files, http and databases including Postgres. As we'll show, dictionaries can be automatically refreshed providing an ideal way to ensure small datasets subject to frequent changes are available for direct joins.

Our dictionary requires a primary key on which lookups will be performed. This is conceptually identical to a transactional database primary key and should be unique. Our above query requires a lookup on the join key - PostId. The dictionary should in turn be populated with the total of the up and down votes per PostId from our votes table. Here's the query to obtain this dictionary data:

To create our dictionary requires the following DDL - note the use of our above query:

In self-managed OSS, the above command needs to be executed on all nodes. In ClickHouse Cloud, the dictionary will automatically be replicated to all nodes. The above was executed on a ClickHouse Cloud node with 64GB of RAM, taking 36s to load.

To confirm the memory consumed by our dictionary:

Retrieving the up and down votes for a specific PostId can be now achieved with a simple dictGet function. Below we retrieve the values for the post 11227902:

Not only is this query much simpler, it's also over twice as fast! This could be optimized further by only loading posts with more than 10 up and down votes into the dictionary and only storing a pre-computed controversial value.

Query time enrichment

Dictionaries can be used to look up values at query time. These values can be returned in results or used in aggregations. Suppose we create a dictionary to map user IDs to their location:

We can use this dictionary to enrich post results:

Similar to our above join example, we can use the same dictionary to efficiently determine where most posts originate from:

Index time enrichment

In the above example, we used a dictionary at query time to remove a join. Dictionaries can also be used to enrich rows at insert time. This is typically appropriate if the enrichment value does not change and exists in an external source which can be used to populate the dictionary. In this case, enriching the row at insert time avoids the query time lookup to the dictionary.

Let's suppose that the Location of a user in Stack Overflow never changes (in reality they do) - specifically the Location column of the users table. Suppose we want to do an analytics query on the posts table by location. This contains a UserId.

A dictionary provides a mapping from user id to location, backed by the users table:

We omit users with an Id < 0, allowing us to use the Hashed dictionary type. Users with Id < 0 are system users.

To exploit this dictionary at insert time for the posts table, we need to modify the schema:

In the above example the Location is declared as a MATERIALIZED column. This means the value can be provided as part of an INSERT query and will always be calculated.

ClickHouse also supports DEFAULT columns (where the value can be inserted or calculated if not provided).

To populate the table we can use the usual INSERT INTO SELECT from S3:

We can now obtain the name of the location from which most posts originate:

Advanced Dictionary Topics

Choosing the Dictionary LAYOUT

The LAYOUT clause controls the internal data structure for the dictionary. A number of options exist and are documented here. Some tips on choosing the correct layout can be found here.

Refreshing dictionaries

We have specified a LIFETIME for the dictionary of MIN 600 MAX 900. LIFETIME is the update interval for the dictionary, with the values here causing a periodic reload at a random interval between 600 and 900s. This random interval is necessary in order to distribute the load on the dictionary source when updating on a large number of servers. During updates, the old version of a dictionary can still be queried, with only the initial load blocking queries. Note that setting (LIFETIME(0)) prevents dictionaries from updating. Dictionaries can be forcibly reloaded using the SYSTEM RELOAD DICTIONARY command.

For database sources such as ClickHouse and Postgres, you can set up a query that will update the dictionaries only if they really changed (the response of the query determines this), rather than at a periodic interval. Further details can be found here.

Other dictionary types

ClickHouse also supports Hierarchical, Polygon and Regular Expression dictionaries.

More reading