Getting Started with Clickhouse

Let me start with my use case. I have a lot of Performance API data, where each page view has many Performance Entries. Just imagine one page view has for example 20 entries and you have 10.000 page views a day, this makes a 200.000 entries, just for one day. Aggregate this over a week (1.4M) or a month (6M entries). Exploring these with MariaDB is possible but I want aggregates, sums, averages, etc. across these values, for that a "Column-oriented database" makes a lot more sense. This brought me to clickhouse. There are alike tools for the job, like Apache Druid and others, I chose clickhouse because it is open source and the setup is much simpler (just one docker container) than it is for druid (at least six) and seems to integrate well with django, which I am using.

Set it up

I just followed the guide, installed the docker container, hooked it up to a docker volume and off we go.
I enter my container like so: docker-compose exec clickhouse bash which enters the container with clickhouse in it and gives me a bash. Just as if I had installed clickhouse on my machine.

A Client?

I looked around if there is some tool like Sequel Pro just for clickhouse, I didn't find one. There is a page though with some UIs for clickhouse. I decided to go pure command line for learning it. Eventually I might try out tabix, even though it looks not as if it is still actively maintained.

Insert 2 Million Rows into my First Table

Get into the clickhouse shell first. Given I am in the docker container I just run clickhouse client and the clickhouse shell opens.

It has pretty nice auto completion and the expected commands like show tables or describe table X work just as expected. So this is real SQL, as promised.

I now follow the "Advanced Tutorial" where I will create a table with 2M rows for exploring some NY taxi data. The table creation worked as described in there. Next I need to insert some data. The query beginning like this:

INSERT INTO trips
  SELECT * FROM s3(
    'https://datasets.../nyc-taxi/trips_{1..2}.gz',
...

crashed my docker container. I tried it a couple of times. Maybe not enough memory available, not sure. I didn't look into the logs and got no helpful error message, it just silently stopped and exited the container. Ok.

Reading this query and reading the tutorial, I learned that two files are loaded into the table. Ok, fine, let's split up the queries and read one file after another into the DB. This reminded me of what I read before, which was I guess an important thing to know about clickhouse:

it is important to understand that each insert into a MergeTree table causes a part (folder) to be created in storage. To minimize parts, bulk insert lots of rows at a time (tens of thousands or even millions at once).

I don't understand what "parts" are yet, but it seems to be more efficient to have less.

To insert the two files we will create two parts by just running two insert statements, one for each file.

INSERT INTO trips
  SELECT * FROM s3(
    'https://datasets.../nyc-taxi/trips_1.gz',
...

and

INSERT INTO trips
  SELECT * FROM s3(
    'https://datasets.../nyc-taxi/trips_2.gz',
...

this works.

Curiosity triggers me to see the row count, as expected using standard SQL:

SELECT count()
FROM trips

   ┌─count()─┐
1. │ 1999657 │ -- 2.00 million
   └─────────┘

1 row in set. Elapsed: 0.004 sec.

Wow, and this just took a couple of seconds, inside a docker container running on Mac, btw. using colima (container runtime on MacOS, and alternative docker runtime).

Explaining Queries

A short detour, if you don't care about this skip this section. My curiosity just triggered me to look what EXPLAIN tells me about the queries, since this is a column-oriented DB it surely looks different, the explanation. Let's see. I tried the simple version first:

EXPLAIN
SELECT count()
FROM trips

   ┌─explain────────────────────────────────────────────────┐
1. │ Expression ((Project names + Projection))              │
2. │   MergingAggregated                                    │
3. │     Expression (Change column names)                   │
4. │       ReadFromPreparedSource (Optimized trivial count) │
   └────────────────────────────────────────────────────────┘

4 rows in set. Elapsed: 0.003 sec.

Ok, I see I don't understand this too well yet. But the auto completion showed that there is something called EXPLAIN QUERY TREE. So let's try this:

EXPLAIN QUERY TREE
SELECT count()
FROM trips

   ┌─explain───────────────────────────────────────────────────────────────────────────────────┐
1. │ QUERY id: 0                                                                               │
2. │   PROJECTION COLUMNS                                                                      │
3. │     count() UInt64                                                                        │
4. │   PROJECTION                                                                              │
5. │     LIST id: 1, nodes: 1                                                                  │
6. │       FUNCTION id: 2, function_name: count, function_type: aggregate, result_type: UInt64 │
7. │   JOIN TREE                                                                               │
8. │     TABLE id: 3, alias: __table1, table_name: default.trips                               │
   └───────────────────────────────────────────────────────────────────────────────────────────┘

8 rows in set. Elapsed: 0.001 sec.

Interesting. Not what I expected, since it does not have any numbers on the complexity of the query, how many rows/columns it had to look up, as MariaDB/MySQL has it. Here is the docs for the EXPLAIN commands, and I saw there are many more, like EXPLAIN PLAN or EXPLAIN ESTIMATE, very interesting. I get back to it later.

I always like to look a bit behind the curtain, even if I do not yet understand all that it means, but I know that this little piece of knowledge will help me to connect the dots later. But if I don't collect dots there is nothing to connect.

Back to going through the tutorial.

First "Complex" Query

In the tutorial the next query is a one that needs the DB to do more work. Find some distinct values by running:

SELECT DISTINCT pickup_ntaname
FROM trips

     ┌─pickup_ntaname───────────────────────────────────┐
  1. │ Midtown-Midtown South                            │
  2. │ Hudson Yards-Chelsea-Flatiron-Union Square       │
...

     ┌─pickup_ntaname────────────────┐
189. │ Grymes Hill-Clifton-Fox Hills │
     └───────────────────────────────┘
     ┌─pickup_ntaname─────────────────────────────┐
190. │ Annadale-Huguenot-Prince's Bay-Eltingville │
     └────────────────────────────────────────────┘

190 rows in set. 
Elapsed: 0.073 sec. 
Processed 2.00 million rows, 60.32 MB (27.52 million rows/s., 830.20 MB/s.)
Peak memory usage: 4.20 MiB.

Notice the detailed info below the query. This is pretty cool. It grouped the data by pickup_ntaname and there are 190 distinct ones it seems.

My curiosity is triggered, I am exploring the data a bit, it seems that pickup_ct2010 is the neighborhood where people got picked up, lets group by this and read the counts and sort it reverse by counts. Let's try out how much of my SQL knowledge works here. Surprise, surprise, all of it.

SELECT
    pickup_ct2010,
    count(*) AS pickups
FROM trips
GROUP BY pickup_ct2010
ORDER BY pickups DESC

   ┌─pickup_ct2010─┬─pickups─┐
1. │ Manhattan     │ 1808857 │
2. │ Queens        │  124920 │
3. │ Brooklyn      │   34988 │
4. │               │   29209 │
5. │ Bronx         │    1507 │
6. │ New Jersey    │     144 │
7. │ Staten Island │      32 │
   └───────────────┴─────────┘

7 rows in set. 
Elapsed: 0.044 sec. 
Processed 2.00 million rows, 35.32 MB (45.90 million rows/s., 810.56 MB/s.)
Peak memory usage: 269.24 KiB.

A peak memory usage of 270kB, this looks impressive. I am not sure a row-oriented DB would have done it that fast and with this little of memory usage. Feels good.

Other "Complex" Queries

For the initial use case I described, the performance entries, I need to evaluate a lot of percentile values to get useful data. This is not that simple in MariaDB queries, but here these things are built-in, let me list a couple of queries that work in clickhouse and that show the advantages of it.

SELECT median(tip_amount) FROM trips; -- the median
SELECT avg(tip_amount) FROM trips; -- the average
SELECT quantile(0.9)(trip_distance) FROM trips; -- 90% percentile

-- clickhouse has useful custom functions, like `toHour`
SELECT toHour(pickup_datetime) AS pickup_hour, count(*) AS c
FROM trips GROUP BY pickup_hour ORDER BY c DESC;

In the section "3. Analyze the Data" a lot of interesting and amazingly fast queries are being done. Go through this one with time, it is fun and shows the power of clickhouse well. Also it takes away this fear of running these types of queries, that I have at least in a row-oriented DB. I have seen them slow down a product often enough.

Clickhouse provides many custom functions, like the toHour() shown above, the date and time functions are here, make sure to explore the chapters around functions, there is plenty of interesting stuff, even some machine learning functions.

Dictionaries

The tutorial section "4. Create a Dictionary" starts with:

If you are new to ClickHouse, it is important to understand how dictionaries work

A dictionary looks like a table, I am creating it via CREATE DICTIONARY just like a table, and I also see it when i run show tables. The primary key though is crucial here, I understand.

You select a value from this dictionary like using SELECT dictGet('<dictionary name>', '<column to return>', keyInDict), the docs brag with this taking basically no time. But to be honest, I wouldn't have expected this to take much time, the example they have in there is a table with 256 rows. That should be a no brainer for any DB.

Auto-completion ftw I just listed all the dict functions clickhouse provides 🤯

> select dict
dictGet                        dictGetFloat32                 dictGetInt32OrDefault          dictGetIPv6OrDefault           dictGetUInt32OrDefault         DICTIONARIES
dictGetAll                     dictGetFloat32OrDefault        dictGetInt64                   dictGetOrDefault               dictGetUInt64                  dictionaries
dictGetChildren                dictGetFloat64                 dictGetInt64OrDefault          dictGetOrNull                  dictGetUInt64OrDefault         Dictionary
dictGetDate                    dictGetFloat64OrDefault        dictGetInt8                    dictGetString                  dictGetUInt8                   dictionary
dictGetDateOrDefault           dictGetHierarchy               dictGetInt8OrDefault           dictGetStringOrDefault         dictGetUInt8OrDefault          DICTIONARY
dictGetDateTime                dictGetInt16                   dictGetIPv4                    dictGetUInt16                  dictGetUUID                    dictionary_use_async_executor
dictGetDateTimeOrDefault       dictGetInt16OrDefault          dictGetIPv4OrDefault           dictGetUInt16OrDefault         dictGetUUIDOrDefault           dictIsIn
dictGetDescendants             dictGetInt32                   dictGetIPv6                    dictGetUInt32                  dictHas

Quite a bunch.

Mmmh, I am through the dictionaries chapter, but I don't see why this is so special. What I basically understood is, that one can use dictionaries as a shorter way of joining tables, where one table is the dictionary. A query using a dictionary might look like this:

SELECT
    count(*) AS total,
    dictGetOrDefault('taxi_zone_dictionary', 'Borough', toUInt64(pickup_nyct2010_gid), 'Unknown') AS borough_name
FROM trips
GROUP BY pickup_nyct2010_gid

The function dictGetOrDefault reads from the dictionary "taxi_zone_dictionary" the column "Borough" for the key "toUInt64(pickup_nyct2010_gid)" and in case there is no such key in the dictionary use "Unknown". Definitely much easier to read and write than it would be using a standard SQL join.

I still struggle to understand why dictionaries are so very important, as mentioned at the start of this paragraph.

In the next chapter a JOIN with a dictionary is shown, that looks just like a JOIN with any other table. And then the tutorial says:

Behind the scenes, ClickHouse is actually calling the dictGet function for the taxi_zone_dictionary dictionary, but the JOIN syntax is more familiar for SQL developers.

To me this sounds a bit like "we have made joins easier". Which I am fine with and actually also thankful. I didn't practice complex joins for a while and I did get rusty and will surely oversee things and make mistakes.

Will see if there comes a better "Why" for the dictionaries.

The tutorial ends here. Ok, I got a feeling for clickhouse and I believe I can basically use it and understand some of its power. But I feel this was just the tip of the iceberg.

Visualizations, Client

In the beginning I said I will stick to the command line, and I did. But meanwhile I remembered that I had been using Apache Superset for visualizing these kind of data and that this might be one option for a UI tool for clickhouse, it has a query console and visualization capabilities, actually superset's core.

But the clickhouse command line tool is quite nice too, it has intelligent auto-completion and managing of query editing. Just the output of a lot of data is not that useful as a UI would offer it.

Anyway. Give it a try if you have the right use case. Clickhouse feels good until now. I might report about an update with more experience.

Getting Started with Clickhouse

Set it up

A Client?

Insert 2 Million Rows into my First Table

Explaining Queries

First "Complex" Query

Other "Complex" Queries

Dictionaries

Visualizations, Client

Comments

More from this blog

Building PicoSSG: 'Just Enough Code'

ClickHouse Aggregations and Django

EventSource: Native Browser Streaming API

Smallest test runner

Command Palette

Set it up

A Client?

Insert 2 Million Rows into my First Table

Explaining Queries

First "Complex" Query

Other "Complex" Queries

Dictionaries

Visualizations, Client

Comments

More from this blog