Skip to main content
Skip to main content
Edit this page

NYPD Complaint Data

Tab separated value, or TSV, files are common and may include field headings as the first line of the file. ClickHouse can ingest TSVs, and also can query TSVs without ingesting the files. This guide covers both of these cases. If you need to query or ingest CSV files, the same techniques work, simply substitute TSV with CSV in your format arguments.

While working through this guide you will:

  • Investigate: Query the structure and content of the TSV file.
  • Determine the target ClickHouse schema: Choose proper data types and map the existing data to those types.
  • Create a ClickHouse table.
  • Preprocess and stream the data to ClickHouse.
  • Run some queries against ClickHouse.

The dataset used in this guide comes from the NYC Open Data team, and contains data about "all valid felony, misdemeanor, and violation crimes reported to the New York City Police Department (NYPD)". At the time of writing, the data file is 166MB, but it is updated regularly.

Source: data.cityofnewyork.us Terms of use: https://www1.nyc.gov/home/terms-of-use.page

Prerequisites

A note about the commands described in this guide

There are two types of commands in this guide:

  • Some of the commands are querying the TSV files, these are run at the command prompt.
  • The rest of the commands are querying ClickHouse, and these are run in the clickhouse-client or Play UI.
Note

The examples in this guide assume that you have saved the TSV file to ${HOME}/NYPD_Complaint_Data_Current__Year_To_Date_.tsv, please adjust the commands if needed.

Familiarize yourself with the TSV file

Before starting to work with the ClickHouse database familiarize yourself with the data.

Look at the fields in the source TSV file

This is an example of a command to query a TSV file, but don't run it yet.

Sample response

Tip

Most of the time the above command will let you know which fields in the input data are numeric, and which are strings, and which are tuples. This is not always the case. Because ClickHouse is routineley used with datasets containing billions of records there is a default number (100) of rows examined to infer the schema in order to avoid parsing billions of rows to infer the schema. The response below may not match what you see, as the dataset is updated several times each year. Looking at the Data Dictionary you can see that CMPLNT_NUM is specified as text, and not numeric. By overriding the default of 100 rows for inference with the setting SETTINGS input_format_max_rows_to_read_for_schema_inference=2000 you can get a better idea of the content.

Note: as of version 22.5 the default is now 25,000 rows for inferring the schema, so only change the setting if you are on an older version or if you need more than 25,000 rows to be sampled.

Run this command at your command prompt. You will be using clickhouse-local to query the data in the TSV file you downloaded.

Result:

At this point you should check that the columns in the TSV file match the names and types specified in the Columns in this Dataset section of the dataset web page. The data types are not very specific, all numeric fields are set to Nullable(Float64), and all other fields are Nullable(String). When you create a ClickHouse table to store the data you can specify more appropriate and performant types.

Determine the proper schema

In order to figure out what types should be used for the fields it is necessary to know what the data looks like. For example, the field JURISDICTION_CODE is a numeric: should it be a UInt8, or an Enum, or is Float64 appropriate?

Result:

The query response shows that the JURISDICTION_CODE fits well in a UInt8.

Similarly, look at some of the String fields and see if they are well suited to being DateTime or LowCardinality(String) fields.

For example, the field PARKS_NM is described as "Name of NYC park, playground or greenspace of occurrence, if applicable (state parks are not included)". The names of parks in New York City may be a good candidate for a LowCardinality(String):

Result:

Have a look at some of the park names:

Result:

The dataset in use at the time of writing has only a few hundred distinct parks and playgrounds in the PARK_NM column. This is a small number based on the LowCardinality recommendation to stay below 10,000 distinct strings in a LowCardinality(String) field.

DateTime fields

Based on the Columns in this Dataset section of the dataset web page there are date and time fields for the start and end of the reported event. Looking at the min and max of the CMPLNT_FR_DT and CMPLT_TO_DT gives an idea of whether or not the fields are always populated:

Result:

Result:

Result:

Result:

Make a plan

Based on the above investigation:

  • JURISDICTION_CODE should be cast as UInt8.
  • PARKS_NM should be cast to LowCardinality(String)
  • CMPLNT_FR_DT and CMPLNT_FR_TM are always populated (possibly with a default time of 00:00:00)
  • CMPLNT_TO_DT and CMPLNT_TO_TM may be empty
  • Dates and times are stored in separate fields in the source
  • Dates are mm/dd/yyyy format
  • Times are hh:mm:ss format
  • Dates and times can be concatenated into DateTime types
  • There are some dates before January 1st 1970, which means we need a 64 bit DateTime
Note

There are many more changes to be made to the types, they all can be determined by following the same investigation steps. Look at the number of distinct strings in a field, the min and max of the numerics, and make your decisions. The table schema that is given later in the guide has many low cardinality strings and unsigned integer fields and very few floating point numerics.

Concatenate the date and time fields

To concatenate the date and time fields CMPLNT_FR_DT and CMPLNT_FR_TM into a single String that can be cast to a DateTime, select the two fields joined by the concatenation operator: CMPLNT_FR_DT || ' ' || CMPLNT_FR_TM. The CMPLNT_TO_DT and CMPLNT_TO_TM fields are handled similarly.

Result:

Convert the date and time String to a DateTime64 type

Earlier in the guide we discovered that there are dates in the TSV file before January 1st 1970, which means that we need a 64 bit DateTime type for the dates. The dates also need to be converted from MM/DD/YYYY to YYYY/MM/DD format. Both of these can be done with parseDateTime64BestEffort().

Lines 2 and 3 above contain the concatenation from the previous step, and lines 4 and 5 above parse the strings into DateTime64. As the complaint end time is not guaranteed to exist parseDateTime64BestEffortOrNull is used.

Result:

Note

The dates shown as 1925 above are from errors in the data. There are several records in the original data with dates in the years 1019 - 1022 that should be 2019 - 2022. They are being stored as Jan 1st 1925 as that is the earliest date with a 64 bit DateTime.

Create a table

The decisions made above on the data types used for the columns are reflected in the table schema below. We also need to decide on the ORDER BY and PRIMARY KEY used for the table. At least one of ORDER BY or PRIMARY KEY must be specified. Here are some guidelines on deciding on the columns to includes in ORDER BY, and more information is in the Next Steps section at the end of this document.

Order By and Primary Key clauses

  • The ORDER BY tuple should include fields that are used in query filters
  • To maximize compression on disk the ORDER BY tuple should be ordered by ascending cardinality
  • If it exists, the PRIMARY KEY tuple must be a subset of the ORDER BY tuple
  • If only ORDER BY is specified, then the same tuple will be used as PRIMARY KEY
  • The primary key index is created using the PRIMARY KEY tuple if specified, otherwise the ORDER BY tuple
  • The PRIMARY KEY index is kept in main memory

Looking at the dataset and the questions that might be answered by querying it we might decide that we would look at the types of crimes reported over time in the five boroughs of New York City. These fields might be then included in the ORDER BY:

ColumnDescription (from the data dictionary)
OFNS_DESCDescription of offense corresponding with key code
RPT_DTDate event was reported to police
BORO_NMThe name of the borough in which the incident occurred

Querying the TSV file for the cardinality of the three candidate columns:

Result:

Ordering by cardinality, the ORDER BY becomes:

Note

The table below will use more easily read column names, the above names will be mapped to

Putting together the changes to data types and the ORDER BY tuple gives this table structure:

Finding the primary key of a table

The ClickHouse system database, specifically system.table has all of the information about the table you just created. This query shows the ORDER BY (sorting key), and the PRIMARY KEY:

Response

Preprocess and Import Data

We will use clickhouse-local tool for data preprocessing and clickhouse-client to upload it.

clickhouse-local arguments used

Tip

table='input' appears in the arguments to clickhouse-local below. clickhouse-local takes the provided input (cat ${HOME}/NYPD_Complaint_Data_Current__Year_To_Date_.tsv) and inserts the input into a table. By default the table is named table. In this guide the name of the table is set to input to make the data flow clearer. The final argument to clickhouse-local is a query that selects from the table (FROM input) which is then piped to clickhouse-client to populate the table NYPD_Complaint.

Validate the Data

Note

The dataset changes once or more per year, your counts may not match what is in this document.

Query:

Result:

The size of the dataset in ClickHouse is just 12% of the original TSV file, compare the size of the original TSV file with the size of the table:

Query:

Result:

Run Some Queries

Query 1. Compare the number of complaints by month

Query:

Result:

Query 2. Compare total number of complaints by Borough

Query:

Result:

Next Steps

A Practical Introduction to Sparse Primary Indexes in ClickHouse discusses the differences in ClickHouse indexing compared to traditional relational databases, how ClickHouse builds and uses a sparse primary index, and indexing best practices.