Writing Queries in ClickHouse using GitHub Data

This dataset contains all of the commits and changes for the ClickHouse repository. It can be generated using the native git-import tool distributed with ClickHouse.

The generated data provides a tsv file for each of the following tables:

commits - commits with statistics.
file_changes - files changed in every commit with the info about the change and statistics.
line_changes - every changed line in every changed file in every commit with full info about the line and the information about the previous change of this line.

As of November 8th, 2022, each TSV is approximately the following size and number of rows:

commits - 7.8M - 266,051 rows
file_changes - 53M - 266,051 rows
line_changes - 2.7G - 7,535,157 rows

Generating the data

This is optional. We distribute the data freely - see Downloading and inserting the data.

This will take around 3 minutes (as of November 8th 2022 on a MacBook Pro 2021) to complete for the ClickHouse repository.

A full list of available options can be obtained from the tools native help.

This help also provides the DDL for each of the above tables e.g.

These queries should work on any repository. Feel free to explore and report your findings Some guidelines with respect to execution times (as of November 2022):

Linux - ~/clickhouse git-import - 160 mins

Downloading and inserting the data

The following data can be used to reproduce a working environment. Alternatively, this dataset is available in play.clickhouse.com - see Queries for further details.

Generated files for the following repositories can be found below:

ClickHouse (Nov 8th 2022)
Linux (Nov 8th 2022)

To insert this data, prepare the database by executing the following queries:

Insert the data using INSERT INTO SELECT and the s3 function. For example, below, we insert the ClickHouse files into each of their respective tables:

commits

file_changes

line_changes

Queries

The tool suggests several queries via its help output. We have answered these in addition to some additional supplementary questions of interest. These queries are of approximately increasing complexity vs. the tool's arbitrary order.

This dataset is available in play.clickhouse.com in the git_clickhouse databases. We provide a link to this environment for all queries, adapting the database name as required. Note that play results may vary from the those presented here due to differences in time of data collection.

History of a single file

The simplest of queries. Here we look at all commit messages for the StorageReplicatedMergeTree.cpp. Since these are likely more interesting, we sort by the most recent messages first.

Generating the data​

Downloading and inserting the data​

Queries​

History of a single file​

Find the current active files​

List files with most modifications​

What day of the week do commits usually occur?​

History of subdirectory/file - number of lines, commits and contributors over time​

List files with maximum number of authors​

Oldest lines of code in the repository​

Files with longest history​

Distribution of contributors with respect to docs and code over the month​

Authors with the most diverse impact​

Favorite files for an author​

Largest files with lowest number of authors​

Commits and lines of code distribution by time; by weekday, by author; for specific subdirectories​

Matrix of authors that shows what authors tends to rewrite another authors code​

Who is the highest percentage contributor per day of week?​

Distribution of code age across repository​

What percentage of code for an author has been removed by other authors?​

List files that were rewritten most number of times?​

What weekday does the code have the highest chance to stay in the repository?​

Files sorted by average code age​

Who tends to write more tests / CPP code / comments?​

How does an authors commits change over time with respect to code/comments percentage?​

What is the average time before code will be rewritten and the median (half-life of code decay)?​

What is the worst time to write code in sense that the code has highest chance to be re-written?​

Which authors code is the most sticky?​

Most consecutive days of commits by an author​

Line by line commit history of a file​

Unsolved Questions​

Git blame​

Related Content​

Generating the data

Downloading and inserting the data

Queries

History of a single file

Find the current active files

List files with most modifications

What day of the week do commits usually occur?

History of subdirectory/file - number of lines, commits and contributors over time

List files with maximum number of authors

Oldest lines of code in the repository

Files with longest history

Distribution of contributors with respect to docs and code over the month

Authors with the most diverse impact

Favorite files for an author

Largest files with lowest number of authors

Commits and lines of code distribution by time; by weekday, by author; for specific subdirectories

Matrix of authors that shows what authors tends to rewrite another authors code

Who is the highest percentage contributor per day of week?

Distribution of code age across repository

What percentage of code for an author has been removed by other authors?

List files that were rewritten most number of times?

What weekday does the code have the highest chance to stay in the repository?

Files sorted by average code age

Who tends to write more tests / CPP code / comments?

How does an authors commits change over time with respect to code/comments percentage?

What is the average time before code will be rewritten and the median (half-life of code decay)?

What is the worst time to write code in sense that the code has highest chance to be re-written?

Which authors code is the most sticky?

Most consecutive days of commits by an author

Line by line commit history of a file

Unsolved Questions

Git blame

Related Content