# Introduction

Why do we use indexes? Searching a row in a sorted file with N length takes $O(log_2 N)$ comparisons and the same number of reads from a filesystem which is heavy itself. However, tables in databases are not sorted which complicates the operation, Furthermore, if you have a lot of reads, updates and deletions on them. Writing the sorted version of the file (table) would dramatically slow the database down. One more thing makes it even more complicated - every table may be sorted in more than one order. That’s why we use indexes which holds only attributes used in sorting and reference to a place where the data are kept.

# Types of indexes in MySQL

In MySQL, we have few indexes available which we can use. Each of them has a different application and it’s better to know a bit about them.

## BTree+ index

B-Tree+ is the most used index in MySQL. It should be used when you have many reads and few writes in the table because reindexing of the table is very costly. There are few key things about B-Tree+:

• it’s a rooted tree
• every node does not have more than $m$ children
• every node (excluding root and leafs) has not less than $m/2$ children. $m/2$ is rounded down to a natural digit and called a factor of minimization.
• root, if it’s not a leaf, has at least two children.
• every leaf is always on the same level
• nodes which are not a leaf heaving $k$ children has $k-1$ keys.
• node does not have more than $m-1$ keys. Leaf which is not a root has at least $m/2$ keys.

Where $m$ is a number of nodes in one level. In practical usage, the $m$ value is very high which makes the tree very weight but very short. It helps reduce the number of reads from hard drive what is very costly and has a huge impact on performance.

An example of the tree is shown in the picture below.

That’s theory, in practice, you can create the index using query.

CREATE INDEX test_indexing USING BTREE ON bkielbasa_index.indexing(id);


or in a less verbose version

CREATE INDEX test_indexing ON bkielbasa_index.indexing(id);


because the B-tree+ index is a default one in MySQL. To create an index on more than one column, all of them must be NOT NULL. Remember that the primary key is always an index.

## RTREE index

R-Tree index is used to index spatial data like geographical coordinates, rectangles or polygons. How it looks in 2d is shown below.

To create T-Tree index you can use CREATE SPATIAL INDEX like showed below:

CREATE SPATIAL INDEX some_index ON some_table_with_geometry (geometry_coords);


## HASH index

Another index type available in MySQL database is hash index. The key thing you should know about it that the index is kept in the memory. What’s important to stress is that HASH index has $O(1)$ average search and an $O(n)$ worst case search. Hash Indexes is the general recommendation for equality based lookups. The optimizer cannot use a hash index to speed up ORDER BY operations. (This type of index cannot be used to search for the next entry in order.) Additionally, the only whole index can be used in the search. The InnoDB indexes can be used in open and range lookups which will be described further in the article.

Creating indexes is very similar to the above

## FULLTEXT index

Fulltext index was introduced to InnoDB in MySQL 5.6 and helps in a searching text (CHAR, VARCHAR, or TEXT columns). How does it works and why it’s faster than regular search? In Fulltext index, they split the text into words and make an index of the words and not the whole text. This works a lot faster for text searches when looking for specific words.

Adding the index is very similar to previous indexes but using it is a bit different.

CREATE TABLE opening_lines (
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
opening_line TEXT(500),
author VARCHAR(200),
title VARCHAR(200),
FULLTEXT idx (opening_line)
) ENGINE=InnoDB;

SELECT id, author, title FROM opening_lines WHERE MATCH(opening_line) AGAINST('Ishmael');


but remember - it works only on columns where you already added the index.

The fulltext index is not used very often because there are better alternatives such Lucene, Lucene with Compass or Solr which have much more features and are faster than fulltext index.

# Introduction to explain

To fully understand next chapters you should be familiar with command EXPLAIN in MySQL. EXPLAIN provides information how MySQL will execute your code statements. It works with SELECT, DELETE, INSERT, REPLACE, and UPDATE statements. Here are columns you will see in the output of the command

Column JSON Name Meaning
id select_id The SELECT identifier
select_type None The SELECT type
table table_name The table for the output row
partitions partitions The matching partitions
type access_type The join type
possible_keys possible_keys The possible indexes to choose
key key The index actually chosen
key_len key_length The length of the chosen key
ref ref The columns compared to the index
rows rows Estimate of rows to be examined
filtered filtered Percentage of rows filtered by table condition

The columns are more detailed described in the docs. We will be focused on few of them: Extra, key, key_len and rows.

Please notice that multiple records may be returned for a single query. A simple example is shown below

EXPLAIN SELECT * FROM
orderdetails d
INNER JOIN orders o ON d.orderNumber = o.orderNumber
INNER JOIN products p ON p.productCode = d.productCode
INNER JOIN productlines l ON p.productLine = l.productLine
INNER JOIN customers c on c.customerNumber = o.customerNumber
WHERE o.orderNumber = 10101G


and the output

********************** 1. row **********************
id: 1
select_type: SIMPLE
table: l
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 7
Extra:
********************** 2. row **********************
id: 1
select_type: SIMPLE
table: p
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 110
Extra: Using where; Using join buffer
********************** 3. row **********************
id: 1
select_type: SIMPLE
table: c
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 122
Extra: Using join buffer
********************** 4. row **********************
id: 1
select_type: SIMPLE
table: o
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 326
Extra: Using where; Using join buffer
********************** 5. row **********************
id: 1
select_type: SIMPLE
table: d
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 2996
Extra: Using where; Using join buffer
5 rows in set (0.00 sec)


While studying the EXPLAIN output for performance it is important to fetch as small columns as it’s possible (the rows column) and use always an index. Using indexes indicates a small number of reads from the hard drive what’s expensive. A low number of rows used decreases the complexity of the searching.

What’s interesting you can get the information in JSON format. You can achieve that by passing format=json to the query.

explain format=json SELECT * FROM film WHERE (film_id BETWEEN 1 and 10) or (film_id BETWEEN 911 and 920)

********* 1. row *********
EXPLAIN: {
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "10.04"
},
"table": {
"table_name": "film",
"access_type": "range",
"possible_keys": [
"PRIMARY"
],
"key": "PRIMARY",
"used_key_parts": [
"film_id"
],
"key_length": "2",
"rows_examined_per_scan": 20,
"rows_produced_per_join": 20,
"filtered": "100.00",
"cost_info": {
"eval_cost": "4.00",
"prefix_cost": "10.04",
},
"used_columns": [
"film_id",
"title",
"description",
"release_year",
"language_id",
"original_language_id",
"rental_duration",
"rental_rate",
"length",
"replacement_cost",
"rating",
"special_features",
"last_update"
],
"attached_condition": "((film.film_id between 1 and 10) or (film.film_id between 911 and 920))"
}
}
}


In forks of MySQL like MariaDB or Percona the feature is even more useful because shows even more information.

# Example

When you have a general overview on the topic, let’s check how it works in practice. I created two tables: clients

CREATE TABLE clients (
id int(11) NOT NULL AUTO_INCREMENT,
username varchar(255) NOT NULL,
password varchar(255) NOT NULL,
email varchar(255) NOT NULL,
PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;


and clients_indexed

CREATE TABLE clients_indexed (
id int(11) NOT NULL AUTO_INCREMENT,
username varchar(255) NOT NULL,
password varchar(255) NOT NULL,
email varchar(255) NOT NULL,
PRIMARY KEY (id),
KEY username_idx (username,email)
) ENGINE=InnoDB DEFAULT CHARSET=utf8


and added to each of them exactly 200002 identical records. The only difference in the schema is that the second table has an index called username_idx. Maybe let’s start with a trivial example. Our task is to find a record with ID = 1053.

Query Result
SELECT * FROM clients where id = 1053; 1 row in set (0.00 sec)
SELECT * FROM clients_indexed where id = 1053; 1 row in set (0.00 sec)

We could not expect a different result. In Both queries, we used the primary key. More interesting results we’ll get when we try to find on a column without any index.

Query Result
select * from clients where username = 'username_0.12939831440280966'; 1 row in set (0.35 sec)
select * from clients_indexed where username = 'username_0.12939831440280966'; 1 row in set (0.00 sec)

In table without the index on username column, we had to wait much longer than in the second table. That’s because MySQL had to read all the columns from the hard drive and try to match our criteria. Here is explain of both queries.

Table without index:

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE clients NULL ALL NULL NULL NULL NULL 776890 10.00 Using where

Table with index:

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE clients_indexed NULL ref username_idx username_idx 767 const 1 100.00 NULL

It exactly says why the second search was much faster - it filtered 100% of the rows! In the table without index before analyzing the data MySQL could filter only 10% of all records. It means that MySQL had to read from hard drive 90% of the records to try to match the where condition. That’s a huge difference. Please notice that in the indexed table the engine did not even need use where condition to compare columns with the condition - he just already got what he needed!

## Lookups

What’s important to mention that you can use index to find values from the beginning. It means that in the query above the index was used but in the queries below it will not:

 --- won't use index

SELECT * from clients where username LIKE '%a%';
SELECT * from clients where username LIKE '%a';
SELECT * from clients where username LIKE 'a%a';


BTree+ Indexes are useful in three kinds of look-ups:

• point lookup - an example of it you can find in queries above. Point lookups are look-ups where you use equal sign for example: where username = 'admin'.
• open range lookup - it’s a search where you specify start or end of the indexed value. Example where id > 10 or where price < 300.
• close range lookup - the same as open range lookup but you define both beginning and end of the indexed values. Example where id > 10 and id < 100

# Cons of using indexes

Using indexes has not only advantages but disadvantages, too. Below you have list of few of them.

## Indexes take additional space on the hard drive

On huge tables or badly designed indexes it may be a big deal. I think it’s visible very well in tables from this article.

-rw-r-----   1 bartlomiejkielbasa  admin       8664 Oct  3 19:09 clients.frm
-rw-r-----   1 bartlomiejkielbasa  admin  100663296 Oct  3 19:23 clients.ibd
-rw-r-----   1 bartlomiejkielbasa  admin       8664 Oct  3 19:10 clients_indexed.frm
-rw-r-----   1 bartlomiejkielbasa  admin  192937984 Oct  3 19:25 clients_indexed.ibd
-rw-r-----   1 bartlomiejkielbasa  admin         61 Oct  3 19:09 db.opt


What is idb file?

By default, all InnoDB tables and indexes are stored in the system tablespace. As an alternative, you can store each InnoDB table and associated indexes in its own data file. This feature is called “file-per-table tablespaces” because each table has its own tablespace, and each tablespace has its own .ibd data file.

Read more in the official docs. In other words, ibd files contain indexes. The file with indexes for clients_indexed table takes almost twice more size than indexes for the first table!

## Write operations take more time

The indexes must be updated in this situation which is an additional operation on your hard drive. You will have more performance improvement on tables with a huge number of reads and a low number of updates, deletes and so on.

# Summary

Indexes can save our time but can be problematic in very huge tables or if they are not well designed. And remember - Use indexes find specific records - not to find all of them!