MySQL low cardinality index efficiency

I’ve heard the opinion that indexes on low cardinality columns don’t work well. I set out to disprove that.

First, let’s create a table with a boolean column “active” and fill it with dummy data. Please note it includes a (default, B-Tree) index on the “active” column.

root@localhost test > show create table users\G *************************** 1. row *************************** Table: users Create Table: CREATE TABLE `users` ( `id` int(11) NOT NULL AUTO_INCREMENT, `name` varchar(50) DEFAULT NULL, `email` varchar(100) DEFAULT NULL, `active` tinyint(1) DEFAULT NULL, PRIMARY KEY (`id`), KEY `active` (`active`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 1 row in set (0.00 sec) root@localhost test > DELIMITER ;; root@localhost test > CREATE PROCEDURE insert_random(IN amount INT, IN percent_active INT) -> BEGIN -> DECLARE counter INT DEFAULT 0; -> myloop: LOOP -> if (counter = amount) THEN -> LEAVE myloop; -> end if; -> INSERT INTO users SET -> name = UUID(), -> email = UUID(), -> active = IF(FLOOR(RAND() * 100) < percent_active, 1, 0); -> SET counter = counter + 1; -> END LOOP; -> END ;; Query OK, 0 rows affected (0.00 sec) root@localhost test > DELIMITER ; root@localhost test > call insert_random(1000000, 10); Query OK, 1 row affected (36.60 sec) root@localhost test > select count(*) from users; +----------+ | count(*) | +----------+ | 1000000 | +----------+ 1 row in set (0.18 sec)

We’ve inserted a million users into our table! That took some time. Please note the 0.18 seconds to count them all.

Now, let’s see how fast we can count the active users:

root@localhost test > select count(*) from users where active = true; +----------+ | count(*) | +----------+ | 99887 | +----------+ 1 row in set (0.04 sec)

Cool, pretty fast. Can you explain that?

root@localhost test > explain select count(*) from users where active = true\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: users partitions: NULL type: ref possible_keys: active key: active key_len: 2 ref: const rows: 208462 filtered: 100.00 Extra: Using index 1 row in set, 1 warning (0.00 sec)

Now let’s drop the index and run the test again:

root@localhost test > alter table users drop key active; Query OK, 0 rows affected (0.05 sec) Records: 0 Duplicates: 0 Warnings: 0 root@localhost test > select count(*) from users where active = true; +----------+ | count(*) | +----------+ | 99887 | +----------+ 1 row in set (0.24 sec)

Wow, a lot slower… I’ve ran the selects a couple of times with consistent results to verify no caching was influencing the results. Let’s see the explain:

root@localhost test > explain select count(*) from users where active = true\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: users partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 912236 filtered: 10.00 Extra: Using where 1 row in set, 1 warning (0.00 sec)

As you hopefully know: “Using index” good, “Using where” bad.

TL;DR A boolean column consisting of 10% TRUE and 90% FALSE queried for TRUE values using an index takes 0.04 sec, while not using an index takes 0.24 sec. The index speeds up the query by about a factor of six.

I’ve repeated the test with worse case scenario of a binary column split 50/50. 50% true, 50% false. The numbers are a bit less consistent, but generally around 0.16 for the indexed version and 0.24 for the unindexed version. Go indexes!