Sunday, 11 January 2015

mysql collation and character set

Hi all,
have you ever wondered what is collation latin1_swedish_ci shown in phpMyAdmin? 
yeah sure it might not be useful to your work but it might be useful some day to have knowledge, isn't it?

From MySQL docs:
character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. Let's make the distinction clear with an example of an imaginary character set.
Suppose that we have an alphabet with four letters: 'A', 'B', 'a', 'b'. We give each letter a number: 'A' = 0, 'B' = 1, 'a' = 2, 'c' = 3. The letter 'A' is a symbol, the number 0 is the encoding for 'A', and the combination of all four letters and their encodings is a character set.
Now, suppose that we want to compare two string values, 'A' and 'B'. The simplest way to do this is to look at the encodings: 0 for 'A' and 1 for 'B'. Because 0 is less than 1, we say 'A' is less than 'B'. Now, what we've just done is apply a collation to our character set. The collation is a set of rules (only one rule in this case): "compare the encodings." We call this simplest of all possible collations a binary collation.
But what if we want to say that the lowercase and uppercase letters are equivalent? Then we would have at least two rules: (1) treat the lowercase letters 'a' and 'b' as equivalent to 'A' and 'B'; (2) then compare the encodings. We call this a case-insensitive collation. It's a little more complex than a binary collation.
In real life, most character sets have many characters: not just 'A' and 'B' but whole alphabets, sometimes multiple alphabets or eastern writing systems with thousands of characters, along with many special symbols and punctuation marks. Also in real life, most collations have many rules: not just case insensitivity but also accent insensitivity (an "accent" is a mark attached to a character as in German 'ö') and multiple-character mappings (such as the rule that 'ö' = 'OE' in one of the two German collations).
so based upon docs latin1_swedish_ci  collation means it's case - insensitive .
some of example from stack-overflow which will clarify further doubts regarding this.
For example, latin1_general_cs is case-sensitive with both VARCHAR and VARBINARY:
DROP TABLE IF EXISTS `case_sensitive`;
CREATE TABLE `case_sensitive` (
    `id` INT NOT NULL AUTO_INCREMENT,
    `nonbinary` VARCHAR(255),
    `binary`  VARBINARY(255),
    PRIMARY KEY (`id`)
) ENGINE=InnoDB COLLATE latin1_general_cs;

INSERT INTO `case_sensitive` (`nonbinary`, `binary`) VALUES ('A', 'A');

SELECT * FROM `case_sensitive` WHERE `nonbinary` = 'A';

+----+-----------+--------+
| id | nonbinary | binary |
+----+-----------+--------+
|  1 | A         | A      |
+----+-----------+--------+
1 row in set (0.00 sec)

SELECT * FROM `case_sensitive` WHERE `binary` = 'A';

+----+-----------+--------+
| id | nonbinary | binary |
+----+-----------+--------+
|  1 | A         | A      |
+----+-----------+--------+
1 row in set (0.00 sec)

SELECT * FROM `case_sensitive` WHERE `nonbinary` = 'a';

Empty set (0.00 sec)

SELECT * FROM `case_sensitive` WHERE `binary` = 'a';

Empty set (0.00 sec)
Whereas latin1_general_ci is case-insensitive with VARCHAR, and case-sensitive with VARBINARY:
DROP TABLE IF EXISTS `case_insensitive`;
CREATE TABLE `case_insensitive` (
    `id` INT NOT NULL AUTO_INCREMENT,
    `nonbinary` VARCHAR(255),
    `binary`  VARBINARY(255),
    PRIMARY KEY (`id`)
) ENGINE=InnoDB COLLATE latin1_general_ci;

INSERT INTO `case_insensitive` (`nonbinary`, `binary`) VALUES ('A', 'A');

SELECT * FROM `case_insensitive` WHERE `nonbinary` = 'A';

+----+-----------+--------+
| id | nonbinary | binary |
+----+-----------+--------+
|  1 | A         | A      |
+----+-----------+--------+
1 row in set (0.00 sec)

SELECT * FROM `case_insensitive` WHERE `binary` = 'A';

+----+-----------+--------+
| id | nonbinary | binary |
+----+-----------+--------+
|  1 | A         | A      |
+----+-----------+--------+

SELECT * FROM `case_insensitive` WHERE `nonbinary` = 'a';

+----+-----------+--------+
| id | nonbinary | binary |
+----+-----------+--------+
|  1 | A         | A      |
+----+-----------+--------+

SELECT * FROM `case_insensitive` WHERE `binary` = 'a';

Empty set (0.00 sec)


so for next time if you want that when you want to have case sensitive / insensitive string comparision for database don't forget to choose accordingly! :)

Now what will you do if you want to change collation of existing table?

MySQL has 4 levels of collation: server, database, table, column. If you change the collation of the server, database or table, you don't change the setting for each column, but you change the default collations.
E.g if you change the default collation of a database, each new table you create in that database will use that collation, and if you change the default collation of a table, each column you create in that table will get that collation.
alter table query to change collation type.
alter table <some_table> convert to character set utf8 collate utf8_unicode_ci;


enjoy!

No comments:

Post a Comment