collation – Page 3 – Make Me Engineer

Humanized or natural number sorting of mixed word-and-number strings

June 30, 2022 by Tarik

Building on your test data, but this works with arbitrary data. This works with any number of elements in the string. Register a composite type made up of one text and one integer value once per database. I call it ai: CREATE TYPE ai AS (a text, i int); The trick is to form an … Read more

How to change the CHARACTER SET (and COLLATION) throughout a database?

June 28, 2022 by Tarik

change database collation: ALTER DATABASE <database_name> CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci; change table collation: ALTER TABLE <table_name> CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci; change column collation: ALTER TABLE <table_name> MODIFY <column_name> VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci; What do the parts of utf8mb4_0900_ai_ci mean? 3 bytes — utf8 4 bytes — utf8mb4 (new) v4.0 … Read more

How do I perform an accent insensitive compare (e with è, é, ê and ë) in SQL Server?

June 18, 2022 by Tarik

Coerce to an accent insensitive collation You’ll also need to ensure both side have the same collation to avoid errors or further coercions if you want to compare against a table variable or temp table varchar column and because the constant value will have the collation of the database Update: only for local variables, not … Read more

What is the best collation to use for MySQL with PHP? [closed]

June 17, 2022 by Tarik

The main difference is sorting accuracy (when comparing characters in the language) and performance. The only special one is utf8_bin which is for comparing characters in binary format. utf8_general_ci is somewhat faster than utf8_unicode_ci, but less accurate (for sorting). The specific language utf8 encoding (such as utf8_swedish_ci) contain additional language rules that make them the … Read more

UTF-8: General? Bin? Unicode?

May 29, 2022 by Tarik

In general, utf8_general_ci is faster than utf8_unicode_ci, but less correct. Here is the difference: For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is … Read more

How do I sort unicode strings alphabetically in Python?

May 25, 2022 by Tarik

IBM’s ICU library does that (and a lot more). It has Python bindings: PyICU. Update: The core difference in sorting between ICU and locale.strcoll is that ICU uses the full Unicode Collation Algorithm while strcoll uses ISO 14651. The differences between those two algorithms are briefly summarized here: http://unicode.org/faq/collation.html#13. These are rather exotic special cases, … Read more

Troubleshooting “Illegal mix of collations” error in mysql

May 7, 2022 by Tarik

This is generally caused by comparing two strings of incompatible collation or by attempting to select data of different collation into a combined column. The clause COLLATE allows you to specify the collation used in the query. For example, the following WHERE clause will always give the error you posted: WHERE ‘A’ COLLATE latin1_general_ci = … Read more

How do I see what character set a MySQL database / table / column is?

May 5, 2022 by Tarik

Here’s how I’d do it – For Schemas (or Databases – they are synonyms): SELECT default_character_set_name FROM information_schema.SCHEMATA WHERE schema_name = “schemaname”; For Tables: SELECT CCSA.character_set_name FROM information_schema.`TABLES` T, information_schema.`COLLATION_CHARACTER_SET_APPLICABILITY` CCSA WHERE CCSA.collation_name = T.table_collation AND T.table_schema = “schemaname” AND T.table_name = “tablename”; For Columns: SELECT character_set_name FROM information_schema.`COLUMNS` WHERE table_schema = “schemaname” AND table_name … Read more

Efficiently replace all accented characters in a string?

May 4, 2022 by Tarik

Here is a more complete version based on the Unicode standard. var Latinise={};Latinise.latin_map={“Á”:”A”, “Ă”:”A”, “Ắ”:”A”, “Ặ”:”A”, “Ằ”:”A”, “Ẳ”:”A”, “Ẵ”:”A”, “Ǎ”:”A”, “Â”:”A”, “Ấ”:”A”, “Ậ”:”A”, “Ầ”:”A”, “Ẩ”:”A”, “Ẫ”:”A”, “Ä”:”A”, “Ǟ”:”A”, “Ȧ”:”A”, “Ǡ”:”A”, “Ạ”:”A”, “Ȁ”:”A”, “À”:”A”, “Ả”:”A”, “Ȃ”:”A”, “Ā”:”A”, “Ą”:”A”, “Å”:”A”, “Ǻ”:”A”, “Ḁ”:”A”, “Ⱥ”:”A”, “Ã”:”A”, “Ꜳ”:”AA”, “Æ”:”AE”, “Ǽ”:”AE”, “Ǣ”:”AE”, “Ꜵ”:”AO”, “Ꜷ”:”AU”, “Ꜹ”:”AV”, “Ꜻ”:”AV”, “Ꜽ”:”AY”, “Ḃ”:”B”, “Ḅ”:”B”, “Ɓ”:”B”, “Ḇ”:”B”, … Read more

What’s the difference between utf8_general_ci and utf8_unicode_ci?

May 1, 2022 by Tarik

For those people still arriving at this question in 2020 or later, there are newer options that may be better than both of these. For example, utf8mb4_0900_ai_ci. All these collations are for the UTF-8 character encoding. The differences are in how text is sorted and compared. _unicode_ci and _general_ci are two different sets of rules … Read more