Wednesday, September 9, 2015

Converting/Exporting mixed encoding MySQL data to UTF8

I had to move an old MySQL database storing the info of a Greek website, and guess what; the default schema collation was latin1_swedish_ci and the charset latin1, the defaults of MySQL instance (which no one changes during installation) :/
The schema contained a mixture of tables, some of them in latin1_swedish_ci collation and some other with the proper UTF8 settings. Trying to export the data from either MySQL Workbench, phpMyAdmin and host's panel I was getting an ANSI encoded sql file. Normally, that's fine but if your data contains UTF8 characters (i.e. Greek letters) then you've got a problem.

 The Greek word "νέο" from a utf8 table is shown as "Ξ½Ξ­-ΞΏ" but the word "Φωτογραφία" from a latin1 table is shown as "Γ¦ÉΓβ€ΓΒΏΓΒ³ΓΒΓΒ±Γ†ΓΒ―ΓΒ±". Changing the file encoding to UTF-8 will fix the utf8 strings but the latin1 strings are completely ruined.
In order to fix that, I had to split the export/import task for the utf8 and the latin1 tables.

For the utf8 tables, I just used one of my dumps changing the file encoding to UTF-8.
The easiest way to do that in Windows is using Notepad++ (Encoding -> Encode in UTF-8 not Convert to UTF-8). If you are in Linux, I guess iconv is the easiest way.

For the latin1 tables, I had to execute mysqldump specifying the character set:
mysqldump -p[PASSWORD]  --default-character-set=latin1 --add-drop-table [DB_NAME] > db_latin1.sql
Again, changing the file encoding to UTF-8, will fix the strings from the latin1 tables. This time you can remove all the statements related with the utf8 tables and replace the "latin1" charset statements with "utf8". Now you can either manually merge the two dump files into one, or you could import both dump files in a new schema.

Related posts:

No comments: