UTF-8 Readme: Difference between revisions From Online Manual

Jump to: navigation, search
(→‎How to convert to UTF-8: is this more clear?)
 
(22 intermediate revisions by 5 users not shown)
Line 1: Line 1:
UTF-8 is an encoding standard that can represent all Unicode characters. This allows it to show almost any writing system in the world.  
UTF-8 is an encoding standard that can represent all Unicode characters. This allows it to show almost any writing system in the world.  


==What's new in SMF 2.0.x with regard to UTF-8?==
==UTF in SMF 2.0.x==
With version 2.0, SMF introduced full UTF-8 character set support. SMF 1.1.x supported ISO-8859 character sets, with limited support for non-ISO character sets for some languages.  With SMF 2.0 comes the option to run the forum with UTF-8 or without.  If you install a new SMF 2.0.x forum with UTF-8, or if you upgrade an existing SMF 2.0.x forum to UTF-8, then all posts will be stored on the database using the UTF-8 character set, and every web page will inform the browser it is using the UTF-8 character set.  For each language pack you decide to install for your forum, it will be necessary to choose the UTF-8 version of that character set.
SMF 2.0.x includes the option to run the forum with or without UTF-8.  If you install a new SMF 2.0.x forum with UTF-8, or if you upgrade an existing SMF 2.0.x forum to UTF-8, all posts will be stored in the database using the UTF-8 character set, and every web page will inform the browser that it is using the UTF-8 character set.  For each language pack you decide to install for your forum, it will be necessary to choose the UTF-8 version of that character set.


If you choose not to use UTF-8 for your SMF 2.0.x forum, then you must choose the non-UTF-8 versions of all language packs you install with your forum.   
If you choose not to use UTF-8 for your SMF 2.0.x forum, you must choose the non-UTF-8 versions of all language packs that you install on your forum.   


If you choose the wrong character set for any of your language packs, you will see some of what users often describe as "garbage characters" on the screen.  The solution is to use only the correct character-set version of the language packs you have chosen for your forum.
If you choose the wrong character set for any of your language packs, you will see some "garbage characters" on the screen.  The solution is to use only the correct character set version of the language packs you have chosen for your forum.


==Character sets available for 1.1.x==
You can download language packages with a UTF-8 character set [http://download.simplemachines.org/?smflanguages here].
For languages with non-ASCII characters, these character sets were used in a non-UTF-8 forum.  When converting to UTF-8, take note of the character set previously used.
<table class="bbc_table" width="100%"><tr><td width="30%">''' Character set '''</td><td>'''Language'''</td></tr><tr><td> big5</td><td>Chinese (traditional)</td></tr><tr><td> gbk</td><td>Chinese (simplified)</td></tr>
<tr><td> ISO-8859-1</td><td>Albanian, Brazilian, Catalan, Danish, Dutch, English, Finnish, French, German, Portuguese, Norwegian, Spanish, Swedish, Italian, Indonesian, Malay, Galician</td></tr>
<tr><td> ISO-8859-2</td><td>Croatian, Hungarian, Polish, Romanian, Serbian (latin), Slovak, Polish, Czech</td></tr>
<tr><td> ISO-8859-3</td><td>Esperanto</td></tr>
<tr><td> ISO-8859-5</td><td>Serbian (cyrilic)</td></tr>
<tr><td> ISO-8859-9</td><td>Turkish</td></tr>
<tr><td> tis-620</td><td>Thai</td></tr>
<tr><td> UTF-8</td><td>Chinese (simplified), Chinese (traditional), Japanese, Persian, Vietnamese, Urdu, Persian, Macedonian, Lithuanian</td></tr>
<tr><td> windows-1256</td><td>Arabic</td></tr>
<tr><td> windows-1251</td><td>Bulgarian, Russian, Ukrainian</td></tr>
<tr><td> windows-1253</td><td>Greek</td></tr>
<tr><td> windows-1255</td><td>Hebrew</td></tr></table>
As of SMF 1.1 RC3 you'll be able to also download each of those language packages in UTF-8 character set ([http://download.simplemachines.org/?smflanguages Language packs]).


==Why would I need UTF-8?==
==Why would I need UTF-8?==
There are a few reasons you might need UTF-8:
There are a few reasons you might need UTF-8:
*If you want to support multiple languages that use different character sets on your forum. For instance if you want to support both Russian and Turkish, you will need a character set that supports both. UTF-8 is then a logical choice.
*If you want to support multiple languages that use different character sets on your forum. For instance, if you want to support both Russian and Turkish, you will need a character set that supports both.
*If the software integrating with SMF uses UTF-8. In some cases such an integration can require character sets to match.
*If the software integrating with SMF uses UTF-8. In some cases such an integration can require character sets to match.
*If you need better search results or improved sorting. In some cases searching and sorting by the database can be improved by chosing UTF-8 as your character set.
*If you need better search results or improved sorting. In some cases searching and sorting by the database can be improved by choosing UTF-8 as your character set.
==Why would I NOT need UTF-8?==
If none of the aboe reasons apply to your forum, UTF-8 would probably not be very useful. Besides, it's a bit slower too.


Also keep in mind that you need at least MySQL 4.1 and SMF 1.1 RC3 to be able to use UTF-8 as default character set if you are using MySQL as your database scheme.
==Why would I not need UTF-8?==
If none of the above reasons apply to your forum, UTF-8 would probably not be very useful.


==How to convert to UTF-8==
==How to convert to UTF-8==
This procedure will work well if you have used the recommended character sets in your language files in the past.
This procedure will work well if you have used the recommended character sets in your language files in the past.
*Start with a '''backup''' of your database(!) Character set conversions nearly always go correctly, but it is best to be prepared for the unexpected.
 
*Check your default language file ([[Languages|Administration Center » Languages » Edit Languages]] ) and make a note of the character set used.
1. Start with a '''backup''' of your database. Character set conversions nearly always work correctly, but it is best to be prepared for the unexpected.
*Go to ''Forum Maintenance > Convert the database and data to UTF-8'' (this option will only be available if SMF detects a database version which supports UTF-8)
 
* Select the character sets for your data (member posts) and database. By default, SMF will choose the character set of your default language file.
2. Check your default language file ([[Languages|Administration Center » Languages » Edit Languages]] ) and make a note of the character set used.
* Press proceed; your database will be converted. Depending on the size of your database, the conversion process might stop temporarily from time to time to avoid overloading the server. If that was successful, your forum should be converted to UTF-8.
 
* For each of language pack currently in use, replace it with the UTF-8 version of that language pack. Luckily all language packs for 1.1 RC3 are available for both the original character set and UTF-8, so simply download them and you should be ready to go.
3. Go to ''Forum Maintenance > Convert the database and data to UTF-8'' (this option will only be available if SMF detects a database version which supports UTF-8).
* Once all the UTF-8 language packs have been installed, convert the language settings of each user by running the following query: {{code
 
4. Select the character sets for your data (member posts) and database. By default, SMF will choose the character set of your default language file.
 
5. Press proceed, and your database will be converted. Depending on the size of your database, the conversion process might stop temporarily from time-to-time to avoid overloading the server. If this was successful, your forum should be converted to UTF-8.
 
6. For each of language pack currently in use, replace it with the UTF-8 version of that language pack.
 
7. Once all the UTF-8 language packs have been installed, convert the language settings of each user by running this query:
{{code
|1=<nowiki>UPDATE smf_members
|1=<nowiki>UPDATE smf_members
SET lngfile = CONCAT(lngfile, '-utf8')
SET lngfile = CONCAT(lngfile, '-utf8')
WHERE lngfile != ''</nowiki>}}
WHERE lngfile != ''</nowiki>}}
*In your admin center, change the default language -- choose the UTF-8 version.
*Check to see if all your data was properly converted
*If any of your posts contain HTML entities, you will want to convert those to UTF-8 as well -- run "Convert HTML-entities to UTF-8 characters"


{{Needs work}}
8. In your Administration Center, change the default language, ensuring that you choose the UTF-8 version.
 
9. Check to see if all your data was properly converted.
 
10. If any of your posts contain HTML entities, you will want to convert these to UTF-8 as well by running "Convert HTML-entities to UTF-8 characters"
 
==What to do if your site uses a mix of character sets==
If your site began as a non-UTF-8 installation, and you chose UTF-8 language files, the procedure above might not work perfectly.  Because there are likely to be hundreds of posts on the database in a character set different to the forum's default character set, site search functions might not work well, and the situation might be holding you back from converting to SMF 2.0. 
 
The good news is that you can fix this problem and convert to UTF-8.  You must simply plan the process out a little more carefully, and might need to take one or two extra steps.
 
*Before you begin, double check the default language file/character set you used to install SMF, and also the language file/character set you later added (and which was used for storing the important posts on the forum).  Make sure to put the forum into maintenance mode.
*Set the default language for your forum to be whatever it was when you first installed the forum (English ISO-8859-1, for example)
*In order to change your default language (from English to Greek, for example), it is best to get a copy of the upgrade package for the current version of SMF, for the language you want to be your new default.  Copy all these files just as you normally would for an upgrade or install.
*Run upgrade.php, still using the old default character set, and then delete upgrade.php
*If your character set problems have affected your search take the following steps:
**Go to Administration Center » Search » Search Method, delete any text index (if any) and select "No index" as the search method.
**Remove all rows from tables with names like smf_log_search_*.  Use phpMyAdmin to do this. Do not drop these tables.
**Now proceed through the instructions above up to step 4.  During step 4, you will have to set the '''Data character set''' to the character set used in the posts on your forum.
**Follow the remainder of the instructions above to finish up.
 
To see how two SMF forum administrators applied this approach to sites that had been created with ISO-8859-1 English with UTF-8 Greek character set added later, please read [http://www.simplemachines.org/community/index.php?msg=3093945 It's all Greek to me .... :) ]
[[Category:FAQ]]
[[Category:FAQ]]
[[Category:UTF-8 FAQ]]

Latest revision as of 08:10, 16 September 2016

UTF-8 is an encoding standard that can represent all Unicode characters. This allows it to show almost any writing system in the world.

UTF in SMF 2.0.x

SMF 2.0.x includes the option to run the forum with or without UTF-8. If you install a new SMF 2.0.x forum with UTF-8, or if you upgrade an existing SMF 2.0.x forum to UTF-8, all posts will be stored in the database using the UTF-8 character set, and every web page will inform the browser that it is using the UTF-8 character set. For each language pack you decide to install for your forum, it will be necessary to choose the UTF-8 version of that character set.

If you choose not to use UTF-8 for your SMF 2.0.x forum, you must choose the non-UTF-8 versions of all language packs that you install on your forum.

If you choose the wrong character set for any of your language packs, you will see some "garbage characters" on the screen. The solution is to use only the correct character set version of the language packs you have chosen for your forum.

You can download language packages with a UTF-8 character set here.

Why would I need UTF-8?

There are a few reasons you might need UTF-8:

  • If you want to support multiple languages that use different character sets on your forum. For instance, if you want to support both Russian and Turkish, you will need a character set that supports both.
  • If the software integrating with SMF uses UTF-8. In some cases such an integration can require character sets to match.
  • If you need better search results or improved sorting. In some cases searching and sorting by the database can be improved by choosing UTF-8 as your character set.

Why would I not need UTF-8?

If none of the above reasons apply to your forum, UTF-8 would probably not be very useful.

How to convert to UTF-8

This procedure will work well if you have used the recommended character sets in your language files in the past.

1. Start with a backup of your database. Character set conversions nearly always work correctly, but it is best to be prepared for the unexpected.

2. Check your default language file (Administration Center » Languages » Edit Languages ) and make a note of the character set used.

3. Go to Forum Maintenance > Convert the database and data to UTF-8 (this option will only be available if SMF detects a database version which supports UTF-8).

4. Select the character sets for your data (member posts) and database. By default, SMF will choose the character set of your default language file.

5. Press proceed, and your database will be converted. Depending on the size of your database, the conversion process might stop temporarily from time-to-time to avoid overloading the server. If this was successful, your forum should be converted to UTF-8.

6. For each of language pack currently in use, replace it with the UTF-8 version of that language pack.

7. Once all the UTF-8 language packs have been installed, convert the language settings of each user by running this query:

UPDATE smf_members
SET lngfile = CONCAT(lngfile, '-utf8')
WHERE lngfile != ''

8. In your Administration Center, change the default language, ensuring that you choose the UTF-8 version.

9. Check to see if all your data was properly converted.

10. If any of your posts contain HTML entities, you will want to convert these to UTF-8 as well by running "Convert HTML-entities to UTF-8 characters"

What to do if your site uses a mix of character sets

If your site began as a non-UTF-8 installation, and you chose UTF-8 language files, the procedure above might not work perfectly. Because there are likely to be hundreds of posts on the database in a character set different to the forum's default character set, site search functions might not work well, and the situation might be holding you back from converting to SMF 2.0.

The good news is that you can fix this problem and convert to UTF-8. You must simply plan the process out a little more carefully, and might need to take one or two extra steps.

  • Before you begin, double check the default language file/character set you used to install SMF, and also the language file/character set you later added (and which was used for storing the important posts on the forum). Make sure to put the forum into maintenance mode.
  • Set the default language for your forum to be whatever it was when you first installed the forum (English ISO-8859-1, for example)
  • In order to change your default language (from English to Greek, for example), it is best to get a copy of the upgrade package for the current version of SMF, for the language you want to be your new default. Copy all these files just as you normally would for an upgrade or install.
  • Run upgrade.php, still using the old default character set, and then delete upgrade.php
  • If your character set problems have affected your search take the following steps:
    • Go to Administration Center » Search » Search Method, delete any text index (if any) and select "No index" as the search method.
    • Remove all rows from tables with names like smf_log_search_*. Use phpMyAdmin to do this. Do not drop these tables.
    • Now proceed through the instructions above up to step 4. During step 4, you will have to set the Data character set to the character set used in the posts on your forum.
    • Follow the remainder of the instructions above to finish up.

To see how two SMF forum administrators applied this approach to sites that had been created with ISO-8859-1 English with UTF-8 Greek character set added later, please read It's all Greek to me .... :)



Advertisement: