Unicode

The topic "Unicode" is quite complex, but I have to briefly outline it here to explain one feature of Nemp. The basic problem is nicely summarized in this picture. (For non german readers: The broken characters should be the german letter "ß", which would then translates to "Crappy Encoding".)

The ID3 tag ultimately contains many, many bytes. These are numbers between 0 and 255. If there is text in a file, then these numerical values are interpreted as letters and displayed accordingly. With the "normal" letters A-Z and a-z there are usually no problems. With german umlauts it starts already. And if you then change the alphabet, e.g. to Cyrillic, Greek or Hebrew, it becomes increasingly difficult.

All characters that exist get their own unique number via Unicode. Most of them (obviously) have a number beyond 255, but if only a maximum of 255 fits into a byte, then you have to think a bit more about how to encode these characters correctly.

The most useful method for this are the encodings UTF-8 or UTF-16, which can be used to encode almost all characters. But if a program does not understand UTF-8, you get effects like in the picture.

Another method are the ISO-8859 standards. Here one agrees on a certain range of the Unicode number space and then interprets the values between 0-255 accordingly differently. Depending on the interpretation, a Cyrillic letter is then displayed, a Greek, Hebrew, Thai, or whatever. But this works only if you share files only with people who have agreed on the same ISO-8859 standard.

Encoding in ID3v1 tags

Let's get back to the ID3 tags. In the very simply constructed ID3v1 tag no information about the used character encoding is provided. This can lead to information not being read correctly. Widely used, to my knowledge, are the various ISO-8859 standards, and more recently also UTF-8. Nemp can try to determine the character encoding used in ID3v1 tags. This works sometimes, maybe often, but certainly not always.

First, Nemp tries to interpret the information as UTF-8. If that doesn't work, a guess is made based on the filename. If the filename contains many Cyrillic or Greek characters, then the metadata is likely to contain those characters as well. Nemp then selects one of the ISO-8859 standards that are valid for the character range in question.

Encoding in other formats

Other metadata formats (including ID3v2 tags) do not have this problem. Here, UTF-8 or UTF-16 is usually mandatory, or the encoding used can (or must) be explicitly included in the metadata. If Nemp continues to display nonsense here, the error usually lies in the files, and not with Nemp. The dumbest thing I've ever seen was an encoding UTF-16 with byte order mark and zero as termination. That wouldn't be unusual by itself - but this concept was applied to every single character: The text "Nemp" would have been encoded like FE FF 00 4E 00 00 FE FF 00 45 00 00 FE FF 00 4D 00 00 FE FF 00 50 00 00 by this - completely nuts.