A years-long Turkish alphabet bug in the Kotlin compiler

sam-cooper.medium.com

148 points by Bogdanp a day ago

okanat a day ago

As a Turkish speaker who was using a Turkish-locale setup in my teenage years these kinds of bugs frustrated me infinitely. Half of the Java or Python apps I installed never run. My PHP webservers always had problems with random software. Ultimately, I had to change my system's language to English. However, US has godawful standards for everything: dates, measurement units, paper sizes.

When I shared computers with my parents I had to switch languages back-and-forth all the time. This helped me learn English rather quickly but, I find it a huge accessibility and software design issue.

If your program depends on letter cases, that is a badly designed program, period. If a language ships toUpper or a toLower function without a mandatory language field, it is badly designed too. The only slightly-better option is making toUpper and toLower ASCII-only and throwing error for any other character set.

While half of the language design of C is questionable and outright dangerous, making its functions locale-sensitive by all popular OSes was an avoidable mistake. Yet everybody did that. Just the existence of this behavior is a reason I would like to get rid of anything GNU-based in the systems I develop today.

I don't care if Unicode releases a conversion map. Natural-language behavior should always require natural language metadata too. Even modern languages like Rust did a crappy job of enforcing it: https://doc.rust-lang.org/std/primitive.char.html#method.to_... . Yes it is significantly safer but converting 'ß' to 'SS' in German definitely has gotchas too.

collinfunk a day ago

> While half of the language design of C is questionable and outright dangerous, making its functions locale-sensitive by all popular OSes was an avoidable mistake. Yet everybody did that. Just the existence of this behavior is a reason I would like to get rid of anything GNU-based in the systems I develop today.
POSIX requires that many functions account for the current locale. I'm not sure why you are blaming GNU for this.
- keyle 17 hours ago
  
  C wasn't designed to be running facebook, it was designed to not have to write assembly.
  - jen20 9 hours ago
    
    At a time when many machines did not have as many bytes of memory as there are Unicode code points.
- immibis 13 hours ago
  
  I'm not sure why you are blaming POSIX! The role of POSIX is to write down what is already common practice in almost all POSIX-like systems. It doesn't usually specify new behaviour.
  - GTP 8 hours ago
    
    I always assumed it was the other way around: a system follows POSIX to be POSIX-compliant.
newpavlov 11 hours ago

>Even modern languages like Rust did a crappy job of enforcing it
Rust did the only sensible thing here. String handling algorithms SHOULD NOT depend on locale and reusing LATIN CAPITAL LETTER I arguably was a terrible decision on the Unicode side (I know there were reasons for it, but I believe they should've bit the bullet here), same as Han unification.
OptionOfT 3 hours ago

> Even modern languages like Rust did a crappy job of enforcing it: https://doc.rust-lang.org/std/primitive.char.html#method.to_... .
What were they supposed to do?
According to Wikipedia:
> Traditionally, ⟨ß⟩ did not have a capital form, and was capitalized as ⟨SS⟩. Some type designers introduced capitalized variants. In 2017, the Council for German Orthography officially adopted a capital form ⟨ẞ⟩ as an acceptable variant, ending a long debate.[4] Since 2024 the capital has been preferred over ⟨SS⟩.[5]
Source: https://en.wikipedia.org/wiki/%C3%9F#:~:text=Traditionally%2...
So this has been adopted since 2017. And Rust follows the unicode standard. It's not up to Rust, it's up to Unicode. And if that was what the mapping was in 2017, that's on Unicode, not Rust.
But I'm unsure if we can change the existing Unicode mapping without breaking backwards compatibility? I did learn that ẞ (the uppercase variant) is the same as SS (copy the SS and you'll see the ẞ highlighted, both upper and lowercase).
1718627440 a day ago

> However, US has godawful standards for everything: dates, measurement units, paper sizes.
Isn't the choice of language and date and unit formats normally independent.
- neandrake a day ago
  
  There are OS-level settings for date and unit formats but not all software obeys that, instead falling back to using the default date/unit formats for the selected locale.
- Waterluvian a day ago
  
  They’re about as independent as system language defaults causing software not to work properly. It’s that whole realm of “well we assumed that…” design error.
- okanat a day ago
  
  > > However, US has godawful standards for everything: dates, measurement units, paper sizes.
  > Isn't the choice of language and date and unit formats normally independent.
  You would hope so but, no. Quite a bit software tie the language setting to Locale setting. If you are lucky, they will provide an "English (UK)" option (which still uses miles or FFS WTF is a stone!).
  On Windows you can kinda select the units easily. On Linux let me introduce you to the journey to LC_ environment variables: https://www.baeldung.com/linux/locale-environment-variables . This doesn't mean the websites or the apps will obey them. Quite a few of them don't and just use LANGUAGE, LANG or LC_TYPE as their setting.
  My company switched to Notion this year (I still miss Confluence). It was hell until last month since they only had "English (US)" and used M/D/Y everywhere with no option to change!
  - miki123211 14 hours ago
    
    Mac OS actually lets you do English (Avganistan) or English (Somalia) or whatever.
    It's just English (I don't know when it's US and when it's UK, it's UK for Poland), but with the date / temperature / currency / unit preferences of whatever locale you actually live in.
    
    spookie 10 hours ago
    
    At least for any country in continental europe "English" is usually "English International", meaning English UK.
    Maybe there are some exceptions if we speak globally, hence limiting myself to europe. But I assume it is the same deal.
  - spookie 10 hours ago
    
    Certain desktop environments like KDE provide a nice GUI for changing the locale environment variables. It has worked quite well for me, to use euro instead of my country's small currency :')
  - menage 17 hours ago
    
    > FFS WTF is a stone!
    It's actually a pretty good weight for measuring humans (14lb). Your weight in pounds varies from day to day but your weight in (half-)stones is much more stable.
    
    doix 17 hours ago
    
    The real travesty is the fact that the sub-unit for a stone is a pound and not a pebble. I have no idea what stones and pounds are, but if it was stones and pebbles at least it'd be funnier
    
    pandemic_region 15 hours ago
    
    There's a full metric system hidden there: rock - stone - pebble - grain.
    I propose 614 stones to the rock, 131 pebbles to the stone, and 14707 grains to the pebble. Of course.
    
    okanat 14 hours ago
    
    Let's introduce the commonly used unit of crumble which is 3/4 of a grain!
    
    thechao 10 hours ago
    
    The commonly used unit should be 23/17ths.
  - doublerabbit a day ago
    
    > FFS WTF is a stone
    An english imperial measurement. Measurements made based on actual stone rock and were mainly use as weighing agricultural items such as animal meat and potatoes. We also used tons and pounds before we incorporated the metric system of Europe.
    
    emmelaich a day ago
    
    A stone is 1/8th of a long hundredweight. Easy!
    
    stefs 16 hours ago
    
    My car gets 40 rods to the hogshead and that's the way I likes it!
emmelaich a day ago

If it's offered, choose EN-Australian or EN-international. Then you get sensible dates and measurement units.
- benhurmarcel 15 hours ago
  
  I usually set the Ireland locale, they use English but use civilized units. Sometimes there's also a "English (Europe)" or "English (Germany)" locale that works too.
  - distances 14 hours ago
    
    I also use Ireland sometimes for user accounts. For example Hotels.com only offers the local languages when you select which country to use. The Irish version is one of the few that has allows you to buy in Euros in English.
  - okanat 14 hours ago
    
    Nowadays this works for many applications. Not for the "legacy" ARM compiler that was definitely invented after Win NT adopted UTF though. It crashes with "English (Germany)". Just whyy.
- Waterluvian a day ago
  
  And if you want it to be more sensible but still not sensible, pick EN-ca.
layer8 a day ago

> While half of the language design of C is questionable and outright dangerous, making its functions locale-sensitive by all popular OSes was an avoidable mistake.
It wasn’t a mistake for local software that is supposed to automatically use the user’s locale. It’s what made a lot of local software usefully locale-sensitive without the developer having to put much effort into it, or even necessarily be aware of it. It’s the reason why setting the LC_* environment variables on Linux has any effect on most software.
The age of server software, and software talking to other systems, is what made that default less convenient.
- jkrejcha 18 hours ago
  
  On the contrary, the locale APIs are problematic for many reasons. If C had just been like "well C only supports the C locale, write your own support if that's what you want", much more software would have been less subtly broken.
  There's a few fundamental problems with it:
  1. The locale APIs weren't designed very well and things were added over the years that do not play nice with it.
  So like as an example, what should `int toupper(int c)` return? (By the way, the paramater `c` is really an unsigned char, if you try to put anything but a single byte here, you get undefined behavior. What if you're using something that uses a multibyte encoding? You only get one byte back so that doesn't really help there either.
  Many of the functions were clearly designed for the "1 character = 1 byte" world, which is a key assumption of all of these APIs. Which is fine if you're working with ASCII, but blows up as soon as you change locales.
  And even so, it creates problems where you try to use it. Say I have a "shell" but all of the commands are internally stored as uppercase, but you want to be compatible. If you try to use anything outside of ASCII with locales, you can't just store the command list in uppercase form because then they won't match when doing a string comparison using the obvious function for it (strcmp). You have to use strcoll instead, and sometimes you just, might not have a match for multibyte encodings.
  2. The locale is global state.
  The worst part about it is that it's actually global state (not even like faux-global state like errno). This basically means that it's basically wildly thread unsafe as you can have thread 1 running toupper(x) while another thread, possibly in a completely different library, calling setlocale (as many library functions do to guard against the semantics of a lot of standard library functions changing unexpectedly). And boom, instant undefined behavior, with basically nothing you could reasonably do about it. You'll probably get something out of it, but the pieces are probably going to display weirdly unless your users are from the US, where the C locale is pretty close to the US locale.
  This means any of the functions in this list[1] is potentially a bomb:
  > fprintf, isprint, iswdigit, localeconv, tolower, fscanf, ispunct, iswgraph, mblen, toupper, isalnum, isspace, iswlower, mbstowcs, towlower, isalpha, isupper, iswprint, mbtowc, towupper, isblank, iswalnum, iswpunct, setlocale, wcscoll, iscntrl, iswalpha, iswspace, strcoll, wcstod, isdigit, iswblank, iswupper, strerror, wcstombs, isgraph, iswcntrl, iswxdigit, strtod, wcsxfrm, islower, iswctype, isxdigit.
  And there are some important ones in there too like strerror. Searching through GitHub as a random sample, it's not uncommon to see these functions be used[2], and really, would you expect `isdigit` to be thread-unsafe?
  It's a little better with POSIX as they define a bunch of "_r" variants of functions like strerror and the like which at least give some thread safety (and uselocale at least is a thread-only variant of setlocale, which lets you safely do the whole "guard all library calls to `uselocale(LC_ALL, "C")`"). But Windows doesn't support uselocale so you have to use _configthreadlocale instead.
  It also creates hard to trace bug reports. Saying you only support ASCII or whatever is, well it's not great today, but it's at least somewhat understandable, and is commonly seen to be the lowest common denominator for software. Sure, ideally we'd all use byte strings where we don't care or UTF-8 where we actually want to work with text (and maybe UTF-16 on Windows for certain things), but that's just a feature that doesn't exist, whereas memory corruption when you do something with a string but only for people in a certain part of the world in certain circumstances is not really a great user experience or developer experience for that matter.
  The thing, I actually like C in a lot of ways. It's a very useful programming language and has incredible importance even today and probably for the far future, but I don't really think the locale API was all that well designed.
  [1]: Source: https://en.cppreference.com/w/c/locale/setlocale.html
  [2]: https://github.com/search?q=strerror%28+language%3AC&type=co...
  - eklitzke 4 hours ago
    
    I think it's important to point out the distinction between what POSIX mandates and what actual libc implementations, notably glibc, do. Nearly all non-reentrant POSIX functions are actually only non-reentrant if you are using a 1980s computer that for some reason has threads but doesn't have thread-local storage. All of these things like strerror are implemented using TLS in glibc nowadays, so while it is technically true you need to use the _r versions if you want to be portable to computers that nobody has used in 30 years in practice you usually don't need to worry about these things, especially if you're using Linux, since they use store results in static thread-local memory rather than static global memory.
    As for the string.h stuff, while it is all terrible it's at least well documented that everything is broken unless you use wchar_t, and nobody uses wchar_t because it's the worst possible localization solution. No one is seriously trying to do real localization in C (and if they were they'd be using libicu).
    
    jkrejcha 2 hours ago
    
    strerror, at least on glibc, was only made thread safe back in 2020[1], which is really not that long ago in the grand scheme of things. It was WONTFIXed when it was initially reported back in 2005(!). There have only been 10 glibc releases since then and the 2.32 branch is still actively maintained.
    There is probably a wide breadth of software that is actively not using that glibc version.
    But yeah, agreed that trying to do localization with the builtin functions are fraught with traps and pitfalls. Part of the problem though is less about localization and more due to the fact that you can have bugs inflicted on you if you're not careful to just overwrite the locale with the C locale (and make sure to do this everywhere you can)
    [1]: https://sourceware.org/bugzilla/show_activity.cgi?id=1890 (see specifically the target milestone, the 2023 date seems to be overly pessimistic)
arccy a day ago

use Australian English: English but with same settings for everything else, including keyboard layout
- okanat a day ago
  
  I live in Germany now, so I generally set it to Irish nowadays. Since I like ISO-style enter key, I use UK keyboard layout (also easier to switch to Turkish than ANSI-layout). However many OSes now have a English (Europe) locale too
- Sesse__ a day ago
  
  Many Linux distributions provide en_DK specifically for this purpose. English as it is used in Denmark. :-)
  - Symbiote 17 hours ago
    
    This uses a comma decimal separator, which might or might not be desired.
    Irish English locale uses a dot.
  - fph a day ago
    
    Denmark doesn't have Euros as currency, unfortunately.
    
    jojomodding a day ago
    
    Tying currency to locale seems insane. I have bank accounts in multiple currencies and use both several times per week. Why does all software on my system need to have a default currency? Most software does not care about money, those that do usually give you a quote in a currency fixed by someone else.
    
    input_sh 16 hours ago
    
    It's about how easy it is to reach the € sign. Ideally, it should be as easy to type as the $ sign is in the en_US layout.
    For what it's worth, I think most all European keyboard layouts have key combos for € and $ defined (many have £ as well), while on en_US you can only type $ (without messing with settings). Europe of course has more currencies than just €, but they use a two-letters-long abbreviations instead of a special symbol.
    
    simonask 10 hours ago
    
    zł has entered the chat. ;-)
    (The Polish Ł is typically not easily typable of non-Polish keyboards.)
    
    Sesse__ 5 hours ago
    
    Huh, do typical Linux keyboards not have it on AltGr-L?
    
    Symbiote 17 hours ago
    
    en_IE does.
thaumasiotes a day ago

> If a language ships toUpper or a toLower function without a mandatory language field, it is badly designed too. The only slightly-better option is making toUpper and toLower ASCII-only and throwing error for any other character set.
There is a deeper bug within Unicode.
The Turkish letter TURKISH CAPITAL LETTER DOTLESS I is represented as the code point U+0049, which is named LATIN CAPITAL LETTER I.
The Greek letter GREEK CAPITAL LETTER IOTA is represented as the code point U+0399, named... GREEK CAPITAL LETTER IOTA.
The relationship between the Greek letter I and the Roman letter I is identical in every way to the relationship between the Turkish letter dotless I and the Roman letter I. (Heck, the lowercase form is also dotless.) But lowercasing works on GREEK CAPITAL LETTER IOTA because it has a code point to call its own.
Should iota have its own code point? The answer to that question is "no": it is, by definition, drawn identically to the ascii I. But Unicode has never followed its principles. This crops up again and again and again, everywhere you look. (And, in "defense" of Unicode, it has several principles that directly contradict each other.)
Then people come to rely on behavior that only applies to certain buggy parts of Unicode, and get messed up by parts that don't share those particular bugs.
- layer8 a day ago
  
  It’s not a bug, it’s a feature. The reason is that ISO 8859-7 [0] used for Greek has a separate character code for Iota (for all greek letters, really), while ISO 8859-3 [1] and -9 [2] used for Turkish do not for the usual dotless uppercase I.
  One important goal of Unicode is to be able to convert from existing character sets to Unicode (and back) without having to know the language of the text that is being converted. If they had invented a separate code point for I in Turkish, then when converting text from those existing ISO character encodings, you’d have to know whether the text is Turkish or English or something else, to know which Unicode code point to map the source “I” into. That’s exactly what Unicode was designed to avoid.
  [0] https://en.wikipedia.org/wiki/ISO/IEC_8859-7
  [1] https://en.wikipedia.org/wiki/ISO/IEC_8859-3
  [2] https://en.wikipedia.org/wiki/ISO/IEC_8859-9
  - thaumasiotes 20 hours ago
    
    I know that. That's why I mentioned
    > in "defense" of Unicode, it has several principles that directly contradict each other
    Unicode wants to do several things, and they aren't mutually compatible. It is premised on the idea that you can be all things to all people.
    > It’s not a bug, it’s a feature.
    It is a bug. It directly violates Unicode's stated principles. It's also a feature, but that won't make it not a bug.
  - newpavlov 11 hours ago
    
    >If they had invented a separate code point for I in Turkish, then when converting text from those existing ISO character encodings, you’d have to know whether the text is Turkish or English or something else, to know which Unicode code point to map the source “I” into. That’s exactly what Unicode was designed to avoid.
    Great. So now we have to know locale for handling case conversion for probably centuries to come, but it was totally worth to save a bit of effort in the relatively short transition phase. /s
    
    JuniperMesos 7 hours ago
    
    You always have to know locale to handle case conversion - this is not actually defined the same way in different human languages and it is a mistake to pretend it is.
    
    newpavlov 7 hours ago
    
    In most cases locale is encoded in character itself, i.e. Latin "a" and Cyrillic "a" are two different characters, despite being visually indistinguishable in most cases.
    The "language-sensitive" section of the special casing document [0] is extremely small and contains only the cases of stupid reuse of Latin I.
    [0]: https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing....
    
    fhars 10 hours ago
    
    Without it, there would not have been a transition phase.
    
    newpavlov 10 hours ago
    
    I call BS. Without a series of MAJOR blunders Unicode was destined to succeed. When the rest of the world has migrated to Unicode, I am more than certain that Turks would've migrated as well. Yes, they may have complained for several years and would've spent a minuscule amount of resources to adopt the conversion software, but that's it, a decade or two later everyone would've forgotten about it.
    I believe that even addition of emojis was completely unnecessary despite the pressure from Japanese telecoms. Today's landscape of messengers only confirms that.
themafia a day ago

I thought locale is mostly controlled by the environment. So you can run your system and each program with it's own separate locale settings if you like.
- silon42 14 hours ago
  
  I wish there was a single letter universal locale with sane values, maybe call it U or E, with:
  ISO (or RFC....) date time, UTF-8 default (maybe also alternative with ISO8859-1) decimal point in numbers and _ for thousands, metric paper / A4, ..., unicode neutral collation
  but keeps US-English language
fukka42 12 hours ago

Just use English. If you want to program you need to learn it anyway to make sense of anything.
I'm not a native English speaker btw. I learned it as I was learning programming as a kid 20 years ago
- whynotmaybe 10 hours ago
  
  Yes and no. This will work only if you don't create software used internationally.
  If you only work in English, you will test in English and avoid uses cases like the one described in the article.
  Did you know that many town and streets in Canada have a ' in their name? And that many websites reject any ' in their text fields because they think its Sql injection?
  - jen20 9 hours ago
    
    Ms O’Reilly would like a word about surname fields.
  - fukka42 9 hours ago
    
    My EU country does the same. Of course software should work for the locales you're targeting but that is different from the language used by developer tooling. The GP is talking about changing the locale of their development machine so I assume that's what they're referring to.

mikestew a day ago

When I saw "Turkish alphabet bug", I just knew it was some version of toLower() gone horribly wrong.

(I'm sure there's a good reason, but I find it odd that compiler message tags are invariably uppercase, but in this problem code they lowercased it to go do a lookup from an enum of lowercase names. Why isn't the enum uppercase, like the things you're going to lookup?)

kevin_thibedeau a day ago

With Turkish you can't safely case-fold with toupper() or tolower() in a C/US locale: i->I and I->i are both wrong. Uppercasing wouldn't work. You have to use Unicode or Latin-5 to manage it.
- teo_zero 18 hours ago
  
  You misunderstood the parent post. They where suggesting to look up the exact string that ends up in the message, without any conversion. So if the message contains INFO, ERROR, etc. then look up "INFO", "ERROR"...
- oneshtein 18 hours ago
  
  It's the bug in the Turkish locale. They hacked Latin alphabet instead of creating a separate letter with separate rules.
kokada 15 hours ago

Without looking at the source code I think it is because the log functions are lowercase, but I am not sure this is the reason.
thaumasiotes a day ago

> Why isn't the enum uppercase, like the things you're going to lookup?
Another question: why does the log record the string you intended to look up, instead of the string you actually did look up?

sjrd a day ago

I am one of the maintainers is the Scala compiler, and this is one of the things that immediately jump to me when I review code that contains any casing operation. Always explicitly specify the locale. However, unlike TFA and other comments, I don't suggest `Locale.US`. That's a little too US-centric. The canonical locale is in fact `Locale.ROOT`. Granted, in practice it's equivalent, but I find it a little bit more sensible.

Also, this is the last remaining major system-dependent default in Java. They made strict floating point the default in 17; UTF-8 the default encoding some versions later (21?); only the locale remains. I hope they make ROOT the default in an upcoming version.

FWIW, in the Scala.js implementation, we've been using UTF-8 and ROOT as the defaults forever.

mormegil 17 hours ago

I agree that Locale.ROOT is the canonical choice. But in this case, Locale.US also makes sense: it isn't some abstract "US is some kind of the global default", it is saying "we know are upcasing an English word".
- fukka42 12 hours ago
  
  Wouldn't the British locale make more sense then?
JuniperMesos a day ago

> However, unlike TFA and other comments, I don't suggest `Locale.US`. That's a little too US-centric. The canonical locale is in fact `Locale.ROOT`. Granted, in practice it's equivalent, but I find it a little bit more sensible.
I have no idea what `Locale.ROOT` refers to, and I'd be worried that it's accidentally the same as the system locale or something, exactly the sort of thing that will unexpectedly change when a Turkish-speaker uses a computer or what have you.
- layer8 a day ago
  
  > I'd be worried that it's accidentally the same as the system locale or something
  The API docs clearly specify that Locale.ROOT “is regarded as the base locale of all locales, and is used as the language/country neutral locale for the locale sensitive operations.”
- troad 20 hours ago
  
  > However, unlike TFA and other comments, I don't suggest `Locale.US`. That's a little too US-centric. The canonical locale is in fact `Locale.ROOT`. Granted, in practice it's equivalent, but I find it a little bit more sensible.
  Isn't it kind of strange to say that Locale.US is too US centric, and therefore we'll invent a new, fictitious locale, the contents of which is all the US defaults, but which we'll call "the base locale of all locales"? That somehow seems even more US centric to me than just saying Locale.US.
  Setting the locale as Locale.US is at least comprehensible at a glance.
  - sjrd 20 hours ago
    
    I guess it's one way to look at it. I see it as: I want a reproducible locale, independent of the user's system. If I see US, I'm wondering if it was chosen to be English because the program was written in English. When I localize the program, should I make that locale configurable? ROOT communicates that it must not be configurable, and never dependent on the system.
  - Symbiote 15 hours ago
    
    I am surprised to find Java's Locale.ROOT is not American.
    DateFormat dateFormat = DateFormat.getDateInstance(DateFormat.DEFAULT, Locale.ROOT); System.out.println(dateFormat.format(new Date())); dateFormat = DateFormat.getTimeInstance(DateFormat.DEFAULT, Locale.ROOT); System.out.println(dateFormat.format(new Date())); NumberFormat numberFormatter = NumberFormat.getNumberInstance(Locale.ROOT); System.out.println(numberFormatter.format(12.34)); NumberFormat currencyFormatter = NumberFormat.getCurrencyInstance(Locale.ROOT); System.out.println(currencyFormatter.format(12.34)); 2025 Oct 13 10:12:42 12.34 ¤ 12.34
    Even POSIX C is less American than I expected, with a metric paper size and no currency symbol defined (¤ isn't in ASCII). Only the American date format.
    
    brainwad 11 hours ago
    
    That's not the American date format, either - which would be Oct 13 2025.
  - outadoc 14 hours ago
    
    I assume that Locale.ROOT will stay backwards-compatible, whereas theoretically Locale.US could change. What if it changes its currency in the future, for example, or its date format?
- kevin_thibedeau a day ago
  
  It is a programming language agnostic equivalent of POSIX C locale with Unicode enhancement.

jillesvangurp 16 hours ago

Interesting one. That and and relying on system character encodings is a source of subtle bugs. I've been bitten by that many times with e.g. XML parsing in the past. Modern Kotlin thankfully has very few (if any) places left where this can happen. Kotlin has parameters with default values. So anything that relies on a character encoding usually has a parameter of encoding set to UTF-8.

The bug here was the default Java implementation that Kotlin uses on JVM. On kotlin-js both toLowerCase() and lowercase() do exactly the same thing. Also, the deprecation mechanism in Kotlin is kind of cool. The deprecated implementation is still there and you could use it with a compiler flag to disable the error.

  @Deprecated("Use lowercase() instead.", ReplaceWith("lowercase(Locale.getDefault())", "java.util.Locale"))
  @DeprecatedSinceKotlin(warningSince = "1.5", errorSince = "2.1")
  @kotlin.internal.InlineOnly
  public actual inline fun String.toLowerCase(): String = (this as java.lang.String).toLowerCase()

  /**
   * Returns a copy of this string converted to lower case using Unicode mapping rules of the invariant locale.
   *
   * This function supports one-to-many and many-to-one character mapping,
   * thus the length of the returned string can be different from the length of the original string.
   *
   * @sample samples.text.Strings.lowercase
   */
  @SinceKotlin("1.5")
  @kotlin.internal.InlineOnly
  public actual inline fun String.lowercase(): String = (this as java.lang.String).toLowerCase(Locale.ROOT)

johnyzee a day ago

Ugh, I've had the exact same problem in a Java project, which meant I had to go through thousands and thousands of lines of code and make sure that all 'toLowerCase()' on enum names included Locale.ENGLISH as parameter.

As the article demonstrates, the error manifests in a completely inscrutable way. But once I saw the bug from a couple of users with Turkish sounding names, I zeroed in on it. And cursed a few times under my breath whoever messed up that character table so bad.

nradov a day ago

Were you not using static analysis tools? All of the popular ones will warn about that issue with locales.
- lucumo 18 hours ago
  
  They do. But a generic warning about locale-dependence doesn't really tell you that ASCII-strings will be broken. For nearly every purpose ASCII is the same in every locale. If you have a string that is guaranteed to be ASCII (like an enum constant is in most code styles), it's easy to think "not a problem here" and move on.

zettabomb a day ago

I have always wondered why Turkey chose to Latinize in this way. I understand that the issue is having two similar vowels in Turkish, but not why they decided to invent the dotless I, when other diacritics already existed. Ĭ Î Ï Í Ì Į Ĩ and almost certainly a dozen other would've worked, unless there was already some significance to the dot in Turkish that's not obvious.

jeroenhd a day ago

Computers and localisation weren't relevant back in the early 20th century. The dotless existed before the dotted i (in Greek script as iota). Some European scholars putting an extra dot on the letter to make it stand out a bit more are as much to blame as the Turks for making the distinction between the different i-vowels clear.
Really, this bug is nothing but programmers failing to take into account that not everybody writes in English.
- JuniperMesos a day ago
  
  It's not exactly programmers failing to take into account that no everybody writes in English - if that were the case, then it would simply be impossible to represent the Turkish lowercase-dotless and uppercase-dotted I at all. The actual problem is failing to take into account that operations on text strings that work in one language's writing might not work the same way in a different language's writing system. There's a lot of languages in the world that use the Latin writing system, and even if you are personally a fluent speaker and writer of several of them, you might simply have not learned about Turkish's specific behavior with I.
- jagrsw a day ago
  
  > that not everybody writes in English.
  I don't know... I understand the history and reasons for this capitalization behavior in Turkish, and my native language isn't English, which had to use a lot of strange encodings before the introduction of UTF-8.
  But messing around with the capitalization of ASCII <= codepoint(127) is a risky business, in my opinion. These codepoints are explicitly named:
  "LATIN CAPITAL LETTER I" "LATIN SMALL LETTER I"
  and requiring them to not match exactly during capitalization/diminuitization sounds very risky.
- troad 20 hours ago
  
  > Really, this bug is nothing but programmers failing to take into account that not everybody writes in English.
  This bug is the exact opposite of that. The program would have worked fine had it used pure ASCII transforms (±0x20); it was the use of library functions that did in fact take Turkish into account that caused the problem.
  More broadly, this is not an easy issue to solve. If a Turkish programmer writes code, what is the expected behaviour for metaprogramming and compilers? Are the function names in English or Turkish? What about variables, object members, struct fields? You could have one variable name that references some government ID number using its native Turkish name, right next to another variable name that uses the English "ID". How does the compiler know what locale to use for which symbol?
  Boiling all of this down to 'just be more considerate' is not actually constructive or actionable.
  - jeroenhd 10 hours ago
    
    The issue is actually quite easy to solve by specifying a default locale for string operations when you are not dealing with user input. Whether you pick US or ROOT or Turkish as a default locale, all you need to do is make sure that your fancy metaprogramming tricks relying on strings-as-enums are all parsed the same way. Locale.ROOT for Java, InvariantCulture or ToUpperInvariant() for C#, you name it.
    The whole problem is that the compiler has no idea about the locale of any strings in the system, that's why it's on the programmer to specify them.
    Lowercasing/uppercasing a string takes an (infuriatingly) optional locale parameter, and the moment that gets involved, you should think twice before using it for anything other than user data processing. I would happily see Oracle deprecate all string operations lacking a locale in the next version of Java.
    
    troad 10 hours ago
    
    > actually quite easy to solve
    I cannot square your earlier assertion that we should be more mindful "that not everybody writes in English", with your current assertion that all code must only ever contain English, for simplicity's sake. Either is a cogent position on its own, just not both at the same time.
    This bug arose because the programmers made incorrect assumptions about the result of a case-changing operation. If you impose English case rules on Turkish symbol names, this exact bug would simply arise in reverse.
    More problematically, as I alluded to earlier, Turkish code may contain a mix of languages. It may, for example, be using a DSL to talk to a database with fields named in Turkish, as well as making calls to standard library functions named in English. Which half of the code is your proposed invariant locale going to break?
mrighele a day ago

The issue is not the invention of the dotless I, it already exists, the issue is that the took a vowerl , i/I, and the assigned the lower case to one vowel, and the upper case to a different one, and invented what left missing.
It's like they decided that the uppercase of "a" is "E" and the uppercase of "e" is "A".
- pinkmuffinere a day ago
  
  This is misleading, because it assumes that i/I naturally represent one vowel, which is just not the case. i/I represents one vowel in _English_, when written with a latin script. ̶I̶n̶ ̶f̶a̶c̶t̶ ̶e̶v̶e̶n̶ ̶t̶h̶i̶s̶ ̶i̶s̶n̶'̶t̶ ̶c̶o̶r̶r̶e̶c̶t̶,̶ ̶i̶/̶I̶ ̶r̶e̶p̶r̶e̶s̶e̶n̶t̶s̶ ̶o̶n̶e̶ ̶p̶h̶o̶n̶e̶m̶e̶,̶ ̶n̶o̶t̶ ̶o̶n̶e̶ ̶v̶o̶w̶e̶l̶.̶ <see troad's comment for correction>
  There is no reason to assume that the English representation is in general "correct", "standard", or even "first". The modern script for Turkish was adopted around the 1920's, so you could argue perhaps that most typewriters presented a standard that should have been followed. However, there was variation even between different typewriters, and I strongly suspect that typewriters weren't common in Turkey when the change was made.
  - troad a day ago
    
    > In fact even this isn't correct, i/I represents one phoneme, not one vowel.
    Not quite. In English, 'i' and 'I' are two allographs of one grapheme, corresponding to many phonemes, based on context. (Using linguistic definitions here, not compsci ones.) The 'i's in 'kit' and 'kite' stand for different phonemes, for example.
    > There is no reason to assume that the English representation is in general "correct", "standard", or even "first".
    Correct, but the I/i allography is not exclusive to English. Every Latin script functions that way, other than Turkish and Turkish-derived scripts.
    No one is saying Turkish cannot break from that convention - they can feel free to do anything they like - but the resulting issues are fairly predictable, and their adverse effects fall mainly on Turkish speakers in practice, not on the rest of us.
    
    Muromec a day ago
    
    > but the resulting issues are fairly predictable, and their adverse effects fall mainly on Turkish speakers in practice, not on the rest of us.
    I don't think it's fair to call it predictable. When this convention was chosen, the problem of "what is the uppercase letter to I" was always bound to the context of language. Now it suddenly isn't. Shikata ga nai. It wasn't even an explicit assumption that can be reflected upon, it was an implicit one, that just happened.
    
    pinkmuffinere a day ago
    
    > Not quite. In English, 'i' and 'I' are two allographs of one grapheme, corresponding to many phonemes, based on context. (Using linguistic definitions here, not compsci ones.) The 'i's in 'kit' and 'kite' stand for different phonemes, for example.
    You're right, apologies my linguistics is rusty and I was overconfident.
    > Correct, but the I/i allography is not exclusive to English. Every Latin script functions that way, other than Turkish and Turkish-derived scripts.
    I think my main argument is that the importance of standardizing to i/I was much less obvious in the 1920's. The benefits are obvious to us now, but I think we would be hard pressed to predict this outcome a-priori.
  - ginko a day ago
    
    >This is misleading, because it assumes that i/I naturally represent one vowel, which is just not the case.
    It does in literally any language using a latin alphabet other than Turkish.
    
    okanat a day ago
    
    All other Turkic languages also copied this for their Latin script: https://en.wikipedia.org/wiki/Dotless_I
    
    pinkmuffinere a day ago
    
    This may be correct, I'd have to do a 'real' search, which I'm too lazy to do, lol sorry. However there are definitely other (non-latin) scripts that have either i or I, but for which i/I is not a correct pair. For example, greek has ι/Ι too.
- ozgung a day ago
  
  Nope, we decided to do it the correct and logical way for our alphabet. Some glyphs are either dotted or dotless. So, we have Iı, İi, Oo, Öö, Uu, Üü, Cc, Çç, Ss and Şş. You see the Ii pair is actually the odd one in the series.
  Also, we don't have serifs in our I. It's just a straight line. So, it's not even related to your Ii pair in English. You can't dictate how we write our straight lines, can you.
  The root cause of the problem is in the implementation and standardization of the computer systems. Computers are originally designed only for English alphabet in mind. And patched to support other languages over time, poorly. Computers should obey the language rules, not the other way around.
  - oneshtein 18 hours ago
    
    Yep, but you decided to abuse Latin alphabet instead of creating your own code page with your own letters and with your own rules.
    
    ozgung 15 hours ago
    
    We created our own letters and our own rules. In 1928, long before code pages and computers.
    The assumption that letters come in universal pairs is wrong. That assumption is the bug. You can’t assume that capitalization rules must be the same for every language implementing a specific alphabet. Those rules may change for every language. They do.
    And not just capitalization rules. Auto complete, for instance, should respect the language as well. You can’t “correct” a French word to an English word. Localization is not optional when dealing with text.
    
    silon42 13 hours ago
    
    Do all the letters have separate unicode codepoints? (no reuse Latin ones?)
    
    dropbear3 3 hours ago
    
    There are the following codepoints:
    U+0049 I LATIN CAPITAL LETTER I U+0069 i LATIN SMALL LETTER I U+0130 İ LATIN CAPITAL LETTER I WITH DOT ABOVE U+0131 ı LATIN SMALL LETTER DOTLESS I
    While the names of the first two don't explicitly state that they should be dotless and dotted, respectively, the Unicode standard section on the block containing those two [0] does contrast them with the dotted and dotless versions, at least implying that they should be rendered dotless and dotted, respectively.
    Unicode has historically been against adding a separate codepoint for every single language's orthography when the glyphs are (almost) identical to an existing one ("allographs"). Controversy arose when the consortium proposed considering Han characters, which do have language variants, to be allographs, which led to what is known as "Han unification".
    [0]: https://www.unicode.org/charts/PDF/U0000.pdf
  - zettabomb a day ago
    
    >Also, we don't have serifs in our I.
    That depends on font.
    >So, it's not even related to your Ii pair in English.
    Modern Turkish uses the Latin script, of course it's related.
    >You can't dictate how we write our straight lines, can you.
    No, I can't, I just want to understand why the Turks decided to change this letter, and this letter only, from the rest of the standard Latin script/diacritics.
    
    ozgung 13 hours ago
    
    > I just want to understand why the Turks decided to change this letter, and this letter only
    Because Turkish uses a phonetic alphabet suited for Turkish sounds, based on latin letters. There are 8 vovels come in two subsets:
    AIOU and EİÖÜ.
    When you pair them with zip(), pairs are phonetically related sounds but totally different letters at the same time. Turkish also uses suffixes for everything, and vowels in these suffixes sometimes change between these two subgroups.
    This design lets me write any word uniquely and almost correctly using the Turkish alphabet.
    Dis dizayn lets mi rayt ani vörd yüniğkli end olmost koreğtkli yuzing dı törkiş alfabet.
    Ö is the dotted version of O. İ is the dotted version of I. Related but different. Their lower case versions are logically (not by historical convention): öoiı. So we didn’t just wanted to change I, and only I. We just added dots. Since there are no Oö pair in any language our OoÖö vovels didn’t get the same attention. Same for our Ğğ and Şş.
    I hope this answers the question.
  - thaumasiotes a day ago
    
    > Computers are originally designed only for English alphabet in mind.
    Computers are originally designed for no alphabet at all. They only have two symbols.
    ASCII is a set of operating codes that includes instructions to physically move different parts of a mechanical typewriter. It was already a mistake when it was used for computer displays.
    
    JuniperMesos a day ago
    
    Note that ASCII stands for "American Standard Code for Information Interchange". There's no expectation here that this is a suitable code for any language other than English, the de-facto language of the United States of America.
    
    anonymars 21 hours ago
    
    Does the situation change in Unicode?
- steezeburger a day ago
  
  I don’t think that’s the right way to think about it. It’s not like they were Latinizing Turkish with ASCII in mind. They wanted a one-to-one mapping between letters and sounds. The dot versus no dot marks where in your mouth or throat the vowel is formed. They didn’t have this concept that capital I automatically pairs with lowercase i. The dot was always part of the letter itself. The reform wasn’t trying to fit existing Western conventions, it was trying to map the Turkish sounds to symbols.
  - LudwigNagasena 19 hours ago
    
    They switched from Arabic script to Latin script. They literally did latinize Turkish, but they ditched the convention of 1 to 1 correspondence between lowercase and uppercase letters that is invariant across all languages that use Latin script except for German script, Turkish script and its offspring Azerbaijani script.
    
    steezeburger 23 minutes ago
    
    I was just saying they didn't do that with ASCII in mind. I was not saying they didn't Latinize.
    
    cachius 10 hours ago
    
    > correspondence between lowercase and uppercase [not in] German script
    Where is it broken in German script? Do you mean small ß and capital ẞ?
    
    LudwigNagasena 9 hours ago
    
    Yes, ẞ is an optional variant of ß, which is traditionally capitalized as SS.
- okanat a day ago
  
  Not really. Turkish has a feature that is called "vowel harmony". You match suffixes you add to a word based on a category system: low pitch vs high pitch vowels where a,ı,o,u are low pitch and e,i,ö,ü are high pitch.
  Ö and ü were already borrowed from German alphabet. Umlaut-added variants of 'ö' and 'ü' have a similar effect on 'o' and 'u' respectively: they bring a back vowel to front. See: https://en.wikipedia.org/wiki/Vowel . Similarly removing the dots bring them back.
  Turkish already had i sound and its back variant which is a schwa-like sound: https://en.wikipedia.org/wiki/Close_back_unrounded_vowel . It has the same relation in IPA as 'ö' has to 'o' and 'ü' has to 'u'. Since the makers of the Turkish variant of Latin Alphabet had the rare chance of making a regular pronunciation system with the state of the language and since removing the dots had the effect of making a front vowel a back vowel, they simply copied this feature from ö and ü to i:
  Just remove the dots to make it a back vowel! Now we have ı.
  When comes to capitalization, ö becomes Ö, ü becomes Ü. So it is just logical to make the capital of i İ and the lowercase of I ı.
  - ithkuil a day ago
    
    Yes it's hard to come up with a different capital than I unless you somehow can see into the future and foresee the advent of computers, which the Turkish alphabet reform predates.
    Of course the latin capital I is dotless because originally the lowercase latin "i" was also dotless. The dot has been added later to make text more legible.
  - thaumasiotes a day ago
    
    > low pitch vs high pitch vowels where a,ı,o,u are low pitch and e,i,ö,ü are high pitch.
    Does that reflect the Turkish terminology? Ordinarily you would call o and u "high" while a and e are "low". The distinction between o/u and ö/ü is the other dimension: o/u are "back" while ö/ü are "front".
    
    selcuka a day ago
    
    > Does that reflect the Turkish terminology?
    Yes. The Turkish terms are "kalın ünlü" and "ince ünlü". They literally translate to "low pitch wovel"/"high pitch wovel" )(or "thick wovel"/"thin wovel") in this context.
    There is a second wovel harmony rule [1] (called lesser wovel harmony) that makes the distinction you pointed out. Letters a/e/ı/i are called flat wovels, and o/ö/u/ü are called round wovels.
    [1] https://georgiasomethingyouknowwhatever.wordpress.com/2015/0...
  - oneshtein 18 hours ago
    
    So, instead of adding two full letters, with proper upper case and lower case, you added two halves to hack Latin alphabet. This is the bug.
nurettin a day ago

There was actually three! i (as in th[i]s), î (as in ch[ee]se) and ı which sounds nothing like the first two, it sounds something like the e in bag[e]l. I guess it sounded so different that it warranted such a drastic symbolic change.
- ithkuil a day ago
  
  Turkish exhibits a vowel harmony system and uses diacritics on other vowels too and the choice to put "i" together with other front vowels like "ü" and "ö" and put "ı" together with back vowels like "u" and "o" is actually pretty elegant.
  The latinization reform of the Turkish language predates computers and it was hard to foresee the woes that future generations would have had with that choice
ayhanfuat a day ago

Except for the a/e pair, front and back vowels have dotted and dotless versions in Turkish: ı and i, o and ö, u and ü.
- o11c a day ago
  
  In that case they should've used ï for consistency.
  - thaumasiotes 21 hours ago
    
    That would be the opposite of consistency; i is the front vowel and ı is the back one.
    Note that the vowel /i/ cannot umlaut, because it's already a front vowel. The ï you cite comes from French, where the two dots represent diaeresis rather than umlaut. When umlaut is a feature of your language, combining the notation like that isn't likely to be a good idea.
- zettabomb a day ago
  
  Makes sense enough, but why not use i and ï to be consistent?
  - okanat a day ago
    
    Turkish i/İ sounds pretty similar to most of the European languages. Italian, French and German pronounce it pretty similar. Also removing umlauts from the other two vowels ö and ü to write o and u has the same effect as removing the dot from i. It is just consistent.
    
    zettabomb a day ago
    
    No, what I mean is, o and u get an umlaut (two dots) to become ö and ü, but i doesn't get an umlaut, it's just a single dot from ı to i. Why not make it i and ï? That would be more consistent, in my opinion.
    
    selcuka a day ago
    
    I guess the aim was to reuse as much of the standard Latin alphabet as possible.
    A better solution would have been to leave i/I as they are (similar to j/J), and introduce a new lowercase/uppercase letter pair for "ı", such as Iota (ɩ/Ɩ).
    
    zahlman 13 hours ago
    
    See https://news.ycombinator.com/item?id=45564152.
  - ayhanfuat a day ago
    
    This was shortly after the Turkish War of Independence. Illiteracy was quite high (estimated at over 85%) and the country was still being rebuilt. My guess is they did their best to represent all the sounds while creating a one to one mapping between sounds and letters but also not deviating too much from familiar forms. There were probably conflicting goals so inconsistencies were bound to happen.

Someone 16 hours ago

FTA: “Less than a week later, they had a fix ready: (source: GitHub)

    map[name] = "box${primitiveType.javaKeywordName.capitalize(Locale.US)}"

[…]
In September 2020, nearly a year after the coroutines bug had been fixed and forgotten
[…]

When they came to fix this issue, the Kotlin team weren’t leaving anything to chance. They scoured the entire compiler codebase for case-conversion operations—calls to capitalize(), decapitalize(), toLowerCase(), and toUpperCase()”

Bloody late, I would say. If something like this happened in OpenBSD, I think they would have done that, and more (the article doesn’t mention tooling to detect the introduction of new similar bugs ofof adding warnings to documentation), at the first spotting of such a bug.

How come no reviewer mentioned such things when the first fix was reviewed?

Also, why are they using Locale.US, and not Locale.ROOT (https://docs.oracle.com/javase/8/docs/api/java/util/Locale.h...)?

liquidpele a day ago

It’s always Turkish lol. That was our language of choice to QA anything… if it worked on that it would pretty much work on anything.

ramses0 a day ago

I'm shocked there's no mention of "The Turkey Test"
https://blog.codinghorror.com/whats-wrong-with-turkey/
- OptionOfT 3 hours ago
  
  I really dislike the fact that that article uses the “” quotes instead of "". Makes copy-pasting not work.

voidUpdate 15 hours ago

> If capitalize() was an ambiguous name, what should its replacement be called? Can you think of a name that describes the function’s behaviour more clearly?

In c#, setting every letter to its uppercase form is ToUpper, and I think capitalise is perfectly reasonable for setting the first character. I'm not sure I've ever referred to uppercasing a string as capitalising it

carstenhag a day ago

I was scrolling and scrolling, waiting for the author to mention the new methods, which of course every Android Dev had to migrate to at some point. And 99% of us probably thought how annoying this change is, even though it probably reduced the number of bugs for Turkish users :)

Unrelated, but a month ago I found a weird behaviour where in a kotlin scratch file, `List.isEmpty()` is always true. Questioned my sanity for at least an hour there... https://youtrack.jetbrains.com/issue/KTIJ-35551/

ajkjk a day ago

well now I wanna know what's going on there!

emmelaich a day ago

Could have been worse --

    Ramazan Çalçoban sent his estranged wife Emine the text message:
    Zaten sen sıkışınca konuyu değiştiriyorsun.
    "Anyhow, whenever you can't answer an argument, you change the subject."

    Unfortunately, what she thought he wrote was:
    Zaten sen sikişınce konuyu değiştiriyorsun.
    "Anyhow, whenever they are fucking you, you change the subject."

This led to a fight in which the woman was stabbed and died and the man committed suicide in prison.

https://gizmodo.com/a-cellphones-missing-dot-kills-two-peopl...

esafak 8 hours ago

> sikişınce
The last one should be an i too :)

zahlman 13 hours ago

I knew from the headline that this would be the Turkish I thing, but I couldn't fathom why a compiler would care about case-folding. "I don't know Kotlin, but surely its syntax is case-sensitive like all the other commonly used languages nowadays?"

> The code is part of a class named CompilerOutputParser, and is responsible for reading XML files containing messages from the Kotlin compiler. Those files look something like this:

"Oh."

"... Seriously?"

As if I didn't hate XML enough already.

_ZeD_ 11 hours ago

what do you propose to handle translation messages? how do you think they should map the compiler codes to human messages?
- colejohnson66 11 hours ago
  
  .NET ResX localization generates a source file. So localized messages are just `Resources.TheKey` - a named property like anything else in the language. It also catches key renaming bugs because the code will fail to compile if you remove/rename a key without updating users.
- zahlman 10 hours ago
  
  ... Just about anything else? The baseline expectation in other data formats is that keys are case-sensitive. (Also, that there are keys, specifically designed for lookup of the needed data, rather than tag names.)
James_K 9 hours ago

>https://www.w3.org/TR/REC-xml/
>match
>[Definition: (Of strings or names:) Two strings or names being compared are identical. Characters with multiple possible representations in ISO/IEC 10646 (e.g. characters with both precomposed and base+diacritic forms) match only if they have the same representation in both strings. No case folding is performed.
I'm quite fond of XML myself, and this is not an issue in XML.
- zahlman 6 hours ago
  I mean, yes, the standard says that parsing should be case sensitive (I also found https://stackoverflow.com/questions/7414747). But people parse it as if it were case insensitive all the time, in part thanks to tradition established by HTML. In this case, the XML looked like
  <MESSAGES> <INFO path="src/main/Kotlin/Example.kt" line="1" column="1"> This is a message from the compiler about a line of code. </INFO> </MESSAGES>
  and other code would try to filter the messages for presentation. And it appears that even if their spec demanded uppercase tag names, they were case-folding them for lookup purposes (to map "INFO" to some constant like CompilerMessageSeverity.INFO, or something like that).

Dwedit 20 hours ago

In C# programming, you are able to specify a culture every time you call a function such as numbers <-> strings, or case conversion. Or you specify the "Invariant Culture", which is basically US English. But the default culture is still based on your system's locale, you need to explicitly name the invariant culture everywhere. Because it involves a lot of filling in parameters for many different functions, people often leave it out, then their code breaks on systems where "," is the decimal separator.

You can also change the default culture to the invariant culture and save all the headaches. Save the localized number conversion and such for situations where you actually need to interact with localized values.

gf000 17 hours ago

The same is true for Java/Kotlin (in this case at least). The problem is that there is a zero parameter version that implicitly depends on global state, so you may end up with the bug unless you were already familiar with the issue at hand - I think the same applies for C#.
Though linters will routinely catch this particular issue FWIW.

phplovesong 16 hours ago

Wow this is bad. Even for a language like Java having vanilla strings as some sort of enum like value, and then even going further and downcasing them is a 100% bugmagnet waiting for the kaboom.

flexagoon a day ago

Wouldn't at least the first issue be solved by using Unicode case folding instead of lowercase? Python, for example, has separate .casefold() and .lower() methods, and AFAIK casefold would always turn I into i, and is much more appropriate for this use case.

rhdunn 15 hours ago

There are 3 types of case folding:
1. Simple one-to-one mappings -- E.g. `T` to `t`. These are typically the ones handled by `lower()` or similar methods as they work on single characters so can modify a string in place (the length of the string doesn't change).
2. More complex one-to-many mappings -- E.g. German `ß` to `ss`. These are covered by functions like `casefold()`. You can't modify the string in place so the function needs to always write to a new string buffer.
3. Locale-specific mappings -- This is what this bug is about. In Turkish `I` maps to `ı` whereas other languages/locales it maps to `i`. You can only implement this by passing the locale to the case function, irrespective of whether you are also doing (1) or (2).
- zahlman 12 hours ago
  
  This is not quite right, at least for Python. .upper() and .lower() (and .casefold() as well) implement the default casing algorithms from the Unicode specification, which are one-to-many (but still locale-naive). Other languages, meanwhile, might well implement locale-aware mapping that defaults to the system locale rather than requiring a locale to be passed.
  - rhdunn 6 hours ago
    
    For one-to-one I was thinking of the Unicode Character Database single-character upper, lower, and title case entries in the UnicodeData.txt file [1] -- i.e. the simple case mappings [2].
    The one-to-many mappings are specified in SpecialCasing.txt. Some are locale independent (like my German sharp/double S example) and others are locale aware (like the Turkish example). For the locale aware mappings you need to ensure you are setting the language/locale correctly in your language of choice, or that the language is doing the right thing.
    [1] https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.tx...
    [2] https://www.unicode.org/reports/tr44/#Casemapping
    [3] https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing....
zahlman 12 hours ago

Both .casefold() and .lower() in Python use the default Unicode casing algorithms. They're unicode-aware, but locale-naive. So .lower() also works for this purpose; the point of .casefold() is more about the intended semantics.
See also: https://stackoverflow.com/questions/19030948 where someone sought the locale-sensitive behaviour.

esafak a day ago

Kotlin keywords should be assumed to be English.

pimlottc 21 hours ago

Logging levels are not language keywords.
- DamonHD 12 hours ago
  
  The implied locale of those logging levels is (US) English though. And so any recasing of them should be in that locale.

charcircuit a day ago

Everyone who has used Java has hit this before. Java really should force people to always specify the locale and get rid of the versions of the functions without locale parameters. There is so much hidden broken code out there.

Uvix a day ago

That only helps if devs specify an invariant locale (ROOT for Java) where needed. In practice, I think you'll see devs blindly using using the user's current locale like it silently does today.
- jeroenhd a day ago
  
  The invariant locale can't parse the numbers I enter (my locale uses comma as a decimal separator). More than a few applications will reject perfectly valid numbers. Intel's driver control panel was even so fucked up that I needed to change my locale to make it parse its own UI layout files.
  Defaulting to ROOT makes a lot of sense for internal constants, like in the example in this article, but defaulting to ROOT for everything just exposes the problems that caused Sun to use the user locale by default in the first place.
  - Uvix a day ago
    
    Agreed, there are cases where user locale is needed. So many so that I expect that to be devs’ default if required to specify, and that they won’t use ROOT where they should.

naniwaduni a day ago

A stark reminder that all operations on strings are wrong.

hshdhdhehd 17 hours ago

And all code is operations on strings. (The code starts out as a string).
lifthrasiir a day ago

Or that strings are not human texts.
- oneshtein 18 hours ago
  
  Kotlin is not for humans.

immibis 13 hours ago

For a while when I made Minecraft mods, I had my test environment set to Turkish for this exact reason (there's some simple command-line parameter you can pass to the JVM). Half the other mods installed in this environment would have broken textures, but mine never did since I tested it.

originHarbor1 15 hours ago

Every programmer learns about Turkish 'i' the hard way, usually at 3 AM in production.

hshdhdhehd 17 hours ago

Tldr. ToLowerCase is like converting a time to a string. For human display purposes only.

DenisM 7 hours ago

It stops working when people ask for cases-insensitive file names, as is their right.

darkhorn a day ago

Java; write once, run anywhere, except on Turkish Windows.