If you’re involved in UK government web work, you’re probably more familiar than the average UK resident with the Welsh language. The official guidelines state:
For services provided to the public in Wales, and with due regard to the Welsh Language Act, the Welsh and English languages must be treated on a basis of equality.
… although generally, people take an encouragingly pragmatic and relaxed view. Which is just as well.
The Welsh language has 28 letters – or possibly 29, depending on your stance on the letter ‘J’. (Er, what about Jones?) These include: ch, dd, ff, ng, ll, ph, rh and th. Yes, I know each of those is two letters, but they only count as one. Technically, this is known as a digraph.
And here’s the best bit. In a Welsh dictionary, words beginning with ‘ch’ don’t come between ‘cg’ and ‘ci’ (if there are any). Oh no. They would come after ‘cz’ (if there was a ‘z’, which there isn’t), since ‘ch’ is considered to be the letter between ‘c’ and ‘d’. Unless, that is, it really does mean the two letters as two distinct letters, rather than one. So for example, in the placename ‘Bangor’, the ‘n’ and ‘g’ are two letters, and not the digraph ‘ng’.
So what? Well, if you’re trying to write the specification for a bilingual English / Welsh database with presentation in alphabetical order, and you want to do it properly, you’re going to have to write one very clever sorting routine.
In case the reason for my research is of any use to anyone else: 2002′s UK government web guidelines say to use ‘Character set Latin 1′ (ie ISO-8859-1) for Welsh, even though this doesn’t contain the required w-circumflex and y-circumflex characters… but this may have been because the all-encompassing UTF-8 was only properly adopted as an internet standard in late 2003. The BBC, National Assembly for Wales and Welsh nationalist political party Plaid (formerly Plaid Cymru) all currently use ISO-8859-1.
UTF-8 is the choice of the Welsh Language Board – which seems like the strongest possible endorsement. It is also used by Welsh language TV channel S4C, plus the Welsh versions of Google and Wicipedia (not a typo). The missing characters also appear in ISO-8859-14, also known as ‘Latin 8′, but this is rarely used if ever.