Change pattern for tOtherEnumerationValue
The current pattern for tOtherEnumerationValue is other:\w{2,}, which means other: followed by at least two characters, allowing "all characters except the set of "punctuation", "separator" and "other" characters". These character categories are called "General Category" and defined per character in the Unicode Character Database (a list of all characters and their General Category is given in UnicodeData.txt).
If we look at the characters of the commonly used Windows-1252 character set, the following characters are included in \w:
- Letters: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿŒœŠšŸŽžƒˆ
- Numbers: 0123456789²³¹¼½¾
- Symbols: $+<=>^`|~¢£¤¥¦¨©¬®¯°±´¸×÷˜€™
In the same character set, the following characters are excluded from \w:
- Punctuation: !"#%&'()*,-./:;?@[]_{}¡§«¶·»¿–—‘’‚“”„†‡•…‰‹›
- Separators: space and no-break space
- Other: control characters and soft hyphen
\w also allows any other unicode character that is not categorised as either "punctuation", "separator" or "other". Here are some more examples of unicode characters included in \w:
- Letters: ČĐIJɱΘИفफ़นኯ駱𝕱𝛘
- Marks: (diacritical marks that combine with other characters)
- Numbers: ٠١٢٣٤٥٦٧٨٩௦௧௨௩௪௫௬௭௮௯௰௱௲
- Symbols: ˂˃˄˅℃⇐⇑⇒⇓√∛∜🀉
🅰 🅱 🎡 🎢 🎣 🎤 💯 💰 💱
It seems a bit random to allow other:🥳🎉 but not values with hyphens or underscores. The railML documentation says "minimum two characters, white space not allowed", which would indicate the pattern other:\S{2,}, or more generally other:\P{Z}{2,}.
Recommended solution
- Change the pattern of
tOtherEnumerationValuetoother:[A-Za-z0-9\-_]{2,} - Update documentation of previous versions (see related issues) to recommend using only letters A-Z, a-z and digits 0-9.
Related issues
Documentation updates in previous versions:
- #651 (closed) railML 3.1
- version2#485 (closed) railML 2.5
- #626 (closed) railML 3.2
- #625 (closed) railML 3.3