Phone Numbers Data for Taiwan in OSM — Opening a Can of Worms
Posted by assanges on 30 March 2026 in English. Last updated on 1 April 2026.此文本同時提供 台灣華語版本 This article is also available in Taiwanese Mandarin
OpenStreetMap’s collaborative nature is both its biggest strength and a source of persistent data-quality issues. With thousands of contributors independently adding phone tags to shops, restaurants, clinics, and government offices, each person tends to follow their own formatting style. For Taiwan, that means a database where the same country code can show up as +886, +886+, or +886(2), and a single city’s worth of phone numbers might span a dozen different conventions.
This post catalogues what we found when we scanned OSM elements across all six special municipalities and five additional counties — we are working on a normalizer to fix the issue.
The Scale of the Problem
Across eleven cached regions — all six special municipalities (臺北市, 新北市, 桃園市, 臺中市, 臺南市, 高雄市) plus 苗栗縣, 新竹市, 臺東縣, 連江縣, 金門縣 — we found 49,260 tags (phone or contact:phone) on 49,229 elements. After splitting multi-value fields on semicolons, that yields 50,643 individual phone number strings to classify.
| Format class | Count | Share |
|---|---|---|
E.123 space (+886 2 1234 5678) |
41,842 | 82.6% |
RFC 3966 dash (+886-2-1234-5678) |
6,655 | 13.1% |
No separator (+886212345678) |
1,158 | 2.3% |
Local format, no country code (02-1234-5678) |
854 | 1.7% |
Corrupt/typo country code (+866 …, +886(2)…) |
92 | 0.2% |
| Other (wrong country, junk) | 42 | 0.1% |
Roughly 1 in 5 individual values deviates from the most common contributor convention, creating inconsistency that complicates deduplication, display, and machine parsing.
Things went from bad to downright ridiculous.
The format split varies noticeably by region:
| Region | Region (ZH) | Tags | Values | E.123 | RFC 3966 | Other |
|---|---|---|---|---|---|---|
| TPE | 臺北市 ★ | 9,804 | 10,146 | 85% | 11% | 4% |
| NWT | 新北市 ★ | 13,963 | 14,395 | 85% | 11% | 4% |
| TAO | 桃園市 ★ | 4,177 | 4,277 | 83% | 12% | 5% |
| TXG | 臺中市 ★ | 8,065 | 8,282 | 73% | 24% | 4% |
| TNN | 臺南市 ★ | 4,246 | 4,322 | 84% | 11% | 6% |
| KNN | 高雄市 ★ | 5,168 | 5,262 | 84% | 10% | 5% |
| MIA | 苗栗縣 | 1,318 | 1,338 | 77% | 18% | 5% |
| HSZ | 新竹市 | 1,088 | 1,094 | 82% | 12% | 6% |
| TTT | 臺東縣 | 1,101 | 1,178 | 96% | 3% | 1% |
| LIE | 連江縣 | 95 | 103 | 98% | 0% | 2% |
| KMN | 金門縣 | 235 | 246 | 90% | 8% | 2% |
★ Special municipality. Taichung stands out with 24% RFC 3966 usage — roughly double the rate of any other major city — suggesting a dominant local editing pattern or tool default in that contributor community. The outlier island counties (連江縣, 臺東縣) have the highest E.123 consistency, possibly because their smaller contributor pools converge on informal norms more easily.
What “Correct” Means: Format Standards in OSM
Before diving into the issues, it’s worth clarifying what “correct” actually means in an OSM context.
The OSM wiki’s Key:phone page does not mandate a single format. It documents E.123 international notation, RFC 3966 (tel: URI dash notation), and NANP-style formatting without expressing a clear preference between them. In practice, E.123 space notation is the most commonly used by Taiwan contributors — which is why we use it as the normalisation target — but RFC 3966 dash notation is a legitimate alternative that the wiki explicitly acknowledges.
So the goal of normalization isn’t strict compliance with any one standard — it’s internal consistency: a dataset where everything follows the same convention is just much easier to work with than one that mixes three formats at random.
What Consistent Looks Like
For Taiwan, the most common contributor convention is E.123, followed by RFC 3966 / NANP (North American +1-style, RFC 3966-like):
ITU E.123
----------------------------------------
+886 2 1234 5678 ← Taipei landline
+886 4 1234 5678 ← Taichung landline
+886 37 123 456 ← Miaoli landline (3-digit area code)
+886 89 123 456 ← Taitung landline (3-digit area code)
+886 9X XXXX XXXX ← Mobile
+886 800 XXX XXX ← Toll-free (0800)
NANP
----------------------------------------
+886-2-1234-5678 ← Taipei landline
+886-4-1234-5678 ← Taichung landline
+886-37-123-456 ← Miaoli landline (3-digit area code)
+886-89-123-456 ← Taitung landline (3-digit area code)
+886-9X-XXXX-XXXX ← Mobile
+886-800-XXX-XXX ← Toll-free (0800)
Multiple numbers separated by semicolons, no trailing semicolon:
ITU E.123
----------------------------------------
+886 2 8787 8787;+886 2 8787 8765
(or)
NANP
----------------------------------------
+886-2-8787-8787;+886-2-8787-8765
Both are acceptable normalised formats. The open question for the community is agreeing on one and applying it consistently to resolve the current mixing.
Not your average daily struggle
Our findings
Issue 1: Inconsistent Separators
The most common deviation is mixing hyphens and spaces. Both of these encode the same number:
+886 2 2181 2345 ← E.123 (space, most common in TW OSM data)
+886-2-2181-2345 ← RFC 3966 dash (legitimate, less common)
The real problem is mixing both within a single value, which belongs to neither convention:
+886 2 2873-6548 ← space after country code, dashes within
+886-2-28358739 ← dashes, then no grouping in subscriber number
We found 1,554 values that contain both spaces and hyphens in a single phone string — the worst of both worlds, unambiguously wrong under either standard.
Issue 2: Missing Country Code
Some contributors enter phone numbers the way they would dial them locally — without the +886 prefix:
02-2581-7780
02 8751 3227
0222346763
0921067050
OSM’s phone tag is meant to hold an internationally dialable number. A value like 02-2581-7780 is ambiguous outside Taiwan: consumers have no way to know which country’s area-code conventions apply. We found 854 such values, including mobile numbers entered as bare 09XXXXXXXX strings.
Issue 3: No Separator After Country Code
A related variant omits any separator between the country code and the rest of the number:
+886288613257
+886228839850
These are syntactically valid in E.164 (the all-digits form used by telephony APIs) but fail most display validators and are unreadable as stored OSM data. We found 1,158 such values.
Issue 4: Corrupt or Malformed Country Codes
A small but non-trivial number of entries contain clear input errors:
+866 2 29126883 ← digits transposed (866 instead of 886)
+886+2 2311 2940 ← extra plus sign
+886(2)28232410 ← parenthesised area code (North American style)
+886.2 2322 3477 ← dot as separator
+8886 2 8780 6278 ← extra digit in country code
+00886-2-23825234 ← international dialling prefix 00 prepended
We found 92 such values. These will silently fail in any phone-number parsing library that enforces ITU-T E.164 syntax.
Issue 5: Duplicate Entries in Multi-Value Fields
OSM supports multiple phone numbers for one element using semicolons. We found 1,320 multi-value tags across the dataset. Of those, 24 contain duplicate entries — the same number appearing more than once:
+886 2 2916 0300;+886 2 2916 0300
+886 89 862 326;+886 89 862 326;+886 89 862 326
This suggests copy-paste mistakes during editing. While harmless individually, they can inflate the number of contact options and potentially confusing to machines.
Issue 6: Extension Numbers — a Format Wild West
You are the one accountable, Raiden!! (via @M4HCHE3ZY on X (formerly Twitter))
Beyond the main number itself, 635+ values encode an extension, using at least five different conventions found in the data:
| Convention | Example | Count |
|---|---|---|
Hash # |
+886 2 2536 3001#8653 |
572 |
Tilde ~ |
+886 2 2368 0031~2 |
26 |
ext. / ext |
+886 2 2741 5991 ext.21 |
30 |
Chinese 分機 |
+886 4 2528 5394分機6000 |
7 |
Comma , (iOS) |
+886 2 2938 2300,630 |
~1+ |
Detecting extensions is tractable
As community members pointed out, a simple rule works: any character that is not a digit, space, or hyphen ([^\s\d-]) can be treated as the start of the extension suffix. This is essentially what our normalizer does — split at the first such character, normalize the base number, then reattach the suffix verbatim.
Encoding extensions is where it breaks down
The OSM wiki’s Key:phone#Extensions page currently documents three different conventions without picking one, which is itself a signal of how unresolved this is.
E.123 specifies ext as the separator. It was standardised in the printed-directory era — ext 8653 is readable on a business card, but apps do not reliably parse it. There is no DTMF interpretation; the extension string is purely informational.
Apple iOS (and macOS Contacts) stores extensions using a comma , as a pause-and-dial separator: +886-2-2938-2300,630. The comma signals the dialler to wait for the call to connect, then send the remaining digits as DTMF tones — so 630 is dialled automatically after the main number picks up. This is practical on-device behaviour, but it creates two distinct problems in OSM data:
- Ambiguity with multi-value separators. OSM uses
;to separate multiple phone numbers in a single tag. Comma has no such defined role in OSM, so an iOS-style value like+886 2 2938 2300,630is likely to be misread as a single malformed number rather than a number-plus-extension. We found 16 values with commas in the dataset; most are multi-value numbers incorrectly separated by,instead of;, but at least one appears to be a genuine iOS-exported extension. - Non-portability. A comma-encoded extension is only meaningful to a DTMF-capable dialler. It conveys no human-readable information and is invisible to any parser that does not understand the pause-dial convention.
libphonenumber detects extensions across many separators (#, ext, x, ,, etc.) but emits no canonical output format for the extension part, leaving it to the caller.
RFC 3966 (tel: URI) is the most formally specified option — it uses ;ext=NNN. But RFC 3966 extensions create a structural conflict with OSM’s data model that is worth spelling out in full.
The RFC 3966 semicolon conflict
OSM uses the semicolon ; as the multi-value separator for phone tags:
+886 2 1234 5678;+886 2 8765 4321 ← two phone numbers, standard OSM
RFC 3966’s extension syntax also uses a semicolon as a parameter delimiter:
tel:+886-2-1234-5678;ext=8653 ← RFC 3966 with extension
If a contributor stores this in an OSM tag, any OSM editor or data consumer that naively splits on ; will interpret it as two values: tel:+886-2-1234-5678 and ext=8653. The extension becomes a phantom second phone number — one that is not a phone number at all.
The obvious workaround is to escape the semicolon as \;, a convention some OSM tags use for literal semicolons inside values. But this creates its own problems:
- OSM editors do not consistently honour
\;escaping; many will still split on it or display it literally. - RFC 3966 parsers expect a raw
;as the parameter delimiter — a backslash-escaped\;ext=8653is not valid RFC 3966 and will not be parsed correctly by any complianttel:URI parser. - Machine readability is not improved: a consumer now needs to know both OSM’s backslash-escaping convention and RFC 3966’s parameter syntax, and reconcile the two. It adds encoding complexity without giving any parser a clean path to the extension digits.
The backslash escape is a leaky workaround that satisfies neither standard fully. It is, in effect, a third encoding layered on top of two already-conflicting ones.
The result is that RFC 3966 extension notation is structurally incompatible with OSM’s semicolon-as-multi-value convention, with no clean resolution available today. For this reason, our normalizer preserves extension suffixes as-is rather than attempting to rewrite them into any standard form.
A Note on E.123 and Machine Readability
Here’s something worth keeping in mind: even a perfectly normalised E.123 phone tag isn’t as machine-friendly as it looks.
E.123 was standardised by the ITU-T in 1988 — when the primary medium for a phone number was a business card, letterhead, or printed directory. The spaces in +886 2 1234 5678 are visual grouping aids for human readers, not semantic tokens. A parser encountering that string has to strip the spaces, infer the country code, and figure out the area code boundary — all heuristically.
RFC 3966’s tel:+886-2-1234-5678 is marginally more structured (hyphens as explicit separators, a URI scheme that signals “this is a phone number”), but still requires a real parser to interpret the digit groups. The truly machine-readable form is E.164 — +886212345678, all digits, no punctuation — which is what telephony APIs and databases actually want. None of these is what OSM stores by default.
This tension is fundamental: OSM’s phone tag is human-oriented. Normalization to E.123 is about making data consistent and editable by contributors, not about producing a format that apps can blindly ingest without parsing. The downstream app still needs a library like libphonenumber to do the real work — which is exactly why that library’s correctness for Taiwan’s edge-case area codes matters as much as it does.
A Note on Unexpected Area Code Grouping by google/libphonenumber
This one is subtle. Google’s libphonenumber — the standard library used by virtually every phone-number parser — groups some Taiwanese area codes differently than how they appear in local usage.
Taiwan assigns 3-digit and 4-digit area codes to several regions. libphonenumber’s metadata appears to represent these as extensions of their 2-digit neighbours, producing a different grouping than what locals would recognise:
| Dialled | libphonenumber output | Expected output (E.123) |
|---|---|---|
037-123-456 |
+886 3 7123 456 |
+886 37 123 456 |
049-123-4567 |
+886 4 9123 4567 |
+886 49 123 4567 |
082-123-456 |
+886 8 2123 456 |
+886 82 123 456 |
0826-12345 |
+886 8 26123 45 |
+886 826 12345 |
0836-12345 |
+886 8 36123 45 |
+886 836 12345 |
089-123-456 |
+886 8 9123 456 |
+886 89 123 456 |
Affected regions: Miaoli (037), Nantou (049), Kinmen (082), Wuqiu (0826), Matsu/Lienchiang (0836), and Taitung (089).
This means that even phone numbers already stored in +886 X XXXX XXXX form may carry a different digit grouping if they were entered via a tool backed by libphonenumber. The grouping we use here follows the National Numbering Plan and official government contact listings — though it’s worth noting this may be an intentional design choice in libphonenumber’s metadata.
See also:
- Issue Tracker: Unexpected formatting of the TW numbers with 3/4-digit area codes
- 公眾電信網路號碼計畫 (Public Telecommunication Network Numbering Plan, Chinese only) [PDF]
- TG Group Chat
About OpenStreetMap Taiwan Community
The OpenStreetMap Taiwan Community (OSMTW) is made up of enthusiastic mappers interested in Taiwan. Since 2010, OSMTW has evolved from a small gathering of individuals into a vibrant local community, welcoming more people to collaborate on mapping projects. OSMTW now co-hosts monthly meetups and occasional expeditions with the local Wikidata community in Taipei. Come by, check it out, and join us!
Ends/ Mon, 30 March 2026
Issued at NST 16:15
Last updated at Wed, 1 Apr 2026 NST 12:00
NNNN
Things went from bad to downright ridiculous.
You are the one accountable, Raiden!! (via @M4HCHE3ZY on X (formerly Twitter))
Discussion
Comment from bryceco on 31 March 2026 at 03:47
This is worth reading whether you map in Taiwan or anywhere else in the world. Thanks for posting!