

An innocent question

At work I posed an innocent question about displaying large integers (like 12345678) with commas in the thousands separator (like 12,345,678). As I found out, this was quite a loaded question.

My question had to do with implementing it in Python. I couldn't find a built-in method, so I wrote up this awful one-liner:

  >>> x = 2493085724309857243980
  >>> print ",".join(  [str(x/(10**i))[-3:] for i in range(3*10,1,-3) if x/(10**i)>0] + [str(x)[-3:]]  )

It works for numbers with up to ten commas. Which is big, but not portable. Nor does it work with floats (in fact, it breaks in a fantastic display of numbers.)

I also assumed commas were appropriate. They're not always.


My question came back with this response:

  Doesn't python have a printf like function that is handled at the
  bytecode level? The best (simplest and efficient) way to do it is the
  interpreter/compiler level.

So I did some research, and found out how to do it. C's printf has way of separating the thousands place with a comma when the ' (apostrophe) modifier is applied to to i, d, f, etc. Except Pythons printf parser doesn't understand it!

The real answer

After some research I found that since each country separates their long numbers differently (Europeans might write 1.234.567,89 while Americans write 1,234,567.89 -- that's why the central question of this text is "loaded"), UNIX provides "locales" to tune standard output of varous things. From locale(7):

   A  locale is a set of language and cultural rules.  These cover aspects
   such as language for messages, different  character  sets,  lexigraphic
   conventions,  etc.   A program needs to be able to determine its locale
   and act accordingly to be portable to different cultures.

Back to Python:

   >>> import locale
   >>> locale.format("%d", 3245452, 1)

Oops. It seems that default locale has no thousands separator defined:

   >>> locale.localeconv()["thousands_sep"]

... so you have to switch to a locale that does:

   >>> locale.setlocale(locale.LC_NUMERIC, 'en_US.ISO8859-1')
   >>> locale.localeconv()["thousands_sep"]

Now things work:

   >>> locale.format("%d", 3245452, 1)
   >>> locale.format("%d", 324545278968968698, 1)

Doing things in C is semantically identical:

   #include <stdio.h>
   #include <locale.h>
   int main(void)
     int i = 1234567;
     /* Print i with default locale, C */
     printf("%'15d (%s)\n", i, setlocale(LC_NUMERIC, NULL));
     /* Switch locale for numerics, and print i */
     setlocale(LC_NUMERIC, "en_US.iso88591");
     printf("%'15d (%s)\n", i, setlocale(LC_NUMERIC, NULL));
     return 0;

It outputs:

         1234567 (C)
       1,234,567 (en_US.iso88591)


On the topic of internationalization, Joel Spolsky has an essay worth reading: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

This is, updated 2004-12-02 01:36 EST

Contact: michalg at domain where domain is (more)