<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/">
  <channel rdf:about="http://blog.gmane.org/gmane.comp.python.tutor">
    <title>gmane.comp.python.tutor</title>
    <link>http://blog.gmane.org/gmane.comp.python.tutor</link>
    <description/>
    <syn:updatePeriod>hourly</syn:updatePeriod>
    <syn:updateFrequency>1</syn:updateFrequency>
    <syn:updateBase>1901-01-01T00:00+00:00</syn:updateBase>
    <items>
      <rdf:Seq>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.python.tutor/81652"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.python.tutor/81651"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.python.tutor/81650"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.python.tutor/81649"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.python.tutor/81648"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.python.tutor/81647"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.python.tutor/81646"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.python.tutor/81645"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.python.tutor/81644"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.python.tutor/81643"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.python.tutor/81642"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.python.tutor/81641"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.python.tutor/81640"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.python.tutor/81639"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.python.tutor/81638"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.python.tutor/81637"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.python.tutor/81636"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.python.tutor/81635"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.python.tutor/81634"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.python.tutor/81633"/>
      </rdf:Seq>
    </items>
    <image rdf:resource="http://gmane.org/img/gmane-25t.png"/>
    <textinput rdf:resource=""/>
  </channel>
  <image rdf:about="http://gmane.org/img/gmane-25t.png">
    <title>Gmane</title>
    <url>http://gmane.org/img/gmane-25t.png</url>
    <link>http://gmane.org</link>
  </image>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.python.tutor/81652">
    <title>Re: Python web script to run a command line expression</title>
    <link>http://permalink.gmane.org/gmane.comp.python.tutor/81652</link>
    <description>&lt;pre&gt;
I'm not sure if this is what you are looking for or if this will work on 
WAMP but python has a virtual terminal emulator called Vte or 
python-vte. I use it to display the terminal and run commands.
I'm using it on Linux by adding "from gi.repository import Vte".
Hope it helps.



On 18-05-2013 04:20, Ahmet Anil Dindar wrote:

_______________________________________________
Tutor maillist  -  Tutor&amp;lt; at &amp;gt;python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
&lt;/pre&gt;</description>
    <dc:creator>William Ranvaud</dc:creator>
    <dc:date>2013-05-18T20:23:04</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.python.tutor/81651">
    <title>model methods in Django</title>
    <link>http://permalink.gmane.org/gmane.comp.python.tutor/81651</link>
    <description>&lt;pre&gt; im following the official docs and after learning Python im sure of
how methods work, but the model example on the beginners guide has me
really confused.

The model definition is omitted but can anyone explain how this methed
(was_published_recently) is given these attributes:

class Poll(models.Model):
    # ...
    def was_published_recently(self):
        return self.pub_date &amp;gt;= timezone.now() - datetime.timedelta(days=1)
    was_published_recently.admin_order_field = 'pub_date'
    was_published_recently.boolean = True
    was_published_recently.short_description = 'Published recently?'

are the names of the attributes already attached to these
functions/methods, or are they being created on the fly with whatever
name you want? As i am unable to comprehend what is going on, i dont
really have a clue as to what each definition is doing and how it
affects the model, even after reading this section of the docs over
and over again im still lost.
_______________________________________________
Tutor maillist  -  Tutor&amp;lt; at &amp;gt;python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

&lt;/pre&gt;</description>
    <dc:creator>Matthew Ngaha</dc:creator>
    <dc:date>2013-05-18T19:16:27</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.python.tutor/81650">
    <title>Re: why is unichr(sys.maxunicode) blank?</title>
    <link>http://permalink.gmane.org/gmane.comp.python.tutor/81650</link>
    <description>&lt;pre&gt;
Yes, str() in 2.x uses the locale predicates from &amp;lt;ctype.h&amp;gt;:

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/ctype.h.html

However, 2.x bytearray uses the bytes_methods from 3.x, which use pyctype:

2.7.5 source:
http://hg.python.org/cpython/file/ab05e7dd2788/Include/pyctype.h
http://hg.python.org/cpython/file/ab05e7dd2788/Python/pyctype.c
http://hg.python.org/cpython/file/ab05e7dd2788/Include/bytes_methods.h
http://hg.python.org/cpython/file/ab05e7dd2788/Objects/stringlib/ctype.h

Note that the table in pyctype.c is only defined for ASCII.


Here's a non-sick example. A system in the US might customize
LC_MEASUREMENT to use SI units and LC_TIME to have Monday as the first
day of the week.


The re module has the re.L flag to enable limited locale support. It
only affects the alphanumeric category and word boundaries. You're
probably better off using re.U and the Unicode database.


It's 2 bytes, not one. If you use a non-BMP \U escape on a narrow
build it creates a surrogate pair.  Each surrogate has a 10-bit range
in a 2-byte code. The lead surrogate is in the range 0xD800-0xDBFF,
and the trail is in the range 0xDC00-0xDFFF.
_______________________________________________
Tutor maillist  -  Tutor&amp;lt; at &amp;gt;python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

&lt;/pre&gt;</description>
    <dc:creator>eryksun</dc:creator>
    <dc:date>2013-05-18T19:15:07</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.python.tutor/81649">
    <title>Re: Retrieving data from a web site</title>
    <link>http://permalink.gmane.org/gmane.comp.python.tutor/81649</link>
    <description>&lt;pre&gt;Hi

Just a minor observation:

On 18 May 2013 13:44, Peter Otten &amp;lt;__peter__&amp;lt; at &amp;gt;web.de&amp;gt; wrote:



You don't need javascript, in this case, assuming the reference is to the
UK lotto --  A simple curl test confirms that (for the UK lottery at least)
the numbers can be retrieved simply without the involvedment of javascript,
so Python will be able to do the same. (URL:
https://www.national-lottery.co.uk/player/p/results.ftl  Apologies if this
is about some other lottery and I've missed it...)

Best,

Walter
_______________________________________________
Tutor maillist  -  Tutor&amp;lt; at &amp;gt;python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
&lt;/pre&gt;</description>
    <dc:creator>Walter Prins</dc:creator>
    <dc:date>2013-05-18T17:48:53</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.python.tutor/81648">
    <title>Re: why is unichr(sys.maxunicode) blank?</title>
    <link>http://permalink.gmane.org/gmane.comp.python.tutor/81648</link>
    <description>&lt;pre&gt;

 
Thanks for the links. Without examples it remains pretty abstract, but I think I know is meant by this locale category now.. "The LC_CTYPE category shall define character classification, case conversion, and other character attributes. So if you switch from one locale to another, certain attributes of a character set might change". A switch from locale A to locale B might affect an attribute "casing", therefore, the mapping from lower- to uppercase *might* differ by locale. In stupid country X  "a".upper() may return "B".

It seems that the result of str.isalpha() and str.isdigit() *might* be different depending on the setting of locale.C_CTYPE. 

It is pretty sick that all these things can be adjusted separately (what is the use of having: danish collation, russian case conversion, english decimal sign, japanese codepage ;-)

 



That one is the clearest IMHO. Oh no, now I see the possible impact on regexes. The meaning of e.g. "\s+"
might change depending on the locale.C_CTYPE setting!!



That is a nice way of putting it. So if you slice a multibyte char "mb", mb[0] will return the first byte? That is annoying.

_______________________________________________
Tutor maillist  -  Tutor&amp;lt; at &amp;gt;python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

&lt;/pre&gt;</description>
    <dc:creator>Albert-Jan Roskam</dc:creator>
    <dc:date>2013-05-18T16:45:32</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.python.tutor/81647">
    <title>Re: why is unichr(sys.maxunicode) blank?</title>
    <link>http://permalink.gmane.org/gmane.comp.python.tutor/81647</link>
    <description>&lt;pre&gt;
Each surrogate in a UTF-16 surrogate pair is 10 bits, for a total of
20-bits. Thus UTF-16 sets the upper bound on the number of code points
at 2**20 + 2**16 (BMP). UTF-8 only needs 4 bytes for this number of
codes.


LC_CTYPE is the locale category that classifies characters. In Debian
Linux, the English-language locales copy LC_CTYPE from the i18n
(internationalization) locale:

short: http://goo.gl/Hs8RD
http://www.eglibc.org/cgi-bin/viewvc.cgi/trunk/libc/localedata/locales/i18n?view=markup

Here's the mapping between the symbolic Unicode names in the latter
(e.g. &amp;lt;U0020&amp;gt;) and UTF-8:

short: http://goo.gl/cZ3dS
http://www.eglibc.org/cgi-bin/viewvc.cgi/trunk/libc/localedata/charmaps/UTF-8?view=markup

The i18n locale is defined by the ISO/IEC technical report 14652, as
an instance of an upward compatible extension to the POSIX locale
specification called the FDCC-set (i.e. Set of Formal Definitions of
Cultural Conventions). Here it is in all its glory, if you like
reading technical reports:

http://www.open-std.org/jtc1/sc22/wg20/docs/n972-14652ft.pdf

If that's not enough, here's the POSIX 1003.1 locale spec:

short: http://goo.gl/aOJUx
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html


Narrow builds create UTF-16 surrogate pairs from \U literals, but
these aren't treated as an atomic unit for slicing, iteration, or
string length.
_______________________________________________
Tutor maillist  -  Tutor&amp;lt; at &amp;gt;python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

&lt;/pre&gt;</description>
    <dc:creator>eryksun</dc:creator>
    <dc:date>2013-05-18T14:23:45</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.python.tutor/81646">
    <title>Re: Retrieving data from a web site</title>
    <link>http://permalink.gmane.org/gmane.comp.python.tutor/81646</link>
    <description>&lt;pre&gt;

You can use a tool like lxml that "understands" html (though in this case 
you'd need a javascript parser on top of that) -- or hack something together 
with string methods or regular expressions. For example:

import urllib2
import json

s = urllib2.urlopen("http://*********/goldencasket").read()
s = s.partition("latestResults_productResults")[2].lstrip(" =")
s = s.partition(";")[0]
data = json.loads(s)
lotto = data["GoldLottoSaturday"]

print lotto["drawDayDateNumber"]
print map(int, lotto["primaryNumbers"])
print map(int, lotto["secondaryNumbers"])

While this is brittle I've found that doing it "right" is usually not 
worthwhile as it won't survive the next website redesign eighter.

PS: &amp;lt;http://*********/goldencasket/results/download-results&amp;gt;
has links to zipped csv files with the results. Downloading, inflating and 
reading these should be the simplest and best way to get your data.

_______________________________________________
Tutor maillist  -  Tutor&amp;lt; at &amp;gt;python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

&lt;/pre&gt;</description>
    <dc:creator>Peter Otten</dc:creator>
    <dc:date>2013-05-18T12:44:45</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.python.tutor/81645">
    <title>Re: why is unichr(sys.maxunicode) blank?</title>
    <link>http://permalink.gmane.org/gmane.comp.python.tutor/81645</link>
    <description>&lt;pre&gt;

The UTF-8 data structure was originally designed to go up to 6 bytes, but since Unicode itself is limited to 1114111 code points, no more than 4 bytes are needed for UTF-8.

Also, it is wrong to say that the 4-byte UTF-8 values are "East Asian languages". The full Unicode range contains 17 "planes" of 65,536 code points. The first such plane is called the "Basic Multilingual Plane", and it includes all the code points that can be represented in 1 to 3 UTF-8 bytes. The BMP includes in excess of 13,000 East Asian code points, e.g.:


py&amp;gt; import unicodedata as ud
py&amp;gt; c = '\u3050'
py&amp;gt; print(c, ud.name(c), c.encode('utf-8'))
P HIRAGANA LETTER GU b'\xe3\x81\x90'


The 4-byte UTF-8 values are in the second and subsequent planes, called "Supplementary Multilingual Planes". They include historical character sets such as Egyptian hieroglyphs, cuneiform, musical and mathematical symbols, Emoji, gaming symbols, Ancient Arabic and Persian, and many others.

http://en.wikipedia.org/wiki/Plane_(Unicode)



Well, that's certainly common, but not all legacy encodings are supersets of ASCII. For example:

http://en.wikipedia.org/wiki/Big5

although I see that Python's implementation of Big5 is *technically* incorrect, although *practically* useful, as it does include ASCII.



No idea :-)





UCS-2 is a fixed-width encoding that is identical to UTF-16 for code points up to U+FFFF. It differs from UTF-16 in that it *cannot* encode code points U+10000 and higher, in other words, it does not support surrogate pairs. So UCS-2 is obsolete in the sense it doesn't include the whole set of Unicode characters.

In Python 3.2 and older, Python has a choice between a *narrow build* that uses UTF-16 (including surrogates) for strings in memory, or a *wide build* that uses UTF-32. The choice is made when you compile the Python interpreter. Other programming languages may use other systems.

Python 3.3 uses a different, more flexible scheme for keeping strings in memory. Depending on the largest code point in a string, the string will be stored in either Latin-1 (one byte per character), UCS-2 (two bytes per character, and no surrogates) or UTF-32 (four bytes per character). This means that there is no longer a need for surrogate pairs, but only strings that *need* four bytes per character will use four bytes.




Endianness is relevant for UTF-16 too.

It is not relevant for UTF-8 because UTF-8 defines the order that multiple bytes must appear. UTF-8 is defined in terms of *bytes*, not multi-byte words. So the code point U+3050 is encoded into three bytes, *in this order*:

0xE3 0x81 0x90

There's no question about which byte comes first, because the order is set. But UTF-16 defines the encoding in terms of double-byte words, so the question of how words are stored becomes relevant. A 16-bit word can be laid out in memory in at least two ways:

[most significant byte] [least significant byte]

[least significant byte] [most significant byte]

so U+3050 could legitimately appear as bytes 0x3050 or 0x5030 depending on the machine you are using.

It's hard to talk about endianness without getting confused, or at least for me it is :-) Even though I've written down 0x3050 and 0x5030, it is important to understand that they both have the same numeric value of 12368 in decimal. The difference is just in how the bytes are laid out in memory. By analogy, Arabic numerals used in English and other Western languages are written in *big endian order*:

1234 means 1 THOUSAND 2 HUNDREDS 3 TENS 4 UNITS

Imagine a language that wrote numbers in *little endian order*, but using the same digits. You would count:

0
1
2
...
01  # no UNITS 1 TEN
11  # 1 UNITS 1 TEN
21  # 2 UNITS 1 TEN
...
4321  # 4 UNITS 3 TENS 2 HUNDREDS 1 THOUSAND


Since both UTF-16 and UTF-32 are defined in terms of 16 or 32 bit words, endianness is relevant; since UTF-8 is defined in terms of 8-bit bytes, it is not.

Fortunately, all(?) modern computing hardware has standardized on the same "endianness" of individual bytes. This was not always the case, but today if you receive a byte with bits:

0b00110000

then there is no(?) doubt that it represents decimal 48, not 12.




Certainly not each byte! That would be impossible, since the BOM itself is *two bytes* for UTF-16 and *four bytes* for UTF-32.

Remember, a BOM is not compulsory. If you decide before hand that you will always use big-endian UTF-16, say, there is no need to waste time with a BOM. But then you're responsible for producing big-endian words even if your hardware is little-endian.

A BOM is useful when you're transmitting a file to somebody else, and they *might* not have the same endianness as you. If you can pass a message on via some other channel, you can say "I'm about to send you a file in little-endian UTF-16" and all will be good. But since you normally can't, you just insert the BOM at the start of the file, and they can auto-detect the endianness.

How do they do that? Because they read the first two bytes. If they read it as 0xFFFE, that tells them that their byte-order and my byte-order are mismatched, and they should just use the opposite byte-order from what their system uses by default. If they read it as 0xFEFF, our endianness match, and we're right to go.

You can stick a BOM at the beginning of every string, but that's rather wasteful, and it leads to difficulty with string processing (especially concatenating strings), so it's best not to use BOMs except *at most* once per file.



Because UTF-8 is a very cunning system that was designed by very clever people (Dave Prosser and Ken Thompson) to be unambiguous when read one byte at a time.

When reading a stream of UTF-8 bytes, you look at the first bit of the current byte. If it is a zero, then you have a single-byte code, so you can decode that byte and move on to the next byte. A single byte with a leading 0 gives you 127 possible different values. (If this sounds like ASCII, that's not a coincidence.)

But if the current byte starts with bits 110, then you throw those three bits away, and keep the next five bits. Then you read the next byte, check that it starts with bits 10, and keep the six bits following that. That gives you 5+6 = 11 useful bits in total, from two bytes read, which is enough to encode a further 2047 distinct values.

If the current byte starts with bits 1110, then you throw those four bits away and keep the next four. Then you read in two more bytes, check that they both start with bits 10, and keep the next six bits from each. This gives you 4+6+6 = 16 bits in total, which encodes a further 65535 values.

If the current byte starts with 11110, you throw away those five bits and read in the next three bytes. This gives you 3+6+6+6 = 21 bits, which is enough to encode 2097151 values. So in total, that gives you 127+2047+65535+2097151 = 2164860 distinct values, which is more than the number we actually need.

(Notice that the number of leading 1s in the first byte tells you how many bytes you need to read. Also note that not all byte sequences are valid UTF-8.) In summary:

U+0000 - U+007F =&amp;gt; 0xxxxxxx
U+0080 - U+07FF =&amp;gt; 110xxxxx 10xxxxxx
U+0800 - U+FFFF =&amp;gt; 1110xxxx 10xxxxxx 10xxxxxx
U+10000 - U+1FFFFF =&amp;gt; 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx




You're using UTF-8. I'm talking about UTF-16.





&lt;/pre&gt;</description>
    <dc:creator>Steven D'Aprano</dc:creator>
    <dc:date>2013-05-18T12:12:16</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.python.tutor/81644">
    <title>Re: Unsubscribe</title>
    <link>http://permalink.gmane.org/gmane.comp.python.tutor/81644</link>
    <description>&lt;pre&gt;
At the bottom of every message is a link to a web page to "change 
subscription options."  At the bottom of that page is a button that can 
unsubscribe you.

&lt;/pre&gt;</description>
    <dc:creator>Dave Angel</dc:creator>
    <dc:date>2013-05-18T12:06:51</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.python.tutor/81643">
    <title>Re: Retrieving data from a web site</title>
    <link>http://permalink.gmane.org/gmane.comp.python.tutor/81643</link>
    <description>&lt;pre&gt;
Further investigation shows that the numbers are available if I view the 
source of the page. So, all I have to do is parse the page and extract 
the drawn numbers. I'm not sure, at the moment, how I might do that but 
I have something to work with.

&lt;/pre&gt;</description>
    <dc:creator>Phil</dc:creator>
    <dc:date>2013-05-18T10:16:41</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.python.tutor/81642">
    <title>Re: Retrieving data from a web site</title>
    <link>http://permalink.gmane.org/gmane.comp.python.tutor/81642</link>
    <description>&lt;pre&gt;
http://tatts.com/goldencasket


Not that I can find. A Google search hasn't turned up anything.


Good point Peter, I'll investigate.

&lt;/pre&gt;</description>
    <dc:creator>Phil</dc:creator>
    <dc:date>2013-05-18T09:40:49</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.python.tutor/81641">
    <title>Re: use python to change the webpage content?</title>
    <link>http://permalink.gmane.org/gmane.comp.python.tutor/81641</link>
    <description>&lt;pre&gt;


Maybe this? http://www.pythonforbeginners.com/cheatsheet/python-mechanize-cheat-sheet/

_______________________________________________
Tutor maillist  -  Tutor&amp;lt; at &amp;gt;python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

&lt;/pre&gt;</description>
    <dc:creator>Albert-Jan Roskam</dc:creator>
    <dc:date>2013-05-18T10:42:44</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.python.tutor/81640">
    <title>Re: why is unichr(sys.maxunicode) blank?</title>
    <link>http://permalink.gmane.org/gmane.comp.python.tutor/81640</link>
    <description>&lt;pre&gt;

----- Original Message -----


Thank you. That unicodedata module is very handy sometimes (and crucial for regexes, sometimes). I rarely use it but I should have remembered it.

_______________________________________________
Tutor maillist  -  Tutor&amp;lt; at &amp;gt;python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

&lt;/pre&gt;</description>
    <dc:creator>Albert-Jan Roskam</dc:creator>
    <dc:date>2013-05-18T10:39:22</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.python.tutor/81639">
    <title>Re: why is unichr(sys.maxunicode) blank?</title>
    <link>http://permalink.gmane.org/gmane.comp.python.tutor/81639</link>
    <description>&lt;pre&gt;





Thanks for all your replies. I knew about code points, but to represent the unicode string (code point) as a utf-8 byte string (bytes), characters 0-127 are 1 byte (of 8 bits), then 128-255 (accented chars) 
are 2 bytes, and so on up to 4 bytes for East Asian languages. But later on Joel Spolsky's "standard" page about unicode I read that it goes to 6 bytes. That's what I implied when I mentioned "utf8".





I would admit it if otherwise, but that's what I meant ;-)






I always viewed the codepage as "the bunch of chars on top of ascii", e.g. cp1252 (latin-1) is ascii (0-127) +  another 128 characters that are used in Europe (euro sign, Scandinavian and Mediterranean (Spanish), but not Slavian chars). A certain locale implies a certain codepage (on Windows), but where does the locale category LC_CTYPE fit in this story?




Isn't UCS-2 the internal unicode encoding for CPython (narrow builds)? Or maybe this is a different abbreviation. I read about bit multilingual plane (BMP) and surrogate pairs and all. The author suggested that messing with surrogate pairs is a topic to dive into in case one's nail bed is being derusted. I wholeheartedly agree.





Why is endianness relevant only for utf-32, but not for utf-8 and utf16? Is "utf-8" a shorthand for saying "utf-8 le"?




So each byte starts with a BOM? Or each file? I find utf-32 indeed the easiest to understand. In utf-8, how does a system "know" that the given octet of bits is to be interpreted as a single-byte character, or rather like "hold on, these eight bits are gibberish as they are right now, let's check what happens if we add the next eight bits", in other words a multibyte char (forgive me the naive phrasing ;-). Why I mention is in the context of BOM: why aren't these needed to indicate "mulitbyte char ahead!"?





Just as I thought I was starting to understand it.... Sorry. len(unichr(63000).encode("utf-8")) returns three bytes.
What should I do to arrive at two? Something like len(unichr(63000).encode("&amp;lt;internal unicode encoding that Python 2.7 uses&amp;gt;"))? 



Ah, ok, this answers one of my questions above.


Thanks again, all, it is much appreciated!

_______________________________________________
Tutor maillist  -  Tutor&amp;lt; at &amp;gt;python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

&lt;/pre&gt;</description>
    <dc:creator>Albert-Jan Roskam</dc:creator>
    <dc:date>2013-05-18T10:01:40</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.python.tutor/81638">
    <title>Re: Retrieving data from a web site</title>
    <link>http://permalink.gmane.org/gmane.comp.python.tutor/81638</link>
    <description>&lt;pre&gt;

What's the url of the page? 

Are there alternatives that give the number as plain text? 

If not, do the images have names like whatever0.jpg, whatever1.jpg, 
whatever2.jpg, ...? Then you could infer the value from the name. 

If not, is a digit always represented by the same image? Then you could map 
the image urls to the digits.


_______________________________________________
Tutor maillist  -  Tutor&amp;lt; at &amp;gt;python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

&lt;/pre&gt;</description>
    <dc:creator>Peter Otten</dc:creator>
    <dc:date>2013-05-18T09:25:03</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.python.tutor/81637">
    <title>Re: Retrieving data from a web site</title>
    <link>http://permalink.gmane.org/gmane.comp.python.tutor/81637</link>
    <description>&lt;pre&gt;
Thanks for the replies,

The site in question is the Lotto results page and the drawn numbers are 
not obscured. So I don't expect that there would be any legal or 
copyright problems.

I have written a simple program that checks the results, for an unlikely 
win, but I have to manually enter the drawn numbers. I thought the next 
step might be to automatically download the results.

I can see that this would be a relatively easy task if the digits were 
not displayed as graphics.

&lt;/pre&gt;</description>
    <dc:creator>Phil</dc:creator>
    <dc:date>2013-05-18T08:41:06</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.python.tutor/81636">
    <title>Python web script to run a command line expression</title>
    <link>http://permalink.gmane.org/gmane.comp.python.tutor/81636</link>
    <description>&lt;pre&gt;Hi,
I have a WAMP running in my office computer. I wonder how I can implement a
python script that runs within WAMP and execute a command line expression.
By this way, I will able to run my command line expressions through web
page in intranet.

I appreciate your suggestions.

++Ahmet
_______________________________________________
Tutor maillist  -  Tutor&amp;lt; at &amp;gt;python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
&lt;/pre&gt;</description>
    <dc:creator>Ahmet Anil Dindar</dc:creator>
    <dc:date>2013-05-18T07:20:50</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.python.tutor/81635">
    <title>use python to change the webpage content?</title>
    <link>http://permalink.gmane.org/gmane.comp.python.tutor/81635</link>
    <description>&lt;pre&gt;There is a online simulator about a physic project I'm doing and I want to
use the data the simulator generates on that website. I can get data using
urllib.request and regular expression but I also want to change some of the
input values and then get different sets of data. However, if I change the
inputs, the address of the webpage wouldn't change, so I couldn't get data
with different initial conditions.I'm wondering how I can implement this.

Thanks for your effort!!
_______________________________________________
Tutor maillist  -  Tutor&amp;lt; at &amp;gt;python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
&lt;/pre&gt;</description>
    <dc:creator>Jiajun Xu</dc:creator>
    <dc:date>2013-05-16T01:44:23</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.python.tutor/81634">
    <title>Re: Retrieving data from a web site</title>
    <link>http://permalink.gmane.org/gmane.comp.python.tutor/81634</link>
    <description>&lt;pre&gt;
In addition to Dave's points there is also the legality to consider.
Images are often copyrighted (although images of digits are less 
likely!) and sites often have conditions of use that prohibit web 
scraping. Such sites often include scripts that analyze user activity 
and if they suspect you of being a robot may ban your computer from 
accessing the site - including by browser.

So be sure that you  are allowed to access the site robotically and that 
you are allowed to download the content or you could find yourself 
blacklisted and unable to access the site even with your browser.


&lt;/pre&gt;</description>
    <dc:creator>Alan Gauld</dc:creator>
    <dc:date>2013-05-18T06:33:30</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.python.tutor/81633">
    <title>Re: why is unichr(sys.maxunicode) blank?</title>
    <link>http://permalink.gmane.org/gmane.comp.python.tutor/81633</link>
    <description>&lt;pre&gt;

By the way, your sentence above reflects a misunderstanding. Unicode characters (strictly speaking, code points) are not "bytes", four or otherwise. They are abstract entities represented by a number between 0 and 1114111, or in hex, 0x10FFFF. Code points can represent characters, or parts of characters (e.g. accents, diacritics, combining characters and similar), or non-characters.

Much confusion comes from conflating bytes and code points, or bytes and characters. The first step to being a Unicode wizard is to always keep them distinct in your mind. By analogy, the floating point number 23.42 is stored in memory or on disk as a bunch of bytes, but there is nothing to be gained from confusing the number 23.42 from the bytes 0xEC51B81E856B3740, which is how it is stored as a C double.

Unicode code points are abstract entities, but in the real world, they have to be stored in a computer's memory, or written to disk, or transmitted over a wire, and that requires *bytes*. So there are three Unicode schemes for storing code points as bytes. These are called *encodings*. Only encodings involve bytes, so it is nonsense to talk about "four-byte" unicode characters, since it conflates the abstract Unicode character set with one of various concrete encodings.

There are three standard Unicode encodings. (These are not to be confused with the dozens of "legacy encodings", a.k.a. code pages, used prior to the Unicode standard. They do not cover the entire range of Unicode, and are not part of the Unicode standard.) These encodings are:

UTF-8
UTF-16
UTF-32 (also sometimes known as UCS-4)

plus at least one older, obsolete encoding, UCS-2.

UTF-32 is the least common, but simplest. It simply maps every code point to four bytes. In the following, I will follow this convention:

- code points are written using the standard Unicode notation, U+xxxx where the x's are hexadecimal digits;

- bytes are written in hexadecimal, using a leading 0x.

Code point U+0000 -&amp;gt; bytes 0x00000000
Code point U+0001 -&amp;gt; bytes 0x00000001
Code point U+0002 -&amp;gt; bytes 0x00000002
...
Code point U+10FFFF -&amp;gt; bytes 0x0010FFFF


It is simple because the mapping is trivially simple, and uncommon because for typical English-language text, it wastes a lot of memory.

The only complication is that UTF-32 depends on the endianess of your system. In the above examples I glossed over this factor. In fact, there are two common ways that bytes can be stored:

- "big endian", where the most-significant (largest) byte is on the left (lowest address);
- "little endian", where the most-significant (largest) byte is on the right.

So in a little-endian system, we have this instead:

Code point U+0000 -&amp;gt; bytes 0x00000000
Code point U+0001 -&amp;gt; bytes 0x01000000
Code point U+0002 -&amp;gt; bytes 0x02000000
...
Code point U+10FFFF -&amp;gt; bytes 0xFFFF1000

(Note that little-endian is not merely the reverse of big-endian. It is the order of bytes that is reversed, not the order of digits, or the order of bits within each byte.)

So when you receive a bunch of bytes that you know represents text encoded using UTF-32, you can bunch the bytes in groups of four and convert them to Unicode code points. But you need to know the endianess. One way to do that is to add a Byte Order Mark at the beginning of the bytes. If you look at the first four bytes, and it looks like 0x0000FEFF, then you have big-endian UTF-32. But if it looks like 0xFFFE0000, then you have little-endian.

So that's UTF-32. UTF-16 is a little more complicated.

UTF-16 divides the Unicode range into two groups:

* The first (approximately) 65000 code points which are represented as two bytes;

* Everything else, which are represented as a pair of double bytes, so-called "surrogate pairs".

For the first 65000-odd code points, the mapping is trivial, and relatively compact:

code point U+0000 =&amp;gt; bytes 0x0000
code point U+0001 =&amp;gt; bytes 0x0001
code point U+0002 =&amp;gt; bytes 0x0002
...
code point U+FFFF =&amp;gt; bytes 0xFFFF


Code points beyond that point are encoded into a pair of double bytes (four bytes in total):

code point U+10000 =&amp;gt; bytes 0xD800 DC00
...
code point U+10FFFF =&amp;gt; bytes 0xDBFF DFFF


Notice a potential ambiguity here. If you receive a byte 0xD800, is that the start of a surrogate pair, or the code point U+D800? The Unicode standard resolves this ambiguity by officially reserving code points U+D800 through U+DFFF for use as surrogate pairs in UTF-16.

Like UTF-32, UTF-16 also has to distinguish between big-endian and little-endian. It does so with a leading BOM, only this time it is two bytes, not four:

0xFEFF =&amp;gt; big-endian
0xFFFE =&amp;gt; little-endian


Last but not least, we have UTF-8. UTF-8 is slowly becoming the standard for storing Unicode on disk, because it is very compact for common English-language text, backwards-compatible with ASCII text files, and doesn't require a BOM. (Although Microsoft software sometimes adds a UTF-8 signature at the start of files, namely the three bytes 0xEFBBBF.)

UTF-8 is also a variable-width encoding. Unicode code-points are mapped to one, two, three or four bytes, as needed:

Code points U+0000 to U+007E =&amp;gt; 1 byte
Code points U+0080 to U+07FF =&amp;gt; 2 bytes
Code points U+0800 to U+FFFF =&amp;gt; 3 bytes
Code points U+10000 to U+10FFFF =&amp;gt; 4 bytes

(Older versions of UTF-8 could go up to six bytes, but now that Unicode is officially limited to exactly 0x10FFFF code points, it now only goes up to four bytes.)



&lt;/pre&gt;</description>
    <dc:creator>Steven D'Aprano</dc:creator>
    <dc:date>2013-05-18T03:49:38</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.python.tutor/81632">
    <title>Re: why is unichr(sys.maxunicode) blank?</title>
    <link>http://permalink.gmane.org/gmane.comp.python.tutor/81632</link>
    <description>&lt;pre&gt;
There's no name since the code point isn't assigned, but the category
is defined:

    &amp;gt;&amp;gt;&amp;gt; unicodedata.category(u'\U0010FFFD')
    'Co'
    &amp;gt;&amp;gt;&amp;gt; unicodedata.category(u'\U0010FFFE')
    'Cn'
    &amp;gt;&amp;gt;&amp;gt; unicodedata.category(u'\U0010FFFF')
    'Cn'

'Co' is the private use category, and 'Cn' is for codes that aren't assigned.
_______________________________________________
Tutor maillist  -  Tutor&amp;lt; at &amp;gt;python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

&lt;/pre&gt;</description>
    <dc:creator>eryksun</dc:creator>
    <dc:date>2013-05-18T03:28:55</dc:date>
  </item>
  <textinput rdf:about="http://search.gmane.org/?group=$group=gmane.comp.python.tutor">
    <title>Search Engine</title>
    <description>Search the mailing list at Gmane</description>
    <name>query</name>
    <link>http://search.gmane.org/?group=$group=gmane.comp.python.tutor</link>
  </textinput>
</rdf:RDF>
