<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/">
  <channel about="http://blog.gmane.org/gmane.comp.lang.perl.perl5.porters">
    <title>gmane.comp.lang.perl.perl5.porters</title>
    <link>http://blog.gmane.org/gmane.comp.lang.perl.perl5.porters</link>
    <description/>
    <syn:updatePeriod>hourly</syn:updatePeriod>
    <syn:updateFrequency>1</syn:updateFrequency>
    <syn:updateBase>1901-01-01T00:00+00:00</syn:updateBase>
    <items>
      <rdf:Seq>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64353"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64352"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64351"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64350"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64349"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64348"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64347"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64346"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64345"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64344"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64343"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64342"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64341"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64340"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64339"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64338"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64337"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64336"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64335"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64334"/>
      </rdf:Seq>
    </items>
    <image rdf:resource="http://gmane.org/img/gmane-25t.png"/>
    <textinput rdf:resource=""/>
  </channel>
  <image rdf:about="http://gmane.org/img/gmane-25t.png">
    <title>Gmane</title>
    <url>http://gmane.org/img/gmane-25t.png</url>
    <link>http://gmane.org</link>
  </image>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64353">
    <title>Re: maint-5.8 breaks NYTProf 2.07</title>
    <link>http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64353</link>
    <description>
Yes, thanks. I fixed NYTProf trunk before I submitted the patch.

Guess I should make progress on a new release before RC2 arrives...
Thanks for reminding me about that Andreas.

Tim.

</description>
    <dc:creator>Tim Bunce</dc:creator>
    <dc:date>2008-11-23T11:57:05</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64352">
    <title>Re: [perl #60738] Something missing in Config(3perl)</title>
    <link>http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64352</link>
    <description>On Sat, 22 Nov 2008 11:29:16 -0800, "rrt&lt; at &gt;sc3d.org (via RT)"
&lt;perlbug-followup&lt; at &gt;perl.org&gt; wrote:


What is unclear about "a plain ''"? It is an empty string.


Correct. Nothing to see, please move on.

&lt;longer explanation&gt;
Configure uses 'dist' to generate Configure, and 'dist' defines 'mv',
but perl doesn't use it. I will not overrule the part of dist that
defines 'mv' just to make this unused variable go away.
&lt;/longer explanation&gt;

</description>
    <dc:creator>H.Merijn Brand</dc:creator>
    <dc:date>2008-11-23T08:13:39</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64351">
    <title>Smoke [5.11.0] 34896 FAIL(F) MSWin32 WinXP/.Net SP3 (x86/2 cpu)</title>
    <link>http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64351</link>
    <description>Automated smoke report for 5.11.0 patch 34896
maldoror.bath.planit.group: Intel(R) Core(TM)2 CPU 6700 &lt; at &gt; 2.66GHz(~2660 MHz) (x86/2 cpu)
    on        MSWin32 - WinXP/.Net SP3
    using     bcc32 version 5.5.1
    smoketime 5 hours 35 minutes (average 16 minutes 47 seconds)

Summary: FAIL(F)

O = OK  F = Failure(s), extended report at the bottom
X = Failure(s) under TEST but not under harness
? = still running or test results not (yet) available
Build failures during:       - = unknown or N/A
c = Configure, m = make, M = make (after miniperl), t = make test-prep

   34896     Configuration (common) -DCCTYPE=BORLAND -DINST_TOP=$(INST_DRV)\Smoke\doesntexist
----------- ---------------------------------------------------------
F F         
F F         -Dusemymalloc
F F         -Duselargefiles
F F         -Duselargefiles -Dusemymalloc
F F         -Duseithreads -Uuseimpsys
F F         -Duseithreads -Uuseimpsys -Dusemymalloc
F F         -Duseithreads -Uuseimpsys -Duselargefiles
F F         -Duseithreads -Uuseimpsys -Duselargefiles -Dusemymalloc
F F         -Duseithreads
F F         -Duseithreads -Duselargefiles
| +--------- -DDEBUGGING
+----------- no debugging


Locally applied patches:
    DEVEL
    SMOKE34896

Failures: (common-args) -DCCTYPE=BORLAND -DINST_TOP=$(INST_DRV)\Smoke\doesntexist
[default] 
[default] -DDEBUGGING
[default] -Dusemymalloc
[default] -DDEBUGGING -Dusemymalloc
[default] -Duselargefiles
[default] -DDEBUGGING -Duselargefiles
[default] -Duselargefiles -Dusemymalloc
[default] -DDEBUGGING -Duselargefiles -Dusemymalloc
[default] -Duseithreads -Uuseimpsys
[default] -DDEBUGGING -Duseithreads -Uuseimpsys
[default] -Duseithreads -Uuseimpsys -Dusemymalloc
[default] -DDEBUGGING -Duseithreads -Uuseimpsys -Dusemymalloc
[default] -Duseithreads -Uuseimpsys -Duselargefiles
[default] -DDEBUGGING -Duseithreads -Uuseimpsys -Duselargefiles
[default] -Duseithreads -Uuseimpsys -Duselargefiles -Dusemymalloc
[default] -DDEBUGGING -Duseithreads -Uuseimpsys -Duselargefiles -Dusemymalloc
[default] -Duseithreads
[default] -DDEBUGGING -Duseithreads
[default] -DDEBUGGING -Duseithreads -Duselargefiles
../ext/B/t/deparse.t........................................FAILED
    63, 65
../ext/PerlIO/t/ioleaks.t...................................FAILED
../lib/Attribute/Handlers/t/linerep.t.......................FAILED

[default] -Duseithreads -Duselargefiles
../ext/B/t/deparse.t........................................FAILED
    63, 65
../ext/IO/t/io_sock.t.......................................FAILED
../ext/PerlIO/t/ioleaks.t...................................FAILED
../lib/Attribute/Handlers/t/linerep.t.......................FAILED

Compiler messages(bcc32):
Warning W8008 crc32.c 232: Condition is always true in function crc32
Warning W8066 crc32.c 242: Unreachable code in function crc32
Warning W8004 deflate.c 351: 'hash_head' is assigned a value that is never used in function deflateSetDictionary
Warning W8008 deflate.c 1278: Condition is always false in function fill_window
Warning W8066 deflate.c 1279: Unreachable code in function fill_window
Warning W8004 deflate.c 1669: 'bflush' is assigned a value that is never used in function deflate_slow

</description>
    <dc:creator>Steve Hay</dc:creator>
    <dc:date>2008-11-23T04:24:00</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64350">
    <title>maint-5.8 breaks NYTProf 2.07</title>
    <link>http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64350</link>
    <description>Tim,

are you aware that your commit 34899 broke NYTProf?

http://www.nntp.perl.org/group/perl.cpan.testers/2008/11/msg2669348.html

I can reproduce the result.

</description>
    <dc:creator>Andreas J. Koenig</dc:creator>
    <dc:date>2008-11-23T04:03:20</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64349">
    <title>Matching multi-character folds</title>
    <link>http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64349</link>
    <description>This email is best viewed under utf8.

The Unicode standard lists several different cases where a character (or
code point if you prefer) should match a multiple character sequence 
when case is ignored.

One of these is the oft mentioned in this list, German lower case sharp
s or ß.  'ss' =~ /ß/i is true. (U+00DF)

And perl does currently work that way if and only if the ß is stored in 
utf8.  For the purposes of this email, I'm assuming all strings are in utf8.

In a recent email, Yves has said that he thinks it is debatable whether 
or not it should work this way.  My own view is that they should match. 
It is beyond debate if the utf8ness of the strings should matter or not. 
  To quote from the perltodo: "The handling of Unicode is unclean in 
many places. For example, the regexp engine matches in Unicode semantics 
whenever the string or the pattern is flagged as UTF-8, but that should 
not be dependent on an internal storage detail of the string. Likewise, 
case folding behaviour is dependent on the UTF8 internal flag being on 
or off."

Yves has submitted an RFC for the first part of that statement, and I'm 
now going to talk about the second.  I believe we have established that 
there will be a new mode of operation which will become the default in 
5.12 that characters in the 128-255 range will case fold match as the 
Unicode standard says.  But there are some issues with multi-char folds 
(the only one in that range being ß) generally.

To start the discussion about the multi-char folds, I give examples of 
the various types defined in the standard.  The first type is that of ß.

Another type is ligatures (they don't view ß as a ligature, and I don't
know why)  So 'fi' =~ /ﬁ/i is true. (U+FB01)

Another type is where there there is no corresponding upper or title
case single precomposed character corresponding to a lower case one. 
For instance LATIN SMALL LETTER J WITH CARON, so 'ǰ' =~ /ǰ/i is true. 
(U+01F0)

Still another type is lower Greek letters with a iota-subscript or a
iota adscript.  I won't put in an example.

And the final types all have to do with putting a combining dot above i
and j in Azeri, Turkish, and Lithuanian locales, which perl doesn't 
support in Unicode.

I think it is more correct for these things to match than not. 
However, I'm not so sure when things are put in a character class.  What 
should /[ß]/i match?  I'm tempted to say not 'ss' because character 
classes match only a single character.  But with the J with caron, that 
really is like a single character, with the caron just a modifier.  For 
that I'm tempted to say yes 'ǰ' =~ /[ǰ]/i.  The problem is that the 
concept of a character class doesn't fit with the Unicode ideas.  I 
haven't done any research as to what other languages, etc do.

Would you like to know what happens today in perl?  Well I'll tell you 
anyway.  /[ß]/i is true and 'ǰ' =~ /[ǰ]/i is false.  In fact, every 
other multi-char case ignored fold returns false.  This in fact may be 
the only time in perl history, savor the moment, when the infamous ß 
gives an arguably more correct result than other characters.

The code in regcomp.c takes special pains to make all these match.  But 
it doesn't work, except in the [ß] case.  So we don't have to worry 
about breaking existing code if we decide it should work differently.

Let's look at it the other direction.  Should ß =~ /ss/i ?  Should 'ǰ' 
=~ /ǰ/i ?  They both are true currently.  However, things like ß =~ 
/s{2}/i is false, and that seems inconsistent.

So, I'm not sure what the right answers are, but things are somewhat 
broken today, and I'd like to get clarity on how it should work.


</description>
    <dc:creator>karl williamson</dc:creator>
    <dc:date>2008-11-23T03:58:38</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64348">
    <title>Matching multi-character folds</title>
    <link>http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64348</link>
    <description>This email is best viewed under utf8.

The Unicode standard lists several different cases where a character (or
code point if you prefer) should match a multiple character sequence 
when case is ignored.

One of these is the oft mentioned in this list, German lower case sharp
s or ß.  'ss' =~ /ß/i is true. (U+00DF)

And perl does currently work that way if and only if the ß is stored in 
utf8.  For the purposes of this email, I'm assuming all strings are in utf8.

In a recent email, Yves has said that he thinks it is debatable whether 
or not it should work this way.  My own view is that they should match, 
and it is beyond debate that the utf8ness of the strings should matter 
or not.  To quote from the perltodo: "The handling of Unicode is unclean 
in many places. For example, the regexp engine matches in Unicode 
semantics whenever the string or the pattern is flagged as UTF-8, but 
that should not be dependent on an internal storage detail of the 
string. Likewise, case folding behaviour is dependent on the UTF8 
internal flag being on or off."

To start the discussion about the multi-char folds, I give examples of 
the various types defined in the standard.  The first case is that of ß.

Another case is ligatures (they don't view ß as a ligature, and I don't
know why)  So 'fi' =~ /ﬁ/i is true. (U+FB01)

Another case is where there there is no corresponding upper or title
case single precomposed character to a lower case one.  For instance
LATIN SMALL LETTER J WITH CARON, so 'ǰ' =~ /ǰ/i is true. (U+01F0)

Still another case is lower Greek letters with a iota-subscript or a
iota adscript.  I won't put in an example.

And the final cases all have to do with putting a combining dot above i
and j in Azeri, Turkish, and Lithuanian locales, which perl doesn't 
support in Unicode.

I think it is more correct for these things to match than not. 
However, I'm not so sure when things are put in a character class.  What 
should /[ß]/i match?  I'm tempted to say not 'ss' because character 
classes match only a single character.  But with the J with caron, that 
really is like a single character, with the caron really just a 
modifier.  For that I'm tempted to say yes 'ǰ' =~ /[ǰ]/i.  The problem 
is that the concept of a character class doesn't fit with the Unicode 
ideas.  I haven't done any research as to what other languages, etc do.

Would you like to know what happens today in perl?  Well I'll tell you 
anyway.  /[ß]/i is true and 'ǰ' =~ /[ǰ]/i is false.  In fact, every 
other multi-char fold returns false.  This in fact may be the only time 
in perl history, savor the moment, when the infamous ß gives an arguably 
more correct result than other characters.

Now the code in regcomp.c takes special pains to make all these match. 
But it doesn't work, except in the [ß] case.  So we don't have to worry 
about breaking existing code if we decide it should work differently.

Let's look at it the other direction.  Should ß =~ /ss/i ?  Should 'ǰ' 
=~ /ǰ/i ?  They both are true currently.  However, things like ß =~ 
/s{2}/i is false, and that seems inconsistent.

So, I'm not sure what the right answers are, but things are broken today.


</description>
    <dc:creator>karl williamson</dc:creator>
    <dc:date>2008-11-23T03:46:29</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64347">
    <title>[perl #60738] Something missing in Config(3perl)</title>
    <link>http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64347</link>
    <description># New Ticket Created by  rrt&lt; at &gt;sc3d.org 
# Please include the string:  [perl #60738]
# in the subject line of all future correspondence about this issue. 
# &lt;URL: http://rt.perl.org/rt3/Ticket/Display.html?id=60738 &gt;



This is a bug report for perl from rrt&lt; at &gt;sc3d.org,
generated with the help of perlbug 1.36 running under perl 5.10.0.


-----------------------------------------------------------------
[Please enter your report here]
Config(3perl) contains the following passage:

 "mv"
           From Loc.U:

           This variable is defined but not used by Configure.  The value is a
           plain ’’ and is not useful.

A plain what? (The groff source is equally unhelpful, there really is
nothing there.)

[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags:
    category=docs
    severity=low
---
Site configuration information for perl 5.10.0:

Configured by Debian Project at Sat Nov  1 21:51:58 UTC 2008.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.6.26-1-686, archname=i486-linux-gnu-thread-multi
    uname='linux rebekka 2.6.26-1-686 #1 smp thu oct 9 15:18:09 utc 2008 i686 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.10.0 -Dsitearch=/usr/local/lib/perl/5.10.0 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.10.0 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2 -g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include'
    ccversion='', gccversion='4.3.2', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib /usr/lib64
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=/lib/libc-2.7.so, so=so, useshrplib=true, libperl=libperl.so.5.10.0
    gnulibc_version='2.7'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib'

Locally applied patches:


---
&lt; at &gt;INC for perl 5.10.0:
    /home/rrt/local/share/perl5
    /etc/perl
    /usr/local/lib/perl/5.10.0
    /usr/local/share/perl/5.10.0
    /usr/lib/perl5
    /usr/share/perl5
    /usr/lib/perl/5.10
    /usr/share/perl/5.10
    /usr/local/lib/site_perl
    .

---
Environment for perl 5.10.0:
    HOME=/home/rrt
    LANG=en_GB.UTF-8
    LANGUAGE=en_GB:en_US:en_GB:en
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/usr/local/sbin:/sbin:/usr/sbin:/usr/NX/bin:/usr/local/epocemx/bin:/home/rrt/bin:/home/rrt/local/i686/bin:/home/rrt/local/bin:/home/rrt/.luarocks/bin:/home/rrt/Work/Adsensus/svn/nancy/trunk:/home/rrt/Work/Adsensus/svn/adsensus/trunk:/usr/local/bin:/usr/bin:/bin:/usr/games
    PERL5LIB=/home/rrt/local/share/perl5
    PERL_BADLANG (unset)
    SHELL=/bin/bash


</description>
    <dc:creator>rrt&lt; at &gt;sc3d.org (via RT</dc:creator>
    <dc:date>2008-11-22T19:29:16</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64346">
    <title>Re: [perl #58182] Unicode bug: More questions about coding</title>
    <link>http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64346</link>
    <description>On approximately 11/22/2008 6:34 PM, came the following characters from 
the keyboard of karl williamson:



Not sure that I agree that the parser should know that much about what 
the functions do, but on the other hand, having a function that does 
nothing in certain lexical situations seems useless, so we've arrived at 
the same conclusion by different paths... the functions should continue 
to behave as they did in the use bytes case.




I would surmise the same way you did, for when the space is consumed.

For the 3:1 case, I understand what you have done, thanks for 
clarifying.  Whether one should be concerned about the possibility of 
consuming triple the space need for the typical character string because 
of case shifting is not my call, certainly, but I'm glad to understand 
this can happen.

For temporary variables, it is kind of a ho-hum situation, a bit of 
space wasted until they drop out of scope.  For variables that might 
stick around for a while, it could be a concern.  Perhaps it is my 
ignorance, but I know of no way for the programmer to say "OK, this 
string is now going to be used as the key in a hash that will last the 
lifetime of the program, and it would be good to make it as short as 
possible".  Such a function could be used for premature optimization, 
but could save significant space for long term data that has, say, but 
uppercased for consistent comparison to input values (rather than doing 
less efficient case-insensitive comparisons).




Sounds good.

</description>
    <dc:creator>Glenn Linderman</dc:creator>
    <dc:date>2008-11-23T02:59:35</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64345">
    <title>Re: [perl #58182] Unicode bug: More questions about coding</title>
    <link>http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64345</link>
    <description>Sure.  What I meant was that the function isn't the place to be coding 
in noop cases, so I could have answered my own question if I had thought 
about it.  I should leave it do what it always has done in bytes mode, 
and if someone wants to change perl so that it acts as documented, the 
place to do so is not the function, but the parser.


The worst case for a single byte to utf8 is 2:1.  The worst case for in 
general changing the case of a utf8 character is 3:1, because of the 
extra modifiers that go with it.  The way the functions (not ones I have 
touched, by the way) work is that they essentially malloc enough space 
for the worst case for converting from byte to utf8.  Any extra dangles 
until the scalar's reference count goes to 0, when the entire amount is 
freed.  This extra may be needed if the variable is appended to, say. 
If the scalar has to grow beyond what is adjacent to the string (and I 
haven't really looked at this code, but am doing some surmising), a new 
string is allocated, the original's space is freed, and the scalar now 
has a different string pointer.


I'm sorry I wasn't very clear.  Restated, in compatibility mode the 
existing macros are used which have the non-ascii chars be caseless, but 
in the new mode, the table is used for the entire 0-255 range.



</description>
    <dc:creator>karl williamson</dc:creator>
    <dc:date>2008-11-23T02:34:21</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64344">
    <title>Re: ? Bug in macro UTF_START_MARK when U8 &gt; 8 bits</title>
    <link>http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64344</link>
    <description>I guess I was assuming there are such machines today.  It is easy to do 
a #if on UCHAR_MAX, and define the macros depending on the result to 
include the mask or not, so one doesn't have to depend on the compiler.

</description>
    <dc:creator>karl williamson</dc:creator>
    <dc:date>2008-11-23T02:10:27</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64343">
    <title>Re: ? Bug in macro UTF_START_MARK when U8 &gt; 8 bits</title>
    <link>http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64343</link>
    <description>On approximately 11/22/2008 3:13 PM, came the following characters from 
the keyboard of karl williamson:


Ah, um, yes, probably.  Well that is the definition straight from C, in 
which Perl is implemented.  (Of course bits are smaller, but Perl 
doesn't expose bits except through bit vectors; they are not directly an 
integral type, and so are irrelevant to this discussion.)  So that 
doesn't answer the question completely.

Still leaving me with the question below.




But I don't know if there are any, either.  UCHAR_MAX would tell the 
tale, indeed.  A reasonable compiler for an 8-bit architecture should 
optimize away any   &amp; 0xFF  operations, so I doubt it would be harmful 
to fix it, even if there are no such platforms, but if there are not, I 
don't expect any in the future either... unless there is a jump to a 
minimum 16-bit integer size, someday.

</description>
    <dc:creator>Glenn Linderman</dc:creator>
    <dc:date>2008-11-23T00:33:07</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64342">
    <title>Re: [perl #58182] Unicode bug: More questions about coding</title>
    <link>http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64342</link>
    <description>On approximately 11/22/2008 2:54 PM, came the following characters from 
the keyboard of karl williamson:


Maybe so.  Would be a more efficient noop that way.  But what I meant, 
is that if the user wants a noop, they wouldn't generally code it as

use bytes;
$foo = uc( $foo )
no bytes;




I'm assuming you are saying here that the worst case for a lowercase 
character converted to uppercase, or an uppercase character converted to 
lowercase, can be 3:1 (since these are the operations of concern), 
rather than the worst case conversion of one character byte to one UTF-8 
sequence is 3:1 (since I don't think that happens).

It is a space vs time tradeoff... and the results are highly dependent 
on the data being manipulated...

So you allocate 3:1 space, if you don't use it, do you give it back, or 
leave it dangle for the next potential operation?




My comment was similar to others I've seen here.  I'm by no means an 
insider, although I've been hanging around quite a while.  You are 
looking at the code and doing the work; as long as you have a reasonable 
justification (like the comment you found) for the change, I think it 
will fly.  Gratuitous changes don't seem to be particularly welcome, but 
if it makes the code more correct, easier to maintain, shorter, not 
measurably slower, things seem to be accepted.




add 32 for numbers in that range is the same as OR 32; sub 32 for 
numbers in that range is the same as AND ~ 32.  Flipping the bit via 
logical operations, versus doing arithmetic.  6 of one, a half-dozen of 
the other.  Logic operations used to be faster, way back when, because 
there were no possibility of carries; with today's processors, it is 
generally one clock for either.

Now, though, you've got me not understanding something.

If there was "ad hoc" logic to test for a-z A-Z ranges, and it is about 
the same expense as a table, but that now you've invented a table for 
the 128-255 range, wouldn't it be simpler overall to use the table for 
the a-z A-Z ranges also?




</description>
    <dc:creator>Glenn Linderman</dc:creator>
    <dc:date>2008-11-23T00:27:03</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64341">
    <title>Re: on expectations of privacy</title>
    <link>http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64341</link>
    <description>

... a great deal of that danger due to the fact that for all of the 
documentation on XS (including the book and a quarter on it), the easiest way 
to figure out how to do something in a Perl extension is to read the Perl 
source code.

Make it easier for people to copy and paste *good* code, and they'll copy and 
paste less naughty code (not *no* naughty code; just less).

</description>
    <dc:creator>chromatic</dc:creator>
    <dc:date>2008-11-22T23:23:25</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64340">
    <title>Re: [perl #58182] Unicode bug: code review request</title>
    <link>http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64340</link>
    <description>
The email that got through the filters stripped off my html file, I 
presume for security reasons.

</description>
    <dc:creator>karl williamson</dc:creator>
    <dc:date>2008-11-22T23:14:20</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64339">
    <title>Re: ? Bug in macro UTF_START_MARK when U8 &gt; 8 bits</title>
    <link>http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64339</link>
    <description>
I asked earlier about U8 size, and Nicholas Clark responded: "It's 
always going to be an unsigned char, it's always going to be the
smallest type on the platform, and it's always going to be at least 8 bits."

I'm pretty sure it is a bug on such architectures, if there are any. 
The same problem occurs with the UTF8_ACCUMULATE macro.  limits.h has a 
UCHAR_MAX macro that can be used to decide if this is a problem or not, 
and if it is just add in a 0xFF mask.  Is this worth fixing?
In one file there is a CHARMASK macro referred to, but it tests if it is 
defined, and it never is.

</description>
    <dc:creator>karl williamson</dc:creator>
    <dc:date>2008-11-22T23:13:24</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64338">
    <title>Re: Conflation of "ASCII" , "utf8" with other things in perl.</title>
    <link>http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64338</link>
    <description>I  would like to fix some of these things I've found confusing.  The 
biggest one for my coding so far is NATIVE_TO_ASCII, when it means 
NATIVE_TO_LATIN1, but only if the patches might be met with favor by 
those who can approve them.  After all, I wouldn't really be fixing 
bugs, but making it easier to maintain.  And I don't know what the perl 
culture is about making changes.  There are also a few instances where 
the EBCDIC/ASCII stuff could be cleaned up.  I'd like to do that.

I think the EBCDIC stuff could be explained by a paragraph in the 
perlapi pod.

I'd be happy to make pod changes as I discover problems.  I would think 
that I'd want to use git for that for the files where the pod is derived 
from the .c, as I could change the documentation, and not have to worry 
about someone else changing the code before I submit the patch.  (I 
would like one large patch instead of a line here and there.)  From 
reading this list, it sounds like git is at least pretty good at 
handling changes to a source file that come from multiple sources, so 
that manual intervention isn't often required.  Yves gave a simple 
formula for a porter to use git, which looked pretty easy.  Is this 
really all that  I would do?

</description>
    <dc:creator>karl williamson</dc:creator>
    <dc:date>2008-11-22T23:09:48</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64337">
    <title>Re: [perl #58182] Unicode bug: More questions about coding</title>
    <link>http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64337</link>
    <description>
In thinking about this some more, it seems like if we wanted to make 
things noops in bytes mode, the place to do it would be the parser (or 
whatever it is that sets up the execution stack) so that functions that 
are noops aren't even called.
The worst case in Unicode is 3:1.  But I decided to choose the worst 
case, as that is what is done when a string is upgraded to utf8.

I still am feeling my way about the change culture here.  I've worked on 
projects where only a major bug warranted a change--customers had to 
live with the ones that management didn't deem major enough.  And I've 
worked on projects where the developer was God, and could do whatever 
they liked.  I prefer ones where there is a discussion and consensus as 
to what should happen.

That's similar to what I did.

I don't understand what you're saying here,  but I use the existing code 
to handle characters in this range, which comes down to testing if 'A' 
&lt;= c &lt;= 'Z' on ASCII machines and then adding 32 to get the lower case; 
similar for lower case going the other way.  A table lookup is about the 
same amount of machine work.


I ended up doing it for all the functions, and including the range 
128-255 without going to the general Unicode functions.  The expense for 
characters above 255 is 2 comparisons.  The payoff for those less is 
significant.


</description>
    <dc:creator>karl williamson</dc:creator>
    <dc:date>2008-11-22T22:54:46</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64336">
    <title>[perl #58182] Unicode bug: code review request</title>
    <link>http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64336</link>
    <description>Attached for code review are changes to pp.c to enable case handling of 
characters in the range 128-255.  This is not a patch yet.  Since I'm 
new to changing the perl source, I wanted to get feedback before 
submitting a real patch.  Also, changes in a couple of other files 
depend on some unresolved issues.

There are two attachments, one is in standard patch format.  The other 
is an html file containing another type of diff that I prefer.  In it, 
the cyan background is for deleted things, the yellow for added; changes 
in line indentation are not shown.

I'm wondering what sorts of things I might be overlooking.  I do know 
that I don't know much about how overloading or magic might affect these 
routines.  I don't think my changes would affect these, but then, I 
don't know much about these.

There are a couple of comments marked TODO, which means I have special 
questions about them.

One of my design goals was to not slow things down unnecessarily.  I 
think that, if anything, I have sped things up.  One area of concern I 
have, though, in that regard is in the loop in pp_uc.  I don't know 
anything about modern optimizers.  Before, it was a very tight loop, 
which an optimizer could reasonably handle.  Now, the mainstream is a 
tight loop, but there is a conditional in it which can cause a 
significant amount of code to be executed, that could fool the 
optimizer.  Perhaps the non-mainstream case should be put in a function.

I added some code to improve the efficiency of utf8 handling, so that if 
perl has hard-coded into it the upper and lower cases of a character, it 
doesn't have to go out to the general Unicode functions.

The most changes (as opposed to additions) are in uc_first().  Most of 
these come from rearranging the code so that in all cases the changed 
character is known before processing the rest of the string. 
Previously, only if it was encoded in utf-8 would this preliminary step 
be done.  Otherwise, it was done as it went along.

An earlier author contemplated combining lc and uc into one function.  I 
haven't done that, yet.  At the expense of two extra comparisons per 
function call, I could save, perhaps as much space in perl as I've used 
up by adding the code to do the new functionality.

I have added several code-generating macros.  Normally, I don't like 
these, but I think it makes things more readable here.

I have added many more comments than are typical in the Perl source.  I 
earlier got feedback that this might be a good thing.

If I don't get any feedback, I'll end up submitting a patch.  This does 
pass all regression tests.  With the new behavior enabled it fails one, 
which I have gotten permission from the author to change.

If you were to try compiling this, you would get missing symbol errors, 
from macros in headers.  Several of these should be obvious what they 
mean, but here are definitions of ones that may not be:

IN_UNI_8_BIT is true if and only if characters in the range 128-255 are 
to be treated as having upper and lower case as defined by Unicode. 
When false, these routines should deliver identical results as they 
always have.

toLOWER_LATIN1(c) takes a character in the range 0-255 and returns its 
lower case as defined by Unicode.

toUPPER_LATIN1_MOD(c) takes a character in the range 0-255 and returns 
its upper case as defined by Unicode, except for 3 tricky characters, 
for which it returns a modified value, as explained in the code's comments.

UTF8_TWO_BYTE_HI(c) returns the first utf8-encoded byte for a Unicode 
character c whose utf8 is known to take exactly two bytes.  Similarly 
for UTF8_TWO_BYTE_LO
--- pp.c.blead2008-11-17 00:37:48.000000000 -0700
+++ pp.c2008-11-22 14:01:58.000000000 -0700
&lt; at &gt;&lt; at &gt; -3521,22 +3521,90 &lt; at &gt;&lt; at &gt;
 #endif
 }
 
+/* Generally UTF-8 and UTF-EBCDIC are indistinguishable at this level.  So 
+ * most comments below say UTF-8, when in fact they mean UTF-EBCDIC as well */
+
+/* Both the characters below can be stored in two UTF-8 bytes.  In UTF-8 the max
+ * character that 2 bytes can hold is U+07FF, and in UTF-EBCDIC it is U+03FF.
+ * See http://www.unicode.org/unicode/reports/tr16 */
+
+#define LATIN_CAPITAL_LETTER_Y_WITH_DIAERESIS 0x0178/* Also is title case */
+#define GREEK_CAPITAL_LETTER_MU 0x039C/* Upper and title case of MICRON */
+
+/* Generates code to store a unicode codepoint c that is known to occupy
+ * exactly two UTF-8 and UTF-EBCDIC bytes into p and p+1. */
+#define STORE_UNI_TO_UTF8_TWO_BYTE(p, c)    \
+    *(p) = UTF8_TWO_BYTE_HI(c);    \
+    *((p)+1) = UTF8_TWO_BYTE_LO(c);
+
+/* Like STORE_UNI_TO_UTF8_TWO_BYTE, but advances p to point to the next
+ * available byte after the two bytes */
+#define CAT_UNI_TO_UTF8_TWO_BYTE(p, c)    \
+    *(p)++ = UTF8_TWO_BYTE_HI(c);    \
+    *((p)++) = UTF8_TWO_BYTE_LO(c);
+
+/* Generates code to store the upper case of latin1 character l which is known
+ * to have its upper case be non-latin1 into the two bytes p and p+1.  There
+ * are only two characters that fit this description, and this macro knows
+ * about them, and that the upper case values fit into two UTF-8 or UTF-EBCDIC
+ * bytes */
+#define STORE_NON_LATIN1_UC(p, l)    \
+if ((l) == LATIN_SMALL_LETTER_Y_WITH_DIAERESIS) {    \
+    STORE_UNI_TO_UTF8_TWO_BYTE((p), LATIN_CAPITAL_LETTER_Y_WITH_DIAERESIS)  \
+} else {    \
+    STORE_UNI_TO_UTF8_TWO_BYTE((p), GREEK_CAPITAL_LETTER_MU)    \
+}
+
+/* Like STORE_NON_LATIN1_UC, but advances p to point to the next available byte
+ * after the character stored */
+#define CAT_NON_LATIN1_UC(p, l)    \
+if ((l) == LATIN_SMALL_LETTER_Y_WITH_DIAERESIS) {    \
+    CAT_UNI_TO_UTF8_TWO_BYTE((p), LATIN_CAPITAL_LETTER_Y_WITH_DIAERESIS)    \
+} else {    \
+    CAT_UNI_TO_UTF8_TWO_BYTE((p), GREEK_CAPITAL_LETTER_MU)    \
+}
+
+/* Generates code to add the two UTF-8 bytes (probably u) that are the upper
+ * case of l into p and p+1.  u must be the result of toUPPER_LATIN1_MOD(l),
+ * and must require two bytes to store it.  Advances p to point to the next
+ * available position */
+#define CAT_TWO_BYTE_UNI_UPPER(p, l, u)    \
+if ((u) != LATIN_SMALL_LETTER_Y_WITH_DIAERESIS) {    \
+    CAT_UNI_TO_UTF8_TWO_BYTE((p), (u))/* not special case, just save it */ \
+} else if (l == LATIN_SMALL_LETTER_SHARP_S) {    \
+    *(p)++ = 'S'; *(p)++ = 'S'; /* upper case is 'SS' */    \
+} else {    \
+    CAT_NON_LATIN1_UC((p), (l)) /* else is one of the other special cases */ \
+}
+
 PP(pp_ucfirst)
 {
+    /* Both lcfirst() and ucfirst().  Only the first character changes.  This
+     * means that likely we can change in-place, ie., just take the source and
+     * change that one character and store it back, but not if read-only etc,
+     * or if the length changes */
+
     dVAR;
     dSP;
     SV *source = TOPs;
-    STRLEN slen;
+    STRLEN slen; /* slen is the byte length of the whole SV. */
     STRLEN need;
     SV *dest;
-    bool inplace = TRUE;
-    bool doing_utf8;
+    bool inplace = TRUE;    /* Be optimistic that the new and old lengths are
+       the same, so only need to change the first
+       character, in place */
+
+    bool doing_utf8 = FALSE;
+    bool convert_source_to_utf8 = FALSE;   /* If need to convert */
     const int op_type = PL_op-&gt;op_type;
     const U8 *s;
     U8 *d;
     U8 tmpbuf[UTF8_MAXBYTES_CASE+1];
-    STRLEN ulen;
-    STRLEN tculen;
+    STRLEN ulen;    /* ulen is the byte length of the original Unicode character
+     * stored as UTF-8 at s. */
+    STRLEN tculen;  /* tculen is the byte length of the freshly titlecased (or
+     * lowercased) character stored in tmpbuf.  May be either
+     * UTF-8 or not, but in either case is the number of bytes */
 
     SvGETMAGIC(source);
     if (SvOK(source)) {
&lt; at &gt;&lt; at &gt; -3548,23 +3616,161 &lt; at &gt;&lt; at &gt;
 slen = 0;
     }
 
-    if (slen &amp;&amp; DO_UTF8(source) &amp;&amp; UTF8_IS_START(*s)) {
+
+    /* First calculate what the changed first character should be.  This affects
+     * whether we can just swap it out, leaving the rest of the string unchanged,
+     * or even if the dest has to be in UTF-8 even if the source isn't */
+
+    if (! slen) {   /* If empty */
+need = slen + 1; /* still need a trailing NUL */
+    } else if (DO_UTF8(source)) {
 doing_utf8 = TRUE;
-utf8_to_uvchr(s, &amp;ulen);
-if (op_type == OP_UCFIRST) {
-    toTITLE_utf8(s, tmpbuf, &amp;tculen);
+
+/* If the source character is invariant, it is ASCII (or, in EBCDIC, a
+ * mapping of an ASCII character or a caseless C1 control)  In ASCII,
+ * the lower and upper cases of any character are also ASCII (and title
+ * case is the same as upper case).  So it is safe to use the simple
+ * case change macros which avoid the overhead of the general
+ * functions.  Note that if perl were to be extended to do locale
+ * handling in UTF-8 strings, this wouldn't be true in, for example,
+ * Lithuanian or Turkic.  */
+
+if (UTF8_IS_INVARIANT(*s)) {
+    *tmpbuf = (op_type == OP_LCFIRST) ? toLOWER(*s) : toUPPER(*s);
+    inplace = TRUE;
+    tculen = ulen = 1;
+    need = slen + 1;
+} else if (UTF8_IS_DOWNGRADEABLE_START(*s)) {
+    U8 chr;
+
+    /* Similarly, if the source character isn't ASCII but is in the
+     * latin1 range (or EBCDIC mapping thereof), we have the case
+     * changes compiled into perl, and can avoid the overhead of the
+     * general functions.  In this range, the characters are stored as
+     * two UTF-8 bytes, and it turns out that any changed-case version is
+     * also two bytes (in both ASCIIish and EBCDIC machines). */
+
+    inplace = TRUE;
+    tculen = ulen = 2;
+    need = slen + 1;
+
+    /* Convert the two source bytes to a single Unicode code point
+     * value, change case and save for below */
+
+    chr = UTF8_ACCUMULATE(*s, *(s+1));
+    if (op_type == OP_LCFIRST) {
+U8 lower = toLOWER_LATIN1(chr);
+STORE_UNI_TO_UTF8_TWO_BYTE(tmpbuf, lower);
+    } else {
+U8 upper = toUPPER_LATIN1_MOD(chr);
+
+/* Most of the latin1 range characters are well-behaved.  Their
+ * title and upper cases are the same, and are also in the
+ * latin1 range.  The macro above returns their upper (hence
+ * title) case, and all that need be done is to save the result
+ * for below.  However, several characters are problematic, and
+ * have to be handled specially.  The MOD means that these
+ * tricky characters all get mapped to the single character
+ * tested just below */
+
+if (upper != LATIN_SMALL_LETTER_Y_WITH_DIAERESIS) {
+    STORE_UNI_TO_UTF8_TWO_BYTE(tmpbuf, upper);
+} else if (chr == LATIN_SMALL_LETTER_SHARP_S) {
+
+    /* This one is tricky because the title and upper cases are
+     * different, but mostly because they are two characters
+     * long, though the UTF-8 is still two bytes, so the stored
+     * length doesn't change */
+
+    *tmpbuf = 'S';  /* This one is tricky.  The UTF-8 is 'Ss' */
+    *(tmpbuf + 1) = 's';
+} else {
+
+    /* The others have their title and upper cases the same,
+     * but are tricky because the changed-case characters
+     * aren't in the latin1 range.  They, however, do fit into
+     * two UTF-8 bytes */
+
+    STORE_NON_LATIN1_UC(tmpbuf, chr);    
+}
+    }
 } else {
-    toLOWER_utf8(s, tmpbuf, &amp;tculen);
+
+    /* Here, can't short-cut the general case */
+
+    utf8_to_uvchr(s, &amp;ulen);
+    if (op_type == OP_UCFIRST) {
+toTITLE_utf8(s, tmpbuf, &amp;tculen);
+    } else {
+toLOWER_utf8(s, tmpbuf, &amp;tculen);
+    }
+    /* we can do in-place if and only if the lengths are the same.  */
+    inplace = (ulen == tculen);
+    need = slen + 1 - ulen + tculen;
+}
+    } else { /* Not UTF-8,  Need to consider locale and if latin1 is treated as
+caseless.  Note that a locale takes precedence */ 
+tculen = 1;    /* Most characters will require one byte, */
+need = slen + 1;    /* but will be overridden for the tricky ones */
+if (op_type == OP_LCFIRST) {
+
+    /* lower case the first letter: no trickiness for any character */
+
+    *tmpbuf = (IN_LOCALE_RUNTIME) ? toLOWER_LC(*s) :
+((IN_UNI_8_BIT) ? toLOWER_LATIN1(*s) : toLOWER(*s));
+} else  /* is ucfirst() */ if (IN_LOCALE_RUNTIME) {
+    *tmpbuf = toUPPER_LC(*s);/* This would be a bug if any locales
+   have upper and title case different */
+} else if (! IN_UNI_8_BIT) {
+    *tmpbuf = toUPPER(*s);/* Non-ascii are caseless, or on EBCDIC
+   machines whatever the native
+   function does */
+} else { /* is ucfirst non-UTF-8, not in locale, and cased latin1 */
+    *tmpbuf = toUPPER_LATIN1_MOD(*s);
+
+    /* tmpbuf now has the correct title case for all latin1 characters
+     * except for the several ones that have tricky handling.  All
+     * of these are mapped by the MOD to the letter below. */
+
+    if (*tmpbuf == LATIN_SMALL_LETTER_Y_WITH_DIAERESIS) {
+
+/* We use the original to distinguish between the tricky cases */
+
+if (*s == LATIN_SMALL_LETTER_SHARP_S) {
+    /* Two character title case 'Ss', but can remain non-UTF-8 */
+    need = slen + 2;
+    *tmpbuf = 'S';
+    *(tmpbuf + 1) = 's';   /* The length of tmpbuf is &gt;= 2 */
+    tculen = 2;
+} else {
+
+    /* The other two tricky ones have their title case outside
+     * latin1.  It is the same as their upper case. */
+
+    STORE_NON_LATIN1_UC(tmpbuf, *s)
+
+    /* The UTF-8 and UTF-EBCDIC lengths of both these characters
+     * and their upper cases is 2. */
+
+    tculen = ulen = 2;
+    doing_utf8 = TRUE;
+    inplace = FALSE;
+
+    /* The entire result will have to be in UTF-8.  Assume worst
+     * case sizing in conversion. */
+
+    convert_source_to_utf8 = TRUE;
+    need = slen * 2 + 1;
+}
+    }
 }
-/* If the two differ, we definately cannot do inplace.  */
-inplace = (ulen == tculen);
-need = slen + 1 - ulen + tculen;
-    } else {
-doing_utf8 = FALSE;
-need = slen + 1;
     }
 
-    if (SvPADTMP(source) &amp;&amp; !SvREADONLY(source) &amp;&amp; inplace &amp;&amp; SvTEMP(source)) {
+
+    /* Here, have the first character's changed case stored in tmpbuf.  Ready to
+     * generate the result */
+
+    if (inplace &amp;&amp; SvPADTMP(source) &amp;&amp; !SvREADONLY(source) &amp;&amp; SvTEMP(source)) {
 /* We can convert in place.  */
 
 dest = source;
&lt; at &gt;&lt; at &gt; -3585,46 +3791,71 &lt; at &gt;&lt; at &gt;
 
     if (doing_utf8) {
 if(!inplace) {
-    /* slen is the byte length of the whole SV.
-     * ulen is the byte length of the original Unicode character
-     * stored as UTF-8 at s.
-     * tculen is the byte length of the freshly titlecased (or
-     * lowercased) Unicode character stored as UTF-8 at tmpbuf.
-     * We first set the result to be the titlecased (/lowercased)
-     * character, and then append the rest of the SV data. */
-    sv_setpvn(dest, (char*)tmpbuf, tculen);
-    if (slen &gt; ulen)
-        sv_catpvn(dest, (char*)(s + ulen), slen - ulen);
+    if (! convert_source_to_utf8) {
+
+/* Here  both source and dest are in UTF-8, but have to create
+ * the entire output.  We initialize the result to be the
+ * title/lower cased first character, and then append the rest
+ * of the string. */
+
+sv_setpvn(dest, (char*)tmpbuf, tculen);
+if (slen &gt; ulen) {
+    sv_catpvn(dest, (char*)(s + ulen), slen - ulen);
+}
+    } else {
+const U8 *const send = s + slen;
+
+/* Here the dest needs to be in UTF-8, but the source isn't,
+ * except we have UTF-8'd the first character of the source
+ * into tmpbuf.  First put that into dest, and then append the
+ * rest of the source, converting it to UTF-8 as we go. */
+
+/* Assert tculen is 2 here because the only two characters that
+ * get to this part of the code have 2-byte UTF-8 equivalents */
+*d++ = *tmpbuf;
+*d++ = *(tmpbuf + 1);
+s++;/* We have just processed the 1st char */
+
+for (; s &lt; send; s++) {
+    d = uvchr_to_utf8(d, *s);
+}
+*d = '\0';
+SvCUR_set(dest, d - (U8*)SvPVX_const(dest));
+    }
     SvUTF8_on(dest);
 }
-else {
+else {   /* in-place UTF-8.  Just overwrite the first character */
     Copy(tmpbuf, d, tculen, U8);
     SvCUR_set(dest, need - 1);
 }
     }
-    else {
-if (*s) {
+    else {  /* Not UTF-8 */
+if (slen) {
     if (IN_LOCALE_RUNTIME) {
 TAINT;
 SvTAINTED_on(dest);
-*d = (op_type == OP_UCFIRST)
-    ? toUPPER_LC(*s) : toLOWER_LC(*s);
     }
-    else
-*d = (op_type == OP_UCFIRST) ? toUPPER(*s) : toLOWER(*s);
+    if (inplace) {  /* in-place, only need to change the 1st char */
+*d = *tmpbuf;
+    } else {/* Not in-place */
+
+/* First copy the case-changed character(s) from tmpbuf */
+Copy(tmpbuf, d, tculen, U8);
+
+/* Then the rest.  This will copy the trailing NUL */
+Copy(s + 1, d + tculen, slen, U8);
+SvCUR_set(dest, need - 1);
+    }
 } else {
     /* See bug #39028  */
     *d = *s;
 }
 
+/* It can be that we don't treat the source as UTF-8 if in a "use bytes",
+ * but, still want the destination to retain that flag */
+
 if (SvUTF8(source))
     SvUTF8_on(dest);
-
-if (!inplace) {
-    /* This will copy the trailing NUL  */
-    Copy(s + 1, d + 1, slen, U8);
-    SvCUR_set(dest, need - 1);
-}
     }
     SvSETMAGIC(dest);
     RETURN;
&lt; at &gt;&lt; at &gt; -3644,43 +3875,37 &lt; at &gt;&lt; at &gt;
     const U8 *s;
     U8 *d;
 
+    dTARGET;
+
     SvGETMAGIC(source);
 
-    if (SvPADTMP(source) &amp;&amp; !SvREADONLY(source) &amp;&amp; !SvAMAGIC(source)
-&amp;&amp; SvTEMP(source) &amp;&amp; !DO_UTF8(source)) {
-/* We can convert in place.  */
 
-dest = source;
-s = d = (U8*)SvPV_force_nomg(source, len);
-min = len + 1;
-    } else {
-dTARGET;
+    dest = TARG;
 
-dest = TARG;
+    /* The old implementation would copy source into TARG at this point.
+       This had the side effect that if source was undef, TARG was now
+       an undefined SV with PADTMP set, and they don't warn inside
+       sv_2pv_flags(). However, we're now getting the PV direct from
+       source, which doesn't have PADTMP set, so it would warn. Hence the
+       little games.  */
 
-/* The old implementation would copy source into TARG at this point.
-   This had the side effect that if source was undef, TARG was now
-   an undefined SV with PADTMP set, and they don't warn inside
-   sv_2pv_flags(). However, we're now getting the PV direct from
-   source, which doesn't have PADTMP set, so it would warn. Hence the
-   little games.  */
+    if (SvOK(source)) {
+s = (const U8*)SvPV_nomg_const(source, len);
+    } else {
+if (ckWARN(WARN_UNINITIALIZED))
+    report_uninit(source);
+s = (const U8*)"";
+len = 0;
+    }
 
-if (SvOK(source)) {
-    s = (const U8*)SvPV_nomg_const(source, len);
-} else {
-    if (ckWARN(WARN_UNINITIALIZED))
-report_uninit(source);
-    s = (const U8*)"";
-    len = 0;
-}
-min = len + 1;
+    SvUPGRADE(dest, SVt_PV);
 
-SvUPGRADE(dest, SVt_PV);
-d = (U8*)SvGROW(dest, min);
-(void)SvPOK_only(dest);
+    min = len + 1;
 
-SETs(dest);
-    }
+    d = (U8*)SvGROW(dest, min);
+    (void)SvPOK_only(dest);
+
+    SETs(dest);
 
     /* Overloaded values may have toggled the UTF-8 flag on source, so we need
        to check DO_UTF8 again here.  */
&lt; at &gt;&lt; at &gt; -3690,29 +3915,51 &lt; at &gt;&lt; at &gt;
 U8 tmpbuf[UTF8_MAXBYTES+1];
 
 while (s &lt; send) {
-    const STRLEN u = UTF8SKIP(s);
-    STRLEN ulen;
 
-    toUPPER_utf8(s, tmpbuf, &amp;ulen);
-    if (ulen &gt; u &amp;&amp; (SvLEN(dest) &lt; (min += ulen - u))) {
-/* If the eventually required minimum size outgrows
- * the available space, we need to grow. */
-const UV o = d - (U8*)SvPVX_const(dest);
-
-/* If someone uppercases one million U+03B0s we SvGROW() one
- * million times.  Or we could try guessing how much to
- allocate without allocating too much.  Such is life. */
-SvGROW(dest, min);
-d = (U8*)SvPVX(dest) + o;
-    }
-    Copy(tmpbuf, d, ulen, U8);
-    d += ulen;
-    s += u;
+
+    /* If the UTF-8 character is invariant, then it is in the range
+     * known by the standard macro; result is only one byte long */
+
+    if (UTF8_IS_INVARIANT(*s)) {
+*d++ = toUPPER(*s);
+s++;
+    } else if (UTF8_IS_DOWNGRADEABLE_START(*s)) {
+
+/* Likewise, if it fits in a byte, its case change is in our
+ * table */
+
+U8 orig = UTF8_ACCUMULATE(*s, *(s+1));
+U8 upper = toUPPER_LATIN1_MOD(orig);
+CAT_TWO_BYTE_UNI_UPPER(d, orig, upper);
+s += 2;
+    } else {
+
+/* Otherwise, need the general UTF-8 case */
+
+const STRLEN u = UTF8SKIP(s);
+STRLEN ulen;
+
+toUPPER_utf8(s, tmpbuf, &amp;ulen);
+if (ulen &gt; u &amp;&amp; (SvLEN(dest) &lt; (min += ulen - u))) {
+    /* If the eventually required minimum size outgrows
+     * the available space, we need to grow. */
+    const UV o = d - (U8*)SvPVX_const(dest);
+
+    /* If someone uppercases one million U+03B0s we SvGROW() one
+     * million times.  Or we could try guessing how much to
+     allocate without allocating too much.  Such is life. */
+    SvGROW(dest, min);
+    d = (U8*)SvPVX(dest) + o;
+}
+Copy(tmpbuf, d, ulen, U8);
+d += ulen;
+s += u;
+    }
 }
 SvUTF8_on(dest);
 *d = '\0';
 SvCUR_set(dest, d - (U8*)SvPVX_const(dest));
-    } else {
+    } else {/* Not UTF-8 */
 if (len) {
     const U8 *const send = s + len;
     if (IN_LOCALE_RUNTIME) {
&lt; at &gt;&lt; at &gt; -3721,20 +3968,115 &lt; at &gt;&lt; at &gt;
 for (; s &lt; send; d++, s++)
     *d = toUPPER_LC(*s);
     }
-    else {
-for (; s &lt; send; d++, s++)
+    else if (! IN_UNI_8_BIT) {
+for (; s &lt; send; d++, s++) {
     *d = toUPPER(*s);
+}
+    } else {
+for (; s &lt; send; d++, s++) {
+    *d = toUPPER_LATIN1_MOD(*s);
+    if (*d != LATIN_SMALL_LETTER_Y_WITH_DIAERESIS) continue;
+
+    /* To avoid extra tests in the mainstream case, all
+     * characters that require special handling are mapped by
+     * the MOD to the one above.  To get here means it is not
+     * mainstream, needs special handling.  Use the original
+     * source to distinguish between the cases */
+
+    if (*s == LATIN_SMALL_LETTER_SHARP_S) {
+
+/* uc() of this requires 2 characters, but they are
+ * ASCII.  If not enough room, grow the string */
+
+if (SvLEN(dest) &lt; ++min) {
+    const UV o = d - (U8*)SvPVX_const(dest);
+    SvGROW(dest, min);
+    d = (U8*)SvPVX(dest) + o;
+}
+*d++ = 'S'; *d = 'S'; /* upper case is 'SS' */
+continue;
+    }
+
+    /* The other two special handling characters have their
+     * upper cases outside the latin1 range, hence need to be
+     * in UTF-8, so the whole result needs to be in UTF-8.  So,
+     * here we are somewhere in the middle of processing a
+     * non-UTF-8 string, and realize that we will have to convert
+     * to UTF-8.  What to do?  There are several possibilities.
+     * The simplest to code is to convert what we have so far,
+     * set a flag, and continue on in the loop.  The flag would
+     * be tested each time through the loop, and if set, the
+     * next character would be converted to UTF-8 and stored.
+     * But, I didn't want to slow down the mainstream case at
+     * all for this fairly rare case, so I (khw) didn't want to
+     * add a test that didn't absolutely have to be there in
+     * the loop, besides the possibility that it would get too
+     * complicated for optimizers to deal with.  Another
+     * possibility is to just give up, convert the source to
+     * UTF-8, and restart the function that way.  Another
+     * possibility is to convert both what has already been
+     * processed and what is yet to come separately to UTF-8,
+     * then jump into the loop that handles UTF-8.  But the most
+     * efficient time-wise is what follows, and turned out to
+     * not require much extra code.  */
+    /* Convert what we have so far into UTF-8 */
+
+    len = d - (U8*)SvPVX_const(dest);
+    SvCUR_set(dest, len);
+    len = sv_utf8_upgrade(dest);
+
+    /* Assume the worst case space requirements for converting
+     * what we haven't processed so far: that it will require
+     * two bytes for each input character, plus the NUL at the
+     * end, and make sure there is enough room in the
+     * destination for that amount.  This may cause the string
+     * pointer to move, so re-find it.  TODO: It would be nice
+     * to have the upgrade reserve this much space to avoid the
+     * possibility of two grows */
+
+    SvGROW(dest, len + (2 * (send -s)) + 1);
+    d = (U8*)SvPVX(dest) + len;
+
+    /* And append the current character's upper case in UTF-8 */
+
+    CAT_NON_LATIN1_UC(d, *s)
+
+    /* Now process the remainder of the input, converting to
+     * upper and UTF-8.  If each resulting byte is invariant in
+     * UTF-8, output as-is, otherwise convert to UTF-8 and append
+     * it to the output.  The MOD characters are all variant */
+
+    s++;
+    for (; s &lt; send; s++) {
+U8 upper = toUPPER_LATIN1_MOD(*s);
+if UTF8_IS_INVARIANT(upper) {
+    *d++ = upper;
+} else {
+    CAT_TWO_BYTE_UNI_UPPER(d, *s, upper);
+}
+    }
+
+    /* Here have processed the whole input; no need to continue
+     * with the outer loop.  Each character has been converted
+     * to upper case and converted to UTF-8 */
+
+    break;
+}
     }
 }
-if (source != dest) {
-    *d = '\0';
-    SvCUR_set(dest, d - (U8*)SvPVX_const(dest));
-}
+*d = '\0';  /* Here d points to 1 after last char, add NUL */
+SvCUR_set(dest, d - (U8*)SvPVX_const(dest));
     }
     SvSETMAGIC(dest);
     RETURN;
 }
 
+/* lc currently doesn't properly handle the case of GREEK_CAPITAL_LETTER_SIGMA,
+ * which has a special lower case when it occurs at the end of a word.  This
+ * comment gives a discussion of this problem.
+ *
+ * To be furnished
+ * */
 PP(pp_lc)
 {
     dVAR;
&lt; at &gt;&lt; at &gt; -3792,40 +4134,41 &lt; at &gt;&lt; at &gt;
 U8 tmpbuf[UTF8_MAXBYTES_CASE+1];
 
 while (s &lt; send) {
-    const STRLEN u = UTF8SKIP(s);
-    STRLEN ulen;
-    const UV uv = toLOWER_utf8(s, tmpbuf, &amp;ulen);
+    if (UTF8_IS_INVARIANT(*s)) {
+*d++ = toLOWER(*s);
+s++;
+    } else if (UTF8_IS_DOWNGRADEABLE_START(*s)) {
+U8 lower = toLOWER_LATIN1(UTF8_ACCUMULATE(*s, *(s+1)));
+CAT_UNI_TO_UTF8_TWO_BYTE(d, lower);
+s += 2;
+    } else {
+const STRLEN u = UTF8SKIP(s);
+STRLEN ulen;
+const UV uv = toLOWER_utf8(s, tmpbuf, &amp;ulen);
 
 #define GREEK_CAPITAL_LETTER_SIGMA 0x03A3 /* Unicode U+03A3 */
-    if (uv == GREEK_CAPITAL_LETTER_SIGMA) {
-NOOP;
-/*
- * Now if the sigma is NOT followed by
- * /$ignorable_sequence$cased_letter/;
- * and it IS preceded by /$cased_letter$ignorable_sequence/;
- * where $ignorable_sequence is [\x{2010}\x{AD}\p{Mn}]*
- * and $cased_letter is [\p{Ll}\p{Lo}\p{Lt}]
- * then it should be mapped to 0x03C2,
- * (GREEK SMALL LETTER FINAL SIGMA),
- * instead of staying 0x03A3.
- * "should be": in other words, this is not implemented yet.
- * See lib/unicore/SpecialCasing.txt.
- */
-    }
-    if (ulen &gt; u &amp;&amp; (SvLEN(dest) &lt; (min += ulen - u))) {
-/* If the eventually required minimum size outgrows
- * the available space, we need to grow. */
-const UV o = d - (U8*)SvPVX_const(dest);
-
-/* If someone lowercases one million U+0130s we SvGROW() one
- * million times.  Or we could try guessing how much to
- allocate without allocating too much.  Such is life. */
-SvGROW(dest, min);
-d = (U8*)SvPVX(dest) + o;
-    }
-    Copy(tmpbuf, d, ulen, U8);
-    d += ulen;
-    s += u;
+/* TODO Shouldn't this be commented out?, to avoid pointless
+ * test?  */
+
+if (uv == GREEK_CAPITAL_LETTER_SIGMA) {
+    NOOP;
+    /* See comments before this routine about handling this */
+}
+if (ulen &gt; u &amp;&amp; (SvLEN(dest) &lt; (min += ulen - u))) {
+    /* If the eventually required minimum size outgrows
+     * the available space, we need to grow. */
+    const UV o = d - (U8*)SvPVX_const(dest);
+
+    /* If someone lowercases one million U+0130s we SvGROW() one
+     * million times.  Or we could try guessing how much to
+     allocate without allocating too much.  Such is life. */
+    SvGROW(dest, min);
+    d = (U8*)SvPVX(dest) + o;
+}
+Copy(tmpbuf, d, ulen, U8);
+d += ulen;
+s += u;
+    }
 }
 SvUTF8_on(dest);
 *d = '\0';
&lt; at &gt;&lt; at &gt; -3839,9 +4182,14 &lt; at &gt;&lt; at &gt;
 for (; s &lt; send; d++, s++)
     *d = toLOWER_LC(*s);
     }
-    else {
-for (; s &lt; send; d++, s++)
+    else if (! IN_UNI_8_BIT) {
+for (; s &lt; send; d++, s++) {
     *d = toLOWER(*s);
+}
+    } else {
+for (; s &lt; send; d++, s++) {
+    *d = toLOWER_LATIN1(*s);
+}
     }
 }
 if (source != dest) {
&lt; at &gt;&lt; at &gt; -4769,7 +5117,7 &lt; at &gt;&lt; at &gt;
     if (do_utf8) {
 while (m &lt; strend &amp;&amp; !( *m == ' ' || is_utf8_space((U8*)m) )) {
     const int t = UTF8SKIP(m);
-    /* is_utf8_space returns FALSE for malform utf8 */
+    /* is_utf8_space returns FALSE for malform UTF-8 */
     if (strend - m &lt; t)
 m = strend;
     else
</description>
    <dc:creator>karl williamson</dc:creator>
    <dc:date>2008-11-22T22:24:22</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64335">
    <title>Re: on expectations of privacy</title>
    <link>http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64335</link>
    <description>
Sounds like fun.  Big fan of cleanups, I am.
</description>
    <dc:creator>Chip Salzenberg</dc:creator>
    <dc:date>2008-11-22T20:42:10</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64334">
    <title>[patch&lt; at &gt;34896] vms readdir() fixes for UNIX/EFS mode</title>
    <link>http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64334</link>
    <description>Readdir() on VMS, when opening a directory with a UNIX file 
specification, should return all the files in UNIX format.

This includes dropping the ".DIR" from the directory specifications.

Because traditionally Perl has not done this, this fix will only be done 
when the DECC$EFS_CHARSET feature is enabled.


Next on the fix list for Unix compatibility mode, &lt;*&gt; is always 
returning specifications in VMS format instead of UNIX format when 
DECC$FILENAME_UNIX_REPORT is active.

Regards,
-John
--- /rsync_root/perl/vms/vms.cMon Nov 10 06:51:29 2008
+++ vms/vms.cSat Nov 22 10:46:34 2008
&lt; at &gt;&lt; at &gt; -9631,11 +9631,32 &lt; at &gt;&lt; at &gt;
 &amp;vs_spec,
 &amp;vs_len);
 
-    /* Drop NULL extensions on UNIX file specification */
-    if ((dd-&gt;flags &amp; PERL_VMSDIR_M_UNIXSPECS &amp;&amp;
-(e_len == 1) &amp;&amp; decc_readdir_dropdotnotype)) {
-e_len = 0;
-e_spec[0] = '\0';
+    if (dd-&gt;flags &amp; PERL_VMSDIR_M_UNIXSPECS) {
+
+        /* In Unix report mode, remove the ".dir;1" from the name */
+        /* if it is a real directory. */
+        if (decc_filename_unix_report || decc_efs_charset) {
+            if ((e_len == 4) &amp;&amp; (vs_len == 2) &amp;&amp; (vs_spec[1] == '1')) {
+                if ((toupper(e_spec[1]) == 'D') &amp;&amp;
+                    (toupper(e_spec[2]) == 'I') &amp;&amp;
+                    (toupper(e_spec[3]) == 'R')) {
+                    Stat_t statbuf;
+                    int ret_sts;
+
+                    ret_sts = stat(buff, (stat_t *)&amp;statbuf);
+                    if ((ret_sts == 0) &amp;&amp; S_ISDIR(statbuf.st_mode)) {
+                        e_len = 0;
+                        e_spec[0] = 0;
+                    }
+                }
+            }
+        }
+
+        /* Drop NULL extensions on UNIX file specification */
+if ((e_len == 1) &amp;&amp; decc_readdir_dropdotnotype) {
+    e_len = 0;
+    e_spec[0] = '\0';
+        }
     }
 
     strncpy(dd-&gt;entry.d_name, n_spec, n_len + e_len);
</description>
    <dc:creator>John E. Malmberg</dc:creator>
    <dc:date>2008-11-22T17:31:58</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64333">
    <title>Re: on expectations of privacy</title>
    <link>http://permalink.gmane.org/gmane.comp.lang.perl.perl5.porters/64333</link>
    <description>
Gtk2 and friends[1] do this by using ExtUtils::Depends.

-Torsten

[1] Nearly all of the modules in http://search.cpan.org/~TSCH/ link to, or are 
linked to by, one of the others.

</description>
    <dc:creator>Torsten Schoenfeld</dc:creator>
    <dc:date>2008-11-22T17:06:21</dc:date>
  </item>
  <textinput about="http://search.gmane.org/?group=$group=gmane.comp.lang.perl.perl5.porters">
    <title>Search Engine</title>
    <description>Search the mailing list at Gmane</description>
    <name>query</name>
    <link>http://search.gmane.org/?group=$group=gmane.comp.lang.perl.perl5.porters</link>
  </textinput>
</rdf:RDF>
