<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/">
  <channel rdf:about="http://blog.gmane.org/gmane.comp.file-systems.ceph.devel">
    <title>gmane.comp.file-systems.ceph.devel</title>
    <link>http://blog.gmane.org/gmane.comp.file-systems.ceph.devel</link>
    <description/>
    <syn:updatePeriod>hourly</syn:updatePeriod>
    <syn:updateFrequency>1</syn:updateFrequency>
    <syn:updateBase>1901-01-01T00:00+00:00</syn:updateBase>
    <items>
      <rdf:Seq>
        <rdf:li rdf:resource="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15659"/>
        <rdf:li rdf:resource="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15658"/>
        <rdf:li rdf:resource="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15649"/>
        <rdf:li rdf:resource="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15643"/>
        <rdf:li rdf:resource="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15617"/>
        <rdf:li rdf:resource="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15605"/>
        <rdf:li rdf:resource="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15601"/>
        <rdf:li rdf:resource="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15599"/>
        <rdf:li rdf:resource="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15594"/>
        <rdf:li rdf:resource="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15587"/>
        <rdf:li rdf:resource="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15580"/>
        <rdf:li rdf:resource="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15570"/>
        <rdf:li rdf:resource="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15569"/>
        <rdf:li rdf:resource="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15568"/>
        <rdf:li rdf:resource="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15567"/>
        <rdf:li rdf:resource="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15549"/>
        <rdf:li rdf:resource="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15547"/>
        <rdf:li rdf:resource="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15546"/>
        <rdf:li rdf:resource="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15542"/>
        <rdf:li rdf:resource="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15541"/>
      </rdf:Seq>
    </items>
    <image rdf:resource="http://gmane.org/img/gmane-25t.png"/>
    <textinput rdf:resource=""/>
  </channel>
  <image rdf:about="http://gmane.org/img/gmane-25t.png">
    <title>Gmane</title>
    <url>http://gmane.org/img/gmane-25t.png</url>
    <link>http://gmane.org</link>
  </image>
  <item rdf:about="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15659">
    <title>OSD throttles documentation</title>
    <link>http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15659</link>
    <description>&lt;pre&gt;Hi John,

I have troubles interpreting

http://ceph.com/docs/master/dev/osd_internals/osd_throttles/

Has it been generated by a tool ? I would very much appreciate a hint :-)

Cheers

&lt;/pre&gt;</description>
    <dc:creator>Loic Dachary</dc:creator>
    <dc:date>2013-06-18T21:32:34</dc:date>
  </item>
  <item rdf:about="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15658">
    <title>ceph-deploy problems on weird /dev device names?</title>
    <link>http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15658</link>
    <description>&lt;pre&gt;I remember seeing a few reports of problems from users with strange block 
device names in /dev (sdaa*, c0d1p2* etc.) and have a bug open 
(http://tracker.ceph.com/issues/5345), but looking at the code I don't 
immediately see the problem, and I don't have any machines that have this 
problem.  Are there any users who have seen this problem who can try the 
latest version and/or help test a fix?

Thanks!
sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo&amp;lt; at &amp;gt;vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

&lt;/pre&gt;</description>
    <dc:creator>Sage Weil</dc:creator>
    <dc:date>2013-06-18T20:43:02</dc:date>
  </item>
  <item rdf:about="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15649">
    <title>Erasure code library summary</title>
    <link>http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15649</link>
    <description>&lt;pre&gt;Hi Ceph,

TL;DR: use jerasure 1.2 with Reed-Solomon to code/decode/repair an object, and upgrade to 2.0 when available.

Disclaimer: I'm no expert ;-) The terms are explained in wikipedia[1].

Using Reed-Solomon object O is encoded by dividing it into consecutive chuncks O1, O2, ... ON and computing parity blocks P1, P2, ... PK.  Reading the original content of object O is a simple concatenation of O1, O2, ... ON. If O2 or P2 are lost, they can be repaired/reconstructed using O1 ... ON and P1 ... PK. If the use case is mostly reading objects and repairs are at least 1000 times less likely than normal operations, being able to read the object from non-coded chuncks is attractive. 

Reed-Solomon is significantly more expensive to encode ( 100MB/s order of magnitude on a single 2.5Ghz core ) than fountain codes with the current jerasure implementation[2]. However, gf-complete[3] that will be used in the upcoming version of jerasure significantly improves performances ( 2 to 10 times faster ) and the difference becomes negligible. 

Reed-Solomon coding family is the only one that can keep the chuncks unencoded and therefore concatenable.

The jerasure library is packaged and being worked on by the author at the moment. All other Free Software implementations are either not packaged or not maintained. 

The license[4] of jerasure is compatible with the license of Ceph.

Performances depend on the parameters to the Reed-Solomon functions but they will also be influenced by the buffer sizes used when calling the encoding functions: smaller buffers will mean more calls and more overhead.

Open questions:

* Does Mojette Transform [5] have compelling qualities compared to other code families ?
* Do hierarchical codes [6] have compelling qualities ? Implementing them would require a different API. To be effective they need to take into account the context in which an object is stored where the other code only require the object itself.
* I have not experiemented with the jerasure API yet

Feedback and criticisms are welcome :-)

[1] http://en.wikipedia.org/wiki/Erasure_code
[2] jerasure 1.2 http://web.eecs.utk.edu/~plank/plank/papers/CS-08-627.html
[3] gf-complete http://web.eecs.utk.edu/~plank/plank/papers/CS-13-703.html
[4] jerasure license https://github.com/tsuraan/Jerasure/blob/master/License.txt
[5] Mojette Transform http://en.wikipedia.org/wiki/Mojette_Transform
[6] hierarchical codes http://www.e-biersack.eu/BPublished/nc_springer.pdf


&lt;/pre&gt;</description>
    <dc:creator>Loic Dachary</dc:creator>
    <dc:date>2013-06-18T12:22:59</dc:date>
  </item>
  <item rdf:about="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15643">
    <title>(unknown)</title>
    <link>http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15643</link>
    <description>&lt;pre&gt;


Loan Syndicacion

Am AFG Guaranty Trust Bank, zu strukturieren wir Kreditlinien treffen Sie
unsere
Kunden spezifischen geschäftlichen Anforderungen und einen deutlichen
Mehrwert für unsere
Kunden Unternehmen.
eine Division der AFG Finance und Private Bank plc.

Wenn Sie erwägen, eine große Akquisition oder ein Großprojekt sind, können
Sie
brauchen eine erhebliche Menge an Kredit. AFG Guaranty Trust Bank setzen
können
zusammen das Syndikat, das die gesamte Kredit schnürt für
Sie.


Als Bank mit internationaler Reichweite, sind wir gekommen, um Darlehen zu
identifizieren
Syndizierungen als Teil unseres Kerngeschäfts und durch spitzte diese Zeile
aggressiv sind wir an einem Punkt, wo wir kommen, um als erkannt haben
Hauptakteur in diesem Bereich.


öffnen Sie ein Girokonto heute mit einem Minimum Bankguthaben von 500 £ und
Getup zu £ 10.000 als Darlehen und auch den Hauch einer Chance und gewann
die Sterne
Preis von £ 500.000 in die sparen und gewinnen promo in may.aply jetzt.


mit dem Folowing Informationen über Rechtsanwalt steven lee das Konto
Offizier.


FULL NAME;


Wohnadresse;


E-MAIL-ADRESSE;

Telefonnummer;

Nächsten KINS;

MUTTER MAIDEN NAME;


Familienstand;


BÜROADRESSE;

ALTERNATIVE Telefonnummer;

TO &amp;lt; at &amp;gt; yahoo.com bar.stevenlee
NOTE; ALLE Darlehen sind für 10JAHRE RATE VALID
ANGEBOT ENDET BALD SO JETZT HURRY

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo&amp;lt; at &amp;gt;vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

&lt;/pre&gt;</description>
    <dc:creator>AFG GTBANK LOAN</dc:creator>
    <dc:date>2013-06-17T19:02:19</dc:date>
  </item>
  <item rdf:about="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15617">
    <title>ceph branch status</title>
    <link>http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15617</link>
    <description>&lt;pre&gt;&lt;/pre&gt;</description>
    <dc:creator>ceph branch robot</dc:creator>
    <dc:date>2013-06-17T15:00:16</dc:date>
  </item>
  <item rdf:about="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15605">
    <title>[PATCH 0/8] misc fixes for mds</title>
    <link>http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15605</link>
    <description>&lt;pre&gt;From: "Yan, Zheng" &amp;lt;zheng.z.yan&amp;lt; at &amp;gt;intel.com&amp;gt;

these patches are also in:
  git://github.com/ukernel/ceph.git wip-mds

Regards
Yan, Zheng
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo&amp;lt; at &amp;gt;vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

&lt;/pre&gt;</description>
    <dc:creator>Yan, Zheng</dc:creator>
    <dc:date>2013-06-17T12:10:21</dc:date>
  </item>
  <item rdf:about="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15601">
    <title>rbdwrapper: userland library for transparent access to rbd images</title>
    <link>http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15601</link>
    <description>&lt;pre&gt;Hi,

I would like to contribute source code for an
experimaental pure userland library to provide transparent
client access to rbd images without the use
of the client kernel modules rbd.ko and libceph.ko.

The attached rpm source package basically provides
 - the code for librbdwrapper.so
 - and for the admin command rbdwrapper_adm 

librbdwrapper.so can be utilized by an application
via the LD_PRELOAD mechanism and intercepts some
basic standard library calls like open(), close()
read() or write() to rbd images.

A call to open() is intercepted only if the path
name in the call refers to a rbdwrapper file system
object. These are administered by the rbdwrapper_adm
command, while plays a similar role as the rbd
subcommands
  rbd map &amp;lt;image name&amp;gt;
  rbd showmapped
for rbd images via the rbd.ko and libceph.ko kernel
modules.

The benefit of librdbwrapper is that
 - it utilizes the same userland libraries which are
   used by the ceph daemons
 - allows for a higher degree of parallelism as far
   as the number of communication channels to the
   osd's is concerned: on a single system with multiple
   client applications, each of these applications
   uses its own instance of the ceph messenger
   communication module

The drawback of librbdwrapper compared to real block
devices as provided by the rbd.ko kernel module is that
it does not provide a real block device, i.e. it cannot
utilize linux kernel vfs services like mount() or the
generic block device driver services.


Regards

Andreas Bluemle


&lt;/pre&gt;</description>
    <dc:creator>Andreas Bluemle</dc:creator>
    <dc:date>2013-06-17T07:30:37</dc:date>
  </item>
  <item rdf:about="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15599">
    <title>[PATCH 2/3] mds: fix cap revoke race</title>
    <link>http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15599</link>
    <description>&lt;pre&gt;From: "Yan, Zheng" &amp;lt;zheng.z.yan&amp;lt; at &amp;gt;intel.com&amp;gt;

If caps are been revoking by the auth MDS, don't consider them as
issued even they are still issued by non-auth MDS. The non-auth
MDS should also be revoking/exporting these caps, we just haven't
received the cap revoke/export message.

Signed-off-by: Yan, Zheng &amp;lt;zheng.z.yan&amp;lt; at &amp;gt;intel.com&amp;gt;
---
 fs/ceph/caps.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 9a5ccc9..a8c616b 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -697,6 +697,15 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt; int __ceph_caps_issued(struct ceph_inode_info *ci, int *implemented)
 if (implemented)
 *implemented |= cap-&amp;gt;implemented;
 }
+/*
+ * exclude caps issued by non-auth MDS, but are been revoking
+ * by the auth MDS. The non-auth MDS should be revoking/exporting
+ * these caps, but the message is delayed.
+ */
+if (ci-&amp;gt;i_auth_cap) {
+cap = ci-&amp;gt;i_auth_cap;
+have &amp;amp;= ~cap-&amp;gt;implemented | cap-&amp;gt;issued;
+}
 return have;
 }
 
&lt;/pre&gt;</description>
    <dc:creator>Yan, Zheng</dc:creator>
    <dc:date>2013-06-17T02:48:15</dc:date>
  </item>
  <item rdf:about="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15594">
    <title>reply to this email:eventmanager301&lt; at &gt;hotmail.com</title>
    <link>http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15594</link>
    <description>&lt;pre&gt;

&lt;/pre&gt;</description>
    <dc:creator>RESERVE BANK OF INDIA</dc:creator>
    <dc:date>2013-06-16T16:48:12</dc:date>
  </item>
  <item rdf:about="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15587">
    <title>Using GF-complete in Ceph</title>
    <link>http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15587</link>
    <description>&lt;pre&gt;Hi James,

It would be great to re-use GF-complete ( http://web.eecs.utk.edu/~plank/plank/papers/CS-13-703.html ) as a basis for erasure coding in Ceph ( http://ceph.com/ ). The engineering work to make room for erasure coding began a few weeks ago ( at the moment replication is used ). I'll keep working on it until it's done. 

Any advice you may have on how to approach the integration of erasure coding in Ceph would be very welcome.

Cheers

&lt;/pre&gt;</description>
    <dc:creator>Loic Dachary</dc:creator>
    <dc:date>2013-06-15T07:44:13</dc:date>
  </item>
  <item rdf:about="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15580">
    <title>Comments on Ceph distributed parity implementation</title>
    <link>http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15580</link>
    <description>&lt;pre&gt;Dear Community
I am a young engineer (not software or math, please bare with me) with some suggestions regarding erasure codes. I never used Ceph before or any other distributed file system.

I stumped upon the suggestion for adding erasure codes to Ceph, as
described in this article
http://wiki.Ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend

first I would like to say great initiative to add erasure codes to Ceph.
Ceph needs its own implementation and it have to be done right, I cannot stress this enough, suggested software mentioned in that article would result in very low performance.

Why?
Reed-Solomon is normally something regarded as being very slow compared to other erasure codes, because the underlying Galois-Field multiplication is slow. Please see video at usenix.org forexplanation.

The implementations of Zfec library and other suggested software the others rely on the Vandermonde matrix, a matrix used in in Reed-Solomon erasure codes, a faster approach would be to use the Cauchy-Reed-Solomon implementation. Please see [1,2,3]
Although there is something even better, by using the Intel SSE2/3 SIMD instructions it is possible to do the as fast as any other XOR based erasure codes (RaptorQ LT-codes, LDPC etc.).

The suggested FECpp lib uses the optimisation but with a relative small Galois-field only 2^8, since Ceph aimes at unlimited scalability increasing the size of the Galois-Field would improve performance [4]. Of course the configured Ceph Object Size and/or Stripe width have to be taken into account.
Please see
https://www.usenix.org/conference/fast13/screaming-fast-galois-field-arithmetic-using-sse2-extensions


The solution
Using the GF-Complete open source library [4] to implement Reed-Solomon in Ceph in order to allow Ceph to scale to infinity.
James S. Plank the author of GF-complete have developed a library implementing various Reed-Solomon codes called Jerasure. http://web.eecs.utk.edu/~plank/plank/www/software.html
Jerasure 2.0 using the GF-complete artimetric based in Intel SSE SIMD instructions, is current in development expected release august 2013. Will be released under the new BSD license. Jerasure 2.0 also supports arbitrary Galois-field sizes 8,16,32,64 or 128 bit.

The limit of this implementation would be the processors L2/L3 cache not the underlying arithmetic. 

Best Regards
Martin Flyvbjerg

[1] http://web.eecs.utk.edu/~plank/plank/papers/CS-05-569.pdf
[2] http://web.eecs.utk.edu/~plank/plank/papers/CS-08-625.pdf
[3] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2009.pdf
[4] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2013-GF.pdf
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo&amp;lt; at &amp;gt;vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

&lt;/pre&gt;</description>
    <dc:creator>Martin Flyvbjerg</dc:creator>
    <dc:date>2013-06-14T20:13:26</dc:date>
  </item>
  <item rdf:about="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15570">
    <title>Writing to RBD image while it's snapshot is being created causes I/O errors</title>
    <link>http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15570</link>
    <description>&lt;pre&gt;Hi,

I noticed that writing to RBD image using kernel driver while it's snapshot 
is being created causes I/O errors and the filesystem (reiserfs) eventually 
aborts and remounts itself in read-only mode:

[192507.327359] end_request: I/O error, dev rbd7, sector 818528
[192507.331510] end_request: I/O error, dev rbd7, sector 819200
[192507.348478] end_request: I/O error, dev rbd7, sector 408
[192507.352647] REISERFS abort (device rbd7): Journal write error in 
flush_commit_list

Is this happening by design or is it a bug? I know that I should freeze the 
filesystem before attempting to create a snapshot but these I/O errors seem 
unnecessary to me. If it's really impossible to write to an image while it's 
snapshot is being created couldn't write requests be blocked until 
snapshotting completes?

I'm using ceph 0.56.6 and kernel 3.4.48.

&lt;/pre&gt;</description>
    <dc:creator>Karol Jurak</dc:creator>
    <dc:date>2013-06-14T14:47:19</dc:date>
  </item>
  <item rdf:about="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15569">
    <title>PG recovery throttling and queue processing optimizations</title>
    <link>http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15569</link>
    <description>&lt;pre&gt;Hello!

I would like to present our optimizations for Ceph, such as priority-based
recovery throttling and OSD op queue processing optimizations.

I would also like to see your comments on this patch.
Ideas from this patch may be useful for real optimizations.
The patch is for latest cuttlefish.

This patch includes two things:

1. Recovery throttling and reordering based on OSD load.
2. Optimization of OpWQ processing.


First of all variables are calculated: PG cost, OSD average load and
"prioritized recovery" flag for PG.
OSD average load is calculated as geometric mean of PG costs.
PG cost and PG prioritized recovery flag is set calculate_pg_cost() function.
Client and high priority ops increases PG cost and trigger setting of
priority flag while backfills decreases.

1. Recovery throttling and reordering based on OSD load.
We insert pauses after every recovery operation. From 0.1 to 5.3 seconds,
depending on PG load and the fact if current recovering PG is have high
priority.
The formula is:
throttle = pg_recovery_prio ? 0.1 : 0.3 + osd_average_load * throttle_coef

At first PGs with recovery priority flag set are recovered and then
another PGs.

2. Optimization of OpWQ processing.
While using "bobtail" release we noticed that ceph spends many time on
waiting for pg.lock(). So we modified OpWQ dequeuing and processing code to
try to handle ops for another PGs if current pg is already locked.

To achieve this goal we move all incoming ops to "pg_for_processing" map,
and process PGs not in incoming order but in descendent order of PG costs,
skipping locked PGs.

So op threads will never wait for pg to be unlocked if there are another PGs to
process.


With part 1 recovery process does not affect most client operations anymore.
Part 2 increased efficiency of op queue processing that affected disk I/O
utilization. It gave a great effect on the "bobtail" release, and a small
improvement on the "cuttlefish".

Any comments are welcome.


Thanks,
Sergey.

---

diff --git a/src/common/config_opts.h b/src/common/config_opts.h
index d7684a4..d87b6ed 100644
--- a/src/common/config_opts.h
+++ b/src/common/config_opts.h
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -390,6 +390,9 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt; OPTION(osd_op_pq_min_cost, OPT_U64, 65536)
 OPTION(osd_disk_threads, OPT_INT, 1)
 OPTION(osd_recovery_threads, OPT_INT, 1)
 OPTION(osd_recover_clone_overlap, OPT_BOOL, true)   // preserve clone_overlap during recovery/migration
+OPTION(osd_recovery_throttle, OPT_FLOAT, 0.3)
+OPTION(osd_recovery_throttle_active, OPT_FLOAT, 0.1)
+OPTION(osd_recovery_throttle_coef, OPT_FLOAT, 0.08)
 OPTION(osd_backfill_scan_min, OPT_INT, 64)
 OPTION(osd_backfill_scan_max, OPT_INT, 512)
 OPTION(osd_op_thread_timeout, OPT_INT, 15)
diff --git a/src/include/xlist.h b/src/include/xlist.h
index 5384561..50de9b1 100644
--- a/src/include/xlist.h
+++ b/src/include/xlist.h
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -150,6 +150,7 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt; public:
   public:
     iterator(item *i = 0) : cur(i) {}
     T operator*() { return static_cast&amp;lt;T&amp;gt;(cur-&amp;gt;_item); }
+    item *get_cur() { return cur; }
     iterator&amp;amp; operator++() {
       assert(cur);
       assert(cur-&amp;gt;_list);
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -161,6 +162,9 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt; public:
 
   iterator begin() { return iterator(_front); }
   iterator end() { return iterator(NULL); }
+  void remove(iterator i) {
+      remove(i.get_cur());
+  }
 };
 
 
diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc
index 91c214d..bbd8dd7 100644
--- a/src/osd/OSD.cc
+++ b/src/osd/OSD.cc
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -870,6 +870,10 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt; OSD::OSD(int id, Messenger *internal_messenger, Messenger *external_messenger,
  Messenger *hbclientm, Messenger *hbserverm, MonClient *mc,
  const std::string &amp;amp;dev, const std::string &amp;amp;jdev) :
   Dispatcher(external_messenger-&amp;gt;cct),
+  pg_load(0.0),
+  m_osd_recovery_throttle(g_conf-&amp;gt;osd_recovery_throttle),
+  m_osd_recovery_throttle_active(g_conf-&amp;gt;osd_recovery_throttle_active),
+  m_osd_recovery_throttle_coef(g_conf-&amp;gt;osd_recovery_throttle_coef),
   osd_lock("OSD::osd_lock"),
   tick_timer(external_messenger-&amp;gt;cct, osd_lock),
   authorize_handler_cluster_registry(new AuthAuthorizeHandlerRegistry(external_messenger-&amp;gt;cct,
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -6421,6 +6425,62 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt; void OSD::enqueue_op(PG *pg, OpRequestRef op)
   op_wq.queue(make_pair(PGRef(pg), op));
 }
 
+void OSD::OpWQ::_calculate_pg_cost(PGRef pg) {
+    assert(qlock.is_locked());
+    int cost = 0;
+    if (pg == NULL) return;
+    pg-&amp;gt;recovery_prio = false;
+    if (pg_for_processing.count(&amp;amp;*pg)) {
+        for (list&amp;lt;OpRequestRef&amp;gt;::iterator i = pg_for_processing[&amp;amp;*pg].begin(); i != pg_for_processing[&amp;amp;*pg].end(); i++) {
+            OpRequestRef op = *i;
+            if (op-&amp;gt;request-&amp;gt;get_type() == CEPH_MSG_OSD_OP ||
+                    op-&amp;gt;request-&amp;gt;get_type() == CEPH_MSG_OSD_OPREPLY ||
+                    op-&amp;gt;request-&amp;gt;get_type() == MSG_OSD_SUBOP ||
+                    op-&amp;gt;request-&amp;gt;get_type() == MSG_OSD_SUBOPREPLY) {
+                pg-&amp;gt;recovery_prio = true;
+            }
+        }
+        for (list&amp;lt;OpRequestRef&amp;gt;::iterator i = pg_for_processing[&amp;amp;*pg].begin(); i != pg_for_processing[&amp;amp;*pg].end(); i++) {
+            OpRequestRef op = *i;
+            if (op-&amp;gt;request-&amp;gt;get_type() == MSG_OSD_PG_BACKFILL ||
+                    op-&amp;gt;request-&amp;gt;get_type() == MSG_OSD_PG_SCAN
+                    /* || op-&amp;gt;request-&amp;gt;op == MOSDPGBackfill::OP_BACKFILL_PROGRESS
+                    || op-&amp;gt;request-&amp;gt;op == MOSDPGBackfill::OP_BACKFILL_FINISH */) {
+                if (pg-&amp;gt;recovery_prio) {
+                    cost += 1000;
+                } else {
+                    cost += 1;
+                }
+            } else {
+                cost += 10 * (op-&amp;gt;request-&amp;gt;get_priority() - 1) + 1;
+            }
+        }
+    }
+    pg_for_processing_costs[&amp;amp;*pg] = cost;
+}
+
+double OSD::OpWQ::_get_pg_cost(PG* pg) {
+    assert(qlock.is_locked());
+    int sum = pg_for_processing_costs[pg];
+    if (sum &amp;gt; 0 &amp;amp;&amp;amp; pg_for_processing.count(pg)) {
+        return (double)pg_for_processing_costs[pg] / pg_for_processing[pg].size();
+    }
+    return 0.0;
+}
+
+double OSD::pg_cost(PG *pg)
+{
+    Mutex::Locker l(op_wq.qlock);
+    return op_wq._get_pg_cost(pg);
+}
+
+void OSD::OpWQ::_push_pg(PGRef pg) {
+    assert(qlock.is_locked());
+    assert(pg.get() != NULL);
+    _calculate_pg_cost(pg);
+    pg_for_processing_queue.push(&amp;amp;*pg);
+}
+
 void OSD::OpWQ::_enqueue(pair&amp;lt;PGRef, OpRequestRef&amp;gt; item)
 {
   unsigned priority = item.second-&amp;gt;request-&amp;gt;get_priority();
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -6432,6 +6492,11 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt; void OSD::OpWQ::_enqueue(pair&amp;lt;PGRef, OpRequestRef&amp;gt; item)
   else
     pqueue.enqueue(item.second-&amp;gt;request-&amp;gt;get_source_inst(),
       priority, cost, item);
+  {
+      Mutex::Locker l(qlock);
+      _calculate_pg_cost(&amp;amp;*(item.first));
+      _push_pg(item.first);
+  }
   osd-&amp;gt;logger-&amp;gt;set(l_osd_opq, pqueue.length());
 }
 
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -6454,41 +6519,144 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt; void OSD::OpWQ::_enqueue_front(pair&amp;lt;PGRef, OpRequestRef&amp;gt; item)
   else
     pqueue.enqueue_front(item.second-&amp;gt;request-&amp;gt;get_source_inst(),
       priority, cost, item);
+  {
+      Mutex::Locker l(qlock);
+      _calculate_pg_cost(&amp;amp;*(item.first));
+      _push_pg(item.first);
+  }
   osd-&amp;gt;logger-&amp;gt;set(l_osd_opq, pqueue.length());
 }
 
 PGRef OSD::OpWQ::_dequeue()
 {
-  assert(!pqueue.empty());
+  static unsigned int order = 0;
+  ++order;
+  //assert(!_empty());
   PGRef pg;
+  unsigned int sum = 0;
   {
     Mutex::Locker l(qlock);
-    pair&amp;lt;PGRef, OpRequestRef&amp;gt; ret = pqueue.dequeue();
-    pg = ret.first;
-    pg_for_processing[&amp;amp;*pg].push_back(ret.second);
+#undef dout_prefix
+#define dout_prefix *_dout
+    dout(10) &amp;lt;&amp;lt; "MAP_SIZE: pg_for_processing=" &amp;lt;&amp;lt; pg_for_processing.size() &amp;lt;&amp;lt; "; pfp_queue=" &amp;lt;&amp;lt; pg_for_processing_queue.size() &amp;lt;&amp;lt; "; pqueue=" &amp;lt;&amp;lt; pqueue.length() &amp;lt;&amp;lt; dendl;
+    if (!pg_for_processing_queue.empty()) {
+        while (pg.get() == NULL &amp;amp;&amp;amp; !pg_for_processing_queue.empty()) {
+            PG* top = pg_for_processing_queue.top();
+            pg_for_processing_queue.pop();
+            if (top != NULL) {
+                // check if usable
+                if (pg_for_processing.count(top)) {
+                    if (!top-&amp;gt;is_locked()) {
+                        pg = top;
+                    } else {
+                        pg_for_processing_postponed.insert(top);
+                    }
+                }
+            }
+        }
+    } else if (pqueue.empty() &amp;amp;&amp;amp; pg_for_processing.size()) {
+        unsigned int j = 0;
+        for (map&amp;lt;PG*, list&amp;lt;OpRequestRef&amp;gt; &amp;gt;::iterator i = pg_for_processing.begin(); i != pg_for_processing.end(); i++, j++) {
+            if (j == order % pg_for_processing.size()) {
+                if (!i-&amp;gt;first-&amp;gt;is_locked()) {
+                    pg = i-&amp;gt;first;
+                }
+                break;
+            }
+        }
+    }
+    else {
+        // do nothing
+    }
+
+    while (!pqueue.empty()) {
+        pair&amp;lt;PGRef, OpRequestRef&amp;gt; ret = pqueue.dequeue();
+        pg_for_processing[&amp;amp;*(ret.first)].push_back(ret.second);
+        if (pg.get() == NULL) {
+            pg = ret.first;
+        } else {
+            if (pg_for_processing_postponed.find(ret.first) == pg_for_processing_postponed.end()) {
+                _push_pg(ret.first);
+            }
+        }
+    }
+
+    double geom = 1.0;
+    for (map&amp;lt;PG*, list&amp;lt;OpRequestRef&amp;gt; &amp;gt;::iterator i = pg_for_processing.begin(); i != pg_for_processing.end(); i++) {
+        sum += i-&amp;gt;second.size();
+        geom *= i-&amp;gt;second.size();
+    }
+    osd-&amp;gt;pg_load = ::pow(geom, 1.0 / (pg_for_processing.size() + 1.0)) * pg_for_processing.size();
   }
   osd-&amp;gt;logger-&amp;gt;set(l_osd_opq, pqueue.length());
+  dout(10) &amp;lt;&amp;lt; "pg is " &amp;lt;&amp;lt; pg.get() &amp;lt;&amp;lt; dendl;
+#undef dout_prefix
+#define dout_prefix _prefix(_dout, whoami, get_osdmap())
+  if (pg.get() == NULL) {
+      Cond cond;
+      Mutex mutex("OSD::OpWQ::_process_throttle");
+      mutex.Lock();
+      cond.WaitInterval(g_ceph_context, mutex, utime_t(0, 5000ul));
+      mutex.Unlock();
+  }
   return pg;
 }
 
 void OSD::OpWQ::_process(PGRef pg)
 {
-  pg-&amp;gt;lock();
-  OpRequestRef op;
-  {
-    Mutex::Locker l(qlock);
-    if (!pg_for_processing.count(&amp;amp;*pg)) {
+#undef dout_prefix
+#define dout_prefix *_dout
+  utime_t start = ceph_clock_now(g_ceph_context);
+  dout(10) &amp;lt;&amp;lt; "start dequeue op from pg " &amp;lt;&amp;lt; pg &amp;lt;&amp;lt; "; pqueue len=" &amp;lt;&amp;lt; osd-&amp;gt;logger-&amp;gt;get(l_osd_opq) &amp;lt;&amp;lt; dendl;
+  bool locked = false;
+  if (pg.get() != NULL) {
+      locked = pg-&amp;gt;try_lock();
+  }
+  if (locked) {
+      OpRequestRef op;
+      {
+        Mutex::Locker l(qlock);
+        if (!pg_for_processing.count(&amp;amp;*pg)) {
+          pg-&amp;gt;unlock();
+          return;
+        }
+        assert(pg_for_processing[&amp;amp;*pg].size());
+        op = pg_for_processing[&amp;amp;*pg].front();
+        pg_for_processing[&amp;amp;*pg].pop_front();
+        if (!(pg_for_processing[&amp;amp;*pg].size())) {
+          pg_for_processing.erase(&amp;amp;*pg);
+          pg-&amp;gt;recovery_prio_lock.Lock();
+          pg-&amp;gt;recovery_prio = false;
+          pg-&amp;gt;recovery_prio_lock.Unlock();
+        } else
+          pg_for_processing_postponed.insert(pg);
+      }
+      osd-&amp;gt;dequeue_op(pg, op);
+      {
+          Mutex::Locker l(qlock);
+          if (pg_for_processing_postponed.find(pg) != pg_for_processing_postponed.end()) {
+              _push_pg(pg);
+              /*if (find(pg_for_processing_list.begin(), pg_for_processing_list.end(), pg) != pg_for_processing_list.end())
+                pg_for_processing_list.push_back(pg); */
+              pg_for_processing_postponed.erase(pg);
+          }
+      }
       pg-&amp;gt;unlock();
-      return;
-    }
-    assert(pg_for_processing[&amp;amp;*pg].size());
-    op = pg_for_processing[&amp;amp;*pg].front();
-    pg_for_processing[&amp;amp;*pg].pop_front();
-    if (!(pg_for_processing[&amp;amp;*pg].size()))
-      pg_for_processing.erase(&amp;amp;*pg);
+  } else {
+      {
+        Mutex::Locker l(qlock);
+        if (pg.get() != NULL) {
+/*            list&amp;lt;PGRef&amp;gt;::iterator pgpos = find(pg_for_processing_list.begin(), pg_for_processing_list.end(), pg);
+            if (pgpos != pg_for_processing_list.end())
+              pg_for_processing_list.erase(pgpos); */
+            pg_for_processing_postponed.insert(pg);
+        }
+      }
   }
-  osd-&amp;gt;dequeue_op(pg, op);
-  pg-&amp;gt;unlock();
+  utime_t stop = ceph_clock_now(g_ceph_context);
+  dout(10) &amp;lt;&amp;lt; "stop dequeue op from pg " &amp;lt;&amp;lt; pg &amp;lt;&amp;lt; ". operation took " &amp;lt;&amp;lt; (stop-start) &amp;lt;&amp;lt; " seconds" &amp;lt;&amp;lt; dendl;
+#undef dout_prefix
+#define dout_prefix _prefix(_dout, whoami, get_osdmap())
 }
 
 
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -6556,6 +6724,8 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt; void OSD::process_peering_events(
   ThreadPool::TPHandle &amp;amp;handle
   )
 {
+  dout(10) &amp;lt;&amp;lt; "process_peering_events start size=" &amp;lt;&amp;lt; pgs.size() &amp;lt;&amp;lt; dendl;
+  utime_t start = ceph_clock_now(g_ceph_context);
   bool need_up_thru = false;
   epoch_t same_interval_since = 0;
   OSDMapRef curmap = service.get_osdmap();
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -6566,6 +6736,8 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt; void OSD::process_peering_events(
     set&amp;lt;boost::intrusive_ptr&amp;lt;PG&amp;gt; &amp;gt; split_pgs;
     PG *pg = *i;
     pg-&amp;gt;lock();
+    utime_t start_pg = ceph_clock_now(g_ceph_context);
+    dout(10) &amp;lt;&amp;lt; "process_peering_events pg=" &amp;lt;&amp;lt; pg-&amp;gt;get_pgid() &amp;lt;&amp;lt; " start" &amp;lt;&amp;lt; dendl;
     curmap = service.get_osdmap();
     if (pg-&amp;gt;deleting) {
       pg-&amp;gt;unlock();
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -6591,6 +6763,7 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt; void OSD::process_peering_events(
     } else {
       dispatch_context_transaction(rctx, pg);
     }
+    dout(10) &amp;lt;&amp;lt; "process_peering_events pg=" &amp;lt;&amp;lt; pg-&amp;gt;get_pgid() &amp;lt;&amp;lt; " took " &amp;lt;&amp;lt; (double)(ceph_clock_now(g_ceph_context) - start_pg) &amp;lt;&amp;lt; dendl;
     pg-&amp;gt;unlock();
     handle.reset_tp_timeout();
   }
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -6599,6 +6772,7 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt; void OSD::process_peering_events(
   dispatch_context(rctx, 0, curmap);
 
   service.send_pg_temp();
+  dout(10) &amp;lt;&amp;lt; "process_peering_events " &amp;lt;&amp;lt; pgs.size() &amp;lt;&amp;lt; " took " &amp;lt;&amp;lt; (double)(ceph_clock_now(g_ceph_context) - start) &amp;lt;&amp;lt; dendl;
 }
 
 // --------------------------------
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -6607,6 +6781,9 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt; const char** OSD::get_tracked_conf_keys() const
 {
   static const char* KEYS[] = {
     "osd_max_backfills",
+    "osd_recovery_throttle",
+    "osd_recovery_throttle_active",
+    "osd_recovery_throttle_coef",
     NULL
   };
   return KEYS;
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -6619,6 +6796,13 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt; void OSD::handle_conf_change(const struct md_config_t *conf,
     service.local_reserver.set_max(g_conf-&amp;gt;osd_max_backfills);
     service.remote_reserver.set_max(g_conf-&amp;gt;osd_max_backfills);
   }
+  if (changed.count("osd_recovery_throttle") ||
+      changed.count("osd_recovery_throttle_active") ||
+      changed.count("osd_recovery_throttle_coef")) {
+    m_osd_recovery_throttle = conf-&amp;gt;osd_recovery_throttle;
+    m_osd_recovery_throttle_active = conf-&amp;gt;osd_recovery_throttle_active;
+    m_osd_recovery_throttle_coef = conf-&amp;gt;osd_recovery_throttle_coef;
+  }
 }
 
 // --------------------------------
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -6694,3 +6878,59 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt; int OSD::init_op_flags(OpRequestRef op)
 
   return 0;
 }
+
+/*---------------------------------------------------*/
+#undef dout_prefix
+#define dout_prefix (*_dout &amp;lt;&amp;lt; " recovery_wq ")
+PG* OSD::RecoveryWQ::_dequeue() {
+    if (osd-&amp;gt;recovery_queue.empty())
+        return NULL;
+
+    if (!osd-&amp;gt;_recover_now())
+        return NULL;
+
+    PG *pg = NULL;
+    for (xlist&amp;lt;PG*&amp;gt;::iterator i = osd-&amp;gt;recovery_queue.begin(); !i.end(); ++i) {
+        if ((*i)-&amp;gt;recovery_prio) {
+            pg = *i;
+            osd-&amp;gt;recovery_queue.remove(i); // invalidates i !
+            break;
+        }
+    }
+    if (pg == NULL) {
+        pg = osd-&amp;gt;recovery_queue.front();
+        osd-&amp;gt;recovery_queue.pop_front();
+    }
+    return pg;
+}
+
+void OSD::RecoveryWQ::_process(PG *pg, ThreadPool::TPHandle &amp;amp;handle) {
+  dout(10) &amp;lt;&amp;lt; "STARTREC " &amp;lt;&amp;lt; pg &amp;lt;&amp;lt; " recovery_wq=" &amp;lt;&amp;lt; osd-&amp;gt;recovery_queue.size() &amp;lt;&amp;lt; "; pg_load=" &amp;lt;&amp;lt; osd-&amp;gt;pg_load &amp;lt;&amp;lt; dendl;
+
+  double loadd = 0;
+  double cur_load = osd-&amp;gt;pg_load;
+  if (cur_load &amp;gt;= 1.0) cur_load -= 1.0;
+  double throttle_coef = osd-&amp;gt;m_osd_recovery_throttle_coef;
+  double loadfrac = modf(cur_load * throttle_coef, &amp;amp;loadd);
+  int load = (int)loadd;
+  if (load &amp;lt; 0) load = 0;
+  if (load &amp;gt; 5) load = 5;
+
+  osd-&amp;gt;do_recovery(pg);
+  pg-&amp;gt;put("RecoveryWQ");
+  dout(10) &amp;lt;&amp;lt; "ENDREC " &amp;lt;&amp;lt; pg &amp;lt;&amp;lt; dendl;
+  Mutex lock("RecWQ::process");
+  Cond cond;
+  lock.Lock();
+  unsigned long wait_time = 1.0e9 * osd-&amp;gt;m_osd_recovery_throttle;
+  if (pg-&amp;gt;recovery_prio) {
+      wait_time = 1.0e9 * osd-&amp;gt;m_osd_recovery_throttle_active;
+  }
+  wait_time += 1.0e9 * loadfrac;
+
+  handle.reset_tp_timeout();
+  dout(10) &amp;lt;&amp;lt; "WAITREC " &amp;lt;&amp;lt; load &amp;lt;&amp;lt; "," &amp;lt;&amp;lt; wait_time &amp;lt;&amp;lt; dendl;
+  cond.WaitInterval(g_ceph_context, lock, utime_t(load, wait_time));
+  lock.Unlock();
+  handle.reset_tp_timeout();
+}
diff --git a/src/osd/OSD.h b/src/osd/OSD.h
index ac2c634..a557558 100644
--- a/src/osd/OSD.h
+++ b/src/osd/OSD.h
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -15,6 +15,10 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt;
 #ifndef CEPH_OSD_H
 #define CEPH_OSD_H
 
+#include "boost/heap/priority_queue.hpp"
+#undef _ASSERT_H
+#define _ASSERT_H _dout_cct
+
 #include "boost/tuple/tuple.hpp"
 
 #include "PG.h"
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -482,6 +486,11 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt; public:
   virtual const char** get_tracked_conf_keys() const;
   virtual void handle_conf_change(const struct md_config_t *conf,
   const std::set &amp;lt;std::string&amp;gt; &amp;amp;changed);
+  double pg_cost(PG* pg);
+  double pg_load;
+  double m_osd_recovery_throttle;
+  double m_osd_recovery_throttle_active;
+  double m_osd_recovery_throttle_coef;
 
 protected:
   Mutex osd_lock;// global lock
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -735,12 +744,23 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt; private:
        PGRef &amp;gt; {
     Mutex qlock;
     map&amp;lt;PG*, list&amp;lt;OpRequestRef&amp;gt; &amp;gt; pg_for_processing;
+    map&amp;lt;PG*, double&amp;gt; pg_for_processing_costs;
+    struct compare_costs {
+        OpWQ *parent;
+        compare_costs(OpWQ *wq) : parent(wq) {}
+        bool operator()(PG* const a, PG* const b) const {
+            return parent-&amp;gt;pg_for_processing_costs[&amp;amp;*a] &amp;lt; parent-&amp;gt;pg_for_processing_costs[&amp;amp;*b];
+        }
+    };
+    boost::heap::priority_queue&amp;lt;PG*, boost::heap::compare&amp;lt;compare_costs&amp;gt; &amp;gt; pg_for_processing_queue;
+    set&amp;lt;PGRef&amp;gt; pg_for_processing_postponed;
     OSD *osd;
     PrioritizedQueue&amp;lt;pair&amp;lt;PGRef, OpRequestRef&amp;gt;, entity_inst_t &amp;gt; pqueue;
     OpWQ(OSD *o, time_t ti, ThreadPool *tp)
       : ThreadPool::WorkQueueVal&amp;lt;pair&amp;lt;PGRef, OpRequestRef&amp;gt;, PGRef &amp;gt;(
 "OSD::OpWQ", ti, ti*10, tp),
 qlock("OpWQ::qlock"),
+    pg_for_processing_queue(compare_costs(this)),
 osd(o),
 pqueue(o-&amp;gt;cct-&amp;gt;_conf-&amp;gt;osd_op_pq_max_tokens_per_priority,
        o-&amp;gt;cct-&amp;gt;_conf-&amp;gt;osd_op_pq_min_cost)
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -751,6 +771,9 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt; private:
       pqueue.dump(f);
     }
 
+    void _calculate_pg_cost(PGRef pg);
+    double _get_pg_cost(PG* pg);
+    void _push_pg(PGRef pg);
     void _enqueue_front(pair&amp;lt;PGRef, OpRequestRef&amp;gt; item);
     void _enqueue(pair&amp;lt;PGRef, OpRequestRef&amp;gt; item);
     PGRef _dequeue();
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -785,6 +808,8 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt; private:
       unlock();
     }
     bool _empty() {
+      Mutex::Locker l(qlock);
+      if (pg_for_processing.size()) return false;
       return pqueue.empty();
     }
     void _process(PGRef pg);
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -1222,27 +1247,14 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt; protected:
       if (pg-&amp;gt;recovery_item.remove_myself())
 pg-&amp;gt;put("RecoveryWQ");
     }
-    PG *_dequeue() {
-      if (osd-&amp;gt;recovery_queue.empty())
-return NULL;
-      
-      if (!osd-&amp;gt;_recover_now())
-return NULL;
-
-      PG *pg = osd-&amp;gt;recovery_queue.front();
-      osd-&amp;gt;recovery_queue.pop_front();
-      return pg;
-    }
+    PG *_dequeue();
     void _queue_front(PG *pg) {
       if (!pg-&amp;gt;recovery_item.is_on_list()) {
 pg-&amp;gt;get("RecoveryWQ");
 osd-&amp;gt;recovery_queue.push_front(&amp;amp;pg-&amp;gt;recovery_item);
       }
     }
-    void _process(PG *pg) {
-      osd-&amp;gt;do_recovery(pg);
-      pg-&amp;gt;put("RecoveryWQ");
-    }
+    void _process(PG *pg, ThreadPool::TPHandle &amp;amp;handle);
     void _clear() {
       while (!osd-&amp;gt;recovery_queue.empty()) {
 PG *pg = osd-&amp;gt;recovery_queue.front();
diff --git a/src/osd/PG.cc b/src/osd/PG.cc
index d356597..d8f8cae 100644
--- a/src/osd/PG.cc
+++ b/src/osd/PG.cc
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -145,7 +145,7 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt; PG::PG(OSDService *o, OSDMapRef curmap,
   #ifdef PG_DEBUG_REFS
   _ref_id_lock("PG::_ref_id_lock"), _ref_id(0),
   #endif
-  deleting(false), dirty_info(false), dirty_big_info(false), dirty_log(false),
+  deleting(false), recovery_prio(false), recovery_prio_lock("PG::recovery_prio_lock"), dirty_info(false), dirty_big_info(false), dirty_log(false),
   info(p),
   info_struct_v(0),
   coll(p), log_oid(loid), biginfo_oid(ioid),
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -194,6 +194,19 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt; void PG::lock(bool no_lockdep)
   dout(30) &amp;lt;&amp;lt; "lock" &amp;lt;&amp;lt; dendl;
 }
 
+bool PG::try_lock() {
+    if (!_lock.TryLock()) {
+        dout(10) &amp;lt;&amp;lt; "pg_try_lock_fail " &amp;lt;&amp;lt; this &amp;lt;&amp;lt; dendl;
+        return false;
+    } else {
+        assert(!dirty_info);
+        assert(!dirty_log);
+
+        dout(10) &amp;lt;&amp;lt; "pg_try_lock " &amp;lt;&amp;lt; this &amp;lt;&amp;lt; dendl;
+        return true;
+    }
+}
+
 void PG::lock_with_map_lock_held(bool no_lockdep)
 {
   _lock.Lock(no_lockdep);
diff --git a/src/osd/PG.h b/src/osd/PG.h
index 9446334..f833790 100644
--- a/src/osd/PG.h
+++ b/src/osd/PG.h
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -413,8 +413,11 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt; protected:
 
 public:
   bool deleting;  // true while in removing or OSD is shutting down
+  bool recovery_prio;
+  Mutex recovery_prio_lock;
 
   void lock(bool no_lockdep = false);
+  bool try_lock();
   void unlock() {
     //generic_dout(0) &amp;lt;&amp;lt; this &amp;lt;&amp;lt; " " &amp;lt;&amp;lt; info.pgid &amp;lt;&amp;lt; " unlock" &amp;lt;&amp;lt; dendl;
     assert(!dirty_info);
&lt;/pre&gt;</description>
    <dc:creator>Sergey Fionov</dc:creator>
    <dc:date>2013-06-14T14:08:23</dc:date>
  </item>
  <item rdf:about="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15568">
    <title>radosgw- bind user to pool</title>
    <link>http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15568</link>
    <description>&lt;pre&gt;Hello,

We have several projects, which needs to save data to a rados object store via radosgw.

Is it possible to bind a rgw user to a specific pool?

Perhaps it is also possible to bind a whole rgw to a specific pool? We have set up 2 rgws, but when I add a certain pool in rgw1, it's synchronized in the other one/ rgw2 . It's the same with the user database. So is it possible to separate it? We thought about 1RGW per project, but because of the sync, it's not possible I think?

When you add more than one pool to the radosgw, in which pool will data be written? Is it random?

Thank you very much.

Regards

Philipp Jäger
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo&amp;lt; at &amp;gt;vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

&lt;/pre&gt;</description>
    <dc:creator>Jäger, Philipp</dc:creator>
    <dc:date>2013-06-14T13:59:01</dc:date>
  </item>
  <item rdf:about="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15567">
    <title>Issues with ceph-deploy/deph-disk</title>
    <link>http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15567</link>
    <description>&lt;pre&gt;Dear developers,

I had tried asking similar question in ceph-users but had not had any reply so please forgive me for posting here.

I had a slight issue with ceph-deploy (aside from the bug #4984) specifically during osd prepare and activate. I am trying to build cluster using some old HP servers and Ubuntu Precise, but unfortunately the devices in these HP came out as /dev/cciss/c0d0 and so on (or /dev/cciss/c0d0p1 if with partitions). So is there any possible workaround or where I can make changes since the ceph-deploy/ceph-disk is expecting something like /dev/sda instead? 

Thanks in advance.

Regards,
Luke


------------------------------------------------------------------
-
-
DISCLAIMER: 

This e-mail (including any attachments) is for the addressee(s) 
only and may contain confidential information. If you are not the 
intended recipient, please note that any dealing, review, 
distribution, printing, copying or use of this e-mail is strictly 
prohibited. If you have received this email in error, please notify 
the sender  immediately and delete the original message. 
MIMOS Berhad is a research and development institution under 
the purview of the Malaysian Ministry of Science, Technology and 
Innovation. Opinions, conclusions and other information in this e-
mail that do not relate to the official business of MIMOS Berhad 
and/or its subsidiaries shall be understood as neither given nor 
endorsed by MIMOS Berhad and/or its subsidiaries and neither 
MIMOS Berhad nor its subsidiaries accepts responsibility for the 
same. All liability arising from or in connection with computer 
viruses and/or corrupted e-mails is excluded to the fullest extent 
permitted by law.


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo&amp;lt; at &amp;gt;vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

&lt;/pre&gt;</description>
    <dc:creator>Luke Jing Yuan</dc:creator>
    <dc:date>2013-06-14T09:57:29</dc:date>
  </item>
  <item rdf:about="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15549">
    <title>[PATCH 1/2] rbd: fetch object order before using it</title>
    <link>http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15549</link>
    <description>&lt;pre&gt;rbd_dev_v2_header_onetime() fetches striping information, and
checks whether the image can be read by compariing the stripe unit
to the object size. It determines the object size by shifting
the object order, which is 0 at this point since it has not been
read yet. Move the call to get the image size and object order
before rbd_dev_v2_header_onetime() so it is set before use.

Signed-off-by: Josh Durgin &amp;lt;josh.durgin&amp;lt; at &amp;gt;inktank.com&amp;gt;
---
 drivers/block/rbd.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index cecf5c6..9f72125 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -4286,6 +4286,10 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt; static int rbd_dev_v2_header_info(struct rbd_device *rbd_dev)
 bool first_time = rbd_dev-&amp;gt;header.object_prefix == NULL;
 int ret;
 
+ret = rbd_dev_v2_image_size(rbd_dev);
+if (ret)
+return ret;
+
 if (first_time) {
 ret = rbd_dev_v2_header_onetime(rbd_dev);
 if (ret)
&amp;lt; at &amp;gt;&amp;lt; at &amp;gt; -4319,10 +4323,6 &amp;lt; at &amp;gt;&amp;lt; at &amp;gt; static int rbd_dev_v2_header_info(struct rbd_device *rbd_dev)
 "is EXPERIMENTAL!");
 }
 
-ret = rbd_dev_v2_image_size(rbd_dev);
-if (ret)
-return ret;
-
 if (rbd_dev-&amp;gt;spec-&amp;gt;snap_id == CEPH_NOSNAP)
 if (rbd_dev-&amp;gt;mapping.size != rbd_dev-&amp;gt;header.image_size)
 rbd_dev-&amp;gt;mapping.size = rbd_dev-&amp;gt;header.image_size;
&lt;/pre&gt;</description>
    <dc:creator>Josh Durgin</dc:creator>
    <dc:date>2013-06-13T03:10:45</dc:date>
  </item>
  <item rdf:about="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15547">
    <title>v0.64 released</title>
    <link>http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15547</link>
    <description>&lt;pre&gt;A new development release of Ceph is out.  Notable changes include:

 * osd: monitor both front and back interfaces
 * osd: verify both front and back network are working before rejoining 
   cluster
 * osd: fix memory/network inefficiency during deep scrub
 * osd: fix incorrect mark-down of osds
 * mon: fix start fork behavior
 * mon: fix election timeout
 * mon: better trim/compaction behavior
 * mon: fix units in 'ceph df' output
 * mon, osd: misc memory leaks
 * librbd: make default options/features for newly created images (e.g., 
   via qemu-img) configurable
 * mds: many fixes for mds clustering
 * mds: fix rare hang after client restart
 * ceph-fuse: add ioctl support
 * ceph-fuse/libcephfs: fix for cap release/hang
 * rgw: handle deep uri resources
 * rgw: fix CORS bugs
 * ceph-disk: add '[un]suppress-active DEV' command
 * debian: rgw: stop daemon on uninstall
 * debian: fix upstart behavior with upgrades

You can get v0.64 from the usual locations:

 * Git at git://github.com/ceph/ceph.git
 * Tarball at http://ceph.com/download/ceph-0.64.tar.gz
 * For Debian/Ubuntu packages, see http://ceph.com/docs/master/install/debian
 * For RPMs, see http://ceph.com/docs/master/install/rpm

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo&amp;lt; at &amp;gt;vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

&lt;/pre&gt;</description>
    <dc:creator>Sage Weil</dc:creator>
    <dc:date>2013-06-12T22:30:33</dc:date>
  </item>
  <item rdf:about="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15546">
    <title>Ceph developers, please note: changes to 'ceph' CLI tool in master branch</title>
    <link>http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15546</link>
    <description>&lt;pre&gt;A large restructuring of the 'ceph' command-line tool has been pushed to 
the master branch (and will be present in v0.65 as well).  The ceph tool 
you execute is now a Python script that talks to the cluster through 
rados.py, the Python binding to librados.so (and, of course, then, with 
librados.so).

Those who install/upgrade using packages will get the coordinated 
versions of all the pieces, and can stop reading except as a matter of 
interest.

However, those who build from source will need to be aware that 
PYTHONPATH must include the source versions of src/pybind/rados.py, and 
LD_LIBRARY_PATH must include .libs.  Currently, you must set these in 
your environment before running ./ceph.

I've just pushed a commit that will try to automatically determine this 
situation: if the path to the ceph tool ends in src/, and there exist 
.libs and pybind directories there, the tool will reset LD_LIBRARY_PATH 
and PYTHONPATH and re-exec itself so that LD_LIBRARY_PATH takes effect.
It also issues a message to stderr noting this "developer mode":

*** DEVELOPER MODE: setting PYTHONPATH and LD_LIBRARY_PATH

This latest convenience feature for developers is in
commit e5184ea95031b7bea4264062de083045767d5dc3 in master.

&lt;/pre&gt;</description>
    <dc:creator>Dan Mick</dc:creator>
    <dc:date>2013-06-12T21:20:48</dc:date>
  </item>
  <item rdf:about="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15542">
    <title>Concealed Business Proposition</title>
    <link>http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15542</link>
    <description>&lt;pre&gt;


MR.SUNG LEE
DAH SING BANK LTD
DES VOEUX RD.BRANCH,CENTRAL HONG KONG,
HONG KONG.

Good Day,

I am Mr.Sung Lee, Auditing and Account Credit Officer, Dah Sing Bank
Ltd,(Hong
Kong).I have a very sensitive and confidential brief for you I ask for your
partnership in re-profiling funds Transfer. I will give you the details
but in
summary the funds would be done from my bank( Dah Sing Bank) and further to
this other information to facilitate the remittance of the funds will be
revealed to you in due course. For your assistance, you shall receive 25% of
the funds to be transferred .

I will appreciate if you can reply me.

I'm looking forward to it
Mr.Sung Lee

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo&amp;lt; at &amp;gt;vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

&lt;/pre&gt;</description>
    <dc:creator>Mr.Sung Lee</dc:creator>
    <dc:date>2013-06-12T13:37:45</dc:date>
  </item>
  <item rdf:about="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15541">
    <title>Alert!!!</title>
    <link>http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15541</link>
    <description>&lt;pre&gt;&lt;/pre&gt;</description>
    <dc:creator>Webmaster Security</dc:creator>
    <dc:date>2013-06-12T12:05:47</dc:date>
  </item>
  <item rdf:about="http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15531">
    <title>radosrgw performance problems</title>
    <link>http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/15531</link>
    <description>&lt;pre&gt;Hello,

we have a performance problem with radosrgw.
Only 8mb/s-9 per upload, also tested with s3cmd on the rgw itself. 
(2 uploads at the same time: combined 15mb/s, 3 uploads at the same time: comb. 21mb/s)
But when putting a file via rados rbd , we get 40mb/s upload, so no network or other problem in general.

Same speed with the inktank apache/fastcgi and the original one. Hardware also fast enough. We use Ubuntu 12.04 lts, ceph 0.61.2

So have you any idea why the rgw is so slow? How can we identify where the problem is?

(I've heard something about the rgw admin socket to check perfcounters, but it seems that this is deprecated? Because when i type ceph --admin-daemon ... it says unknown command and I cannot find it in the ceph docu. Then i wanted to bench via rest-bench, but it says "ERROR: failed to create bucket: XmlParseFailure -failed initializing benchmark", so I could not bench the speed.)

Ceph.conf- rgw part:

[client.radosgw.connect2]
host = hcrgwko2
rgw socket path = /tmp/connect2.sock
log file = /var/log/ceph/connect2.log
rgw dns name =  FQDN

Thank you very much.


Regards

Philipp
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo&amp;lt; at &amp;gt;vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

&lt;/pre&gt;</description>
    <dc:creator>Jäger, Philipp</dc:creator>
    <dc:date>2013-06-11T13:27:25</dc:date>
  </item>
  <textinput rdf:about="http://search.gmane.org/?group=$group=gmane.comp.file-systems.ceph.devel">
    <title>Search Engine</title>
    <description>Search the mailing list at Gmane</description>
    <name>query</name>
    <link>http://search.gmane.org/?group=$group=gmane.comp.file-systems.ceph.devel</link>
  </textinput>
</rdf:RDF>
