<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/">
  <channel rdf:about="http://permalink.gmane.org/gmane.comp.ai.gensim">
    <title>gmane.comp.ai.gensim</title>
    <link>http://permalink.gmane.org/gmane.comp.ai.gensim</link>
    <description/>
    <syn:updatePeriod>hourly</syn:updatePeriod>
    <syn:updateFrequency>1</syn:updateFrequency>
    <syn:updateBase>1901-01-01T00:00+00:00</syn:updateBase>
    <items>
      <rdf:Seq>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.ai.gensim/1915"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.ai.gensim/1914"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.ai.gensim/1913"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.ai.gensim/1912"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.ai.gensim/1911"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.ai.gensim/1910"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.ai.gensim/1909"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.ai.gensim/1908"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.ai.gensim/1907"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.ai.gensim/1906"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.ai.gensim/1905"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.ai.gensim/1904"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.ai.gensim/1903"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.ai.gensim/1902"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.ai.gensim/1901"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.ai.gensim/1900"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.ai.gensim/1899"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.ai.gensim/1898"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.ai.gensim/1897"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.comp.ai.gensim/1896"/>
      </rdf:Seq>
    </items>
    <image rdf:resource="http://gmane.org/img/gmane-25t.png"/>
    <textinput rdf:resource=""/>
  </channel>
  <image rdf:about="http://gmane.org/img/gmane-25t.png">
    <title>Gmane</title>
    <url>http://gmane.org/img/gmane-25t.png</url>
    <link>http://gmane.org</link>
  </image>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.ai.gensim/1915">
    <title>[gensim:1910] How to print topic distribution per word</title>
    <link>http://permalink.gmane.org/gmane.comp.ai.gensim/1915</link>
    <description>&lt;pre&gt;Hi,

First of all, thanks for coming up with this very useful toolkit.

I just wish to ask if there is a way to force inference to print the 
distribution over all topics for a single word e.g.

lda = LdaModel(corpus, num_topics=100, id2word=dictionary,passes=50)

case 1 :
new_vec = dictionary.doc2bow(['dog','computer','philosophy','speaker'])
print lda[new_vec]

case 2:
new_vec2 = dictionary.doc2bow(['dog'']) 
print lda[new_vec2]


For case 1, I get a list of a few possible topics with their distributions. 
However, for case 2, I only get the most probable topic and the probability 
associated.

The goal here is to extract for each word a list of the entire 100 topical 
probabilities as features.

Any help would be appreciated.

Ben

&lt;/pre&gt;</description>
    <dc:creator>Ben Leong</dc:creator>
    <dc:date>2013-06-20T03:19:30</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.ai.gensim/1914">
    <title>[gensim:1908] Re: Wikipedia Question</title>
    <link>http://permalink.gmane.org/gmane.comp.ai.gensim/1914</link>
    <description>&lt;pre&gt;Hello Ben,


you want the file `enwiki-latest-pages-articles.xml.bz2`; I'm not sure
what the "articles-multistream" dump is. It may work with the
articles.xml parser too; I never tried.



Wherever you want; your download dir is fine.


The script can be run from anywhere (assuming you have gensim
installed). Running the script like that, without parameters, will
display an example invocation, where you will see how to specify the
wiki data location.

Hope that helps, regards,
Radim


&lt;/pre&gt;</description>
    <dc:creator>Radim Řehůřek</dc:creator>
    <dc:date>2013-06-18T10:14:27</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.ai.gensim/1913">
    <title>[gensim:1908] Re: Changing var_maxiter and var_thresh under distributed LDA</title>
    <link>http://permalink.gmane.org/gmane.comp.ai.gensim/1913</link>
    <description>&lt;pre&gt;Radim,

Thanks for the response.

-izzy

On Thursday, May 30, 2013 11:11:00 AM UTC-4, Radim Řehůřek wrote:

&lt;/pre&gt;</description>
    <dc:creator>izzy</dc:creator>
    <dc:date>2013-05-31T15:29:12</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.ai.gensim/1912">
    <title>[gensim:1906] Re: Changing var_maxiter and var_thresh under distributed LDA</title>
    <link>http://permalink.gmane.org/gmane.comp.ai.gensim/1912</link>
    <description>&lt;pre&gt;Hello izzy,

these are constants that are not part of the API and changing them has
no effect in distributed computations.

You have two options:

* either hardwire them to a different value;

* or, promote them to proper parameters (=passed via LdaModel
constructor) + pass them to `dispatcher.initialize()` in the
constructor.

In both options you'd modify the LdaModel constructor in ldamodel.py.
If you go with option 2), please consider a pull request -- this may
be useful to other people as well.

Best,
Radim


On May 29, 4:24 pm, izzy &amp;lt;risrael...-Re5JQEeQqe8AvxtiuMwx3w&amp;lt; at &amp;gt;public.gmane.org&amp;gt; wrote:

&lt;/pre&gt;</description>
    <dc:creator>Radim Řehůřek</dc:creator>
    <dc:date>2013-05-30T15:11:00</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.ai.gensim/1911">
    <title>[gensim:1906] Re: Random seed in LDA?</title>
    <link>http://permalink.gmane.org/gmane.comp.ai.gensim/1911</link>
    <description>&lt;pre&gt;

On Sunday, February 24, 2013 4:52:31 AM UTC-5, Radim Řehůřek wrote:
I'm an academic, so I may need to show that I can replicate the exact same 
run at some point in the future. 

&lt;/pre&gt;</description>
    <dc:creator>izzy</dc:creator>
    <dc:date>2013-05-29T14:26:25</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.ai.gensim/1910">
    <title>[gensim:1905] Changing var_maxiter and var_thresh under distributed LDA</title>
    <link>http://permalink.gmane.org/gmane.comp.ai.gensim/1910</link>
    <description>&lt;pre&gt;I can't seem to change VAR_MAXITER (and also possibly VAR_THRESH) when 
using distributed LDA.  It seems to default to 50 for var_maxiter when I 
switch distributed from "FALSE" to "TRUE".  I guess I can start digging 
around in the source code, but does it need to be done differently than in 
serial mode?  (See below)


*DISTRIBUTED*


Here's the relevant code:

   mod = models.LdaModel(id2word=dictionary, num_topics=30, passes=10, 
update_every=1, distributed=True)
   mod.VAR_MAXITER = 1000
   mod.VAR_THRESH = 0.0001
   mod.update(corpus)

*Here's a snippet output from one of the workers:*
*
*

2013-05-29 10:16:35,812 : INFO : worker #0 received job #1
2013-05-29 10:16:39,552 : INFO : 812/2000 documents converged within *50 
iterations*
2013-05-29 10:16:39,555 : INFO : finished processing job #1
2013-05-29 10:16:40,354 : INFO : worker #0 returning its state after 2 jobs
2013-05-29 10:16:40,671 : INFO : resetting worker #0


*
*
*SERIAL*
*
*
*code:*

   mod = models.LdaModel(id2word=dictionary, num_topics=3&lt;/pre&gt;</description>
    <dc:creator>izzy</dc:creator>
    <dc:date>2013-05-29T14:24:17</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.ai.gensim/1909">
    <title>[gensim:1903] Re: "KeyError 0" when dictionary contains duplicate terms</title>
    <link>http://permalink.gmane.org/gmane.comp.ai.gensim/1909</link>
    <description>&lt;pre&gt;Hello Paul,


On May 29, 7:09 am, Paul Brown &amp;lt;p...-jDwgCC3FOVZs6hBKEjR/UA&amp;lt; at &amp;gt;public.gmane.org&amp;gt; wrote:

Sure! Trouble with mismatching ids is an evergreen issue here.

I'm not sure what you mean by "width of corpus", or how
`len(id2token)` could != `len(token2id)` (unless you have the same
token under two different ids), but I'm sure it'll be clearer from the
pull request. :)

Thanks,
Radim




&lt;/pre&gt;</description>
    <dc:creator>Radim Řehůřek</dc:creator>
    <dc:date>2013-05-29T07:58:23</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.ai.gensim/1908">
    <title>[gensim:1903] "KeyError 0" when dictionary contains duplicate terms</title>
    <link>http://permalink.gmane.org/gmane.comp.ai.gensim/1908</link>
    <description>&lt;pre&gt;
Hi --

I'm working with a modest corpus/dictionary of ~400k documents over a 
vocabulary of just under 1500 terms, and on some initial experiments with 
building LDA models, I encountered an exception with "KeyError: 0".  After 
some investigation, I observed that the feature matrix had the expected 
1479 columns but the dictionary — as loaded — contained 1469 terms.  With 
that observation in-hand, I found the 10 duplicate terms in the dictionary 
and was able to get things cleaned up.

A couple of suggestions:

   1. if len(id2token) != len(token2id) then an error should be presented 
   to the user, as other code relies on len(id2token) = len(token2id) as an 
   invariant.
   2. if len(id2oken) != width of corpus feature matrix then an error 
   should be presented to the user, as it may take quite some time for the 
   debug messages to land on a term that is missing a corresponding index.

I'm willing to provide a pull request if these sound like useful 
validations.

Best.
&lt;/pre&gt;</description>
    <dc:creator>Paul Brown</dc:creator>
    <dc:date>2013-05-29T05:09:46</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.ai.gensim/1907">
    <title>[gensim:1901] Re: Segmentation Fault in models.LsiModel()</title>
    <link>http://permalink.gmane.org/gmane.comp.ai.gensim/1907</link>
    <description>&lt;pre&gt;...regarding the "extra 100 samples": oversampling improves SVD
accuracy. For more info, check out the  academic articles linked from
http://radimrehurek.com/gensim/tut2.html#available-transformations .

Best,
Radim

On May 28, 7:40 pm, Roger Leitzke &amp;lt;roger.leit...-Re5JQEeQqe8AvxtiuMwx3w&amp;lt; at &amp;gt;public.gmane.org&amp;gt; wrote:

&lt;/pre&gt;</description>
    <dc:creator>Radim Řehůřek</dc:creator>
    <dc:date>2013-05-28T21:36:03</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.ai.gensim/1906">
    <title>[gensim:1900] Re: Segmentation Fault in models.LsiModel()</title>
    <link>http://permalink.gmane.org/gmane.comp.ai.gensim/1906</link>
    <description>&lt;pre&gt;Hello again Roger,

looks like you're not using any optimized BLAS library, is that
correct? If so, you're losing a lot of performance there.

In any case, if `qr_destroy` on random data works, the problem must
lie in some combination of code+data. You can add more logging
statements inside `matutils.qr_destroy`, to see which line exactly
causes the segfault. Then we'll be wiser :)

And to speed up the debugging, try skipping the first 120,000
documents (using `itertools.islice` on your corpus), because in your
log the trouble appears only with the 120k-140k batch.

Best,
Radim


On May 28, 7:40 pm, Roger Leitzke &amp;lt;roger.leit...-Re5JQEeQqe8AvxtiuMwx3w&amp;lt; at &amp;gt;public.gmane.org&amp;gt; wrote:

&lt;/pre&gt;</description>
    <dc:creator>Radim Řehůřek</dc:creator>
    <dc:date>2013-05-28T19:32:43</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.ai.gensim/1905">
    <title>Re: [gensim:1899] Re: Segmentation Fault in models.LsiModel()</title>
    <link>http://permalink.gmane.org/gmane.comp.ai.gensim/1905</link>
    <description>&lt;pre&gt;Hi Radim,

Thanks for answering! Yes, unfortunately I have to use such big dictionary.
I have small models too, but the largest model that I ran without problems
has 3,226,254 terms. However I had the same problem trying to run a tfidf
model with a dictionary containing 3,674,106 terms. Well, I checked
manually your code and it worked fine. The numpy configuration is below:

blas_info:
    libraries = ['blas']
    library_dirs = ['/usr/lib64']
    language = f77

lapack_info:
    libraries = ['lapack']
    library_dirs = ['/usr/lib64']
    language = f77

atlas_threads_info:
  NOT AVAILABLE

blas_opt_info:
    libraries = ['blas']
    library_dirs = ['/usr/lib64']
    language = f77
    define_macros = [('NO_ATLAS_INFO', 1)]

atlas_blas_threads_info:
  NOT AVAILABLE

lapack_opt_info:
    libraries = ['lapack', 'blas']
    library_dirs = ['/usr/lib64']
    language = f77
    define_macros = [('NO_ATLAS_INFO', 1)]

atlas_info:
  NOT AVAILABLE

lapack_mkl_info:
  NOT AVAILABLE

blas_mkl_info:
  NOT AVAILABLE

a&lt;/pre&gt;</description>
    <dc:creator>Roger Leitzke</dc:creator>
    <dc:date>2013-05-28T17:40:27</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.ai.gensim/1904">
    <title>[gensim:1898] Re: Segmentation Fault in models.LsiModel()</title>
    <link>http://permalink.gmane.org/gmane.comp.ai.gensim/1904</link>
    <description>&lt;pre&gt;Hello Roger,

On May 27, 5:49 pm, Roger Leitzke &amp;lt;roger.leit...-Re5JQEeQqe8AvxtiuMwx3w&amp;lt; at &amp;gt;public.gmane.org&amp;gt; wrote:

some nice iron you have :)

I don't see anything wrong in your log, and you say you already
checked your input. So my guess is there is some unknown problem in
your BLAS library processing large matrices.

With a dictionary of 5,664,776 terms (is this really necessary?), the
crashing point seems to be somewhere inside the QR decomposition of a
5,664,776 x 350 matrix ~= 16GB.

Can you please check manually, from a python shell, whether the
following works?


Best,
Radim



&lt;/pre&gt;</description>
    <dc:creator>Radim Řehůřek</dc:creator>
    <dc:date>2013-05-27T17:58:21</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.ai.gensim/1903">
    <title>[gensim:1897] Segmentation Fault in models.LsiModel()</title>
    <link>http://permalink.gmane.org/gmane.comp.ai.gensim/1903</link>
    <description>&lt;pre&gt;Hi everyone,

I searched in this mailing list about the problem that I'm having but I
didn't find any solution. Well, I'm having segmentation fault when building
the lsi model. I verified the dictionary and the tfidf.mm in order to find
a missing id (I read in the mailing list that someone had a problem with
ids), but they are OK for me. My code is below and the log file is
attached. Does anyone have any idea about what could be the problem or how
could I solve this? I'm running it in a 2x Intel Xeon E7- 2850 2.0 GHz
Hyper-Threading machine with 80GB of memory, and using:

Gensim version: 0.8.6
Numpy version: 1.3.0
Scipy version: 0.7.0

*Code*:

   inputfolder = '/home/roger.granada/lsaC'
   corpus = corpora.MmCorpus(*join*(inputfolder, 'tfidf.mm'))
   dictionary = Dictionary.load_from_text(*join*(inputfolder,
'dictionary_tfidf.txt'))
   modellsa = models.LsiModel(corpus, id2word=dictionary, num_topics=250)
   modellsa.save(*join*(inputfolder, 'model.lsi'))

 *Output*:
  - Attached

&lt;/pre&gt;</description>
    <dc:creator>Roger Leitzke</dc:creator>
    <dc:date>2013-05-27T15:49:20</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.ai.gensim/1902">
    <title>[gensim:1896] Re: Accessing topics per document in test set</title>
    <link>http://permalink.gmane.org/gmane.comp.ai.gensim/1902</link>
    <description>&lt;pre&gt;Hello Thomas,

no, topics of the indexed documents are not stored. What is stored is
a scaled version of the topics (a vector of unit length).

If this vector is what you need, you can get it with (if you're using
the stable release):

    def vec_by_id(docid):
        for index in [server.stable.opt_index,
server.stable.fresh_index]:
            if index is not None and docid in index:
                return index.vec_by_id(docid)

or with `vec = server.stable.vec_by_id(docid)` (if you're using the
newer github version).

Both versions return `None` when the the requested `docid` is not
indexed at all.

Best,
Radim


On May 24, 8:33 pm, Thomas Uhrig &amp;lt;tuhrig...-Re5JQEeQqe8AvxtiuMwx3w&amp;lt; at &amp;gt;public.gmane.org&amp;gt; wrote:

&lt;/pre&gt;</description>
    <dc:creator>Radim Řehůřek</dc:creator>
    <dc:date>2013-05-25T12:16:11</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.ai.gensim/1901">
    <title>[gensim:1896] Re: Accessing topics per document in test set</title>
    <link>http://permalink.gmane.org/gmane.comp.ai.gensim/1901</link>
    <description>&lt;pre&gt;Hi!

Is it possible to get the topics for a document id using *GensimServer*? I 
tried something like:

        server = SessionServer(Config.SIMILARITY_SERVER)
        
        model = server.debug_model().lsi;

        print model[ server.stable.fresh_index.vec_by_id( id ) ]

But this looks a little bit complicated and (that's the worst ;) it doesn't 
work. Any ideas?

Thanks.

On Thursday, March 14, 2013 4:01:21 PM UTC+1, hh wrote:

&lt;/pre&gt;</description>
    <dc:creator>Thomas Uhrig</dc:creator>
    <dc:date>2013-05-24T18:33:47</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.ai.gensim/1900">
    <title>[gensim:1894] Re: Merge Step of Distributed LDA</title>
    <link>http://permalink.gmane.org/gmane.comp.ai.gensim/1900</link>
    <description>&lt;pre&gt;Hello M,

sorry for the delay -- you are correct, E-step is distributed across
all worker nodes, M-step is not.

M-step is adding up the resulting worker matrices (fast) + computing
dirichlet_expectation on the resulting matrix (fast), so I think
sending data around would actually make it a lot slower.

Or did you mean something else? If you have an idea, try it out
(gensim is on github) -- always better to see hard numbers than
assume :)

Regards,
Radim


On May 9, 4:51 am, "M." &amp;lt;sil...-Re5JQEeQqe8AvxtiuMwx3w&amp;lt; at &amp;gt;public.gmane.org&amp;gt; wrote:

&lt;/pre&gt;</description>
    <dc:creator>Radim Řehůřek</dc:creator>
    <dc:date>2013-05-16T10:10:45</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.ai.gensim/1899">
    <title>[gensim:1894] Gensim now available on PiCloud as public environment</title>
    <link>http://permalink.gmane.org/gmane.comp.ai.gensim/1899</link>
    <description>&lt;pre&gt;Hello,

I saw that Gensim has a section on distributed computing [1]. I just wanted 
to let you all know that PiCloud [2] supports gensim. While any user could 
always create a custom environment, and install the latest version 
themselves [2], we've decided to address the issue directly by introducing 
a publicly shared environment with gensim.

In short, specifying a job's environment as "/picloud/gensim" will use a 
public, up-to-date version of gensim. You can clone the environment to 
"freeze" the version, so that it doesn't change under you. Hope this helps!

[1] http://radimrehurek.com/gensim/distributed.html
[2] http://www.picloud.com
[3] http://docs.picloud.com/**environment.html&amp;lt;http://docs.picloud.com/environment.html&amp;gt;

While we aren't integrated into gensim to the point where 
"distributed=True" will use PiCloud, we're interested in hearing whether 
this is a desirable &amp;amp; in-demand feature.

Best Regards,
John

--
John Riley
PiCloud, Inc.

&lt;/pre&gt;</description>
    <dc:creator>John Riley</dc:creator>
    <dc:date>2013-05-10T02:28:01</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.ai.gensim/1898">
    <title>[gensim:1893] Merge Step of Distributed LDA</title>
    <link>http://permalink.gmane.org/gmane.comp.ai.gensim/1898</link>
    <description>&lt;pre&gt;Been going through the distributed LDA code. Seems like the E-step can be 
done in parallel, while the M-step has to gather the results of all the 
intermediate E-steps and then run itself on one computer. Anyway to better 
parallelize the M-step? Something like the merge-step in mergesort, where 
after all the E steps are done, the worker nodes pair-off and M-step, and 
then merge etc. and bubble up returning a completed M-step to the 
dispatcher?

&lt;/pre&gt;</description>
    <dc:creator>M.</dc:creator>
    <dc:date>2013-05-09T02:51:56</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.ai.gensim/1897">
    <title>[gensim:1892] install problem</title>
    <link>http://permalink.gmane.org/gmane.comp.ai.gensim/1897</link>
    <description>&lt;pre&gt;When I try to install scipy

我 went to
/家庭/ VI /构建/ scipy的

sudo的蟒蛇setup.py安装


错误：库dfftpack有Fortran的来源，但没有Fortran编译器发现

&lt;/pre&gt;</description>
    <dc:creator>叶璐</dc:creator>
    <dc:date>2013-05-08T07:00:28</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.ai.gensim/1896">
    <title>[gensim:1890] Re: extremely slow lda?</title>
    <link>http://permalink.gmane.org/gmane.comp.ai.gensim/1896</link>
    <description>&lt;pre&gt;Yes, I am using the generically compiled version from ubuntu. I will
compile my own version. Thanks both of you for all the help.

On May 7, 4:32 pm, Radim Řehůřek &amp;lt;m...-yqFObnq8frArm/5+6H3lwA&amp;lt; at &amp;gt;public.gmane.org&amp;gt; wrote:

&lt;/pre&gt;</description>
    <dc:creator>Jason</dc:creator>
    <dc:date>2013-05-07T21:40:16</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.comp.ai.gensim/1895">
    <title>[gensim:1889] Re: extremely slow lda?</title>
    <link>http://permalink.gmane.org/gmane.comp.ai.gensim/1895</link>
    <description>&lt;pre&gt;Hello Jason,

glad we figured it out.

On May 7, 9:31 pm, Jason &amp;lt;jason...-Re5JQEeQqe8AvxtiuMwx3w&amp;lt; at &amp;gt;public.gmane.org&amp;gt; wrote:

There's no such formula; it depends on the language/your goal/etc. In
general, include words that are meaningful to your app + exclude words
that are not (stop words, function words, ...). For example, academic
papers like to trim the vocabulary to ~10k. Real-world apps have to
deal with real-world words so larger vocab is needed, but it quickly
becomes diminishing returns, esp. for English. In the tutorials I used
100k.

Stemming/lemmatization can be useful to reduce the vocabulary size,
too, but I saw you're already using that :) Note that stemming can
also decrease accuracy (new vs. news), and is tricky in general for
non-English languages (mentioning this because I believe I saw some
non-latin scripts in the dictionary you sent earlier).



On my laptop, this takes about 5.8 seconds. So your server seems to be
~14x slower.

ATLAS is a fine library, but my guess is you have some gene&lt;/pre&gt;</description>
    <dc:creator>Radim Řehůřek</dc:creator>
    <dc:date>2013-05-07T20:32:51</dc:date>
  </item>
  <textinput rdf:about="http://search.gmane.org/?group=$group=gmane.comp.ai.gensim">
    <title>Search Engine</title>
    <description>Search the mailing list at Gmane</description>
    <name>query</name>
    <link>http://search.gmane.org/?group=$group=gmane.comp.ai.gensim</link>
  </textinput>
</rdf:RDF>
