Mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Nagel <wastl.na...@googlemail.com>
Subject Re: Settings question
Date Fri, 16 Dec 2016 16:06:05 GMT
Hi Kris,

> when is later

the next round / cycle given that all unfetched URLs fit into -topN

> an optimal setting for this when nutch needs to follow the redirect?

http.redirect.max > 3
  Hardly what you want. Worst case: you are send around and the fetcher is
  caught in redirect loops.

3 >= http.redirect.max > 0
  If the fetcher follows redirects may cause duplicate fetches in case multiple
  URLs point to the same redirect target. That's a potential drawback.

http.redirect.max = 0
  Avoid unnecessary work by deduplicating redirect targets in CrawlDb.
  But not optimal if
  - redirects are used by crawled sites to set cookies (in combination with protocol-httpclient)
  - cycles take long and ephemeral redirects become invalid during this time

Best,
Sebastian

On 12/15/2016 07:31 PM, KRIS MUSSHORN wrote:
> 
> <property>
>   <name>http.redirect.max</name>
>   <value>0</value>
>   <description>The maximum number of redirects the fetcher will follow when
>   trying to fetch a page. If set to negative or 0, fetcher won't immediately
>   follow redirected URLs, instead it will record them for later fetching.
>   </description>
> </property> 
> 
> 
> when is later and what is an optimal setting for this when nutch needs to follow the
redirect? 
> 
> TIA 
> Kris 
> 


Mime
View raw message